Proxy Types for Web Scraping: Residential, Datacenter, and ISP Explained (2026)

2026-03-30 [scraping proxies python anti-bot]

Proxy Types for Web Scraping: Residential, Datacenter, and ISP Explained (2026)

If you have ever scaled a scraper past a few hundred requests, you know the drill: 403s start rolling in, CAPTCHAs multiply, and your clean data pipeline turns into a wall of errors. The IP layer is almost always the first thing that breaks. You can write perfect Python, respect robots.txt, randomize your user agents, add human-like delays, and still get blocked because your IP address gives you away immediately.

Proxies are the standard solution. But "just use proxies" is advice that fails people constantly, because the proxy type you choose determines whether you succeed or waste money. Residential proxies cost 10-20x more than datacenter proxies. Picking the wrong type means either burning bandwidth on expensive IPs you did not need, or getting blocked anyway because you went too cheap on a target that required better coverage.

This guide covers every proxy type used in production web scraping in 2026: datacenter, residential, ISP/static residential, and mobile. For each type you will get the technical characteristics, a clear picture of when it works and when it fails, and Python code examples you can drop into your own scraper. We also cover rotating versus sticky sessions, CAPTCHA handling, anti-detection techniques, and how to build retry logic that does not burn through your proxy quota on transient errors.

The goal is to give you enough context to make the right proxy decision on your next project without spending three hours testing configurations that were never going to work.

Why Proxies Matter: What Sites Actually Check

Before picking a proxy type, it helps to understand what anti-bot systems are actually measuring. Modern bot detection is not just "is this IP in a data center." Systems like Cloudflare, Akamai Bot Manager, PerimeterX, DataDome, and Kasada look at a stack of signals simultaneously.

IP-level signals: - ASN (Autonomous System Number) - which organization owns this IP block - Whether the ASN is a known cloud provider, VPN service, or data center - IP reputation score based on past abuse history - Geographic consistency (IP in Germany, Accept-Language header says Chinese) - Whether the IP has been seen on abuse databases

TLS fingerprint: - The exact sequence of cipher suites offered during TLS handshake - TLS version, extension order, supported groups - Tools like curl and Python requests have distinctive TLS fingerprints that differ from Chrome

HTTP/2 fingerprint: - HTTP/2 SETTINGS frames, WINDOW_UPDATE values, header order - These fingerprints identify the underlying HTTP client library, not just the browser string

Behavioral signals: - Request rate and timing patterns - Navigation paths (does this "user" ever go to non-product pages) - Mouse movement and interaction data (for sites that inject JS tracking) - Session consistency (same IP for 1000 requests in 3 minutes)

Proxies solve the IP-level signals. They do not help with TLS fingerprinting or behavioral analysis unless you also handle those layers.

Datacenter Proxies

Datacenter proxies are IPs assigned to physical or virtual servers in commercial data centers. They come from providers like AWS, GCP, Hetzner, OVH, Leaseweb, and thousands of smaller hosts. They are the cheapest proxy type and the easiest to detect.

Technical characteristics: - Latency: typically 20-80ms, sometimes faster than residential because of direct routing - Speed: no bandwidth constraints from consumer ISPs - Pricing: $0.50-3/GB or flat monthly rates for dedicated IPs - Pool size: unlimited in theory, constrained by what you can afford - ASN: registered to known hosting companies

Why they get blocked: Anti-bot systems maintain databases of data center IP ranges. Amazon, Google, Azure, DigitalOcean, Linode, Hetzner - every major cloud provider has their ASN ranges widely published. When your request comes from a hosting AS number, Cloudflare knows before it even reads your headers that this is not a human browsing from home.

The tell is not just the ASN. Data center IPs also lack reverse DNS entries that look like consumer ISP records, they do not appear in residential IP geolocation databases, and they tend to have very clean traffic histories.

When datacenter proxies work: - Public APIs with no bot detection layer (government open data, academic datasets) - Sites running minimal protection - basic rate limiting by IP, no fingerprinting - Your own infrastructure and staging environments - Development and testing before you burn residential proxy bandwidth - Scraping sites where the data is meant to be accessed programmatically

When they fail: Any site running Cloudflare Business/Enterprise, Akamai Bot Manager, PerimeterX, DataDome, or Kasada will challenge or block datacenter IPs on the first request.

Python example with datacenter proxy rotation:

import httpx
import random
import time
from typing import Optional

DATACENTER_PROXIES = [
    "http://user:[email protected]:8080",
    "http://user:[email protected]:8080",
    "http://user:[email protected]:8080",
]

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
}

def scrape_with_datacenter(url: str, max_retries: int = 3) -> Optional[str]:
    for attempt in range(max_retries):
        proxy = random.choice(DATACENTER_PROXIES)
        try:
            with httpx.Client(proxy=proxy, timeout=15.0, headers=HEADERS) as client:
                resp = client.get(url)
                if resp.status_code == 200:
                    return resp.text
                elif resp.status_code in (429, 503):
                    wait = 2 ** attempt
                    time.sleep(wait)
                elif resp.status_code == 403:
                    print(f"403 on {url} - consider upgrading to residential proxies")
                    return None
        except (httpx.TimeoutException, httpx.ProxyError) as e:
            print(f"Proxy error attempt {attempt + 1}: {e}")
            time.sleep(1)
    return None

Residential Proxies

Residential proxies use IP addresses assigned by internet service providers to real households. When a website checks the ASN for a residential proxy request, it sees Comcast, Vodafone, Telstra, or a regional ISP - exactly what it would see for any normal human browsing from home.

This is the fundamental advantage of residential proxies: they are not distinguishable from real user traffic at the IP layer. A Comcast IP in Chicago could belong to a data scientist scraping competitor pricing or someone watching Netflix. The site cannot know, and that uncertainty is what you are paying for.

Technical characteristics: - Latency: 200-800ms typical, varies by location and ISP routing - Speed: limited by consumer broadband upstream bandwidth - Pricing: $3-10/GB depending on provider, geo-targeting, and contract volume - Pool size: major providers claim 10-100M IPs, though active pool is smaller - ASN: registered to residential ISPs worldwide

How residential proxy networks work: Residential proxy providers build their networks by running software on real users devices - typically through SDKs bundled into mobile apps, browser extensions, or VPN clients. The device owner consents (buried in terms of service) to have their bandwidth used when their device is idle. Your traffic exits through that device IP address.

This creates some quirks: IPs go offline when devices sleep or lose connectivity, bandwidth is shared with other customers, and you have no control over which specific IP you get within a geo-targeting filter.

When residential proxies are necessary: - Amazon product and pricing data - Google Shopping, Google Maps, Google SERP results - LinkedIn profile and company data - Real estate portals (Zillow, Realtor, Redfin) - Social media platforms (Instagram, Twitter/X, TikTok) - Ticketing platforms (StubHub, Ticketmaster) - Any Cloudflare-protected site running Bot Fight Mode or Super Bot Fight Mode - Price monitoring across major e-commerce retailers

Python example with ThorData residential proxies:

import httpx
import time
import random
from dataclasses import dataclass
from typing import Optional

@dataclass
class ScrapeResult:
    url: str
    status: int
    content: Optional[str]
    proxy_used: str
    attempt: int
    latency_ms: float

# ThorData gateway - rotation is handled server-side
THORDATA_PROXY = "http://username:[email protected]:7000"

def build_realistic_headers(accept_language: str = "en-US,en;q=0.9") -> dict:
    return {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": accept_language,
        "Accept-Encoding": "gzip, deflate, br",
        "sec-ch-ua": '"Chromium";v="131", "Not_A Brand";v="24", "Google Chrome";v="131"',
        "sec-ch-ua-mobile": "?0",
        "sec-ch-ua-platform": '"Windows"',
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Upgrade-Insecure-Requests": "1",
    }

def scrape_residential(
    url: str,
    max_retries: int = 5,
    min_delay: float = 1.0,
    max_delay: float = 4.0,
) -> ScrapeResult:
    headers = build_realistic_headers()
    for attempt in range(1, max_retries + 1):
        start = time.monotonic()
        try:
            with httpx.Client(proxy=THORDATA_PROXY, timeout=30.0, headers=headers) as client:
                resp = client.get(url)
                latency = (time.monotonic() - start) * 1000
                if resp.status_code == 200:
                    return ScrapeResult(url, 200, resp.text, THORDATA_PROXY, attempt, latency)
                elif resp.status_code == 429:
                    wait = random.uniform(min_delay * 2 ** attempt, max_delay * 2 ** attempt)
                    time.sleep(min(wait, 60))
                else:
                    return ScrapeResult(url, resp.status_code, None, THORDATA_PROXY, attempt, latency)
        except httpx.TimeoutException:
            time.sleep(random.uniform(min_delay, max_delay))
    return ScrapeResult(url, -1, None, THORDATA_PROXY, max_retries, 0.0)

ThorData provides rotating residential proxies with geo-targeting down to city level. Their gateway handles rotation automatically - you do not need to manage a proxy list, just point your client at the gateway endpoint and it distributes requests across their pool.

ISP Proxies (Static Residential)

ISP proxies are the hybrid nobody talks about enough. They are hosted in data centers - so they have datacenter-level speed and uptime - but registered under residential ASNs. When a website checks the ASN for an ISP proxy, it sees a residential internet service provider, not AWS or Hetzner.

The technical trick is that proxy providers purchase IP blocks from ISPs and colocate the actual servers in their own data centers. The IP routing goes through the ISPs network, so the ASN lookup returns the ISP. The traffic itself travels over fast data center infrastructure.

Technical characteristics: - Latency: 30-100ms - much closer to datacenter than residential - Speed: not constrained by consumer upload bandwidth - Pricing: $2-5/GB, more than datacenter but cheaper per-GB than rotating residential - Assignment: dedicated - you get specific IPs rather than random pool rotation - Persistence: same IP for days, weeks, or months depending on contract

When to use ISP proxies: - Long-running monitoring jobs that need consistent identity (price trackers, stock monitors) - Account management scenarios where IP changes trigger security alerts - High-volume scrapes where residential bandwidth costs would be prohibitive - Sites that use IP-session binding (same IP required across a multi-page workflow) - Performance-sensitive scraping where residential latency is a bottleneck

Python example for session-consistent ISP proxy scraping:

import httpx
import asyncio
from typing import List, Dict, Any

ISP_PROXIES = [
    "http://user:[email protected]:8080",
    "http://user:[email protected]:8080",
    "http://user:[email protected]:8080",
]

async def scrape_paginated_session(
    base_url: str,
    proxy: str,
    max_pages: int = 50
) -> List[Dict[str, Any]]:
    all_items = []
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
    }
    async with httpx.AsyncClient(proxy=proxy, timeout=20.0, headers=headers) as client:
        for page in range(1, max_pages + 1):
            url = f"{base_url}?page={page}"
            try:
                resp = await client.get(url)
                if resp.status_code == 200:
                    pass  # Parse page content here
                elif resp.status_code == 404:
                    break
                await asyncio.sleep(0.8 + (page % 3) * 0.4)
            except httpx.TimeoutException:
                await asyncio.sleep(5)
                continue
    return all_items

Mobile Proxies

Mobile proxies route traffic through mobile devices on carrier networks (4G, 5G, LTE). They are the most expensive proxy type and also the hardest to block, because mobile carrier IPs are used by millions of users simultaneously - blocking a single mobile IP risks blocking thousands of legitimate users.

Technical characteristics: - Latency: highly variable, 100-2000ms depending on carrier and location - Cost: $15-50/GB or per-device monthly pricing - Detection resistance: extremely high - carriers rotate IPs via CGNAT - Use cases: mobile-specific content, carrier-gated sites, maximum stealth

Mobile proxies are rarely necessary for standard scraping. They become relevant for carrier-specific content access, mobile app API scraping, or situations where you need to appear as mobile traffic specifically.

Rotating vs Sticky Sessions

Understanding when to rotate IPs versus when to maintain a session is as important as picking the right proxy type.

Rotating sessions assign a new IP for each request or after a short interval. Use rotation for: - Search result pages (each query is an independent, stateless request) - Product catalog scraping where each URL is self-contained - News article collection across multiple outlets - Any data collection where there is no session state to maintain

Sticky sessions maintain the same IP for a configurable duration (typically 1-30 minutes). Use sticky sessions for: - Paginating through multi-page results (the site tracks your session) - Login and authenticated scraping (cookie-to-IP binding is common) - Shopping cart and checkout observation - Any workflow where the site uses IP as part of session validation

import httpx
import time
import random
from typing import Generator, List

def rotating_scraper(
    urls: List[str],
    proxy_gateway: str,
    delay_range: tuple = (0.5, 2.0)
) -> Generator[tuple, None, None]:
    for url in urls:
        with httpx.Client(proxy=proxy_gateway, timeout=15.0) as client:
            try:
                resp = client.get(url)
                yield (url, resp.status_code, resp.text if resp.status_code == 200 else None)
            except Exception:
                yield (url, -1, None)
        time.sleep(random.uniform(*delay_range))

def sticky_session_scraper(
    start_url: str,
    proxy_with_session_id: str,
) -> List[str]:
    """
    Maintain same IP across a paginated sequence.
    ThorData sticky session format: user-sessid12345:[email protected]:7000
    """
    pages = []
    with httpx.Client(proxy=proxy_with_session_id, timeout=20.0) as client:
        url = start_url
        while url:
            resp = client.get(url)
            if resp.status_code != 200:
                break
            pages.append(resp.text)
            next_url = None  # Extract from response
            url = next_url
            time.sleep(1.5)
    return pages

Anti-Detection: Headers, Delays, and Fingerprint Spoofing

Proxies handle the IP layer. Anti-detection covers everything else that bot detection systems analyze.

Request Headers

A bare Python requests call sends headers that no browser actually sends. Building a realistic header set is table stakes:

import random

CHROME_USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
]

def get_chrome_headers(referer: str = None) -> dict:
    headers = {
        "User-Agent": random.choice(CHROME_USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "sec-ch-ua": '"Chromium";v="131", "Not_A Brand";v="24", "Google Chrome";v="131"',
        "sec-ch-ua-mobile": "?0",
        "sec-ch-ua-platform": '"Windows"',
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none" if not referer else "same-origin",
        "Sec-Fetch-User": "?1",
        "Upgrade-Insecure-Requests": "1",
        "DNT": "1",
    }
    if referer:
        headers["Referer"] = referer
    return headers

TLS Fingerprinting with curl_cffi

Standard Python HTTP libraries send a TLS fingerprint that does not match Chrome. The curl_cffi library impersonates real browser TLS fingerprints:

from curl_cffi import requests as cffi_requests

def scrape_with_tls_impersonation(url: str, proxy: str) -> str:
    resp = cffi_requests.get(
        url,
        impersonate="chrome131",
        proxy=proxy,
        timeout=15,
    )
    return resp.text

Request Timing and Rate Control

import time
import random

def human_delay(base_seconds: float = 1.0, variance: float = 0.5) -> None:
    delay = max(0.1, random.gauss(base_seconds, variance))
    time.sleep(delay)

def rate_limited_batch(urls: list, scrape_fn, requests_per_minute: int = 30) -> list:
    results = []
    min_interval = 60.0 / requests_per_minute
    last_request_time = 0.0
    for i, url in enumerate(urls):
        elapsed = time.monotonic() - last_request_time
        if elapsed < min_interval and i >= 5:
            time.sleep(min_interval - elapsed + random.uniform(0, 0.3))
        result = scrape_fn(url)
        results.append(result)
        last_request_time = time.monotonic()
        if i > 0 and i % random.randint(15, 25) == 0:
            time.sleep(random.uniform(5, 15))
    return results

CAPTCHA Handling

CAPTCHAs appear when bot detection has flagged your traffic but not outright blocked it.

from bs4 import BeautifulSoup
import httpx
import re

def detect_captcha(response: httpx.Response) -> str:
    if response.status_code == 403:
        body = response.text.lower()
        if "cf-challenge" in body or "challenge-platform" in body:
            return "cloudflare"
        if "px-captcha" in body or "perimeterx" in body:
            return "perimeterx"
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, "lxml")
        if soup.find(attrs={"class": re.compile(r"g-recaptcha")}):
            return "recaptcha_v2"
        if soup.find(attrs={"data-sitekey": True}):
            return "recaptcha_v3"
    return "none"

Rate Limiting and Retry Logic

Production scrapers need retry logic that distinguishes between transient failures and permanent blocks:

import httpx
import time
import random
import logging
from enum import Enum
from typing import Optional, Callable

logger = logging.getLogger(__name__)

class RetryOutcome(Enum):
    SUCCESS = "success"
    RATE_LIMITED = "rate_limited"
    BLOCKED = "blocked"
    SERVER_ERROR = "server_error"
    EXHAUSTED = "exhausted"

def classify_response(status: int) -> RetryOutcome:
    if status == 200:
        return RetryOutcome.SUCCESS
    elif status == 429:
        return RetryOutcome.RATE_LIMITED
    elif status in (403, 401):
        return RetryOutcome.BLOCKED
    elif status >= 500:
        return RetryOutcome.SERVER_ERROR
    return RetryOutcome.BLOCKED

def scrape_with_retry(
    url: str,
    proxy_fn: Callable[[], str],
    max_attempts: int = 5,
    base_backoff: float = 2.0,
) -> tuple:
    for attempt in range(1, max_attempts + 1):
        proxy = proxy_fn()
        try:
            with httpx.Client(proxy=proxy, timeout=20.0) as client:
                resp = client.get(url, headers=get_chrome_headers())
                outcome = classify_response(resp.status_code)
                if outcome == RetryOutcome.SUCCESS:
                    return resp.text, outcome
                elif outcome == RetryOutcome.RATE_LIMITED:
                    retry_after = int(resp.headers.get("Retry-After", base_backoff * 2 ** attempt))
                    wait = min(retry_after + random.uniform(0, 2), 120)
                    logger.warning(f"Rate limited attempt {attempt}, waiting {wait:.1f}s")
                    time.sleep(wait)
                elif outcome == RetryOutcome.BLOCKED:
                    logger.warning(f"Blocked HTTP {resp.status_code} attempt {attempt}")
                    time.sleep(random.uniform(2, 5))
                elif outcome == RetryOutcome.SERVER_ERROR:
                    wait = base_backoff * (2 ** (attempt - 1)) + random.uniform(0, 1)
                    time.sleep(min(wait, 60))
        except httpx.TimeoutException:
            logger.warning(f"Timeout on attempt {attempt}")
            time.sleep(random.uniform(2, 6))
        except httpx.ProxyError as e:
            logger.error(f"Proxy error attempt {attempt}: {e}")
            time.sleep(2)
    return None, RetryOutcome.EXHAUSTED

Real-World Use Cases

1. E-commerce Price Monitoring

Price monitoring is one of the most common scraping use cases. The challenge is that major retailers have aggressive bot detection, and residential proxies are required for Amazon, Walmart, and Target.

import httpx
import json
import re
import time
import datetime
from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class ProductPrice:
    url: str
    title: str
    price: Optional[float]
    currency: str
    in_stock: bool
    scraped_at: str

def extract_price_from_html(html: str, url: str) -> ProductPrice:
    soup = BeautifulSoup(html, "lxml")
    schema = soup.find("script", type="application/ld+json")
    if schema:
        try:
            data = json.loads(schema.string)
            if isinstance(data, list):
                data = data[0]
            if data.get("@type") == "Product":
                offer = data.get("offers", {})
                if isinstance(offer, list):
                    offer = offer[0]
                return ProductPrice(
                    url=url,
                    title=data.get("name", ""),
                    price=float(offer.get("price", 0)),
                    currency=offer.get("priceCurrency", "USD"),
                    in_stock=offer.get("availability", "").endswith("InStock"),
                    scraped_at=datetime.datetime.utcnow().isoformat(),
                )
        except (json.JSONDecodeError, KeyError, ValueError):
            pass
    price_tag = soup.select_one('[itemprop="price"], .price, #price')
    title_tag = soup.select_one("h1")
    price_text = price_tag.get_text(strip=True) if price_tag else ""
    price_match = re.search(r"[\d,]+\.?\d*", price_text.replace(",", ""))
    return ProductPrice(
        url=url,
        title=title_tag.get_text(strip=True) if title_tag else "",
        price=float(price_match.group()) if price_match else None,
        currency="USD",
        in_stock=bool(soup.find(string=re.compile(r"in stock", re.I))),
        scraped_at=datetime.datetime.utcnow().isoformat(),
    )

2. Real Estate Listing Scraper

Real estate portals like Zillow, Realtor.com, and Redfin are among the most aggressively protected scraping targets. They use device fingerprinting in addition to IP-based blocking, so you need both residential proxies and realistic browser headers.

import httpx
import time
import random
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class RealEstateListing:
    address: str
    city: str
    state: str
    zip_code: str
    price: Optional[int]
    bedrooms: Optional[int]
    bathrooms: Optional[float]
    sqft: Optional[int]
    listing_url: str
    days_on_market: Optional[int]

def scrape_real_estate_search(
    search_url: str,
    proxy: str,
    max_pages: int = 10,
) -> List[RealEstateListing]:
    listings = []
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Referer": "https://www.google.com/",
    }
    with httpx.Client(proxy=proxy, timeout=30.0, headers=headers, follow_redirects=True) as client:
        for page in range(1, max_pages + 1):
            url = f"{search_url}&page={page}" if "?" in search_url else f"{search_url}?page={page}"
            resp = client.get(url)
            if resp.status_code != 200:
                break
            if page % 3 == 0:
                time.sleep(5 + random.uniform(0, 3))
            else:
                time.sleep(1.5 + random.uniform(0, 1))
    return listings

3. Job Board Aggregator

Job boards are generally less aggressive with bot detection than e-commerce, but the most popular ones (Indeed, LinkedIn) have significant protection. Many job boards expose JSON APIs that are much cleaner to scrape than HTML.

import httpx
import json
import datetime
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class JobListing:
    title: str
    company: str
    location: str
    salary_min: Optional[int]
    salary_max: Optional[int]
    remote: bool
    posted_date: str
    listing_url: str
    source: str

def scrape_job_api(board_url: str, proxy: str, keyword: str = "python developer") -> List[JobListing]:
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
        "Accept": "application/json, text/html, */*",
        "Accept-Language": "en-US,en;q=0.9",
    }
    jobs = []
    with httpx.Client(proxy=proxy, timeout=20.0, headers=headers) as client:
        encoded = keyword.replace(" ", "+")
        resp = client.get(f"{board_url}/search?q={encoded}&format=json")
        if resp.status_code == 200:
            try:
                data = resp.json()
                raw_jobs = data.get("jobs", data.get("results", []))
                for job in raw_jobs:
                    company = job.get("company", {})
                    company_name = company.get("name", "") if isinstance(company, dict) else str(company)
                    jobs.append(JobListing(
                        title=job.get("title", ""),
                        company=company_name,
                        location=job.get("location", ""),
                        salary_min=job.get("salary_min"),
                        salary_max=job.get("salary_max"),
                        remote="remote" in job.get("location", "").lower(),
                        posted_date=job.get("created_at", datetime.datetime.utcnow().isoformat()),
                        listing_url=job.get("url", ""),
                        source=board_url,
                    ))
            except json.JSONDecodeError:
                pass
    return jobs

4. SERP Rank Tracker

Google is one of the most difficult scraping targets. Residential proxies are mandatory - datacenter IPs get CAPTCHA challenges on nearly every request. Use geo-targeted proxies matching the country you want results for.

import httpx
import time
import random
from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import List

@dataclass
class SearchResult:
    position: int
    title: str
    url: str
    description: str
    is_ad: bool
    keyword: str

def track_serp_positions(
    keywords: List[str],
    target_domain: str,
    proxy: str,
    country_code: str = "us",
) -> List[SearchResult]:
    results = []
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
        "Accept-Language": f"{country_code}-{country_code.upper()},{country_code};q=0.9,en;q=0.8",
        "Referer": "https://www.google.com/",
    }
    for keyword in keywords:
        with httpx.Client(proxy=proxy, timeout=15.0, headers=headers) as client:
            encoded = keyword.replace(" ", "+")
            resp = client.get(f"https://www.google.com/search?q={encoded}&gl={country_code}&num=100")
            if resp.status_code == 200:
                soup = BeautifulSoup(resp.text, "lxml")
                organic = soup.select("div.g")
                for pos, div in enumerate(organic[:20], 1):
                    link = div.select_one("a[href]")
                    title_el = div.select_one("h3")
                    snippet = div.select_one(".VwiC3b")
                    if link and title_el:
                        url = link.get("href", "")
                        if target_domain in url:
                            results.append(SearchResult(
                                position=pos,
                                title=title_el.get_text(),
                                url=url,
                                description=snippet.get_text() if snippet else "",
                                is_ad=False,
                                keyword=keyword,
                            ))
        time.sleep(random.uniform(3, 8))
    return results

Social platforms have the most sophisticated bot detection systems. Instagram and TikTok run device fingerprinting in JavaScript - bypassing this requires a real browser context via Playwright.

from playwright.async_api import async_playwright
import asyncio
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class SocialProfile:
    username: str
    display_name: str
    bio: str
    follower_count: Optional[int]
    following_count: Optional[int]
    post_count: Optional[int]
    verified: bool
    profile_url: str

async def scrape_public_profiles(
    usernames: List[str],
    proxy_config: dict,
    platform: str = "twitter",
) -> List[SocialProfile]:
    profiles = []
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy=proxy_config,
            args=["--disable-blink-features=AutomationControlled"],
        )
        for username in usernames:
            context = await browser.new_context(
                viewport={"width": 1366, "height": 768},
                user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
                locale="en-US",
            )
            page = await context.new_page()
            await page.add_init_script(
                "Object.defineProperty(navigator, 'webdriver', {get: () => undefined});"
            )
            try:
                base_url = f"https://twitter.com/{username}" if platform == "twitter" else f"https://www.instagram.com/{username}/"
                await page.goto(base_url, wait_until="networkidle", timeout=30000)
                await page.wait_for_timeout(2000)
            finally:
                await context.close()
            await asyncio.sleep(2)
        await browser.close()
    return profiles

6. News and Media Monitoring

News monitoring benefits from a two-phase approach: parse RSS feeds without proxies to get article URLs cheaply, then use proxies only for fetching full article text. This cuts proxy bandwidth costs significantly.

import httpx
import feedparser
from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import List, Optional
import datetime
import time
import random

@dataclass
class NewsArticle:
    headline: str
    url: str
    source: str
    published_at: str
    summary: str
    author: Optional[str]
    full_text: Optional[str]

def scrape_news_sources(
    rss_feeds: List[str],
    article_proxy: str,
    max_articles_per_feed: int = 20,
) -> List[NewsArticle]:
    feed_items = []
    for feed_url in rss_feeds:
        feed = feedparser.parse(feed_url)
        for entry in feed.entries[:max_articles_per_feed]:
            feed_items.append({
                "title": entry.get("title", ""),
                "url": entry.get("link", ""),
                "published": entry.get("published", datetime.datetime.utcnow().isoformat()),
                "summary": entry.get("summary", ""),
                "source": feed.feed.get("title", feed_url),
            })
    articles = []
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    }
    with httpx.Client(proxy=article_proxy, timeout=15.0, headers=headers) as client:
        for item in feed_items:
            try:
                resp = client.get(item["url"])
                if resp.status_code == 200:
                    soup = BeautifulSoup(resp.text, "lxml")
                    for tag in soup.select("nav, header, footer, .ad, aside"):
                        tag.decompose()
                    paragraphs = soup.select("article p, .article-body p, .post-content p")
                    full_text = " ".join(p.get_text(strip=True) for p in paragraphs)
                    articles.append(NewsArticle(
                        headline=item["title"],
                        url=item["url"],
                        source=item["source"],
                        published_at=item["published"],
                        summary=item["summary"],
                        author=None,
                        full_text=full_text if full_text else None,
                    ))
                time.sleep(random.uniform(0.5, 1.5))
            except Exception:
                pass
    return articles

7. Academic and Research Data Harvesting

Academic databases are generally more tolerant of scraping but have strict rate limits. Datacenter proxies work fine for most academic sources - there is no need to pay for residential proxies here.

import httpx
import time
from dataclasses import dataclass, field
from typing import List, Optional

@dataclass
class ResearchPaper:
    title: str
    authors: List[str]
    abstract: str
    doi: Optional[str]
    publication_year: Optional[int]
    journal: Optional[str]
    citations: Optional[int]
    download_url: Optional[str]
    keywords: List[str] = field(default_factory=list)

def scrape_arxiv_papers(
    search_query: str,
    max_results: int = 100,
    proxy: Optional[str] = None,
) -> List[ResearchPaper]:
    """
    Scrape arXiv preprints.
    arXiv allows programmatic access but rate-limits to roughly 1 req/3s.
    Datacenter proxies are fine here.
    """
    papers = []
    headers = {
        "User-Agent": "ResearchScraper/1.0 (academic research; [email protected])",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    }
    client_kwargs: dict = {"timeout": 20.0, "headers": headers}
    if proxy:
        client_kwargs["proxy"] = proxy
    base_url = "https://export.arxiv.org/find/cs/1/all:+{query}/0/{start}/0/all/0/1"
    with httpx.Client(**client_kwargs) as client:
        start = 0
        while start < max_results:
            url = base_url.format(query=search_query.replace(" ", "+"), start=start)
            resp = client.get(url)
            if resp.status_code != 200:
                break
            start += 25
            time.sleep(3)
    return papers

Output Schema and Storage

Always define output schemas before scraping at scale. It forces you to think about what you actually need and makes downstream processing trivial.

from dataclasses import dataclass, field
from typing import Optional
import json
import datetime
import sqlite3
import hashlib

@dataclass
class ScrapedPage:
    url: str
    status_code: int
    scraped_at: str = field(default_factory=lambda: datetime.datetime.utcnow().isoformat())
    proxy_type: str = "residential"
    content_hash: Optional[str] = None
    data: dict = field(default_factory=dict)
    error: Optional[str] = None

class ScrapingStorage:
    def __init__(self, db_path: str = "scraping_results.db"):
        self.conn = sqlite3.connect(db_path)
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS scraped_pages (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                url TEXT UNIQUE,
                status_code INTEGER,
                scraped_at TEXT,
                proxy_type TEXT,
                content_hash TEXT,
                data TEXT,
                error TEXT
            )
        """)
        self.conn.commit()

    def save(self, page: ScrapedPage) -> None:
        if page.data:
            content = json.dumps(page.data, sort_keys=True)
            page.content_hash = hashlib.sha256(content.encode()).hexdigest()[:16]
        self.conn.execute(
            "INSERT OR REPLACE INTO scraped_pages VALUES (NULL, ?, ?, ?, ?, ?, ?, ?)",
            (page.url, page.status_code, page.scraped_at, page.proxy_type,
             page.content_hash, json.dumps(page.data), page.error)
        )
        self.conn.commit()

    def is_scraped(self, url: str, max_age_hours: int = 24) -> bool:
        cutoff = (datetime.datetime.utcnow() - datetime.timedelta(hours=max_age_hours)).isoformat()
        row = self.conn.execute(
            "SELECT 1 FROM scraped_pages WHERE url = ? AND scraped_at > ? AND status_code = 200",
            (url, cutoff)
        ).fetchone()
        return row is not None

    def export_to_jsonl(self, output_file: str) -> int:
        rows = self.conn.execute("SELECT url, data FROM scraped_pages WHERE status_code = 200").fetchall()
        with open(output_file, "w") as f:
            for url, data_str in rows:
                if data_str:
                    f.write(data_str + "\n")
        return len(rows)

Choosing the Right Proxy Provider

What matters in practice: pool size, geo coverage, uptime consistency, and how the provider handles failures. A provider with 100M residential IPs sounds impressive until you realize 70% are offline at any given time and the active pool has heavy overlap with other customers.

For rotating residential proxy coverage with solid geo-targeting, ThorData is worth evaluating. They offer rotating residential proxies with city-level targeting, sticky session support, and pricing that does not penalize high-bandwidth months.

The Decision Matrix

Scenario	Proxy Type	Session	Expected Success Rate
Public API, no bot detection	Datacenter	Rotating	Very High
Cloudflare-protected site	Residential	Rotating	Medium-High
Login / session scraping	ISP or Residential	Sticky	Medium
Amazon / Google pricing	Residential	Rotating	Medium
High-volume catalog scrape	ISP	Mixed	High
Mobile-specific content	Mobile	Rotating	High
JS-heavy SPA	Residential + Playwright	Per-context	Medium

Start with the cheapest proxy type that works for your target. Escalate when you hit consistent blocks. Monitor your success rate per proxy type and per target domain - the data will tell you where to invest more.

Proxy Types for Web Scraping: Residential, Datacenter, and ISP Explained (2026)

Proxy Types for Web Scraping: Residential, Datacenter, and ISP Explained (2026)

Why Proxies Matter: What Sites Actually Check

Datacenter Proxies

Residential Proxies

ISP Proxies (Static Residential)

Mobile Proxies

Rotating vs Sticky Sessions

Anti-Detection: Headers, Delays, and Fingerprint Spoofing

Request Headers

TLS Fingerprinting with curl_cffi

Request Timing and Rate Control

CAPTCHA Handling

Rate Limiting and Retry Logic

Real-World Use Cases

1. E-commerce Price Monitoring

2. Real Estate Listing Scraper

3. Job Board Aggregator

4. SERP Rank Tracker

5. Social Media Profile Data Collector

6. News and Media Monitoring

7. Academic and Research Data Harvesting

Output Schema and Storage

Choosing the Right Proxy Provider

The Decision Matrix