Residential Proxies vs Datacenter Proxies: The Complete 2026 Guide for Web Scraping

2026-03-30 proxies scraping residential-proxies datacenter-proxies mobile-proxies ISP-proxies

If you're building a scraper in 2026, you'll hit the proxy question fast. Datacenter or residential? The proxy industry wants you to believe you need residential proxies for everything. You don't. But sometimes you really do.

The proxy landscape has shifted significantly over the past two years. Bot detection systems have gotten smarter, but so have the proxy technologies available to developers. Mobile proxies have emerged as a serious option, ISP proxies offer an interesting middle ground, and pricing models have diversified beyond simple per-GB billing.

This guide covers everything you need to make an informed decision — not just the basics, but the nuances that only matter once you're running scrapers at scale.

Understanding the Four Proxy Types

Before diving into comparisons, let's be precise about what each proxy type actually is, how it works, and what makes it different at the network level.

Datacenter Proxies

Datacenter proxies come from cloud providers — AWS, Hetzner, OVH, DigitalOcean. These are IP addresses that belong to autonomous systems (AS numbers) registered to hosting companies. When a target site looks up the ASN for your IP, it sees something like "AMAZON-02" or "HETZNER-DC" — a dead giveaway that you're not a regular user.

How they work: Your request goes from your machine to a proxy server in a data center, which forwards it to the target site. The target sees the datacenter IP. Simple, fast, no middleman beyond the proxy server itself.

Performance characteristics: - Latency: 10-50ms (fast, predictable) - Bandwidth: High (100Mbps+ is common) - Uptime: 99.9%+ (enterprise infrastructure) - IP pool size: Varies, but typically thousands to tens of thousands - Connection reliability: Very high

Cost: $0.50-5 per GB, or $1-3 per IP per month for dedicated proxies. This is 5-20x cheaper than residential alternatives.

When datacenter proxies are all you need: - Public APIs with rate limits (just rotate IPs) - News sites, blogs, documentation pages - Government data portals and public records - Price comparison on smaller e-commerce sites - Any site that doesn't actively fingerprint proxy traffic - Academic research sites and open-access journals - Weather and sports data aggregation - Job boards without advanced bot detection - Search engine results pages (with proper rotation)

Most developers start here and never need to upgrade. If your target doesn't use advanced bot detection, don't waste money on residential IPs.

Residential Proxies

Residential proxies route your traffic through real home internet connections — Comcast, Vodafone, AT&T, BT, Deutsche Telekom. To the target site, you look like a regular person browsing from their couch.

How they work: Proxy providers maintain networks of residential IPs through SDK integrations with mobile apps, browser extensions, or peer-to-peer networks. When you route traffic through a residential proxy, it exits from someone's home ISP connection. The target site sees an IP registered to a consumer ISP, which is exactly what legitimate traffic looks like.

Performance characteristics: - Latency: 100-500ms (higher and more variable than datacenter) - Bandwidth: Variable (depends on the residential connection) - Uptime: Lower than datacenter (connections drop when users go offline) - IP pool size: Millions (large providers have 50M+ IPs) - Connection reliability: Moderate (individual IPs may disconnect)

Cost: $5-20 per GB. The cost varies significantly by provider, geo-target, and volume. Geo-targeting specific countries or cities costs more.

When you genuinely need residential proxies: - Cloudflare-protected sites — Cloudflare's bot management flags datacenter IP ranges by default. Residential IPs sail through the initial challenge. - Amazon product pages — Amazon has been blocking datacenter ranges aggressively since 2024. Residential proxies are basically mandatory now. - Social media platforms — Instagram, LinkedIn, and TikTok all fingerprint datacenter traffic. You'll get CAPTCHAs or shadow-blocks within minutes. - Sneaker sites, ticket platforms — Anything with anti-bot middleware (Akamai, PerimeterX, DataDome) requires residential IPs. - Google Search at scale — A few queries from datacenter IPs are fine. Thousands per hour? You need residential. - Banking and financial sites — Strong fraud detection systems treat datacenter IPs as suspicious by default. - Travel booking sites — Expedia, Booking.com, and airline sites use sophisticated bot detection that flags datacenter traffic.

ISP Proxies (The Middle Ground)

ISP proxies are a hybrid — they're hosted in data centers but registered under residential ISP ASNs. This gives you datacenter-level speed and reliability with residential-level detection resistance.

How they work: Providers purchase IP blocks from ISPs and host them in data centers. The IPs show up as residential in ASN lookups, but they have the stability and speed of datacenter infrastructure. Think of it as renting an apartment in a residential neighborhood but running it like an office.

Performance characteristics: - Latency: 20-80ms (near-datacenter performance) - Bandwidth: High (datacenter infrastructure) - Uptime: 99%+ (no dependency on end-user connections) - IP pool size: Smaller (thousands, not millions) - Connection reliability: Very high

Cost: $10-30 per IP per month (usually sold as dedicated IPs, not per GB). More expensive per-IP than datacenter, cheaper per-GB than residential for high-traffic use cases.

Best for: - Account management (social media, e-commerce accounts) - Long-running sessions where you need the same IP for hours - Sites that check ISP ASNs but don't do deep fingerprinting - SEO monitoring and SERP tracking - Ad verification and brand protection

Mobile Proxies

Mobile proxies route traffic through 3G/4G/5G connections from real mobile devices. Mobile carriers use CGNAT (Carrier-Grade NAT), meaning hundreds or thousands of users share the same IP. This makes mobile IPs incredibly hard to block — banning one mobile IP affects thousands of legitimate users.

How they work: Traffic is routed through USB modems, mobile hotspots, or app-based SDKs connected to cellular networks. The target sees an IP from a mobile carrier's IP pool. Because of CGNAT, these IPs have naturally high trust scores.

Performance characteristics: - Latency: 50-200ms (depends on cellular network) - Bandwidth: Variable (cell network dependent) - Uptime: Good (carrier infrastructure) - IP pool size: Moderate (tens of thousands) - Connection reliability: Good, but occasional drops

Cost: $20-50 per GB, or $50-300 per port per month. The most expensive option, but sometimes the only one that works.

Best for: - Platforms with the most aggressive bot detection (Instagram, TikTok) - Account creation and verification workflows - Mobile-specific content that differs from desktop - Targets that have blocked all known residential proxy ranges - Social media automation at scale

Real-World Detection Rates: What Actually Gets Blocked

Theory is nice, but what matters is whether your requests succeed. Here's what you'll actually encounter against common bot detection systems in 2026:

Cloudflare Bot Management

Proxy Type	Success Rate	Notes
Datacenter	5-20%	Immediate challenge pages, high block rate
Residential	85-95%	Most pass the initial challenge
ISP	60-80%	Better than datacenter, but Cloudflare has started flagging some ISP ranges
Mobile	95-99%	Highest success rate due to CGNAT trust

Amazon Product Pages

Proxy Type	Success Rate	Notes
Datacenter	10-30%	Aggressive blocking since 2024
Residential	80-90%	Works well with proper rotation
ISP	50-70%	Hit or miss depending on the ISP range
Mobile	90-95%	Very reliable but expensive for high-volume scraping

LinkedIn Profiles

Proxy Type	Success Rate	Notes
Datacenter	<5%	Almost immediately blocked
Residential	70-85%	Needs sticky sessions for logged-in scraping
ISP	40-60%	Better for account management than bulk scraping
Mobile	85-95%	Best option for account-based operations

Google Search Results

Proxy Type	Success Rate	Notes
Datacenter	30-50%	Works at low volume with rotation
Residential	90-95%	Standard choice for SERP scraping
ISP	70-85%	Good for moderate volume
Mobile	95%+	Overkill unless other types fail

These numbers assume proper rotation, reasonable request rates, and basic anti-detection practices (realistic headers, random delays). Hammering any site at maximum speed will get you blocked regardless of proxy type.

Python Code Examples: Complete Proxy Integration

Basic Proxy Rotation with httpx

import httpx
import random
import time

class ProxyRotator:
    """Simple proxy rotator that cycles through a list of proxies."""

    def __init__(self, proxies: list[str]):
        self.proxies = proxies
        self.current = 0
        self.failed = set()

    def get_next(self) -> str:
        available = [p for p in self.proxies if p not in self.failed]
        if not available:
            self.failed.clear()  # Reset if all failed
            available = self.proxies
        proxy = random.choice(available)
        return proxy

    def mark_failed(self, proxy: str):
        self.failed.add(proxy)

# Datacenter proxy rotation
dc_proxies = [
    "http://user:[email protected]:8080",
    "http://user:[email protected]:8080",
    "http://user:[email protected]:8080",
]

rotator = ProxyRotator(dc_proxies)

def scrape_with_rotation(urls: list[str]) -> list[dict]:
    results = []
    for url in urls:
        proxy = rotator.get_next()
        try:
            with httpx.Client(proxy=proxy, timeout=30) as client:
                resp = client.get(url, headers={
                    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                                  "Chrome/131.0.0.0 Safari/537.36",
                    "Accept": "text/html,application/xhtml+xml",
                    "Accept-Language": "en-US,en;q=0.9",
                })
                results.append({
                    "url": url,
                    "status": resp.status_code,
                    "size": len(resp.content),
                })
        except (httpx.ConnectError, httpx.TimeoutException) as e:
            rotator.mark_failed(proxy)
            results.append({"url": url, "status": "error", "error": str(e)})

        time.sleep(random.uniform(1, 3))  # Be polite

    return results

Residential Proxy with ThorData

ThorData provides both rotating and sticky residential proxies with a clean API. Here's how to integrate them:

import httpx
import asyncio
import random
from dataclasses import dataclass

@dataclass
class ThorDataConfig:
    username: str
    password: str
    host: str = "proxy.thordata.com"
    port: int = 9000

    def rotating_url(self, country: str = "") -> str:
        """Get a rotating proxy URL — new IP on each request."""
        user = self.username
        if country:
            user = f"{user}-country-{country}"
        return f"http://{user}:{self.password}@{self.host}:{self.port}"

    def sticky_url(self, session_id: str, country: str = "") -> str:
        """Get a sticky proxy URL — same IP for the session duration."""
        user = f"{self.username}-session-{session_id}"
        if country:
            user = f"{user}-country-{country}"
        return f"http://{user}:{self.password}@{self.host}:{self.port}"

# Initialize
thor = ThorDataConfig(username="your_user", password="your_pass")

# Rotating proxy — new IP per request
async def scrape_products(urls: list[str]) -> list[dict]:
    proxy = thor.rotating_url(country="us")
    results = []

    async with httpx.AsyncClient(proxy=proxy, timeout=30) as client:
        for url in urls:
            try:
                resp = await client.get(url, headers={
                    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                                  "Chrome/131.0.0.0 Safari/537.36",
                })
                results.append({"url": url, "status": resp.status_code})
            except Exception as e:
                results.append({"url": url, "error": str(e)})

            await asyncio.sleep(random.uniform(0.5, 2))

    return results

# Sticky session — same IP for multi-step flows
async def scrape_with_session(login_url: str, data_urls: list[str]) -> list[dict]:
    session_id = f"sess_{random.randint(100000, 999999)}"
    proxy = thor.sticky_url(session_id=session_id, country="us")

    async with httpx.AsyncClient(proxy=proxy, timeout=30) as client:
        # Login (IP stays the same)
        login_resp = await client.post(login_url, data={
            "username": "user", "password": "pass"
        })

        if login_resp.status_code != 200:
            return [{"error": "Login failed"}]

        # Scrape data pages with same IP
        results = []
        for url in data_urls:
            resp = await client.get(url)
            results.append({"url": url, "status": resp.status_code})
            await asyncio.sleep(random.uniform(1, 3))

        return results

Async High-Throughput Scraping with Proxy Pool

import httpx
import asyncio
import random
import time
from collections import defaultdict

class SmartProxyPool:
    """Proxy pool that tracks success rates and auto-adjusts."""

    def __init__(self, proxies: list[str], max_concurrent: int = 10):
        self.proxies = proxies
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.stats = defaultdict(lambda: {"success": 0, "fail": 0})
        self.cooldown = {}  # proxy -> timestamp when cooldown ends

    def get_proxy(self) -> str:
        now = time.time()
        available = [
            p for p in self.proxies
            if self.cooldown.get(p, 0) < now
        ]
        if not available:
            available = self.proxies  # All on cooldown, use anyway

        # Prefer proxies with higher success rates
        weights = []
        for p in available:
            s = self.stats[p]
            total = s["success"] + s["fail"]
            if total == 0:
                weights.append(1.0)
            else:
                weights.append(s["success"] / total)

        return random.choices(available, weights=weights, k=1)[0]

    def report_success(self, proxy: str):
        self.stats[proxy]["success"] += 1

    def report_failure(self, proxy: str, cooldown_seconds: int = 30):
        self.stats[proxy]["fail"] += 1
        self.cooldown[proxy] = time.time() + cooldown_seconds

    async def fetch(self, url: str, client_kwargs: dict = None) -> dict:
        async with self.semaphore:
            proxy = self.get_proxy()
            try:
                async with httpx.AsyncClient(
                    proxy=proxy,
                    timeout=30,
                    **(client_kwargs or {})
                ) as client:
                    resp = await client.get(url, headers={
                        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                                      "AppleWebKit/537.36",
                        "Accept": "text/html,application/xhtml+xml",
                    })

                    if resp.status_code == 200:
                        self.report_success(proxy)
                        return {
                            "url": url,
                            "status": 200,
                            "content": resp.text,
                            "proxy": proxy,
                        }
                    elif resp.status_code == 403:
                        self.report_failure(proxy, cooldown_seconds=60)
                        return {"url": url, "status": 403, "blocked": True}
                    elif resp.status_code == 429:
                        self.report_failure(proxy, cooldown_seconds=120)
                        return {"url": url, "status": 429, "rate_limited": True}
                    else:
                        return {"url": url, "status": resp.status_code}

            except (httpx.ConnectError, httpx.TimeoutException) as e:
                self.report_failure(proxy, cooldown_seconds=60)
                return {"url": url, "error": str(e)}

async def main():
    # Mix of proxy types for different targets
    pool = SmartProxyPool(
        proxies=[
            "http://user:[email protected]:8080",
            "http://user:[email protected]:8080",
            "http://user:[email protected]:9000",
            "http://user:[email protected]:9000",
        ],
        max_concurrent=5,
    )

    urls = [f"https://example.com/product/{i}" for i in range(100)]
    tasks = [pool.fetch(url) for url in urls]
    results = await asyncio.gather(*tasks)

    success = sum(1 for r in results if r.get("status") == 200)
    blocked = sum(1 for r in results if r.get("blocked"))
    errors = sum(1 for r in results if "error" in r)

    print(f"Results: {success} success, {blocked} blocked, {errors} errors")

    # Print proxy performance stats
    for proxy, stats in pool.stats.items():
        total = stats["success"] + stats["fail"]
        rate = stats["success"] / total * 100 if total > 0 else 0
        print(f"  {proxy}: {rate:.0f}% success ({total} requests)")

asyncio.run(main())

Playwright with Residential Proxies

For JavaScript-heavy sites, you need a real browser routed through proxies:

from playwright.async_api import async_playwright
import asyncio
import json

async def scrape_spa_with_proxy():
    """Scrape a JavaScript-heavy site through a residential proxy."""

    async with async_playwright() as p:
        browser = await p.chromium.launch(
            proxy={
                "server": "http://proxy.thordata.com:9000",
                "username": "your_user",
                "password": "your_pass",
            },
            headless=True,
        )

        context = await browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent=(
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/131.0.0.0 Safari/537.36"
            ),
            locale="en-US",
            timezone_id="America/New_York",
        )

        page = await context.new_page()

        # Block unnecessary resources to save bandwidth (and proxy cost)
        await page.route("**/*.{png,jpg,jpeg,gif,svg,webp}", 
                        lambda route: route.abort())
        await page.route("**/analytics*", lambda route: route.abort())
        await page.route("**/tracking*", lambda route: route.abort())

        try:
            await page.goto("https://example.com/products", 
                          wait_until="networkidle", timeout=30000)

            # Wait for product cards to load
            await page.wait_for_selector(".product-card", timeout=10000)

            # Extract data
            products = await page.evaluate("""
                () => {
                    return Array.from(
                        document.querySelectorAll('.product-card')
                    ).map(card => ({
                        title: card.querySelector('h2')?.textContent?.trim(),
                        price: card.querySelector('.price')?.textContent?.trim(),
                        url: card.querySelector('a')?.href,
                    }));
                }
            """)

            return products

        finally:
            await browser.close()

# Run
products = asyncio.run(scrape_spa_with_proxy())
for p in products:
    print(f"{p['title']}: {p['price']}")

Automatic Proxy Escalation

The smartest approach: start cheap and escalate only when needed.

import httpx
import asyncio
from enum import Enum

class ProxyTier(Enum):
    DATACENTER = "datacenter"
    RESIDENTIAL = "residential"
    MOBILE = "mobile"

class ProxyEscalator:
    """Automatically escalates from cheap to expensive proxies based on blocks."""

    def __init__(self, config: dict):
        self.tiers = {
            ProxyTier.DATACENTER: config["datacenter_proxies"],
            ProxyTier.RESIDENTIAL: config["residential_proxies"],
            ProxyTier.MOBILE: config.get("mobile_proxies", []),
        }
        self.tier_order = [ProxyTier.DATACENTER, ProxyTier.RESIDENTIAL, ProxyTier.MOBILE]

    async def fetch(self, url: str, max_retries: int = 3) -> dict:
        """Try datacenter first, escalate to residential, then mobile."""

        last_error = None
        for tier in self.tier_order:
            proxies = self.tiers.get(tier, [])
            if not proxies:
                continue

            for attempt in range(max_retries):
                proxy = proxies[attempt % len(proxies)]
                try:
                    async with httpx.AsyncClient(
                        proxy=proxy, timeout=30
                    ) as client:
                        resp = await client.get(url, headers={
                            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                                          "AppleWebKit/537.36",
                        })

                        if resp.status_code == 200:
                            return {
                                "url": url,
                                "status": 200,
                                "content": resp.text,
                                "tier": tier.value,
                                "attempts": attempt + 1,
                            }
                        elif resp.status_code in (403, 429, 503):
                            # Blocked — try next attempt or escalate
                            last_error = f"HTTP {resp.status_code}"
                            await asyncio.sleep(2 ** attempt)
                            continue
                        else:
                            return {
                                "url": url,
                                "status": resp.status_code,
                                "tier": tier.value,
                            }

                except Exception as e:
                    last_error = str(e)
                    continue

            # All retries for this tier failed — escalate
            print(f"  Tier {tier.value} failed for {url}, escalating...")

        return {"url": url, "error": last_error, "exhausted": True}

# Usage
escalator = ProxyEscalator({
    "datacenter_proxies": [
        "http://user:[email protected]:8080",
        "http://user:[email protected]:8080",
    ],
    "residential_proxies": [
        "http://user:[email protected]:9000",
    ],
    "mobile_proxies": [
        "http://user:[email protected]:9100",
    ],
})

async def main():
    urls = [
        "https://easy-target.com/page",      # Datacenter will work
        "https://cloudflare-site.com/data",   # Needs residential
        "https://instagram.com/profile",       # May need mobile
    ]

    results = await asyncio.gather(*[escalator.fetch(url) for url in urls])
    for r in results:
        tier = r.get("tier", "none")
        status = r.get("status", r.get("error", "unknown"))
        print(f"  {r['url']}: {status} (via {tier})")

asyncio.run(main())

Rotating vs Sticky Sessions

This trips up a lot of developers, so let's be thorough.

Rotating Proxies

Rotating proxies give you a new IP on every request (or every N seconds). Maximum anonymity, minimum pattern detection.

When to use rotating: - Scraping independent product pages where each request is standalone - Search engine results where you want maximum query volume - Price monitoring across hundreds of sites - Any workflow where requests don't depend on each other

Implementation pattern:

import httpx

# Most providers use username formatting for rotation control
# ThorData example:
rotating_proxy = "http://user-rotate:[email protected]:9000"

async def scrape_product_list(product_ids: list[int]) -> list[dict]:
    """Each request gets a fresh IP."""
    results = []

    async with httpx.AsyncClient(proxy=rotating_proxy, timeout=30) as client:
        for pid in product_ids:
            url = f"https://store.example.com/product/{pid}"
            resp = await client.get(url)
            if resp.status_code == 200:
                results.append({"id": pid, "html": resp.text})

    return results

Sticky Sessions

Sticky sessions keep the same IP for a set duration (usually 1-30 minutes). The proxy provider assigns you an IP and maintains the mapping for the session duration.

When to use sticky sessions: - Logging into accounts (session cookies are often tied to IP) - Multi-page checkout monitoring - Paginating through search results where the site tracks your session state - Crawling sites that use server-side session tracking - Any multi-step workflow where IP changes look suspicious

Implementation pattern:

import httpx
import random

# Sticky session — same IP for the session duration
session_id = f"mysession_{random.randint(10000, 99999)}"
sticky_proxy = f"http://user-session-{session_id}:[email protected]:9000"

async def scrape_paginated_results(base_url: str, pages: int) -> list[dict]:
    """All pages use the same IP — looks like one user browsing."""
    results = []

    async with httpx.AsyncClient(proxy=sticky_proxy, timeout=30) as client:
        for page in range(1, pages + 1):
            url = f"{base_url}?page={page}"
            resp = await client.get(url, headers={
                "Referer": base_url if page == 1 else f"{base_url}?page={page-1}",
            })
            if resp.status_code == 200:
                results.append({"page": page, "html": resp.text})

    return results

Session Duration Strategy

Different targets require different session durations:

Use Case	Recommended Duration	Why
Product page scraping	No session (rotating)	Each page is independent
Search pagination	5-10 minutes	Enough to paginate through results
Login + data export	10-30 minutes	Complete the full workflow
Social media browsing	15-30 minutes	Mimic real browsing session
Account management	30+ minutes	Maintain account association

CAPTCHA Bypass Strategies

CAPTCHAs are the next layer of defense after IP blocking. Even with residential proxies, you'll encounter them on heavily protected sites.

Understanding CAPTCHA Triggers

CAPTCHAs aren't random. They're triggered by specific signals:

IP reputation — New or flagged IPs get more CAPTCHAs
Request patterns — Uniform timing between requests triggers challenges
Browser fingerprint — Missing or inconsistent JavaScript fingerprints
Behavioral signals — No mouse movement, no scrolling, instant form submission
TLS fingerprint — Non-browser TLS handshakes (JA3/JA4 fingerprints)

Strategy 1: Reduce CAPTCHA Frequency

The best CAPTCHA strategy is avoiding them entirely:

import httpx
import random
import time

def human_like_headers() -> dict:
    """Generate realistic browser headers."""
    chrome_versions = ["130.0.0.0", "131.0.0.0", "132.0.0.0"]
    version = random.choice(chrome_versions)

    return {
        "User-Agent": f"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                      f"AppleWebKit/537.36 (KHTML, like Gecko) "
                      f"Chrome/{version} Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;"
                  "q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "DNT": "1",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Cache-Control": "max-age=0",
    }

def human_like_delay():
    """Random delay that mimics human browsing patterns."""
    # Humans don't click at exactly 2-second intervals
    base = random.uniform(2, 5)
    # Occasionally take longer (reading the page)
    if random.random() < 0.2:
        base += random.uniform(5, 15)
    time.sleep(base)

Strategy 2: CAPTCHA Solving Services

When you can't avoid CAPTCHAs, solve them programmatically:

import httpx
import asyncio
import time

class CaptchaSolver:
    """Integration with CAPTCHA solving services."""

    def __init__(self, api_key: str, service: str = "2captcha"):
        self.api_key = api_key
        self.service = service
        self.base_url = "https://2captcha.com/in.php"
        self.result_url = "https://2captcha.com/res.php"

    async def solve_recaptcha_v2(
        self, site_key: str, page_url: str
    ) -> str | None:
        """Submit reCAPTCHA v2 and wait for solution."""

        async with httpx.AsyncClient() as client:
            # Submit task
            resp = await client.post(self.base_url, data={
                "key": self.api_key,
                "method": "userrecaptcha",
                "googlekey": site_key,
                "pageurl": page_url,
                "json": 1,
            })
            task = resp.json()

            if task.get("status") != 1:
                return None

            task_id = task["request"]

            # Poll for result (typically 20-60 seconds)
            for _ in range(30):
                await asyncio.sleep(5)
                result = await client.get(self.result_url, params={
                    "key": self.api_key,
                    "action": "get",
                    "id": task_id,
                    "json": 1,
                })
                data = result.json()

                if data.get("status") == 1:
                    return data["request"]  # The solved token
                elif data.get("request") == "CAPCHA_NOT_READY":
                    continue
                else:
                    return None  # Error

            return None  # Timeout

    async def solve_hcaptcha(
        self, site_key: str, page_url: str
    ) -> str | None:
        """Solve hCaptcha challenge."""

        async with httpx.AsyncClient() as client:
            resp = await client.post(self.base_url, data={
                "key": self.api_key,
                "method": "hcaptcha",
                "sitekey": site_key,
                "pageurl": page_url,
                "json": 1,
            })
            task = resp.json()

            if task.get("status") != 1:
                return None

            task_id = task["request"]

            for _ in range(30):
                await asyncio.sleep(5)
                result = await client.get(self.result_url, params={
                    "key": self.api_key,
                    "action": "get",
                    "id": task_id,
                    "json": 1,
                })
                data = result.json()

                if data.get("status") == 1:
                    return data["request"]
                elif data.get("request") == "CAPCHA_NOT_READY":
                    continue
                else:
                    return None

            return None

Strategy 3: Browser Automation for JavaScript Challenges

For Cloudflare Turnstile and similar challenges that require browser execution:

from playwright.async_api import async_playwright
import asyncio

async def solve_cloudflare_challenge(url: str, proxy: str) -> str | None:
    """Navigate through Cloudflare's challenge page using a real browser."""

    proxy_parts = proxy.replace("http://", "").split("@")
    user_pass = proxy_parts[0].split(":")
    server = proxy_parts[1]

    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy={
                "server": f"http://{server}",
                "username": user_pass[0],
                "password": user_pass[1],
            },
        )

        context = await browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent=(
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/131.0.0.0 Safari/537.36"
            ),
        )

        page = await context.new_page()

        try:
            await page.goto(url, wait_until="domcontentloaded")

            # Wait for Cloudflare challenge to resolve (up to 15 seconds)
            for _ in range(30):
                await asyncio.sleep(0.5)
                title = await page.title()
                # Cloudflare challenge pages have specific titles
                if "Just a moment" not in title and "Attention Required" not in title:
                    break

            # Get the page content after challenge is solved
            content = await page.content()

            # Extract cookies for future requests without browser
            cookies = await context.cookies()
            cf_clearance = next(
                (c for c in cookies if c["name"] == "cf_clearance"), 
                None
            )

            return content

        finally:
            await browser.close()

Cost Analysis: Real Numbers for Real Projects

Let's compare costs across actual scraping scenarios. These are based on 2026 pricing from major providers.

Scenario 1: E-Commerce Price Monitoring

Task: Monitor 10,000 product pages daily, each ~500KB average response. Data volume: ~5GB per day, ~150GB per month.

Proxy Type	Monthly Cost	Notes
Datacenter	$75-150	$0.50-1/GB; works if no bot detection
Residential (ThorData)	$750-1,200	$5-8/GB; needed for Amazon, Walmart
Residential (Bright Data)	$1,260-2,100	$8.40-14/GB; premium but expensive
Mobile	$3,000-7,500	$20-50/GB; overkill for most e-commerce

Recommendation: Start with datacenter. Escalate to residential (via ThorData) only for sites that block datacenter IPs.

Task: Collect 50,000 public profiles per month, ~200KB per profile. Data volume: ~10GB per month.

Proxy Type	Monthly Cost	Notes
Datacenter	Not viable	Social platforms block datacenter IPs
Residential	$50-100	$5-10/GB; standard approach
Mobile	$200-500	$20-50/GB; best for Instagram, TikTok

Recommendation: Residential for LinkedIn and Twitter. Mobile for Instagram and TikTok if residential gets blocked frequently.

Scenario 3: SERP Tracking (10,000 Keywords)

Task: Track rankings for 10,000 keywords daily across Google, Bing. Data volume: ~3GB per day, ~90GB per month.

Proxy Type	Monthly Cost	Notes
Datacenter	$45-90	Works for Bing, partially for Google
Residential	$450-720	Standard for Google SERP scraping
ISP	$200-500	Good middle ground for moderate volume

Recommendation: Datacenter for Bing, residential for Google. ISP proxies work for Google at lower volumes.

Anti-Detection Best Practices Beyond Proxies

Proxies are necessary but not sufficient. Modern bot detection looks at many signals beyond your IP address.

TLS Fingerprinting (JA3/JA4)

Every HTTPS client has a unique TLS fingerprint based on how it negotiates the connection. Python's requests and httpx libraries have fingerprints that differ from real browsers.

# Use curl_cffi for browser-like TLS fingerprints
from curl_cffi import requests as cffi_requests

# Impersonate Chrome's TLS fingerprint
resp = cffi_requests.get(
    "https://example.com",
    impersonate="chrome131",
    proxies={"https": "http://user:[email protected]:9000"},
)
print(resp.status_code)

HTTP/2 Fingerprinting

Modern sites check HTTP/2 settings (SETTINGS frame, WINDOW_UPDATE, PRIORITY frames). Standard Python clients send different HTTP/2 parameters than browsers.

# httpx with HTTP/2 support
import httpx

client = httpx.Client(
    http2=True,  # Enable HTTP/2
    proxy="http://user:[email protected]:9000",
)

Header Order Fingerprinting

Browsers send headers in a specific, consistent order. Python libraries often send them in a different order. Some detection systems check this.

# Correct header order for Chrome
from collections import OrderedDict

headers = OrderedDict([
    ("Host", "example.com"),
    ("Connection", "keep-alive"),
    ("Cache-Control", "max-age=0"),
    ("sec-ch-ua", '"Chromium";v="131", "Not_A Brand";v="24"'),
    ("sec-ch-ua-mobile", "?0"),
    ("sec-ch-ua-platform", '"macOS"'),
    ("Upgrade-Insecure-Requests", "1"),
    ("User-Agent", "Mozilla/5.0 ..."),
    ("Accept", "text/html,application/xhtml+xml,..."),
    ("Sec-Fetch-Site", "none"),
    ("Sec-Fetch-Mode", "navigate"),
    ("Sec-Fetch-User", "?1"),
    ("Sec-Fetch-Dest", "document"),
    ("Accept-Encoding", "gzip, deflate, br"),
    ("Accept-Language", "en-US,en;q=0.9"),
])

Real-World Use Cases

E-Commerce Price Intelligence

import httpx
import asyncio
import json
from datetime import datetime

async def monitor_competitor_prices(
    products: list[dict],
    proxy_url: str,
) -> list[dict]:
    """
    Monitor competitor pricing across e-commerce sites.
    Uses residential proxies for protected sites.
    """
    results = []

    async with httpx.AsyncClient(proxy=proxy_url, timeout=30) as client:
        for product in products:
            try:
                resp = await client.get(product["url"], headers={
                    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                                  "AppleWebKit/537.36",
                    "Accept-Language": "en-US,en;q=0.9",
                })

                if resp.status_code == 200:
                    # Parse price from response (site-specific logic)
                    results.append({
                        "product_id": product["id"],
                        "url": product["url"],
                        "status": "success",
                        "timestamp": datetime.utcnow().isoformat(),
                        "html_size": len(resp.content),
                    })
                else:
                    results.append({
                        "product_id": product["id"],
                        "status": "blocked",
                        "http_code": resp.status_code,
                    })

                await asyncio.sleep(2)

            except Exception as e:
                results.append({
                    "product_id": product["id"],
                    "status": "error",
                    "error": str(e),
                })

    return results

Academic Research: Collecting Public Data

import httpx
import asyncio
import csv
import io

async def collect_research_data(
    search_queries: list[str],
    proxy_url: str,
    output_file: str,
) -> int:
    """
    Collect public research data from academic sources.
    Uses datacenter proxies (academic sites rarely block).
    """
    collected = 0

    async with httpx.AsyncClient(proxy=proxy_url, timeout=30) as client:
        for query in search_queries:
            try:
                resp = await client.get(
                    "https://api.openalex.org/works",
                    params={
                        "search": query,
                        "per_page": 50,
                        "sort": "relevance_score:desc",
                    },
                )

                if resp.status_code == 200:
                    data = resp.json()
                    collected += len(data.get("results", []))

                await asyncio.sleep(1)

            except Exception:
                continue

    return collected

Brand Monitoring and Reputation Tracking

import httpx
import asyncio
from bs4 import BeautifulSoup

async def monitor_brand_mentions(
    brand_name: str,
    review_sites: list[str],
    proxy_url: str,
) -> list[dict]:
    """
    Monitor brand mentions across review sites.
    Uses residential proxies for sites with bot detection.
    """
    mentions = []

    async with httpx.AsyncClient(proxy=proxy_url, timeout=30) as client:
        for site_url in review_sites:
            try:
                resp = await client.get(
                    site_url,
                    params={"q": brand_name},
                    headers={
                        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                                      "AppleWebKit/537.36",
                    },
                )

                if resp.status_code == 200:
                    soup = BeautifulSoup(resp.text, "html.parser")
                    # Extract reviews (site-specific selectors)
                    review_elements = soup.select(".review, .comment, .testimonial")

                    for elem in review_elements:
                        text = elem.get_text(strip=True)
                        if brand_name.lower() in text.lower():
                            mentions.append({
                                "source": site_url,
                                "text": text[:500],
                                "sentiment": "pending_analysis",
                            })

                await asyncio.sleep(3)

            except Exception:
                continue

    return mentions

The 5-Step Decision Framework

Before you buy proxies, work through this framework:

Step 1: Does your target use bot detection? No → Use datacenter proxies. Save your money. Yes → Continue to step 2.

Step 2: What detection system? Basic rate limiting only → Datacenter proxies with rotation are fine. Cloudflare, Akamai, DataDome, PerimeterX → You need residential or better. Continue to step 3.

Step 3: Does your workflow involve sessions? No (independent pages) → Use rotating residential proxies. Yes (login, pagination, multi-step) → Use sticky residential sessions.

Step 4: What's your budget vs. volume? High volume, tight budget → ThorData residential for competitive per-GB rates. Low volume, needs reliability → ISP proxies for consistent performance. Any budget, maximum success rate needed → Mobile proxies.

Step 5: How critical is the data? Nice-to-have → Start cheap, accept some failures. Revenue-critical → Invest in reliable proxies and build automatic escalation.

Common Mistakes and How to Avoid Them

Mistake 1: Using Residential Proxies for Everything

Problem: Burning money scraping static blogs through $10/GB proxies. Fix: Start with datacenter. Escalate only when blocked.

Mistake 2: Not Rotating User-Agents

Problem: Same User-Agent across thousands of requests, even with different IPs. Fix: Rotate User-Agents that match real browser distributions.

Mistake 3: Ignoring Request Timing

Problem: 100 requests per second looks robotic regardless of proxy type. Fix: Add random delays (2-10 seconds) that mimic human browsing.

Mistake 4: Not Monitoring Proxy Health

Problem: Half your proxies are returning errors, wasting time and credits. Fix: Track success rates per proxy and auto-remove underperformers.

Mistake 5: Sticky Sessions That Are Too Long

Problem: Holding the same IP for hours when you only need 5 minutes. Fix: Match session duration to actual workflow length.

Mistake 6: Not Testing Proxies Before Large Runs

Problem: Running 10,000 requests through untested proxies, discovering issues after burning credits. Fix: Always test with 10-20 requests first, verify success rate, then scale up.

Provider Comparison: Key Factors

When evaluating proxy providers, focus on these concrete factors:

Factor	What to Check	Why It Matters
IP pool size	Total IPs and geo-distribution	Larger pools = less chance of hitting reused/flagged IPs
Session support	Rotating + sticky options	Some workflows require persistent sessions
Authentication	IP whitelisting + user:pass	IP whitelisting is more secure but less flexible
Bandwidth limits	Per-GB pricing vs unlimited	Predictable costs vs. risk of surprise bills
Geo-targeting	Country, state, city, ASN	Some targets serve different content by location
Protocol support	HTTP, HTTPS, SOCKS5	SOCKS5 is needed for some specialized use cases
Concurrent connections	Max simultaneous connections	Bottleneck for high-throughput scraping
Response time	Average latency by proxy type	Slower proxies = longer scraping runs

For a solid balance of pool size, pricing, and features, ThorData covers the essentials — large residential pool, both session types, competitive per-GB pricing, and good geo-targeting options.

The Bottom Line

Don't default to residential proxies because a proxy provider told you to. Start with datacenter proxies. They're 5-20x cheaper and work for the majority of scraping tasks. Switch to residential only when you hit actual blocks — Cloudflare challenges, Amazon CAPTCHAs, social media rate limits.

When you do need residential proxies, pick a provider with good IP diversity, sticky session support, and competitive pricing. Build automatic escalation into your code so you use the cheapest proxy that works for each target.

The scraping landscape keeps evolving. Detection systems get smarter, but proxy technology evolves alongside them. The developers who do best are the ones who match the right tool to each specific problem — not the ones who throw expensive proxies at everything hoping it works.

Match the tool to the problem. Your scraping budget will thank you.