Residential Proxies vs Datacenter Proxies: The Complete 2026 Guide for Web Scraping
If you're building a scraper in 2026, you'll hit the proxy question fast. Datacenter or residential? The proxy industry wants you to believe you need residential proxies for everything. You don't. But sometimes you really do.
The proxy landscape has shifted significantly over the past two years. Bot detection systems have gotten smarter, but so have the proxy technologies available to developers. Mobile proxies have emerged as a serious option, ISP proxies offer an interesting middle ground, and pricing models have diversified beyond simple per-GB billing.
This guide covers everything you need to make an informed decision — not just the basics, but the nuances that only matter once you're running scrapers at scale.
Understanding the Four Proxy Types
Before diving into comparisons, let's be precise about what each proxy type actually is, how it works, and what makes it different at the network level.
Datacenter Proxies
Datacenter proxies come from cloud providers — AWS, Hetzner, OVH, DigitalOcean. These are IP addresses that belong to autonomous systems (AS numbers) registered to hosting companies. When a target site looks up the ASN for your IP, it sees something like "AMAZON-02" or "HETZNER-DC" — a dead giveaway that you're not a regular user.
How they work: Your request goes from your machine to a proxy server in a data center, which forwards it to the target site. The target sees the datacenter IP. Simple, fast, no middleman beyond the proxy server itself.
Performance characteristics: - Latency: 10-50ms (fast, predictable) - Bandwidth: High (100Mbps+ is common) - Uptime: 99.9%+ (enterprise infrastructure) - IP pool size: Varies, but typically thousands to tens of thousands - Connection reliability: Very high
Cost: $0.50-5 per GB, or $1-3 per IP per month for dedicated proxies. This is 5-20x cheaper than residential alternatives.
When datacenter proxies are all you need: - Public APIs with rate limits (just rotate IPs) - News sites, blogs, documentation pages - Government data portals and public records - Price comparison on smaller e-commerce sites - Any site that doesn't actively fingerprint proxy traffic - Academic research sites and open-access journals - Weather and sports data aggregation - Job boards without advanced bot detection - Search engine results pages (with proper rotation)
Most developers start here and never need to upgrade. If your target doesn't use advanced bot detection, don't waste money on residential IPs.
Residential Proxies
Residential proxies route your traffic through real home internet connections — Comcast, Vodafone, AT&T, BT, Deutsche Telekom. To the target site, you look like a regular person browsing from their couch.
How they work: Proxy providers maintain networks of residential IPs through SDK integrations with mobile apps, browser extensions, or peer-to-peer networks. When you route traffic through a residential proxy, it exits from someone's home ISP connection. The target site sees an IP registered to a consumer ISP, which is exactly what legitimate traffic looks like.
Performance characteristics: - Latency: 100-500ms (higher and more variable than datacenter) - Bandwidth: Variable (depends on the residential connection) - Uptime: Lower than datacenter (connections drop when users go offline) - IP pool size: Millions (large providers have 50M+ IPs) - Connection reliability: Moderate (individual IPs may disconnect)
Cost: $5-20 per GB. The cost varies significantly by provider, geo-target, and volume. Geo-targeting specific countries or cities costs more.
When you genuinely need residential proxies: - Cloudflare-protected sites — Cloudflare's bot management flags datacenter IP ranges by default. Residential IPs sail through the initial challenge. - Amazon product pages — Amazon has been blocking datacenter ranges aggressively since 2024. Residential proxies are basically mandatory now. - Social media platforms — Instagram, LinkedIn, and TikTok all fingerprint datacenter traffic. You'll get CAPTCHAs or shadow-blocks within minutes. - Sneaker sites, ticket platforms — Anything with anti-bot middleware (Akamai, PerimeterX, DataDome) requires residential IPs. - Google Search at scale — A few queries from datacenter IPs are fine. Thousands per hour? You need residential. - Banking and financial sites — Strong fraud detection systems treat datacenter IPs as suspicious by default. - Travel booking sites — Expedia, Booking.com, and airline sites use sophisticated bot detection that flags datacenter traffic.
ISP Proxies (The Middle Ground)
ISP proxies are a hybrid — they're hosted in data centers but registered under residential ISP ASNs. This gives you datacenter-level speed and reliability with residential-level detection resistance.
How they work: Providers purchase IP blocks from ISPs and host them in data centers. The IPs show up as residential in ASN lookups, but they have the stability and speed of datacenter infrastructure. Think of it as renting an apartment in a residential neighborhood but running it like an office.
Performance characteristics: - Latency: 20-80ms (near-datacenter performance) - Bandwidth: High (datacenter infrastructure) - Uptime: 99%+ (no dependency on end-user connections) - IP pool size: Smaller (thousands, not millions) - Connection reliability: Very high
Cost: $10-30 per IP per month (usually sold as dedicated IPs, not per GB). More expensive per-IP than datacenter, cheaper per-GB than residential for high-traffic use cases.
Best for: - Account management (social media, e-commerce accounts) - Long-running sessions where you need the same IP for hours - Sites that check ISP ASNs but don't do deep fingerprinting - SEO monitoring and SERP tracking - Ad verification and brand protection
Mobile Proxies
Mobile proxies route traffic through 3G/4G/5G connections from real mobile devices. Mobile carriers use CGNAT (Carrier-Grade NAT), meaning hundreds or thousands of users share the same IP. This makes mobile IPs incredibly hard to block — banning one mobile IP affects thousands of legitimate users.
How they work: Traffic is routed through USB modems, mobile hotspots, or app-based SDKs connected to cellular networks. The target sees an IP from a mobile carrier's IP pool. Because of CGNAT, these IPs have naturally high trust scores.
Performance characteristics: - Latency: 50-200ms (depends on cellular network) - Bandwidth: Variable (cell network dependent) - Uptime: Good (carrier infrastructure) - IP pool size: Moderate (tens of thousands) - Connection reliability: Good, but occasional drops
Cost: $20-50 per GB, or $50-300 per port per month. The most expensive option, but sometimes the only one that works.
Best for: - Platforms with the most aggressive bot detection (Instagram, TikTok) - Account creation and verification workflows - Mobile-specific content that differs from desktop - Targets that have blocked all known residential proxy ranges - Social media automation at scale
Real-World Detection Rates: What Actually Gets Blocked
Theory is nice, but what matters is whether your requests succeed. Here's what you'll actually encounter against common bot detection systems in 2026:
Cloudflare Bot Management
| Proxy Type | Success Rate | Notes |
|---|---|---|
| Datacenter | 5-20% | Immediate challenge pages, high block rate |
| Residential | 85-95% | Most pass the initial challenge |
| ISP | 60-80% | Better than datacenter, but Cloudflare has started flagging some ISP ranges |
| Mobile | 95-99% | Highest success rate due to CGNAT trust |
Amazon Product Pages
| Proxy Type | Success Rate | Notes |
|---|---|---|
| Datacenter | 10-30% | Aggressive blocking since 2024 |
| Residential | 80-90% | Works well with proper rotation |
| ISP | 50-70% | Hit or miss depending on the ISP range |
| Mobile | 90-95% | Very reliable but expensive for high-volume scraping |
LinkedIn Profiles
| Proxy Type | Success Rate | Notes |
|---|---|---|
| Datacenter | <5% | Almost immediately blocked |
| Residential | 70-85% | Needs sticky sessions for logged-in scraping |
| ISP | 40-60% | Better for account management than bulk scraping |
| Mobile | 85-95% | Best option for account-based operations |
Google Search Results
| Proxy Type | Success Rate | Notes |
|---|---|---|
| Datacenter | 30-50% | Works at low volume with rotation |
| Residential | 90-95% | Standard choice for SERP scraping |
| ISP | 70-85% | Good for moderate volume |
| Mobile | 95%+ | Overkill unless other types fail |
These numbers assume proper rotation, reasonable request rates, and basic anti-detection practices (realistic headers, random delays). Hammering any site at maximum speed will get you blocked regardless of proxy type.
Python Code Examples: Complete Proxy Integration
Basic Proxy Rotation with httpx
import httpx
import random
import time
class ProxyRotator:
"""Simple proxy rotator that cycles through a list of proxies."""
def __init__(self, proxies: list[str]):
self.proxies = proxies
self.current = 0
self.failed = set()
def get_next(self) -> str:
available = [p for p in self.proxies if p not in self.failed]
if not available:
self.failed.clear() # Reset if all failed
available = self.proxies
proxy = random.choice(available)
return proxy
def mark_failed(self, proxy: str):
self.failed.add(proxy)
# Datacenter proxy rotation
dc_proxies = [
"http://user:[email protected]:8080",
"http://user:[email protected]:8080",
"http://user:[email protected]:8080",
]
rotator = ProxyRotator(dc_proxies)
def scrape_with_rotation(urls: list[str]) -> list[dict]:
results = []
for url in urls:
proxy = rotator.get_next()
try:
with httpx.Client(proxy=proxy, timeout=30) as client:
resp = client.get(url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/131.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "en-US,en;q=0.9",
})
results.append({
"url": url,
"status": resp.status_code,
"size": len(resp.content),
})
except (httpx.ConnectError, httpx.TimeoutException) as e:
rotator.mark_failed(proxy)
results.append({"url": url, "status": "error", "error": str(e)})
time.sleep(random.uniform(1, 3)) # Be polite
return results
Residential Proxy with ThorData
ThorData provides both rotating and sticky residential proxies with a clean API. Here's how to integrate them:
import httpx
import asyncio
import random
from dataclasses import dataclass
@dataclass
class ThorDataConfig:
username: str
password: str
host: str = "proxy.thordata.com"
port: int = 9000
def rotating_url(self, country: str = "") -> str:
"""Get a rotating proxy URL — new IP on each request."""
user = self.username
if country:
user = f"{user}-country-{country}"
return f"http://{user}:{self.password}@{self.host}:{self.port}"
def sticky_url(self, session_id: str, country: str = "") -> str:
"""Get a sticky proxy URL — same IP for the session duration."""
user = f"{self.username}-session-{session_id}"
if country:
user = f"{user}-country-{country}"
return f"http://{user}:{self.password}@{self.host}:{self.port}"
# Initialize
thor = ThorDataConfig(username="your_user", password="your_pass")
# Rotating proxy — new IP per request
async def scrape_products(urls: list[str]) -> list[dict]:
proxy = thor.rotating_url(country="us")
results = []
async with httpx.AsyncClient(proxy=proxy, timeout=30) as client:
for url in urls:
try:
resp = await client.get(url, headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/131.0.0.0 Safari/537.36",
})
results.append({"url": url, "status": resp.status_code})
except Exception as e:
results.append({"url": url, "error": str(e)})
await asyncio.sleep(random.uniform(0.5, 2))
return results
# Sticky session — same IP for multi-step flows
async def scrape_with_session(login_url: str, data_urls: list[str]) -> list[dict]:
session_id = f"sess_{random.randint(100000, 999999)}"
proxy = thor.sticky_url(session_id=session_id, country="us")
async with httpx.AsyncClient(proxy=proxy, timeout=30) as client:
# Login (IP stays the same)
login_resp = await client.post(login_url, data={
"username": "user", "password": "pass"
})
if login_resp.status_code != 200:
return [{"error": "Login failed"}]
# Scrape data pages with same IP
results = []
for url in data_urls:
resp = await client.get(url)
results.append({"url": url, "status": resp.status_code})
await asyncio.sleep(random.uniform(1, 3))
return results
Async High-Throughput Scraping with Proxy Pool
import httpx
import asyncio
import random
import time
from collections import defaultdict
class SmartProxyPool:
"""Proxy pool that tracks success rates and auto-adjusts."""
def __init__(self, proxies: list[str], max_concurrent: int = 10):
self.proxies = proxies
self.semaphore = asyncio.Semaphore(max_concurrent)
self.stats = defaultdict(lambda: {"success": 0, "fail": 0})
self.cooldown = {} # proxy -> timestamp when cooldown ends
def get_proxy(self) -> str:
now = time.time()
available = [
p for p in self.proxies
if self.cooldown.get(p, 0) < now
]
if not available:
available = self.proxies # All on cooldown, use anyway
# Prefer proxies with higher success rates
weights = []
for p in available:
s = self.stats[p]
total = s["success"] + s["fail"]
if total == 0:
weights.append(1.0)
else:
weights.append(s["success"] / total)
return random.choices(available, weights=weights, k=1)[0]
def report_success(self, proxy: str):
self.stats[proxy]["success"] += 1
def report_failure(self, proxy: str, cooldown_seconds: int = 30):
self.stats[proxy]["fail"] += 1
self.cooldown[proxy] = time.time() + cooldown_seconds
async def fetch(self, url: str, client_kwargs: dict = None) -> dict:
async with self.semaphore:
proxy = self.get_proxy()
try:
async with httpx.AsyncClient(
proxy=proxy,
timeout=30,
**(client_kwargs or {})
) as client:
resp = await client.get(url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36",
"Accept": "text/html,application/xhtml+xml",
})
if resp.status_code == 200:
self.report_success(proxy)
return {
"url": url,
"status": 200,
"content": resp.text,
"proxy": proxy,
}
elif resp.status_code == 403:
self.report_failure(proxy, cooldown_seconds=60)
return {"url": url, "status": 403, "blocked": True}
elif resp.status_code == 429:
self.report_failure(proxy, cooldown_seconds=120)
return {"url": url, "status": 429, "rate_limited": True}
else:
return {"url": url, "status": resp.status_code}
except (httpx.ConnectError, httpx.TimeoutException) as e:
self.report_failure(proxy, cooldown_seconds=60)
return {"url": url, "error": str(e)}
async def main():
# Mix of proxy types for different targets
pool = SmartProxyPool(
proxies=[
"http://user:[email protected]:8080",
"http://user:[email protected]:8080",
"http://user:[email protected]:9000",
"http://user:[email protected]:9000",
],
max_concurrent=5,
)
urls = [f"https://example.com/product/{i}" for i in range(100)]
tasks = [pool.fetch(url) for url in urls]
results = await asyncio.gather(*tasks)
success = sum(1 for r in results if r.get("status") == 200)
blocked = sum(1 for r in results if r.get("blocked"))
errors = sum(1 for r in results if "error" in r)
print(f"Results: {success} success, {blocked} blocked, {errors} errors")
# Print proxy performance stats
for proxy, stats in pool.stats.items():
total = stats["success"] + stats["fail"]
rate = stats["success"] / total * 100 if total > 0 else 0
print(f" {proxy}: {rate:.0f}% success ({total} requests)")
asyncio.run(main())
Playwright with Residential Proxies
For JavaScript-heavy sites, you need a real browser routed through proxies:
from playwright.async_api import async_playwright
import asyncio
import json
async def scrape_spa_with_proxy():
"""Scrape a JavaScript-heavy site through a residential proxy."""
async with async_playwright() as p:
browser = await p.chromium.launch(
proxy={
"server": "http://proxy.thordata.com:9000",
"username": "your_user",
"password": "your_pass",
},
headless=True,
)
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent=(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/131.0.0.0 Safari/537.36"
),
locale="en-US",
timezone_id="America/New_York",
)
page = await context.new_page()
# Block unnecessary resources to save bandwidth (and proxy cost)
await page.route("**/*.{png,jpg,jpeg,gif,svg,webp}",
lambda route: route.abort())
await page.route("**/analytics*", lambda route: route.abort())
await page.route("**/tracking*", lambda route: route.abort())
try:
await page.goto("https://example.com/products",
wait_until="networkidle", timeout=30000)
# Wait for product cards to load
await page.wait_for_selector(".product-card", timeout=10000)
# Extract data
products = await page.evaluate("""
() => {
return Array.from(
document.querySelectorAll('.product-card')
).map(card => ({
title: card.querySelector('h2')?.textContent?.trim(),
price: card.querySelector('.price')?.textContent?.trim(),
url: card.querySelector('a')?.href,
}));
}
""")
return products
finally:
await browser.close()
# Run
products = asyncio.run(scrape_spa_with_proxy())
for p in products:
print(f"{p['title']}: {p['price']}")
Automatic Proxy Escalation
The smartest approach: start cheap and escalate only when needed.
import httpx
import asyncio
from enum import Enum
class ProxyTier(Enum):
DATACENTER = "datacenter"
RESIDENTIAL = "residential"
MOBILE = "mobile"
class ProxyEscalator:
"""Automatically escalates from cheap to expensive proxies based on blocks."""
def __init__(self, config: dict):
self.tiers = {
ProxyTier.DATACENTER: config["datacenter_proxies"],
ProxyTier.RESIDENTIAL: config["residential_proxies"],
ProxyTier.MOBILE: config.get("mobile_proxies", []),
}
self.tier_order = [ProxyTier.DATACENTER, ProxyTier.RESIDENTIAL, ProxyTier.MOBILE]
async def fetch(self, url: str, max_retries: int = 3) -> dict:
"""Try datacenter first, escalate to residential, then mobile."""
last_error = None
for tier in self.tier_order:
proxies = self.tiers.get(tier, [])
if not proxies:
continue
for attempt in range(max_retries):
proxy = proxies[attempt % len(proxies)]
try:
async with httpx.AsyncClient(
proxy=proxy, timeout=30
) as client:
resp = await client.get(url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36",
})
if resp.status_code == 200:
return {
"url": url,
"status": 200,
"content": resp.text,
"tier": tier.value,
"attempts": attempt + 1,
}
elif resp.status_code in (403, 429, 503):
# Blocked — try next attempt or escalate
last_error = f"HTTP {resp.status_code}"
await asyncio.sleep(2 ** attempt)
continue
else:
return {
"url": url,
"status": resp.status_code,
"tier": tier.value,
}
except Exception as e:
last_error = str(e)
continue
# All retries for this tier failed — escalate
print(f" Tier {tier.value} failed for {url}, escalating...")
return {"url": url, "error": last_error, "exhausted": True}
# Usage
escalator = ProxyEscalator({
"datacenter_proxies": [
"http://user:[email protected]:8080",
"http://user:[email protected]:8080",
],
"residential_proxies": [
"http://user:[email protected]:9000",
],
"mobile_proxies": [
"http://user:[email protected]:9100",
],
})
async def main():
urls = [
"https://easy-target.com/page", # Datacenter will work
"https://cloudflare-site.com/data", # Needs residential
"https://instagram.com/profile", # May need mobile
]
results = await asyncio.gather(*[escalator.fetch(url) for url in urls])
for r in results:
tier = r.get("tier", "none")
status = r.get("status", r.get("error", "unknown"))
print(f" {r['url']}: {status} (via {tier})")
asyncio.run(main())
Rotating vs Sticky Sessions
This trips up a lot of developers, so let's be thorough.
Rotating Proxies
Rotating proxies give you a new IP on every request (or every N seconds). Maximum anonymity, minimum pattern detection.
When to use rotating: - Scraping independent product pages where each request is standalone - Search engine results where you want maximum query volume - Price monitoring across hundreds of sites - Any workflow where requests don't depend on each other
Implementation pattern:
import httpx
# Most providers use username formatting for rotation control
# ThorData example:
rotating_proxy = "http://user-rotate:[email protected]:9000"
async def scrape_product_list(product_ids: list[int]) -> list[dict]:
"""Each request gets a fresh IP."""
results = []
async with httpx.AsyncClient(proxy=rotating_proxy, timeout=30) as client:
for pid in product_ids:
url = f"https://store.example.com/product/{pid}"
resp = await client.get(url)
if resp.status_code == 200:
results.append({"id": pid, "html": resp.text})
return results
Sticky Sessions
Sticky sessions keep the same IP for a set duration (usually 1-30 minutes). The proxy provider assigns you an IP and maintains the mapping for the session duration.
When to use sticky sessions: - Logging into accounts (session cookies are often tied to IP) - Multi-page checkout monitoring - Paginating through search results where the site tracks your session state - Crawling sites that use server-side session tracking - Any multi-step workflow where IP changes look suspicious
Implementation pattern:
import httpx
import random
# Sticky session — same IP for the session duration
session_id = f"mysession_{random.randint(10000, 99999)}"
sticky_proxy = f"http://user-session-{session_id}:[email protected]:9000"
async def scrape_paginated_results(base_url: str, pages: int) -> list[dict]:
"""All pages use the same IP — looks like one user browsing."""
results = []
async with httpx.AsyncClient(proxy=sticky_proxy, timeout=30) as client:
for page in range(1, pages + 1):
url = f"{base_url}?page={page}"
resp = await client.get(url, headers={
"Referer": base_url if page == 1 else f"{base_url}?page={page-1}",
})
if resp.status_code == 200:
results.append({"page": page, "html": resp.text})
return results
Session Duration Strategy
Different targets require different session durations:
| Use Case | Recommended Duration | Why |
|---|---|---|
| Product page scraping | No session (rotating) | Each page is independent |
| Search pagination | 5-10 minutes | Enough to paginate through results |
| Login + data export | 10-30 minutes | Complete the full workflow |
| Social media browsing | 15-30 minutes | Mimic real browsing session |
| Account management | 30+ minutes | Maintain account association |
CAPTCHA Bypass Strategies
CAPTCHAs are the next layer of defense after IP blocking. Even with residential proxies, you'll encounter them on heavily protected sites.
Understanding CAPTCHA Triggers
CAPTCHAs aren't random. They're triggered by specific signals:
- IP reputation — New or flagged IPs get more CAPTCHAs
- Request patterns — Uniform timing between requests triggers challenges
- Browser fingerprint — Missing or inconsistent JavaScript fingerprints
- Behavioral signals — No mouse movement, no scrolling, instant form submission
- TLS fingerprint — Non-browser TLS handshakes (JA3/JA4 fingerprints)
Strategy 1: Reduce CAPTCHA Frequency
The best CAPTCHA strategy is avoiding them entirely:
import httpx
import random
import time
def human_like_headers() -> dict:
"""Generate realistic browser headers."""
chrome_versions = ["130.0.0.0", "131.0.0.0", "132.0.0.0"]
version = random.choice(chrome_versions)
return {
"User-Agent": f"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
f"AppleWebKit/537.36 (KHTML, like Gecko) "
f"Chrome/{version} Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;"
"q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Cache-Control": "max-age=0",
}
def human_like_delay():
"""Random delay that mimics human browsing patterns."""
# Humans don't click at exactly 2-second intervals
base = random.uniform(2, 5)
# Occasionally take longer (reading the page)
if random.random() < 0.2:
base += random.uniform(5, 15)
time.sleep(base)
Strategy 2: CAPTCHA Solving Services
When you can't avoid CAPTCHAs, solve them programmatically:
import httpx
import asyncio
import time
class CaptchaSolver:
"""Integration with CAPTCHA solving services."""
def __init__(self, api_key: str, service: str = "2captcha"):
self.api_key = api_key
self.service = service
self.base_url = "https://2captcha.com/in.php"
self.result_url = "https://2captcha.com/res.php"
async def solve_recaptcha_v2(
self, site_key: str, page_url: str
) -> str | None:
"""Submit reCAPTCHA v2 and wait for solution."""
async with httpx.AsyncClient() as client:
# Submit task
resp = await client.post(self.base_url, data={
"key": self.api_key,
"method": "userrecaptcha",
"googlekey": site_key,
"pageurl": page_url,
"json": 1,
})
task = resp.json()
if task.get("status") != 1:
return None
task_id = task["request"]
# Poll for result (typically 20-60 seconds)
for _ in range(30):
await asyncio.sleep(5)
result = await client.get(self.result_url, params={
"key": self.api_key,
"action": "get",
"id": task_id,
"json": 1,
})
data = result.json()
if data.get("status") == 1:
return data["request"] # The solved token
elif data.get("request") == "CAPCHA_NOT_READY":
continue
else:
return None # Error
return None # Timeout
async def solve_hcaptcha(
self, site_key: str, page_url: str
) -> str | None:
"""Solve hCaptcha challenge."""
async with httpx.AsyncClient() as client:
resp = await client.post(self.base_url, data={
"key": self.api_key,
"method": "hcaptcha",
"sitekey": site_key,
"pageurl": page_url,
"json": 1,
})
task = resp.json()
if task.get("status") != 1:
return None
task_id = task["request"]
for _ in range(30):
await asyncio.sleep(5)
result = await client.get(self.result_url, params={
"key": self.api_key,
"action": "get",
"id": task_id,
"json": 1,
})
data = result.json()
if data.get("status") == 1:
return data["request"]
elif data.get("request") == "CAPCHA_NOT_READY":
continue
else:
return None
return None
Strategy 3: Browser Automation for JavaScript Challenges
For Cloudflare Turnstile and similar challenges that require browser execution:
from playwright.async_api import async_playwright
import asyncio
async def solve_cloudflare_challenge(url: str, proxy: str) -> str | None:
"""Navigate through Cloudflare's challenge page using a real browser."""
proxy_parts = proxy.replace("http://", "").split("@")
user_pass = proxy_parts[0].split(":")
server = proxy_parts[1]
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy={
"server": f"http://{server}",
"username": user_pass[0],
"password": user_pass[1],
},
)
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent=(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/131.0.0.0 Safari/537.36"
),
)
page = await context.new_page()
try:
await page.goto(url, wait_until="domcontentloaded")
# Wait for Cloudflare challenge to resolve (up to 15 seconds)
for _ in range(30):
await asyncio.sleep(0.5)
title = await page.title()
# Cloudflare challenge pages have specific titles
if "Just a moment" not in title and "Attention Required" not in title:
break
# Get the page content after challenge is solved
content = await page.content()
# Extract cookies for future requests without browser
cookies = await context.cookies()
cf_clearance = next(
(c for c in cookies if c["name"] == "cf_clearance"),
None
)
return content
finally:
await browser.close()
Cost Analysis: Real Numbers for Real Projects
Let's compare costs across actual scraping scenarios. These are based on 2026 pricing from major providers.
Scenario 1: E-Commerce Price Monitoring
Task: Monitor 10,000 product pages daily, each ~500KB average response. Data volume: ~5GB per day, ~150GB per month.
| Proxy Type | Monthly Cost | Notes |
|---|---|---|
| Datacenter | $75-150 | $0.50-1/GB; works if no bot detection |
| Residential (ThorData) | $750-1,200 | $5-8/GB; needed for Amazon, Walmart |
| Residential (Bright Data) | $1,260-2,100 | $8.40-14/GB; premium but expensive |
| Mobile | $3,000-7,500 | $20-50/GB; overkill for most e-commerce |
Recommendation: Start with datacenter. Escalate to residential (via ThorData) only for sites that block datacenter IPs.
Scenario 2: Social Media Data Collection
Task: Collect 50,000 public profiles per month, ~200KB per profile. Data volume: ~10GB per month.
| Proxy Type | Monthly Cost | Notes |
|---|---|---|
| Datacenter | Not viable | Social platforms block datacenter IPs |
| Residential | $50-100 | $5-10/GB; standard approach |
| Mobile | $200-500 | $20-50/GB; best for Instagram, TikTok |
Recommendation: Residential for LinkedIn and Twitter. Mobile for Instagram and TikTok if residential gets blocked frequently.
Scenario 3: SERP Tracking (10,000 Keywords)
Task: Track rankings for 10,000 keywords daily across Google, Bing. Data volume: ~3GB per day, ~90GB per month.
| Proxy Type | Monthly Cost | Notes |
|---|---|---|
| Datacenter | $45-90 | Works for Bing, partially for Google |
| Residential | $450-720 | Standard for Google SERP scraping |
| ISP | $200-500 | Good middle ground for moderate volume |
Recommendation: Datacenter for Bing, residential for Google. ISP proxies work for Google at lower volumes.
Anti-Detection Best Practices Beyond Proxies
Proxies are necessary but not sufficient. Modern bot detection looks at many signals beyond your IP address.
TLS Fingerprinting (JA3/JA4)
Every HTTPS client has a unique TLS fingerprint based on how it negotiates the connection. Python's requests and httpx libraries have fingerprints that differ from real browsers.
# Use curl_cffi for browser-like TLS fingerprints
from curl_cffi import requests as cffi_requests
# Impersonate Chrome's TLS fingerprint
resp = cffi_requests.get(
"https://example.com",
impersonate="chrome131",
proxies={"https": "http://user:[email protected]:9000"},
)
print(resp.status_code)
HTTP/2 Fingerprinting
Modern sites check HTTP/2 settings (SETTINGS frame, WINDOW_UPDATE, PRIORITY frames). Standard Python clients send different HTTP/2 parameters than browsers.
# httpx with HTTP/2 support
import httpx
client = httpx.Client(
http2=True, # Enable HTTP/2
proxy="http://user:[email protected]:9000",
)
Header Order Fingerprinting
Browsers send headers in a specific, consistent order. Python libraries often send them in a different order. Some detection systems check this.
# Correct header order for Chrome
from collections import OrderedDict
headers = OrderedDict([
("Host", "example.com"),
("Connection", "keep-alive"),
("Cache-Control", "max-age=0"),
("sec-ch-ua", '"Chromium";v="131", "Not_A Brand";v="24"'),
("sec-ch-ua-mobile", "?0"),
("sec-ch-ua-platform", '"macOS"'),
("Upgrade-Insecure-Requests", "1"),
("User-Agent", "Mozilla/5.0 ..."),
("Accept", "text/html,application/xhtml+xml,..."),
("Sec-Fetch-Site", "none"),
("Sec-Fetch-Mode", "navigate"),
("Sec-Fetch-User", "?1"),
("Sec-Fetch-Dest", "document"),
("Accept-Encoding", "gzip, deflate, br"),
("Accept-Language", "en-US,en;q=0.9"),
])
Real-World Use Cases
E-Commerce Price Intelligence
import httpx
import asyncio
import json
from datetime import datetime
async def monitor_competitor_prices(
products: list[dict],
proxy_url: str,
) -> list[dict]:
"""
Monitor competitor pricing across e-commerce sites.
Uses residential proxies for protected sites.
"""
results = []
async with httpx.AsyncClient(proxy=proxy_url, timeout=30) as client:
for product in products:
try:
resp = await client.get(product["url"], headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36",
"Accept-Language": "en-US,en;q=0.9",
})
if resp.status_code == 200:
# Parse price from response (site-specific logic)
results.append({
"product_id": product["id"],
"url": product["url"],
"status": "success",
"timestamp": datetime.utcnow().isoformat(),
"html_size": len(resp.content),
})
else:
results.append({
"product_id": product["id"],
"status": "blocked",
"http_code": resp.status_code,
})
await asyncio.sleep(2)
except Exception as e:
results.append({
"product_id": product["id"],
"status": "error",
"error": str(e),
})
return results
Academic Research: Collecting Public Data
import httpx
import asyncio
import csv
import io
async def collect_research_data(
search_queries: list[str],
proxy_url: str,
output_file: str,
) -> int:
"""
Collect public research data from academic sources.
Uses datacenter proxies (academic sites rarely block).
"""
collected = 0
async with httpx.AsyncClient(proxy=proxy_url, timeout=30) as client:
for query in search_queries:
try:
resp = await client.get(
"https://api.openalex.org/works",
params={
"search": query,
"per_page": 50,
"sort": "relevance_score:desc",
},
)
if resp.status_code == 200:
data = resp.json()
collected += len(data.get("results", []))
await asyncio.sleep(1)
except Exception:
continue
return collected
Brand Monitoring and Reputation Tracking
import httpx
import asyncio
from bs4 import BeautifulSoup
async def monitor_brand_mentions(
brand_name: str,
review_sites: list[str],
proxy_url: str,
) -> list[dict]:
"""
Monitor brand mentions across review sites.
Uses residential proxies for sites with bot detection.
"""
mentions = []
async with httpx.AsyncClient(proxy=proxy_url, timeout=30) as client:
for site_url in review_sites:
try:
resp = await client.get(
site_url,
params={"q": brand_name},
headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36",
},
)
if resp.status_code == 200:
soup = BeautifulSoup(resp.text, "html.parser")
# Extract reviews (site-specific selectors)
review_elements = soup.select(".review, .comment, .testimonial")
for elem in review_elements:
text = elem.get_text(strip=True)
if brand_name.lower() in text.lower():
mentions.append({
"source": site_url,
"text": text[:500],
"sentiment": "pending_analysis",
})
await asyncio.sleep(3)
except Exception:
continue
return mentions
The 5-Step Decision Framework
Before you buy proxies, work through this framework:
Step 1: Does your target use bot detection? No → Use datacenter proxies. Save your money. Yes → Continue to step 2.
Step 2: What detection system? Basic rate limiting only → Datacenter proxies with rotation are fine. Cloudflare, Akamai, DataDome, PerimeterX → You need residential or better. Continue to step 3.
Step 3: Does your workflow involve sessions? No (independent pages) → Use rotating residential proxies. Yes (login, pagination, multi-step) → Use sticky residential sessions.
Step 4: What's your budget vs. volume? High volume, tight budget → ThorData residential for competitive per-GB rates. Low volume, needs reliability → ISP proxies for consistent performance. Any budget, maximum success rate needed → Mobile proxies.
Step 5: How critical is the data? Nice-to-have → Start cheap, accept some failures. Revenue-critical → Invest in reliable proxies and build automatic escalation.
Common Mistakes and How to Avoid Them
Mistake 1: Using Residential Proxies for Everything
Problem: Burning money scraping static blogs through $10/GB proxies. Fix: Start with datacenter. Escalate only when blocked.
Mistake 2: Not Rotating User-Agents
Problem: Same User-Agent across thousands of requests, even with different IPs. Fix: Rotate User-Agents that match real browser distributions.
Mistake 3: Ignoring Request Timing
Problem: 100 requests per second looks robotic regardless of proxy type. Fix: Add random delays (2-10 seconds) that mimic human browsing.
Mistake 4: Not Monitoring Proxy Health
Problem: Half your proxies are returning errors, wasting time and credits. Fix: Track success rates per proxy and auto-remove underperformers.
Mistake 5: Sticky Sessions That Are Too Long
Problem: Holding the same IP for hours when you only need 5 minutes. Fix: Match session duration to actual workflow length.
Mistake 6: Not Testing Proxies Before Large Runs
Problem: Running 10,000 requests through untested proxies, discovering issues after burning credits. Fix: Always test with 10-20 requests first, verify success rate, then scale up.
Provider Comparison: Key Factors
When evaluating proxy providers, focus on these concrete factors:
| Factor | What to Check | Why It Matters |
|---|---|---|
| IP pool size | Total IPs and geo-distribution | Larger pools = less chance of hitting reused/flagged IPs |
| Session support | Rotating + sticky options | Some workflows require persistent sessions |
| Authentication | IP whitelisting + user:pass | IP whitelisting is more secure but less flexible |
| Bandwidth limits | Per-GB pricing vs unlimited | Predictable costs vs. risk of surprise bills |
| Geo-targeting | Country, state, city, ASN | Some targets serve different content by location |
| Protocol support | HTTP, HTTPS, SOCKS5 | SOCKS5 is needed for some specialized use cases |
| Concurrent connections | Max simultaneous connections | Bottleneck for high-throughput scraping |
| Response time | Average latency by proxy type | Slower proxies = longer scraping runs |
For a solid balance of pool size, pricing, and features, ThorData covers the essentials — large residential pool, both session types, competitive per-GB pricing, and good geo-targeting options.
The Bottom Line
Don't default to residential proxies because a proxy provider told you to. Start with datacenter proxies. They're 5-20x cheaper and work for the majority of scraping tasks. Switch to residential only when you hit actual blocks — Cloudflare challenges, Amazon CAPTCHAs, social media rate limits.
When you do need residential proxies, pick a provider with good IP diversity, sticky session support, and competitive pricing. Build automatic escalation into your code so you use the cheapest proxy that works for each target.
The scraping landscape keeps evolving. Detection systems get smarter, but proxy technology evolves alongside them. The developers who do best are the ones who match the right tool to each specific problem — not the ones who throw expensive proxies at everything hoping it works.
Match the tool to the problem. Your scraping budget will thank you.