How to Handle Anti-Bot Protection When Scraping in 2026 (Cloudflare, DataDome, Imperva)
If you've scraped anything meaningful in the last year, you've hit a wall. Not a 403 — those are easy. I mean the kind where your requests return 200 OK with a JavaScript challenge page, or where everything works for 50 requests and then silently starts returning stale data.
Anti-bot systems got significantly better in 2025-2026. Here's what each one actually does and what works against them.
The Four Tiers of Anti-Bot in 2026
Cloudflare protects over 50% of websites with any bot protection at all. Their stack runs JS challenges (Turnstile), browser fingerprinting via canvas and WebGL hashing, and behavioral analysis on mouse movement and scroll patterns. The free tier is easy to bypass. Business and Enterprise tiers with Bot Management enabled are genuinely hard.
DataDome is the one that makes scraper developers swear. Used by major retailers (Foot Locker, TripAdvisor, dozens of media sites), it runs ML-based scoring on every single request. It evaluates your TLS fingerprint, headers, timing, and behavioral signals in combination — and assigns a bot probability score. If you fix one signal but not the others, you still get blocked.
Imperva (Incapsula) focuses heavily on TLS fingerprinting and cookie-based challenges. It injects JavaScript that sets specific cookies, then validates them on subsequent requests. Less sophisticated than DataDome's ML approach, but very effective against scripts.
Akamai Bot Manager uses token-based validation with heavily obfuscated JavaScript that generates sensor data. The JS changes frequently, making static analysis a losing game.
What Gets You Blocked (Specifically)
Stop guessing. These are the actual detection vectors, ranked by how often they catch scrapers:
1. TLS Fingerprint (JA3/JA4 hash)
Python's requests library uses urllib3's TLS stack, which produces a JA3 fingerprint that looks nothing like any browser. Every anti-bot system maintains a database of known library fingerprints. This is the #1 reason requests.get() fails on protected sites in 2026.
# Python requests JA3 (easily identified as non-browser)
769,47-53-5-10-49161-49162-49171-49172-50-56-19-4,0-10-11,23-24-25,0
# Chrome 124 JA3 (what you want to look like)
771,4865-4866-4867-49195-49199-49196-49200-52393-52392...
2. Missing or Wrong Browser Headers
Modern browsers send 15+ headers per request. Scrapers typically send 3-4. The tells:
- Missing Sec-Fetch-Mode, Sec-Fetch-Site, Sec-Fetch-Dest (Chrome sends these on every request)
- Missing or wrong sec-ch-ua (Client Hints — Chrome's replacement for User-Agent)
- Accept-Language missing or set to something generic like en
- Accept-Encoding not including br (Brotli — all modern browsers support it)
3. Headless Browser Detection
Even Playwright and Puppeteer get caught:
- navigator.webdriver returns true by default
- Missing browser plugins (navigator.plugins is empty)
- Chrome automation flags in window.chrome.runtime
- Canvas fingerprint returns a unique hash for headless vs headed rendering
- WebGL renderer string says "SwiftShader" instead of an actual GPU
4. Behavioral Signals - Request timing too regular (exactly 2.0s between requests = bot) - No mouse movement or scroll events before clicking - Loading page resources in wrong order (CSS/JS before HTML is parsed) - Fetching robots.txt before scraping (ironic, but it flags you)
5. IP Reputation - Datacenter ASNs (AWS, GCP, DigitalOcean) are pre-flagged - Known proxy/VPN ranges get higher bot scores - Residential IPs get a trust bonus but aren't immune
Understanding Each Defense Layer in Depth
How Cloudflare Bot Management Works
Cloudflare's Bot Management runs a challenge score from 1-99 on every visitor. A score of 1 means almost certainly human; 99 means almost certainly bot. The score is computed from:
- JavaScript rendering — can the client execute JavaScript and return expected results?
- Browser API availability — does the client expose the same APIs a real browser would?
- Canvas fingerprint — is the canvas rendering hash consistent with the claimed browser and OS?
- Network timing — does the TCP handshake timing, TTFB, and subsequent request timing pattern match expected human behavior?
- IP reputation — is the IP known to belong to a hosting provider, VPN, or previously flagged for bot activity?
- Cookie validation — does the client properly store and return the session cookies Cloudflare sets?
The key insight is that Cloudflare doesn't block you when any single signal is off. It accumulates evidence and acts when the combined score exceeds a threshold. This is why partial fixes don't work: fix your TLS fingerprint but keep a datacenter IP, and you're still blocked.
How DataDome's ML Model Works
DataDome is fundamentally different from Cloudflare. Rather than rule-based scoring, it runs a trained ML classifier that evaluates your request against behavioral baselines built from millions of legitimate visitors to that site specifically.
DataDome captures: - The full sequence of your requests (not just each one individually) - Time between requests and variance in timing - Which pages you visit and in what order - Which resources you request (do you load images? CSS? Tracking pixels?) - Mouse trajectory data if you execute JavaScript - The correlation between all of the above
This is why DataDome is so hard. Even if you fool it on request 1, your behavioral pattern across requests 2-50 builds an increasingly clear signal. A human visiting a retail site browses category pages, reads descriptions, goes back, compares items. A scraper hits structured endpoints in order and never loads a product image.
Imperva's Technical Stack
Imperva (now Thales Group) uses a different approach: rather than ML scoring, it relies on cryptographic challenge-response:
- It serves a JavaScript payload that must execute and compute a specific value based on browser environment data
- That computed value is embedded in a cookie
- Every subsequent request is validated against that cookie
- The JavaScript to compute the value changes regularly
The weakness: the cookie value computation is deterministic. If you can execute the JavaScript in a real browser context once, you can replay those cookies for some time. But Imperva also validates the browser fingerprint matches the original cookie context, so replaying across different IPs or fingerprints fails.
The practical approach: use curl_cffi for TLS impersonation combined with Playwright to execute the initial JavaScript challenge, then extract and replay the cookies for subsequent requests.
What Actually Works (Per Tier)
Tier 1: Sites With Basic Protection (IP blocking, simple rate limits)
Rotate IPs, add proper browser headers, add random delays. This still works for maybe 40% of protected sites.
import httpx
import random
import time
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-Dest": "document",
"sec-ch-ua": '"Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"Windows"',
}
def scrape(url: str, proxy: str = None) -> httpx.Response:
"""Basic scrape with anti-detection headers and optional proxy."""
time.sleep(random.uniform(1.5, 4.0))
proxies = {"https://": proxy} if proxy else None
return httpx.get(url, headers=HEADERS, proxies=proxies, follow_redirects=True, timeout=20)
def rotate_user_agent() -> str:
"""Return a random realistic user agent."""
agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:125.0) Gecko/20100101 Firefox/125.0",
]
return random.choice(agents)
Tier 2: Cloudflare (JS Challenges + Fingerprinting)
For Cloudflare Business/Enterprise, you need a real browser. playwright-stealth or undetected-chromedriver patches the obvious headless tells:
import asyncio
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async
async def scrape_cloudflare_protected(url: str, proxy_url: str = None) -> str:
"""Scrape a Cloudflare-protected page using Playwright with stealth."""
async with async_playwright() as p:
launch_kwargs = {
"headless": True,
"args": [
"--no-sandbox",
"--disable-blink-features=AutomationControlled",
"--disable-dev-shm-usage",
"--window-size=1440,900",
],
}
if proxy_url:
launch_kwargs["proxy"] = {"server": proxy_url}
browser = await p.chromium.launch(**launch_kwargs)
context = await browser.new_context(
viewport={"width": 1440, "height": 900},
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
locale="en-US",
timezone_id="America/New_York",
)
page = await context.new_page()
await stealth_async(page)
# Simulate human-like behavior before navigation
await page.mouse.move(
random.randint(100, 300),
random.randint(100, 300)
)
await page.goto(url, wait_until="networkidle", timeout=60000)
# Cloudflare challenge usually resolves within 5s
await page.wait_for_timeout(random.randint(4000, 8000))
# Check if still on challenge page
title = await page.title()
if "Just a moment" in title or "Attention Required" in title:
# Wait longer for challenge to complete
await page.wait_for_timeout(10000)
content = await page.content()
await browser.close()
return content
# Usage with ThorData residential proxy
result = asyncio.run(scrape_cloudflare_protected(
"https://protected-site.com/data",
proxy_url="http://user:[email protected]:9000"
))
Key: use headless=True with the new headless mode in Chrome 124+. Old headless mode has too many detectable differences.
Tier 3: DataDome (ML-Based Scoring)
DataDome is the hardest to beat consistently. Their system correlates signals across requests, so fixing one thing doesn't help if three others are wrong.
import asyncio
import random
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async
async def scrape_datadome_protected(
urls: list[str],
proxy_url: str,
delay_range: tuple = (15, 35)
) -> list[dict]:
"""
Scrape DataDome-protected pages with realistic behavior simulation.
Key requirements:
- Residential proxy (datacenter IPs fail immediately)
- Slow, randomized request rates (15-35 seconds between pages)
- Realistic session: homepage first, then navigate
- Full JavaScript execution
"""
results = []
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy={"server": proxy_url},
args=["--disable-blink-features=AutomationControlled"],
)
context = await browser.new_context(
viewport={"width": random.choice([1366, 1440, 1920]), "height": random.choice([768, 900, 1080])},
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
locale="en-US",
)
page = await context.new_page()
await stealth_async(page)
# CRITICAL: Always load the homepage first to establish session
base_domain = "/".join(urls[0].split("/")[:3])
await page.goto(base_domain, wait_until="networkidle")
await asyncio.sleep(random.uniform(3, 6))
# Simulate reading the homepage
await page.evaluate("window.scrollTo(0, document.body.scrollHeight * 0.3)")
await asyncio.sleep(random.uniform(1, 3))
await page.evaluate("window.scrollTo(0, document.body.scrollHeight * 0.6)")
await asyncio.sleep(random.uniform(1, 2))
for url in urls:
try:
await page.goto(url, wait_until="networkidle", timeout=45000)
await asyncio.sleep(random.uniform(*delay_range))
# Simulate reading the page
await page.evaluate("window.scrollTo(0, document.body.scrollHeight * 0.5)")
await asyncio.sleep(random.uniform(2, 5))
content = await page.content()
results.append({"url": url, "content": content, "status": "ok"})
except Exception as e:
results.append({"url": url, "content": None, "status": str(e)})
await browser.close()
return results
What works: - Full browser automation (Playwright, not requests) - Residential proxies are mandatory — if you need residential proxies for this kind of work, ThorData has a solid residential pool with city-level targeting - Slow request rates (15-30 seconds between pages) - Realistic session behavior: load homepage first, navigate via links, don't jump to deep URLs - Check if the site has an official API before fighting DataDome — many do, and it's cheaper than the proxy bill
Tier 4: Imperva — TLS Impersonation
Imperva's main detection vector is TLS fingerprinting. You can impersonate a browser's TLS stack using curl_cffi (formerly curl-impersonate) in Python:
from curl_cffi import requests as curl_requests
import random
import time
def scrape_imperva_protected(url: str, proxy: str = None) -> curl_requests.Response:
"""
Use curl_cffi to impersonate Chrome's exact TLS fingerprint.
This is the most effective approach for Imperva-protected sites.
"""
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-Dest": "document",
}
proxies = {"https": proxy} if proxy else None
session = curl_requests.Session()
# Impersonates Chrome's exact TLS fingerprint byte-for-byte
response = session.get(
url,
impersonate="chrome124",
headers=headers,
proxies=proxies,
timeout=30,
)
return response
# Combined approach: curl_cffi for TLS + extract cookies + use for subsequent requests
def get_imperva_session_cookies(site_url: str, proxy: str = None) -> dict:
"""Solve Imperva cookie challenge and return valid session cookies."""
response = scrape_imperva_protected(site_url, proxy=proxy)
cookies = dict(response.cookies)
print(f"Got {len(cookies)} cookies from Imperva challenge")
return cookies
def scrape_with_imperva_cookies(target_url: str, cookies: dict, proxy: str = None) -> str:
"""Use validated Imperva session cookies for subsequent requests."""
session = curl_requests.Session()
for name, value in cookies.items():
session.cookies.set(name, value)
response = session.get(
target_url,
impersonate="chrome124",
proxies={"https": proxy} if proxy else None,
timeout=30,
)
return response.text
curl_cffi compiles against a modified libcurl that reproduces Chrome/Firefox TLS handshakes byte-for-byte. This is the single most effective library for bypassing TLS-based detection in 2026.
Tier 5: Akamai — Sensor Data Bypass
Akamai is the hardest tier. Their obfuscated JavaScript generates "sensor data" that must accompany every request. The JS changes weekly.
import httpx
from playwright.async_api import async_playwright
async def extract_akamai_sensor_data(target_url: str, proxy: str = None) -> dict:
"""
Use a real browser to execute Akamai's JavaScript and capture
the sensor data it generates. Then use that data for subsequent
HTTP requests, avoiding the overhead of full browser automation.
"""
sensor_data = {}
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy={"server": proxy} if proxy else None,
args=["--disable-blink-features=AutomationControlled"],
)
page = await browser.new_page()
# Intercept the Akamai sensor request to capture the payload
async def handle_request(request):
if "/_ctr/" in request.url or "/akam/" in request.url:
body = request.post_data
if body:
sensor_data["payload"] = body
sensor_data["headers"] = dict(request.headers)
page.on("request", handle_request)
await page.goto(target_url, wait_until="networkidle")
await page.wait_for_timeout(5000)
# Extract cookies after JS execution
cookies = await page.context.cookies()
sensor_data["cookies"] = {c["name"]: c["value"] for c in cookies}
await browser.close()
return sensor_data
Anti-Detection Techniques That Work Across All Systems
Browser Fingerprint Hardening
import asyncio
from playwright.async_api import async_playwright
STEALTH_SCRIPT = """
// Override navigator.webdriver
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined,
configurable: true
});
// Restore plugins array
Object.defineProperty(navigator, 'plugins', {
get: () => [
{ name: 'Chrome PDF Plugin', filename: 'internal-pdf-viewer' },
{ name: 'Chrome PDF Viewer', filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai' },
{ name: 'Native Client', filename: 'internal-nacl-plugin' },
],
});
// Fix mimeTypes
Object.defineProperty(navigator, 'mimeTypes', {
get: () => [
{ type: 'application/pdf', suffixes: 'pdf', description: 'Portable Document Format' },
],
});
// Restore chrome runtime
window.chrome = {
runtime: {},
loadTimes: function() {},
csi: function() {},
app: {},
};
// Fix permissions
const originalQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (parameters) => (
parameters.name === 'notifications'
? Promise.resolve({ state: Notification.permission })
: originalQuery(parameters)
);
// Prevent headless detection via iframe
Object.defineProperty(HTMLIFrameElement.prototype, 'contentWindow', {
get: function() {
return window;
}
});
"""
async def create_hardened_browser(proxy_url: str = None):
"""Create a Playwright browser with full fingerprint hardening."""
p = await async_playwright().start()
launch_args = [
"--no-sandbox",
"--disable-blink-features=AutomationControlled",
"--disable-dev-shm-usage",
"--disable-infobars",
"--disable-extensions",
"--disable-automation",
"--window-size=1440,900",
"--lang=en-US",
]
browser = await p.chromium.launch(
headless=True,
args=launch_args,
proxy={"server": proxy_url} if proxy_url else None,
)
context = await browser.new_context(
viewport={"width": 1440, "height": 900},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
locale="en-US",
timezone_id="America/New_York",
permissions=["geolocation"],
color_scheme="light",
device_scale_factor=1.0,
java_script_enabled=True,
bypass_csp=False,
)
# Inject stealth scripts on every page
await context.add_init_script(STEALTH_SCRIPT)
return p, browser, context
Proxy Rotation Strategy
Not all proxy types are equal. Here's the hierarchy for anti-bot systems:
| Proxy Type | Cloudflare | DataDome | Imperva | Akamai | Cost |
|---|---|---|---|---|---|
| Datacenter | Blocked | Blocked | Often blocked | Blocked | $ |
| ISP/Static Residential | Usually passes | Usually passes | Passes | Usually passes | $$ |
| Rotating Residential | Passes | Passes with slow rates | Passes | Passes | $$$ |
| Mobile Proxies | Best | Best | Best | Best | $$$$ |
For most production scraping work, rotating residential proxies hit the right balance of effectiveness and cost. ThorData's residential network covers 190+ countries with city-level targeting and automatic rotation — you configure the target country/city and each request gets a fresh IP from a real consumer connection.
import httpx
import random
THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
THORDATA_HOST = "proxy.thordata.com"
THORDATA_PORT = 9000
def get_proxy(country: str = "US", city: str = None) -> str:
"""Build a ThorData proxy URL with geo-targeting."""
user_str = f"{THORDATA_USER}_country-{country}"
if city:
user_str += f"_city-{city}"
return f"http://{user_str}:{THORDATA_PASS}@{THORDATA_HOST}:{THORDATA_PORT}"
def make_client_with_proxy(country: str = "US") -> httpx.Client:
"""Create an httpx client routing through a residential proxy."""
proxy = get_proxy(country)
return httpx.Client(
proxies={"https://": proxy},
timeout=30,
follow_redirects=True,
http2=True,
)
Request Timing and Pattern Randomization
Behavioral analysis is increasingly sophisticated. Pure random delays aren't enough — you need timing that matches human reading patterns:
import time
import random
import math
def human_delay(action: str = "read"):
"""
Apply realistic delays based on the action type.
Human behavior isn't uniformly random — it follows patterns.
"""
if action == "read":
# Reading a page: log-normal distribution, mean ~8 seconds
delay = random.lognormvariate(2.0, 0.6)
delay = max(3.0, min(45.0, delay)) # clamp 3-45s
elif action == "click":
# Click reactions: fast, 0.5-2 seconds
delay = random.uniform(0.4, 2.0)
elif action == "search":
# Search and think: medium, 3-12 seconds
delay = random.uniform(3.0, 12.0)
elif action == "scroll":
# Scroll pause: very short, 0.3-1 second
delay = random.uniform(0.3, 1.0)
else:
delay = random.uniform(1.0, 5.0)
time.sleep(delay)
async def human_scroll(page, target_pct: float = 0.7):
"""Simulate human scrolling — not a single jump to the target."""
current = 0
target = target_pct * 100
while current < target:
step = random.uniform(5, 15)
current = min(current + step, target)
await page.evaluate(f"window.scrollTo(0, document.body.scrollHeight * {current/100})")
await asyncio.sleep(random.uniform(0.05, 0.2))
Session Warming
Starting a scraping session by going directly to your target data URL is a classic bot tell. Warm up the session first:
async def warm_session(page, base_domain: str):
"""Warm up a browser session to mimic organic navigation."""
# Load homepage
await page.goto(base_domain, wait_until="networkidle")
await asyncio.sleep(random.uniform(3, 7))
# Simulate reading homepage
await human_scroll(page, 0.4)
await asyncio.sleep(random.uniform(2, 4))
# Click an internal link (not your target yet)
links = await page.query_selector_all("a[href^='/']")
if links:
random_link = random.choice(links[:10])
href = await random_link.get_attribute("href")
if href and not any(x in href for x in ["login", "signup", "cart", "checkout"]):
await page.goto(base_domain + href, wait_until="domcontentloaded")
await asyncio.sleep(random.uniform(2, 5))
# Now navigate to your target
await asyncio.sleep(random.uniform(1, 3))
Data Storage for Large-Scale Scraping
When scraping through anti-bot systems, partial failures are common. Design your storage to handle retries and resume without re-scraping:
import sqlite3
from pathlib import Path
from dataclasses import dataclass, asdict
from datetime import datetime
@dataclass
class ScrapedPage:
url: str
content: str
status_code: int
proxy_used: str
scraped_at: str = None
retries: int = 0
error: str = None
def __post_init__(self):
if not self.scraped_at:
self.scraped_at = datetime.utcnow().isoformat()
class ScrapingDB:
def __init__(self, path: str = "scrape_results.db"):
self.conn = sqlite3.connect(path)
self.conn.execute("""
CREATE TABLE IF NOT EXISTS pages (
url TEXT PRIMARY KEY,
content TEXT,
status_code INTEGER,
proxy_used TEXT,
scraped_at TEXT,
retries INTEGER DEFAULT 0,
error TEXT
)
""")
self.conn.execute("""
CREATE TABLE IF NOT EXISTS queue (
url TEXT PRIMARY KEY,
priority INTEGER DEFAULT 0,
added_at TEXT DEFAULT (datetime('now')),
attempted_at TEXT,
status TEXT DEFAULT 'pending'
)
""")
self.conn.commit()
def add_to_queue(self, urls: list[str], priority: int = 0):
self.conn.executemany(
"INSERT OR IGNORE INTO queue (url, priority) VALUES (?, ?)",
[(url, priority) for url in urls],
)
self.conn.commit()
def get_next_batch(self, batch_size: int = 10) -> list[str]:
rows = self.conn.execute("""
SELECT url FROM queue
WHERE status = 'pending'
ORDER BY priority DESC, added_at ASC
LIMIT ?
""", (batch_size,)).fetchall()
return [r[0] for r in rows]
def mark_in_progress(self, url: str):
self.conn.execute(
"UPDATE queue SET status='in_progress', attempted_at=datetime('now') WHERE url=?",
(url,)
)
self.conn.commit()
def save_result(self, page: ScrapedPage):
self.conn.execute("""
INSERT OR REPLACE INTO pages
(url, content, status_code, proxy_used, scraped_at, retries, error)
VALUES (?, ?, ?, ?, ?, ?, ?)
""", (page.url, page.content, page.status_code, page.proxy_used,
page.scraped_at, page.retries, page.error))
self.conn.execute(
"UPDATE queue SET status=? WHERE url=?",
("done" if not page.error else "failed", page.url)
)
self.conn.commit()
def get_failed_urls(self) -> list[str]:
rows = self.conn.execute(
"SELECT url FROM queue WHERE status='failed'"
).fetchall()
return [r[0] for r in rows]
def reset_failed(self):
"""Reset failed items for retry."""
self.conn.execute("UPDATE queue SET status='pending' WHERE status='failed'")
self.conn.commit()
The Economics: When to Fight vs. When to Walk Away
Before spending 20 hours reverse-engineering a site's anti-bot system, ask:
-
Does the site have an API? Many sites behind DataDome or Cloudflare have official APIs. LinkedIn has an API. Amazon has Product Advertising API. Even if they're limited, they might cover your use case.
-
Can you buy the data? Data brokers sell pre-scraped datasets for many common targets. Often cheaper than proxy costs.
-
What's the proxy math? Residential proxies cost $3-12/GB. If you're scraping 10,000 pages at 500KB each, that's ~5GB = $15-60. At scale, this adds up fast. Compare against API costs or data purchases.
-
Is there a less-protected equivalent? Mobile versions of sites (
m.example.com), RSS feeds, sitemaps, and cached versions (Google Cache, Wayback Machine) often have weaker or no protection.
The best scraper engineers I know spend more time finding the path of least resistance than brute-forcing through protection. Every hour fighting an anti-bot system is an hour you could spend building something with the data.
Debugging and Diagnostics
When a scrape fails, diagnosing which layer blocked you saves hours:
import httpx
import json
async def diagnose_protection(url: str) -> dict:
"""Diagnose what anti-bot protection a site uses."""
diagnosis = {"url": url, "protections": []}
# Test 1: Plain request — see what headers come back
try:
resp = httpx.get(url, timeout=10, headers={"User-Agent": "curl/7.68.0"})
diagnosis["plain_request_status"] = resp.status_code
cf_ray = resp.headers.get("cf-ray")
if cf_ray:
diagnosis["protections"].append("Cloudflare")
if "datadome" in resp.headers.get("set-cookie", "").lower():
diagnosis["protections"].append("DataDome")
if "incap_ses" in resp.headers.get("set-cookie", "").lower():
diagnosis["protections"].append("Imperva")
if "ak_bmsc" in resp.headers.get("set-cookie", "").lower():
diagnosis["protections"].append("Akamai")
except Exception as e:
diagnosis["plain_request_error"] = str(e)
# Test 2: Check response body for challenge markers
if hasattr(resp, "text"):
body = resp.text.lower()
if "just a moment" in body or "cf-challenge" in body:
diagnosis["cloudflare_challenge"] = True
if "datadome.co" in body:
diagnosis["datadome_challenge"] = True
if "/_Incapsula_Resource" in resp.text:
diagnosis["imperva_challenge"] = True
return diagnosis
# Usage
import asyncio
result = asyncio.run(diagnose_protection("https://target-site.com"))
print(json.dumps(result, indent=2))
Real-World Use Cases and Production Patterns
Price Monitoring at Scale
Track competitor pricing across thousands of product pages on well-protected ecommerce sites:
import asyncio
from datetime import datetime
async def monitor_prices(
product_urls: list[str],
proxy_pool: list[str],
db: ScrapingDB,
) -> None:
"""Production price monitoring with anti-detection."""
db.add_to_queue(product_urls)
batch = db.get_next_batch(10)
for url in batch:
proxy = random.choice(proxy_pool)
db.mark_in_progress(url)
try:
content = await scrape_cloudflare_protected(url, proxy_url=proxy)
# Parse price from content...
page = ScrapedPage(
url=url,
content=content,
status_code=200,
proxy_used=proxy,
)
db.save_result(page)
await asyncio.sleep(random.uniform(5, 15))
except Exception as e:
page = ScrapedPage(
url=url, content="", status_code=0,
proxy_used=proxy, error=str(e)
)
db.save_result(page)
News Article Aggregation
News sites often use Cloudflare. RSS feeds are the first fallback, but when you need the full body:
from curl_cffi import requests as curl_requests
def scrape_news_article(url: str, proxy: str = None) -> str:
"""Scrape a news article using TLS impersonation."""
session = curl_requests.Session()
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://www.google.com/", # Simulate coming from search
}
response = session.get(
url,
impersonate="chrome124",
headers=headers,
proxies={"https": proxy} if proxy else None,
timeout=30,
)
if response.status_code == 200:
return response.text
return ""
Quick Reference
| System | Primary Detection | Best Bypass | Difficulty | Proxy Needed |
|---|---|---|---|---|
| Cloudflare Free | JS Challenge | Browser headers + delays | Easy | Datacenter OK |
| Cloudflare Biz/Enterprise | Fingerprint + Behavioral | Playwright + stealth | Medium | Residential |
| DataDome | ML ensemble scoring | Full browser + residential IP + slow | Hard | Residential mandatory |
| Imperva | TLS fingerprint + cookies | curl_cffi + header matching | Medium | ISP/Residential |
| Akamai | Sensor data + JS tokens | Playwright + sensor capture | Hard | Residential |
| PerimeterX | Behavioral ML | Playwright + human simulation | Hard | Residential/Mobile |
The anti-bot arms race isn't slowing down. The trend is clear: simple HTTP libraries are dead for protected sites. You need either browser automation or TLS impersonation, and increasingly both. Plan your scraping infrastructure accordingly.
For residential proxies that work reliably against these systems, ThorData is worth evaluating — their rotating residential pool covers 190+ countries with automatic rotation per request, which is what the toughest systems require.