How to Handle Anti-Bot Protection When Scraping in 2026 (Cloudflare, DataDome, Imperva)

2026-03-30 [scraping anti-bot cloudflare python]

If you've scraped anything meaningful in the last year, you've hit a wall. Not a 403 — those are easy. I mean the kind where your requests return 200 OK with a JavaScript challenge page, or where everything works for 50 requests and then silently starts returning stale data.

Anti-bot systems got significantly better in 2025-2026. Here's what each one actually does and what works against them.

The Four Tiers of Anti-Bot in 2026

Cloudflare protects over 50% of websites with any bot protection at all. Their stack runs JS challenges (Turnstile), browser fingerprinting via canvas and WebGL hashing, and behavioral analysis on mouse movement and scroll patterns. The free tier is easy to bypass. Business and Enterprise tiers with Bot Management enabled are genuinely hard.

DataDome is the one that makes scraper developers swear. Used by major retailers (Foot Locker, TripAdvisor, dozens of media sites), it runs ML-based scoring on every single request. It evaluates your TLS fingerprint, headers, timing, and behavioral signals in combination — and assigns a bot probability score. If you fix one signal but not the others, you still get blocked.

Imperva (Incapsula) focuses heavily on TLS fingerprinting and cookie-based challenges. It injects JavaScript that sets specific cookies, then validates them on subsequent requests. Less sophisticated than DataDome's ML approach, but very effective against scripts.

Akamai Bot Manager uses token-based validation with heavily obfuscated JavaScript that generates sensor data. The JS changes frequently, making static analysis a losing game.

What Gets You Blocked (Specifically)

Stop guessing. These are the actual detection vectors, ranked by how often they catch scrapers:

1. TLS Fingerprint (JA3/JA4 hash) Python's requests library uses urllib3's TLS stack, which produces a JA3 fingerprint that looks nothing like any browser. Every anti-bot system maintains a database of known library fingerprints. This is the #1 reason requests.get() fails on protected sites in 2026.

# Python requests JA3 (easily identified as non-browser)
769,47-53-5-10-49161-49162-49171-49172-50-56-19-4,0-10-11,23-24-25,0

# Chrome 124 JA3 (what you want to look like)
771,4865-4866-4867-49195-49199-49196-49200-52393-52392...

2. Missing or Wrong Browser Headers Modern browsers send 15+ headers per request. Scrapers typically send 3-4. The tells: - Missing Sec-Fetch-Mode, Sec-Fetch-Site, Sec-Fetch-Dest (Chrome sends these on every request) - Missing or wrong sec-ch-ua (Client Hints — Chrome's replacement for User-Agent) - Accept-Language missing or set to something generic like en - Accept-Encoding not including br (Brotli — all modern browsers support it)

3. Headless Browser Detection Even Playwright and Puppeteer get caught: - navigator.webdriver returns true by default - Missing browser plugins (navigator.plugins is empty) - Chrome automation flags in window.chrome.runtime - Canvas fingerprint returns a unique hash for headless vs headed rendering - WebGL renderer string says "SwiftShader" instead of an actual GPU

4. Behavioral Signals - Request timing too regular (exactly 2.0s between requests = bot) - No mouse movement or scroll events before clicking - Loading page resources in wrong order (CSS/JS before HTML is parsed) - Fetching robots.txt before scraping (ironic, but it flags you)

5. IP Reputation - Datacenter ASNs (AWS, GCP, DigitalOcean) are pre-flagged - Known proxy/VPN ranges get higher bot scores - Residential IPs get a trust bonus but aren't immune

Understanding Each Defense Layer in Depth

How Cloudflare Bot Management Works

Cloudflare's Bot Management runs a challenge score from 1-99 on every visitor. A score of 1 means almost certainly human; 99 means almost certainly bot. The score is computed from:

JavaScript rendering — can the client execute JavaScript and return expected results?
Browser API availability — does the client expose the same APIs a real browser would?
Canvas fingerprint — is the canvas rendering hash consistent with the claimed browser and OS?
Network timing — does the TCP handshake timing, TTFB, and subsequent request timing pattern match expected human behavior?
IP reputation — is the IP known to belong to a hosting provider, VPN, or previously flagged for bot activity?
Cookie validation — does the client properly store and return the session cookies Cloudflare sets?

The key insight is that Cloudflare doesn't block you when any single signal is off. It accumulates evidence and acts when the combined score exceeds a threshold. This is why partial fixes don't work: fix your TLS fingerprint but keep a datacenter IP, and you're still blocked.

How DataDome's ML Model Works

DataDome is fundamentally different from Cloudflare. Rather than rule-based scoring, it runs a trained ML classifier that evaluates your request against behavioral baselines built from millions of legitimate visitors to that site specifically.

DataDome captures: - The full sequence of your requests (not just each one individually) - Time between requests and variance in timing - Which pages you visit and in what order - Which resources you request (do you load images? CSS? Tracking pixels?) - Mouse trajectory data if you execute JavaScript - The correlation between all of the above

This is why DataDome is so hard. Even if you fool it on request 1, your behavioral pattern across requests 2-50 builds an increasingly clear signal. A human visiting a retail site browses category pages, reads descriptions, goes back, compares items. A scraper hits structured endpoints in order and never loads a product image.

Imperva's Technical Stack

Imperva (now Thales Group) uses a different approach: rather than ML scoring, it relies on cryptographic challenge-response:

It serves a JavaScript payload that must execute and compute a specific value based on browser environment data
That computed value is embedded in a cookie
Every subsequent request is validated against that cookie
The JavaScript to compute the value changes regularly

The weakness: the cookie value computation is deterministic. If you can execute the JavaScript in a real browser context once, you can replay those cookies for some time. But Imperva also validates the browser fingerprint matches the original cookie context, so replaying across different IPs or fingerprints fails.

The practical approach: use curl_cffi for TLS impersonation combined with Playwright to execute the initial JavaScript challenge, then extract and replay the cookies for subsequent requests.

What Actually Works (Per Tier)

Tier 1: Sites With Basic Protection (IP blocking, simple rate limits)

Rotate IPs, add proper browser headers, add random delays. This still works for maybe 40% of protected sites.

import httpx
import random
import time

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-Dest": "document",
    "sec-ch-ua": '"Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"',
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": '"Windows"',
}

def scrape(url: str, proxy: str = None) -> httpx.Response:
    """Basic scrape with anti-detection headers and optional proxy."""
    time.sleep(random.uniform(1.5, 4.0))
    proxies = {"https://": proxy} if proxy else None
    return httpx.get(url, headers=HEADERS, proxies=proxies, follow_redirects=True, timeout=20)

def rotate_user_agent() -> str:
    """Return a random realistic user agent."""
    agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:125.0) Gecko/20100101 Firefox/125.0",
    ]
    return random.choice(agents)

Tier 2: Cloudflare (JS Challenges + Fingerprinting)

For Cloudflare Business/Enterprise, you need a real browser. playwright-stealth or undetected-chromedriver patches the obvious headless tells:

import asyncio
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async

async def scrape_cloudflare_protected(url: str, proxy_url: str = None) -> str:
    """Scrape a Cloudflare-protected page using Playwright with stealth."""
    async with async_playwright() as p:
        launch_kwargs = {
            "headless": True,
            "args": [
                "--no-sandbox",
                "--disable-blink-features=AutomationControlled",
                "--disable-dev-shm-usage",
                "--window-size=1440,900",
            ],
        }
        if proxy_url:
            launch_kwargs["proxy"] = {"server": proxy_url}

        browser = await p.chromium.launch(**launch_kwargs)
        context = await browser.new_context(
            viewport={"width": 1440, "height": 900},
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            ),
            locale="en-US",
            timezone_id="America/New_York",
        )

        page = await context.new_page()
        await stealth_async(page)

        # Simulate human-like behavior before navigation
        await page.mouse.move(
            random.randint(100, 300),
            random.randint(100, 300)
        )

        await page.goto(url, wait_until="networkidle", timeout=60000)

        # Cloudflare challenge usually resolves within 5s
        await page.wait_for_timeout(random.randint(4000, 8000))

        # Check if still on challenge page
        title = await page.title()
        if "Just a moment" in title or "Attention Required" in title:
            # Wait longer for challenge to complete
            await page.wait_for_timeout(10000)

        content = await page.content()
        await browser.close()
        return content

# Usage with ThorData residential proxy
result = asyncio.run(scrape_cloudflare_protected(
    "https://protected-site.com/data",
    proxy_url="http://user:[email protected]:9000"
))

Key: use headless=True with the new headless mode in Chrome 124+. Old headless mode has too many detectable differences.

Tier 3: DataDome (ML-Based Scoring)

DataDome is the hardest to beat consistently. Their system correlates signals across requests, so fixing one thing doesn't help if three others are wrong.

import asyncio
import random
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async

async def scrape_datadome_protected(
    urls: list[str],
    proxy_url: str,
    delay_range: tuple = (15, 35)
) -> list[dict]:
    """
    Scrape DataDome-protected pages with realistic behavior simulation.

    Key requirements:
    - Residential proxy (datacenter IPs fail immediately)
    - Slow, randomized request rates (15-35 seconds between pages)
    - Realistic session: homepage first, then navigate
    - Full JavaScript execution
    """
    results = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy={"server": proxy_url},
            args=["--disable-blink-features=AutomationControlled"],
        )

        context = await browser.new_context(
            viewport={"width": random.choice([1366, 1440, 1920]), "height": random.choice([768, 900, 1080])},
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            ),
            locale="en-US",
        )

        page = await context.new_page()
        await stealth_async(page)

        # CRITICAL: Always load the homepage first to establish session
        base_domain = "/".join(urls[0].split("/")[:3])
        await page.goto(base_domain, wait_until="networkidle")
        await asyncio.sleep(random.uniform(3, 6))

        # Simulate reading the homepage
        await page.evaluate("window.scrollTo(0, document.body.scrollHeight * 0.3)")
        await asyncio.sleep(random.uniform(1, 3))
        await page.evaluate("window.scrollTo(0, document.body.scrollHeight * 0.6)")
        await asyncio.sleep(random.uniform(1, 2))

        for url in urls:
            try:
                await page.goto(url, wait_until="networkidle", timeout=45000)
                await asyncio.sleep(random.uniform(*delay_range))

                # Simulate reading the page
                await page.evaluate("window.scrollTo(0, document.body.scrollHeight * 0.5)")
                await asyncio.sleep(random.uniform(2, 5))

                content = await page.content()
                results.append({"url": url, "content": content, "status": "ok"})
            except Exception as e:
                results.append({"url": url, "content": None, "status": str(e)})

        await browser.close()

    return results

What works: - Full browser automation (Playwright, not requests) - Residential proxies are mandatory — if you need residential proxies for this kind of work, ThorData has a solid residential pool with city-level targeting - Slow request rates (15-30 seconds between pages) - Realistic session behavior: load homepage first, navigate via links, don't jump to deep URLs - Check if the site has an official API before fighting DataDome — many do, and it's cheaper than the proxy bill

Tier 4: Imperva — TLS Impersonation

Imperva's main detection vector is TLS fingerprinting. You can impersonate a browser's TLS stack using curl_cffi (formerly curl-impersonate) in Python:

from curl_cffi import requests as curl_requests
import random
import time

def scrape_imperva_protected(url: str, proxy: str = None) -> curl_requests.Response:
    """
    Use curl_cffi to impersonate Chrome's exact TLS fingerprint.
    This is the most effective approach for Imperva-protected sites.
    """
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-Dest": "document",
    }

    proxies = {"https": proxy} if proxy else None

    session = curl_requests.Session()
    # Impersonates Chrome's exact TLS fingerprint byte-for-byte
    response = session.get(
        url,
        impersonate="chrome124",
        headers=headers,
        proxies=proxies,
        timeout=30,
    )
    return response

# Combined approach: curl_cffi for TLS + extract cookies + use for subsequent requests
def get_imperva_session_cookies(site_url: str, proxy: str = None) -> dict:
    """Solve Imperva cookie challenge and return valid session cookies."""
    response = scrape_imperva_protected(site_url, proxy=proxy)
    cookies = dict(response.cookies)
    print(f"Got {len(cookies)} cookies from Imperva challenge")
    return cookies

def scrape_with_imperva_cookies(target_url: str, cookies: dict, proxy: str = None) -> str:
    """Use validated Imperva session cookies for subsequent requests."""
    session = curl_requests.Session()
    for name, value in cookies.items():
        session.cookies.set(name, value)

    response = session.get(
        target_url,
        impersonate="chrome124",
        proxies={"https": proxy} if proxy else None,
        timeout=30,
    )
    return response.text

curl_cffi compiles against a modified libcurl that reproduces Chrome/Firefox TLS handshakes byte-for-byte. This is the single most effective library for bypassing TLS-based detection in 2026.

Tier 5: Akamai — Sensor Data Bypass

Akamai is the hardest tier. Their obfuscated JavaScript generates "sensor data" that must accompany every request. The JS changes weekly.

import httpx
from playwright.async_api import async_playwright

async def extract_akamai_sensor_data(target_url: str, proxy: str = None) -> dict:
    """
    Use a real browser to execute Akamai's JavaScript and capture
    the sensor data it generates. Then use that data for subsequent
    HTTP requests, avoiding the overhead of full browser automation.
    """
    sensor_data = {}

    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy={"server": proxy} if proxy else None,
            args=["--disable-blink-features=AutomationControlled"],
        )
        page = await browser.new_page()

        # Intercept the Akamai sensor request to capture the payload
        async def handle_request(request):
            if "/_ctr/" in request.url or "/akam/" in request.url:
                body = request.post_data
                if body:
                    sensor_data["payload"] = body
                    sensor_data["headers"] = dict(request.headers)

        page.on("request", handle_request)
        await page.goto(target_url, wait_until="networkidle")
        await page.wait_for_timeout(5000)

        # Extract cookies after JS execution
        cookies = await page.context.cookies()
        sensor_data["cookies"] = {c["name"]: c["value"] for c in cookies}

        await browser.close()

    return sensor_data

Anti-Detection Techniques That Work Across All Systems

Browser Fingerprint Hardening

import asyncio
from playwright.async_api import async_playwright

STEALTH_SCRIPT = """
// Override navigator.webdriver
Object.defineProperty(navigator, 'webdriver', {
    get: () => undefined,
    configurable: true
});

// Restore plugins array
Object.defineProperty(navigator, 'plugins', {
    get: () => [
        { name: 'Chrome PDF Plugin', filename: 'internal-pdf-viewer' },
        { name: 'Chrome PDF Viewer', filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai' },
        { name: 'Native Client', filename: 'internal-nacl-plugin' },
    ],
});

// Fix mimeTypes
Object.defineProperty(navigator, 'mimeTypes', {
    get: () => [
        { type: 'application/pdf', suffixes: 'pdf', description: 'Portable Document Format' },
    ],
});

// Restore chrome runtime
window.chrome = {
    runtime: {},
    loadTimes: function() {},
    csi: function() {},
    app: {},
};

// Fix permissions
const originalQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (parameters) => (
    parameters.name === 'notifications'
        ? Promise.resolve({ state: Notification.permission })
        : originalQuery(parameters)
);

// Prevent headless detection via iframe
Object.defineProperty(HTMLIFrameElement.prototype, 'contentWindow', {
    get: function() {
        return window;
    }
});
"""

async def create_hardened_browser(proxy_url: str = None):
    """Create a Playwright browser with full fingerprint hardening."""
    p = await async_playwright().start()

    launch_args = [
        "--no-sandbox",
        "--disable-blink-features=AutomationControlled",
        "--disable-dev-shm-usage",
        "--disable-infobars",
        "--disable-extensions",
        "--disable-automation",
        "--window-size=1440,900",
        "--lang=en-US",
    ]

    browser = await p.chromium.launch(
        headless=True,
        args=launch_args,
        proxy={"server": proxy_url} if proxy_url else None,
    )

    context = await browser.new_context(
        viewport={"width": 1440, "height": 900},
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        locale="en-US",
        timezone_id="America/New_York",
        permissions=["geolocation"],
        color_scheme="light",
        device_scale_factor=1.0,
        java_script_enabled=True,
        bypass_csp=False,
    )

    # Inject stealth scripts on every page
    await context.add_init_script(STEALTH_SCRIPT)

    return p, browser, context

Proxy Rotation Strategy

Not all proxy types are equal. Here's the hierarchy for anti-bot systems:

Proxy Type	Cloudflare	DataDome	Imperva	Akamai	Cost
Datacenter	Blocked	Blocked	Often blocked	Blocked	$
ISP/Static Residential	Usually passes	Usually passes	Passes	Usually passes	$$
Rotating Residential	Passes	Passes with slow rates	Passes	Passes	$$$
Mobile Proxies	Best	Best	Best	Best	$$$$

For most production scraping work, rotating residential proxies hit the right balance of effectiveness and cost. ThorData's residential network covers 190+ countries with city-level targeting and automatic rotation — you configure the target country/city and each request gets a fresh IP from a real consumer connection.

import httpx
import random

THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
THORDATA_HOST = "proxy.thordata.com"
THORDATA_PORT = 9000

def get_proxy(country: str = "US", city: str = None) -> str:
    """Build a ThorData proxy URL with geo-targeting."""
    user_str = f"{THORDATA_USER}_country-{country}"
    if city:
        user_str += f"_city-{city}"
    return f"http://{user_str}:{THORDATA_PASS}@{THORDATA_HOST}:{THORDATA_PORT}"

def make_client_with_proxy(country: str = "US") -> httpx.Client:
    """Create an httpx client routing through a residential proxy."""
    proxy = get_proxy(country)
    return httpx.Client(
        proxies={"https://": proxy},
        timeout=30,
        follow_redirects=True,
        http2=True,
    )

Request Timing and Pattern Randomization

Behavioral analysis is increasingly sophisticated. Pure random delays aren't enough — you need timing that matches human reading patterns:

import time
import random
import math

def human_delay(action: str = "read"):
    """
    Apply realistic delays based on the action type.
    Human behavior isn't uniformly random — it follows patterns.
    """
    if action == "read":
        # Reading a page: log-normal distribution, mean ~8 seconds
        delay = random.lognormvariate(2.0, 0.6)
        delay = max(3.0, min(45.0, delay))  # clamp 3-45s
    elif action == "click":
        # Click reactions: fast, 0.5-2 seconds
        delay = random.uniform(0.4, 2.0)
    elif action == "search":
        # Search and think: medium, 3-12 seconds
        delay = random.uniform(3.0, 12.0)
    elif action == "scroll":
        # Scroll pause: very short, 0.3-1 second
        delay = random.uniform(0.3, 1.0)
    else:
        delay = random.uniform(1.0, 5.0)

    time.sleep(delay)

async def human_scroll(page, target_pct: float = 0.7):
    """Simulate human scrolling — not a single jump to the target."""
    current = 0
    target = target_pct * 100
    while current < target:
        step = random.uniform(5, 15)
        current = min(current + step, target)
        await page.evaluate(f"window.scrollTo(0, document.body.scrollHeight * {current/100})")
        await asyncio.sleep(random.uniform(0.05, 0.2))

Session Warming

Starting a scraping session by going directly to your target data URL is a classic bot tell. Warm up the session first:

async def warm_session(page, base_domain: str):
    """Warm up a browser session to mimic organic navigation."""
    # Load homepage
    await page.goto(base_domain, wait_until="networkidle")
    await asyncio.sleep(random.uniform(3, 7))

    # Simulate reading homepage
    await human_scroll(page, 0.4)
    await asyncio.sleep(random.uniform(2, 4))

    # Click an internal link (not your target yet)
    links = await page.query_selector_all("a[href^='/']")
    if links:
        random_link = random.choice(links[:10])
        href = await random_link.get_attribute("href")
        if href and not any(x in href for x in ["login", "signup", "cart", "checkout"]):
            await page.goto(base_domain + href, wait_until="domcontentloaded")
            await asyncio.sleep(random.uniform(2, 5))

    # Now navigate to your target
    await asyncio.sleep(random.uniform(1, 3))

Data Storage for Large-Scale Scraping

When scraping through anti-bot systems, partial failures are common. Design your storage to handle retries and resume without re-scraping:

import sqlite3
from pathlib import Path
from dataclasses import dataclass, asdict
from datetime import datetime

@dataclass
class ScrapedPage:
    url: str
    content: str
    status_code: int
    proxy_used: str
    scraped_at: str = None
    retries: int = 0
    error: str = None

    def __post_init__(self):
        if not self.scraped_at:
            self.scraped_at = datetime.utcnow().isoformat()

class ScrapingDB:
    def __init__(self, path: str = "scrape_results.db"):
        self.conn = sqlite3.connect(path)
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS pages (
                url TEXT PRIMARY KEY,
                content TEXT,
                status_code INTEGER,
                proxy_used TEXT,
                scraped_at TEXT,
                retries INTEGER DEFAULT 0,
                error TEXT
            )
        """)
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS queue (
                url TEXT PRIMARY KEY,
                priority INTEGER DEFAULT 0,
                added_at TEXT DEFAULT (datetime('now')),
                attempted_at TEXT,
                status TEXT DEFAULT 'pending'
            )
        """)
        self.conn.commit()

    def add_to_queue(self, urls: list[str], priority: int = 0):
        self.conn.executemany(
            "INSERT OR IGNORE INTO queue (url, priority) VALUES (?, ?)",
            [(url, priority) for url in urls],
        )
        self.conn.commit()

    def get_next_batch(self, batch_size: int = 10) -> list[str]:
        rows = self.conn.execute("""
            SELECT url FROM queue
            WHERE status = 'pending'
            ORDER BY priority DESC, added_at ASC
            LIMIT ?
        """, (batch_size,)).fetchall()
        return [r[0] for r in rows]

    def mark_in_progress(self, url: str):
        self.conn.execute(
            "UPDATE queue SET status='in_progress', attempted_at=datetime('now') WHERE url=?",
            (url,)
        )
        self.conn.commit()

    def save_result(self, page: ScrapedPage):
        self.conn.execute("""
            INSERT OR REPLACE INTO pages
            (url, content, status_code, proxy_used, scraped_at, retries, error)
            VALUES (?, ?, ?, ?, ?, ?, ?)
        """, (page.url, page.content, page.status_code, page.proxy_used,
              page.scraped_at, page.retries, page.error))
        self.conn.execute(
            "UPDATE queue SET status=? WHERE url=?",
            ("done" if not page.error else "failed", page.url)
        )
        self.conn.commit()

    def get_failed_urls(self) -> list[str]:
        rows = self.conn.execute(
            "SELECT url FROM queue WHERE status='failed'"
        ).fetchall()
        return [r[0] for r in rows]

    def reset_failed(self):
        """Reset failed items for retry."""
        self.conn.execute("UPDATE queue SET status='pending' WHERE status='failed'")
        self.conn.commit()

The Economics: When to Fight vs. When to Walk Away

Before spending 20 hours reverse-engineering a site's anti-bot system, ask:

Does the site have an API? Many sites behind DataDome or Cloudflare have official APIs. LinkedIn has an API. Amazon has Product Advertising API. Even if they're limited, they might cover your use case.
Can you buy the data? Data brokers sell pre-scraped datasets for many common targets. Often cheaper than proxy costs.
What's the proxy math? Residential proxies cost $3-12/GB. If you're scraping 10,000 pages at 500KB each, that's ~5GB = $15-60. At scale, this adds up fast. Compare against API costs or data purchases.
Is there a less-protected equivalent? Mobile versions of sites (m.example.com), RSS feeds, sitemaps, and cached versions (Google Cache, Wayback Machine) often have weaker or no protection.

The best scraper engineers I know spend more time finding the path of least resistance than brute-forcing through protection. Every hour fighting an anti-bot system is an hour you could spend building something with the data.

Debugging and Diagnostics

When a scrape fails, diagnosing which layer blocked you saves hours:

import httpx
import json

async def diagnose_protection(url: str) -> dict:
    """Diagnose what anti-bot protection a site uses."""
    diagnosis = {"url": url, "protections": []}

    # Test 1: Plain request — see what headers come back
    try:
        resp = httpx.get(url, timeout=10, headers={"User-Agent": "curl/7.68.0"})
        diagnosis["plain_request_status"] = resp.status_code
        cf_ray = resp.headers.get("cf-ray")
        if cf_ray:
            diagnosis["protections"].append("Cloudflare")
        if "datadome" in resp.headers.get("set-cookie", "").lower():
            diagnosis["protections"].append("DataDome")
        if "incap_ses" in resp.headers.get("set-cookie", "").lower():
            diagnosis["protections"].append("Imperva")
        if "ak_bmsc" in resp.headers.get("set-cookie", "").lower():
            diagnosis["protections"].append("Akamai")
    except Exception as e:
        diagnosis["plain_request_error"] = str(e)

    # Test 2: Check response body for challenge markers
    if hasattr(resp, "text"):
        body = resp.text.lower()
        if "just a moment" in body or "cf-challenge" in body:
            diagnosis["cloudflare_challenge"] = True
        if "datadome.co" in body:
            diagnosis["datadome_challenge"] = True
        if "/_Incapsula_Resource" in resp.text:
            diagnosis["imperva_challenge"] = True

    return diagnosis

# Usage
import asyncio
result = asyncio.run(diagnose_protection("https://target-site.com"))
print(json.dumps(result, indent=2))

Real-World Use Cases and Production Patterns

Price Monitoring at Scale

Track competitor pricing across thousands of product pages on well-protected ecommerce sites:

import asyncio
from datetime import datetime

async def monitor_prices(
    product_urls: list[str],
    proxy_pool: list[str],
    db: ScrapingDB,
) -> None:
    """Production price monitoring with anti-detection."""
    db.add_to_queue(product_urls)
    batch = db.get_next_batch(10)

    for url in batch:
        proxy = random.choice(proxy_pool)
        db.mark_in_progress(url)

        try:
            content = await scrape_cloudflare_protected(url, proxy_url=proxy)
            # Parse price from content...
            page = ScrapedPage(
                url=url,
                content=content,
                status_code=200,
                proxy_used=proxy,
            )
            db.save_result(page)
            await asyncio.sleep(random.uniform(5, 15))
        except Exception as e:
            page = ScrapedPage(
                url=url, content="", status_code=0,
                proxy_used=proxy, error=str(e)
            )
            db.save_result(page)

News Article Aggregation

News sites often use Cloudflare. RSS feeds are the first fallback, but when you need the full body:

from curl_cffi import requests as curl_requests

def scrape_news_article(url: str, proxy: str = None) -> str:
    """Scrape a news article using TLS impersonation."""
    session = curl_requests.Session()
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "Accept": "text/html,application/xhtml+xml",
        "Accept-Language": "en-US,en;q=0.9",
        "Referer": "https://www.google.com/",  # Simulate coming from search
    }

    response = session.get(
        url,
        impersonate="chrome124",
        headers=headers,
        proxies={"https": proxy} if proxy else None,
        timeout=30,
    )

    if response.status_code == 200:
        return response.text
    return ""

Quick Reference

System	Primary Detection	Best Bypass	Difficulty	Proxy Needed
Cloudflare Free	JS Challenge	Browser headers + delays	Easy	Datacenter OK
Cloudflare Biz/Enterprise	Fingerprint + Behavioral	Playwright + stealth	Medium	Residential
DataDome	ML ensemble scoring	Full browser + residential IP + slow	Hard	Residential mandatory
Imperva	TLS fingerprint + cookies	curl_cffi + header matching	Medium	ISP/Residential
Akamai	Sensor data + JS tokens	Playwright + sensor capture	Hard	Residential
PerimeterX	Behavioral ML	Playwright + human simulation	Hard	Residential/Mobile

The anti-bot arms race isn't slowing down. The trend is clear: simple HTTP libraries are dead for protected sites. You need either browser automation or TLS impersonation, and increasingly both. Plan your scraping infrastructure accordingly.

For residential proxies that work reliably against these systems, ThorData is worth evaluating — their rotating residential pool covers 190+ countries with automatic rotation per request, which is what the toughest systems require.