Scraping Cloudflare-Protected Sites in 2026

2026-03-31 cloudflare web-scraping playwright proxies bot-detection anti-bot tls-fingerprint residential-proxies

Scraping Cloudflare-Protected Sites in 2026

Cloudflare sits in front of roughly 20% of all websites on the internet. If you have been doing any serious web scraping in the past two years, you have run into its bot detection system — usually appearing as a 403 Forbidden response, a 1020 Access Denied error page, or an infinite JavaScript challenge loop that never resolves into the content you need.

Understanding why Cloudflare blocks scrapers requires understanding what it is actually analyzing. Cloudflare is not just checking whether you send a convincing User-Agent header. By the time your HTTP library sends its first request header, Cloudflare has already formed a preliminary opinion about you based on your IP address reputation, the TLS handshake signature your library presents at the TCP level, and the HTTP/2 connection fingerprint that your HTTP client library produces. Before a single line of HTML is served, three independent detection channels have already run.

This creates a layered detection problem. Each layer needs to be addressed separately. A scraper that fixes the IP layer but ignores the TLS layer will still fail. One that fixes both but runs inside a headless Chrome with detectable automation flags will fail at the JavaScript challenge layer. Getting through Cloudflare reliably in 2026 means understanding each detection layer and addressing them in sequence.

This guide covers all of it: what Cloudflare is checking, what fails, what works, complete code examples for each approach, proxy rotation with residential IPs, CAPTCHA handling strategies, rate limiting, and a decision framework for choosing the right tool per protection tier. Every code example is production-ready Python.

One important caveat before diving in: this guide focuses on legally and ethically appropriate use cases — price monitoring, publicly accessible data collection, research, and competitive intelligence on data you have a legitimate interest in accessing. Cloudflare's bot protection exists for good reasons, and its terms of service must be respected. Always check the target site's terms before scraping.

What Cloudflare Is Actually Checking

Layer 1: IP Reputation (Pre-TLS)

The first check happens before any HTTP is exchanged. Cloudflare maintains constantly updated databases of IP address reputation. Datacenter IP ranges — AWS, GCP, DigitalOcean, Hetzner, Vultr, Linode, and hundreds of smaller hosting providers — are pre-scored as high-suspicion. Many are outright blocked on Cloudflare's higher protection tiers without any challenge.

This is why a scraper that works perfectly on your local machine may fail completely when deployed to a cloud server. Your home IP has a residential ASN. Your cloud server has a datacenter ASN. Same code, completely different treatment.

Layer 2: TLS Fingerprint (JA3/JA4)

TLS fingerprinting is the most commonly misunderstood detection layer. When your HTTP library makes an HTTPS connection, it performs a TLS handshake with the server. The parameters of that handshake — which cipher suites your client supports, in what order, which TLS extensions are included, the elliptic curve preferences — form a unique signature called a JA3 fingerprint (and its successor, JA4).

Python's requests library, based on urllib3, has a distinctive JA3 fingerprint. Even if you set a convincing Chrome user agent, the TLS fingerprint immediately identifies you as Python/urllib3. Cloudflare has known about this for years and blocks the requests fingerprint on aggressively configured zones.

The solution is curl-cffi, a Python library that wraps libcurl with configurable TLS settings, allowing you to produce a TLS fingerprint that matches real Chrome, Safari, or Firefox.

Layer 3: HTTP/2 Fingerprint

Similar to JA3, HTTP/2 connection parameters form a fingerprint. The SETTINGS frame values, initial window sizes, HEADERS frame ordering, and stream priority weights differ between browser implementations and Python HTTP libraries.

Libraries like httpx send HTTP/2 SETTINGS frames that are characteristic of Python clients. curl-cffi handles this too, producing HTTP/2 fingerprints that match real browsers.

Layer 4: JavaScript Challenge / Browser Fingerprint

If your request passes the IP and TLS layers (or if the site's protection tier doesn't check them), Cloudflare may still serve a JavaScript challenge page. This challenge runs in the browser and checks:

navigator.webdriver — is this an automated browser?
Canvas fingerprint — does the rendered canvas match a known browser/OS combination?
WebGL renderer string — what GPU is reported?
Font enumeration — which system fonts are installed?
Chrome runtime API presence
Time taken to complete the challenge (too fast = bot)
Mouse movement entropy (for Turnstile)

Standard Playwright/Selenium without stealth patches will fail here because navigator.webdriver is true by default in headless automation contexts.

Layer 5: Behavioral Signals

At higher protection tiers and for volume scraping, Cloudflare tracks behavioral patterns: the speed at which you navigate between pages, whether you follow links in the same way humans do, session duration, the ratio of crawled pages to browsing time. These signals matter less for occasional scraping but become significant for high-volume scrapers.

What Does Not Work in 2026

requests with a fake user agent: Gets blocked at IP or TLS layer on any site with medium or higher Cloudflare protection. The JA3 fingerprint of Python's requests is widely known.

cloudscraper library: This library reverse-engineers Cloudflare's JavaScript challenge to solve it in Python. Cloudflare has rotated the challenge generation algorithm multiple times. It works for weeks, then breaks overnight when Cloudflare rotates.

Selenium without stealth: navigator.webdriver = true is trivially detectable. Even with this patched, Chrome DevTools Protocol connections have their own fingerprint.

Standard datacenter proxies: Most datacenter IP ranges are blocked regardless of TLS or browser fingerprint on sites with aggressive Cloudflare configuration. This includes proxies from popular datacenter providers.

Rotating user agents with requests: If your TLS fingerprint says Python, changing the User-Agent header to Chrome does nothing. Cloudflare does not rely on User-Agent as a primary signal — it is too easy to fake.

What Works: The Layered Approach

Layer 1 Fix: curl-cffi for TLS Fingerprint Spoofing

curl-cffi is the primary tool for getting past TLS-based detection without running a full browser. Install it and replace requests with it:

from curl_cffi import requests as cf_requests
import time
import random

def create_cloudflare_session(impersonate: str = "chrome120") -> cf_requests.Session:
    """Create a session that impersonates a real browser at the TLS level."""
    session = cf_requests.Session(impersonate=impersonate)
    session.headers.update({
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "sec-ch-ua": '"Google Chrome";v="120", "Not(A:Brand";v="24", "Chromium";v="120"',
        "sec-ch-ua-mobile": "?0",
        "sec-ch-ua-platform": '"macOS"',
    })
    return session


# Supported impersonation targets
IMPERSONATE_OPTIONS = [
    "chrome99", "chrome100", "chrome101", "chrome104", "chrome107",
    "chrome110", "chrome116", "chrome119", "chrome120",
    "safari15_3", "safari15_5", "safari16", "safari17_0",
    "firefox99", "firefox102", "firefox110",
    "edge99", "edge101",
]


def fetch_cloudflare_page(url: str, proxy_url: str = None) -> str:
    """Fetch a Cloudflare-protected page using TLS fingerprint impersonation."""
    # Randomize the browser version for variety
    impersonate = random.choice(["chrome116", "chrome119", "chrome120", "safari17_0"])

    session = create_cloudflare_session(impersonate)

    if proxy_url:
        session.proxies = {"http": proxy_url, "https": proxy_url}

    try:
        response = session.get(url, timeout=30)

        if response.status_code == 200:
            return response.text
        elif response.status_code == 403:
            raise Exception(f"Blocked (403) — try residential proxy: {url}")
        elif response.status_code == 429:
            retry_after = int(response.headers.get("Retry-After", 60))
            time.sleep(retry_after)
            raise Exception(f"Rate limited, waited {retry_after}s")
        else:
            raise Exception(f"HTTP {response.status_code} for {url}")

    except cf_requests.RequestsError as e:
        raise Exception(f"curl-cffi error: {e}")

Layer 2 Fix: Residential Proxies for IP Reputation

Even with perfect TLS impersonation, datacenter IPs fail on aggressively configured Cloudflare zones. You need residential proxies — IP addresses from real consumer ISP connections.

ThorData provides rotating residential proxies with country-level targeting. Here is a complete integration combining curl-cffi with ThorData for maximum Cloudflare bypass effectiveness:

from curl_cffi import requests as cf_requests
from bs4 import BeautifulSoup
import random
import time
import logging

logger = logging.getLogger(__name__)


class CloudflareScraper:
    """
    Production-ready scraper for Cloudflare-protected sites.
    Combines TLS fingerprint impersonation (curl-cffi) with
    residential proxy rotation (ThorData).
    """

    THORDATA_HOST = "proxy.thordata.com"
    THORDATA_PORT = 9000

    BROWSER_PROFILES = [
        {
            "impersonate": "chrome120",
            "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
            "platform": '"Windows"',
        },
        {
            "impersonate": "chrome119",
            "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
            "platform": '"macOS"',
        },
        {
            "impersonate": "safari17_0",
            "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15",
            "platform": '"macOS"',
        },
    ]

    def __init__(self, thordata_user: str, thordata_pass: str, 
                 country: str = "US", requests_per_ip: int = 30):
        self.thordata_user = thordata_user
        self.thordata_pass = thordata_pass
        self.country = country
        self.requests_per_ip = requests_per_ip
        self._request_count = 0
        self._current_session_id = self._new_session_id()
        self._session = None
        self._rotate_session()

    def _new_session_id(self) -> str:
        return f"cf-{random.randint(100000, 999999)}"

    def _get_proxy_url(self) -> str:
        proxy_user = f"{self.thordata_user}-country-{self.country}-session-{self._current_session_id}"
        return f"http://{proxy_user}:{self.thordata_pass}@{self.THORDATA_HOST}:{self.THORDATA_PORT}"

    def _rotate_session(self):
        """Create a new session with fresh browser profile and proxy."""
        if self._session:
            self._session.close()

        profile = random.choice(self.BROWSER_PROFILES)
        self._session = cf_requests.Session(impersonate=profile["impersonate"])
        self._session.headers.update({
            "User-Agent": profile["user_agent"],
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "Upgrade-Insecure-Requests": "1",
            "Sec-Fetch-Dest": "document",
            "Sec-Fetch-Mode": "navigate",
            "Sec-Fetch-Site": "none",
            "Sec-Fetch-User": "?1",
            "sec-ch-ua-platform": profile["platform"],
        })
        self._session.proxies = {
            "http": self._get_proxy_url(),
            "https": self._get_proxy_url(),
        }
        self._current_session_id = self._new_session_id()
        self._request_count = 0
        logger.info(f"New session: {profile['impersonate']} via {self.country} residential IP")

    def get(self, url: str, **kwargs) -> cf_requests.Response:
        """Fetch URL, rotating session after requests_per_ip requests."""
        if self._request_count >= self.requests_per_ip:
            logger.info(f"Rotating session after {self._request_count} requests")
            self._rotate_session()
            time.sleep(random.uniform(2, 5))

        kwargs.setdefault("timeout", 30)
        response = self._session.get(url, **kwargs)
        self._request_count += 1
        return response

    def scrape(self, url: str, retries: int = 3) -> BeautifulSoup:
        """Fetch and parse, with automatic retry on soft blocks."""
        for attempt in range(retries):
            try:
                response = self.get(url)

                if response.status_code == 200:
                    soup = BeautifulSoup(response.text, "lxml")

                    # Check for Cloudflare challenge page
                    if self._is_cloudflare_challenge(soup):
                        logger.warning(f"Cloudflare challenge on attempt {attempt + 1}: {url}")
                        self._rotate_session()
                        time.sleep(random.uniform(5, 15))
                        continue

                    return soup

                elif response.status_code in (403, 429, 503):
                    logger.warning(f"HTTP {response.status_code} on attempt {attempt + 1}: {url}")
                    self._rotate_session()
                    time.sleep(random.uniform(10, 30) * (attempt + 1))
                    continue

                else:
                    response.raise_for_status()

            except Exception as e:
                logger.error(f"Error on attempt {attempt + 1} for {url}: {e}")
                if attempt < retries - 1:
                    self._rotate_session()
                    time.sleep(random.uniform(5, 15))
                else:
                    raise

        raise Exception(f"Failed after {retries} attempts: {url}")

    @staticmethod
    def _is_cloudflare_challenge(soup: BeautifulSoup) -> bool:
        """Detect Cloudflare challenge pages."""
        indicators = [
            soup.find("title") and "just a moment" in (soup.find("title").string or "").lower(),
            soup.find("div", id="cf-wrapper"),
            soup.find("div", id="challenge-running"),
            soup.find("div", class_="cf-browser-verification"),
            soup.find("script", src=lambda s: s and "challenges.cloudflare.com" in s),
        ]
        return any(indicators)


# Usage example
scraper = CloudflareScraper(
    thordata_user="your_thordata_username",
    thordata_pass="your_thordata_password",
    country="US",
    requests_per_ip=20,
)

soup = scraper.scrape("https://example-cloudflare-protected.com/products")
products = soup.select(".product-card")
print(f"Found {len(products)} products")

Layer 3 Fix: Playwright with Stealth for JavaScript Challenges

When the site requires JavaScript execution — Turnstile, IUAM (I'm Under Attack Mode) — you need a real browser. The key is patching the automation detection APIs before the page's JavaScript runs.

import asyncio
import random
from playwright.async_api import async_playwright, Page, BrowserContext
from bs4 import BeautifulSoup
import logging

logger = logging.getLogger(__name__)

STEALTH_SCRIPT = """
// Patch webdriver detection
Object.defineProperty(navigator, 'webdriver', { get: () => undefined });

// Add Chrome runtime (missing in headless)
window.chrome = {
    runtime: {
        onMessage: { addListener: () => {} },
        sendMessage: () => {},
    },
    loadTimes: () => ({}),
    csi: () => ({}),
};

// Realistic plugin list
Object.defineProperty(navigator, 'plugins', {
    get: () => {
        const plugins = [
            { name: 'Chrome PDF Plugin', filename: 'internal-pdf-viewer', description: 'Portable Document Format' },
            { name: 'Chrome PDF Viewer', filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai', description: '' },
            { name: 'Native Client', filename: 'internal-nacl-plugin', description: '' },
        ];
        return Object.create(PluginArray.prototype, 
            Object.fromEntries(plugins.map((p, i) => [i, { value: p, enumerable: true }]).concat([
                ['length', { value: plugins.length }],
                ['item', { value: i => plugins[i] }],
                ['namedItem', { value: name => plugins.find(p => p.name === name) || null }],
                [Symbol.iterator, { value: function*() { yield* plugins; } }]
            ]))
        );
    }
});

// Realistic language settings
Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });

// Permissions API — avoid undefined handling
const originalQuery = window.navigator.permissions?.query;
if (originalQuery) {
    window.navigator.permissions.query = (parameters) => (
        parameters.name === 'notifications'
            ? Promise.resolve({ state: Notification.permission })
            : originalQuery(parameters)
    );
}

// Prevent iframe detection
Object.defineProperty(HTMLIFrameElement.prototype, 'contentWindow', {
    get: function() {
        return window;
    }
});
"""


async def create_stealth_context(playwright, proxy_url: str = None) -> BrowserContext:
    """Launch a browser with stealth configuration."""
    browser = await playwright.chromium.launch(
        headless=True,
        args=[
            "--disable-blink-features=AutomationControlled",
            "--disable-infobars",
            "--disable-background-timer-throttling",
            "--disable-backgrounding-occluded-windows",
            "--disable-renderer-backgrounding",
            "--no-first-run",
            "--no-default-browser-check",
            "--window-size=1440,900",
        ]
    )

    context_args = {
        "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
        "viewport": {"width": 1440, "height": 900},
        "locale": "en-US",
        "timezone_id": "America/New_York",
        "geolocation": {"latitude": 40.7128, "longitude": -74.0060},
        "permissions": ["geolocation"],
        "color_scheme": "light",
        "extra_http_headers": {
            "Accept-Language": "en-US,en;q=0.9",
        },
    }

    if proxy_url:
        context_args["proxy"] = {"server": proxy_url}

    context = await browser.new_context(**context_args)

    # Inject stealth script into every page before any page script runs
    await context.add_init_script(STEALTH_SCRIPT)

    return context


async def scrape_with_stealth(url: str, proxy_url: str = None, 
                               wait_for_selector: str = None,
                               timeout: int = 30000) -> str:
    """
    Scrape a Cloudflare-protected page using stealth Playwright.
    Returns page HTML after all challenges are resolved.
    """
    async with async_playwright() as p:
        context = await create_stealth_context(p, proxy_url)
        page = await context.new_page()

        # Simulate human mouse movement
        await page.mouse.move(
            random.randint(100, 500),
            random.randint(100, 400)
        )

        try:
            await page.goto(url, wait_until="domcontentloaded", timeout=timeout)

            # Wait for Cloudflare challenge to resolve
            try:
                await page.wait_for_function(
                    """
                    () => !document.querySelector('#challenge-running') &&
                          !document.querySelector('.cf-browser-verification') &&
                          !document.title.toLowerCase().includes('just a moment')
                    """,
                    timeout=20000
                )
            except Exception:
                logger.warning(f"Cloudflare challenge wait timed out for {url}")

            # Wait for the actual content
            if wait_for_selector:
                try:
                    await page.wait_for_selector(wait_for_selector, timeout=15000)
                except Exception:
                    logger.warning(f"Target selector '{wait_for_selector}' not found")
            else:
                await asyncio.sleep(random.uniform(2, 4))

            # Scroll to trigger lazy loading
            await page.evaluate("""
                window.scrollTo({ top: document.body.scrollHeight / 3, behavior: 'smooth' });
            """)
            await asyncio.sleep(1)

            html = await page.content()

        finally:
            await context.close()

    return html


async def scrape_multiple_stealth(urls: list, proxy_url: str = None,
                                    max_concurrent: int = 2) -> list:
    """Scrape multiple Cloudflare-protected URLs with concurrency control."""
    semaphore = asyncio.Semaphore(max_concurrent)

    async def fetch_one(url: str) -> dict:
        async with semaphore:
            await asyncio.sleep(random.uniform(2, 6))
            try:
                html = await scrape_with_stealth(url, proxy_url)
                soup = BeautifulSoup(html, "lxml")
                return {"url": url, "soup": soup, "error": None}
            except Exception as e:
                return {"url": url, "soup": None, "error": str(e)}

    tasks = [fetch_one(url) for url in urls]
    return await asyncio.gather(*tasks)


# Usage
async def main():
    proxy = "http://user:[email protected]:9000"

    html = await scrape_with_stealth(
        "https://cloudflare-protected-site.com/data",
        proxy_url=proxy,
        wait_for_selector=".data-table"
    )

    soup = BeautifulSoup(html, "lxml")
    rows = soup.select(".data-table tr")
    print(f"Found {len(rows)} table rows")

asyncio.run(main())

Complete Use Case Examples

Use Case 1: Scraping a Cloudflare-Protected E-commerce Site

from curl_cffi import requests as cf_requests
from bs4 import BeautifulSoup
import json
import time
import random

def scrape_ecommerce_products(base_url: str, category_path: str,
                               thordata_user: str, thordata_pass: str,
                               max_pages: int = 20) -> list:
    """
    Scrape product listings from a Cloudflare-protected e-commerce site.
    Uses curl-cffi + rotating residential proxies.
    """
    products = []
    session_id = random.randint(100000, 999999)

    for page in range(1, max_pages + 1):
        proxy_url = f"http://{thordata_user}-country-US-session-cf{session_id}:{thordata_pass}@proxy.thordata.com:9000"

        session = cf_requests.Session(impersonate="chrome120")
        session.headers.update({
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
        })
        session.proxies = {"http": proxy_url, "https": proxy_url}

        url = f"{base_url}{category_path}?page={page}"

        try:
            resp = session.get(url, timeout=30)

            if resp.status_code != 200:
                print(f"Page {page}: HTTP {resp.status_code}")
                if resp.status_code in (429, 503):
                    time.sleep(random.uniform(30, 60))
                break

            soup = BeautifulSoup(resp.text, "lxml")

            # Check for Cloudflare block
            if "cf-browser-verification" in resp.text or "just a moment" in resp.text.lower():
                print(f"Page {page}: Cloudflare challenge — rotating session")
                session_id = random.randint(100000, 999999)
                time.sleep(random.uniform(10, 20))
                continue

            # Extract products (adapt selectors per site)
            page_products = []
            for card in soup.select(".product-card, [data-testid='product'], .product-item"):
                name_el = card.select_one("h2, h3, .product-name, .title")
                price_el = card.select_one(".price, .product-price, [data-price]")
                link_el = card.select_one("a[href]")
                img_el = card.select_one("img")

                if not name_el:
                    continue

                page_products.append({
                    "name": name_el.get_text(strip=True),
                    "price": price_el.get_text(strip=True) if price_el else None,
                    "url": link_el.get("href") if link_el else None,
                    "image": img_el.get("src") if img_el else None,
                    "page": page,
                })

            if not page_products:
                print(f"Page {page}: No products found — stopping")
                break

            products.extend(page_products)
            print(f"Page {page}: {len(page_products)} products (total: {len(products)})")

            # Rotate IP every 15-20 requests
            if page % random.randint(15, 20) == 0:
                session_id = random.randint(100000, 999999)

            time.sleep(random.uniform(2, 5))

        except Exception as e:
            print(f"Page {page} error: {e}")
            time.sleep(random.uniform(5, 15))

    return products

Use Case 2: Cloudflare-Protected News Site Scraper

import asyncio
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import Optional, List
import random

@dataclass
class NewsArticle:
    title: str
    author: Optional[str]
    date: Optional[str]
    body: str
    tags: List[str]
    url: str
    word_count: int

async def scrape_news_site(urls: list, proxy_url: str = None) -> List[NewsArticle]:
    """Scrape articles from Cloudflare-protected news sites."""
    articles = []

    async with async_playwright() as p:
        context = await create_stealth_context(p, proxy_url)

        for url in urls:
            page = await context.new_page()

            try:
                await page.goto(url, wait_until="domcontentloaded", timeout=30000)

                # Wait for challenge resolution
                try:
                    await page.wait_for_function(
                        "() => !document.title.toLowerCase().includes('just a moment')",
                        timeout=15000
                    )
                except Exception:
                    pass

                await asyncio.sleep(random.uniform(1, 3))

                html = await page.content()
                soup = BeautifulSoup(html, "lxml")

                # Clean up noise
                for noise in soup.select("nav, footer, .sidebar, .advertisement, script, style"):
                    noise.decompose()

                # Extract article content
                title = ""
                for sel in ["h1.article-title", "h1.headline", "[itemprop='headline']", "h1"]:
                    el = soup.select_one(sel)
                    if el:
                        title = el.get_text(strip=True)
                        break

                author = ""
                for sel in ["[rel='author']", ".author-name", ".byline", "[class*='author']"]:
                    el = soup.select_one(sel)
                    if el:
                        author = el.get_text(strip=True)
                        break

                date_el = soup.select_one("time[datetime], .published-date, [itemprop='datePublished']")
                date = date_el.get("datetime") or (date_el.get_text(strip=True) if date_el else None)

                # Body text
                body = ""
                for sel in ["article .content", ".article-body", ".story-body", "article"]:
                    el = soup.select_one(sel)
                    if el:
                        paras = el.find_all("p")
                        body = " ".join(p.get_text(strip=True) for p in paras if len(p.get_text()) > 30)
                        if body:
                            break

                tags = [t.get_text(strip=True) for t in soup.select(".tag, .topic-tag, .article-tag")]

                articles.append(NewsArticle(
                    title=title,
                    author=author,
                    date=date,
                    body=body,
                    tags=tags[:10],
                    url=url,
                    word_count=len(body.split()),
                ))

                print(f"✓ {title[:60]}... ({len(body.split())} words)")

            except Exception as e:
                print(f"✗ {url}: {e}")

            finally:
                await page.close()

            await asyncio.sleep(random.uniform(3, 8))

        await context.close()

    return articles

Use Case 3: Cloudflare-Protected Price Comparison Scraper

from curl_cffi import requests as cf_requests
from bs4 import BeautifulSoup
import re
import json
from typing import Optional

def scrape_price_data(product_url: str, thordata_user: str, thordata_pass: str) -> dict:
    """
    Extract price, stock status, and variants from a Cloudflare-protected 
    product page. Tries JSON-LD structured data first (faster), falls back 
    to HTML parsing.
    """
    proxy_url = f"http://{thordata_user}-country-US:{thordata_pass}@proxy.thordata.com:9000"

    session = cf_requests.Session(impersonate="chrome120")
    session.headers.update({
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    })
    session.proxies = {"http": proxy_url, "https": proxy_url}

    resp = session.get(product_url, timeout=30)
    resp.raise_for_status()

    soup = BeautifulSoup(resp.text, "lxml")

    # Try JSON-LD structured data first — most reliable
    result = {"url": product_url, "source": "unknown"}

    for script in soup.find_all("script", type="application/ld+json"):
        try:
            data = json.loads(script.string)

            # Handle @graph arrays
            if isinstance(data, dict) and "@graph" in data:
                items = data["@graph"]
            elif isinstance(data, list):
                items = data
            else:
                items = [data]

            for item in items:
                if item.get("@type") in ("Product", "ItemPage"):
                    offers = item.get("offers", {})
                    if isinstance(offers, list):
                        offers = offers[0] if offers else {}

                    result.update({
                        "name": item.get("name"),
                        "brand": item.get("brand", {}).get("name") if isinstance(item.get("brand"), dict) else item.get("brand"),
                        "sku": item.get("sku") or item.get("mpn"),
                        "price": offers.get("price"),
                        "currency": offers.get("priceCurrency"),
                        "availability": offers.get("availability", "").split("/")[-1] if offers.get("availability") else None,
                        "rating": item.get("aggregateRating", {}).get("ratingValue"),
                        "review_count": item.get("aggregateRating", {}).get("reviewCount"),
                        "source": "json-ld",
                    })
                    return result
        except (json.JSONDecodeError, AttributeError):
            continue

    # Fallback: HTML scraping
    name_el = soup.select_one("h1.product-title, h1.product-name, #productTitle, h1[itemprop='name']")
    price_text = ""
    for sel in [".price", ".product-price", "[itemprop='price']", "#priceblock_ourprice", ".offer-price"]:
        el = soup.select_one(sel)
        if el:
            price_text = el.get_text(strip=True)
            break

    price = None
    if price_text:
        match = re.search(r"[\d,]+\.?\d*", price_text.replace(",", ""))
        if match:
            try:
                price = float(match.group())
            except ValueError:
                pass

    result.update({
        "name": name_el.get_text(strip=True) if name_el else None,
        "price": price,
        "currency": "USD",
        "source": "html-scrape",
    })
    return result

CAPTCHA Handling Strategies

Turnstile (Cloudflare's CAPTCHA)

Cloudflare's Turnstile replaced many hCaptcha deployments. It is designed to be invisible for legitimate users. With a real browser and a residential IP, it typically resolves automatically:

async def handle_turnstile(page, timeout: int = 20000) -> bool:
    """
    Wait for Cloudflare Turnstile to auto-resolve.
    Returns True if resolved, False if timed out.
    """
    turnstile_selectors = [
        "iframe[src*='challenges.cloudflare.com']",
        "[data-sitekey]",
        ".cf-turnstile",
    ]

    for selector in turnstile_selectors:
        frame = await page.query_selector(selector)
        if not frame:
            continue

        print("Turnstile detected, waiting for auto-resolution...")

        # Turnstile resolves based on browser behavior analysis
        # With stealth Playwright + residential proxy, it usually passes
        try:
            await page.wait_for_function(
                """
                () => {
                    const input = document.querySelector('[name="cf-turnstile-response"]');
                    return input && input.value && input.value.length > 0;
                }
                """,
                timeout=timeout
            )
            print("Turnstile resolved")
            return True
        except Exception:
            print(f"Turnstile not resolved within {timeout/1000}s")
            return False

    return True  # No Turnstile found


async def scrape_with_turnstile_handling(url: str, proxy_url: str = None) -> str:
    """Scrape URL, handling Turnstile challenges."""
    async with async_playwright() as p:
        context = await create_stealth_context(p, proxy_url)
        page = await context.new_page()

        await page.goto(url, wait_until="domcontentloaded", timeout=30000)

        # Handle Turnstile if present
        resolved = await handle_turnstile(page)
        if not resolved:
            # If Turnstile failed, try a different proxy
            print("Turnstile failed — try different residential proxy or slow down")
            await context.close()
            return ""

        # Wait for content to load after challenge
        await asyncio.sleep(random.uniform(2, 4))
        content = await page.content()
        await context.close()
        return content

Rate Limiting Strategy

Getting through Cloudflare's initial check is only half the problem. The origin site's own rate limiting still applies, and Cloudflare also applies behavioral rate limiting at volume.

import time
import random
from collections import deque

class RateLimiter:
    """Token bucket rate limiter for scraping within Cloudflare's tolerance."""

    def __init__(self, requests_per_minute: int = 20, burst_size: int = 5):
        self.requests_per_minute = requests_per_minute
        self.burst_size = burst_size
        self.min_interval = 60.0 / requests_per_minute
        self.request_times = deque()

    def wait(self):
        """Block until rate limit allows another request."""
        now = time.time()

        # Remove old timestamps outside the 60-second window
        while self.request_times and now - self.request_times[0] > 60:
            self.request_times.popleft()

        if len(self.request_times) >= self.requests_per_minute:
            sleep_time = 60 - (now - self.request_times[0])
            if sleep_time > 0:
                time.sleep(sleep_time)
            now = time.time()

        # Add natural variance to timing
        if self.request_times:
            time_since_last = now - self.request_times[-1]
            if time_since_last < self.min_interval:
                variance = random.uniform(0, self.min_interval * 0.5)
                time.sleep(self.min_interval - time_since_last + variance)

        self.request_times.append(time.time())


# Recommended settings per Cloudflare protection tier
RATE_LIMITS = {
    "low": RateLimiter(requests_per_minute=60, burst_size=10),       # Standard Cloudflare
    "medium": RateLimiter(requests_per_minute=30, burst_size=5),     # Pro/Business tier
    "high": RateLimiter(requests_per_minute=10, burst_size=2),       # Enterprise / aggressive
    "extreme": RateLimiter(requests_per_minute=3, burst_size=1),     # Turnstile / challenge mode
}

Decision Framework: Choosing Your Approach

Protection Level	Symptoms	Best Approach	Proxy Type
No protection	Works with `requests`	Plain `requests`	None needed
Standard Cloudflare	403 from datacenter IPs	`curl-cffi` + residential proxy	Residential
Medium (JS challenge)	JavaScript loop, 503	Playwright stealth + residential	Residential
High (IUAM mode)	Constant challenge, 5s wait	Playwright + slow human simulation	Premium residential
Turnstile interactive	CAPTCHA checkbox	Playwright stealth + residential	Premium residential
Enterprise / WAF	All above + behavioral blocks	Consider alternative data source or API	N/A

For residential proxies, ThorData provides rotating residential IPs with country targeting starting at low cost per GB, with a pool covering 190+ countries. For Cloudflare bypass specifically, US and EU residential IPs have the highest success rates.

Common Error Codes

Code	Meaning	Fix
403 / 1020	IP or ASN blocked	Switch to residential proxy
503	JavaScript challenge pending	Use Playwright with stealth
429	Rate limited	Slow down, respect `Retry-After` header
520-530	Origin server errors (not Cloudflare)	Site-specific issue unrelated to bot detection
Empty body / connection reset	TLS fingerprint blocked	Switch from `requests` to `curl-cffi`
Infinite redirect	Session/cookie issue	Use a persistent session, clear cookies

Output Schema Example

from dataclasses import dataclass, field
from typing import Optional, List
from datetime import datetime

@dataclass
class ScrapedPage:
    url: str
    status_code: int
    scraped_at: str
    bypass_method: str  # "curl-cffi", "playwright-stealth", "direct"
    proxy_country: Optional[str]
    cloudflare_challenge: bool
    html_length: int
    data: dict = field(default_factory=dict)
    error: Optional[str] = None

# Example output
example = ScrapedPage(
    url="https://cloudflare-site.com/products",
    status_code=200,
    scraped_at=datetime.utcnow().isoformat(),
    bypass_method="curl-cffi",
    proxy_country="US",
    cloudflare_challenge=False,
    html_length=145230,
    data={"products": 48, "pages_scraped": 3},
)

Summary

Cloudflare protection in 2026 operates across five independent layers. Each must be addressed or the scraper fails. Start with curl-cffi for the TLS fingerprint layer — it handles 80% of Cloudflare-protected sites without a full browser. Add residential proxy rotation via ThorData for IP reputation. Add Playwright stealth for JavaScript challenges. Implement proper rate limiting to stay under behavioral detection thresholds.

Before building any of this infrastructure, spend five minutes in DevTools checking whether the target data is available via a less-protected API endpoint. Many Cloudflare-protected sites have a mobile API or JSON feed that bypasses the web-tier protection entirely.