Scraping Cloudflare-Protected Sites in 2026
Scraping Cloudflare-Protected Sites in 2026
Cloudflare sits in front of roughly 20% of all websites on the internet. If you have been doing any serious web scraping in the past two years, you have run into its bot detection system — usually appearing as a 403 Forbidden response, a 1020 Access Denied error page, or an infinite JavaScript challenge loop that never resolves into the content you need.
Understanding why Cloudflare blocks scrapers requires understanding what it is actually analyzing. Cloudflare is not just checking whether you send a convincing User-Agent header. By the time your HTTP library sends its first request header, Cloudflare has already formed a preliminary opinion about you based on your IP address reputation, the TLS handshake signature your library presents at the TCP level, and the HTTP/2 connection fingerprint that your HTTP client library produces. Before a single line of HTML is served, three independent detection channels have already run.
This creates a layered detection problem. Each layer needs to be addressed separately. A scraper that fixes the IP layer but ignores the TLS layer will still fail. One that fixes both but runs inside a headless Chrome with detectable automation flags will fail at the JavaScript challenge layer. Getting through Cloudflare reliably in 2026 means understanding each detection layer and addressing them in sequence.
This guide covers all of it: what Cloudflare is checking, what fails, what works, complete code examples for each approach, proxy rotation with residential IPs, CAPTCHA handling strategies, rate limiting, and a decision framework for choosing the right tool per protection tier. Every code example is production-ready Python.
One important caveat before diving in: this guide focuses on legally and ethically appropriate use cases — price monitoring, publicly accessible data collection, research, and competitive intelligence on data you have a legitimate interest in accessing. Cloudflare's bot protection exists for good reasons, and its terms of service must be respected. Always check the target site's terms before scraping.
What Cloudflare Is Actually Checking
Layer 1: IP Reputation (Pre-TLS)
The first check happens before any HTTP is exchanged. Cloudflare maintains constantly updated databases of IP address reputation. Datacenter IP ranges — AWS, GCP, DigitalOcean, Hetzner, Vultr, Linode, and hundreds of smaller hosting providers — are pre-scored as high-suspicion. Many are outright blocked on Cloudflare's higher protection tiers without any challenge.
This is why a scraper that works perfectly on your local machine may fail completely when deployed to a cloud server. Your home IP has a residential ASN. Your cloud server has a datacenter ASN. Same code, completely different treatment.
Layer 2: TLS Fingerprint (JA3/JA4)
TLS fingerprinting is the most commonly misunderstood detection layer. When your HTTP library makes an HTTPS connection, it performs a TLS handshake with the server. The parameters of that handshake — which cipher suites your client supports, in what order, which TLS extensions are included, the elliptic curve preferences — form a unique signature called a JA3 fingerprint (and its successor, JA4).
Python's requests library, based on urllib3, has a distinctive JA3 fingerprint. Even if you set a convincing Chrome user agent, the TLS fingerprint immediately identifies you as Python/urllib3. Cloudflare has known about this for years and blocks the requests fingerprint on aggressively configured zones.
The solution is curl-cffi, a Python library that wraps libcurl with configurable TLS settings, allowing you to produce a TLS fingerprint that matches real Chrome, Safari, or Firefox.
Layer 3: HTTP/2 Fingerprint
Similar to JA3, HTTP/2 connection parameters form a fingerprint. The SETTINGS frame values, initial window sizes, HEADERS frame ordering, and stream priority weights differ between browser implementations and Python HTTP libraries.
Libraries like httpx send HTTP/2 SETTINGS frames that are characteristic of Python clients. curl-cffi handles this too, producing HTTP/2 fingerprints that match real browsers.
Layer 4: JavaScript Challenge / Browser Fingerprint
If your request passes the IP and TLS layers (or if the site's protection tier doesn't check them), Cloudflare may still serve a JavaScript challenge page. This challenge runs in the browser and checks:
navigator.webdriver— is this an automated browser?- Canvas fingerprint — does the rendered canvas match a known browser/OS combination?
- WebGL renderer string — what GPU is reported?
- Font enumeration — which system fonts are installed?
- Chrome runtime API presence
- Time taken to complete the challenge (too fast = bot)
- Mouse movement entropy (for Turnstile)
Standard Playwright/Selenium without stealth patches will fail here because navigator.webdriver is true by default in headless automation contexts.
Layer 5: Behavioral Signals
At higher protection tiers and for volume scraping, Cloudflare tracks behavioral patterns: the speed at which you navigate between pages, whether you follow links in the same way humans do, session duration, the ratio of crawled pages to browsing time. These signals matter less for occasional scraping but become significant for high-volume scrapers.
What Does Not Work in 2026
requests with a fake user agent: Gets blocked at IP or TLS layer on any site with medium or higher Cloudflare protection. The JA3 fingerprint of Python's requests is widely known.
cloudscraper library: This library reverse-engineers Cloudflare's JavaScript challenge to solve it in Python. Cloudflare has rotated the challenge generation algorithm multiple times. It works for weeks, then breaks overnight when Cloudflare rotates.
Selenium without stealth: navigator.webdriver = true is trivially detectable. Even with this patched, Chrome DevTools Protocol connections have their own fingerprint.
Standard datacenter proxies: Most datacenter IP ranges are blocked regardless of TLS or browser fingerprint on sites with aggressive Cloudflare configuration. This includes proxies from popular datacenter providers.
Rotating user agents with requests: If your TLS fingerprint says Python, changing the User-Agent header to Chrome does nothing. Cloudflare does not rely on User-Agent as a primary signal — it is too easy to fake.
What Works: The Layered Approach
Layer 1 Fix: curl-cffi for TLS Fingerprint Spoofing
curl-cffi is the primary tool for getting past TLS-based detection without running a full browser. Install it and replace requests with it:
from curl_cffi import requests as cf_requests
import time
import random
def create_cloudflare_session(impersonate: str = "chrome120") -> cf_requests.Session:
"""Create a session that impersonates a real browser at the TLS level."""
session = cf_requests.Session(impersonate=impersonate)
session.headers.update({
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"sec-ch-ua": '"Google Chrome";v="120", "Not(A:Brand";v="24", "Chromium";v="120"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"macOS"',
})
return session
# Supported impersonation targets
IMPERSONATE_OPTIONS = [
"chrome99", "chrome100", "chrome101", "chrome104", "chrome107",
"chrome110", "chrome116", "chrome119", "chrome120",
"safari15_3", "safari15_5", "safari16", "safari17_0",
"firefox99", "firefox102", "firefox110",
"edge99", "edge101",
]
def fetch_cloudflare_page(url: str, proxy_url: str = None) -> str:
"""Fetch a Cloudflare-protected page using TLS fingerprint impersonation."""
# Randomize the browser version for variety
impersonate = random.choice(["chrome116", "chrome119", "chrome120", "safari17_0"])
session = create_cloudflare_session(impersonate)
if proxy_url:
session.proxies = {"http": proxy_url, "https": proxy_url}
try:
response = session.get(url, timeout=30)
if response.status_code == 200:
return response.text
elif response.status_code == 403:
raise Exception(f"Blocked (403) — try residential proxy: {url}")
elif response.status_code == 429:
retry_after = int(response.headers.get("Retry-After", 60))
time.sleep(retry_after)
raise Exception(f"Rate limited, waited {retry_after}s")
else:
raise Exception(f"HTTP {response.status_code} for {url}")
except cf_requests.RequestsError as e:
raise Exception(f"curl-cffi error: {e}")
Layer 2 Fix: Residential Proxies for IP Reputation
Even with perfect TLS impersonation, datacenter IPs fail on aggressively configured Cloudflare zones. You need residential proxies — IP addresses from real consumer ISP connections.
ThorData provides rotating residential proxies with country-level targeting. Here is a complete integration combining curl-cffi with ThorData for maximum Cloudflare bypass effectiveness:
from curl_cffi import requests as cf_requests
from bs4 import BeautifulSoup
import random
import time
import logging
logger = logging.getLogger(__name__)
class CloudflareScraper:
"""
Production-ready scraper for Cloudflare-protected sites.
Combines TLS fingerprint impersonation (curl-cffi) with
residential proxy rotation (ThorData).
"""
THORDATA_HOST = "proxy.thordata.com"
THORDATA_PORT = 9000
BROWSER_PROFILES = [
{
"impersonate": "chrome120",
"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"platform": '"Windows"',
},
{
"impersonate": "chrome119",
"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
"platform": '"macOS"',
},
{
"impersonate": "safari17_0",
"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15",
"platform": '"macOS"',
},
]
def __init__(self, thordata_user: str, thordata_pass: str,
country: str = "US", requests_per_ip: int = 30):
self.thordata_user = thordata_user
self.thordata_pass = thordata_pass
self.country = country
self.requests_per_ip = requests_per_ip
self._request_count = 0
self._current_session_id = self._new_session_id()
self._session = None
self._rotate_session()
def _new_session_id(self) -> str:
return f"cf-{random.randint(100000, 999999)}"
def _get_proxy_url(self) -> str:
proxy_user = f"{self.thordata_user}-country-{self.country}-session-{self._current_session_id}"
return f"http://{proxy_user}:{self.thordata_pass}@{self.THORDATA_HOST}:{self.THORDATA_PORT}"
def _rotate_session(self):
"""Create a new session with fresh browser profile and proxy."""
if self._session:
self._session.close()
profile = random.choice(self.BROWSER_PROFILES)
self._session = cf_requests.Session(impersonate=profile["impersonate"])
self._session.headers.update({
"User-Agent": profile["user_agent"],
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"sec-ch-ua-platform": profile["platform"],
})
self._session.proxies = {
"http": self._get_proxy_url(),
"https": self._get_proxy_url(),
}
self._current_session_id = self._new_session_id()
self._request_count = 0
logger.info(f"New session: {profile['impersonate']} via {self.country} residential IP")
def get(self, url: str, **kwargs) -> cf_requests.Response:
"""Fetch URL, rotating session after requests_per_ip requests."""
if self._request_count >= self.requests_per_ip:
logger.info(f"Rotating session after {self._request_count} requests")
self._rotate_session()
time.sleep(random.uniform(2, 5))
kwargs.setdefault("timeout", 30)
response = self._session.get(url, **kwargs)
self._request_count += 1
return response
def scrape(self, url: str, retries: int = 3) -> BeautifulSoup:
"""Fetch and parse, with automatic retry on soft blocks."""
for attempt in range(retries):
try:
response = self.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, "lxml")
# Check for Cloudflare challenge page
if self._is_cloudflare_challenge(soup):
logger.warning(f"Cloudflare challenge on attempt {attempt + 1}: {url}")
self._rotate_session()
time.sleep(random.uniform(5, 15))
continue
return soup
elif response.status_code in (403, 429, 503):
logger.warning(f"HTTP {response.status_code} on attempt {attempt + 1}: {url}")
self._rotate_session()
time.sleep(random.uniform(10, 30) * (attempt + 1))
continue
else:
response.raise_for_status()
except Exception as e:
logger.error(f"Error on attempt {attempt + 1} for {url}: {e}")
if attempt < retries - 1:
self._rotate_session()
time.sleep(random.uniform(5, 15))
else:
raise
raise Exception(f"Failed after {retries} attempts: {url}")
@staticmethod
def _is_cloudflare_challenge(soup: BeautifulSoup) -> bool:
"""Detect Cloudflare challenge pages."""
indicators = [
soup.find("title") and "just a moment" in (soup.find("title").string or "").lower(),
soup.find("div", id="cf-wrapper"),
soup.find("div", id="challenge-running"),
soup.find("div", class_="cf-browser-verification"),
soup.find("script", src=lambda s: s and "challenges.cloudflare.com" in s),
]
return any(indicators)
# Usage example
scraper = CloudflareScraper(
thordata_user="your_thordata_username",
thordata_pass="your_thordata_password",
country="US",
requests_per_ip=20,
)
soup = scraper.scrape("https://example-cloudflare-protected.com/products")
products = soup.select(".product-card")
print(f"Found {len(products)} products")
Layer 3 Fix: Playwright with Stealth for JavaScript Challenges
When the site requires JavaScript execution — Turnstile, IUAM (I'm Under Attack Mode) — you need a real browser. The key is patching the automation detection APIs before the page's JavaScript runs.
import asyncio
import random
from playwright.async_api import async_playwright, Page, BrowserContext
from bs4 import BeautifulSoup
import logging
logger = logging.getLogger(__name__)
STEALTH_SCRIPT = """
// Patch webdriver detection
Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
// Add Chrome runtime (missing in headless)
window.chrome = {
runtime: {
onMessage: { addListener: () => {} },
sendMessage: () => {},
},
loadTimes: () => ({}),
csi: () => ({}),
};
// Realistic plugin list
Object.defineProperty(navigator, 'plugins', {
get: () => {
const plugins = [
{ name: 'Chrome PDF Plugin', filename: 'internal-pdf-viewer', description: 'Portable Document Format' },
{ name: 'Chrome PDF Viewer', filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai', description: '' },
{ name: 'Native Client', filename: 'internal-nacl-plugin', description: '' },
];
return Object.create(PluginArray.prototype,
Object.fromEntries(plugins.map((p, i) => [i, { value: p, enumerable: true }]).concat([
['length', { value: plugins.length }],
['item', { value: i => plugins[i] }],
['namedItem', { value: name => plugins.find(p => p.name === name) || null }],
[Symbol.iterator, { value: function*() { yield* plugins; } }]
]))
);
}
});
// Realistic language settings
Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });
// Permissions API — avoid undefined handling
const originalQuery = window.navigator.permissions?.query;
if (originalQuery) {
window.navigator.permissions.query = (parameters) => (
parameters.name === 'notifications'
? Promise.resolve({ state: Notification.permission })
: originalQuery(parameters)
);
}
// Prevent iframe detection
Object.defineProperty(HTMLIFrameElement.prototype, 'contentWindow', {
get: function() {
return window;
}
});
"""
async def create_stealth_context(playwright, proxy_url: str = None) -> BrowserContext:
"""Launch a browser with stealth configuration."""
browser = await playwright.chromium.launch(
headless=True,
args=[
"--disable-blink-features=AutomationControlled",
"--disable-infobars",
"--disable-background-timer-throttling",
"--disable-backgrounding-occluded-windows",
"--disable-renderer-backgrounding",
"--no-first-run",
"--no-default-browser-check",
"--window-size=1440,900",
]
)
context_args = {
"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"viewport": {"width": 1440, "height": 900},
"locale": "en-US",
"timezone_id": "America/New_York",
"geolocation": {"latitude": 40.7128, "longitude": -74.0060},
"permissions": ["geolocation"],
"color_scheme": "light",
"extra_http_headers": {
"Accept-Language": "en-US,en;q=0.9",
},
}
if proxy_url:
context_args["proxy"] = {"server": proxy_url}
context = await browser.new_context(**context_args)
# Inject stealth script into every page before any page script runs
await context.add_init_script(STEALTH_SCRIPT)
return context
async def scrape_with_stealth(url: str, proxy_url: str = None,
wait_for_selector: str = None,
timeout: int = 30000) -> str:
"""
Scrape a Cloudflare-protected page using stealth Playwright.
Returns page HTML after all challenges are resolved.
"""
async with async_playwright() as p:
context = await create_stealth_context(p, proxy_url)
page = await context.new_page()
# Simulate human mouse movement
await page.mouse.move(
random.randint(100, 500),
random.randint(100, 400)
)
try:
await page.goto(url, wait_until="domcontentloaded", timeout=timeout)
# Wait for Cloudflare challenge to resolve
try:
await page.wait_for_function(
"""
() => !document.querySelector('#challenge-running') &&
!document.querySelector('.cf-browser-verification') &&
!document.title.toLowerCase().includes('just a moment')
""",
timeout=20000
)
except Exception:
logger.warning(f"Cloudflare challenge wait timed out for {url}")
# Wait for the actual content
if wait_for_selector:
try:
await page.wait_for_selector(wait_for_selector, timeout=15000)
except Exception:
logger.warning(f"Target selector '{wait_for_selector}' not found")
else:
await asyncio.sleep(random.uniform(2, 4))
# Scroll to trigger lazy loading
await page.evaluate("""
window.scrollTo({ top: document.body.scrollHeight / 3, behavior: 'smooth' });
""")
await asyncio.sleep(1)
html = await page.content()
finally:
await context.close()
return html
async def scrape_multiple_stealth(urls: list, proxy_url: str = None,
max_concurrent: int = 2) -> list:
"""Scrape multiple Cloudflare-protected URLs with concurrency control."""
semaphore = asyncio.Semaphore(max_concurrent)
async def fetch_one(url: str) -> dict:
async with semaphore:
await asyncio.sleep(random.uniform(2, 6))
try:
html = await scrape_with_stealth(url, proxy_url)
soup = BeautifulSoup(html, "lxml")
return {"url": url, "soup": soup, "error": None}
except Exception as e:
return {"url": url, "soup": None, "error": str(e)}
tasks = [fetch_one(url) for url in urls]
return await asyncio.gather(*tasks)
# Usage
async def main():
proxy = "http://user:[email protected]:9000"
html = await scrape_with_stealth(
"https://cloudflare-protected-site.com/data",
proxy_url=proxy,
wait_for_selector=".data-table"
)
soup = BeautifulSoup(html, "lxml")
rows = soup.select(".data-table tr")
print(f"Found {len(rows)} table rows")
asyncio.run(main())
Complete Use Case Examples
Use Case 1: Scraping a Cloudflare-Protected E-commerce Site
from curl_cffi import requests as cf_requests
from bs4 import BeautifulSoup
import json
import time
import random
def scrape_ecommerce_products(base_url: str, category_path: str,
thordata_user: str, thordata_pass: str,
max_pages: int = 20) -> list:
"""
Scrape product listings from a Cloudflare-protected e-commerce site.
Uses curl-cffi + rotating residential proxies.
"""
products = []
session_id = random.randint(100000, 999999)
for page in range(1, max_pages + 1):
proxy_url = f"http://{thordata_user}-country-US-session-cf{session_id}:{thordata_pass}@proxy.thordata.com:9000"
session = cf_requests.Session(impersonate="chrome120")
session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
})
session.proxies = {"http": proxy_url, "https": proxy_url}
url = f"{base_url}{category_path}?page={page}"
try:
resp = session.get(url, timeout=30)
if resp.status_code != 200:
print(f"Page {page}: HTTP {resp.status_code}")
if resp.status_code in (429, 503):
time.sleep(random.uniform(30, 60))
break
soup = BeautifulSoup(resp.text, "lxml")
# Check for Cloudflare block
if "cf-browser-verification" in resp.text or "just a moment" in resp.text.lower():
print(f"Page {page}: Cloudflare challenge — rotating session")
session_id = random.randint(100000, 999999)
time.sleep(random.uniform(10, 20))
continue
# Extract products (adapt selectors per site)
page_products = []
for card in soup.select(".product-card, [data-testid='product'], .product-item"):
name_el = card.select_one("h2, h3, .product-name, .title")
price_el = card.select_one(".price, .product-price, [data-price]")
link_el = card.select_one("a[href]")
img_el = card.select_one("img")
if not name_el:
continue
page_products.append({
"name": name_el.get_text(strip=True),
"price": price_el.get_text(strip=True) if price_el else None,
"url": link_el.get("href") if link_el else None,
"image": img_el.get("src") if img_el else None,
"page": page,
})
if not page_products:
print(f"Page {page}: No products found — stopping")
break
products.extend(page_products)
print(f"Page {page}: {len(page_products)} products (total: {len(products)})")
# Rotate IP every 15-20 requests
if page % random.randint(15, 20) == 0:
session_id = random.randint(100000, 999999)
time.sleep(random.uniform(2, 5))
except Exception as e:
print(f"Page {page} error: {e}")
time.sleep(random.uniform(5, 15))
return products
Use Case 2: Cloudflare-Protected News Site Scraper
import asyncio
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import Optional, List
import random
@dataclass
class NewsArticle:
title: str
author: Optional[str]
date: Optional[str]
body: str
tags: List[str]
url: str
word_count: int
async def scrape_news_site(urls: list, proxy_url: str = None) -> List[NewsArticle]:
"""Scrape articles from Cloudflare-protected news sites."""
articles = []
async with async_playwright() as p:
context = await create_stealth_context(p, proxy_url)
for url in urls:
page = await context.new_page()
try:
await page.goto(url, wait_until="domcontentloaded", timeout=30000)
# Wait for challenge resolution
try:
await page.wait_for_function(
"() => !document.title.toLowerCase().includes('just a moment')",
timeout=15000
)
except Exception:
pass
await asyncio.sleep(random.uniform(1, 3))
html = await page.content()
soup = BeautifulSoup(html, "lxml")
# Clean up noise
for noise in soup.select("nav, footer, .sidebar, .advertisement, script, style"):
noise.decompose()
# Extract article content
title = ""
for sel in ["h1.article-title", "h1.headline", "[itemprop='headline']", "h1"]:
el = soup.select_one(sel)
if el:
title = el.get_text(strip=True)
break
author = ""
for sel in ["[rel='author']", ".author-name", ".byline", "[class*='author']"]:
el = soup.select_one(sel)
if el:
author = el.get_text(strip=True)
break
date_el = soup.select_one("time[datetime], .published-date, [itemprop='datePublished']")
date = date_el.get("datetime") or (date_el.get_text(strip=True) if date_el else None)
# Body text
body = ""
for sel in ["article .content", ".article-body", ".story-body", "article"]:
el = soup.select_one(sel)
if el:
paras = el.find_all("p")
body = " ".join(p.get_text(strip=True) for p in paras if len(p.get_text()) > 30)
if body:
break
tags = [t.get_text(strip=True) for t in soup.select(".tag, .topic-tag, .article-tag")]
articles.append(NewsArticle(
title=title,
author=author,
date=date,
body=body,
tags=tags[:10],
url=url,
word_count=len(body.split()),
))
print(f"✓ {title[:60]}... ({len(body.split())} words)")
except Exception as e:
print(f"✗ {url}: {e}")
finally:
await page.close()
await asyncio.sleep(random.uniform(3, 8))
await context.close()
return articles
Use Case 3: Cloudflare-Protected Price Comparison Scraper
from curl_cffi import requests as cf_requests
from bs4 import BeautifulSoup
import re
import json
from typing import Optional
def scrape_price_data(product_url: str, thordata_user: str, thordata_pass: str) -> dict:
"""
Extract price, stock status, and variants from a Cloudflare-protected
product page. Tries JSON-LD structured data first (faster), falls back
to HTML parsing.
"""
proxy_url = f"http://{thordata_user}-country-US:{thordata_pass}@proxy.thordata.com:9000"
session = cf_requests.Session(impersonate="chrome120")
session.headers.update({
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
})
session.proxies = {"http": proxy_url, "https": proxy_url}
resp = session.get(product_url, timeout=30)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
# Try JSON-LD structured data first — most reliable
result = {"url": product_url, "source": "unknown"}
for script in soup.find_all("script", type="application/ld+json"):
try:
data = json.loads(script.string)
# Handle @graph arrays
if isinstance(data, dict) and "@graph" in data:
items = data["@graph"]
elif isinstance(data, list):
items = data
else:
items = [data]
for item in items:
if item.get("@type") in ("Product", "ItemPage"):
offers = item.get("offers", {})
if isinstance(offers, list):
offers = offers[0] if offers else {}
result.update({
"name": item.get("name"),
"brand": item.get("brand", {}).get("name") if isinstance(item.get("brand"), dict) else item.get("brand"),
"sku": item.get("sku") or item.get("mpn"),
"price": offers.get("price"),
"currency": offers.get("priceCurrency"),
"availability": offers.get("availability", "").split("/")[-1] if offers.get("availability") else None,
"rating": item.get("aggregateRating", {}).get("ratingValue"),
"review_count": item.get("aggregateRating", {}).get("reviewCount"),
"source": "json-ld",
})
return result
except (json.JSONDecodeError, AttributeError):
continue
# Fallback: HTML scraping
name_el = soup.select_one("h1.product-title, h1.product-name, #productTitle, h1[itemprop='name']")
price_text = ""
for sel in [".price", ".product-price", "[itemprop='price']", "#priceblock_ourprice", ".offer-price"]:
el = soup.select_one(sel)
if el:
price_text = el.get_text(strip=True)
break
price = None
if price_text:
match = re.search(r"[\d,]+\.?\d*", price_text.replace(",", ""))
if match:
try:
price = float(match.group())
except ValueError:
pass
result.update({
"name": name_el.get_text(strip=True) if name_el else None,
"price": price,
"currency": "USD",
"source": "html-scrape",
})
return result
CAPTCHA Handling Strategies
Turnstile (Cloudflare's CAPTCHA)
Cloudflare's Turnstile replaced many hCaptcha deployments. It is designed to be invisible for legitimate users. With a real browser and a residential IP, it typically resolves automatically:
async def handle_turnstile(page, timeout: int = 20000) -> bool:
"""
Wait for Cloudflare Turnstile to auto-resolve.
Returns True if resolved, False if timed out.
"""
turnstile_selectors = [
"iframe[src*='challenges.cloudflare.com']",
"[data-sitekey]",
".cf-turnstile",
]
for selector in turnstile_selectors:
frame = await page.query_selector(selector)
if not frame:
continue
print("Turnstile detected, waiting for auto-resolution...")
# Turnstile resolves based on browser behavior analysis
# With stealth Playwright + residential proxy, it usually passes
try:
await page.wait_for_function(
"""
() => {
const input = document.querySelector('[name="cf-turnstile-response"]');
return input && input.value && input.value.length > 0;
}
""",
timeout=timeout
)
print("Turnstile resolved")
return True
except Exception:
print(f"Turnstile not resolved within {timeout/1000}s")
return False
return True # No Turnstile found
async def scrape_with_turnstile_handling(url: str, proxy_url: str = None) -> str:
"""Scrape URL, handling Turnstile challenges."""
async with async_playwright() as p:
context = await create_stealth_context(p, proxy_url)
page = await context.new_page()
await page.goto(url, wait_until="domcontentloaded", timeout=30000)
# Handle Turnstile if present
resolved = await handle_turnstile(page)
if not resolved:
# If Turnstile failed, try a different proxy
print("Turnstile failed — try different residential proxy or slow down")
await context.close()
return ""
# Wait for content to load after challenge
await asyncio.sleep(random.uniform(2, 4))
content = await page.content()
await context.close()
return content
Rate Limiting Strategy
Getting through Cloudflare's initial check is only half the problem. The origin site's own rate limiting still applies, and Cloudflare also applies behavioral rate limiting at volume.
import time
import random
from collections import deque
class RateLimiter:
"""Token bucket rate limiter for scraping within Cloudflare's tolerance."""
def __init__(self, requests_per_minute: int = 20, burst_size: int = 5):
self.requests_per_minute = requests_per_minute
self.burst_size = burst_size
self.min_interval = 60.0 / requests_per_minute
self.request_times = deque()
def wait(self):
"""Block until rate limit allows another request."""
now = time.time()
# Remove old timestamps outside the 60-second window
while self.request_times and now - self.request_times[0] > 60:
self.request_times.popleft()
if len(self.request_times) >= self.requests_per_minute:
sleep_time = 60 - (now - self.request_times[0])
if sleep_time > 0:
time.sleep(sleep_time)
now = time.time()
# Add natural variance to timing
if self.request_times:
time_since_last = now - self.request_times[-1]
if time_since_last < self.min_interval:
variance = random.uniform(0, self.min_interval * 0.5)
time.sleep(self.min_interval - time_since_last + variance)
self.request_times.append(time.time())
# Recommended settings per Cloudflare protection tier
RATE_LIMITS = {
"low": RateLimiter(requests_per_minute=60, burst_size=10), # Standard Cloudflare
"medium": RateLimiter(requests_per_minute=30, burst_size=5), # Pro/Business tier
"high": RateLimiter(requests_per_minute=10, burst_size=2), # Enterprise / aggressive
"extreme": RateLimiter(requests_per_minute=3, burst_size=1), # Turnstile / challenge mode
}
Decision Framework: Choosing Your Approach
| Protection Level | Symptoms | Best Approach | Proxy Type |
|---|---|---|---|
| No protection | Works with requests |
Plain requests |
None needed |
| Standard Cloudflare | 403 from datacenter IPs | curl-cffi + residential proxy |
Residential |
| Medium (JS challenge) | JavaScript loop, 503 | Playwright stealth + residential | Residential |
| High (IUAM mode) | Constant challenge, 5s wait | Playwright + slow human simulation | Premium residential |
| Turnstile interactive | CAPTCHA checkbox | Playwright stealth + residential | Premium residential |
| Enterprise / WAF | All above + behavioral blocks | Consider alternative data source or API | N/A |
For residential proxies, ThorData provides rotating residential IPs with country targeting starting at low cost per GB, with a pool covering 190+ countries. For Cloudflare bypass specifically, US and EU residential IPs have the highest success rates.
Common Error Codes
| Code | Meaning | Fix |
|---|---|---|
| 403 / 1020 | IP or ASN blocked | Switch to residential proxy |
| 503 | JavaScript challenge pending | Use Playwright with stealth |
| 429 | Rate limited | Slow down, respect Retry-After header |
| 520-530 | Origin server errors (not Cloudflare) | Site-specific issue unrelated to bot detection |
| Empty body / connection reset | TLS fingerprint blocked | Switch from requests to curl-cffi |
| Infinite redirect | Session/cookie issue | Use a persistent session, clear cookies |
Output Schema Example
from dataclasses import dataclass, field
from typing import Optional, List
from datetime import datetime
@dataclass
class ScrapedPage:
url: str
status_code: int
scraped_at: str
bypass_method: str # "curl-cffi", "playwright-stealth", "direct"
proxy_country: Optional[str]
cloudflare_challenge: bool
html_length: int
data: dict = field(default_factory=dict)
error: Optional[str] = None
# Example output
example = ScrapedPage(
url="https://cloudflare-site.com/products",
status_code=200,
scraped_at=datetime.utcnow().isoformat(),
bypass_method="curl-cffi",
proxy_country="US",
cloudflare_challenge=False,
html_length=145230,
data={"products": 48, "pages_scraped": 3},
)
Summary
Cloudflare protection in 2026 operates across five independent layers. Each must be addressed or the scraper fails. Start with curl-cffi for the TLS fingerprint layer — it handles 80% of Cloudflare-protected sites without a full browser. Add residential proxy rotation via ThorData for IP reputation. Add Playwright stealth for JavaScript challenges. Implement proper rate limiting to stay under behavioral detection thresholds.
Before building any of this infrastructure, spend five minutes in DevTools checking whether the target data is available via a less-protected API endpoint. Many Cloudflare-protected sites have a mobile API or JSON feed that bypasses the web-tier protection entirely.