httpx vs Playwright: The Complete Decision Framework for Web Scraping in 2026
I have watched dozens of developers spin up Playwright for scraping tasks that could have been a three-line httpx script. The instinct makes sense — browsers render everything, handle JavaScript, deal with cookies. Why think about it when you can just page.goto() and grab the HTML?
Because you are paying for it. Every single time. And often, you are paying 100x more in compute, time, and maintenance for zero additional benefit.
On the other end of the spectrum, I have seen developers stubbornly stick with httpx for sites that genuinely require browser rendering, wasting days fighting JavaScript challenges and dynamic content that a Playwright script would handle in minutes. Both mistakes come from the same root cause: not understanding when each tool is the right choice.
This guide is the decision framework I wish I had when I started building scraping infrastructure. After running both tools in production across hundreds of target sites — processing millions of pages per month — I have developed a clear, testable methodology for choosing the right tool for any scraping job. The answer is almost never "always use X." It depends on the target, and you need a systematic way to figure it out.
I will give you the benchmarks, the code, the decision tree, and the real-world patterns that determine which tool wins for each use case. By the end, you will never waste time with the wrong tool again.
The Numbers Nobody Talks About
Before we get into decision frameworks, let us establish the performance baseline. These numbers come from my own benchmarks, running both tools against the same set of targets on identical hardware (M2 MacBook Pro, 16GB RAM, gigabit fiber connection).
Speed Comparison
Target: Static HTML product page (example-store.com/product/123)
httpx (sync): 48ms average response time
httpx (async, 10): 12ms effective per-request (parallel)
Playwright: 2,400ms average page load
Playwright (pool 3): 850ms effective per-request (parallel)
Speedup: 50-200x for httpx depending on concurrency
Target: JavaScript SPA with API backend (react-app.com/dashboard)
httpx (direct API): 65ms average response time
httpx (async, 10): 18ms effective per-request (parallel)
Playwright: 3,200ms average page load
Playwright (pool 3): 1,100ms effective per-request
Note: httpx directly called the underlying API the React app uses.
Speedup: 50-175x for httpx when API is accessible
Target: Cloudflare-protected site requiring JS execution
httpx: BLOCKED (403 on every attempt)
curl_cffi: 285ms (with TLS impersonation, no JS challenge)
Playwright: 4,100ms (must execute CF challenge JS)
Playwright + stealth: 3,800ms (slightly faster, same approach)
Winner: curl_cffi when possible, Playwright when JS challenge is mandatory
Memory Comparison
10,000 requests through each tool:
httpx (async):
Peak memory: 180MB
Average per-request: ~2MB overhead
No memory leaks observed over 100K requests
Playwright (3 browser contexts):
Peak memory: 2.1GB
Average per-context: 250-400MB
Memory leaked ~50MB per 1,000 pages (Chrome bug)
Required periodic browser restart
Playwright (single context, reused page):
Peak memory: 800MB
Average: 300MB steady state
Still leaked, but slower
Cost Comparison (Cloud Compute)
Workload: Scrape 50,000 product pages daily
httpx approach:
Instance: 2 vCPU, 4GB RAM ($30/month)
Completion time: ~45 minutes
Proxy costs: ~$5-15/month (bandwidth only)
Total: ~$35-45/month
Playwright approach:
Instance: 4 vCPU, 16GB RAM ($120/month)
Completion time: ~8 hours
Proxy costs: ~$20-50/month (more bandwidth from page assets)
Total: ~$140-170/month
Difference: 3-4x higher cost for Playwright
These are not cherry-picked numbers. For sites where both tools work, httpx is consistently 50-200x faster, uses 50-100x less memory, and costs 3-5x less in infrastructure. The question is not "which is faster" — it is "when does the slower tool become necessary?"
The Decision Tree
Before you write a single line of scraping code, run through this decision tree. It takes 5 minutes and will save you hours of wasted effort.
Step 1: Does the Data Exist in the Initial HTML?
Open the target URL in your browser. Open DevTools (F12). Go to Settings and check "Disable JavaScript." Reload the page.
If the content you need is still visible, use httpx. Full stop. Most news sites, blogs, e-commerce product pages, documentation sites, government databases, and forums serve their content in the initial HTML response. There is no reason to launch a browser engine for this.
import httpx
from selectolax.parser import HTMLParser
# Fast check: does the data exist without JavaScript?
def check_html_content(url: str, selector: str) -> bool:
"""Test if target content is in the initial HTML response."""
resp = httpx.get(url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
}, follow_redirects=True, timeout=15)
tree = HTMLParser(resp.text)
elements = tree.css(selector)
if elements:
print(f"Found {len(elements)} elements matching '{selector}'")
print(f"First match: {elements[0].text(strip=True)[:100]}")
return True
else:
print(f"No elements found for '{selector}' -- might need JavaScript")
return False
# Test it
has_data = check_html_content(
"https://example-store.com/products",
".product-card h2" # Your target selector
)
Step 2: Does the Data Come from an API?
This is the most overlooked optimization in web scraping. Most modern websites — especially those built with React, Vue, Angular, or Next.js — load their data from JSON APIs. You can call those APIs directly with httpx and get structured data without parsing HTML.
import httpx
import json
# Step 1: Find the API (check Network tab in DevTools)
# Filter by XHR/Fetch, look for JSON responses
# Step 2: Call it directly
def scrape_via_api(base_url: str, params: dict) -> list[dict]:
"""Call the site's internal API directly instead of rendering pages."""
with httpx.Client(timeout=15) as client:
resp = client.get(
f"{base_url}/api/products",
params=params,
headers={
"Accept": "application/json",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
"Referer": base_url, # Some APIs check this
},
)
resp.raise_for_status()
return resp.json().get("items", [])
# This is BETTER than browser scraping because:
# 1. You get structured JSON, not HTML to parse
# 2. 50-200x faster
# 3. More reliable (no CSS selector breakage)
# 4. Less bandwidth (no images, CSS, JS downloaded)
products = scrape_via_api(
"https://example-store.com",
{"category": "electronics", "page": 1, "limit": 50}
)
How to find the API:
- Open DevTools > Network tab
- Filter by "Fetch/XHR"
- Navigate the page normally and watch the requests
- Look for JSON responses — these are your APIs
- Right-click the request > "Copy as cURL" to get all the headers
# Automated API discovery helper
async def discover_apis(url: str) -> list[dict]:
"""Use Playwright to discover what APIs a page calls."""
from playwright.async_api import async_playwright
apis = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
# Intercept all network requests
async def on_response(response):
if response.request.resource_type in ("xhr", "fetch"):
content_type = response.headers.get("content-type", "")
if "json" in content_type:
apis.append({
"url": response.url,
"method": response.request.method,
"status": response.status,
"content_type": content_type,
})
page.on("response", on_response)
await page.goto(url, wait_until="networkidle")
await browser.close()
return apis
# Run once to find APIs, then use httpx for all future scraping
import asyncio
apis = asyncio.run(discover_apis("https://example-store.com/products"))
for api in apis:
print(f"{api['method']} {api['url']} -> {api['status']}")
This is a powerful pattern: use Playwright once to discover APIs, then use httpx for all ongoing scraping. You get the best of both worlds.
Step 3: Does the Site Require JavaScript for Content?
Some sites genuinely render everything client-side with no callable API. The HTML is an empty <div id="root"> and the data lives inside JavaScript bundles or is generated dynamically. This is less common than most people think, but it exists.
Signs you need JavaScript execution: - The page source (Ctrl+U) shows an empty body with just script tags - Network tab shows no JSON API calls — data is embedded in JS bundles - Content appears only after JavaScript evaluation
import httpx
def needs_javascript(url: str) -> bool:
"""Heuristic check: does this URL need JS to show content?"""
resp = httpx.get(url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
}, follow_redirects=True, timeout=15)
html = resp.text
# Check for empty body indicators
indicators = [
'<div id="root"></div>',
'<div id="app"></div>',
'<div id="__next"></div>',
'noscript>You need to enable JavaScript',
'noscript>Please enable JavaScript',
]
body_seems_empty = any(ind in html for ind in indicators)
# Check if there is meaningful text content
from selectolax.parser import HTMLParser
tree = HTMLParser(html)
body = tree.css_first("body")
if body:
text = body.text(strip=True)
# If body text is very short, content is probably JS-rendered
meaningful_text = len(text) > 200
if body_seems_empty and not meaningful_text:
return True
return False
# Test before choosing your approach
if needs_javascript("https://target-site.com"):
print("This site needs Playwright")
else:
print("httpx should work fine")
Step 4: Does the Site Use Bot Detection?
This is where it gets nuanced. Bot detection operates at multiple layers:
- IP reputation: Datacenter IPs get blocked, residential IPs pass. Solution: residential proxies from ThorData, not Playwright.
- TLS fingerprinting: Python libraries get caught by JA3/JA4+. Solution:
curl_cffiwith browser impersonation, not Playwright. - JavaScript challenges: Cloudflare/Akamai serve JS that must execute. Solution: Playwright (or curl_cffi if no mandatory JS).
- Browser fingerprinting: Canvas, WebGL, fonts, plugins checks. Solution: Playwright with stealth patches.
- Behavioral analysis: Mouse movement, click patterns, timing. Solution: Playwright with human-like automation.
The key insight: most "bot detection" can be bypassed without a browser. Only the last two categories truly require Playwright. Start with httpx + curl_cffi + proxies, and only escalate to Playwright if that fails.
The Decision Matrix
Here is the complete decision matrix. Find your scenario and use the recommended tool:
| Scenario | httpx | curl_cffi | Playwright | Recommended |
|---|---|---|---|---|
| Static HTML content | Yes | Yes | Yes | httpx |
| JSON API available | Yes | Yes | Yes | httpx |
| Server-side rendered (Next.js SSR, PHP) | Yes | Yes | Yes | httpx |
| TLS fingerprint detection (Cloudflare) | No | Yes | Yes | curl_cffi |
| JavaScript-rendered SPA (no API) | No | No | Yes | Playwright |
| Mandatory JS challenge (CF turnstile) | No | No | Yes | Playwright |
| Complex interactions (login, forms) | Limited | Limited | Yes | Playwright |
| Canvas/WebGL fingerprint check | No | No | Yes | Playwright |
| High volume (10K+ pages/day) | Yes | Yes | Expensive | httpx/curl_cffi |
| Low volume (<100 pages/day) | Yes | Yes | Yes | Your preference |
Side-by-Side Code: Same Task, Both Tools
Task 1: Scraping Product Listings
httpx approach (~50ms per page):
import httpx
from selectolax.parser import HTMLParser
from dataclasses import dataclass
import json
@dataclass
class Product:
title: str
price: str
url: str
image_url: str
rating: str
def scrape_products_httpx(
base_url: str,
pages: int = 10,
proxy: str | None = None,
) -> list[Product]:
"""Scrape product listings with httpx + selectolax."""
products = []
transport = httpx.HTTPTransport(proxy=proxy) if proxy else None
with httpx.Client(
transport=transport,
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
},
follow_redirects=True,
timeout=15.0,
) as client:
for page in range(1, pages + 1):
resp = client.get(f"{base_url}/products", params={"page": page})
if resp.status_code != 200:
print(f"Page {page}: HTTP {resp.status_code}")
continue
tree = HTMLParser(resp.text)
for card in tree.css(".product-card"):
title_el = card.css_first("h2, .product-title")
price_el = card.css_first(".price, [data-price]")
link_el = card.css_first("a[href]")
img_el = card.css_first("img[src]")
rating_el = card.css_first(".rating, [data-rating]")
if title_el and price_el:
products.append(Product(
title=title_el.text(strip=True),
price=price_el.text(strip=True),
url=link_el.attrs.get("href", "") if link_el else "",
image_url=img_el.attrs.get("src", "") if img_el else "",
rating=rating_el.text(strip=True) if rating_el else "N/A",
))
return products
# Usage
products = scrape_products_httpx("https://example-store.com", pages=5)
print(f"Found {len(products)} products")
for p in products[:3]:
print(f" {p.title}: {p.price}")
Async httpx for parallel scraping (12ms effective per page):
import httpx
import asyncio
from selectolax.parser import HTMLParser
async def scrape_products_async(
base_url: str,
pages: int = 50,
concurrency: int = 10,
proxy: str | None = None,
) -> list[dict]:
"""Parallel product scraping with httpx async."""
products = []
semaphore = asyncio.Semaphore(concurrency)
async def fetch_page(client: httpx.AsyncClient, page: int):
async with semaphore:
resp = await client.get(
f"{base_url}/products",
params={"page": page},
)
if resp.status_code == 200:
tree = HTMLParser(resp.text)
for card in tree.css(".product-card"):
title = card.css_first("h2")
price = card.css_first(".price")
if title and price:
products.append({
"title": title.text(strip=True),
"price": price.text(strip=True),
"page": page,
})
transport = httpx.AsyncHTTPTransport(proxy=proxy) if proxy else None
async with httpx.AsyncClient(
transport=transport,
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36"},
follow_redirects=True,
timeout=15.0,
) as client:
tasks = [fetch_page(client, page) for page in range(1, pages + 1)]
await asyncio.gather(*tasks)
return products
# 50 pages in parallel: ~2 seconds total
products = asyncio.run(scrape_products_async(
"https://example-store.com",
pages=50,
concurrency=10,
))
Playwright approach (~3s per page):
from playwright.async_api import async_playwright
import asyncio
async def scrape_products_playwright(
base_url: str,
pages: int = 10,
proxy: dict | None = None,
) -> list[dict]:
"""Scrape product listings with Playwright."""
products = []
async with async_playwright() as p:
launch_opts = {"headless": True}
if proxy:
launch_opts["proxy"] = proxy
browser = await p.chromium.launch(**launch_opts)
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
)
page = await context.new_page()
for page_num in range(1, pages + 1):
await page.goto(
f"{base_url}/products?page={page_num}",
wait_until="networkidle",
)
cards = await page.query_selector_all(".product-card")
for card in cards:
title_el = await card.query_selector("h2, .product-title")
price_el = await card.query_selector(".price, [data-price]")
if title_el and price_el:
products.append({
"title": await title_el.inner_text(),
"price": await price_el.inner_text(),
"page": page_num,
})
await browser.close()
return products
Same result. The httpx version is 60x faster with async and uses a fraction of the memory. I am using selectolax instead of BeautifulSoup because it is approximately 20x faster at parsing — another optimization most people skip.
Task 2: Scraping Behind Cloudflare
When a site is behind Cloudflare's bot management, the approach depends on the protection level:
Level 1 — Basic protection (TLS fingerprint check only):
from curl_cffi import requests as cffi_requests
def scrape_cloudflare_basic(url: str, proxy: str | None = None) -> str:
"""Bypass basic Cloudflare with TLS impersonation only."""
proxy_dict = {"https": proxy, "http": proxy} if proxy else None
session = cffi_requests.Session(
impersonate="chrome131",
proxies=proxy_dict,
)
resp = session.get(url)
session.close()
if resp.status_code == 200:
return resp.text
else:
raise Exception(f"Blocked: HTTP {resp.status_code}")
# 285ms average - no browser needed
html = scrape_cloudflare_basic(
"https://cf-protected-site.com/data",
proxy="http://user:[email protected]:9000",
)
Level 2 — JS challenge (Turnstile/Managed Challenge):
from playwright.async_api import async_playwright
import asyncio
async def scrape_cloudflare_challenge(
url: str,
proxy: dict | None = None,
) -> str:
"""Handle Cloudflare JS challenge that requires browser execution."""
async with async_playwright() as p:
launch_opts = {
"headless": True,
"args": ["--disable-blink-features=AutomationControlled"],
}
if proxy:
launch_opts["proxy"] = proxy
browser = await p.chromium.launch(**launch_opts)
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
)
# Stealth: override webdriver detection
await context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
window.chrome = { runtime: {} };
""")
page = await context.new_page()
await page.goto(url, wait_until="domcontentloaded")
# Wait for Cloudflare challenge to resolve (up to 15s)
try:
await page.wait_for_selector(
"body:not(:has(.cf-challenge-running))",
timeout=15000,
)
except Exception:
pass
# Additional wait for content to render
await asyncio.sleep(3)
html = await page.content()
await browser.close()
return html
Task 3: Scraping JavaScript SPAs
First, try to find the API (httpx):
import httpx
# Most React/Vue/Angular apps call a REST or GraphQL API
# Check Network tab to find it
def scrape_react_app_via_api(api_url: str) -> list[dict]:
"""Bypass the SPA entirely by calling its API directly."""
with httpx.Client(timeout=15) as client:
# For REST APIs
resp = client.get(api_url, params={"limit": 100})
return resp.json()
# 65ms vs 3200ms for Playwright
products = scrape_react_app_via_api("https://app.example.com/api/v1/products")
If no API exists, use Playwright:
from playwright.async_api import async_playwright
import asyncio
async def scrape_spa_content(url: str, content_selector: str) -> list[str]:
"""Scrape content from a JS-only SPA where no API is available."""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle")
# Wait specifically for the content we need
await page.wait_for_selector(content_selector, timeout=10000)
# Extract content
elements = await page.query_selector_all(content_selector)
texts = [await el.inner_text() for el in elements]
await browser.close()
return texts
Advanced: The Hybrid Approach
The most effective production scrapers use both tools strategically:
import httpx
from curl_cffi import requests as cffi_requests
from selectolax.parser import HTMLParser
import asyncio
from dataclasses import dataclass
from enum import Enum
class ScrapingStrategy(Enum):
HTTPX = "httpx" # Fast, for static/SSR content
CURL_CFFI = "curl_cffi" # TLS-spoofed, for fingerprint checks
PLAYWRIGHT = "playwright" # Full browser, for JS-only sites
@dataclass
class SiteProfile:
"""Profile a target site to determine optimal scraping strategy."""
url: str
needs_javascript: bool = False
has_api: bool = False
api_url: str = ""
has_tls_fingerprinting: bool = False
has_js_challenge: bool = False
recommended: ScrapingStrategy = ScrapingStrategy.HTTPX
async def profile_site(url: str) -> SiteProfile:
"""Automatically determine the best scraping strategy for a URL."""
profile = SiteProfile(url=url)
# Test 1: Can we get content with plain httpx?
try:
resp = httpx.get(url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
}, follow_redirects=True, timeout=15)
if resp.status_code == 403:
profile.has_tls_fingerprinting = True
if resp.status_code == 200:
tree = HTMLParser(resp.text)
body = tree.css_first("body")
body_text = body.text(strip=True) if body else ""
if len(body_text) < 100:
profile.needs_javascript = True
else:
profile.recommended = ScrapingStrategy.HTTPX
return profile
except Exception:
pass
# Test 2: Does curl_cffi with TLS impersonation work?
if profile.has_tls_fingerprinting:
try:
session = cffi_requests.Session(impersonate="chrome131")
resp = session.get(url)
session.close()
if resp.status_code == 200:
tree = HTMLParser(resp.text)
body = tree.css_first("body")
body_text = body.text(strip=True) if body else ""
if len(body_text) > 100:
profile.recommended = ScrapingStrategy.CURL_CFFI
return profile
else:
profile.needs_javascript = True
except Exception:
pass
# Test 3: Discover APIs with Playwright
from playwright.async_api import async_playwright
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
api_calls = []
async def capture_api(response):
ct = response.headers.get("content-type", "")
if "json" in ct and response.request.resource_type in ("xhr", "fetch"):
api_calls.append(response.url)
page.on("response", capture_api)
try:
await page.goto(url, wait_until="networkidle", timeout=15000)
cf_challenge = await page.query_selector(".cf-challenge-running, #challenge-running")
if cf_challenge:
profile.has_js_challenge = True
except Exception:
pass
await browser.close()
if api_calls:
profile.has_api = True
profile.api_url = api_calls[0]
profile.recommended = ScrapingStrategy.HTTPX
elif profile.needs_javascript:
profile.recommended = ScrapingStrategy.PLAYWRIGHT
return profile
class HybridScraper:
"""Use the right tool for each target automatically."""
def __init__(self, proxy_url: str | None = None):
self.proxy_url = proxy_url
self.profiles: dict[str, SiteProfile] = {}
async def scrape(self, url: str) -> str:
"""Scrape a URL using the optimal strategy."""
from urllib.parse import urlparse
domain = urlparse(url).netloc
if domain not in self.profiles:
self.profiles[domain] = await profile_site(url)
profile = self.profiles[domain]
if profile.recommended == ScrapingStrategy.HTTPX:
if profile.has_api:
return self._scrape_api(profile.api_url)
return self._scrape_httpx(url)
elif profile.recommended == ScrapingStrategy.CURL_CFFI:
return self._scrape_curl_cffi(url)
else:
return await self._scrape_playwright(url)
def _scrape_httpx(self, url: str) -> str:
transport = httpx.HTTPTransport(proxy=self.proxy_url) if self.proxy_url else None
with httpx.Client(transport=transport, timeout=15) as client:
resp = client.get(url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
})
return resp.text
def _scrape_api(self, api_url: str) -> str:
with httpx.Client(timeout=15) as client:
resp = client.get(api_url)
return resp.text
def _scrape_curl_cffi(self, url: str) -> str:
proxy_dict = {"https": self.proxy_url, "http": self.proxy_url} if self.proxy_url else None
session = cffi_requests.Session(impersonate="chrome131", proxies=proxy_dict)
resp = session.get(url)
session.close()
return resp.text
async def _scrape_playwright(self, url: str) -> str:
from playwright.async_api import async_playwright
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle")
html = await page.content()
await browser.close()
return html
When You Genuinely Need Playwright
I am not anti-Playwright. I use it regularly for specific scenarios where httpx genuinely cannot work:
1. True SPAs with No Callable API
Rare but they exist — applications that embed all data in JavaScript bundles, generate content via WebAssembly, or use proprietary protocols.
2. Sites with Aggressive Browser Fingerprinting
Canvas fingerprinting, WebGL renderer checks, installed fonts, and other browser-specific API verification:
from playwright.async_api import async_playwright
import asyncio
async def scrape_with_full_fingerprint(url: str, proxy: dict) -> str:
"""For sites that verify complete browser fingerprints."""
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=False, # Headed mode has better fingerprints
args=["--disable-blink-features=AutomationControlled"],
)
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
locale="en-US",
timezone_id="America/New_York",
proxy=proxy,
)
await context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
Object.defineProperty(navigator, 'plugins', { get: () => [1,2,3,4,5] });
Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });
const getParameter = WebGLRenderingContext.prototype.getParameter;
WebGLRenderingContext.prototype.getParameter = function(parameter) {
if (parameter === 37445) return 'Intel Inc.';
if (parameter === 37446) return 'Intel Iris OpenGL Engine';
return getParameter.apply(this, arguments);
};
""")
page = await context.new_page()
await page.goto(url, wait_until="networkidle")
html = await page.content()
await browser.close()
return html
3. Complex Multi-Step Interactions
Login flows, multi-page forms, cookie consent dialogs, and interactive workflows:
async def scrape_behind_login(
login_url: str,
target_url: str,
username: str,
password: str,
proxy: dict | None = None,
) -> tuple[str, str]:
"""Scrape content behind a login wall."""
async with async_playwright() as p:
launch_opts = {"headless": True}
if proxy:
launch_opts["proxy"] = proxy
browser = await p.chromium.launch(**launch_opts)
context = await browser.new_context()
page = await context.new_page()
# Step 1: Navigate to login page
await page.goto(login_url, wait_until="networkidle")
# Step 2: Fill in credentials
await page.fill("input[name='username'], input[type='email']", username)
await page.fill("input[name='password'], input[type='password']", password)
# Step 3: Submit
await page.click("button[type='submit'], input[type='submit']")
# Step 4: Wait for redirect
await page.wait_for_url("**/dashboard**", timeout=10000)
# Step 5: Navigate to target content
await page.goto(target_url, wait_until="networkidle")
html = await page.content()
# Save cookies for future httpx requests
cookies = await context.cookies()
cookie_header = "; ".join(f"{c['name']}={c['value']}" for c in cookies)
await browser.close()
return html, cookie_header
# After getting cookies, use httpx for subsequent requests
html, cookies = asyncio.run(scrape_behind_login(
"https://example.com/login",
"https://example.com/dashboard/data",
"[email protected]",
"password123",
))
# Now use httpx with the session cookies (much faster)
resp = httpx.get(
"https://example.com/dashboard/api/data",
headers={"Cookie": cookies},
)
4. Infinite Scroll and Dynamic Loading
async def scrape_infinite_scroll(
url: str,
item_selector: str,
max_items: int = 500,
) -> list[str]:
"""Scrape content from pages with infinite scroll."""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle")
items = set()
last_count = 0
stale_scrolls = 0
while len(items) < max_items and stale_scrolls < 3:
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await asyncio.sleep(2)
elements = await page.query_selector_all(item_selector)
for el in elements:
text = await el.inner_text()
items.add(text)
if len(items) == last_count:
stale_scrolls += 1
else:
stale_scrolls = 0
last_count = len(items)
await browser.close()
return list(items)[:max_items]
Performance Optimization Tips
For httpx
# 1. Use selectolax instead of BeautifulSoup (20x faster parsing)
from selectolax.parser import HTMLParser
tree = HTMLParser(html)
titles = [node.text(strip=True) for node in tree.css("h2.title")]
# 2. Use async with semaphore for controlled concurrency
async def fetch_many(urls: list[str], max_concurrent: int = 20):
sem = asyncio.Semaphore(max_concurrent)
async with httpx.AsyncClient(timeout=15) as client:
async def fetch(url):
async with sem:
return await client.get(url)
return await asyncio.gather(*[fetch(u) for u in urls])
# 3. Reuse connections with keep-alive and HTTP/2
with httpx.Client(http2=True) as client: # HTTP/2 multiplexing
for url in urls:
resp = client.get(url)
# 4. Skip downloading unnecessary content
resp = httpx.get(url, headers={
"Accept": "text/html", # Don't accept images, fonts, etc.
})
For Playwright
# 1. Block unnecessary resources to speed up loading
async def fast_playwright_load(page, url: str):
await page.route("**/*.{png,jpg,jpeg,gif,svg,webp,woff,woff2,ttf,css}",
lambda route: route.abort())
await page.goto(url, wait_until="domcontentloaded")
# 2. Reuse browser contexts instead of creating new ones
browser = await p.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
for url in urls:
await page.goto(url)
# extract data, reuse page
# 3. Use multiple pages in parallel
async def parallel_playwright(urls: list[str], max_pages: int = 5):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
sem = asyncio.Semaphore(max_pages)
async def fetch(url):
async with sem:
page = await browser.new_page()
await page.goto(url)
html = await page.content()
await page.close()
return html
return await asyncio.gather(*[fetch(u) for u in urls])
# 4. Extract data via JavaScript evaluation (faster than query_selector_all)
data = await page.evaluate("""
() => Array.from(document.querySelectorAll('.product')).map(el => ({
title: el.querySelector('h2')?.textContent?.trim(),
price: el.querySelector('.price')?.textContent?.trim(),
}))
""")
Error Handling Patterns
httpx Error Handling with Automatic Escalation
import httpx
import time
import random
from dataclasses import dataclass
from enum import Enum
class ErrorAction(Enum):
RETRY = "retry"
ROTATE_PROXY = "rotate_proxy"
SWITCH_TO_PLAYWRIGHT = "switch_to_playwright"
ABORT = "abort"
@dataclass
class ScrapeError:
action: ErrorAction
message: str
wait_seconds: float = 0
def handle_httpx_error(
error: Exception | None = None,
response: httpx.Response | None = None,
) -> ScrapeError:
"""Determine the right action for an httpx scraping error."""
if isinstance(error, httpx.ConnectTimeout):
return ScrapeError(ErrorAction.RETRY, "Connection timeout", wait_seconds=5)
if isinstance(error, httpx.ProxyError):
return ScrapeError(ErrorAction.ROTATE_PROXY, "Proxy connection failed")
if response is None:
return ScrapeError(ErrorAction.RETRY, str(error), wait_seconds=10)
status = response.status_code
if status == 403:
if "cloudflare" in response.text.lower() or "cf-chl" in response.text:
return ScrapeError(
ErrorAction.SWITCH_TO_PLAYWRIGHT,
"Cloudflare challenge detected",
)
return ScrapeError(ErrorAction.ROTATE_PROXY, "403 Forbidden", wait_seconds=30)
if status == 429:
retry_after = int(response.headers.get("retry-after", 60))
return ScrapeError(ErrorAction.RETRY, "Rate limited", wait_seconds=retry_after)
if status == 503:
return ScrapeError(ErrorAction.RETRY, "Service unavailable", wait_seconds=15)
if status >= 500:
return ScrapeError(ErrorAction.RETRY, f"Server error {status}", wait_seconds=10)
return ScrapeError(ErrorAction.ABORT, f"Unexpected status {status}")
def scrape_with_fallback(
url: str,
max_retries: int = 3,
proxy_url: str | None = None,
) -> str:
"""Scrape with automatic retry, proxy rotation, and Playwright fallback."""
for attempt in range(max_retries):
try:
transport = httpx.HTTPTransport(proxy=proxy_url) if proxy_url else None
with httpx.Client(transport=transport, timeout=15) as client:
resp = client.get(url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
})
if resp.status_code == 200:
return resp.text
err = handle_httpx_error(response=resp)
except Exception as e:
err = handle_httpx_error(error=e)
if err.action == ErrorAction.SWITCH_TO_PLAYWRIGHT:
import asyncio
return asyncio.run(_playwright_fallback(url, proxy_url))
if err.action == ErrorAction.ABORT:
raise Exception(f"Aborting: {err.message}")
if err.wait_seconds > 0:
time.sleep(err.wait_seconds)
raise Exception(f"Failed after {max_retries} attempts")
async def _playwright_fallback(url: str, proxy_url: str | None) -> str:
"""Playwright fallback when httpx fails."""
from playwright.async_api import async_playwright
proxy = None
if proxy_url:
from urllib.parse import urlparse
parsed = urlparse(proxy_url)
proxy = {
"server": f"{parsed.scheme}://{parsed.hostname}:{parsed.port}",
"username": parsed.username,
"password": parsed.password,
}
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle", timeout=30000)
html = await page.content()
await browser.close()
return html
Output Schema Best Practices
Regardless of which tool you use, structure your output consistently:
from dataclasses import dataclass, asdict, field
from datetime import datetime
import json
@dataclass
class ScrapedItem:
"""Base class for scraped items with metadata."""
url: str
scraped_at: str = ""
scraping_method: str = "" # "httpx", "curl_cffi", "playwright"
proxy_used: bool = False
def __post_init__(self):
if not self.scraped_at:
self.scraped_at = datetime.utcnow().isoformat()
@dataclass
class ScrapedProduct(ScrapedItem):
title: str = ""
price: str = ""
currency: str = "USD"
description: str = ""
image_url: str = ""
rating: float = 0.0
review_count: int = 0
in_stock: bool = True
category: str = ""
sku: str = ""
@dataclass
class ScrapedArticle(ScrapedItem):
title: str = ""
author: str = ""
published_date: str = ""
content: str = ""
tags: list[str] = field(default_factory=list)
word_count: int = 0
def __post_init__(self):
super().__post_init__()
if self.content:
self.word_count = len(self.content.split())
@dataclass
class ScrapeResult:
"""Wrapper for batch scraping results."""
items: list[ScrapedItem] = field(default_factory=list)
total_scraped: int = 0
total_errors: int = 0
duration_seconds: float = 0
method: str = ""
def to_json(self) -> str:
return json.dumps(asdict(self), indent=2, ensure_ascii=False, default=str)
def to_jsonl(self) -> str:
return "\n".join(json.dumps(asdict(item), default=str) for item in self.items)
Proxy Integration: Making Both Tools Work Better
Both httpx and Playwright benefit significantly from proxy rotation. Using ThorData residential proxies is one of the most cost-effective ways to improve success rates for either tool:
# httpx with ThorData rotating proxy
transport = httpx.HTTPTransport(proxy="http://user:[email protected]:9000")
with httpx.Client(transport=transport) as client:
resp = client.get("https://target.com")
# curl_cffi with ThorData (best combination for non-JS sites)
from curl_cffi import requests as cffi_requests
session = cffi_requests.Session(
impersonate="chrome131",
proxies={
"https": "http://user:[email protected]:9000",
"http": "http://user:[email protected]:9000",
},
)
# Playwright with ThorData
browser = await p.chromium.launch(
headless=True,
proxy={
"server": "http://rotating.thordata.com:9000",
"username": "your_user",
"password": "your_pass",
},
)
The key benefit of residential proxies from ThorData is that they work with both tools. When paired with httpx or curl_cffi for speed-sensitive work, you get fast responses with real residential IPs. When paired with Playwright for JS-heavy sites, you get authentic browser fingerprints backed by trusted IP addresses. The per-GB pricing model keeps costs predictable regardless of which tool you use.
Real-World Architecture: Scraping Pipeline
Here is how a production scraping pipeline combines both tools:
import asyncio
import httpx
from curl_cffi import requests as cffi_requests
from dataclasses import dataclass
from enum import Enum
import logging
import time
logger = logging.getLogger(__name__)
class ToolChoice(Enum):
HTTPX = "httpx"
CURL_CFFI = "curl_cffi"
PLAYWRIGHT = "playwright"
@dataclass
class ScrapingPipeline:
"""Production scraping pipeline with automatic tool selection."""
proxy_url: str
max_concurrent_httpx: int = 50
max_concurrent_playwright: int = 3
tool_stats: dict = None
def __post_init__(self):
self.tool_stats = {
"httpx": {"success": 0, "fail": 0},
"curl_cffi": {"success": 0, "fail": 0},
"playwright": {"success": 0, "fail": 0},
}
async def scrape_batch(
self,
urls: list[str],
tool: ToolChoice = ToolChoice.HTTPX,
) -> list[dict]:
"""Scrape a batch of URLs with the specified tool."""
if tool == ToolChoice.HTTPX:
return await self._batch_httpx(urls)
elif tool == ToolChoice.CURL_CFFI:
return self._batch_curl_cffi(urls)
else:
return await self._batch_playwright(urls)
async def _batch_httpx(self, urls: list[str]) -> list[dict]:
sem = asyncio.Semaphore(self.max_concurrent_httpx)
results = []
transport = httpx.AsyncHTTPTransport(proxy=self.proxy_url)
async with httpx.AsyncClient(
transport=transport,
timeout=15.0,
headers={"User-Agent": "Mozilla/5.0 Chrome/131.0.0.0"},
) as client:
async def fetch(url):
async with sem:
try:
resp = await client.get(url)
self.tool_stats["httpx"]["success"] += 1
return {"url": url, "status": resp.status_code, "html": resp.text}
except Exception as e:
self.tool_stats["httpx"]["fail"] += 1
return {"url": url, "error": str(e)}
results = await asyncio.gather(*[fetch(u) for u in urls])
return results
def _batch_curl_cffi(self, urls: list[str]) -> list[dict]:
session = cffi_requests.Session(
impersonate="chrome131",
proxies={"https": self.proxy_url, "http": self.proxy_url},
)
results = []
for url in urls:
try:
resp = session.get(url)
self.tool_stats["curl_cffi"]["success"] += 1
results.append({"url": url, "status": resp.status_code, "html": resp.text})
except Exception as e:
self.tool_stats["curl_cffi"]["fail"] += 1
results.append({"url": url, "error": str(e)})
time.sleep(1)
session.close()
return results
async def _batch_playwright(self, urls: list[str]) -> list[dict]:
from playwright.async_api import async_playwright
results = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True, proxy={
"server": self.proxy_url.rsplit("@", 1)[-1] if "@" in self.proxy_url else self.proxy_url,
})
sem = asyncio.Semaphore(self.max_concurrent_playwright)
async def fetch(url):
async with sem:
page = await browser.new_page()
try:
await page.goto(url, wait_until="networkidle", timeout=30000)
html = await page.content()
self.tool_stats["playwright"]["success"] += 1
return {"url": url, "status": 200, "html": html}
except Exception as e:
self.tool_stats["playwright"]["fail"] += 1
return {"url": url, "error": str(e)}
finally:
await page.close()
results = await asyncio.gather(*[fetch(u) for u in urls])
await browser.close()
return results
def report(self) -> str:
lines = ["Scraping Pipeline Stats:"]
for tool, stats in self.tool_stats.items():
total = stats["success"] + stats["fail"]
if total > 0:
rate = stats["success"] / total * 100
lines.append(f" {tool}: {stats['success']}/{total} ({rate:.0f}% success)")
return "\n".join(lines)
The Uncomfortable Truth
The reason most developers default to Playwright is not technical — it is cognitive. Browsers feel safe because they "just work." You do not have to think about whether the page uses JavaScript, what headers to send, or how cookies work. The browser handles it all.
But that convenience has a cost: slower scrapes, higher infrastructure bills, flakier pipelines, and jobs that mysteriously fail when Chrome decides to update its binary. The developers I know who scrape at scale learned to reach for httpx first and Playwright only when they have confirmed they need it.
Start simple. Test whether the data exists in the HTML. Check the Network tab for APIs. Try curl_cffi if you get blocked. Only spin up Playwright when you have concrete evidence that nothing else works. This systematic approach will save you time, money, and maintenance headaches.
The right tool is the simplest one that gets the job done. Most of the time, that is httpx.
Further Reading
- How to Scrape Google Search Results Without Getting Blocked — Google SERP scraping with multiple approaches
- TLS Fingerprinting: Why Your Scraper's Handshake Gives It Away — Deep dive into JA3/JA4+ detection
- Residential vs Datacenter Proxies: Complete Guide — Choosing the right proxy type
- Scrapy vs BeautifulSoup vs Playwright — Framework comparison guide