JavaScript Rendering for Web Scraping: When You Actually Need a Browser (2026)
JavaScript Rendering for Web Scraping: When You Actually Need a Browser (2026)
The most common mistake in web scraping is reaching for a headless browser when you don't need one. Playwright and Puppeteer are powerful tools, but they're also slow, memory-hungry, and dramatically more complex than a simple HTTP request. Every browser instance consumes 50-200MB of RAM and takes 2-10 seconds per page — compare that to httpx fetching plain HTML in 200ms using 5MB of memory.
Before you spin up a browser, you should be certain you actually need one. In my experience, roughly 70% of scrapers that use Playwright or Selenium could achieve the same results with plain HTTP requests — either because the data is actually in the initial HTML, or because there's a hidden API endpoint serving the same data as JSON. The browser is the nuclear option, not the default.
This guide will teach you how to determine when JavaScript rendering is genuinely necessary, how to find and exploit hidden APIs that bypass the need for a browser entirely, how to use Playwright effectively when you do need it, and how to handle the anti-detection challenges that come with browser automation. Every code example is production-ready with proper error handling and proxy support.
Whether you're scraping a React SPA, a server-rendered Next.js app, or a legacy site with jQuery sprinkled in, the decision framework here will save you hours of development time and thousands of dollars in infrastructure costs. Let's figure out the right tool for the job.
The One-Minute Test: Do You Even Need a Browser?
Before writing any code, run this in your terminal:
curl -s "https://example.com/page-you-want" | head -200
Look at the HTML that comes back. If you can see the data you're after — product names, prices, article text — in the raw response, you don't need a browser. Full stop. An HTTP client like httpx will be 10-50x faster.
If the body is mostly empty — a <div id="root"></div> with a pile of <script> tags — that's a single-page application (SPA). The content gets built by JavaScript after the page loads. That's when browser rendering enters the picture.
Here's a Python script that automates this detection:
import httpx
from bs4 import BeautifulSoup
import re
def detect_rendering_type(url):
"""Analyze a page to determine if JavaScript rendering is needed.
Returns a dict with the detection result and reasoning.
"""
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
}
resp = httpx.get(url, headers=headers, timeout=30, follow_redirects=True)
html = resp.text
soup = BeautifulSoup(html, "lxml")
result = {
"url": url,
"status": resp.status_code,
"needs_browser": False,
"reasons": [],
"recommendation": "",
"framework": None,
"has_api": False,
}
# Check 1: Is the body basically empty?
body = soup.find("body")
if body:
body_text = body.get_text(strip=True)
visible_elements = len(body.find_all(["p", "h1", "h2", "h3", "li", "td", "span"]))
if len(body_text) < 200 and visible_elements < 10:
result["needs_browser"] = True
result["reasons"].append("Body is mostly empty — content loaded via JS")
# Check 2: Detect SPA frameworks
framework_markers = {
"React": [r'react', r'data-reactroot', r'__REACT', r'_reactRootContainer'],
"Next.js": [r'__NEXT_DATA__', r'_next/static', r'next/dist'],
"Vue.js": [r'__vue__', r'v-if', r'v-for', r'vue\.js', r'vue\.min\.js'],
"Nuxt.js": [r'__NUXT__', r'nuxt', r'_nuxt/'],
"Angular": [r'ng-version', r'ng-app', r'angular\.js', r'angular\.min\.js'],
"Svelte": [r'svelte', r'__svelte'],
}
for framework, patterns in framework_markers.items():
for pattern in patterns:
if re.search(pattern, html, re.IGNORECASE):
result["framework"] = framework
break
if result["framework"]:
break
# Check 3: Look for __NEXT_DATA__ (Next.js SSR — data might be inline!)
next_data_match = re.search(r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
if next_data_match:
result["reasons"].append("Next.js detected — check __NEXT_DATA__ for inline data")
result["has_api"] = True
# Data might be right here in the HTML, no browser needed!
# Check 4: Look for JSON data embedded in script tags
script_data_patterns = [
r'window\.__INITIAL_STATE__\s*=',
r'window\.__PRELOADED_STATE__\s*=',
r'window\.__STORE__\s*=',
r'window\.__DATA__\s*=',
r'var\s+__NEXT_DATA__\s*=',
r'application/ld\+json',
]
for pattern in script_data_patterns:
if re.search(pattern, html):
result["reasons"].append(f"Found embedded data: {pattern}")
result["has_api"] = True
# Check 5: Count script tags vs content
scripts = soup.find_all("script")
script_count = len(scripts)
if script_count > 15 and len(body_text) < 500:
result["needs_browser"] = True
result["reasons"].append(f"High script-to-content ratio: {script_count} scripts, minimal visible text")
# Recommendation
if result["has_api"] and not result["needs_browser"]:
result["recommendation"] = "Extract data from embedded JSON — no browser needed"
elif result["has_api"]:
result["recommendation"] = "Try extracting embedded data first, browser as fallback"
elif result["needs_browser"]:
result["recommendation"] = "Browser rendering required — use Playwright"
else:
result["recommendation"] = "Static HTML — use httpx/requests directly"
return result
# Test a page
result = detect_rendering_type("https://example.com/products")
print(f"URL: {result['url']}")
print(f"Framework: {result['framework']}")
print(f"Needs browser: {result['needs_browser']}")
print(f"Has embedded API data: {result['has_api']}")
print(f"Recommendation: {result['recommendation']}")
for reason in result['reasons']:
print(f" - {reason}")
What JavaScript Rendering Actually Does
When a browser loads a page, it does far more than fetch HTML:
- Parses HTML into the initial DOM tree
- Downloads CSS and computes styles
- Downloads and executes JavaScript — React, Vue, Angular, or plain JS
- Makes API calls (XHR/fetch) to backend services for data
- Constructs the final DOM with the fetched data
- Hydrates interactive elements (event listeners, state management)
- Renders visual output (layout, paint, composite)
A headless browser like Playwright does steps 1-6 in a real Chromium instance, minus the visible window. It's the nuclear option: thorough, but expensive. The critical insight is that most of the data you want comes from step 4 — the API calls. If you can call those APIs directly, you skip everything else.
The Hidden API Approach: Skip the Browser Entirely
Here's what most tutorials miss: even on JavaScript-heavy sites, you often don't need to render the page. SPAs fetch data from internal APIs that you can call directly. This is almost always the best approach when it works.
Finding Hidden APIs with DevTools
1. Open Chrome DevTools (F12)
2. Go to the Network tab
3. Filter by "Fetch/XHR"
4. Navigate to the page you want to scrape
5. Watch the API calls appear — these are your targets
6. Right-click any request → "Copy as cURL" for instant reproduction
Calling Hidden APIs Directly
import httpx
import json
# Instead of rendering a React product page,
# call the same API the frontend calls
def scrape_product_api(product_id):
"""Call the site's internal API directly — no browser needed."""
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
"Accept": "application/json",
"Referer": f"https://shop.example.com/products/{product_id}",
"X-Requested-With": "XMLHttpRequest",
}
resp = httpx.get(
f"https://shop.example.com/api/v2/products/{product_id}",
headers=headers,
timeout=30,
)
resp.raise_for_status()
data = resp.json()
return {
"title": data["name"],
"price": data["price"]["current"],
"description": data["description"],
"images": [img["url"] for img in data.get("images", [])],
"in_stock": data["availability"]["in_stock"],
"rating": data["reviews"]["average_rating"],
}
# 10x faster than Playwright, uses 1/50th the memory
product = scrape_product_api("ABC123")
print(json.dumps(product, indent=2))
Extracting Data from Next.js NEXT_DATA
Many Next.js sites embed all page data in a <script id="__NEXT_DATA__"> tag. This means the data is right there in the HTML — you just need to extract the JSON:
import httpx
from bs4 import BeautifulSoup
import json
def extract_next_data(url):
"""Extract data from Next.js __NEXT_DATA__ script tag.
Works on any Next.js site — the data is embedded in the HTML.
No browser rendering needed.
"""
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
}
resp = httpx.get(url, headers=headers, timeout=30, follow_redirects=True)
soup = BeautifulSoup(resp.text, "lxml")
next_data = soup.find("script", id="__NEXT_DATA__")
if not next_data:
return None
data = json.loads(next_data.string)
# The actual page data is usually in props.pageProps
page_props = data.get("props", {}).get("pageProps", {})
return page_props
# Example: scrape a Next.js e-commerce site
page_data = extract_next_data("https://nextjs-store.example.com/product/widget")
if page_data:
print(f"Product: {page_data.get('product', {}).get('name')}")
print(f"Price: {page_data.get('product', {}).get('price')}")
Extracting Nuxt.js Window Data
import re
import json
def extract_nuxt_data(html):
"""Extract data from Nuxt.js window.__NUXT__ variable."""
# Nuxt stores data in a window.__NUXT__ assignment
match = re.search(r'window\.__NUXT__\s*=\s*({.+?})\s*;?\s*<\/script>', html, re.DOTALL)
if match:
# Nuxt data can contain JS-specific syntax (undefined, etc.)
# Simple approach: try direct JSON parse
try:
return json.loads(match.group(1))
except json.JSONDecodeError:
# Fall back to eval-safe extraction if needed
pass
return None
Generic Embedded Data Extractor
def extract_embedded_json(html):
"""Find and extract JSON data embedded in script tags.
Catches __INITIAL_STATE__, __PRELOADED_STATE__, and similar patterns.
"""
patterns = [
(r'window\.__INITIAL_STATE__\s*=\s*({.+?})\s*;', "Initial State"),
(r'window\.__PRELOADED_STATE__\s*=\s*({.+?})\s*;', "Preloaded State"),
(r'window\.__STORE__\s*=\s*({.+?})\s*;', "Store Data"),
(r'window\.__DATA__\s*=\s*({.+?})\s*;', "Page Data"),
]
results = {}
for pattern, label in patterns:
match = re.search(pattern, html, re.DOTALL)
if match:
try:
data = json.loads(match.group(1))
results[label] = data
print(f"Found {label}: {len(json.dumps(data))} bytes")
except json.JSONDecodeError:
print(f"Found {label} but couldn't parse JSON")
# Also check for JSON-LD structured data
soup = BeautifulSoup(html, "lxml")
json_ld = soup.find_all("script", type="application/ld+json")
for i, script in enumerate(json_ld):
try:
data = json.loads(script.string)
results[f"JSON-LD #{i+1}"] = data
except (json.JSONDecodeError, TypeError):
pass
return results
When You Genuinely Need a Browser
Some scenarios leave no alternative:
- Cloudflare Turnstile, hCaptcha, or reCAPTCHA challenges that require browser fingerprints and actual JS execution
- Content built entirely client-side with no backing API (rare but it happens — some dashboards, data visualization tools, and canvas-rendered charts)
- Complex interaction flows — login forms with CSRF tokens, multi-step wizards, infinite scroll that triggers lazy-loaded API calls, cookie consent dialogs that gate content
- WebSocket-driven content that streams data through a persistent connection
- Shadow DOM content inside web components that isn't in the main HTML
If you've checked for hidden APIs and confirmed you truly need rendering, then it's time to pick a tool.
Playwright: The Modern Standard (Python)
Playwright is the right choice for Python scrapers in 2026. It supports Chromium, Firefox, and WebKit, has smart auto-waits, and offers both sync and async APIs.
Basic Setup and Usage
pip install playwright
playwright install chromium
from playwright.sync_api import sync_playwright
import json
import time
import random
def scrape_spa_page(url, selector=None, wait_for=None):
"""Scrape a JavaScript-rendered page with Playwright.
Args:
url: Page to scrape
selector: CSS selector to wait for (ensures content loaded)
wait_for: Alternative — wait for specific text or element state
"""
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
args=[
"--disable-blink-features=AutomationControlled",
"--disable-dev-shm-usage",
"--no-sandbox",
]
)
context = browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
locale="en-US",
timezone_id="America/New_York",
)
page = context.new_page()
# Navigate and wait for content
page.goto(url, wait_until="networkidle")
if selector:
page.wait_for_selector(selector, timeout=15000)
if wait_for:
page.wait_for_selector(f"text={wait_for}", timeout=15000)
# Get the fully rendered HTML
html = page.content()
browser.close()
return html
# Scrape a React SPA
html = scrape_spa_page(
"https://react-store.example.com/products",
selector=".product-card"
)
Extracting Data from Rendered Pages
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
def scrape_product_listings(url):
"""Scrape product listings from a JavaScript-rendered page."""
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
)
page = context.new_page()
page.goto(url, wait_until="networkidle")
page.wait_for_selector(".product-card", timeout=15000)
# Option 1: Extract with Playwright selectors
products = []
cards = page.query_selector_all(".product-card")
for card in cards:
title = card.query_selector("h2")
price = card.query_selector(".price")
rating = card.query_selector(".rating")
products.append({
"title": title.inner_text() if title else None,
"price": price.inner_text() if price else None,
"rating": rating.inner_text() if rating else None,
})
# Option 2: Get rendered HTML and parse with BeautifulSoup
html = page.content()
soup = BeautifulSoup(html, "lxml")
# Now parse as if it were static HTML — all JS content is rendered
browser.close()
return products
products = scrape_product_listings("https://spa-store.example.com/category/electronics")
for p in products[:5]:
print(f"{p['title']} — {p['price']}")
Handling Infinite Scroll
Many modern sites load content as you scroll. Here's how to handle it:
def scrape_infinite_scroll(url, item_selector, max_items=100, scroll_pause=2):
"""Scrape a page with infinite scroll loading."""
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until="networkidle")
items_collected = 0
last_count = 0
stale_scrolls = 0
while items_collected < max_items and stale_scrolls < 3:
# Scroll to bottom
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(scroll_pause + random.uniform(0.5, 1.5))
# Count current items
items = page.query_selector_all(item_selector)
items_collected = len(items)
if items_collected == last_count:
stale_scrolls += 1
print(f"No new items after scroll (attempt {stale_scrolls}/3)")
else:
stale_scrolls = 0
print(f"Items loaded: {items_collected}")
last_count = items_collected
# Extract all loaded items
results = []
items = page.query_selector_all(item_selector)
for item in items:
results.append({
"text": item.inner_text(),
"html": item.inner_html(),
})
browser.close()
print(f"Collected {len(results)} items total")
return results
Intercepting Network Requests — The Power Move
Instead of parsing the rendered DOM, intercept the API calls the page makes and grab the raw JSON data. This is often the best of both worlds:
def intercept_api_calls(url, api_pattern="/api/"):
"""Let the browser render the page, but capture the API responses.
This gives you structured JSON data without parsing HTML.
"""
captured_responses = []
def handle_response(response):
if api_pattern in response.url and response.status == 200:
try:
body = response.json()
captured_responses.append({
"url": response.url,
"status": response.status,
"data": body,
})
except Exception:
pass
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.on("response", handle_response)
page.goto(url, wait_until="networkidle")
# Interact if needed to trigger more API calls
time.sleep(3)
browser.close()
print(f"Captured {len(captured_responses)} API responses")
for resp in captured_responses:
print(f" {resp['url']}: {type(resp['data']).__name__}")
return captured_responses
# Let the browser do the work, but grab the clean JSON
api_data = intercept_api_calls(
"https://spa-store.example.com/products",
api_pattern="/api/products"
)
Async Playwright for Concurrent Scraping
import asyncio
from playwright.async_api import async_playwright
async def scrape_pages_concurrent(urls, max_concurrent=5):
"""Scrape multiple JS-rendered pages concurrently."""
semaphore = asyncio.Semaphore(max_concurrent)
results = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
async def scrape_one(url):
async with semaphore:
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
)
page = await context.new_page()
try:
await page.goto(url, wait_until="networkidle", timeout=30000)
title = await page.title()
html = await page.content()
results.append({"url": url, "title": title, "html": html})
print(f"OK: {url}")
except Exception as e:
print(f"FAIL: {url} — {e}")
finally:
await context.close()
# Random delay between pages
await asyncio.sleep(random.uniform(1, 3))
await asyncio.gather(*[scrape_one(url) for url in urls])
await browser.close()
return results
# Scrape 20 pages with 5 concurrent browsers
urls = [f"https://example.com/page/{i}" for i in range(1, 21)]
results = asyncio.run(scrape_pages_concurrent(urls))
Playwright vs Puppeteer: The Full Comparison
| Feature | Playwright (Python) | Puppeteer (Node.js) |
|---|---|---|
| Language | Python, JS, .NET, Java | JavaScript/TypeScript only |
| Browsers | Chromium, Firefox, WebKit | Chromium only |
| Auto-wait | Built-in smart waits | Manual waits needed |
| Parallel contexts | Browser contexts (lightweight) | Separate browser instances |
| Network interception | Full request/response hooks | Full request/response hooks |
| Download handling | Built-in | Manual |
| Video recording | Built-in | Manual |
| Stealth | Better out-of-box | Needs plugins |
| Community | Growing fast | Larger but JS-centric |
| Maintenance | Microsoft-backed | Google-backed |
Bottom line: If your stack is Python, use Playwright. If you're in Node.js already, either works, but Playwright's multi-browser support and auto-waits make it the better choice in 2026.
Anti-Detection for Browser Automation
Headless browsers leave fingerprints that sophisticated sites detect. Here's how to minimize detection:
Basic Stealth Configuration
def create_stealth_context(browser):
"""Create a browser context that's harder to detect as automated."""
context = browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36",
locale="en-US",
timezone_id="America/New_York",
color_scheme="light",
has_touch=False,
is_mobile=False,
java_script_enabled=True,
permissions=["geolocation"],
geolocation={"latitude": 40.7128, "longitude": -74.0060},
)
# Override navigator properties that reveal automation
context.add_init_script("""
// Remove webdriver flag
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
});
// Override plugins to look real
Object.defineProperty(navigator, 'plugins', {
get: () => [
{ name: 'Chrome PDF Plugin' },
{ name: 'Chrome PDF Viewer' },
{ name: 'Native Client' },
]
});
// Override languages
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en']
});
// Fix chrome object
window.chrome = {
runtime: {},
loadTimes: function() {},
csi: function() {},
app: {}
};
// Override permissions query
const originalQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (parameters) =>
parameters.name === 'notifications'
? Promise.resolve({ state: Notification.permission })
: originalQuery(parameters);
""")
return context
Using Proxies with Playwright
Route browser traffic through ThorData residential proxies to avoid IP-based detection:
from playwright.sync_api import sync_playwright
def scrape_with_proxy(url, proxy_url):
"""Scrape through a residential proxy to avoid IP detection."""
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
proxy={
"server": proxy_url,
}
)
context = create_stealth_context(browser)
page = context.new_page()
page.goto(url, wait_until="networkidle")
html = page.content()
browser.close()
return html
# Route through ThorData residential proxy
html = scrape_with_proxy(
"https://protected-site.example.com/data",
"http://USERNAME:[email protected]:9000"
)
Proxy Rotation with Context Isolation
def scrape_multiple_with_proxy_rotation(urls, proxy_base_url):
"""Each page gets a fresh proxy IP via ThorData rotation."""
results = []
with sync_playwright() as p:
for url in urls:
# Each browser instance gets a new IP from ThorData
browser = p.chromium.launch(
headless=True,
proxy={"server": proxy_base_url}
)
context = create_stealth_context(browser)
page = context.new_page()
try:
page.goto(url, wait_until="networkidle", timeout=30000)
html = page.content()
results.append({"url": url, "html": html, "status": "ok"})
print(f"OK: {url}")
except Exception as e:
results.append({"url": url, "html": None, "status": str(e)})
print(f"FAIL: {url}: {e}")
finally:
browser.close()
time.sleep(random.uniform(2, 5))
return results
Human-Like Interaction Patterns
def human_like_browse(page, url):
"""Navigate and interact like a real user — reduces bot detection."""
# Don't just teleport to the target — browse naturally
page.goto("https://example.com", wait_until="networkidle")
time.sleep(random.uniform(1, 3))
# Scroll around a bit
page.evaluate("window.scrollTo(0, 300)")
time.sleep(random.uniform(0.5, 1.5))
# Now navigate to the actual target
page.goto(url, wait_until="networkidle")
time.sleep(random.uniform(1, 2))
# Simulate reading — scroll down slowly
for scroll_pos in range(0, 2000, random.randint(200, 400)):
page.evaluate(f"window.scrollTo(0, {scroll_pos})")
time.sleep(random.uniform(0.3, 0.8))
# Random mouse movements
page.mouse.move(
random.randint(100, 800),
random.randint(100, 600)
)
Rate Limiting and CAPTCHA Handling
Rate Limit Detection and Backoff
def rate_limited_browser_scrape(urls, requests_per_minute=10):
"""Scrape with automatic rate limiting for browser requests."""
delay = 60 / requests_per_minute
results = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = create_stealth_context(browser)
page = context.new_page()
for i, url in enumerate(urls):
try:
response = page.goto(url, wait_until="networkidle", timeout=30000)
if response and response.status == 429:
# Rate limited — exponential backoff
wait = 60 * (2 ** min(i // 10, 3))
print(f"Rate limited. Waiting {wait}s...")
time.sleep(wait)
response = page.goto(url, wait_until="networkidle")
if response and response.status == 403:
print(f"Blocked (403). Switching to proxy...")
# Restart with proxy
browser.close()
browser = p.chromium.launch(
headless=True,
proxy={"server": "http://USERNAME:[email protected]:9000"}
)
context = create_stealth_context(browser)
page = context.new_page()
response = page.goto(url, wait_until="networkidle")
html = page.content()
results.append({"url": url, "html": html})
except Exception as e:
results.append({"url": url, "error": str(e)})
print(f"Error on {url}: {e}")
# Respect rate limits
time.sleep(delay + random.uniform(0.5, 2.0))
browser.close()
return results
CAPTCHA Detection and Handling
def detect_captcha(page):
"""Check if the current page is showing a CAPTCHA challenge."""
captcha_selectors = [
"iframe[src*='recaptcha']",
"iframe[src*='hcaptcha']",
"[class*='captcha']",
"[id*='captcha']",
"iframe[src*='turnstile']",
".cf-challenge",
"#challenge-running",
]
for selector in captcha_selectors:
if page.query_selector(selector):
return True
# Check page text for challenge indicators
body_text = page.evaluate("document.body.innerText").lower()
indicators = ["verify you are human", "checking your browser",
"just a moment", "challenge", "captcha"]
return any(ind in body_text for ind in indicators)
def scrape_with_captcha_handling(url, max_retries=3):
"""Scrape with CAPTCHA detection and proxy-based bypass."""
for attempt in range(max_retries):
with sync_playwright() as p:
# Use proxy on retry attempts
launch_opts = {"headless": True}
if attempt > 0:
launch_opts["proxy"] = {
"server": "http://USERNAME:[email protected]:9000"
}
browser = p.chromium.launch(**launch_opts)
context = create_stealth_context(browser)
page = context.new_page()
try:
page.goto(url, wait_until="networkidle", timeout=30000)
if detect_captcha(page):
print(f"CAPTCHA detected (attempt {attempt + 1}/{max_retries})")
browser.close()
time.sleep(random.uniform(10, 30))
continue
html = page.content()
browser.close()
return html
except Exception as e:
print(f"Error: {e}")
browser.close()
print("All attempts failed — CAPTCHA not bypassed")
return None
The Performance Reality
Numbers from a real project scraping 1,000 product pages:
| Method | Time | RAM Peak | Requests/sec |
|---|---|---|---|
| httpx (direct API) | 45 sec | 80 MB | ~22 |
| httpx + ThorData proxy | 90 sec | 85 MB | ~11 |
| Playwright (headless) | 22 min | 1.2 GB | ~0.75 |
| Playwright + proxy | 35 min | 1.3 GB | ~0.47 |
That's a 30x speed difference and 15x memory difference between httpx and Playwright. At scale, this translates directly to server costs:
- httpx scraper: Runs comfortably on a $5/month VPS
- Playwright scraper: Needs at least 2GB RAM, ideally 4GB — $20-40/month
If you're routing requests through residential proxies like ThorData, the per-request cost makes the speed difference even more significant. Fewer requests = lower proxy costs.
Real-World Use Cases
1. E-Commerce Price Monitoring (SPA)
def monitor_prices_spa(product_urls, db_path="prices.db"):
"""Monitor prices on JavaScript-rendered e-commerce sites."""
import sqlite3
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS prices (
url TEXT, price REAL, title TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = create_stealth_context(browser)
for url in product_urls:
page = context.new_page()
try:
page.goto(url, wait_until="networkidle")
page.wait_for_selector("[class*='price']", timeout=10000)
title = page.query_selector("h1")
price_el = page.query_selector("[class*='price']")
if title and price_el:
title_text = title.inner_text()
price_text = price_el.inner_text()
# Parse price from text like "$29.99" or "29,99 EUR"
import re
price_match = re.search(r'[\d,.]+', price_text.replace(',', ''))
if price_match:
price = float(price_match.group())
conn.execute(
"INSERT INTO prices (url, price, title) VALUES (?, ?, ?)",
(url, price, title_text)
)
print(f"{title_text}: ${price:.2f}")
except Exception as e:
print(f"Failed: {url} — {e}")
finally:
page.close()
time.sleep(random.uniform(3, 6))
browser.close()
conn.commit()
conn.close()
2. Social Media Data Collection
def scrape_social_feed(profile_url, max_posts=50):
"""Scrape posts from a JavaScript-heavy social media feed."""
posts = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = create_stealth_context(browser)
page = context.new_page()
# Intercept API calls to get clean JSON data
api_data = []
def capture_api(response):
if "/api/" in response.url and response.status == 200:
try:
api_data.append(response.json())
except Exception:
pass
page.on("response", capture_api)
page.goto(profile_url, wait_until="networkidle")
# Scroll to load more posts
loaded = 0
while loaded < max_posts:
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(random.uniform(2, 4))
current = len(page.query_selector_all("[class*='post']"))
if current == loaded:
break
loaded = current
print(f"Loaded {loaded} posts...")
browser.close()
# Process intercepted API data for clean extraction
return api_data if api_data else posts
3. Dashboard Data Extraction
def scrape_dashboard(login_url, username, password, data_page_url):
"""Log into a web dashboard and extract data from JS-rendered charts."""
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = create_stealth_context(browser)
page = context.new_page()
# Login
page.goto(login_url, wait_until="networkidle")
page.fill("input[name='email']", username)
page.fill("input[name='password']", password)
page.click("button[type='submit']")
page.wait_for_url("**/dashboard**", timeout=15000)
print("Logged in successfully")
# Navigate to data page
page.goto(data_page_url, wait_until="networkidle")
page.wait_for_selector(".data-table", timeout=15000)
# Extract table data
rows = page.query_selector_all(".data-table tr")
data = []
for row in rows:
cells = row.query_selector_all("td")
if cells:
data.append([cell.inner_text() for cell in cells])
browser.close()
return data
4. Real Estate Listings (Map-Based)
def scrape_map_listings(search_url):
"""Scrape real estate listings from a map-based interface."""
listings = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Intercept listing data API calls
def capture_listings(response):
if "listings" in response.url and response.status == 200:
try:
data = response.json()
if isinstance(data, dict) and "results" in data:
listings.extend(data["results"])
except Exception:
pass
page.on("response", capture_listings)
page.goto(search_url, wait_until="networkidle")
time.sleep(5) # Wait for map tiles and data to load
# Pan map to trigger more listings
for _ in range(3):
page.mouse.move(500, 400)
page.mouse.wheel(0, 300)
time.sleep(3)
browser.close()
print(f"Captured {len(listings)} listings via API interception")
return listings
5. News Aggregation from SPA Sites
def scrape_news_spa(urls, output_file="news.jsonl"):
"""Collect articles from JavaScript-rendered news sites."""
import json
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
for url in urls:
context = create_stealth_context(browser)
page = context.new_page()
try:
page.goto(url, wait_until="networkidle", timeout=30000)
# Try embedded data first
next_data = page.evaluate("""
() => {
const el = document.getElementById('__NEXT_DATA__');
return el ? JSON.parse(el.textContent) : null;
}
""")
if next_data:
# Extract from Next.js data
article = next_data.get("props", {}).get("pageProps", {})
else:
# Fall back to DOM extraction
article = {
"title": page.title(),
"content": page.evaluate(
"document.querySelector('article')?.innerText || ''"
),
"url": url,
}
with open(output_file, "a") as f:
f.write(json.dumps(article, ensure_ascii=False) + "\n")
print(f"Saved: {article.get('title', url)[:60]}")
except Exception as e:
print(f"Failed: {url} — {e}")
finally:
context.close()
time.sleep(random.uniform(2, 5))
browser.close()
6. SPA Search Results Scraping
def scrape_search_results(base_url, queries, max_pages=5):
"""Scrape search results from a JavaScript-rendered search engine."""
all_results = []
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
proxy={"server": "http://USERNAME:[email protected]:9000"}
)
context = create_stealth_context(browser)
for query in queries:
page = context.new_page()
page.goto(f"{base_url}?q={query}", wait_until="networkidle")
for page_num in range(max_pages):
page.wait_for_selector(".search-result", timeout=10000)
results = page.query_selector_all(".search-result")
for result in results:
title_el = result.query_selector("h3, h2, .title")
link_el = result.query_selector("a")
snippet_el = result.query_selector(".snippet, .description, p")
all_results.append({
"query": query,
"page": page_num + 1,
"title": title_el.inner_text() if title_el else None,
"url": link_el.get_attribute("href") if link_el else None,
"snippet": snippet_el.inner_text() if snippet_el else None,
})
# Try to go to next page
next_btn = page.query_selector("[aria-label='Next'], .next-page, a:text('Next')")
if not next_btn:
break
next_btn.click()
time.sleep(random.uniform(2, 4))
page.close()
time.sleep(random.uniform(3, 6))
browser.close()
return all_results
7. Authenticated API Discovery
def discover_apis_after_login(login_url, credentials, target_pages):
"""Log in to a site and discover all API endpoints it uses."""
discovered_apis = {}
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = create_stealth_context(browser)
page = context.new_page()
# Track all API calls
def log_api(response):
url = response.url
if any(p in url for p in ["/api/", "/graphql", "/v1/", "/v2/"]):
if url not in discovered_apis:
discovered_apis[url] = {
"method": response.request.method,
"status": response.status,
"headers": dict(response.request.headers),
"content_type": response.headers.get("content-type", ""),
}
try:
discovered_apis[url]["sample_data"] = response.json()
except Exception:
pass
page.on("response", log_api)
# Login
page.goto(login_url, wait_until="networkidle")
page.fill("input[type='email']", credentials["email"])
page.fill("input[type='password']", credentials["password"])
page.click("button[type='submit']")
time.sleep(5)
# Visit target pages to trigger API calls
for target in target_pages:
page.goto(target, wait_until="networkidle")
time.sleep(3)
browser.close()
print(f"\nDiscovered {len(discovered_apis)} API endpoints:")
for url, info in discovered_apis.items():
print(f" {info['method']} {url} [{info['status']}]")
return discovered_apis
Error Handling and Retry Logic
import functools
def retry_on_failure(max_retries=3, delay=5):
"""Decorator for retrying failed browser operations."""
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except Exception as e:
print(f"Attempt {attempt + 1}/{max_retries} failed: {e}")
if attempt < max_retries - 1:
wait = delay * (2 ** attempt) + random.uniform(0, 2)
print(f"Retrying in {wait:.1f}s...")
time.sleep(wait)
else:
raise
return wrapper
return decorator
@retry_on_failure(max_retries=3)
def reliable_scrape(url):
"""Scrape with automatic retry on failure."""
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = create_stealth_context(browser)
page = context.new_page()
response = page.goto(url, wait_until="networkidle", timeout=30000)
if response.status >= 400:
browser.close()
raise Exception(f"HTTP {response.status}")
if detect_captcha(page):
browser.close()
raise Exception("CAPTCHA detected")
html = page.content()
browser.close()
return html
Decision Flowchart
Follow this every time before writing scraper code:
- Does
curlreturn the data you need? → Use httpx. You're done. - Is the page a SPA? Check Network tab for API calls. → Call the API directly with httpx.
- Is there
__NEXT_DATA__or__NUXT__in the HTML? → Extract the embedded JSON. No browser needed. - Is there
application/ld+jsonstructured data? → Extract it. Often has everything you need. - No usable API? Data only exists after JS execution? → Use Playwright.
- Need to interact (login, scroll, click)? → Use Playwright with explicit waits.
- Getting blocked by Cloudflare/CAPTCHAs? → Use Playwright + ThorData residential proxies.
Start from the top every time. The simplest approach that works is always the right one. Browser rendering is the tool of last resort, not the default. Your server bill and your scraper's reliability will both benefit from choosing the lightest tool that gets the job done.