← Back to blog

JavaScript Rendering for Web Scraping: When You Actually Need a Browser (2026)

JavaScript Rendering for Web Scraping: When You Actually Need a Browser (2026)

The most common mistake in web scraping is reaching for a headless browser when you don't need one. Playwright and Puppeteer are powerful tools, but they're also slow, memory-hungry, and dramatically more complex than a simple HTTP request. Every browser instance consumes 50-200MB of RAM and takes 2-10 seconds per page — compare that to httpx fetching plain HTML in 200ms using 5MB of memory.

Before you spin up a browser, you should be certain you actually need one. In my experience, roughly 70% of scrapers that use Playwright or Selenium could achieve the same results with plain HTTP requests — either because the data is actually in the initial HTML, or because there's a hidden API endpoint serving the same data as JSON. The browser is the nuclear option, not the default.

This guide will teach you how to determine when JavaScript rendering is genuinely necessary, how to find and exploit hidden APIs that bypass the need for a browser entirely, how to use Playwright effectively when you do need it, and how to handle the anti-detection challenges that come with browser automation. Every code example is production-ready with proper error handling and proxy support.

Whether you're scraping a React SPA, a server-rendered Next.js app, or a legacy site with jQuery sprinkled in, the decision framework here will save you hours of development time and thousands of dollars in infrastructure costs. Let's figure out the right tool for the job.

The One-Minute Test: Do You Even Need a Browser?

Before writing any code, run this in your terminal:

curl -s "https://example.com/page-you-want" | head -200

Look at the HTML that comes back. If you can see the data you're after — product names, prices, article text — in the raw response, you don't need a browser. Full stop. An HTTP client like httpx will be 10-50x faster.

If the body is mostly empty — a <div id="root"></div> with a pile of <script> tags — that's a single-page application (SPA). The content gets built by JavaScript after the page loads. That's when browser rendering enters the picture.

Here's a Python script that automates this detection:

import httpx
from bs4 import BeautifulSoup
import re

def detect_rendering_type(url):
    """Analyze a page to determine if JavaScript rendering is needed.

    Returns a dict with the detection result and reasoning.
    """
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                      "AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
    }

    resp = httpx.get(url, headers=headers, timeout=30, follow_redirects=True)
    html = resp.text
    soup = BeautifulSoup(html, "lxml")

    result = {
        "url": url,
        "status": resp.status_code,
        "needs_browser": False,
        "reasons": [],
        "recommendation": "",
        "framework": None,
        "has_api": False,
    }

    # Check 1: Is the body basically empty?
    body = soup.find("body")
    if body:
        body_text = body.get_text(strip=True)
        visible_elements = len(body.find_all(["p", "h1", "h2", "h3", "li", "td", "span"]))

        if len(body_text) < 200 and visible_elements < 10:
            result["needs_browser"] = True
            result["reasons"].append("Body is mostly empty — content loaded via JS")

    # Check 2: Detect SPA frameworks
    framework_markers = {
        "React": [r'react', r'data-reactroot', r'__REACT', r'_reactRootContainer'],
        "Next.js": [r'__NEXT_DATA__', r'_next/static', r'next/dist'],
        "Vue.js": [r'__vue__', r'v-if', r'v-for', r'vue\.js', r'vue\.min\.js'],
        "Nuxt.js": [r'__NUXT__', r'nuxt', r'_nuxt/'],
        "Angular": [r'ng-version', r'ng-app', r'angular\.js', r'angular\.min\.js'],
        "Svelte": [r'svelte', r'__svelte'],
    }

    for framework, patterns in framework_markers.items():
        for pattern in patterns:
            if re.search(pattern, html, re.IGNORECASE):
                result["framework"] = framework
                break
        if result["framework"]:
            break

    # Check 3: Look for __NEXT_DATA__ (Next.js SSR — data might be inline!)
    next_data_match = re.search(r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
    if next_data_match:
        result["reasons"].append("Next.js detected — check __NEXT_DATA__ for inline data")
        result["has_api"] = True
        # Data might be right here in the HTML, no browser needed!

    # Check 4: Look for JSON data embedded in script tags
    script_data_patterns = [
        r'window\.__INITIAL_STATE__\s*=',
        r'window\.__PRELOADED_STATE__\s*=',
        r'window\.__STORE__\s*=',
        r'window\.__DATA__\s*=',
        r'var\s+__NEXT_DATA__\s*=',
        r'application/ld\+json',
    ]
    for pattern in script_data_patterns:
        if re.search(pattern, html):
            result["reasons"].append(f"Found embedded data: {pattern}")
            result["has_api"] = True

    # Check 5: Count script tags vs content
    scripts = soup.find_all("script")
    script_count = len(scripts)
    if script_count > 15 and len(body_text) < 500:
        result["needs_browser"] = True
        result["reasons"].append(f"High script-to-content ratio: {script_count} scripts, minimal visible text")

    # Recommendation
    if result["has_api"] and not result["needs_browser"]:
        result["recommendation"] = "Extract data from embedded JSON — no browser needed"
    elif result["has_api"]:
        result["recommendation"] = "Try extracting embedded data first, browser as fallback"
    elif result["needs_browser"]:
        result["recommendation"] = "Browser rendering required — use Playwright"
    else:
        result["recommendation"] = "Static HTML — use httpx/requests directly"

    return result

# Test a page
result = detect_rendering_type("https://example.com/products")
print(f"URL: {result['url']}")
print(f"Framework: {result['framework']}")
print(f"Needs browser: {result['needs_browser']}")
print(f"Has embedded API data: {result['has_api']}")
print(f"Recommendation: {result['recommendation']}")
for reason in result['reasons']:
    print(f"  - {reason}")

What JavaScript Rendering Actually Does

When a browser loads a page, it does far more than fetch HTML:

  1. Parses HTML into the initial DOM tree
  2. Downloads CSS and computes styles
  3. Downloads and executes JavaScript — React, Vue, Angular, or plain JS
  4. Makes API calls (XHR/fetch) to backend services for data
  5. Constructs the final DOM with the fetched data
  6. Hydrates interactive elements (event listeners, state management)
  7. Renders visual output (layout, paint, composite)

A headless browser like Playwright does steps 1-6 in a real Chromium instance, minus the visible window. It's the nuclear option: thorough, but expensive. The critical insight is that most of the data you want comes from step 4 — the API calls. If you can call those APIs directly, you skip everything else.

The Hidden API Approach: Skip the Browser Entirely

Here's what most tutorials miss: even on JavaScript-heavy sites, you often don't need to render the page. SPAs fetch data from internal APIs that you can call directly. This is almost always the best approach when it works.

Finding Hidden APIs with DevTools

1. Open Chrome DevTools (F12)
2. Go to the Network tab
3. Filter by "Fetch/XHR"
4. Navigate to the page you want to scrape
5. Watch the API calls appear — these are your targets
6. Right-click any request → "Copy as cURL" for instant reproduction

Calling Hidden APIs Directly

import httpx
import json

# Instead of rendering a React product page,
# call the same API the frontend calls
def scrape_product_api(product_id):
    """Call the site's internal API directly — no browser needed."""
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                      "AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
        "Accept": "application/json",
        "Referer": f"https://shop.example.com/products/{product_id}",
        "X-Requested-With": "XMLHttpRequest",
    }

    resp = httpx.get(
        f"https://shop.example.com/api/v2/products/{product_id}",
        headers=headers,
        timeout=30,
    )
    resp.raise_for_status()

    data = resp.json()
    return {
        "title": data["name"],
        "price": data["price"]["current"],
        "description": data["description"],
        "images": [img["url"] for img in data.get("images", [])],
        "in_stock": data["availability"]["in_stock"],
        "rating": data["reviews"]["average_rating"],
    }

# 10x faster than Playwright, uses 1/50th the memory
product = scrape_product_api("ABC123")
print(json.dumps(product, indent=2))

Extracting Data from Next.js NEXT_DATA

Many Next.js sites embed all page data in a <script id="__NEXT_DATA__"> tag. This means the data is right there in the HTML — you just need to extract the JSON:

import httpx
from bs4 import BeautifulSoup
import json

def extract_next_data(url):
    """Extract data from Next.js __NEXT_DATA__ script tag.

    Works on any Next.js site — the data is embedded in the HTML.
    No browser rendering needed.
    """
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                      "AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
    }

    resp = httpx.get(url, headers=headers, timeout=30, follow_redirects=True)
    soup = BeautifulSoup(resp.text, "lxml")

    next_data = soup.find("script", id="__NEXT_DATA__")
    if not next_data:
        return None

    data = json.loads(next_data.string)
    # The actual page data is usually in props.pageProps
    page_props = data.get("props", {}).get("pageProps", {})
    return page_props

# Example: scrape a Next.js e-commerce site
page_data = extract_next_data("https://nextjs-store.example.com/product/widget")
if page_data:
    print(f"Product: {page_data.get('product', {}).get('name')}")
    print(f"Price: {page_data.get('product', {}).get('price')}")

Extracting Nuxt.js Window Data

import re
import json

def extract_nuxt_data(html):
    """Extract data from Nuxt.js window.__NUXT__ variable."""
    # Nuxt stores data in a window.__NUXT__ assignment
    match = re.search(r'window\.__NUXT__\s*=\s*({.+?})\s*;?\s*<\/script>', html, re.DOTALL)
    if match:
        # Nuxt data can contain JS-specific syntax (undefined, etc.)
        # Simple approach: try direct JSON parse
        try:
            return json.loads(match.group(1))
        except json.JSONDecodeError:
            # Fall back to eval-safe extraction if needed
            pass
    return None

Generic Embedded Data Extractor

def extract_embedded_json(html):
    """Find and extract JSON data embedded in script tags.

    Catches __INITIAL_STATE__, __PRELOADED_STATE__, and similar patterns.
    """
    patterns = [
        (r'window\.__INITIAL_STATE__\s*=\s*({.+?})\s*;', "Initial State"),
        (r'window\.__PRELOADED_STATE__\s*=\s*({.+?})\s*;', "Preloaded State"),
        (r'window\.__STORE__\s*=\s*({.+?})\s*;', "Store Data"),
        (r'window\.__DATA__\s*=\s*({.+?})\s*;', "Page Data"),
    ]

    results = {}
    for pattern, label in patterns:
        match = re.search(pattern, html, re.DOTALL)
        if match:
            try:
                data = json.loads(match.group(1))
                results[label] = data
                print(f"Found {label}: {len(json.dumps(data))} bytes")
            except json.JSONDecodeError:
                print(f"Found {label} but couldn't parse JSON")

    # Also check for JSON-LD structured data
    soup = BeautifulSoup(html, "lxml")
    json_ld = soup.find_all("script", type="application/ld+json")
    for i, script in enumerate(json_ld):
        try:
            data = json.loads(script.string)
            results[f"JSON-LD #{i+1}"] = data
        except (json.JSONDecodeError, TypeError):
            pass

    return results

When You Genuinely Need a Browser

Some scenarios leave no alternative:

If you've checked for hidden APIs and confirmed you truly need rendering, then it's time to pick a tool.

Playwright: The Modern Standard (Python)

Playwright is the right choice for Python scrapers in 2026. It supports Chromium, Firefox, and WebKit, has smart auto-waits, and offers both sync and async APIs.

Basic Setup and Usage

pip install playwright
playwright install chromium
from playwright.sync_api import sync_playwright
import json
import time
import random

def scrape_spa_page(url, selector=None, wait_for=None):
    """Scrape a JavaScript-rendered page with Playwright.

    Args:
        url: Page to scrape
        selector: CSS selector to wait for (ensures content loaded)
        wait_for: Alternative — wait for specific text or element state
    """
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            args=[
                "--disable-blink-features=AutomationControlled",
                "--disable-dev-shm-usage",
                "--no-sandbox",
            ]
        )
        context = browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                       "AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
            locale="en-US",
            timezone_id="America/New_York",
        )
        page = context.new_page()

        # Navigate and wait for content
        page.goto(url, wait_until="networkidle")

        if selector:
            page.wait_for_selector(selector, timeout=15000)

        if wait_for:
            page.wait_for_selector(f"text={wait_for}", timeout=15000)

        # Get the fully rendered HTML
        html = page.content()

        browser.close()
        return html

# Scrape a React SPA
html = scrape_spa_page(
    "https://react-store.example.com/products",
    selector=".product-card"
)

Extracting Data from Rendered Pages

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

def scrape_product_listings(url):
    """Scrape product listings from a JavaScript-rendered page."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                       "AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
        )
        page = context.new_page()
        page.goto(url, wait_until="networkidle")
        page.wait_for_selector(".product-card", timeout=15000)

        # Option 1: Extract with Playwright selectors
        products = []
        cards = page.query_selector_all(".product-card")
        for card in cards:
            title = card.query_selector("h2")
            price = card.query_selector(".price")
            rating = card.query_selector(".rating")

            products.append({
                "title": title.inner_text() if title else None,
                "price": price.inner_text() if price else None,
                "rating": rating.inner_text() if rating else None,
            })

        # Option 2: Get rendered HTML and parse with BeautifulSoup
        html = page.content()
        soup = BeautifulSoup(html, "lxml")
        # Now parse as if it were static HTML — all JS content is rendered

        browser.close()
        return products

products = scrape_product_listings("https://spa-store.example.com/category/electronics")
for p in products[:5]:
    print(f"{p['title']} — {p['price']}")

Handling Infinite Scroll

Many modern sites load content as you scroll. Here's how to handle it:

def scrape_infinite_scroll(url, item_selector, max_items=100, scroll_pause=2):
    """Scrape a page with infinite scroll loading."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until="networkidle")

        items_collected = 0
        last_count = 0
        stale_scrolls = 0

        while items_collected < max_items and stale_scrolls < 3:
            # Scroll to bottom
            page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            time.sleep(scroll_pause + random.uniform(0.5, 1.5))

            # Count current items
            items = page.query_selector_all(item_selector)
            items_collected = len(items)

            if items_collected == last_count:
                stale_scrolls += 1
                print(f"No new items after scroll (attempt {stale_scrolls}/3)")
            else:
                stale_scrolls = 0
                print(f"Items loaded: {items_collected}")

            last_count = items_collected

        # Extract all loaded items
        results = []
        items = page.query_selector_all(item_selector)
        for item in items:
            results.append({
                "text": item.inner_text(),
                "html": item.inner_html(),
            })

        browser.close()
        print(f"Collected {len(results)} items total")
        return results

Intercepting Network Requests — The Power Move

Instead of parsing the rendered DOM, intercept the API calls the page makes and grab the raw JSON data. This is often the best of both worlds:

def intercept_api_calls(url, api_pattern="/api/"):
    """Let the browser render the page, but capture the API responses.

    This gives you structured JSON data without parsing HTML.
    """
    captured_responses = []

    def handle_response(response):
        if api_pattern in response.url and response.status == 200:
            try:
                body = response.json()
                captured_responses.append({
                    "url": response.url,
                    "status": response.status,
                    "data": body,
                })
            except Exception:
                pass

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.on("response", handle_response)

        page.goto(url, wait_until="networkidle")
        # Interact if needed to trigger more API calls
        time.sleep(3)

        browser.close()

    print(f"Captured {len(captured_responses)} API responses")
    for resp in captured_responses:
        print(f"  {resp['url']}: {type(resp['data']).__name__}")

    return captured_responses

# Let the browser do the work, but grab the clean JSON
api_data = intercept_api_calls(
    "https://spa-store.example.com/products",
    api_pattern="/api/products"
)

Async Playwright for Concurrent Scraping

import asyncio
from playwright.async_api import async_playwright

async def scrape_pages_concurrent(urls, max_concurrent=5):
    """Scrape multiple JS-rendered pages concurrently."""
    semaphore = asyncio.Semaphore(max_concurrent)
    results = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)

        async def scrape_one(url):
            async with semaphore:
                context = await browser.new_context(
                    viewport={"width": 1920, "height": 1080},
                    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                               "AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
                )
                page = await context.new_page()
                try:
                    await page.goto(url, wait_until="networkidle", timeout=30000)
                    title = await page.title()
                    html = await page.content()
                    results.append({"url": url, "title": title, "html": html})
                    print(f"OK: {url}")
                except Exception as e:
                    print(f"FAIL: {url} — {e}")
                finally:
                    await context.close()

                # Random delay between pages
                await asyncio.sleep(random.uniform(1, 3))

        await asyncio.gather(*[scrape_one(url) for url in urls])
        await browser.close()

    return results

# Scrape 20 pages with 5 concurrent browsers
urls = [f"https://example.com/page/{i}" for i in range(1, 21)]
results = asyncio.run(scrape_pages_concurrent(urls))

Playwright vs Puppeteer: The Full Comparison

Feature Playwright (Python) Puppeteer (Node.js)
Language Python, JS, .NET, Java JavaScript/TypeScript only
Browsers Chromium, Firefox, WebKit Chromium only
Auto-wait Built-in smart waits Manual waits needed
Parallel contexts Browser contexts (lightweight) Separate browser instances
Network interception Full request/response hooks Full request/response hooks
Download handling Built-in Manual
Video recording Built-in Manual
Stealth Better out-of-box Needs plugins
Community Growing fast Larger but JS-centric
Maintenance Microsoft-backed Google-backed

Bottom line: If your stack is Python, use Playwright. If you're in Node.js already, either works, but Playwright's multi-browser support and auto-waits make it the better choice in 2026.

Anti-Detection for Browser Automation

Headless browsers leave fingerprints that sophisticated sites detect. Here's how to minimize detection:

Basic Stealth Configuration

def create_stealth_context(browser):
    """Create a browser context that's harder to detect as automated."""
    context = browser.new_context(
        viewport={"width": 1920, "height": 1080},
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                   "AppleWebKit/537.36 (KHTML, like Gecko) "
                   "Chrome/120.0.0.0 Safari/537.36",
        locale="en-US",
        timezone_id="America/New_York",
        color_scheme="light",
        has_touch=False,
        is_mobile=False,
        java_script_enabled=True,
        permissions=["geolocation"],
        geolocation={"latitude": 40.7128, "longitude": -74.0060},
    )

    # Override navigator properties that reveal automation
    context.add_init_script("""
        // Remove webdriver flag
        Object.defineProperty(navigator, 'webdriver', {
            get: () => undefined
        });

        // Override plugins to look real
        Object.defineProperty(navigator, 'plugins', {
            get: () => [
                { name: 'Chrome PDF Plugin' },
                { name: 'Chrome PDF Viewer' },
                { name: 'Native Client' },
            ]
        });

        // Override languages
        Object.defineProperty(navigator, 'languages', {
            get: () => ['en-US', 'en']
        });

        // Fix chrome object
        window.chrome = {
            runtime: {},
            loadTimes: function() {},
            csi: function() {},
            app: {}
        };

        // Override permissions query
        const originalQuery = window.navigator.permissions.query;
        window.navigator.permissions.query = (parameters) =>
            parameters.name === 'notifications'
                ? Promise.resolve({ state: Notification.permission })
                : originalQuery(parameters);
    """)

    return context

Using Proxies with Playwright

Route browser traffic through ThorData residential proxies to avoid IP-based detection:

from playwright.sync_api import sync_playwright

def scrape_with_proxy(url, proxy_url):
    """Scrape through a residential proxy to avoid IP detection."""
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            proxy={
                "server": proxy_url,
            }
        )
        context = create_stealth_context(browser)
        page = context.new_page()

        page.goto(url, wait_until="networkidle")
        html = page.content()

        browser.close()
        return html

# Route through ThorData residential proxy
html = scrape_with_proxy(
    "https://protected-site.example.com/data",
    "http://USERNAME:[email protected]:9000"
)

Proxy Rotation with Context Isolation

def scrape_multiple_with_proxy_rotation(urls, proxy_base_url):
    """Each page gets a fresh proxy IP via ThorData rotation."""
    results = []

    with sync_playwright() as p:
        for url in urls:
            # Each browser instance gets a new IP from ThorData
            browser = p.chromium.launch(
                headless=True,
                proxy={"server": proxy_base_url}
            )
            context = create_stealth_context(browser)
            page = context.new_page()

            try:
                page.goto(url, wait_until="networkidle", timeout=30000)
                html = page.content()
                results.append({"url": url, "html": html, "status": "ok"})
                print(f"OK: {url}")
            except Exception as e:
                results.append({"url": url, "html": None, "status": str(e)})
                print(f"FAIL: {url}: {e}")
            finally:
                browser.close()

            time.sleep(random.uniform(2, 5))

    return results

Human-Like Interaction Patterns

def human_like_browse(page, url):
    """Navigate and interact like a real user — reduces bot detection."""
    # Don't just teleport to the target — browse naturally
    page.goto("https://example.com", wait_until="networkidle")
    time.sleep(random.uniform(1, 3))

    # Scroll around a bit
    page.evaluate("window.scrollTo(0, 300)")
    time.sleep(random.uniform(0.5, 1.5))

    # Now navigate to the actual target
    page.goto(url, wait_until="networkidle")
    time.sleep(random.uniform(1, 2))

    # Simulate reading — scroll down slowly
    for scroll_pos in range(0, 2000, random.randint(200, 400)):
        page.evaluate(f"window.scrollTo(0, {scroll_pos})")
        time.sleep(random.uniform(0.3, 0.8))

    # Random mouse movements
    page.mouse.move(
        random.randint(100, 800),
        random.randint(100, 600)
    )

Rate Limiting and CAPTCHA Handling

Rate Limit Detection and Backoff

def rate_limited_browser_scrape(urls, requests_per_minute=10):
    """Scrape with automatic rate limiting for browser requests."""
    delay = 60 / requests_per_minute
    results = []

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = create_stealth_context(browser)
        page = context.new_page()

        for i, url in enumerate(urls):
            try:
                response = page.goto(url, wait_until="networkidle", timeout=30000)

                if response and response.status == 429:
                    # Rate limited — exponential backoff
                    wait = 60 * (2 ** min(i // 10, 3))
                    print(f"Rate limited. Waiting {wait}s...")
                    time.sleep(wait)
                    response = page.goto(url, wait_until="networkidle")

                if response and response.status == 403:
                    print(f"Blocked (403). Switching to proxy...")
                    # Restart with proxy
                    browser.close()
                    browser = p.chromium.launch(
                        headless=True,
                        proxy={"server": "http://USERNAME:[email protected]:9000"}
                    )
                    context = create_stealth_context(browser)
                    page = context.new_page()
                    response = page.goto(url, wait_until="networkidle")

                html = page.content()
                results.append({"url": url, "html": html})

            except Exception as e:
                results.append({"url": url, "error": str(e)})
                print(f"Error on {url}: {e}")

            # Respect rate limits
            time.sleep(delay + random.uniform(0.5, 2.0))

        browser.close()

    return results

CAPTCHA Detection and Handling

def detect_captcha(page):
    """Check if the current page is showing a CAPTCHA challenge."""
    captcha_selectors = [
        "iframe[src*='recaptcha']",
        "iframe[src*='hcaptcha']",
        "[class*='captcha']",
        "[id*='captcha']",
        "iframe[src*='turnstile']",
        ".cf-challenge",
        "#challenge-running",
    ]

    for selector in captcha_selectors:
        if page.query_selector(selector):
            return True

    # Check page text for challenge indicators
    body_text = page.evaluate("document.body.innerText").lower()
    indicators = ["verify you are human", "checking your browser",
                  "just a moment", "challenge", "captcha"]
    return any(ind in body_text for ind in indicators)

def scrape_with_captcha_handling(url, max_retries=3):
    """Scrape with CAPTCHA detection and proxy-based bypass."""
    for attempt in range(max_retries):
        with sync_playwright() as p:
            # Use proxy on retry attempts
            launch_opts = {"headless": True}
            if attempt > 0:
                launch_opts["proxy"] = {
                    "server": "http://USERNAME:[email protected]:9000"
                }

            browser = p.chromium.launch(**launch_opts)
            context = create_stealth_context(browser)
            page = context.new_page()

            try:
                page.goto(url, wait_until="networkidle", timeout=30000)

                if detect_captcha(page):
                    print(f"CAPTCHA detected (attempt {attempt + 1}/{max_retries})")
                    browser.close()
                    time.sleep(random.uniform(10, 30))
                    continue

                html = page.content()
                browser.close()
                return html

            except Exception as e:
                print(f"Error: {e}")
                browser.close()

    print("All attempts failed — CAPTCHA not bypassed")
    return None

The Performance Reality

Numbers from a real project scraping 1,000 product pages:

Method Time RAM Peak Requests/sec
httpx (direct API) 45 sec 80 MB ~22
httpx + ThorData proxy 90 sec 85 MB ~11
Playwright (headless) 22 min 1.2 GB ~0.75
Playwright + proxy 35 min 1.3 GB ~0.47

That's a 30x speed difference and 15x memory difference between httpx and Playwright. At scale, this translates directly to server costs:

If you're routing requests through residential proxies like ThorData, the per-request cost makes the speed difference even more significant. Fewer requests = lower proxy costs.

Real-World Use Cases

1. E-Commerce Price Monitoring (SPA)

def monitor_prices_spa(product_urls, db_path="prices.db"):
    """Monitor prices on JavaScript-rendered e-commerce sites."""
    import sqlite3
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS prices (
            url TEXT, price REAL, title TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = create_stealth_context(browser)

        for url in product_urls:
            page = context.new_page()
            try:
                page.goto(url, wait_until="networkidle")
                page.wait_for_selector("[class*='price']", timeout=10000)

                title = page.query_selector("h1")
                price_el = page.query_selector("[class*='price']")

                if title and price_el:
                    title_text = title.inner_text()
                    price_text = price_el.inner_text()
                    # Parse price from text like "$29.99" or "29,99 EUR"
                    import re
                    price_match = re.search(r'[\d,.]+', price_text.replace(',', ''))
                    if price_match:
                        price = float(price_match.group())
                        conn.execute(
                            "INSERT INTO prices (url, price, title) VALUES (?, ?, ?)",
                            (url, price, title_text)
                        )
                        print(f"{title_text}: ${price:.2f}")

            except Exception as e:
                print(f"Failed: {url} — {e}")
            finally:
                page.close()
            time.sleep(random.uniform(3, 6))

        browser.close()
    conn.commit()
    conn.close()

2. Social Media Data Collection

def scrape_social_feed(profile_url, max_posts=50):
    """Scrape posts from a JavaScript-heavy social media feed."""
    posts = []

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = create_stealth_context(browser)
        page = context.new_page()

        # Intercept API calls to get clean JSON data
        api_data = []
        def capture_api(response):
            if "/api/" in response.url and response.status == 200:
                try:
                    api_data.append(response.json())
                except Exception:
                    pass
        page.on("response", capture_api)

        page.goto(profile_url, wait_until="networkidle")

        # Scroll to load more posts
        loaded = 0
        while loaded < max_posts:
            page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            time.sleep(random.uniform(2, 4))

            current = len(page.query_selector_all("[class*='post']"))
            if current == loaded:
                break
            loaded = current
            print(f"Loaded {loaded} posts...")

        browser.close()

    # Process intercepted API data for clean extraction
    return api_data if api_data else posts

3. Dashboard Data Extraction

def scrape_dashboard(login_url, username, password, data_page_url):
    """Log into a web dashboard and extract data from JS-rendered charts."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = create_stealth_context(browser)
        page = context.new_page()

        # Login
        page.goto(login_url, wait_until="networkidle")
        page.fill("input[name='email']", username)
        page.fill("input[name='password']", password)
        page.click("button[type='submit']")
        page.wait_for_url("**/dashboard**", timeout=15000)
        print("Logged in successfully")

        # Navigate to data page
        page.goto(data_page_url, wait_until="networkidle")
        page.wait_for_selector(".data-table", timeout=15000)

        # Extract table data
        rows = page.query_selector_all(".data-table tr")
        data = []
        for row in rows:
            cells = row.query_selector_all("td")
            if cells:
                data.append([cell.inner_text() for cell in cells])

        browser.close()
        return data

4. Real Estate Listings (Map-Based)

def scrape_map_listings(search_url):
    """Scrape real estate listings from a map-based interface."""
    listings = []

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        # Intercept listing data API calls
        def capture_listings(response):
            if "listings" in response.url and response.status == 200:
                try:
                    data = response.json()
                    if isinstance(data, dict) and "results" in data:
                        listings.extend(data["results"])
                except Exception:
                    pass
        page.on("response", capture_listings)

        page.goto(search_url, wait_until="networkidle")
        time.sleep(5)  # Wait for map tiles and data to load

        # Pan map to trigger more listings
        for _ in range(3):
            page.mouse.move(500, 400)
            page.mouse.wheel(0, 300)
            time.sleep(3)

        browser.close()

    print(f"Captured {len(listings)} listings via API interception")
    return listings

5. News Aggregation from SPA Sites

def scrape_news_spa(urls, output_file="news.jsonl"):
    """Collect articles from JavaScript-rendered news sites."""
    import json

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)

        for url in urls:
            context = create_stealth_context(browser)
            page = context.new_page()

            try:
                page.goto(url, wait_until="networkidle", timeout=30000)

                # Try embedded data first
                next_data = page.evaluate("""
                    () => {
                        const el = document.getElementById('__NEXT_DATA__');
                        return el ? JSON.parse(el.textContent) : null;
                    }
                """)

                if next_data:
                    # Extract from Next.js data
                    article = next_data.get("props", {}).get("pageProps", {})
                else:
                    # Fall back to DOM extraction
                    article = {
                        "title": page.title(),
                        "content": page.evaluate(
                            "document.querySelector('article')?.innerText || ''"
                        ),
                        "url": url,
                    }

                with open(output_file, "a") as f:
                    f.write(json.dumps(article, ensure_ascii=False) + "\n")
                print(f"Saved: {article.get('title', url)[:60]}")

            except Exception as e:
                print(f"Failed: {url} — {e}")
            finally:
                context.close()
            time.sleep(random.uniform(2, 5))

        browser.close()

6. SPA Search Results Scraping

def scrape_search_results(base_url, queries, max_pages=5):
    """Scrape search results from a JavaScript-rendered search engine."""
    all_results = []

    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            proxy={"server": "http://USERNAME:[email protected]:9000"}
        )
        context = create_stealth_context(browser)

        for query in queries:
            page = context.new_page()
            page.goto(f"{base_url}?q={query}", wait_until="networkidle")

            for page_num in range(max_pages):
                page.wait_for_selector(".search-result", timeout=10000)

                results = page.query_selector_all(".search-result")
                for result in results:
                    title_el = result.query_selector("h3, h2, .title")
                    link_el = result.query_selector("a")
                    snippet_el = result.query_selector(".snippet, .description, p")

                    all_results.append({
                        "query": query,
                        "page": page_num + 1,
                        "title": title_el.inner_text() if title_el else None,
                        "url": link_el.get_attribute("href") if link_el else None,
                        "snippet": snippet_el.inner_text() if snippet_el else None,
                    })

                # Try to go to next page
                next_btn = page.query_selector("[aria-label='Next'], .next-page, a:text('Next')")
                if not next_btn:
                    break
                next_btn.click()
                time.sleep(random.uniform(2, 4))

            page.close()
            time.sleep(random.uniform(3, 6))

        browser.close()

    return all_results

7. Authenticated API Discovery

def discover_apis_after_login(login_url, credentials, target_pages):
    """Log in to a site and discover all API endpoints it uses."""
    discovered_apis = {}

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = create_stealth_context(browser)
        page = context.new_page()

        # Track all API calls
        def log_api(response):
            url = response.url
            if any(p in url for p in ["/api/", "/graphql", "/v1/", "/v2/"]):
                if url not in discovered_apis:
                    discovered_apis[url] = {
                        "method": response.request.method,
                        "status": response.status,
                        "headers": dict(response.request.headers),
                        "content_type": response.headers.get("content-type", ""),
                    }
                    try:
                        discovered_apis[url]["sample_data"] = response.json()
                    except Exception:
                        pass

        page.on("response", log_api)

        # Login
        page.goto(login_url, wait_until="networkidle")
        page.fill("input[type='email']", credentials["email"])
        page.fill("input[type='password']", credentials["password"])
        page.click("button[type='submit']")
        time.sleep(5)

        # Visit target pages to trigger API calls
        for target in target_pages:
            page.goto(target, wait_until="networkidle")
            time.sleep(3)

        browser.close()

    print(f"\nDiscovered {len(discovered_apis)} API endpoints:")
    for url, info in discovered_apis.items():
        print(f"  {info['method']} {url} [{info['status']}]")

    return discovered_apis

Error Handling and Retry Logic

import functools

def retry_on_failure(max_retries=3, delay=5):
    """Decorator for retrying failed browser operations."""
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    print(f"Attempt {attempt + 1}/{max_retries} failed: {e}")
                    if attempt < max_retries - 1:
                        wait = delay * (2 ** attempt) + random.uniform(0, 2)
                        print(f"Retrying in {wait:.1f}s...")
                        time.sleep(wait)
                    else:
                        raise
        return wrapper
    return decorator

@retry_on_failure(max_retries=3)
def reliable_scrape(url):
    """Scrape with automatic retry on failure."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = create_stealth_context(browser)
        page = context.new_page()

        response = page.goto(url, wait_until="networkidle", timeout=30000)

        if response.status >= 400:
            browser.close()
            raise Exception(f"HTTP {response.status}")

        if detect_captcha(page):
            browser.close()
            raise Exception("CAPTCHA detected")

        html = page.content()
        browser.close()
        return html

Decision Flowchart

Follow this every time before writing scraper code:

  1. Does curl return the data you need? → Use httpx. You're done.
  2. Is the page a SPA? Check Network tab for API calls. → Call the API directly with httpx.
  3. Is there __NEXT_DATA__ or __NUXT__ in the HTML? → Extract the embedded JSON. No browser needed.
  4. Is there application/ld+json structured data? → Extract it. Often has everything you need.
  5. No usable API? Data only exists after JS execution? → Use Playwright.
  6. Need to interact (login, scroll, click)? → Use Playwright with explicit waits.
  7. Getting blocked by Cloudflare/CAPTCHAs? → Use Playwright + ThorData residential proxies.

Start from the top every time. The simplest approach that works is always the right one. Browser rendering is the tool of last resort, not the default. Your server bill and your scraper's reliability will both benefit from choosing the lightest tool that gets the job done.