How to Scrape JavaScript-Heavy Websites with Playwright (2026)

2026-03-30 playwright scraping javascript python automation

How to Scrape JavaScript-Heavy Websites with Playwright (2026)

You write a clean requests and BeautifulSoup scraper. It works on most sites. Then you hit a React dashboard or an infinite-scroll product page and get back an empty div with id root and nothing else. The data you need is nowhere in the HTML source. Welcome to the JavaScript rendering problem.

Modern web applications do not ship HTML anymore - they ship JavaScript bundles that build the HTML inside your browser. React, Vue, Angular, and Svelte applications render entirely client-side. Lazy-loaded content, dynamically injected elements, and XHR-fetched data are invisible to traditional HTTP scrapers. The HTML you download from the server is a skeleton waiting to be filled in by the browser.

This guide covers Playwright for Python in depth: why it is the right tool for JS-heavy sites in 2026, how to set it up, every wait strategy you will actually need, performance optimizations that matter at scale, proxy integration with ThorData, anti-detection techniques, CAPTCHA handling, and seven production-grade use cases with complete code. If you have avoided Playwright because it felt complex, this will change that.

Why Some Sites Need a Real Browser

Before reaching for Playwright, it is worth understanding exactly what the problem is. HTTP scraping fails on JS-heavy sites for a specific reason: the server sends JavaScript, not data. The data only exists after the JavaScript runs in a browser environment.

The rendering pipeline: 1. Browser requests the URL 2. Server returns HTML skeleton + JavaScript bundle (often 1-5MB) 3. Browser parses and executes the JavaScript 4. JavaScript makes additional API calls (XHR/fetch) 5. JavaScript builds the DOM with the fetched data 6. User sees a rendered page

If you stop at step 2 with a raw HTTP client, you get the skeleton. The actual product names, prices, and content are assembled in steps 4-6.

How to tell if you need Playwright: - Open the page in a browser, then right-click and "View Page Source" - If View Source shows useful data: requests + BeautifulSoup can handle it - If View Source shows an empty shell with script tags: you need a browser - Check the Network tab in DevTools: if the data comes from a /api/ URL as JSON, you can scrape that API directly - often cleaner than Playwright

The API interception alternative: Before building a Playwright scraper, always check if the data comes from an internal API that you can call directly. Open DevTools Network tab, filter by Fetch/XHR, reload the page, and look for JSON responses containing the data you want. Calling that API directly with httpx is faster, cheaper, and more reliable than a full Playwright setup. We cover this pattern in detail later.

Playwright vs Selenium vs Puppeteer in 2026

Selenium is the original browser automation framework, dating to 2004. It works, but it shows its age in several ways: verbose setup code, a WebDriver binary that must match your Chrome version (constant maintenance headache), flaky implicit waits that require time.sleep guesses, and an inconsistent API between Python and other languages.

Puppeteer was Google's modern answer to Selenium - a Chrome-only library with a clean Promise-based API. The official version is JavaScript/Node.js only. The Python port (pyppeteer) was a community project that has been effectively abandoned since 2021 with known compatibility issues on modern Chrome versions.

Playwright is Microsoft's 2020 entry that solved the remaining problems: - Cross-browser support (Chromium, Firefox, WebKit) from one unified API - Native async/await in Python with asyncio - Built-in auto-wait system - no more time.sleep guesswork - Headless by default, one flag to switch to headed for debugging - Screenshots, video recording, and network interception built-in - Active development with monthly releases and genuine Microsoft support - Browser binaries bundled - no version matching

For web scraping in Python in 2026, Playwright is the clear choice. The only reason to use Selenium is if you are maintaining existing code that uses it.

Installation and Setup

pip install playwright
playwright install chromium

Playwright downloads its own browser binaries to a local cache directory. No chromedriver version matching, no PATH configuration, no system Chrome dependency. This also means the browser version is pinned to the Playwright version, giving you reproducible behavior across environments.

For scraping specifically, Chromium is the right choice: it is what the majority of sites are tested against, and Chrome-specific fingerprinting is widespread.

Optional dependencies that matter:

# For TLS fingerprinting bypass (when httpx is not enough)
pip install curl-cffi

# For running Playwright in Docker/server environments
# Install system dependencies first:
# apt-get install -y libglib2.0-0 libnss3 libnspr4 libdbus-1-3 libatk1.0-0 #   libatk-bridge2.0-0 libcups2 libdrm2 libxkbcommon0 libxcomposite1 #   libxdamage1 libxfixes3 libxrandr2 libgbm1 libasound2

Verify installation:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://example.com")
    print(page.title())
    browser.close()

Sync vs Async API

Playwright offers both synchronous and asynchronous APIs. For scraping, almost always use the async API:

# Sync - fine for simple scripts, sequential execution
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://example.com")
    content = page.content()
    browser.close()

# Async - required for concurrent scraping
from playwright.async_api import async_playwright
import asyncio

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto("https://example.com")
        content = await page.content()
        await browser.close()

asyncio.run(main())

The sync API blocks on every operation. When you are scraping a single page at a time for debugging, that is fine. For production scrapers running against multiple URLs concurrently, the async API with asyncio.gather is mandatory - otherwise you are paying Playwright's memory overhead without getting its concurrency benefit.

Full Working Example: JS-Rendered Quote Scraper

Let us start with a real, runnable example. quotes.toscrape.com/js/ renders its quotes entirely via JavaScript, making it the standard test site for browser-based scrapers:

import asyncio
from playwright.async_api import async_playwright
from dataclasses import dataclass
from typing import List

@dataclass
class Quote:
    text: str
    author: str
    tags: List[str]

async def scrape_js_quotes(url: str = "http://quotes.toscrape.com/js/") -> List[Quote]:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
            viewport={"width": 1280, "height": 720},
        )
        page = await context.new_page()

        await page.goto(url)
        # Wait until JS has rendered the quote elements
        await page.wait_for_selector(".quote", timeout=10000)

        quotes = []
        quote_elements = await page.locator(".quote").all()

        for el in quote_elements:
            text = await el.locator(".text").inner_text()
            author = await el.locator(".author").inner_text()
            tag_elements = await el.locator(".tag").all()
            tags = [await t.inner_text() for t in tag_elements]
            quotes.append(Quote(text=text.strip(), author=author.strip(), tags=tags))

        await context.close()
        await browser.close()
    return quotes

if __name__ == "__main__":
    results = asyncio.run(scrape_js_quotes())
    for q in results:
        print(f"{q.text[:60]}... — {q.author}")

Key points in this example: - wait_for_selector pauses execution until JS has rendered the .quote elements - no arbitrary sleep needed - locator returns a Locator object, and .all() gives you a list you can iterate - Each async operation inside the loop awaits independently - The context is closed before the browser to properly release resources

Wait Strategies

The auto-wait system is Playwright's most important feature for scraping. Every locator action (click, fill, inner_text, etc.) automatically waits for the element to be visible and enabled. But you still need to explicitly wait at the right points.

wait_for_selector: wait for a specific element

# Wait for element to appear (default state: visible)
await page.wait_for_selector(".product-grid", timeout=10000)

# Wait for element to be hidden (useful after loading spinners)
await page.wait_for_selector(".loading-spinner", state="hidden", timeout=15000)

# Wait for element with specific text
await page.wait_for_selector("text=Add to Cart", timeout=5000)

wait_for_load_state: wait for page lifecycle events

# Wait for DOMContentLoaded (HTML parsed, DOM built, synchronous scripts done)
await page.goto(url, wait_until="domcontentloaded")

# Wait for load event (HTML + all resources loaded)
await page.goto(url, wait_until="load")

# Wait for network idle (no pending requests for 500ms) - good for SPAs
await page.goto(url, wait_until="networkidle")

# After navigation, wait for network to settle
await page.wait_for_load_state("networkidle", timeout=10000)

wait_for_response: wait for a specific API call to complete

import asyncio

# Wait for a specific API response triggered by a user action
async def wait_for_search_results(page, search_term: str) -> dict:
    async with page.expect_response(
        lambda r: "/api/search" in r.url and r.status == 200
    ) as response_info:
        await page.fill("#search-input", search_term)
        await page.press("#search-input", "Enter")

    response = await response_info.value
    return await response.json()

wait_for_function: wait for arbitrary JavaScript condition

# Wait until a JavaScript condition is true
await page.wait_for_function(
    "() => document.querySelectorAll('.product-card').length > 10",
    timeout=15000,
)

# Wait for a specific global variable to be set
await page.wait_for_function("() => window.__DATA__ !== undefined")

Choosing the right wait strategy:

Situation	Use
Specific element must be present	wait_for_selector
Initial page navigation	wait_until="networkidle" in goto
After user action triggers content	wait_for_selector on new content
Background API must complete	wait_for_response
Custom JavaScript condition	wait_for_function
Loading spinner must disappear	wait_for_selector state="hidden"

Blocking Resources for Performance

A full browser loads every resource: HTML, CSS, JavaScript, images, fonts, video. For scraping, you often only need the HTML and the JavaScript that renders it. Blocking unnecessary resources cuts page load time by 40-70%.

from playwright.async_api import async_playwright, Route, Request

BLOCKED_RESOURCE_TYPES = {"image", "stylesheet", "font", "media", "other"}
BLOCKED_DOMAINS = {"google-analytics.com", "doubleclick.net", "facebook.com", "hotjar.com"}

async def block_resources(route: Route, request: Request) -> None:
    """Abort requests for non-essential resources."""
    resource_type = request.resource_type
    url = request.url

    if resource_type in BLOCKED_RESOURCE_TYPES:
        await route.abort()
        return

    if any(domain in url for domain in BLOCKED_DOMAINS):
        await route.abort()
        return

    await route.continue_()

async def scrape_fast(url: str) -> str:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.route("**/*", block_resources)
        await page.goto(url, wait_until="domcontentloaded")
        await page.wait_for_selector("main, #content, .product-list")
        html = await page.content()
        await browser.close()
    return html

Blocking specific ad and tracking scripts:

AD_SCRIPT_PATTERNS = [
    "**/ads/**",
    "**/analytics/**",
    "**/*.tracking.*",
    "**/beacon*",
    "**/pixel*",
    "**/gtm*",
    "**/ga.js",
    "**/analytics.js",
]

async def setup_resource_blocking(page) -> None:
    async def abort_ads(route):
        await route.abort()

    for pattern in AD_SCRIPT_PATTERNS:
        await page.route(pattern, abort_ads)

Screenshots and Debugging

When your scraper returns empty results, screenshots tell you exactly what the browser actually sees:

from playwright.async_api import async_playwright

async def debug_scrape(url: str) -> None:
    async with async_playwright() as p:
        # Headed mode with slow_mo for visual debugging
        browser = await p.chromium.launch(headless=False, slow_mo=500)
        page = await browser.new_page()
        await page.goto(url)

        # Full-page screenshot
        await page.screenshot(path="debug_full.png", full_page=True)

        # Screenshot of specific element
        element = await page.query_selector(".product-grid")
        if element:
            await element.screenshot(path="debug_element.png")

        # Console message capture for JS errors
        page.on("console", lambda msg: print(f"Console {msg.type}: {msg.text}"))
        page.on("pageerror", lambda err: print(f"Page error: {err}"))

        await page.wait_for_timeout(3000)
        await browser.close()

Video recording for complex debugging:

async def record_scraping_session(url: str, output_dir: str = "/tmp/pw_video") -> None:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            record_video_dir=output_dir,
            record_video_size={"width": 1280, "height": 720},
        )
        page = await context.new_page()
        await page.goto(url)
        # Do your scraping here
        await context.close()  # Video is saved on context close
        await browser.close()

Proxy Integration

Residential proxies are essential for scraping sites that block data center IPs. With Playwright, proxy configuration goes into the browser launch or context creation:

Browser-level proxy (all contexts use same proxy):

from playwright.async_api import async_playwright

async def scrape_with_proxy(url: str) -> str:
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy={
                "server": "http://gate.thordata.com:7000",
                "username": "your_username",
                "password": "your_password",
            }
        )
        page = await browser.new_page()
        await page.goto(url, wait_until="networkidle")
        content = await page.content()
        await browser.close()
    return content

Context-level proxy rotation (different proxy per batch):

import asyncio
from playwright.async_api import async_playwright
from typing import List

THORDATA_BASE = "http://gate.thordata.com:7000"

def get_proxy_config(username: str, password: str, session_id: str = None) -> dict:
    """
    ThorData proxy config.
    For sticky sessions, append session ID to username: username-sessid{id}
    For rotating, use base username.
    """
    if session_id:
        user = f"{username}-sessid{session_id}"
    else:
        user = username
    return {
        "server": THORDATA_BASE,
        "username": user,
        "password": password,
    }

async def scrape_urls_with_rotation(
    urls: List[str],
    username: str,
    password: str,
    max_concurrent: int = 4,
) -> List[dict]:
    """Each URL gets a fresh proxy context for maximum rotation."""
    results = []
    semaphore = asyncio.Semaphore(max_concurrent)

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)

        async def scrape_one(url: str, idx: int) -> dict:
            async with semaphore:
                proxy = get_proxy_config(username, password)
                context = await browser.new_context(
                    proxy=proxy,
                    viewport={"width": 1366, "height": 768},
                    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
                    locale="en-US",
                )
                page = await context.new_page()
                try:
                    await page.goto(url, wait_until="networkidle", timeout=30000)
                    await page.wait_for_selector("body", timeout=5000)
                    return {"url": url, "html": await page.content(), "success": True}
                except Exception as e:
                    return {"url": url, "html": None, "success": False, "error": str(e)}
                finally:
                    await context.close()

        tasks = [scrape_one(url, i) for i, url in enumerate(urls)]
        batch_results = await asyncio.gather(*tasks, return_exceptions=True)
        for r in batch_results:
            if isinstance(r, dict):
                results.append(r)

        await browser.close()
    return results

ThorData provides residential proxies that work directly with Playwright's proxy configuration. Their rotating residential pool changes IP on each connection, which maps well to the per-context proxy pattern where each URL gets a fresh browser context.

Anti-Detection Techniques

Playwright runs a real browser, but headless Chrome has some tells that sophisticated bot detection systems look for. Masking them improves success rates on sites running PerimeterX, DataDome, and advanced Cloudflare configurations.

Core automation flags to mask:

async def setup_stealth_context(browser, proxy_config: dict = None):
    """Create a browser context with common automation indicators masked."""
    context_args = {
        "viewport": {"width": 1366, "height": 768},
        "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
        "locale": "en-US",
        "timezone_id": "America/New_York",
        "geolocation": {"longitude": -74.0060, "latitude": 40.7128},
        "permissions": ["geolocation"],
        "color_scheme": "light",
        "device_scale_factor": 1,
        "has_touch": False,
        "is_mobile": False,
    }
    if proxy_config:
        context_args["proxy"] = proxy_config

    context = await browser.new_context(**context_args)

    # Inject stealth scripts before any page load
    await context.add_init_script("""
        // Remove webdriver property
        Object.defineProperty(navigator, 'webdriver', {
            get: () => undefined,
            configurable: true
        });

        // Fake plugin count (headless has 0 plugins)
        Object.defineProperty(navigator, 'plugins', {
            get: () => {
                return {
                    length: 5,
                    0: {name: 'Chrome PDF Plugin'},
                    1: {name: 'Chrome PDF Viewer'},
                    2: {name: 'Native Client'},
                    3: {name: 'Chromium PDF Plugin'},
                    4: {name: 'Widevine Content Decryption Module'},
                };
            }
        });

        // Fake languages
        Object.defineProperty(navigator, 'languages', {
            get: () => ['en-US', 'en']
        });

        // Add chrome runtime object (missing in headless)
        if (!window.chrome) {
            window.chrome = {
                runtime: {},
                loadTimes: function() {},
                csi: function() {},
                app: {}
            };
        }

        // Override permissions API
        const originalQuery = window.navigator.permissions.query;
        window.navigator.permissions.query = (parameters) => (
            parameters.name === 'notifications' ?
                Promise.resolve({ state: Notification.permission }) :
                originalQuery(parameters)
        );
    """)

    return context

Realistic mouse movement simulation:

import asyncio
import random
import math

async def human_mouse_move(page, target_x: int, target_y: int) -> None:
    """Move mouse to target in a curve rather than a straight line."""
    start = await page.evaluate("() => ({x: 0, y: 0})")
    steps = random.randint(8, 15)
    current_x, current_y = 0, 0

    for i in range(steps):
        progress = (i + 1) / steps
        # Bezier curve approximation
        mid_x = (current_x + target_x) / 2 + random.randint(-20, 20)
        mid_y = (current_y + target_y) / 2 + random.randint(-20, 20)
        next_x = int(current_x + (target_x - current_x) * progress + random.uniform(-2, 2))
        next_y = int(current_y + (target_y - current_y) * progress + random.uniform(-2, 2))
        await page.mouse.move(next_x, next_y)
        await asyncio.sleep(random.uniform(0.01, 0.05))
        current_x, current_y = next_x, next_y

    await page.mouse.move(target_x, target_y)

async def human_click(page, selector: str) -> None:
    """Click an element with realistic human-like behavior."""
    element = await page.wait_for_selector(selector)
    box = await element.bounding_box()
    if box:
        # Click slightly off-center
        click_x = box["x"] + box["width"] * random.uniform(0.3, 0.7)
        click_y = box["y"] + box["height"] * random.uniform(0.3, 0.7)
        await human_mouse_move(page, int(click_x), int(click_y))
        await asyncio.sleep(random.uniform(0.05, 0.2))
        await page.mouse.click(int(click_x), int(click_y))

Realistic scroll behavior:

async def human_scroll(page, distance: int = 500, speed: str = "normal") -> None:
    """Scroll page with variable speed and slight randomness."""
    if speed == "fast":
        step_size = random.randint(150, 250)
        step_delay = random.uniform(0.05, 0.1)
    elif speed == "slow":
        step_size = random.randint(50, 100)
        step_delay = random.uniform(0.15, 0.3)
    else:  # normal
        step_size = random.randint(80, 150)
        step_delay = random.uniform(0.08, 0.18)

    scrolled = 0
    while scrolled < distance:
        await page.evaluate(f"window.scrollBy(0, {step_size})")
        scrolled += step_size
        await asyncio.sleep(step_delay + random.uniform(-0.02, 0.05))

CAPTCHA Handling

CAPTCHAs in Playwright context usually mean the site detected automation signals before JavaScript ran. The approach differs by CAPTCHA type.

Detection:

async def detect_captcha_type(page) -> str:
    """Detect what type of CAPTCHA, if any, is on the current page."""
    content = await page.content()
    url = page.url

    if "cf-challenge" in content or "/cdn-cgi/challenge-platform" in url:
        return "cloudflare_challenge"
    if "g-recaptcha" in content:
        return "recaptcha_v2"
    if "grecaptcha.execute" in content:
        return "recaptcha_v3"
    if "hcaptcha" in content:
        return "hcaptcha"
    if "px-captcha" in content or "perimeterx" in content.lower():
        return "perimeterx"
    if await page.query_selector('iframe[src*="recaptcha"]'):
        return "recaptcha_v2_iframe"

    return "none"

Handling Cloudflare challenges: Cloudflare's JS challenge runs browser fingerprint checks. With proper stealth settings and residential proxies, many Cloudflare challenges pass automatically because the fingerprint looks legitimate. When they do not pass:

async def handle_cloudflare(page, max_wait: int = 15000) -> bool:
    """
    Wait for Cloudflare challenge to auto-solve.
    Works for JS challenges (not CAPTCHA challenges) with proper stealth.
    Returns True if challenge passed, False if still blocked.
    """
    try:
        # Wait for the challenge to resolve or timeout
        await page.wait_for_function(
            "() => !document.querySelector('.cf-browser-verification') && document.readyState === 'complete'",
            timeout=max_wait,
        )
        return True
    except Exception:
        # Take screenshot for debugging
        await page.screenshot(path="/tmp/cf_blocked.png", full_page=True)
        return False

2captcha/CapSolver integration for reCAPTCHA:

import httpx
import asyncio
import time

async def solve_recaptcha_v2(page, api_key: str) -> bool:
    """
    Use 2captcha API to solve reCAPTCHA v2.
    Requires a paid 2captcha account.
    """
    # Get site key from page
    sitekey = await page.evaluate("""
        () => {
            const el = document.querySelector('[data-sitekey]');
            return el ? el.getAttribute('data-sitekey') : null;
        }
    """)
    if not sitekey:
        return False

    site_url = page.url

    # Submit CAPTCHA to 2captcha
    with httpx.Client(timeout=30) as client:
        resp = client.post("https://2captcha.com/in.php", data={
            "key": api_key,
            "method": "userrecaptcha",
            "googlekey": sitekey,
            "pageurl": site_url,
            "json": 1,
        })
        task_id = resp.json().get("request")
        if not task_id:
            return False

        # Poll for solution (typically 15-30 seconds)
        for _ in range(24):
            await asyncio.sleep(5)
            result = client.get(f"https://2captcha.com/res.php?key={api_key}&action=get&id={task_id}&json=1")
            data = result.json()
            if data.get("status") == 1:
                token = data["request"]
                # Inject token into page
                await page.evaluate(f"""
                    () => {{
                        document.getElementById('g-recaptcha-response').value = '{token}';
                        if (typeof ___grecaptcha_cfg !== 'undefined') {{
                            Object.entries(___grecaptcha_cfg.clients).forEach(([k, v]) => {{
                                if (v.l && v.l.l) v.l.l.callback('{token}');
                            }});
                        }}
                    }}
                """)
                return True
            elif data.get("request") != "CAPCHA_NOT_READY":
                return False

    return False

Rate Limiting and Retry Logic

import asyncio
import random
import logging
from typing import Optional, List
from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeout

logger = logging.getLogger(__name__)

class PlaywrightScraper:
    def __init__(
        self,
        proxy_config: dict,
        max_concurrent: int = 4,
        requests_per_minute: int = 20,
        max_retries: int = 3,
    ):
        self.proxy_config = proxy_config
        self.max_concurrent = max_concurrent
        self.min_delay = 60.0 / requests_per_minute
        self.max_retries = max_retries
        self._semaphore = None

    async def scrape(self, url: str) -> Optional[str]:
        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=True)
            for attempt in range(1, self.max_retries + 1):
                context = await browser.new_context(
                    proxy=self.proxy_config,
                    viewport={"width": 1366, "height": 768},
                    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
                )
                page = await context.new_page()
                try:
                    await page.goto(url, wait_until="networkidle", timeout=30000)
                    captcha = await detect_captcha_type(page)
                    if captcha != "none":
                        logger.warning(f"CAPTCHA detected ({captcha}) on attempt {attempt}")
                        await context.close()
                        await asyncio.sleep(random.uniform(5, 15))
                        continue
                    html = await page.content()
                    return html
                except PlaywrightTimeout:
                    logger.warning(f"Timeout on attempt {attempt} for {url}")
                    await asyncio.sleep(random.uniform(3, 8) * attempt)
                except Exception as e:
                    logger.error(f"Error on attempt {attempt}: {e}")
                    await asyncio.sleep(random.uniform(2, 5))
                finally:
                    await context.close()
            await browser.close()
        return None

    async def scrape_batch(self, urls: List[str]) -> List[dict]:
        semaphore = asyncio.Semaphore(self.max_concurrent)
        results = []

        async def scrape_with_rate_limit(url: str) -> dict:
            async with semaphore:
                html = await self.scrape(url)
                await asyncio.sleep(
                    self.min_delay + random.uniform(0, self.min_delay * 0.5)
                )
                return {"url": url, "html": html, "success": html is not None}

        tasks = [scrape_with_rate_limit(url) for url in urls]
        batch = await asyncio.gather(*tasks, return_exceptions=True)
        for r in batch:
            if isinstance(r, dict):
                results.append(r)
        return results

Real-World Use Cases

1. React SPA Product Catalog

import asyncio
import json
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class SPAProduct:
    name: str
    price: Optional[float]
    sku: str
    description: str
    images: List[str]
    variants: List[dict]
    url: str

async def scrape_react_catalog(
    catalog_url: str,
    proxy_config: dict,
    max_products: int = 100,
) -> List[SPAProduct]:
    products = []
    captured_api_data = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True, proxy=proxy_config)
        context = await setup_stealth_context(browser, proxy_config)
        page = await context.new_page()

        # Intercept product API calls
        async def capture_product_api(response):
            if "/api/products" in response.url or "/api/catalog" in response.url:
                if response.status == 200:
                    try:
                        data = await response.json()
                        products_data = data.get("products", data.get("items", []))
                        if isinstance(products_data, list):
                            captured_api_data.extend(products_data)
                    except Exception:
                        pass

        page.on("response", capture_product_api)

        # Navigate and wait for content
        await page.goto(catalog_url, wait_until="networkidle", timeout=30000)
        await page.wait_for_selector(".product-card, .product-item, [data-product]", timeout=15000)

        # If we captured API data, use it (much cleaner than HTML parsing)
        if captured_api_data:
            for item in captured_api_data[:max_products]:
                products.append(SPAProduct(
                    name=item.get("name", item.get("title", "")),
                    price=item.get("price"),
                    sku=item.get("sku", item.get("id", "")),
                    description=item.get("description", "")[:300],
                    images=item.get("images", item.get("photos", [])),
                    variants=item.get("variants", []),
                    url=catalog_url,
                ))
        else:
            # Fall back to HTML parsing
            html = await page.content()
            soup = BeautifulSoup(html, "lxml")
            for card in soup.select(".product-card, .product-item")[:max_products]:
                name_tag = card.select_one("h2, h3, .product-name")
                price_tag = card.select_one(".price, [class*=price]")
                products.append(SPAProduct(
                    name=name_tag.get_text(strip=True) if name_tag else "",
                    price=None,
                    sku="",
                    description="",
                    images=[],
                    variants=[],
                    url=catalog_url,
                ))

        await context.close()
        await browser.close()
    return products

import asyncio
import json
import os
from playwright.async_api import async_playwright
from typing import List

async def scrape_after_login(
    login_url: str,
    username: str,
    password: str,
    target_urls: List[str],
    proxy_config: dict,
    session_file: str = "/tmp/session_state.json",
) -> List[dict]:
    """
    Login once, save session state, then scrape authenticated pages.
    Reuses saved session to avoid re-login on each run.
    """
    results = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True, proxy=proxy_config)

        # Try to restore saved session
        context_args = {
            "viewport": {"width": 1366, "height": 768},
            "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
            "proxy": proxy_config,
        }
        if os.path.exists(session_file):
            context_args["storage_state"] = session_file

        context = await browser.new_context(**context_args)
        page = await context.new_page()

        # Check if session is valid
        await page.goto(login_url)
        if page.url == login_url:
            # Need to login
            await page.fill('[name="username"], [name="email"], #username, #email', username)
            await page.fill('[name="password"], #password', password)
            await asyncio.sleep(random.uniform(0.5, 1.5))
            await page.click('[type="submit"], .login-button, button:has-text("Login")')
            await page.wait_for_load_state("networkidle")

            # Save session for next run
            await context.storage_state(path=session_file)

        # Now scrape authenticated pages
        for url in target_urls:
            try:
                await page.goto(url, wait_until="networkidle", timeout=20000)
                await page.wait_for_selector("main, .content, #main-content", timeout=10000)
                html = await page.content()
                results.append({"url": url, "html": html, "authenticated": True})
                await asyncio.sleep(random.uniform(1, 3))
            except Exception as e:
                results.append({"url": url, "html": None, "error": str(e)})

        await context.close()
        await browser.close()
    return results

3. Infinite Scroll Aggregator

import asyncio
import random
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import List

@dataclass
class FeedItem:
    title: str
    content: str
    timestamp: str
    author: str
    url: str

async def scrape_infinite_feed(
    feed_url: str,
    proxy_config: dict,
    target_item_count: int = 200,
    max_scrolls: int = 50,
) -> List[FeedItem]:
    items = []
    seen_titles = set()

    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy=proxy_config,
            args=["--disable-blink-features=AutomationControlled"],
        )
        context = await browser.new_context(
            viewport={"width": 1280, "height": 900},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
        )
        page = await context.new_page()
        await page.goto(feed_url, wait_until="networkidle", timeout=30000)
        await page.wait_for_selector(".feed-item, .post, article", timeout=15000)

        for scroll_num in range(max_scrolls):
            if len(items) >= target_item_count:
                break

            # Extract currently visible items
            html = await page.content()
            soup = BeautifulSoup(html, "lxml")
            for item in soup.select(".feed-item, .post, article"):
                title_el = item.select_one("h2, h3, .title")
                content_el = item.select_one("p, .content, .excerpt")
                time_el = item.select_one("time, .timestamp, .date")
                author_el = item.select_one(".author, [rel=author]")

                if title_el:
                    title = title_el.get_text(strip=True)
                    if title not in seen_titles:
                        seen_titles.add(title)
                        items.append(FeedItem(
                            title=title,
                            content=content_el.get_text(strip=True)[:300] if content_el else "",
                            timestamp=time_el.get("datetime", time_el.get_text(strip=True)) if time_el else "",
                            author=author_el.get_text(strip=True) if author_el else "",
                            url=feed_url,
                        ))

            # Scroll down
            prev_height = await page.evaluate("document.body.scrollHeight")
            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            await asyncio.sleep(random.uniform(1.5, 3.0))

            # Check if new content loaded
            new_height = await page.evaluate("document.body.scrollHeight")
            if new_height == prev_height:
                break  # No more content

        await context.close()
        await browser.close()
    return items[:target_item_count]

4. Form-Based Data Extraction

import asyncio
import random
from playwright.async_api import async_playwright
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class SearchResult:
    query: str
    title: str
    description: str
    url: str
    position: int

async def scrape_search_form(
    search_url: str,
    queries: List[str],
    proxy_config: dict,
    results_per_query: int = 20,
) -> List[SearchResult]:
    all_results = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True, proxy=proxy_config)
        context = await browser.new_context(
            viewport={"width": 1366, "height": 768},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
        )

        for query in queries:
            page = await context.new_page()
            try:
                await page.goto(search_url, wait_until="domcontentloaded")

                # Find and fill search input
                search_input = await page.wait_for_selector(
                    'input[type="search"], input[name="q"], #search-input, .search-box input',
                    timeout=10000,
                )
                await human_mouse_move(page, 0, 0)  # Move mouse first
                await asyncio.sleep(random.uniform(0.3, 0.8))
                await search_input.click()
                await asyncio.sleep(random.uniform(0.2, 0.5))

                # Type like a human (not all at once)
                for char in query:
                    await page.keyboard.type(char)
                    await asyncio.sleep(random.uniform(0.05, 0.2))

                await asyncio.sleep(random.uniform(0.3, 0.8))
                await page.keyboard.press("Enter")
                await page.wait_for_load_state("networkidle", timeout=15000)

                # Extract results
                result_items = await page.locator(".result, .search-result, article").all()
                for pos, item in enumerate(result_items[:results_per_query], 1):
                    try:
                        title = await item.locator("h2, h3, .title").first.inner_text()
                        desc = await item.locator("p, .description, .snippet").first.inner_text()
                        link = await item.locator("a").first.get_attribute("href")
                        all_results.append(SearchResult(
                            query=query,
                            title=title.strip(),
                            description=desc.strip()[:200],
                            url=link or "",
                            position=pos,
                        ))
                    except Exception:
                        pass

            except Exception as e:
                print(f"Error scraping query '{query}': {e}")
            finally:
                await page.close()

            # Human-like delay between queries
            await asyncio.sleep(random.uniform(3, 8))

        await context.close()
        await browser.close()
    return all_results

5. Multi-Page Pagination with Session Tracking

import asyncio
import json
from playwright.async_api import async_playwright
from typing import List, Optional

async def scrape_paginated_spa(
    start_url: str,
    proxy_config: dict,
    max_pages: int = 50,
    items_per_page: int = 20,
) -> List[dict]:
    """
    Handle SPA pagination where page state is managed by JavaScript.
    Uses network response interception to capture data directly.
    """
    all_items = []
    page_data_responses = asyncio.Queue()

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True, proxy=proxy_config)
        context = await browser.new_context(
            viewport={"width": 1366, "height": 768},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
        )
        page = await context.new_page()

        # Intercept pagination API calls
        async def capture_page_data(response):
            if ("/api/items" in response.url or "/api/list" in response.url) and response.status == 200:
                try:
                    data = await response.json()
                    await page_data_responses.put(data)
                except Exception:
                    pass

        page.on("response", capture_page_data)
        await page.goto(start_url, wait_until="networkidle")

        for page_num in range(1, max_pages + 1):
            # Drain any captured API responses
            while not page_data_responses.empty():
                data = await page_data_responses.get()
                items = data.get("items", data.get("results", data.get("data", [])))
                all_items.extend(items)

            # Find and click "Next" button
            next_button = await page.query_selector(
                'button:has-text("Next"), a:has-text("Next"), [aria-label="Next page"], .pagination-next'
            )
            if not next_button:
                break

            is_disabled = await next_button.get_attribute("disabled")
            if is_disabled:
                break

            await next_button.click()
            await page.wait_for_load_state("networkidle", timeout=10000)
            await asyncio.sleep(random.uniform(1, 2))

            if len(all_items) >= max_pages * items_per_page:
                break

        await context.close()
        await browser.close()
    return all_items

6. File Download Automation

import asyncio
import os
from playwright.async_api import async_playwright
from typing import List

async def download_files_from_portal(
    portal_url: str,
    file_selector: str,
    download_dir: str,
    proxy_config: dict,
    max_files: int = 50,
) -> List[str]:
    """
    Download files from a portal that requires JavaScript for the download UI.
    """
    downloaded_files = []
    os.makedirs(download_dir, exist_ok=True)

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True, proxy=proxy_config)
        context = await browser.new_context(
            viewport={"width": 1366, "height": 768},
            accept_downloads=True,  # Required to intercept downloads
        )
        page = await context.new_page()
        await page.goto(portal_url, wait_until="networkidle")
        await page.wait_for_selector(file_selector, timeout=15000)

        download_links = await page.locator(file_selector).all()

        for i, link in enumerate(download_links[:max_files]):
            try:
                async with page.expect_download(timeout=30000) as download_info:
                    await link.click()
                download = await download_info.value
                save_path = os.path.join(download_dir, download.suggested_filename)
                await download.save_as(save_path)
                downloaded_files.append(save_path)
                await asyncio.sleep(random.uniform(1, 3))
            except Exception as e:
                print(f"Failed to download file {i}: {e}")

        await context.close()
        await browser.close()
    return downloaded_files

7. Real-Time Price and Availability Monitor

import asyncio
import json
import datetime
import random
from playwright.async_api import async_playwright
from dataclasses import dataclass, asdict
from typing import List, Optional, Callable

@dataclass
class PriceSnapshot:
    url: str
    price: Optional[float]
    currency: str
    in_stock: bool
    title: str
    scraped_at: str
    proxy_used: str

async def monitor_prices_realtime(
    product_urls: List[str],
    proxy_config: dict,
    on_price_change: Optional[Callable] = None,
    check_interval_minutes: int = 30,
    max_checks: int = 48,
) -> List[List[PriceSnapshot]]:
    """
    Monitor prices over time with Playwright.
    Calls on_price_change callback when a price differs from previous check.
    """
    history = {url: [] for url in product_urls}
    proxy_server = proxy_config.get("server", "")

    for check_num in range(max_checks):
        print(f"Check {check_num + 1}/{max_checks} at {datetime.datetime.utcnow().isoformat()}")
        snapshots = []

        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=True, proxy=proxy_config)

            for url in product_urls:
                context = await browser.new_context(
                    viewport={"width": 1366, "height": 768},
                    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
                    proxy=proxy_config,
                )
                page = await context.new_page()
                try:
                    await page.goto(url, wait_until="networkidle", timeout=30000)
                    await page.wait_for_selector("[itemprop='price'], .price, #price", timeout=10000)

                    price_text = await page.locator("[itemprop='price'], .price, #price").first.inner_text()
                    title_text = await page.locator("h1").first.inner_text()
                    stock_el = await page.query_selector(".in-stock, [itemprop='availability']")

                    price_clean = "".join(c for c in price_text if c.isdigit() or c == ".")
                    try:
                        price = float(price_clean) if price_clean else None
                    except ValueError:
                        price = None

                    snapshot = PriceSnapshot(
                        url=url,
                        price=price,
                        currency="USD",
                        in_stock=bool(stock_el),
                        title=title_text.strip() if title_text else "",
                        scraped_at=datetime.datetime.utcnow().isoformat(),
                        proxy_used=proxy_server,
                    )

                    # Check for price change
                    if history[url] and on_price_change:
                        prev = history[url][-1]
                        if prev.price != snapshot.price:
                            await asyncio.sleep(0)  # Yield to event loop
                            on_price_change(url, prev.price, snapshot.price, snapshot)

                    history[url].append(snapshot)
                    snapshots.append(snapshot)

                except Exception as e:
                    print(f"Error checking {url}: {e}")
                finally:
                    await context.close()

                await asyncio.sleep(random.uniform(2, 5))

            await browser.close()

        if check_num < max_checks - 1:
            await asyncio.sleep(check_interval_minutes * 60)

    return list(history.values())

When You Do Not Need Playwright

Not every JavaScript-heavy site requires Playwright. Before building a full browser scraper:

Check for an internal API. Open DevTools, go to Network, filter by Fetch/XHR, and reload the page. If you see JSON responses containing the data you want, you can call those endpoints directly with httpx. This is faster, more reliable, and dramatically cheaper in terms of resources.

Check if data is in the initial HTML. Some sites that look like SPAs actually server-side render their initial content. View Page Source (not Inspect Element). If the data is there, BeautifulSoup works fine.

Check for a public API. Many major platforms have official APIs. LinkedIn, Twitter, Google, and Amazon all have APIs with rate limits that are more sustainable than scraping.

The rule: use Playwright only when you genuinely need JavaScript rendering and there is no simpler alternative. For static HTML, requests or httpx with BeautifulSoup runs 10-50x faster and costs a fraction of the compute resources.

Wrapping Up

Playwright has become the go-to tool for scraping JS-heavy sites in 2026 because it solved the real problems: auto-wait eliminates the time.sleep debugging spiral, cross-browser support means one codebase handles Chromium and Firefox, and the async API enables genuine concurrency. Pair it with residential proxies from ThorData for the IP layer, add the stealth scripts to mask automation flags, use resource blocking to cut load times, and you can scrape virtually anything that renders in a browser.

The tradeoff is always speed and memory. Playwright is 10-50x slower than raw HTTP scraping and uses 10x more memory. Use it where it is genuinely required, and fall back to Scrapy or httpx everywhere else.

How to Scrape JavaScript-Heavy Websites with Playwright (2026)

How to Scrape JavaScript-Heavy Websites with Playwright (2026)

Why Some Sites Need a Real Browser

Playwright vs Selenium vs Puppeteer in 2026

Installation and Setup

Sync vs Async API

Full Working Example: JS-Rendered Quote Scraper

Wait Strategies

Blocking Resources for Performance

Screenshots and Debugging

Proxy Integration

Anti-Detection Techniques

CAPTCHA Handling

Rate Limiting and Retry Logic

Real-World Use Cases

1. React SPA Product Catalog

2. Authenticated Login Scraper

3. Infinite Scroll Aggregator

4. Form-Based Data Extraction

5. Multi-Page Pagination with Session Tracking

6. File Download Automation

7. Real-Time Price and Availability Monitor

When You Do Not Need Playwright

Wrapping Up