Playwright for Web Scraping in 2026: A Complete Practical Guide

2026-03-30 [scraping playwright python javascript proxies anti-detection]

If you've tried scraping a modern website with requests and BeautifulSoup only to get back an empty <div id="root"></div>, you already know the problem. Most sites worth scraping in 2026 are JavaScript-heavy SPAs that render content client-side. The server sends a near-empty HTML shell and all the real content is injected by React, Vue, or Angular after the JavaScript executes. Traditional HTTP scraping cannot see any of that.

Playwright handles this natively. It runs a real browser engine — Chromium, Firefox, or WebKit — executes all the JavaScript the page loads, waits for content to appear in the DOM, and gives you full programmatic control over interactions. You can click buttons, fill forms, scroll the page, intercept network traffic, and screenshot anything. It is the most capable browser automation tool available in Python today.

This guide goes deep. We start from scratch and build up to production-grade patterns: stealth techniques to defeat bot detection, residential proxy rotation via ThorData, CAPTCHA handling strategies, robust retry logic, and output schemas you can rely on in downstream pipelines. Every code example is complete and runnable.

Why Playwright Beats the Alternatives

The browser automation landscape in 2026 has three serious options: Selenium, Puppeteer/Playwright, and Splash. Here is why Playwright has become the default choice for serious scrapers:

Selenium is the oldest and most battle-tested. Every hiring manager knows it. But it requires you to manage WebDriver binaries that must match your browser version exactly — a constant headache. Its API is synchronous-first and verbose. Auto-waiting is limited compared to Playwright: you frequently need manual WebDriverWait with ExpectedConditions boilerplate just to click a button safely.

Puppeteer introduced the modern browser automation API, but it is JavaScript/Node.js only. If your data pipeline is Python — and most data engineering is — you are either writing glue code or maintaining a separate Node service. Not ideal.

Playwright for Python was built by former Puppeteer engineers at Microsoft who redesigned the API from scratch. Key advantages over everything else:

Auto-waiting everywhere — every interaction (click, fill, query_selector) automatically waits for the element to be actionable. No more sleep(3) and hope.
True multi-browser — Chromium, Firefox, and WebKit from a single Python API. When one engine gets fingerprinted and blocked, you switch.
Network interception — intercept and modify HTTP requests at the browser level. Capture API responses without parsing HTML. Block images and fonts to cut page load time by 70%.
Async-first design — the Python API is built on asyncio. You can run tens of browser contexts concurrently.
Built-in tracing — record full browser traces for debugging. When a scraper fails in production, replay the trace and see exactly what happened.
No binary management — playwright install chromium downloads the correct browser version. No more chromedriver version mismatch errors.

For scraping JavaScript-heavy sites in 2026, Playwright is the right tool. The only reason not to use it is if the site you are targeting has a clean API underneath the UI — in which case you do not need a browser at all.

Setup and Installation

pip install playwright playwright-stealth httpx
playwright install chromium

If you are on a headless server (VPS, Docker, CI), you may need system dependencies:

playwright install-deps chromium

Verify the installation:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com")
    print(page.title())
    browser.close()

If that prints "Example Domain" you are ready.

Architecture: Sync vs Async API

Playwright offers both a synchronous and an asynchronous Python API. Use the async API for any real scraping work.

When to use sync:

One-off scripts and quick experiments
Simple scrapers that hit one URL at a time
Prototyping before you know the scale you need

When to use async:

Any multi-page scraping (always, basically)
Running multiple browser contexts concurrently
Integration with async HTTP clients like httpx.AsyncClient
Production scrapers where throughput matters

The async API looks like this:

import asyncio
from playwright.async_api import async_playwright

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto("https://example.com")
        print(await page.title())
        await browser.close()

asyncio.run(main())

The rest of this guide uses the async API throughout.

Your First Complete Scraper

Here is a production-ready scraper for a paginated product catalog. It handles auto-waiting, extracts structured data, and manages browser lifecycle correctly:

import asyncio
import json
from dataclasses import dataclass, asdict
from typing import Optional
from playwright.async_api import async_playwright, Browser, Page

@dataclass
class Product:
    title: str
    price: Optional[str]
    sku: Optional[str]
    rating: Optional[float]
    review_count: Optional[int]
    url: str

async def extract_products_from_page(page: Page) -> list[Product]:
    """Extract all product cards from the current page."""
    await page.wait_for_selector(".product-card", timeout=15_000)
    cards = await page.query_selector_all(".product-card")
    products = []

    for card in cards:
        title_el = await card.query_selector(".product-title")
        price_el = await card.query_selector(".product-price")
        sku_el = await card.query_selector("[data-sku]")
        rating_el = await card.query_selector("[aria-label*='stars']")
        review_el = await card.query_selector(".review-count")
        link_el = await card.query_selector("a.product-link")

        title = await title_el.inner_text() if title_el else ""
        price = await price_el.inner_text() if price_el else None
        sku = await sku_el.get_attribute("data-sku") if sku_el else None
        url = await link_el.get_attribute("href") if link_el else page.url

        # Parse rating from aria-label like "4.5 stars"
        rating = None
        if rating_el:
            label = await rating_el.get_attribute("aria-label") or ""
            try:
                rating = float(label.split()[0])
            except (ValueError, IndexError):
                pass

        # Parse review count from text like "(1,234)"
        review_count = None
        if review_el:
            text = await review_el.inner_text()
            digits = "".join(c for c in text if c.isdigit())
            review_count = int(digits) if digits else None

        products.append(Product(
            title=title.strip(),
            price=price.strip() if price else None,
            sku=sku,
            rating=rating,
            review_count=review_count,
            url=url if url.startswith("http") else f"https://example-store.com{url}",
        ))

    return products


async def scrape_catalog(base_url: str, max_pages: int = 10) -> list[dict]:
    all_products = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent=(
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            ),
            # Block analytics and ad tracking to speed up loads
            extra_http_headers={
                "Accept-Language": "en-US,en;q=0.9",
                "Accept-Encoding": "gzip, deflate, br",
            }
        )

        # Block images, fonts, and media — not needed for data
        await context.route(
            "**/*.{png,jpg,jpeg,gif,webp,svg,ico,woff,woff2,ttf,mp4,mp3}",
            lambda route: route.abort()
        )

        page = await context.new_page()
        await page.goto(base_url, wait_until="domcontentloaded")

        for page_num in range(1, max_pages + 1):
            print(f"Scraping page {page_num}...")
            products = await extract_products_from_page(page)
            all_products.extend(products)
            print(f"  Extracted {len(products)} products")

            # Try to navigate to next page
            next_btn = await page.query_selector("button[aria-label='Next page']:not([disabled])")
            if not next_btn:
                print("No more pages.")
                break

            await next_btn.click()
            await page.wait_for_load_state("domcontentloaded")

        await browser.close()

    return [asdict(p) for p in all_products]


if __name__ == "__main__":
    results = asyncio.run(scrape_catalog("https://example-store.com/catalog"))
    with open("products.json", "w") as f:
        json.dump(results, f, indent=2)
    print(f"Saved {len(results)} products to products.json")

A few design decisions worth noting: I use wait_until="domcontentloaded" instead of "networkidle". Network idle can hang for 30+ seconds on sites with analytics pixels, chat widgets, and ad networks. Load the DOM, then wait for the specific selector you actually need. I block images and fonts at the context level — this cuts page load time dramatically and reduces bandwidth. I use wait_for_selector with an explicit timeout rather than hoping things appear.

Stealth: Defeating Bot Detection

Headless Chromium is detectable. Bot detection services like Cloudflare, DataDome, PerimeterX, and Akamai Bot Manager look for dozens of signals that distinguish a real browser from automated tooling:

navigator.webdriver === true — the clearest signal of all
Missing browser plugins array (real Chrome has plugins, headless doesn't)
WebGL renderer string shows "SwiftShader" or "llvmpipe" instead of a real GPU
screen.width and window.outerWidth mismatches
Missing chrome runtime object properties
Inconsistent hardware concurrency and device memory values
The HeadlessChrome substring in the user agent string
Chrome automation flags visible in navigator.userAgentData
Missing media codec support

The playwright-stealth package patches most of these at the JavaScript level:

from playwright.async_api import async_playwright
from playwright_stealth import stealth_async

async def create_stealth_browser():
    playwright = await async_playwright().start()
    browser = await playwright.chromium.launch(
        headless=True,
        args=[
            "--disable-blink-features=AutomationControlled",
            "--no-sandbox",
            "--disable-setuid-sandbox",
            "--disable-dev-shm-usage",
            "--disable-infobars",
        ]
    )
    context = await browser.new_context(
        viewport={"width": 1366, "height": 768},
        user_agent=(
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/124.0.0.0 Safari/537.36"
        ),
    )
    page = await context.new_page()
    await stealth_async(page)
    return playwright, browser, page

For high-security targets, go further with manual JavaScript injection to patch remaining signals:

async def apply_advanced_stealth(page):
    """Inject stealth overrides before any page scripts run."""
    await page.add_init_script("""
        // Overwrite navigator.webdriver
        Object.defineProperty(navigator, 'webdriver', {
            get: () => undefined,
        });

        // Fake plugins
        Object.defineProperty(navigator, 'plugins', {
            get: () => [
                { name: 'Chrome PDF Plugin', filename: 'internal-pdf-viewer' },
                { name: 'Chrome PDF Viewer', filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai' },
                { name: 'Native Client', filename: 'internal-nacl-plugin' },
            ],
        });

        // Fake languages
        Object.defineProperty(navigator, 'languages', {
            get: () => ['en-US', 'en'],
        });

        // Fake hardware concurrency (real CPU count)
        Object.defineProperty(navigator, 'hardwareConcurrency', {
            get: () => 8,
        });

        // Remove Automation extension exposure
        window.chrome = {
            runtime: {},
            loadTimes: function() {},
            csi: function() {},
            app: {}
        };

        // Fix permission query fingerprint
        const originalQuery = window.navigator.permissions.query;
        window.navigator.permissions.query = (parameters) => (
            parameters.name === 'notifications' ?
                Promise.resolve({ state: Notification.permission }) :
                originalQuery(parameters)
        );
    """)

For the toughest targets, launch with channel="chrome" to use the system-installed Chrome binary. This bypasses most fingerprinting entirely because it IS a real Chrome installation:

browser = await p.chromium.launch(
    headless=False,  # headed mode avoids more detection
    channel="chrome",
)

Yes, headed is slower and requires a display (use Xvfb on Linux servers). But it is undetectable because it is not detectable — it is literally Chrome running normally.

Human Behavior Simulation

Even with a perfect browser fingerprint, bot detection watches for behavioral signals. Clicks that happen 50 milliseconds after page load, mouse movements that go in a perfect straight line, typing at 1000 characters per second — these are inhuman.

Add realistic behavior with these patterns:

import random
import asyncio

async def human_delay(min_ms: int = 500, max_ms: int = 2000):
    """Wait a random human-like delay."""
    delay = random.randint(min_ms, max_ms) / 1000
    await asyncio.sleep(delay)


async def human_type(page, selector: str, text: str):
    """Type text with realistic per-character delays."""
    await page.click(selector)
    for char in text:
        await page.keyboard.type(char)
        # Vary speed like a real typist: 50-200ms per character
        await asyncio.sleep(random.uniform(0.05, 0.20))


async def move_mouse_naturally(page, target_x: int, target_y: int):
    """Move mouse in a curved path rather than a straight line."""
    current = await page.evaluate("() => ({x: window.mouseX || 0, y: window.mouseY || 0})")
    cx, cy = current.get("x", 0), current.get("y", 0)

    # Generate bezier-like intermediate points
    steps = random.randint(8, 15)
    for i in range(1, steps + 1):
        t = i / steps
        # Slight curve via a random midpoint offset
        mid_x = (cx + target_x) / 2 + random.randint(-30, 30)
        mid_y = (cy + target_y) / 2 + random.randint(-30, 30)
        x = int((1 - t) ** 2 * cx + 2 * (1 - t) * t * mid_x + t ** 2 * target_x)
        y = int((1 - t) ** 2 * cy + 2 * (1 - t) * t * mid_y + t ** 2 * target_y)
        await page.mouse.move(x, y)
        await asyncio.sleep(random.uniform(0.01, 0.03))


async def human_scroll(page, scroll_amount: int = 500):
    """Scroll in multiple small increments."""
    steps = random.randint(5, 12)
    per_step = scroll_amount // steps
    for _ in range(steps):
        await page.mouse.wheel(0, per_step + random.randint(-20, 20))
        await asyncio.sleep(random.uniform(0.05, 0.15))

Proxy Rotation with ThorData

IP-based blocking is the other half of the anti-scraping equation. Even a perfect fingerprint will get banned if you send thousands of requests from one IP. You need residential proxy rotation.

ThorData provides rotating residential proxies that route traffic through real ISP-assigned IP addresses. Each request can use a different IP from a pool of millions, making your traffic indistinguishable from normal user traffic distributed across the country or world.

Setting up Playwright with ThorData:

import asyncio
from playwright.async_api import async_playwright

THORDATA_USER = "your_thordata_username"
THORDATA_PASS = "your_thordata_password"
THORDATA_HOST = "proxy.thordata.com"
THORDATA_PORT = 9000

async def scrape_with_proxy(url: str) -> str:
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy={
                "server": f"http://{THORDATA_HOST}:{THORDATA_PORT}",
                "username": THORDATA_USER,
                "password": THORDATA_PASS,
            }
        )
        page = await browser.new_page()
        await page.goto(url)
        content = await page.content()
        await browser.close()
        return content

For country-specific IPs (useful when scraping geo-restricted content), ThorData supports country targeting via username parameters:

# Route through US residential IPs only
proxy_user = f"{THORDATA_USER}-country-us"

# Route through UK IPs
proxy_user = f"{THORDATA_USER}-country-gb"

browser = await p.chromium.launch(
    proxy={
        "server": f"http://{THORDATA_HOST}:{THORDATA_PORT}",
        "username": proxy_user,
        "password": THORDATA_PASS,
    }
)

For concurrent scraping, create one browser per context with a fresh IP per session:

async def scrape_batch(urls: list[str], concurrency: int = 5) -> list[dict]:
    """Scrape multiple URLs concurrently, each with its own proxy IP."""
    semaphore = asyncio.Semaphore(concurrency)
    results = []

    async def scrape_one(url: str) -> dict:
        async with semaphore:
            async with async_playwright() as p:
                browser = await p.chromium.launch(
                    proxy={
                        "server": f"http://{THORDATA_HOST}:{THORDATA_PORT}",
                        "username": THORDATA_USER,
                        "password": THORDATA_PASS,
                    }
                )
                try:
                    page = await browser.new_page()
                    await page.goto(url, wait_until="domcontentloaded", timeout=30_000)
                    # ... extract data ...
                    return {"url": url, "status": "ok", "data": {}}
                except Exception as e:
                    return {"url": url, "status": "error", "error": str(e)}
                finally:
                    await browser.close()

    tasks = [scrape_one(url) for url in urls]
    results = await asyncio.gather(*tasks)
    return list(results)

Intercepting Network Requests

This is the most powerful Playwright feature for scraping and most scrapers never use it. Modern SPAs fetch their data from internal APIs and render it into HTML. Instead of parsing the rendered HTML, you can intercept the raw API response directly. It is faster, more reliable, and the data is already structured.

Here is how to intercept the JSON API calls that power a product listing page:

import asyncio
import json
from playwright.async_api import async_playwright

async def intercept_api_responses(url: str, api_pattern: str) -> list[dict]:
    """
    Load a page and capture all API responses matching a URL pattern.
    Returns the parsed JSON bodies.
    """
    captured_data = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        async def handle_response(response):
            if api_pattern in response.url:
                try:
                    if response.status == 200:
                        data = await response.json()
                        captured_data.append({
                            "url": response.url,
                            "data": data,
                        })
                        print(f"Captured: {response.url}")
                except Exception as e:
                    print(f"Failed to parse response from {response.url}: {e}")

        page.on("response", handle_response)

        await page.goto(url, wait_until="networkidle", timeout=30_000)

        # If page loads more data on scroll, trigger it
        for _ in range(3):
            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            await asyncio.sleep(2)

        await browser.close()

    return captured_data


# Example: scrape Hacker News comments by intercepting the internal API
async def scrape_hn_comments(story_id: int) -> dict:
    data = await intercept_api_responses(
        f"https://news.ycombinator.com/item?id={story_id}",
        "/api/"
    )
    return data

You can also intercept requests BEFORE they are sent, which lets you modify headers, block tracking, or substitute mock responses:

async def setup_request_interception(page):
    """Block analytics and modify requests before they fire."""

    async def handle_route(route):
        url = route.request.url
        # Block analytics and tracking
        if any(domain in url for domain in [
            "google-analytics.com", "facebook.com/tr",
            "doubleclick.net", "hotjar.com",
        ]):
            await route.abort()
            return

        # Add custom headers to all requests
        headers = {
            **route.request.headers,
            "X-Custom-Header": "scraper-v1",
        }
        await route.continue_(headers=headers)

    await page.route("**/*", handle_route)

Handling Pagination: All Three Patterns

Real sites use one of three pagination patterns, each requiring a different approach.

Pattern 1: Click-Based Next Page

async def scrape_paginated_site(start_url: str) -> list[dict]:
    all_items = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(start_url, wait_until="domcontentloaded")

        page_num = 0
        while True:
            page_num += 1
            print(f"Scraping page {page_num}")

            # Wait for content to appear
            try:
                await page.wait_for_selector(".item-list .item", timeout=10_000)
            except Exception:
                print("No items found, stopping")
                break

            # Extract items
            items = await page.query_selector_all(".item-list .item")
            for item in items:
                title_el = await item.query_selector("h3")
                all_items.append({
                    "title": await title_el.inner_text() if title_el else "",
                    "page": page_num,
                })

            # Find next page button — disabled or missing means last page
            next_btn = await page.query_selector("a.pagination__next:not(.disabled)")
            if not next_btn:
                print("Reached last page")
                break

            # Click and wait for new content
            old_first_item = await page.inner_text(".item-list .item:first-child h3")
            await next_btn.click()

            # Wait until content actually changes (not just DOM update)
            try:
                await page.wait_for_function(
                    f"document.querySelector('.item-list .item:first-child h3')?.innerText !== '{old_first_item}'"
                )
            except Exception:
                break

        await browser.close()
    return all_items

Pattern 2: Infinite Scroll / Load More

async def scrape_infinite_scroll(url: str, max_scrolls: int = 20) -> list[dict]:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url, wait_until="domcontentloaded")

        items_seen = set()
        all_items = []

        for scroll_num in range(max_scrolls):
            # Collect currently visible items
            cards = await page.query_selector_all("[data-testid='item-card']")
            new_count = 0
            for card in cards:
                item_id = await card.get_attribute("data-id")
                if item_id and item_id not in items_seen:
                    items_seen.add(item_id)
                    title_el = await card.query_selector(".title")
                    all_items.append({
                        "id": item_id,
                        "title": await title_el.inner_text() if title_el else "",
                    })
                    new_count += 1

            print(f"Scroll {scroll_num + 1}: +{new_count} new items (total: {len(all_items)})")

            if new_count == 0 and scroll_num > 0:
                print("No new items after scroll, done")
                break

            # Scroll to bottom
            prev_height = await page.evaluate("document.body.scrollHeight")
            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            await asyncio.sleep(2)  # Wait for new content to load

            new_height = await page.evaluate("document.body.scrollHeight")
            if new_height == prev_height:
                print("Page height unchanged, no more content")
                break

        await browser.close()
    return all_items

Pattern 3: URL Parameter Pagination

async def scrape_url_pagination(base_url: str, max_pages: int = 50) -> list[dict]:
    """Simplest case: iterate page numbers in the URL."""
    all_items = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context()
        page = await context.new_page()

        for page_num in range(1, max_pages + 1):
            url = f"{base_url}?page={page_num}"
            await page.goto(url, wait_until="domcontentloaded")

            items = await page.query_selector_all(".search-result")
            if not items:
                print(f"No results on page {page_num}, stopping")
                break

            for item in items:
                all_items.append({"page": page_num, "url": url})

            # Respect the server
            await asyncio.sleep(1.0)

        await browser.close()
    return all_items

Rate Limiting and CAPTCHA Handling

Respecting Rate Limits

The simplest approach is adaptive backoff: start fast, slow down when you see signals of rate limiting (slow responses, empty results, 429 status):

import asyncio
import time
from dataclasses import dataclass

@dataclass
class RateLimiter:
    requests_per_second: float = 1.0
    _last_request_time: float = 0.0

    async def wait(self):
        now = time.monotonic()
        elapsed = now - self._last_request_time
        min_interval = 1.0 / self.requests_per_second
        if elapsed < min_interval:
            await asyncio.sleep(min_interval - elapsed)
        self._last_request_time = time.monotonic()


async def scrape_with_rate_limit(urls: list[str]) -> list[dict]:
    limiter = RateLimiter(requests_per_second=0.5)  # 1 request per 2 seconds
    results = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        for url in urls:
            await limiter.wait()
            try:
                await page.goto(url, wait_until="domcontentloaded", timeout=30_000)
                # Check for rate limit page
                if "Rate limited" in await page.title() or await page.query_selector(".captcha"):
                    print(f"Rate limited at {url}, waiting 60s...")
                    await asyncio.sleep(60)
                    await page.goto(url, wait_until="domcontentloaded", timeout=30_000)

                results.append({"url": url, "status": "ok"})
            except Exception as e:
                results.append({"url": url, "status": "error", "error": str(e)})

        await browser.close()
    return results

CAPTCHA Handling

CAPTCHAs are the escalation point when other anti-bot measures fail. You have three options:

Option 1: Avoid triggering CAPTCHAs. Use stealth mode, respect rate limits, use residential proxies. If you are getting CAPTCHAs, you are already doing something that looks bot-like.

Option 2: CAPTCHA solving services. 2captcha, Anti-Captcha, and CapSolver provide workers (human or AI) that solve CAPTCHAs for a fee. Integration looks like this:

import httpx
import asyncio

async def solve_recaptcha_v2(site_key: str, page_url: str, api_key: str) -> str:
    """Solve reCAPTCHA v2 via 2captcha service."""
    async with httpx.AsyncClient() as client:
        # Submit CAPTCHA task
        resp = await client.post(
            "https://2captcha.com/in.php",
            data={
                "key": api_key,
                "method": "userrecaptcha",
                "googlekey": site_key,
                "pageurl": page_url,
                "json": 1,
            }
        )
        task_id = resp.json()["request"]

        # Poll for result (typically 15-45 seconds)
        for _ in range(20):
            await asyncio.sleep(5)
            result = await client.get(
                "https://2captcha.com/res.php",
                params={"key": api_key, "action": "get", "id": task_id, "json": 1}
            )
            data = result.json()
            if data["status"] == 1:
                return data["request"]  # The g-recaptcha-response token

    raise Exception("CAPTCHA solve timeout")


async def handle_captcha_page(page, api_key: str):
    """Detect and solve reCAPTCHA on the current page."""
    # Check for reCAPTCHA
    recaptcha = await page.query_selector(".g-recaptcha")
    if not recaptcha:
        return True  # No CAPTCHA

    site_key = await recaptcha.get_attribute("data-sitekey")
    if not site_key:
        return False

    print(f"CAPTCHA detected, solving via 2captcha...")
    token = await solve_recaptcha_v2(site_key, page.url, api_key)

    # Inject the token and submit
    await page.evaluate(f"""
        document.getElementById('g-recaptcha-response').value = '{token}';
        document.querySelector('[data-callback]') &&
            window[document.querySelector('[data-callback]').dataset.callback]();
    """)
    await page.wait_for_navigation(wait_until="domcontentloaded")
    return True

Option 3: Playwright with real Chrome + stealth. Many CAPTCHA implementations have logic that skips the challenge for browsers with a legitimate traffic history. Using a real Chrome profile with cookies can bypass CAPTCHAs entirely on sites you have visited "normally" before.

Retry Logic and Error Handling

Production scrapers fail. Network timeouts, server errors, page load failures — build retry logic in from the start:

import asyncio
import functools
from typing import TypeVar, Callable, Any

T = TypeVar("T")

def with_retry(max_attempts: int = 3, backoff_base: float = 2.0):
    """Decorator for automatic retry with exponential backoff."""
    def decorator(func: Callable) -> Callable:
        @functools.wraps(func)
        async def wrapper(*args, **kwargs):
            last_error = None
            for attempt in range(1, max_attempts + 1):
                try:
                    return await func(*args, **kwargs)
                except Exception as e:
                    last_error = e
                    if attempt == max_attempts:
                        break
                    wait = backoff_base ** attempt + random.uniform(0, 1)
                    print(f"Attempt {attempt} failed: {e}. Retrying in {wait:.1f}s...")
                    await asyncio.sleep(wait)
            raise last_error
        return wrapper
    return decorator


@with_retry(max_attempts=3, backoff_base=2.0)
async def scrape_product_page(page, url: str) -> dict:
    """Scrape a single product page with automatic retries."""
    await page.goto(url, wait_until="domcontentloaded", timeout=30_000)

    # Verify we got the right page (not a 404, rate limit page, etc.)
    title = await page.title()
    if "404" in title or "Not Found" in title:
        raise ValueError(f"Got 404 for {url}")
    if "Access Denied" in title or "Blocked" in title:
        raise PermissionError(f"Blocked at {url}")

    name_el = await page.query_selector("h1.product-name")
    if not name_el:
        raise ValueError(f"No product name found at {url}")

    return {
        "url": url,
        "name": await name_el.inner_text(),
    }

Output Schema Design

Well-designed output schemas make scraped data useful in downstream pipelines. Validate your output shapes:

from dataclasses import dataclass, asdict, field
from typing import Optional
import json

@dataclass
class NutritionFacts:
    calories: Optional[float]
    fat_g: Optional[float]
    saturated_fat_g: Optional[float]
    carbohydrates_g: Optional[float]
    sugars_g: Optional[float]
    fiber_g: Optional[float]
    protein_g: Optional[float]
    sodium_mg: Optional[float]


@dataclass
class ProductRecord:
    # Required fields — scraper must fill these
    url: str
    scraped_at: str  # ISO8601

    # Product identity
    name: str = ""
    brand: str = ""
    sku: Optional[str] = None
    barcode: Optional[str] = None

    # Pricing
    price_raw: Optional[str] = None       # "$24.99"
    price_cents: Optional[int] = None    # 2499
    currency: Optional[str] = "USD"
    in_stock: Optional[bool] = None

    # Content
    description: Optional[str] = None
    images: list[str] = field(default_factory=list)
    categories: list[str] = field(default_factory=list)
    rating: Optional[float] = None
    review_count: Optional[int] = None

    # Nutrition (for food products)
    nutrition: Optional[NutritionFacts] = None

    # Scraper metadata
    proxy_used: Optional[str] = None
    scrape_duration_ms: Optional[int] = None


# Example output
example = ProductRecord(
    url="https://shop.example.com/product/widget-pro",
    scraped_at="2026-03-31T14:22:00Z",
    name="Widget Pro 3000",
    brand="Acme Corp",
    sku="WP-3000-BLK",
    price_raw="$89.99",
    price_cents=8999,
    in_stock=True,
    rating=4.3,
    review_count=1247,
    categories=["Electronics", "Gadgets", "Home Office"],
)
print(json.dumps(asdict(example), indent=2, default=str))

Example output:

{
  "url": "https://shop.example.com/product/widget-pro",
  "scraped_at": "2026-03-31T14:22:00Z",
  "name": "Widget Pro 3000",
  "brand": "Acme Corp",
  "sku": "WP-3000-BLK",
  "barcode": null,
  "price_raw": "$89.99",
  "price_cents": 8999,
  "currency": "USD",
  "in_stock": true,
  "description": null,
  "images": [],
  "categories": ["Electronics", "Gadgets", "Home Office"],
  "rating": 4.3,
  "review_count": 1247,
  "nutrition": null,
  "proxy_used": null,
  "scrape_duration_ms": null
}

7 Real-World Use Cases

1. E-commerce Price Monitoring

Track competitor prices across multiple retailers for a product line. Run daily, alert when a price drops below a threshold:

import asyncio
from datetime import datetime

PRODUCTS_TO_TRACK = [
    {"name": "Widget Pro", "urls": [
        "https://amazon.com/dp/B0EXAMPLE",
        "https://bestbuy.com/site/widget-pro/123456.p",
        "https://target.com/p/widget-pro/-/A-123456",
    ]},
]

async def check_price(page, url: str) -> dict:
    await page.goto(url, wait_until="domcontentloaded", timeout=30_000)
    # Each retailer needs specific selectors...
    price_el = await page.query_selector("[data-testid='price']")
    price = await price_el.inner_text() if price_el else "N/A"
    return {"url": url, "price": price, "checked_at": datetime.utcnow().isoformat()}

2. Job Board Aggregation

Aggregate job postings from multiple boards into a unified feed:

async def scrape_job_board(url: str) -> list[dict]:
    jobs = []
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url, wait_until="domcontentloaded")
        await page.wait_for_selector(".job-listing")
        listings = await page.query_selector_all(".job-listing")
        for listing in listings:
            title_el = await listing.query_selector(".job-title")
            company_el = await listing.query_selector(".company-name")
            location_el = await listing.query_selector(".location")
            jobs.append({
                "title": await title_el.inner_text() if title_el else "",
                "company": await company_el.inner_text() if company_el else "",
                "location": await location_el.inner_text() if location_el else "",
                "source_url": url,
                "scraped_at": datetime.utcnow().isoformat(),
            })
        await browser.close()
    return jobs

3. Real Estate Listing Data

Pull property listings including price, size, location, and photos:

async def scrape_property_listings(search_url: str) -> list[dict]:
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy={"server": "http://proxy.thordata.com:9000",
                   "username": THORDATA_USER, "password": THORDATA_PASS}
        )
        page = await browser.new_page()
        await page.goto(search_url, wait_until="domcontentloaded")

        properties = []
        cards = await page.query_selector_all("[data-listing-id]")
        for card in cards:
            listing_id = await card.get_attribute("data-listing-id")
            price_el = await card.query_selector("[data-test='property-card-price']")
            beds_el = await card.query_selector("[data-test='property-card-beds']")
            sqft_el = await card.query_selector("[data-test='property-card-sqft']")
            properties.append({
                "id": listing_id,
                "price": await price_el.inner_text() if price_el else None,
                "beds": await beds_el.inner_text() if beds_el else None,
                "sqft": await sqft_el.inner_text() if sqft_el else None,
            })
        await browser.close()
    return properties

4. Review and Sentiment Monitoring

Collect product reviews from retail sites for sentiment analysis:

async def scrape_reviews(product_url: str, max_pages: int = 5) -> list[dict]:
    reviews = []
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        for page_num in range(1, max_pages + 1):
            url = f"{product_url}?reviewsPage={page_num}"
            await page.goto(url, wait_until="domcontentloaded")

            review_cards = await page.query_selector_all(".review-item")
            if not review_cards:
                break

            for card in review_cards:
                rating_el = await card.query_selector("[data-rating]")
                body_el = await card.query_selector(".review-body")
                date_el = await card.query_selector(".review-date")
                reviews.append({
                    "rating": await rating_el.get_attribute("data-rating") if rating_el else None,
                    "body": await body_el.inner_text() if body_el else "",
                    "date": await date_el.inner_text() if date_el else "",
                })
            await asyncio.sleep(1.5)

        await browser.close()
    return reviews

5. News and Media Monitoring

Track mentions of a company or topic across news sites:

async def scrape_news_search(query: str, news_sites: list[str]) -> list[dict]:
    results = []
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        for site in news_sites:
            search_url = f"{site}/search?q={query}"
            await page.goto(search_url, wait_until="domcontentloaded")
            articles = await page.query_selector_all("article")
            for article in articles[:10]:  # cap at 10 per site
                title_el = await article.query_selector("h2, h3")
                link_el = await article.query_selector("a")
                results.append({
                    "title": await title_el.inner_text() if title_el else "",
                    "url": await link_el.get_attribute("href") if link_el else "",
                    "source": site,
                })
            await asyncio.sleep(2)

        await browser.close()
    return results

Track a competitor's public metrics — follower counts, engagement rates, posted content frequency:

async def scrape_public_profile(profile_url: str) -> dict:
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=False,  # Use headed for social sites
            channel="chrome",
        )
        page = await browser.new_page()
        await page.goto(profile_url, wait_until="domcontentloaded")
        await asyncio.sleep(3)  # Wait for dynamic content

        # Generic extraction — adapt selectors to the specific platform
        followers_el = await page.query_selector("[data-testid='followers-count']")
        posts_el = await page.query_selector_all("[data-testid='post-item']")

        result = {
            "url": profile_url,
            "followers": await followers_el.inner_text() if followers_el else "N/A",
            "recent_post_count": len(posts_el),
        }
        await browser.close()
    return result

7. Government and Public Data Collection

Many government datasets are published as searchable web interfaces rather than downloadable files. Playwright handles form submission and table extraction cleanly:

async def scrape_public_records(search_term: str) -> list[dict]:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto("https://data.example.gov/search")

        # Fill and submit the search form
        await page.fill("#search-input", search_term)
        await page.click("#search-button")
        await page.wait_for_selector("table.results", timeout=15_000)

        # Extract table rows
        rows = await page.query_selector_all("table.results tbody tr")
        records = []
        for row in rows:
            cells = await row.query_selector_all("td")
            cell_texts = [await cell.inner_text() for cell in cells]
            records.append({"row": cell_texts})

        await browser.close()
    return records

Common Pitfalls and How to Avoid Them

Do not use networkidle for initial page load. It waits for all network requests to finish, which can take 30+ seconds on sites with analytics, chat widgets, and ad pixels. Use domcontentloaded plus explicit wait_for_selector.

Set explicit timeouts everywhere. The default 30-second timeout is not always right. Set timeouts explicitly on each action rather than relying on defaults. Long timeouts hide slow pages; short timeouts cause false failures.

Close browser contexts. Leaking browser contexts eats memory. On a 256MB VPS, three leaked contexts crash the process. Always use async with or explicit close() in a finally block.

Do not scrape what you do not need. Block images, fonts, and media at the context level. Set the fields parameter on APIs. Request only the data you will use. This is faster, cheaper, and less load on the target server.

Respect robots.txt and rate limits. Terms of service matter. Hammering a server is rude and often ineffective — aggressive scraping triggers increasingly aggressive defenses. Add delays, respect Retry-After headers, and consider reaching out to the site owner for an official data feed if you need large volumes.

Playwright has matured into the definitive tool for browser-based scraping in 2026. Its auto-waiting eliminates an entire class of flaky test failures, network interception often makes HTML parsing unnecessary, and the async API makes concurrent scraping straightforward. Combined with residential proxies from ThorData and proper stealth configuration, it handles virtually any target you will encounter.