Playwright for Web Scraping in 2026: A Complete Practical Guide
If you've tried scraping a modern website with requests and BeautifulSoup only to get back an empty <div id="root"></div>, you already know the problem. Most sites worth scraping in 2026 are JavaScript-heavy SPAs that render content client-side. The server sends a near-empty HTML shell and all the real content is injected by React, Vue, or Angular after the JavaScript executes. Traditional HTTP scraping cannot see any of that.
Playwright handles this natively. It runs a real browser engine — Chromium, Firefox, or WebKit — executes all the JavaScript the page loads, waits for content to appear in the DOM, and gives you full programmatic control over interactions. You can click buttons, fill forms, scroll the page, intercept network traffic, and screenshot anything. It is the most capable browser automation tool available in Python today.
This guide goes deep. We start from scratch and build up to production-grade patterns: stealth techniques to defeat bot detection, residential proxy rotation via ThorData, CAPTCHA handling strategies, robust retry logic, and output schemas you can rely on in downstream pipelines. Every code example is complete and runnable.
Why Playwright Beats the Alternatives
The browser automation landscape in 2026 has three serious options: Selenium, Puppeteer/Playwright, and Splash. Here is why Playwright has become the default choice for serious scrapers:
Selenium is the oldest and most battle-tested. Every hiring manager knows it. But it requires you to manage WebDriver binaries that must match your browser version exactly — a constant headache. Its API is synchronous-first and verbose. Auto-waiting is limited compared to Playwright: you frequently need manual WebDriverWait with ExpectedConditions boilerplate just to click a button safely.
Puppeteer introduced the modern browser automation API, but it is JavaScript/Node.js only. If your data pipeline is Python — and most data engineering is — you are either writing glue code or maintaining a separate Node service. Not ideal.
Playwright for Python was built by former Puppeteer engineers at Microsoft who redesigned the API from scratch. Key advantages over everything else:
- Auto-waiting everywhere — every interaction (
click,fill,query_selector) automatically waits for the element to be actionable. No moresleep(3)and hope. - True multi-browser — Chromium, Firefox, and WebKit from a single Python API. When one engine gets fingerprinted and blocked, you switch.
- Network interception — intercept and modify HTTP requests at the browser level. Capture API responses without parsing HTML. Block images and fonts to cut page load time by 70%.
- Async-first design — the Python API is built on
asyncio. You can run tens of browser contexts concurrently. - Built-in tracing — record full browser traces for debugging. When a scraper fails in production, replay the trace and see exactly what happened.
- No binary management —
playwright install chromiumdownloads the correct browser version. No more chromedriver version mismatch errors.
For scraping JavaScript-heavy sites in 2026, Playwright is the right tool. The only reason not to use it is if the site you are targeting has a clean API underneath the UI — in which case you do not need a browser at all.
Setup and Installation
pip install playwright playwright-stealth httpx
playwright install chromium
If you are on a headless server (VPS, Docker, CI), you may need system dependencies:
playwright install-deps chromium
Verify the installation:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com")
print(page.title())
browser.close()
If that prints "Example Domain" you are ready.
Architecture: Sync vs Async API
Playwright offers both a synchronous and an asynchronous Python API. Use the async API for any real scraping work.
When to use sync:
- One-off scripts and quick experiments
- Simple scrapers that hit one URL at a time
- Prototyping before you know the scale you need
When to use async:
- Any multi-page scraping (always, basically)
- Running multiple browser contexts concurrently
- Integration with async HTTP clients like
httpx.AsyncClient - Production scrapers where throughput matters
The async API looks like this:
import asyncio
from playwright.async_api import async_playwright
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto("https://example.com")
print(await page.title())
await browser.close()
asyncio.run(main())
The rest of this guide uses the async API throughout.
Your First Complete Scraper
Here is a production-ready scraper for a paginated product catalog. It handles auto-waiting, extracts structured data, and manages browser lifecycle correctly:
import asyncio
import json
from dataclasses import dataclass, asdict
from typing import Optional
from playwright.async_api import async_playwright, Browser, Page
@dataclass
class Product:
title: str
price: Optional[str]
sku: Optional[str]
rating: Optional[float]
review_count: Optional[int]
url: str
async def extract_products_from_page(page: Page) -> list[Product]:
"""Extract all product cards from the current page."""
await page.wait_for_selector(".product-card", timeout=15_000)
cards = await page.query_selector_all(".product-card")
products = []
for card in cards:
title_el = await card.query_selector(".product-title")
price_el = await card.query_selector(".product-price")
sku_el = await card.query_selector("[data-sku]")
rating_el = await card.query_selector("[aria-label*='stars']")
review_el = await card.query_selector(".review-count")
link_el = await card.query_selector("a.product-link")
title = await title_el.inner_text() if title_el else ""
price = await price_el.inner_text() if price_el else None
sku = await sku_el.get_attribute("data-sku") if sku_el else None
url = await link_el.get_attribute("href") if link_el else page.url
# Parse rating from aria-label like "4.5 stars"
rating = None
if rating_el:
label = await rating_el.get_attribute("aria-label") or ""
try:
rating = float(label.split()[0])
except (ValueError, IndexError):
pass
# Parse review count from text like "(1,234)"
review_count = None
if review_el:
text = await review_el.inner_text()
digits = "".join(c for c in text if c.isdigit())
review_count = int(digits) if digits else None
products.append(Product(
title=title.strip(),
price=price.strip() if price else None,
sku=sku,
rating=rating,
review_count=review_count,
url=url if url.startswith("http") else f"https://example-store.com{url}",
))
return products
async def scrape_catalog(base_url: str, max_pages: int = 10) -> list[dict]:
all_products = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent=(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
# Block analytics and ad tracking to speed up loads
extra_http_headers={
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
}
)
# Block images, fonts, and media — not needed for data
await context.route(
"**/*.{png,jpg,jpeg,gif,webp,svg,ico,woff,woff2,ttf,mp4,mp3}",
lambda route: route.abort()
)
page = await context.new_page()
await page.goto(base_url, wait_until="domcontentloaded")
for page_num in range(1, max_pages + 1):
print(f"Scraping page {page_num}...")
products = await extract_products_from_page(page)
all_products.extend(products)
print(f" Extracted {len(products)} products")
# Try to navigate to next page
next_btn = await page.query_selector("button[aria-label='Next page']:not([disabled])")
if not next_btn:
print("No more pages.")
break
await next_btn.click()
await page.wait_for_load_state("domcontentloaded")
await browser.close()
return [asdict(p) for p in all_products]
if __name__ == "__main__":
results = asyncio.run(scrape_catalog("https://example-store.com/catalog"))
with open("products.json", "w") as f:
json.dump(results, f, indent=2)
print(f"Saved {len(results)} products to products.json")
A few design decisions worth noting: I use wait_until="domcontentloaded" instead of "networkidle". Network idle can hang for 30+ seconds on sites with analytics pixels, chat widgets, and ad networks. Load the DOM, then wait for the specific selector you actually need. I block images and fonts at the context level — this cuts page load time dramatically and reduces bandwidth. I use wait_for_selector with an explicit timeout rather than hoping things appear.
Stealth: Defeating Bot Detection
Headless Chromium is detectable. Bot detection services like Cloudflare, DataDome, PerimeterX, and Akamai Bot Manager look for dozens of signals that distinguish a real browser from automated tooling:
navigator.webdriver === true— the clearest signal of all- Missing browser plugins array (real Chrome has plugins, headless doesn't)
- WebGL renderer string shows "SwiftShader" or "llvmpipe" instead of a real GPU
screen.widthandwindow.outerWidthmismatches- Missing chrome runtime object properties
- Inconsistent hardware concurrency and device memory values
- The
HeadlessChromesubstring in the user agent string - Chrome automation flags visible in
navigator.userAgentData - Missing media codec support
The playwright-stealth package patches most of these at the JavaScript level:
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async
async def create_stealth_browser():
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(
headless=True,
args=[
"--disable-blink-features=AutomationControlled",
"--no-sandbox",
"--disable-setuid-sandbox",
"--disable-dev-shm-usage",
"--disable-infobars",
]
)
context = await browser.new_context(
viewport={"width": 1366, "height": 768},
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
)
page = await context.new_page()
await stealth_async(page)
return playwright, browser, page
For high-security targets, go further with manual JavaScript injection to patch remaining signals:
async def apply_advanced_stealth(page):
"""Inject stealth overrides before any page scripts run."""
await page.add_init_script("""
// Overwrite navigator.webdriver
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined,
});
// Fake plugins
Object.defineProperty(navigator, 'plugins', {
get: () => [
{ name: 'Chrome PDF Plugin', filename: 'internal-pdf-viewer' },
{ name: 'Chrome PDF Viewer', filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai' },
{ name: 'Native Client', filename: 'internal-nacl-plugin' },
],
});
// Fake languages
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en'],
});
// Fake hardware concurrency (real CPU count)
Object.defineProperty(navigator, 'hardwareConcurrency', {
get: () => 8,
});
// Remove Automation extension exposure
window.chrome = {
runtime: {},
loadTimes: function() {},
csi: function() {},
app: {}
};
// Fix permission query fingerprint
const originalQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (parameters) => (
parameters.name === 'notifications' ?
Promise.resolve({ state: Notification.permission }) :
originalQuery(parameters)
);
""")
For the toughest targets, launch with channel="chrome" to use the system-installed Chrome binary. This bypasses most fingerprinting entirely because it IS a real Chrome installation:
browser = await p.chromium.launch(
headless=False, # headed mode avoids more detection
channel="chrome",
)
Yes, headed is slower and requires a display (use Xvfb on Linux servers). But it is undetectable because it is not detectable — it is literally Chrome running normally.
Human Behavior Simulation
Even with a perfect browser fingerprint, bot detection watches for behavioral signals. Clicks that happen 50 milliseconds after page load, mouse movements that go in a perfect straight line, typing at 1000 characters per second — these are inhuman.
Add realistic behavior with these patterns:
import random
import asyncio
async def human_delay(min_ms: int = 500, max_ms: int = 2000):
"""Wait a random human-like delay."""
delay = random.randint(min_ms, max_ms) / 1000
await asyncio.sleep(delay)
async def human_type(page, selector: str, text: str):
"""Type text with realistic per-character delays."""
await page.click(selector)
for char in text:
await page.keyboard.type(char)
# Vary speed like a real typist: 50-200ms per character
await asyncio.sleep(random.uniform(0.05, 0.20))
async def move_mouse_naturally(page, target_x: int, target_y: int):
"""Move mouse in a curved path rather than a straight line."""
current = await page.evaluate("() => ({x: window.mouseX || 0, y: window.mouseY || 0})")
cx, cy = current.get("x", 0), current.get("y", 0)
# Generate bezier-like intermediate points
steps = random.randint(8, 15)
for i in range(1, steps + 1):
t = i / steps
# Slight curve via a random midpoint offset
mid_x = (cx + target_x) / 2 + random.randint(-30, 30)
mid_y = (cy + target_y) / 2 + random.randint(-30, 30)
x = int((1 - t) ** 2 * cx + 2 * (1 - t) * t * mid_x + t ** 2 * target_x)
y = int((1 - t) ** 2 * cy + 2 * (1 - t) * t * mid_y + t ** 2 * target_y)
await page.mouse.move(x, y)
await asyncio.sleep(random.uniform(0.01, 0.03))
async def human_scroll(page, scroll_amount: int = 500):
"""Scroll in multiple small increments."""
steps = random.randint(5, 12)
per_step = scroll_amount // steps
for _ in range(steps):
await page.mouse.wheel(0, per_step + random.randint(-20, 20))
await asyncio.sleep(random.uniform(0.05, 0.15))
Proxy Rotation with ThorData
IP-based blocking is the other half of the anti-scraping equation. Even a perfect fingerprint will get banned if you send thousands of requests from one IP. You need residential proxy rotation.
ThorData provides rotating residential proxies that route traffic through real ISP-assigned IP addresses. Each request can use a different IP from a pool of millions, making your traffic indistinguishable from normal user traffic distributed across the country or world.
Setting up Playwright with ThorData:
import asyncio
from playwright.async_api import async_playwright
THORDATA_USER = "your_thordata_username"
THORDATA_PASS = "your_thordata_password"
THORDATA_HOST = "proxy.thordata.com"
THORDATA_PORT = 9000
async def scrape_with_proxy(url: str) -> str:
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy={
"server": f"http://{THORDATA_HOST}:{THORDATA_PORT}",
"username": THORDATA_USER,
"password": THORDATA_PASS,
}
)
page = await browser.new_page()
await page.goto(url)
content = await page.content()
await browser.close()
return content
For country-specific IPs (useful when scraping geo-restricted content), ThorData supports country targeting via username parameters:
# Route through US residential IPs only
proxy_user = f"{THORDATA_USER}-country-us"
# Route through UK IPs
proxy_user = f"{THORDATA_USER}-country-gb"
browser = await p.chromium.launch(
proxy={
"server": f"http://{THORDATA_HOST}:{THORDATA_PORT}",
"username": proxy_user,
"password": THORDATA_PASS,
}
)
For concurrent scraping, create one browser per context with a fresh IP per session:
async def scrape_batch(urls: list[str], concurrency: int = 5) -> list[dict]:
"""Scrape multiple URLs concurrently, each with its own proxy IP."""
semaphore = asyncio.Semaphore(concurrency)
results = []
async def scrape_one(url: str) -> dict:
async with semaphore:
async with async_playwright() as p:
browser = await p.chromium.launch(
proxy={
"server": f"http://{THORDATA_HOST}:{THORDATA_PORT}",
"username": THORDATA_USER,
"password": THORDATA_PASS,
}
)
try:
page = await browser.new_page()
await page.goto(url, wait_until="domcontentloaded", timeout=30_000)
# ... extract data ...
return {"url": url, "status": "ok", "data": {}}
except Exception as e:
return {"url": url, "status": "error", "error": str(e)}
finally:
await browser.close()
tasks = [scrape_one(url) for url in urls]
results = await asyncio.gather(*tasks)
return list(results)
Intercepting Network Requests
This is the most powerful Playwright feature for scraping and most scrapers never use it. Modern SPAs fetch their data from internal APIs and render it into HTML. Instead of parsing the rendered HTML, you can intercept the raw API response directly. It is faster, more reliable, and the data is already structured.
Here is how to intercept the JSON API calls that power a product listing page:
import asyncio
import json
from playwright.async_api import async_playwright
async def intercept_api_responses(url: str, api_pattern: str) -> list[dict]:
"""
Load a page and capture all API responses matching a URL pattern.
Returns the parsed JSON bodies.
"""
captured_data = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
async def handle_response(response):
if api_pattern in response.url:
try:
if response.status == 200:
data = await response.json()
captured_data.append({
"url": response.url,
"data": data,
})
print(f"Captured: {response.url}")
except Exception as e:
print(f"Failed to parse response from {response.url}: {e}")
page.on("response", handle_response)
await page.goto(url, wait_until="networkidle", timeout=30_000)
# If page loads more data on scroll, trigger it
for _ in range(3):
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await asyncio.sleep(2)
await browser.close()
return captured_data
# Example: scrape Hacker News comments by intercepting the internal API
async def scrape_hn_comments(story_id: int) -> dict:
data = await intercept_api_responses(
f"https://news.ycombinator.com/item?id={story_id}",
"/api/"
)
return data
You can also intercept requests BEFORE they are sent, which lets you modify headers, block tracking, or substitute mock responses:
async def setup_request_interception(page):
"""Block analytics and modify requests before they fire."""
async def handle_route(route):
url = route.request.url
# Block analytics and tracking
if any(domain in url for domain in [
"google-analytics.com", "facebook.com/tr",
"doubleclick.net", "hotjar.com",
]):
await route.abort()
return
# Add custom headers to all requests
headers = {
**route.request.headers,
"X-Custom-Header": "scraper-v1",
}
await route.continue_(headers=headers)
await page.route("**/*", handle_route)
Handling Pagination: All Three Patterns
Real sites use one of three pagination patterns, each requiring a different approach.
Pattern 1: Click-Based Next Page
async def scrape_paginated_site(start_url: str) -> list[dict]:
all_items = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(start_url, wait_until="domcontentloaded")
page_num = 0
while True:
page_num += 1
print(f"Scraping page {page_num}")
# Wait for content to appear
try:
await page.wait_for_selector(".item-list .item", timeout=10_000)
except Exception:
print("No items found, stopping")
break
# Extract items
items = await page.query_selector_all(".item-list .item")
for item in items:
title_el = await item.query_selector("h3")
all_items.append({
"title": await title_el.inner_text() if title_el else "",
"page": page_num,
})
# Find next page button — disabled or missing means last page
next_btn = await page.query_selector("a.pagination__next:not(.disabled)")
if not next_btn:
print("Reached last page")
break
# Click and wait for new content
old_first_item = await page.inner_text(".item-list .item:first-child h3")
await next_btn.click()
# Wait until content actually changes (not just DOM update)
try:
await page.wait_for_function(
f"document.querySelector('.item-list .item:first-child h3')?.innerText !== '{old_first_item}'"
)
except Exception:
break
await browser.close()
return all_items
Pattern 2: Infinite Scroll / Load More
async def scrape_infinite_scroll(url: str, max_scrolls: int = 20) -> list[dict]:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url, wait_until="domcontentloaded")
items_seen = set()
all_items = []
for scroll_num in range(max_scrolls):
# Collect currently visible items
cards = await page.query_selector_all("[data-testid='item-card']")
new_count = 0
for card in cards:
item_id = await card.get_attribute("data-id")
if item_id and item_id not in items_seen:
items_seen.add(item_id)
title_el = await card.query_selector(".title")
all_items.append({
"id": item_id,
"title": await title_el.inner_text() if title_el else "",
})
new_count += 1
print(f"Scroll {scroll_num + 1}: +{new_count} new items (total: {len(all_items)})")
if new_count == 0 and scroll_num > 0:
print("No new items after scroll, done")
break
# Scroll to bottom
prev_height = await page.evaluate("document.body.scrollHeight")
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await asyncio.sleep(2) # Wait for new content to load
new_height = await page.evaluate("document.body.scrollHeight")
if new_height == prev_height:
print("Page height unchanged, no more content")
break
await browser.close()
return all_items
Pattern 3: URL Parameter Pagination
async def scrape_url_pagination(base_url: str, max_pages: int = 50) -> list[dict]:
"""Simplest case: iterate page numbers in the URL."""
all_items = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
for page_num in range(1, max_pages + 1):
url = f"{base_url}?page={page_num}"
await page.goto(url, wait_until="domcontentloaded")
items = await page.query_selector_all(".search-result")
if not items:
print(f"No results on page {page_num}, stopping")
break
for item in items:
all_items.append({"page": page_num, "url": url})
# Respect the server
await asyncio.sleep(1.0)
await browser.close()
return all_items
Rate Limiting and CAPTCHA Handling
Respecting Rate Limits
The simplest approach is adaptive backoff: start fast, slow down when you see signals of rate limiting (slow responses, empty results, 429 status):
import asyncio
import time
from dataclasses import dataclass
@dataclass
class RateLimiter:
requests_per_second: float = 1.0
_last_request_time: float = 0.0
async def wait(self):
now = time.monotonic()
elapsed = now - self._last_request_time
min_interval = 1.0 / self.requests_per_second
if elapsed < min_interval:
await asyncio.sleep(min_interval - elapsed)
self._last_request_time = time.monotonic()
async def scrape_with_rate_limit(urls: list[str]) -> list[dict]:
limiter = RateLimiter(requests_per_second=0.5) # 1 request per 2 seconds
results = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
for url in urls:
await limiter.wait()
try:
await page.goto(url, wait_until="domcontentloaded", timeout=30_000)
# Check for rate limit page
if "Rate limited" in await page.title() or await page.query_selector(".captcha"):
print(f"Rate limited at {url}, waiting 60s...")
await asyncio.sleep(60)
await page.goto(url, wait_until="domcontentloaded", timeout=30_000)
results.append({"url": url, "status": "ok"})
except Exception as e:
results.append({"url": url, "status": "error", "error": str(e)})
await browser.close()
return results
CAPTCHA Handling
CAPTCHAs are the escalation point when other anti-bot measures fail. You have three options:
Option 1: Avoid triggering CAPTCHAs. Use stealth mode, respect rate limits, use residential proxies. If you are getting CAPTCHAs, you are already doing something that looks bot-like.
Option 2: CAPTCHA solving services. 2captcha, Anti-Captcha, and CapSolver provide workers (human or AI) that solve CAPTCHAs for a fee. Integration looks like this:
import httpx
import asyncio
async def solve_recaptcha_v2(site_key: str, page_url: str, api_key: str) -> str:
"""Solve reCAPTCHA v2 via 2captcha service."""
async with httpx.AsyncClient() as client:
# Submit CAPTCHA task
resp = await client.post(
"https://2captcha.com/in.php",
data={
"key": api_key,
"method": "userrecaptcha",
"googlekey": site_key,
"pageurl": page_url,
"json": 1,
}
)
task_id = resp.json()["request"]
# Poll for result (typically 15-45 seconds)
for _ in range(20):
await asyncio.sleep(5)
result = await client.get(
"https://2captcha.com/res.php",
params={"key": api_key, "action": "get", "id": task_id, "json": 1}
)
data = result.json()
if data["status"] == 1:
return data["request"] # The g-recaptcha-response token
raise Exception("CAPTCHA solve timeout")
async def handle_captcha_page(page, api_key: str):
"""Detect and solve reCAPTCHA on the current page."""
# Check for reCAPTCHA
recaptcha = await page.query_selector(".g-recaptcha")
if not recaptcha:
return True # No CAPTCHA
site_key = await recaptcha.get_attribute("data-sitekey")
if not site_key:
return False
print(f"CAPTCHA detected, solving via 2captcha...")
token = await solve_recaptcha_v2(site_key, page.url, api_key)
# Inject the token and submit
await page.evaluate(f"""
document.getElementById('g-recaptcha-response').value = '{token}';
document.querySelector('[data-callback]') &&
window[document.querySelector('[data-callback]').dataset.callback]();
""")
await page.wait_for_navigation(wait_until="domcontentloaded")
return True
Option 3: Playwright with real Chrome + stealth. Many CAPTCHA implementations have logic that skips the challenge for browsers with a legitimate traffic history. Using a real Chrome profile with cookies can bypass CAPTCHAs entirely on sites you have visited "normally" before.
Retry Logic and Error Handling
Production scrapers fail. Network timeouts, server errors, page load failures — build retry logic in from the start:
import asyncio
import functools
from typing import TypeVar, Callable, Any
T = TypeVar("T")
def with_retry(max_attempts: int = 3, backoff_base: float = 2.0):
"""Decorator for automatic retry with exponential backoff."""
def decorator(func: Callable) -> Callable:
@functools.wraps(func)
async def wrapper(*args, **kwargs):
last_error = None
for attempt in range(1, max_attempts + 1):
try:
return await func(*args, **kwargs)
except Exception as e:
last_error = e
if attempt == max_attempts:
break
wait = backoff_base ** attempt + random.uniform(0, 1)
print(f"Attempt {attempt} failed: {e}. Retrying in {wait:.1f}s...")
await asyncio.sleep(wait)
raise last_error
return wrapper
return decorator
@with_retry(max_attempts=3, backoff_base=2.0)
async def scrape_product_page(page, url: str) -> dict:
"""Scrape a single product page with automatic retries."""
await page.goto(url, wait_until="domcontentloaded", timeout=30_000)
# Verify we got the right page (not a 404, rate limit page, etc.)
title = await page.title()
if "404" in title or "Not Found" in title:
raise ValueError(f"Got 404 for {url}")
if "Access Denied" in title or "Blocked" in title:
raise PermissionError(f"Blocked at {url}")
name_el = await page.query_selector("h1.product-name")
if not name_el:
raise ValueError(f"No product name found at {url}")
return {
"url": url,
"name": await name_el.inner_text(),
}
Output Schema Design
Well-designed output schemas make scraped data useful in downstream pipelines. Validate your output shapes:
from dataclasses import dataclass, asdict, field
from typing import Optional
import json
@dataclass
class NutritionFacts:
calories: Optional[float]
fat_g: Optional[float]
saturated_fat_g: Optional[float]
carbohydrates_g: Optional[float]
sugars_g: Optional[float]
fiber_g: Optional[float]
protein_g: Optional[float]
sodium_mg: Optional[float]
@dataclass
class ProductRecord:
# Required fields — scraper must fill these
url: str
scraped_at: str # ISO8601
# Product identity
name: str = ""
brand: str = ""
sku: Optional[str] = None
barcode: Optional[str] = None
# Pricing
price_raw: Optional[str] = None # "$24.99"
price_cents: Optional[int] = None # 2499
currency: Optional[str] = "USD"
in_stock: Optional[bool] = None
# Content
description: Optional[str] = None
images: list[str] = field(default_factory=list)
categories: list[str] = field(default_factory=list)
rating: Optional[float] = None
review_count: Optional[int] = None
# Nutrition (for food products)
nutrition: Optional[NutritionFacts] = None
# Scraper metadata
proxy_used: Optional[str] = None
scrape_duration_ms: Optional[int] = None
# Example output
example = ProductRecord(
url="https://shop.example.com/product/widget-pro",
scraped_at="2026-03-31T14:22:00Z",
name="Widget Pro 3000",
brand="Acme Corp",
sku="WP-3000-BLK",
price_raw="$89.99",
price_cents=8999,
in_stock=True,
rating=4.3,
review_count=1247,
categories=["Electronics", "Gadgets", "Home Office"],
)
print(json.dumps(asdict(example), indent=2, default=str))
Example output:
{
"url": "https://shop.example.com/product/widget-pro",
"scraped_at": "2026-03-31T14:22:00Z",
"name": "Widget Pro 3000",
"brand": "Acme Corp",
"sku": "WP-3000-BLK",
"barcode": null,
"price_raw": "$89.99",
"price_cents": 8999,
"currency": "USD",
"in_stock": true,
"description": null,
"images": [],
"categories": ["Electronics", "Gadgets", "Home Office"],
"rating": 4.3,
"review_count": 1247,
"nutrition": null,
"proxy_used": null,
"scrape_duration_ms": null
}
7 Real-World Use Cases
1. E-commerce Price Monitoring
Track competitor prices across multiple retailers for a product line. Run daily, alert when a price drops below a threshold:
import asyncio
from datetime import datetime
PRODUCTS_TO_TRACK = [
{"name": "Widget Pro", "urls": [
"https://amazon.com/dp/B0EXAMPLE",
"https://bestbuy.com/site/widget-pro/123456.p",
"https://target.com/p/widget-pro/-/A-123456",
]},
]
async def check_price(page, url: str) -> dict:
await page.goto(url, wait_until="domcontentloaded", timeout=30_000)
# Each retailer needs specific selectors...
price_el = await page.query_selector("[data-testid='price']")
price = await price_el.inner_text() if price_el else "N/A"
return {"url": url, "price": price, "checked_at": datetime.utcnow().isoformat()}
2. Job Board Aggregation
Aggregate job postings from multiple boards into a unified feed:
async def scrape_job_board(url: str) -> list[dict]:
jobs = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url, wait_until="domcontentloaded")
await page.wait_for_selector(".job-listing")
listings = await page.query_selector_all(".job-listing")
for listing in listings:
title_el = await listing.query_selector(".job-title")
company_el = await listing.query_selector(".company-name")
location_el = await listing.query_selector(".location")
jobs.append({
"title": await title_el.inner_text() if title_el else "",
"company": await company_el.inner_text() if company_el else "",
"location": await location_el.inner_text() if location_el else "",
"source_url": url,
"scraped_at": datetime.utcnow().isoformat(),
})
await browser.close()
return jobs
3. Real Estate Listing Data
Pull property listings including price, size, location, and photos:
async def scrape_property_listings(search_url: str) -> list[dict]:
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy={"server": "http://proxy.thordata.com:9000",
"username": THORDATA_USER, "password": THORDATA_PASS}
)
page = await browser.new_page()
await page.goto(search_url, wait_until="domcontentloaded")
properties = []
cards = await page.query_selector_all("[data-listing-id]")
for card in cards:
listing_id = await card.get_attribute("data-listing-id")
price_el = await card.query_selector("[data-test='property-card-price']")
beds_el = await card.query_selector("[data-test='property-card-beds']")
sqft_el = await card.query_selector("[data-test='property-card-sqft']")
properties.append({
"id": listing_id,
"price": await price_el.inner_text() if price_el else None,
"beds": await beds_el.inner_text() if beds_el else None,
"sqft": await sqft_el.inner_text() if sqft_el else None,
})
await browser.close()
return properties
4. Review and Sentiment Monitoring
Collect product reviews from retail sites for sentiment analysis:
async def scrape_reviews(product_url: str, max_pages: int = 5) -> list[dict]:
reviews = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
for page_num in range(1, max_pages + 1):
url = f"{product_url}?reviewsPage={page_num}"
await page.goto(url, wait_until="domcontentloaded")
review_cards = await page.query_selector_all(".review-item")
if not review_cards:
break
for card in review_cards:
rating_el = await card.query_selector("[data-rating]")
body_el = await card.query_selector(".review-body")
date_el = await card.query_selector(".review-date")
reviews.append({
"rating": await rating_el.get_attribute("data-rating") if rating_el else None,
"body": await body_el.inner_text() if body_el else "",
"date": await date_el.inner_text() if date_el else "",
})
await asyncio.sleep(1.5)
await browser.close()
return reviews
5. News and Media Monitoring
Track mentions of a company or topic across news sites:
async def scrape_news_search(query: str, news_sites: list[str]) -> list[dict]:
results = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
for site in news_sites:
search_url = f"{site}/search?q={query}"
await page.goto(search_url, wait_until="domcontentloaded")
articles = await page.query_selector_all("article")
for article in articles[:10]: # cap at 10 per site
title_el = await article.query_selector("h2, h3")
link_el = await article.query_selector("a")
results.append({
"title": await title_el.inner_text() if title_el else "",
"url": await link_el.get_attribute("href") if link_el else "",
"source": site,
})
await asyncio.sleep(2)
await browser.close()
return results
6. Social Proof and Competitor Analysis
Track a competitor's public metrics — follower counts, engagement rates, posted content frequency:
async def scrape_public_profile(profile_url: str) -> dict:
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=False, # Use headed for social sites
channel="chrome",
)
page = await browser.new_page()
await page.goto(profile_url, wait_until="domcontentloaded")
await asyncio.sleep(3) # Wait for dynamic content
# Generic extraction — adapt selectors to the specific platform
followers_el = await page.query_selector("[data-testid='followers-count']")
posts_el = await page.query_selector_all("[data-testid='post-item']")
result = {
"url": profile_url,
"followers": await followers_el.inner_text() if followers_el else "N/A",
"recent_post_count": len(posts_el),
}
await browser.close()
return result
7. Government and Public Data Collection
Many government datasets are published as searchable web interfaces rather than downloadable files. Playwright handles form submission and table extraction cleanly:
async def scrape_public_records(search_term: str) -> list[dict]:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto("https://data.example.gov/search")
# Fill and submit the search form
await page.fill("#search-input", search_term)
await page.click("#search-button")
await page.wait_for_selector("table.results", timeout=15_000)
# Extract table rows
rows = await page.query_selector_all("table.results tbody tr")
records = []
for row in rows:
cells = await row.query_selector_all("td")
cell_texts = [await cell.inner_text() for cell in cells]
records.append({"row": cell_texts})
await browser.close()
return records
Common Pitfalls and How to Avoid Them
Do not use networkidle for initial page load. It waits for all network requests to finish, which can take 30+ seconds on sites with analytics, chat widgets, and ad pixels. Use domcontentloaded plus explicit wait_for_selector.
Set explicit timeouts everywhere. The default 30-second timeout is not always right. Set timeouts explicitly on each action rather than relying on defaults. Long timeouts hide slow pages; short timeouts cause false failures.
Close browser contexts. Leaking browser contexts eats memory. On a 256MB VPS, three leaked contexts crash the process. Always use async with or explicit close() in a finally block.
Do not scrape what you do not need. Block images, fonts, and media at the context level. Set the fields parameter on APIs. Request only the data you will use. This is faster, cheaper, and less load on the target server.
Respect robots.txt and rate limits. Terms of service matter. Hammering a server is rude and often ineffective — aggressive scraping triggers increasingly aggressive defenses. Add delays, respect Retry-After headers, and consider reaching out to the site owner for an official data feed if you need large volumes.
Playwright has matured into the definitive tool for browser-based scraping in 2026. Its auto-waiting eliminates an entire class of flaky test failures, network interception often makes HTML parsing unnecessary, and the async API makes concurrent scraping straightforward. Combined with residential proxies from ThorData and proper stealth configuration, it handles virtually any target you will encounter.