How to Scrape JavaScript-Heavy Websites with Playwright (2026)
How to Scrape JavaScript-Heavy Websites with Playwright (2026)
You write a clean requests and BeautifulSoup scraper. It works on most sites. Then you hit a React dashboard or an infinite-scroll product page and get back an empty div with id root and nothing else. The data you need is nowhere in the HTML source. Welcome to the JavaScript rendering problem.
Modern web applications do not ship HTML anymore - they ship JavaScript bundles that build the HTML inside your browser. React, Vue, Angular, and Svelte applications render entirely client-side. Lazy-loaded content, dynamically injected elements, and XHR-fetched data are invisible to traditional HTTP scrapers. The HTML you download from the server is a skeleton waiting to be filled in by the browser.
This guide covers Playwright for Python in depth: why it is the right tool for JS-heavy sites in 2026, how to set it up, every wait strategy you will actually need, performance optimizations that matter at scale, proxy integration with ThorData, anti-detection techniques, CAPTCHA handling, and seven production-grade use cases with complete code. If you have avoided Playwright because it felt complex, this will change that.
Why Some Sites Need a Real Browser
Before reaching for Playwright, it is worth understanding exactly what the problem is. HTTP scraping fails on JS-heavy sites for a specific reason: the server sends JavaScript, not data. The data only exists after the JavaScript runs in a browser environment.
The rendering pipeline: 1. Browser requests the URL 2. Server returns HTML skeleton + JavaScript bundle (often 1-5MB) 3. Browser parses and executes the JavaScript 4. JavaScript makes additional API calls (XHR/fetch) 5. JavaScript builds the DOM with the fetched data 6. User sees a rendered page
If you stop at step 2 with a raw HTTP client, you get the skeleton. The actual product names, prices, and content are assembled in steps 4-6.
How to tell if you need Playwright: - Open the page in a browser, then right-click and "View Page Source" - If View Source shows useful data: requests + BeautifulSoup can handle it - If View Source shows an empty shell with script tags: you need a browser - Check the Network tab in DevTools: if the data comes from a /api/ URL as JSON, you can scrape that API directly - often cleaner than Playwright
The API interception alternative: Before building a Playwright scraper, always check if the data comes from an internal API that you can call directly. Open DevTools Network tab, filter by Fetch/XHR, reload the page, and look for JSON responses containing the data you want. Calling that API directly with httpx is faster, cheaper, and more reliable than a full Playwright setup. We cover this pattern in detail later.
Playwright vs Selenium vs Puppeteer in 2026
Selenium is the original browser automation framework, dating to 2004. It works, but it shows its age in several ways: verbose setup code, a WebDriver binary that must match your Chrome version (constant maintenance headache), flaky implicit waits that require time.sleep guesses, and an inconsistent API between Python and other languages.
Puppeteer was Google's modern answer to Selenium - a Chrome-only library with a clean Promise-based API. The official version is JavaScript/Node.js only. The Python port (pyppeteer) was a community project that has been effectively abandoned since 2021 with known compatibility issues on modern Chrome versions.
Playwright is Microsoft's 2020 entry that solved the remaining problems: - Cross-browser support (Chromium, Firefox, WebKit) from one unified API - Native async/await in Python with asyncio - Built-in auto-wait system - no more time.sleep guesswork - Headless by default, one flag to switch to headed for debugging - Screenshots, video recording, and network interception built-in - Active development with monthly releases and genuine Microsoft support - Browser binaries bundled - no version matching
For web scraping in Python in 2026, Playwright is the clear choice. The only reason to use Selenium is if you are maintaining existing code that uses it.
Installation and Setup
pip install playwright
playwright install chromium
Playwright downloads its own browser binaries to a local cache directory. No chromedriver version matching, no PATH configuration, no system Chrome dependency. This also means the browser version is pinned to the Playwright version, giving you reproducible behavior across environments.
For scraping specifically, Chromium is the right choice: it is what the majority of sites are tested against, and Chrome-specific fingerprinting is widespread.
Optional dependencies that matter:
# For TLS fingerprinting bypass (when httpx is not enough)
pip install curl-cffi
# For running Playwright in Docker/server environments
# Install system dependencies first:
# apt-get install -y libglib2.0-0 libnss3 libnspr4 libdbus-1-3 libatk1.0-0 # libatk-bridge2.0-0 libcups2 libdrm2 libxkbcommon0 libxcomposite1 # libxdamage1 libxfixes3 libxrandr2 libgbm1 libasound2
Verify installation:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://example.com")
print(page.title())
browser.close()
Sync vs Async API
Playwright offers both synchronous and asynchronous APIs. For scraping, almost always use the async API:
# Sync - fine for simple scripts, sequential execution
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://example.com")
content = page.content()
browser.close()
# Async - required for concurrent scraping
from playwright.async_api import async_playwright
import asyncio
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
await page.goto("https://example.com")
content = await page.content()
await browser.close()
asyncio.run(main())
The sync API blocks on every operation. When you are scraping a single page at a time for debugging, that is fine. For production scrapers running against multiple URLs concurrently, the async API with asyncio.gather is mandatory - otherwise you are paying Playwright's memory overhead without getting its concurrency benefit.
Full Working Example: JS-Rendered Quote Scraper
Let us start with a real, runnable example. quotes.toscrape.com/js/ renders its quotes entirely via JavaScript, making it the standard test site for browser-based scrapers:
import asyncio
from playwright.async_api import async_playwright
from dataclasses import dataclass
from typing import List
@dataclass
class Quote:
text: str
author: str
tags: List[str]
async def scrape_js_quotes(url: str = "http://quotes.toscrape.com/js/") -> List[Quote]:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
viewport={"width": 1280, "height": 720},
)
page = await context.new_page()
await page.goto(url)
# Wait until JS has rendered the quote elements
await page.wait_for_selector(".quote", timeout=10000)
quotes = []
quote_elements = await page.locator(".quote").all()
for el in quote_elements:
text = await el.locator(".text").inner_text()
author = await el.locator(".author").inner_text()
tag_elements = await el.locator(".tag").all()
tags = [await t.inner_text() for t in tag_elements]
quotes.append(Quote(text=text.strip(), author=author.strip(), tags=tags))
await context.close()
await browser.close()
return quotes
if __name__ == "__main__":
results = asyncio.run(scrape_js_quotes())
for q in results:
print(f"{q.text[:60]}... — {q.author}")
Key points in this example: - wait_for_selector pauses execution until JS has rendered the .quote elements - no arbitrary sleep needed - locator returns a Locator object, and .all() gives you a list you can iterate - Each async operation inside the loop awaits independently - The context is closed before the browser to properly release resources
Wait Strategies
The auto-wait system is Playwright's most important feature for scraping. Every locator action (click, fill, inner_text, etc.) automatically waits for the element to be visible and enabled. But you still need to explicitly wait at the right points.
wait_for_selector: wait for a specific element
# Wait for element to appear (default state: visible)
await page.wait_for_selector(".product-grid", timeout=10000)
# Wait for element to be hidden (useful after loading spinners)
await page.wait_for_selector(".loading-spinner", state="hidden", timeout=15000)
# Wait for element with specific text
await page.wait_for_selector("text=Add to Cart", timeout=5000)
wait_for_load_state: wait for page lifecycle events
# Wait for DOMContentLoaded (HTML parsed, DOM built, synchronous scripts done)
await page.goto(url, wait_until="domcontentloaded")
# Wait for load event (HTML + all resources loaded)
await page.goto(url, wait_until="load")
# Wait for network idle (no pending requests for 500ms) - good for SPAs
await page.goto(url, wait_until="networkidle")
# After navigation, wait for network to settle
await page.wait_for_load_state("networkidle", timeout=10000)
wait_for_response: wait for a specific API call to complete
import asyncio
# Wait for a specific API response triggered by a user action
async def wait_for_search_results(page, search_term: str) -> dict:
async with page.expect_response(
lambda r: "/api/search" in r.url and r.status == 200
) as response_info:
await page.fill("#search-input", search_term)
await page.press("#search-input", "Enter")
response = await response_info.value
return await response.json()
wait_for_function: wait for arbitrary JavaScript condition
# Wait until a JavaScript condition is true
await page.wait_for_function(
"() => document.querySelectorAll('.product-card').length > 10",
timeout=15000,
)
# Wait for a specific global variable to be set
await page.wait_for_function("() => window.__DATA__ !== undefined")
Choosing the right wait strategy:
| Situation | Use |
|---|---|
| Specific element must be present | wait_for_selector |
| Initial page navigation | wait_until="networkidle" in goto |
| After user action triggers content | wait_for_selector on new content |
| Background API must complete | wait_for_response |
| Custom JavaScript condition | wait_for_function |
| Loading spinner must disappear | wait_for_selector state="hidden" |
Blocking Resources for Performance
A full browser loads every resource: HTML, CSS, JavaScript, images, fonts, video. For scraping, you often only need the HTML and the JavaScript that renders it. Blocking unnecessary resources cuts page load time by 40-70%.
from playwright.async_api import async_playwright, Route, Request
BLOCKED_RESOURCE_TYPES = {"image", "stylesheet", "font", "media", "other"}
BLOCKED_DOMAINS = {"google-analytics.com", "doubleclick.net", "facebook.com", "hotjar.com"}
async def block_resources(route: Route, request: Request) -> None:
"""Abort requests for non-essential resources."""
resource_type = request.resource_type
url = request.url
if resource_type in BLOCKED_RESOURCE_TYPES:
await route.abort()
return
if any(domain in url for domain in BLOCKED_DOMAINS):
await route.abort()
return
await route.continue_()
async def scrape_fast(url: str) -> str:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.route("**/*", block_resources)
await page.goto(url, wait_until="domcontentloaded")
await page.wait_for_selector("main, #content, .product-list")
html = await page.content()
await browser.close()
return html
Blocking specific ad and tracking scripts:
AD_SCRIPT_PATTERNS = [
"**/ads/**",
"**/analytics/**",
"**/*.tracking.*",
"**/beacon*",
"**/pixel*",
"**/gtm*",
"**/ga.js",
"**/analytics.js",
]
async def setup_resource_blocking(page) -> None:
async def abort_ads(route):
await route.abort()
for pattern in AD_SCRIPT_PATTERNS:
await page.route(pattern, abort_ads)
Screenshots and Debugging
When your scraper returns empty results, screenshots tell you exactly what the browser actually sees:
from playwright.async_api import async_playwright
async def debug_scrape(url: str) -> None:
async with async_playwright() as p:
# Headed mode with slow_mo for visual debugging
browser = await p.chromium.launch(headless=False, slow_mo=500)
page = await browser.new_page()
await page.goto(url)
# Full-page screenshot
await page.screenshot(path="debug_full.png", full_page=True)
# Screenshot of specific element
element = await page.query_selector(".product-grid")
if element:
await element.screenshot(path="debug_element.png")
# Console message capture for JS errors
page.on("console", lambda msg: print(f"Console {msg.type}: {msg.text}"))
page.on("pageerror", lambda err: print(f"Page error: {err}"))
await page.wait_for_timeout(3000)
await browser.close()
Video recording for complex debugging:
async def record_scraping_session(url: str, output_dir: str = "/tmp/pw_video") -> None:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
record_video_dir=output_dir,
record_video_size={"width": 1280, "height": 720},
)
page = await context.new_page()
await page.goto(url)
# Do your scraping here
await context.close() # Video is saved on context close
await browser.close()
Proxy Integration
Residential proxies are essential for scraping sites that block data center IPs. With Playwright, proxy configuration goes into the browser launch or context creation:
Browser-level proxy (all contexts use same proxy):
from playwright.async_api import async_playwright
async def scrape_with_proxy(url: str) -> str:
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy={
"server": "http://gate.thordata.com:7000",
"username": "your_username",
"password": "your_password",
}
)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle")
content = await page.content()
await browser.close()
return content
Context-level proxy rotation (different proxy per batch):
import asyncio
from playwright.async_api import async_playwright
from typing import List
THORDATA_BASE = "http://gate.thordata.com:7000"
def get_proxy_config(username: str, password: str, session_id: str = None) -> dict:
"""
ThorData proxy config.
For sticky sessions, append session ID to username: username-sessid{id}
For rotating, use base username.
"""
if session_id:
user = f"{username}-sessid{session_id}"
else:
user = username
return {
"server": THORDATA_BASE,
"username": user,
"password": password,
}
async def scrape_urls_with_rotation(
urls: List[str],
username: str,
password: str,
max_concurrent: int = 4,
) -> List[dict]:
"""Each URL gets a fresh proxy context for maximum rotation."""
results = []
semaphore = asyncio.Semaphore(max_concurrent)
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
async def scrape_one(url: str, idx: int) -> dict:
async with semaphore:
proxy = get_proxy_config(username, password)
context = await browser.new_context(
proxy=proxy,
viewport={"width": 1366, "height": 768},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
locale="en-US",
)
page = await context.new_page()
try:
await page.goto(url, wait_until="networkidle", timeout=30000)
await page.wait_for_selector("body", timeout=5000)
return {"url": url, "html": await page.content(), "success": True}
except Exception as e:
return {"url": url, "html": None, "success": False, "error": str(e)}
finally:
await context.close()
tasks = [scrape_one(url, i) for i, url in enumerate(urls)]
batch_results = await asyncio.gather(*tasks, return_exceptions=True)
for r in batch_results:
if isinstance(r, dict):
results.append(r)
await browser.close()
return results
ThorData provides residential proxies that work directly with Playwright's proxy configuration. Their rotating residential pool changes IP on each connection, which maps well to the per-context proxy pattern where each URL gets a fresh browser context.
Anti-Detection Techniques
Playwright runs a real browser, but headless Chrome has some tells that sophisticated bot detection systems look for. Masking them improves success rates on sites running PerimeterX, DataDome, and advanced Cloudflare configurations.
Core automation flags to mask:
async def setup_stealth_context(browser, proxy_config: dict = None):
"""Create a browser context with common automation indicators masked."""
context_args = {
"viewport": {"width": 1366, "height": 768},
"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
"locale": "en-US",
"timezone_id": "America/New_York",
"geolocation": {"longitude": -74.0060, "latitude": 40.7128},
"permissions": ["geolocation"],
"color_scheme": "light",
"device_scale_factor": 1,
"has_touch": False,
"is_mobile": False,
}
if proxy_config:
context_args["proxy"] = proxy_config
context = await browser.new_context(**context_args)
# Inject stealth scripts before any page load
await context.add_init_script("""
// Remove webdriver property
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined,
configurable: true
});
// Fake plugin count (headless has 0 plugins)
Object.defineProperty(navigator, 'plugins', {
get: () => {
return {
length: 5,
0: {name: 'Chrome PDF Plugin'},
1: {name: 'Chrome PDF Viewer'},
2: {name: 'Native Client'},
3: {name: 'Chromium PDF Plugin'},
4: {name: 'Widevine Content Decryption Module'},
};
}
});
// Fake languages
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en']
});
// Add chrome runtime object (missing in headless)
if (!window.chrome) {
window.chrome = {
runtime: {},
loadTimes: function() {},
csi: function() {},
app: {}
};
}
// Override permissions API
const originalQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (parameters) => (
parameters.name === 'notifications' ?
Promise.resolve({ state: Notification.permission }) :
originalQuery(parameters)
);
""")
return context
Realistic mouse movement simulation:
import asyncio
import random
import math
async def human_mouse_move(page, target_x: int, target_y: int) -> None:
"""Move mouse to target in a curve rather than a straight line."""
start = await page.evaluate("() => ({x: 0, y: 0})")
steps = random.randint(8, 15)
current_x, current_y = 0, 0
for i in range(steps):
progress = (i + 1) / steps
# Bezier curve approximation
mid_x = (current_x + target_x) / 2 + random.randint(-20, 20)
mid_y = (current_y + target_y) / 2 + random.randint(-20, 20)
next_x = int(current_x + (target_x - current_x) * progress + random.uniform(-2, 2))
next_y = int(current_y + (target_y - current_y) * progress + random.uniform(-2, 2))
await page.mouse.move(next_x, next_y)
await asyncio.sleep(random.uniform(0.01, 0.05))
current_x, current_y = next_x, next_y
await page.mouse.move(target_x, target_y)
async def human_click(page, selector: str) -> None:
"""Click an element with realistic human-like behavior."""
element = await page.wait_for_selector(selector)
box = await element.bounding_box()
if box:
# Click slightly off-center
click_x = box["x"] + box["width"] * random.uniform(0.3, 0.7)
click_y = box["y"] + box["height"] * random.uniform(0.3, 0.7)
await human_mouse_move(page, int(click_x), int(click_y))
await asyncio.sleep(random.uniform(0.05, 0.2))
await page.mouse.click(int(click_x), int(click_y))
Realistic scroll behavior:
async def human_scroll(page, distance: int = 500, speed: str = "normal") -> None:
"""Scroll page with variable speed and slight randomness."""
if speed == "fast":
step_size = random.randint(150, 250)
step_delay = random.uniform(0.05, 0.1)
elif speed == "slow":
step_size = random.randint(50, 100)
step_delay = random.uniform(0.15, 0.3)
else: # normal
step_size = random.randint(80, 150)
step_delay = random.uniform(0.08, 0.18)
scrolled = 0
while scrolled < distance:
await page.evaluate(f"window.scrollBy(0, {step_size})")
scrolled += step_size
await asyncio.sleep(step_delay + random.uniform(-0.02, 0.05))
CAPTCHA Handling
CAPTCHAs in Playwright context usually mean the site detected automation signals before JavaScript ran. The approach differs by CAPTCHA type.
Detection:
async def detect_captcha_type(page) -> str:
"""Detect what type of CAPTCHA, if any, is on the current page."""
content = await page.content()
url = page.url
if "cf-challenge" in content or "/cdn-cgi/challenge-platform" in url:
return "cloudflare_challenge"
if "g-recaptcha" in content:
return "recaptcha_v2"
if "grecaptcha.execute" in content:
return "recaptcha_v3"
if "hcaptcha" in content:
return "hcaptcha"
if "px-captcha" in content or "perimeterx" in content.lower():
return "perimeterx"
if await page.query_selector('iframe[src*="recaptcha"]'):
return "recaptcha_v2_iframe"
return "none"
Handling Cloudflare challenges: Cloudflare's JS challenge runs browser fingerprint checks. With proper stealth settings and residential proxies, many Cloudflare challenges pass automatically because the fingerprint looks legitimate. When they do not pass:
async def handle_cloudflare(page, max_wait: int = 15000) -> bool:
"""
Wait for Cloudflare challenge to auto-solve.
Works for JS challenges (not CAPTCHA challenges) with proper stealth.
Returns True if challenge passed, False if still blocked.
"""
try:
# Wait for the challenge to resolve or timeout
await page.wait_for_function(
"() => !document.querySelector('.cf-browser-verification') && document.readyState === 'complete'",
timeout=max_wait,
)
return True
except Exception:
# Take screenshot for debugging
await page.screenshot(path="/tmp/cf_blocked.png", full_page=True)
return False
2captcha/CapSolver integration for reCAPTCHA:
import httpx
import asyncio
import time
async def solve_recaptcha_v2(page, api_key: str) -> bool:
"""
Use 2captcha API to solve reCAPTCHA v2.
Requires a paid 2captcha account.
"""
# Get site key from page
sitekey = await page.evaluate("""
() => {
const el = document.querySelector('[data-sitekey]');
return el ? el.getAttribute('data-sitekey') : null;
}
""")
if not sitekey:
return False
site_url = page.url
# Submit CAPTCHA to 2captcha
with httpx.Client(timeout=30) as client:
resp = client.post("https://2captcha.com/in.php", data={
"key": api_key,
"method": "userrecaptcha",
"googlekey": sitekey,
"pageurl": site_url,
"json": 1,
})
task_id = resp.json().get("request")
if not task_id:
return False
# Poll for solution (typically 15-30 seconds)
for _ in range(24):
await asyncio.sleep(5)
result = client.get(f"https://2captcha.com/res.php?key={api_key}&action=get&id={task_id}&json=1")
data = result.json()
if data.get("status") == 1:
token = data["request"]
# Inject token into page
await page.evaluate(f"""
() => {{
document.getElementById('g-recaptcha-response').value = '{token}';
if (typeof ___grecaptcha_cfg !== 'undefined') {{
Object.entries(___grecaptcha_cfg.clients).forEach(([k, v]) => {{
if (v.l && v.l.l) v.l.l.callback('{token}');
}});
}}
}}
""")
return True
elif data.get("request") != "CAPCHA_NOT_READY":
return False
return False
Rate Limiting and Retry Logic
import asyncio
import random
import logging
from typing import Optional, List
from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeout
logger = logging.getLogger(__name__)
class PlaywrightScraper:
def __init__(
self,
proxy_config: dict,
max_concurrent: int = 4,
requests_per_minute: int = 20,
max_retries: int = 3,
):
self.proxy_config = proxy_config
self.max_concurrent = max_concurrent
self.min_delay = 60.0 / requests_per_minute
self.max_retries = max_retries
self._semaphore = None
async def scrape(self, url: str) -> Optional[str]:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
for attempt in range(1, self.max_retries + 1):
context = await browser.new_context(
proxy=self.proxy_config,
viewport={"width": 1366, "height": 768},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
)
page = await context.new_page()
try:
await page.goto(url, wait_until="networkidle", timeout=30000)
captcha = await detect_captcha_type(page)
if captcha != "none":
logger.warning(f"CAPTCHA detected ({captcha}) on attempt {attempt}")
await context.close()
await asyncio.sleep(random.uniform(5, 15))
continue
html = await page.content()
return html
except PlaywrightTimeout:
logger.warning(f"Timeout on attempt {attempt} for {url}")
await asyncio.sleep(random.uniform(3, 8) * attempt)
except Exception as e:
logger.error(f"Error on attempt {attempt}: {e}")
await asyncio.sleep(random.uniform(2, 5))
finally:
await context.close()
await browser.close()
return None
async def scrape_batch(self, urls: List[str]) -> List[dict]:
semaphore = asyncio.Semaphore(self.max_concurrent)
results = []
async def scrape_with_rate_limit(url: str) -> dict:
async with semaphore:
html = await self.scrape(url)
await asyncio.sleep(
self.min_delay + random.uniform(0, self.min_delay * 0.5)
)
return {"url": url, "html": html, "success": html is not None}
tasks = [scrape_with_rate_limit(url) for url in urls]
batch = await asyncio.gather(*tasks, return_exceptions=True)
for r in batch:
if isinstance(r, dict):
results.append(r)
return results
Real-World Use Cases
1. React SPA Product Catalog
import asyncio
import json
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class SPAProduct:
name: str
price: Optional[float]
sku: str
description: str
images: List[str]
variants: List[dict]
url: str
async def scrape_react_catalog(
catalog_url: str,
proxy_config: dict,
max_products: int = 100,
) -> List[SPAProduct]:
products = []
captured_api_data = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True, proxy=proxy_config)
context = await setup_stealth_context(browser, proxy_config)
page = await context.new_page()
# Intercept product API calls
async def capture_product_api(response):
if "/api/products" in response.url or "/api/catalog" in response.url:
if response.status == 200:
try:
data = await response.json()
products_data = data.get("products", data.get("items", []))
if isinstance(products_data, list):
captured_api_data.extend(products_data)
except Exception:
pass
page.on("response", capture_product_api)
# Navigate and wait for content
await page.goto(catalog_url, wait_until="networkidle", timeout=30000)
await page.wait_for_selector(".product-card, .product-item, [data-product]", timeout=15000)
# If we captured API data, use it (much cleaner than HTML parsing)
if captured_api_data:
for item in captured_api_data[:max_products]:
products.append(SPAProduct(
name=item.get("name", item.get("title", "")),
price=item.get("price"),
sku=item.get("sku", item.get("id", "")),
description=item.get("description", "")[:300],
images=item.get("images", item.get("photos", [])),
variants=item.get("variants", []),
url=catalog_url,
))
else:
# Fall back to HTML parsing
html = await page.content()
soup = BeautifulSoup(html, "lxml")
for card in soup.select(".product-card, .product-item")[:max_products]:
name_tag = card.select_one("h2, h3, .product-name")
price_tag = card.select_one(".price, [class*=price]")
products.append(SPAProduct(
name=name_tag.get_text(strip=True) if name_tag else "",
price=None,
sku="",
description="",
images=[],
variants=[],
url=catalog_url,
))
await context.close()
await browser.close()
return products
2. Authenticated Login Scraper
import asyncio
import json
import os
from playwright.async_api import async_playwright
from typing import List
async def scrape_after_login(
login_url: str,
username: str,
password: str,
target_urls: List[str],
proxy_config: dict,
session_file: str = "/tmp/session_state.json",
) -> List[dict]:
"""
Login once, save session state, then scrape authenticated pages.
Reuses saved session to avoid re-login on each run.
"""
results = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True, proxy=proxy_config)
# Try to restore saved session
context_args = {
"viewport": {"width": 1366, "height": 768},
"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
"proxy": proxy_config,
}
if os.path.exists(session_file):
context_args["storage_state"] = session_file
context = await browser.new_context(**context_args)
page = await context.new_page()
# Check if session is valid
await page.goto(login_url)
if page.url == login_url:
# Need to login
await page.fill('[name="username"], [name="email"], #username, #email', username)
await page.fill('[name="password"], #password', password)
await asyncio.sleep(random.uniform(0.5, 1.5))
await page.click('[type="submit"], .login-button, button:has-text("Login")')
await page.wait_for_load_state("networkidle")
# Save session for next run
await context.storage_state(path=session_file)
# Now scrape authenticated pages
for url in target_urls:
try:
await page.goto(url, wait_until="networkidle", timeout=20000)
await page.wait_for_selector("main, .content, #main-content", timeout=10000)
html = await page.content()
results.append({"url": url, "html": html, "authenticated": True})
await asyncio.sleep(random.uniform(1, 3))
except Exception as e:
results.append({"url": url, "html": None, "error": str(e)})
await context.close()
await browser.close()
return results
3. Infinite Scroll Aggregator
import asyncio
import random
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import List
@dataclass
class FeedItem:
title: str
content: str
timestamp: str
author: str
url: str
async def scrape_infinite_feed(
feed_url: str,
proxy_config: dict,
target_item_count: int = 200,
max_scrolls: int = 50,
) -> List[FeedItem]:
items = []
seen_titles = set()
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy=proxy_config,
args=["--disable-blink-features=AutomationControlled"],
)
context = await browser.new_context(
viewport={"width": 1280, "height": 900},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
)
page = await context.new_page()
await page.goto(feed_url, wait_until="networkidle", timeout=30000)
await page.wait_for_selector(".feed-item, .post, article", timeout=15000)
for scroll_num in range(max_scrolls):
if len(items) >= target_item_count:
break
# Extract currently visible items
html = await page.content()
soup = BeautifulSoup(html, "lxml")
for item in soup.select(".feed-item, .post, article"):
title_el = item.select_one("h2, h3, .title")
content_el = item.select_one("p, .content, .excerpt")
time_el = item.select_one("time, .timestamp, .date")
author_el = item.select_one(".author, [rel=author]")
if title_el:
title = title_el.get_text(strip=True)
if title not in seen_titles:
seen_titles.add(title)
items.append(FeedItem(
title=title,
content=content_el.get_text(strip=True)[:300] if content_el else "",
timestamp=time_el.get("datetime", time_el.get_text(strip=True)) if time_el else "",
author=author_el.get_text(strip=True) if author_el else "",
url=feed_url,
))
# Scroll down
prev_height = await page.evaluate("document.body.scrollHeight")
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await asyncio.sleep(random.uniform(1.5, 3.0))
# Check if new content loaded
new_height = await page.evaluate("document.body.scrollHeight")
if new_height == prev_height:
break # No more content
await context.close()
await browser.close()
return items[:target_item_count]
4. Form-Based Data Extraction
import asyncio
import random
from playwright.async_api import async_playwright
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class SearchResult:
query: str
title: str
description: str
url: str
position: int
async def scrape_search_form(
search_url: str,
queries: List[str],
proxy_config: dict,
results_per_query: int = 20,
) -> List[SearchResult]:
all_results = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True, proxy=proxy_config)
context = await browser.new_context(
viewport={"width": 1366, "height": 768},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
)
for query in queries:
page = await context.new_page()
try:
await page.goto(search_url, wait_until="domcontentloaded")
# Find and fill search input
search_input = await page.wait_for_selector(
'input[type="search"], input[name="q"], #search-input, .search-box input',
timeout=10000,
)
await human_mouse_move(page, 0, 0) # Move mouse first
await asyncio.sleep(random.uniform(0.3, 0.8))
await search_input.click()
await asyncio.sleep(random.uniform(0.2, 0.5))
# Type like a human (not all at once)
for char in query:
await page.keyboard.type(char)
await asyncio.sleep(random.uniform(0.05, 0.2))
await asyncio.sleep(random.uniform(0.3, 0.8))
await page.keyboard.press("Enter")
await page.wait_for_load_state("networkidle", timeout=15000)
# Extract results
result_items = await page.locator(".result, .search-result, article").all()
for pos, item in enumerate(result_items[:results_per_query], 1):
try:
title = await item.locator("h2, h3, .title").first.inner_text()
desc = await item.locator("p, .description, .snippet").first.inner_text()
link = await item.locator("a").first.get_attribute("href")
all_results.append(SearchResult(
query=query,
title=title.strip(),
description=desc.strip()[:200],
url=link or "",
position=pos,
))
except Exception:
pass
except Exception as e:
print(f"Error scraping query '{query}': {e}")
finally:
await page.close()
# Human-like delay between queries
await asyncio.sleep(random.uniform(3, 8))
await context.close()
await browser.close()
return all_results
5. Multi-Page Pagination with Session Tracking
import asyncio
import json
from playwright.async_api import async_playwright
from typing import List, Optional
async def scrape_paginated_spa(
start_url: str,
proxy_config: dict,
max_pages: int = 50,
items_per_page: int = 20,
) -> List[dict]:
"""
Handle SPA pagination where page state is managed by JavaScript.
Uses network response interception to capture data directly.
"""
all_items = []
page_data_responses = asyncio.Queue()
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True, proxy=proxy_config)
context = await browser.new_context(
viewport={"width": 1366, "height": 768},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
)
page = await context.new_page()
# Intercept pagination API calls
async def capture_page_data(response):
if ("/api/items" in response.url or "/api/list" in response.url) and response.status == 200:
try:
data = await response.json()
await page_data_responses.put(data)
except Exception:
pass
page.on("response", capture_page_data)
await page.goto(start_url, wait_until="networkidle")
for page_num in range(1, max_pages + 1):
# Drain any captured API responses
while not page_data_responses.empty():
data = await page_data_responses.get()
items = data.get("items", data.get("results", data.get("data", [])))
all_items.extend(items)
# Find and click "Next" button
next_button = await page.query_selector(
'button:has-text("Next"), a:has-text("Next"), [aria-label="Next page"], .pagination-next'
)
if not next_button:
break
is_disabled = await next_button.get_attribute("disabled")
if is_disabled:
break
await next_button.click()
await page.wait_for_load_state("networkidle", timeout=10000)
await asyncio.sleep(random.uniform(1, 2))
if len(all_items) >= max_pages * items_per_page:
break
await context.close()
await browser.close()
return all_items
6. File Download Automation
import asyncio
import os
from playwright.async_api import async_playwright
from typing import List
async def download_files_from_portal(
portal_url: str,
file_selector: str,
download_dir: str,
proxy_config: dict,
max_files: int = 50,
) -> List[str]:
"""
Download files from a portal that requires JavaScript for the download UI.
"""
downloaded_files = []
os.makedirs(download_dir, exist_ok=True)
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True, proxy=proxy_config)
context = await browser.new_context(
viewport={"width": 1366, "height": 768},
accept_downloads=True, # Required to intercept downloads
)
page = await context.new_page()
await page.goto(portal_url, wait_until="networkidle")
await page.wait_for_selector(file_selector, timeout=15000)
download_links = await page.locator(file_selector).all()
for i, link in enumerate(download_links[:max_files]):
try:
async with page.expect_download(timeout=30000) as download_info:
await link.click()
download = await download_info.value
save_path = os.path.join(download_dir, download.suggested_filename)
await download.save_as(save_path)
downloaded_files.append(save_path)
await asyncio.sleep(random.uniform(1, 3))
except Exception as e:
print(f"Failed to download file {i}: {e}")
await context.close()
await browser.close()
return downloaded_files
7. Real-Time Price and Availability Monitor
import asyncio
import json
import datetime
import random
from playwright.async_api import async_playwright
from dataclasses import dataclass, asdict
from typing import List, Optional, Callable
@dataclass
class PriceSnapshot:
url: str
price: Optional[float]
currency: str
in_stock: bool
title: str
scraped_at: str
proxy_used: str
async def monitor_prices_realtime(
product_urls: List[str],
proxy_config: dict,
on_price_change: Optional[Callable] = None,
check_interval_minutes: int = 30,
max_checks: int = 48,
) -> List[List[PriceSnapshot]]:
"""
Monitor prices over time with Playwright.
Calls on_price_change callback when a price differs from previous check.
"""
history = {url: [] for url in product_urls}
proxy_server = proxy_config.get("server", "")
for check_num in range(max_checks):
print(f"Check {check_num + 1}/{max_checks} at {datetime.datetime.utcnow().isoformat()}")
snapshots = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True, proxy=proxy_config)
for url in product_urls:
context = await browser.new_context(
viewport={"width": 1366, "height": 768},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
proxy=proxy_config,
)
page = await context.new_page()
try:
await page.goto(url, wait_until="networkidle", timeout=30000)
await page.wait_for_selector("[itemprop='price'], .price, #price", timeout=10000)
price_text = await page.locator("[itemprop='price'], .price, #price").first.inner_text()
title_text = await page.locator("h1").first.inner_text()
stock_el = await page.query_selector(".in-stock, [itemprop='availability']")
price_clean = "".join(c for c in price_text if c.isdigit() or c == ".")
try:
price = float(price_clean) if price_clean else None
except ValueError:
price = None
snapshot = PriceSnapshot(
url=url,
price=price,
currency="USD",
in_stock=bool(stock_el),
title=title_text.strip() if title_text else "",
scraped_at=datetime.datetime.utcnow().isoformat(),
proxy_used=proxy_server,
)
# Check for price change
if history[url] and on_price_change:
prev = history[url][-1]
if prev.price != snapshot.price:
await asyncio.sleep(0) # Yield to event loop
on_price_change(url, prev.price, snapshot.price, snapshot)
history[url].append(snapshot)
snapshots.append(snapshot)
except Exception as e:
print(f"Error checking {url}: {e}")
finally:
await context.close()
await asyncio.sleep(random.uniform(2, 5))
await browser.close()
if check_num < max_checks - 1:
await asyncio.sleep(check_interval_minutes * 60)
return list(history.values())
When You Do Not Need Playwright
Not every JavaScript-heavy site requires Playwright. Before building a full browser scraper:
Check for an internal API. Open DevTools, go to Network, filter by Fetch/XHR, and reload the page. If you see JSON responses containing the data you want, you can call those endpoints directly with httpx. This is faster, more reliable, and dramatically cheaper in terms of resources.
Check if data is in the initial HTML. Some sites that look like SPAs actually server-side render their initial content. View Page Source (not Inspect Element). If the data is there, BeautifulSoup works fine.
Check for a public API. Many major platforms have official APIs. LinkedIn, Twitter, Google, and Amazon all have APIs with rate limits that are more sustainable than scraping.
The rule: use Playwright only when you genuinely need JavaScript rendering and there is no simpler alternative. For static HTML, requests or httpx with BeautifulSoup runs 10-50x faster and costs a fraction of the compute resources.
Wrapping Up
Playwright has become the go-to tool for scraping JS-heavy sites in 2026 because it solved the real problems: auto-wait eliminates the time.sleep debugging spiral, cross-browser support means one codebase handles Chromium and Firefox, and the async API enables genuine concurrency. Pair it with residential proxies from ThorData for the IP layer, add the stealth scripts to mask automation flags, use resource blocking to cut load times, and you can scrape virtually anything that renders in a browser.
The tradeoff is always speed and memory. Playwright is 10-50x slower than raw HTTP scraping and uses 10x more memory. Use it where it is genuinely required, and fall back to Scrapy or httpx everywhere else.