Scrapy vs BeautifulSoup vs Playwright: Which to Use for Web Scraping in 2026?
Scrapy vs BeautifulSoup vs Playwright: Which to Use for Web Scraping in 2026?
Choosing the wrong scraping tool does not just mean writing bad code - it means spending hours fighting a framework that was never designed for your use case. Playwright crawling a million-page catalog burns memory and slows to a crawl. Scrapy fighting a JavaScript-heavy SPA fails silently with empty fields. BeautifulSoup handling a 50,000-page crawl turns into spaghetti that cannot be maintained.
The Python scraping ecosystem has stabilized around three tools that cover nearly every real-world scenario: BeautifulSoup for lightweight parsing, Scrapy for large-scale crawls, and Playwright for JavaScript-rendered content. Understanding where each excels - and more importantly where each fails - saves you from rewrites.
This guide gives you the full picture. We cover architecture differences, real-world performance numbers, proxy integration patterns, anti-detection techniques for each tool, complete production code examples, and a decision framework for choosing based on your actual requirements. If you have spent time on the fence between these tools, this will settle it.
The Quick Answer
Stop reading comparison articles that end with "it depends." Here is the actual answer:
| Scenario | Use This |
|---|---|
| Static HTML, quick scripts | BeautifulSoup + httpx |
| Multi-page crawls, data pipelines | Scrapy |
| JS-rendered SPAs, login flows | Playwright |
| High-volume crawl with some JS pages | Scrapy + scrapy-playwright |
That is the 80% answer. The remaining 20% is what the rest of this article covers.
BeautifulSoup: The Parser That Refuses to Die
BeautifulSoup is not a scraping framework. It is an HTML parser. That distinction matters more than most tutorials acknowledge.
You give it HTML, it gives you a clean API to extract data from it. That is the entire job. It does not fetch pages, manage sessions, handle retries, enforce rate limits, or deal with JavaScript. You pair it with httpx or requests for fetching, and BeautifulSoup handles the parsing.
Architecture: BeautifulSoup parses a string of HTML into a tree structure using an underlying parser (lxml, html.parser, or html5lib). Once parsed, you navigate the tree with CSS selectors, tag names, or attribute searches. The entire library exists purely in memory during parsing - there is no state, no session, no pipeline.
Where it shines: - Parsing HTML you already have (API responses, saved files, email templates) - One-off scripts where you need data from 5-20 pages - Prototyping - get something working in 15 minutes before deciding if it needs a real framework - Teaching - the API maps directly to how HTML structure works - Extracting data from HTML within larger applications (ETL pipelines, email processors) - Complex HTML parsing within Scrapy spiders where CSS selectors fall short
Where it falls apart: - Anything over 100+ pages (no built-in concurrency, no crawl management, no scheduler) - JavaScript-rendered content (it only sees the raw HTML the server returns) - Production pipelines (no retry logic, no rate limiting, no structured export formats) - Duplicate URL detection and crawl frontier management - Anything requiring stateful browser interaction
Performance reality: BeautifulSoup with lxml parses roughly 1-3 MB/s of HTML. For most pages that is fast enough. The bottleneck is almost always the network request, not the parsing. Where BeautifulSoup loses at scale is that you have to write all the infrastructure yourself: retry logic, rate limiting, concurrency, deduplication. By the time you have built all that, you have reinvented a subset of Scrapy - worse.
Complete example: product catalog parser
import httpx
from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import List, Optional
import time
import random
@dataclass
class Product:
name: str
price: Optional[float]
sku: Optional[str]
description: str
image_url: Optional[str]
in_stock: bool
url: str
def parse_product_page(html: str, url: str) -> Product:
soup = BeautifulSoup(html, "lxml")
name_tag = soup.select_one("h1.product-title, h1[itemprop=name], #productTitle")
price_tag = soup.select_one('[itemprop="price"], .price, .product-price, #priceblock_ourprice')
sku_tag = soup.select_one('[itemprop="sku"], .sku, #productSKU')
desc_tag = soup.select_one('[itemprop="description"], .product-description, #productDescription')
img_tag = soup.select_one('[itemprop="image"], .product-image img, #landingImage')
stock_tag = soup.select_one('[itemprop="availability"], .availability, #availability')
price_text = price_tag.get_text(strip=True) if price_tag else ""
price_clean = "".join(c for c in price_text if c.isdigit() or c == ".")
try:
price = float(price_clean) if price_clean else None
except ValueError:
price = None
in_stock = True
if stock_tag:
stock_text = stock_tag.get_text(strip=True).lower()
in_stock = "out of stock" not in stock_text and "unavailable" not in stock_text
return Product(
name=name_tag.get_text(strip=True) if name_tag else "",
price=price,
sku=sku_tag.get_text(strip=True) if sku_tag else None,
description=desc_tag.get_text(strip=True)[:500] if desc_tag else "",
image_url=img_tag.get("src") or img_tag.get("data-src") if img_tag else None,
in_stock=in_stock,
url=url,
)
def scrape_catalog(
urls: List[str],
proxy: str,
rate_limit_rpm: int = 30,
) -> List[Product]:
products = []
delay = 60.0 / rate_limit_rpm
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}
with httpx.Client(proxy=proxy, timeout=15.0, headers=headers) as client:
for i, url in enumerate(urls):
try:
resp = client.get(url)
if resp.status_code == 200:
products.append(parse_product_page(resp.text, url))
elif resp.status_code == 429:
time.sleep(30)
resp = client.get(url)
if resp.status_code == 200:
products.append(parse_product_page(resp.text, url))
except (httpx.TimeoutException, httpx.ProxyError):
pass
time.sleep(delay + random.uniform(0, 0.5))
return products
Scrapy: The Industrial Scraper
Scrapy is a framework, not a library. That is both its strength and its barrier to entry.
Out of the box you get: async request handling via Twisted, automatic rate limiting, retry middleware, cookie management, multiple export formats (JSON, CSV, XML, JSONL), item pipelines for cleaning data, a crawl scheduler with disk-based persistence, and a middleware stack for customizing every request and response. You write spiders, Scrapy handles everything else.
Architecture: Scrapy runs on Twisted, Python's mature async networking framework. It maintains a request queue, a downloader that handles concurrent HTTP connections, and a spider engine that processes responses and generates new requests. The middleware system lets you intercept requests and responses at multiple points - proxy rotation, header modification, retry logic, and custom downloaders all live here.
Where it shines: - Crawling thousands or millions of pages efficiently - Data pipelines that need cleaning, deduplication, and structured storage - Respectful scraping with AUTOTHROTTLE, DOWNLOAD_DELAY, CONCURRENT_REQUESTS - Long-running jobs that need to pause and resume via checkpoint persistence - Complex crawl graphs with multiple spider types - Integration with data infrastructure (Kafka, S3, PostgreSQL via item pipelines) - Distributed crawling via Scrapy-Redis or Scrapyd
Where it falls apart: - Scraping 3-5 pages from one site (significant setup overhead) - JavaScript-heavy sites by default (Scrapy sees only initial HTML) - Quick prototyping without a project structure - Scenarios where you need fine-grained browser control (mouse events, file downloads)
Performance reality: Scrapy with default settings runs around 16 concurrent requests and handles 300-500 pages/minute on a standard VPS. With CONCURRENT_REQUESTS=64 and AUTOTHROTTLE disabled, you can push past 1000 pages/minute, though most sites will block you before you get there. The async Twisted engine gives Scrapy a real advantage at scale - it does not block on network I/O the way a synchronous httpx loop does.
Spider example with proxy rotation middleware:
import scrapy
from scrapy import signals
from scrapy.http import Request
from dataclasses import dataclass
import random
import json
@dataclass
class ProductItem:
name: str
price: str
url: str
category: str
rating: str
review_count: str
class ProductSpider(scrapy.Spider):
name = "products"
custom_settings = {
"CONCURRENT_REQUESTS": 16,
"DOWNLOAD_DELAY": 1.5,
"RANDOMIZE_DOWNLOAD_DELAY": True,
"AUTOTHROTTLE_ENABLED": True,
"AUTOTHROTTLE_START_DELAY": 1,
"AUTOTHROTTLE_MAX_DELAY": 10,
"AUTOTHROTTLE_TARGET_CONCURRENCY": 8,
"RETRY_TIMES": 3,
"RETRY_HTTP_CODES": [429, 500, 502, 503, 504],
"COOKIES_ENABLED": True,
"DOWNLOADER_MIDDLEWARES": {
"myproject.middlewares.ProxyMiddleware": 350,
"myproject.middlewares.UserAgentMiddleware": 400,
},
"FEEDS": {
"products.jsonl": {"format": "jsonlines", "overwrite": True},
},
}
def __init__(self, start_url=None, *args, **kwargs):
super().__init__(*args, **kwargs)
self.start_urls = [start_url or "https://example.com/products"]
def start_requests(self):
for url in self.start_urls:
yield Request(
url,
callback=self.parse_listing,
headers=self._get_headers(),
meta={"dont_redirect": False},
)
def _get_headers(self) -> dict:
return {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"sec-ch-ua": '"Chromium";v="131", "Not_A Brand";v="24", "Google Chrome";v="131"',
"sec-ch-ua-mobile": "?0",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
}
def parse_listing(self, response):
# Extract product links from listing page
product_links = response.css("a.product-card::attr(href), a.item-link::attr(href)").getall()
for link in product_links:
yield response.follow(link, callback=self.parse_product, headers=self._get_headers())
# Pagination
next_page = response.css("a.next-page::attr(href), link[rel=next]::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse_listing, headers=self._get_headers())
def parse_product(self, response):
# Try JSON-LD structured data first
json_ld = response.css('script[type="application/ld+json"]::text').get()
if json_ld:
try:
data = json.loads(json_ld)
if isinstance(data, list):
data = data[0]
if data.get("@type") == "Product":
offer = data.get("offers", {})
if isinstance(offer, list):
offer = offer[0]
yield {
"name": data.get("name", ""),
"price": offer.get("price", ""),
"currency": offer.get("priceCurrency", "USD"),
"url": response.url,
"sku": data.get("sku", ""),
"description": data.get("description", "")[:300],
"in_stock": offer.get("availability", "").endswith("InStock"),
}
return
except (json.JSONDecodeError, KeyError):
pass
# Fallback to CSS selectors
yield {
"name": response.css("h1::text, h1.title::text").get(default="").strip(),
"price": response.css('[itemprop="price"]::attr(content), .price::text').get(default="").strip(),
"url": response.url,
"sku": response.css('[itemprop="sku"]::text').get(default="").strip(),
"description": response.css('[itemprop="description"]::text').get(default="").strip()[:300],
"in_stock": bool(response.css('.in-stock, [itemprop="availability"]')),
}
Custom proxy rotation middleware:
# myproject/middlewares.py
import random
from scrapy import signals
from scrapy.exceptions import NotConfigured
class ProxyMiddleware:
"""Rotate through a proxy pool on every request."""
def __init__(self, proxy_list):
self.proxy_list = proxy_list
@classmethod
def from_crawler(cls, crawler):
proxy_list = crawler.settings.getlist("PROXY_LIST", [])
if not proxy_list:
raise NotConfigured("PROXY_LIST setting is required")
return cls(proxy_list)
def process_request(self, request, spider):
proxy = random.choice(self.proxy_list)
request.meta["proxy"] = proxy
class UserAgentMiddleware:
"""Rotate user agents to avoid fingerprinting."""
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36",
]
def process_request(self, request, spider):
request.headers["User-Agent"] = random.choice(self.USER_AGENTS)
class RetryWithNewProxyMiddleware:
"""On 403/429, retry with a different proxy."""
RETRY_CODES = {403, 429, 503}
def __init__(self, proxy_list, max_retries=3):
self.proxy_list = proxy_list
self.max_retries = max_retries
@classmethod
def from_crawler(cls, crawler):
proxy_list = crawler.settings.getlist("PROXY_LIST", [])
max_retries = crawler.settings.getint("RETRY_TIMES", 3)
return cls(proxy_list, max_retries)
def process_response(self, request, response, spider):
if response.status in self.RETRY_CODES:
retries = request.meta.get("retry_count", 0)
if retries < self.max_retries:
new_request = request.copy()
new_request.meta["proxy"] = random.choice(self.proxy_list)
new_request.meta["retry_count"] = retries + 1
new_request.dont_filter = True
return new_request
return response
Scrapy settings for ThorData proxy:
# settings.py
PROXY_LIST = ["http://username:[email protected]:7000"]
# For sticky sessions: "http://username-session-{session_id}:[email protected]:7000"
DOWNLOADER_MIDDLEWARES = {
"myproject.middlewares.ProxyMiddleware": 350,
"myproject.middlewares.UserAgentMiddleware": 400,
"myproject.middlewares.RetryWithNewProxyMiddleware": 550,
}
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_TARGET_CONCURRENCY = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 4
Playwright: The Nuclear Option
Playwright launches a real browser. That means it executes JavaScript, renders the page, handles cookies, follows redirects, fires DOM events, and behaves exactly like an actual user. It is the only option when the data you need is constructed client-side.
Architecture: Playwright communicates with browser processes (Chromium, Firefox, or WebKit) over the DevTools Protocol. Each browser instance runs as a separate OS process. Browser contexts within an instance share the browser binary but have isolated storage, cookies, and cache. Pages within a context share the context.
This architecture has real implications: browser startup takes 1-3 seconds, each context consumes 50-150MB RAM, and the DevTools Protocol communication adds latency to every operation.
Where it shines: - Single-page applications built with React, Vue, Angular, or Svelte - Sites behind login walls with complex authentication flows (MFA, OAuth, session tokens) - Pages that load data via XHR or fetch requests after initial render - Infinite scroll pages where content loads on scroll event - When you need to interact with the page (fill forms, click buttons, handle file uploads) - Scraping sites that fingerprint clients in JavaScript - When you need screenshots or PDFs alongside data
Where it falls apart: - Speed: a real browser is 10-50x slower than raw HTTP requests - Memory: each browser context consumes 50-150MB RAM - Scale: running 50+ concurrent browser instances requires serious hardware or cloud resources - Reliability: network timeouts, flaky selectors, race conditions between JS execution and your waits - Cost: proportionally more expensive to run at scale
Performance reality with benchmarks: Scraping 10,000 product pages on a standard VPS (4 vCPUs, 8GB RAM):
| Tool | Time | Memory Peak | Pages/min |
|---|---|---|---|
| Scrapy (16 concurrent) | ~8 min | ~120MB | ~1250 |
| httpx + BS4 (async, 12 concurrent) | ~14 min | ~80MB | ~710 |
| Playwright (4 contexts) | ~90 min | ~800MB | ~110 |
Playwright is not a substitute for HTTP-based scraping at scale. It is a specialized tool for content that is genuinely impossible to get any other way.
Full async Playwright scraper with proxy and anti-detection:
import asyncio
from playwright.async_api import async_playwright, Page, BrowserContext
from typing import List, Optional
import random
import time
PROXY_CONFIG = {
"server": "http://gate.thordata.com:7000",
"username": "your_username",
"password": "your_password",
}
async def block_unnecessary_resources(page: Page) -> None:
"""Block images, fonts, and media to speed up scraping."""
async def handler(route):
if route.request.resource_type in ("image", "stylesheet", "font", "media"):
await route.abort()
else:
await route.continue_()
await page.route("**/*", handler)
async def add_stealth_scripts(page: Page) -> None:
"""Inject scripts before page load to mask automation flags."""
await page.add_init_script("""
// Remove webdriver flag
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
// Fake plugins array (empty in headless)
Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3, 4, 5]});
// Fake language
Object.defineProperty(navigator, 'languages', {get: () => ['en-US', 'en']});
// Remove automation-specific chrome object properties
if (window.chrome) {
window.chrome.runtime = {};
}
""")
async def scrape_spa_page(
url: str,
context: BrowserContext,
wait_selector: str = ".content, main, #app",
timeout: int = 15000,
) -> Optional[str]:
page = await context.new_page()
try:
await block_unnecessary_resources(page)
await page.goto(url, wait_until="domcontentloaded", timeout=timeout)
await page.wait_for_selector(wait_selector, timeout=timeout)
# Give JS time to populate dynamic content
await page.wait_for_load_state("networkidle", timeout=5000)
return await page.content()
except Exception as e:
print(f"Error scraping {url}: {e}")
return None
finally:
await page.close()
async def scrape_with_playwright(
urls: List[str],
max_contexts: int = 4,
proxy_config: dict = None,
) -> List[dict]:
results = []
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy=proxy_config or PROXY_CONFIG,
args=[
"--disable-blink-features=AutomationControlled",
"--disable-dev-shm-usage",
"--no-sandbox",
"--disable-setuid-sandbox",
],
)
# Create a semaphore to limit concurrent contexts
semaphore = asyncio.Semaphore(max_contexts)
async def scrape_with_semaphore(url: str) -> Optional[dict]:
async with semaphore:
context = await browser.new_context(
viewport={"width": 1366, "height": 768},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
locale="en-US",
timezone_id="America/New_York",
)
page = await context.new_page()
await add_stealth_scripts(page)
try:
html = await scrape_spa_page(url, context)
if html:
return {"url": url, "html": html, "success": True}
return {"url": url, "html": None, "success": False}
finally:
await context.close()
tasks = [scrape_with_semaphore(url) for url in urls]
batch_results = await asyncio.gather(*tasks, return_exceptions=True)
for r in batch_results:
if isinstance(r, dict):
results.append(r)
await browser.close()
return results
Handling infinite scroll:
from playwright.async_api import async_playwright, Page
async def scrape_infinite_scroll(url: str, proxy_config: dict, max_scrolls: int = 20) -> list:
items = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True, proxy=proxy_config)
context = await browser.new_context(
viewport={"width": 1280, "height": 900},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
)
page = await context.new_page()
await page.goto(url, wait_until="networkidle")
prev_count = 0
for scroll_num in range(max_scrolls):
# Scroll to bottom
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(1500 + random.randint(0, 1000))
# Count visible items
current_items = await page.locator(".product-card, .item, .result").all()
current_count = len(current_items)
if current_count == prev_count:
break # No new items loaded, we are at the bottom
prev_count = current_count
# Extract data from all loaded items
all_items = await page.locator(".product-card, .item, .result").all()
for item in all_items:
try:
name = await item.locator("h2, h3, .title").first.inner_text()
price = await item.locator(".price, [class*=price]").first.inner_text()
items.append({"name": name.strip(), "price": price.strip()})
except Exception:
pass
await browser.close()
return items
Intercepting XHR/fetch responses (often better than HTML parsing):
from playwright.async_api import async_playwright
import json
from typing import List
async def intercept_api_responses(
url: str,
api_pattern: str,
proxy_config: dict,
) -> List[dict]:
"""
Intercept background API calls instead of parsing HTML.
This is cleaner and more reliable when the site fetches data via JSON APIs.
"""
captured_data = []
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True, proxy=proxy_config)
context = await browser.new_context()
page = await context.new_page()
async def capture_response(response):
if api_pattern in response.url and response.status == 200:
try:
content_type = response.headers.get("content-type", "")
if "json" in content_type:
data = await response.json()
if isinstance(data, list):
captured_data.extend(data)
elif isinstance(data, dict):
items = data.get("items", data.get("results", data.get("data", [])))
captured_data.extend(items)
except Exception:
pass
page.on("response", capture_response)
await page.goto(url, wait_until="networkidle", timeout=30000)
await page.wait_for_timeout(3000)
await context.close()
await browser.close()
return captured_data
Real-World Performance Comparison
These numbers come from scraping 10,000 product pages across a variety of sites, measured on a 4-vCPU VPS with 8GB RAM and residential proxy bandwidth:
| Metric | Scrapy | httpx + BS4 | Playwright |
|---|---|---|---|
| Pages/minute | 1200+ | 700 | 110 |
| Memory (steady-state) | ~120MB | ~80MB | ~800MB |
| Setup time (new project) | 15-20 min | 5 min | 10 min |
| JS-rendered sites | No (without plugin) | No | Yes |
| Built-in rate limiting | Yes (AUTOTHROTTLE) | No | No |
| Proxy rotation | Via middleware | Manual | Via browser config |
| Retry logic | Built-in | Manual | Manual |
| Data export | Built-in (JSON, CSV, XML) | Manual | Manual |
| Distributed crawling | Via Scrapy-Redis | Manual | Manual |
| Anti-bot resistance (browser fingerprint) | Low (HTTP only) | Low (HTTP only) | High |
Combining Tools: The Real Power Move
The most capable scrapers in production do not pick one tool - they use the right tool for each part of the job.
Scrapy + scrapy-playwright: The scrapy-playwright plugin lets you mark specific Scrapy requests to use a real browser while routing everything else through Scrapy's fast HTTP engine. The result: Scrapy handles discovery, crawl management, rate limiting, and data export while Playwright renders only the pages that actually need it.
import scrapy
from scrapy_playwright.page import PageMethod
class HybridSpider(scrapy.Spider):
name = "hybrid"
start_urls = ["https://example.com/products"]
custom_settings = {
"DOWNLOAD_HANDLERS": {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"PLAYWRIGHT_BROWSER_TYPE": "chromium",
"PLAYWRIGHT_LAUNCH_OPTIONS": {"headless": True},
}
def parse(self, response):
# Extract product URLs from static listing page (fast HTTP)
for url in response.css("a.product-link::attr(href)").getall():
# Determine if product page needs JS rendering
if self.needs_javascript(url):
yield scrapy.Request(
response.urljoin(url),
callback=self.parse_js_product,
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("wait_for_selector", ".product-data", timeout=10000),
],
},
)
else:
yield response.follow(url, callback=self.parse_static_product)
def needs_javascript(self, url: str) -> bool:
# Logic to determine if this URL requires JS rendering
js_domains = ["spa-site.com", "dynamic-store.com"]
return any(d in url for d in js_domains)
def parse_js_product(self, response):
"""Handle Playwright-rendered response."""
yield {
"name": response.css("h1::text").get(default="").strip(),
"price": response.css(".price::text").get(default="").strip(),
"url": response.url,
"rendered": True,
}
def parse_static_product(self, response):
"""Handle standard HTTP response."""
yield {
"name": response.css("h1::text").get(default="").strip(),
"price": response.css(".price::text").get(default="").strip(),
"url": response.url,
"rendered": False,
}
BeautifulSoup inside Scrapy: Some developers prefer BeautifulSoup's API for complex HTML parsing tasks within Scrapy spider callbacks - particularly for nested structures or irregular HTML that Scrapy's CSS/XPath selectors struggle with.
import scrapy
from bs4 import BeautifulSoup
class ProductSpiderWithBS4(scrapy.Spider):
name = "products_bs4"
start_urls = ["https://example.com/products"]
def parse(self, response):
# Use BS4 for complex parsing within Scrapy
soup = BeautifulSoup(response.text, "lxml")
# Complex nested structure that is messy with CSS selectors
for product_div in soup.select("div.product-complex > article > section.details"):
nested_data = self._extract_nested(product_div)
if nested_data:
yield nested_data
def _extract_nested(self, tag) -> dict:
try:
specs = {}
for row in tag.select("table.specs tr"):
cells = row.find_all("td")
if len(cells) == 2:
specs[cells[0].get_text(strip=True)] = cells[1].get_text(strip=True)
return {
"name": tag.select_one("h2").get_text(strip=True),
"specs": specs,
}
except Exception:
return {}
Proxy Integration Patterns
Each tool has a slightly different proxy configuration interface.
ThorData with all three tools:
# httpx + BeautifulSoup
import httpx
THORDATA = "http://username:[email protected]:7000"
with httpx.Client(proxy=THORDATA, timeout=20.0) as client:
resp = client.get("https://target-site.com/page")
# Scrapy - in settings.py
# DOWNLOADER_MIDDLEWARES = {"myproject.middlewares.ProxyMiddleware": 350}
# PROXY_LIST = ["http://username:[email protected]:7000"]
# Playwright
from playwright.async_api import async_playwright
PLAYWRIGHT_PROXY = {
"server": "http://gate.thordata.com:7000",
"username": "username",
"password": "password",
}
async def run():
async with async_playwright() as p:
browser = await p.chromium.launch(proxy=PLAYWRIGHT_PROXY)
context = await browser.new_context()
page = await context.new_page()
await page.goto("https://target-site.com")
await browser.close()
ThorData handles proxy rotation at the gateway level for all three tools. For Scrapy and httpx, you point at the gateway URL and rotation happens server-side. For Playwright, proxy authentication goes into the browser config and rotates per browser context when you create fresh contexts for each target URL.
Anti-Detection by Tool
Each tool has different anti-detection challenges and solutions.
BeautifulSoup + httpx: - Main risk: TLS fingerprint and missing browser headers - Solution: curl_cffi for TLS impersonation + realistic header set - Rate limiting: implement manually with time.sleep + gaussian jitter
Scrapy: - Main risk: consistent User-Agent, missing Sec-Fetch headers, no cookie re-use between requests - Solution: UserAgentMiddleware + full browser header set in HEADERS setting - Downside: no JavaScript execution, so JS-based fingerprinting always reveals Scrapy
Playwright: - Main risk: navigator.webdriver flag, empty plugins array, missing chrome object - Solution: add_init_script to mask automation flags + playwright-stealth library - Strong point: actual browser fingerprint, real JS execution, real TLS from Chromium
# playwright-stealth example
from playwright.async_api import async_playwright
async def scrape_with_stealth(url: str, proxy: dict) -> str:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True, proxy=proxy)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
viewport={"width": 1366, "height": 768},
locale="en-US",
timezone_id="America/New_York",
geolocation={"longitude": -74.0060, "latitude": 40.7128},
permissions=["geolocation"],
)
page = await context.new_page()
# Mask automation indicators
await page.add_init_script("""
delete Object.getPrototypeOf(navigator).webdriver;
Object.defineProperty(navigator, 'platform', {get: () => 'Win32'});
Object.defineProperty(screen, 'width', {get: () => 1366});
Object.defineProperty(screen, 'height', {get: () => 768});
""")
await page.goto(url, wait_until="networkidle")
content = await page.content()
await context.close()
await browser.close()
return content
The Bottom Line
Default choice for most developers in 2026: Scrapy for anything beyond a one-off script. It handles 80% of scraping tasks out of the box with sane defaults for rate limiting, retries, and data export. The middleware system makes proxy integration and anti-detection straightforward. Add scrapy-playwright when you hit JS-rendered sites.
Use BeautifulSoup when you need to parse HTML you already have, prototype quickly, or write simple scripts that would be overkill to build in Scrapy.
Use Playwright when the data is genuinely not available without JavaScript execution - SPAs, sites with complex login flows, and targets that fingerprint clients in JavaScript. Accept that you are trading speed for capability.
Never use Playwright as a default just because it is powerful. The resource cost is real, and for static HTML sites it gives you no advantage while making your scraper 10-50x slower.