Scrapy vs BeautifulSoup vs Playwright: Which to Use for Web Scraping in 2026?

2026-03-30 [python scraping scrapy playwright beautifulsoup]

Scrapy vs BeautifulSoup vs Playwright: Which to Use for Web Scraping in 2026?

Choosing the wrong scraping tool does not just mean writing bad code - it means spending hours fighting a framework that was never designed for your use case. Playwright crawling a million-page catalog burns memory and slows to a crawl. Scrapy fighting a JavaScript-heavy SPA fails silently with empty fields. BeautifulSoup handling a 50,000-page crawl turns into spaghetti that cannot be maintained.

The Python scraping ecosystem has stabilized around three tools that cover nearly every real-world scenario: BeautifulSoup for lightweight parsing, Scrapy for large-scale crawls, and Playwright for JavaScript-rendered content. Understanding where each excels - and more importantly where each fails - saves you from rewrites.

This guide gives you the full picture. We cover architecture differences, real-world performance numbers, proxy integration patterns, anti-detection techniques for each tool, complete production code examples, and a decision framework for choosing based on your actual requirements. If you have spent time on the fence between these tools, this will settle it.

The Quick Answer

Stop reading comparison articles that end with "it depends." Here is the actual answer:

Scenario	Use This
Static HTML, quick scripts	BeautifulSoup + httpx
Multi-page crawls, data pipelines	Scrapy
JS-rendered SPAs, login flows	Playwright
High-volume crawl with some JS pages	Scrapy + scrapy-playwright

That is the 80% answer. The remaining 20% is what the rest of this article covers.

BeautifulSoup: The Parser That Refuses to Die

BeautifulSoup is not a scraping framework. It is an HTML parser. That distinction matters more than most tutorials acknowledge.

You give it HTML, it gives you a clean API to extract data from it. That is the entire job. It does not fetch pages, manage sessions, handle retries, enforce rate limits, or deal with JavaScript. You pair it with httpx or requests for fetching, and BeautifulSoup handles the parsing.

Architecture: BeautifulSoup parses a string of HTML into a tree structure using an underlying parser (lxml, html.parser, or html5lib). Once parsed, you navigate the tree with CSS selectors, tag names, or attribute searches. The entire library exists purely in memory during parsing - there is no state, no session, no pipeline.

Where it shines: - Parsing HTML you already have (API responses, saved files, email templates) - One-off scripts where you need data from 5-20 pages - Prototyping - get something working in 15 minutes before deciding if it needs a real framework - Teaching - the API maps directly to how HTML structure works - Extracting data from HTML within larger applications (ETL pipelines, email processors) - Complex HTML parsing within Scrapy spiders where CSS selectors fall short

Where it falls apart: - Anything over 100+ pages (no built-in concurrency, no crawl management, no scheduler) - JavaScript-rendered content (it only sees the raw HTML the server returns) - Production pipelines (no retry logic, no rate limiting, no structured export formats) - Duplicate URL detection and crawl frontier management - Anything requiring stateful browser interaction

Performance reality: BeautifulSoup with lxml parses roughly 1-3 MB/s of HTML. For most pages that is fast enough. The bottleneck is almost always the network request, not the parsing. Where BeautifulSoup loses at scale is that you have to write all the infrastructure yourself: retry logic, rate limiting, concurrency, deduplication. By the time you have built all that, you have reinvented a subset of Scrapy - worse.

Complete example: product catalog parser

import httpx
from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import List, Optional
import time
import random

@dataclass
class Product:
    name: str
    price: Optional[float]
    sku: Optional[str]
    description: str
    image_url: Optional[str]
    in_stock: bool
    url: str

def parse_product_page(html: str, url: str) -> Product:
    soup = BeautifulSoup(html, "lxml")

    name_tag = soup.select_one("h1.product-title, h1[itemprop=name], #productTitle")
    price_tag = soup.select_one('[itemprop="price"], .price, .product-price, #priceblock_ourprice')
    sku_tag = soup.select_one('[itemprop="sku"], .sku, #productSKU')
    desc_tag = soup.select_one('[itemprop="description"], .product-description, #productDescription')
    img_tag = soup.select_one('[itemprop="image"], .product-image img, #landingImage')
    stock_tag = soup.select_one('[itemprop="availability"], .availability, #availability')

    price_text = price_tag.get_text(strip=True) if price_tag else ""
    price_clean = "".join(c for c in price_text if c.isdigit() or c == ".")
    try:
        price = float(price_clean) if price_clean else None
    except ValueError:
        price = None

    in_stock = True
    if stock_tag:
        stock_text = stock_tag.get_text(strip=True).lower()
        in_stock = "out of stock" not in stock_text and "unavailable" not in stock_text

    return Product(
        name=name_tag.get_text(strip=True) if name_tag else "",
        price=price,
        sku=sku_tag.get_text(strip=True) if sku_tag else None,
        description=desc_tag.get_text(strip=True)[:500] if desc_tag else "",
        image_url=img_tag.get("src") or img_tag.get("data-src") if img_tag else None,
        in_stock=in_stock,
        url=url,
    )

def scrape_catalog(
    urls: List[str],
    proxy: str,
    rate_limit_rpm: int = 30,
) -> List[Product]:
    products = []
    delay = 60.0 / rate_limit_rpm
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
    }
    with httpx.Client(proxy=proxy, timeout=15.0, headers=headers) as client:
        for i, url in enumerate(urls):
            try:
                resp = client.get(url)
                if resp.status_code == 200:
                    products.append(parse_product_page(resp.text, url))
                elif resp.status_code == 429:
                    time.sleep(30)
                    resp = client.get(url)
                    if resp.status_code == 200:
                        products.append(parse_product_page(resp.text, url))
            except (httpx.TimeoutException, httpx.ProxyError):
                pass
            time.sleep(delay + random.uniform(0, 0.5))
    return products

Scrapy: The Industrial Scraper

Scrapy is a framework, not a library. That is both its strength and its barrier to entry.

Out of the box you get: async request handling via Twisted, automatic rate limiting, retry middleware, cookie management, multiple export formats (JSON, CSV, XML, JSONL), item pipelines for cleaning data, a crawl scheduler with disk-based persistence, and a middleware stack for customizing every request and response. You write spiders, Scrapy handles everything else.

Architecture: Scrapy runs on Twisted, Python's mature async networking framework. It maintains a request queue, a downloader that handles concurrent HTTP connections, and a spider engine that processes responses and generates new requests. The middleware system lets you intercept requests and responses at multiple points - proxy rotation, header modification, retry logic, and custom downloaders all live here.

Where it shines: - Crawling thousands or millions of pages efficiently - Data pipelines that need cleaning, deduplication, and structured storage - Respectful scraping with AUTOTHROTTLE, DOWNLOAD_DELAY, CONCURRENT_REQUESTS - Long-running jobs that need to pause and resume via checkpoint persistence - Complex crawl graphs with multiple spider types - Integration with data infrastructure (Kafka, S3, PostgreSQL via item pipelines) - Distributed crawling via Scrapy-Redis or Scrapyd

Where it falls apart: - Scraping 3-5 pages from one site (significant setup overhead) - JavaScript-heavy sites by default (Scrapy sees only initial HTML) - Quick prototyping without a project structure - Scenarios where you need fine-grained browser control (mouse events, file downloads)

Performance reality: Scrapy with default settings runs around 16 concurrent requests and handles 300-500 pages/minute on a standard VPS. With CONCURRENT_REQUESTS=64 and AUTOTHROTTLE disabled, you can push past 1000 pages/minute, though most sites will block you before you get there. The async Twisted engine gives Scrapy a real advantage at scale - it does not block on network I/O the way a synchronous httpx loop does.

Spider example with proxy rotation middleware:

import scrapy
from scrapy import signals
from scrapy.http import Request
from dataclasses import dataclass
import random
import json

@dataclass
class ProductItem:
    name: str
    price: str
    url: str
    category: str
    rating: str
    review_count: str

class ProductSpider(scrapy.Spider):
    name = "products"
    custom_settings = {
        "CONCURRENT_REQUESTS": 16,
        "DOWNLOAD_DELAY": 1.5,
        "RANDOMIZE_DOWNLOAD_DELAY": True,
        "AUTOTHROTTLE_ENABLED": True,
        "AUTOTHROTTLE_START_DELAY": 1,
        "AUTOTHROTTLE_MAX_DELAY": 10,
        "AUTOTHROTTLE_TARGET_CONCURRENCY": 8,
        "RETRY_TIMES": 3,
        "RETRY_HTTP_CODES": [429, 500, 502, 503, 504],
        "COOKIES_ENABLED": True,
        "DOWNLOADER_MIDDLEWARES": {
            "myproject.middlewares.ProxyMiddleware": 350,
            "myproject.middlewares.UserAgentMiddleware": 400,
        },
        "FEEDS": {
            "products.jsonl": {"format": "jsonlines", "overwrite": True},
        },
    }

    def __init__(self, start_url=None, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.start_urls = [start_url or "https://example.com/products"]

    def start_requests(self):
        for url in self.start_urls:
            yield Request(
                url,
                callback=self.parse_listing,
                headers=self._get_headers(),
                meta={"dont_redirect": False},
            )

    def _get_headers(self) -> dict:
        return {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
            "sec-ch-ua": '"Chromium";v="131", "Not_A Brand";v="24", "Google Chrome";v="131"',
            "sec-ch-ua-mobile": "?0",
            "Sec-Fetch-Dest": "document",
            "Sec-Fetch-Mode": "navigate",
        }

    def parse_listing(self, response):
        # Extract product links from listing page
        product_links = response.css("a.product-card::attr(href), a.item-link::attr(href)").getall()
        for link in product_links:
            yield response.follow(link, callback=self.parse_product, headers=self._get_headers())

        # Pagination
        next_page = response.css("a.next-page::attr(href), link[rel=next]::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse_listing, headers=self._get_headers())

    def parse_product(self, response):
        # Try JSON-LD structured data first
        json_ld = response.css('script[type="application/ld+json"]::text').get()
        if json_ld:
            try:
                data = json.loads(json_ld)
                if isinstance(data, list):
                    data = data[0]
                if data.get("@type") == "Product":
                    offer = data.get("offers", {})
                    if isinstance(offer, list):
                        offer = offer[0]
                    yield {
                        "name": data.get("name", ""),
                        "price": offer.get("price", ""),
                        "currency": offer.get("priceCurrency", "USD"),
                        "url": response.url,
                        "sku": data.get("sku", ""),
                        "description": data.get("description", "")[:300],
                        "in_stock": offer.get("availability", "").endswith("InStock"),
                    }
                    return
            except (json.JSONDecodeError, KeyError):
                pass

        # Fallback to CSS selectors
        yield {
            "name": response.css("h1::text, h1.title::text").get(default="").strip(),
            "price": response.css('[itemprop="price"]::attr(content), .price::text').get(default="").strip(),
            "url": response.url,
            "sku": response.css('[itemprop="sku"]::text').get(default="").strip(),
            "description": response.css('[itemprop="description"]::text').get(default="").strip()[:300],
            "in_stock": bool(response.css('.in-stock, [itemprop="availability"]')),
        }

Custom proxy rotation middleware:

# myproject/middlewares.py
import random
from scrapy import signals
from scrapy.exceptions import NotConfigured

class ProxyMiddleware:
    """Rotate through a proxy pool on every request."""

    def __init__(self, proxy_list):
        self.proxy_list = proxy_list

    @classmethod
    def from_crawler(cls, crawler):
        proxy_list = crawler.settings.getlist("PROXY_LIST", [])
        if not proxy_list:
            raise NotConfigured("PROXY_LIST setting is required")
        return cls(proxy_list)

    def process_request(self, request, spider):
        proxy = random.choice(self.proxy_list)
        request.meta["proxy"] = proxy

class UserAgentMiddleware:
    """Rotate user agents to avoid fingerprinting."""

    USER_AGENTS = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36",
    ]

    def process_request(self, request, spider):
        request.headers["User-Agent"] = random.choice(self.USER_AGENTS)

class RetryWithNewProxyMiddleware:
    """On 403/429, retry with a different proxy."""

    RETRY_CODES = {403, 429, 503}

    def __init__(self, proxy_list, max_retries=3):
        self.proxy_list = proxy_list
        self.max_retries = max_retries

    @classmethod
    def from_crawler(cls, crawler):
        proxy_list = crawler.settings.getlist("PROXY_LIST", [])
        max_retries = crawler.settings.getint("RETRY_TIMES", 3)
        return cls(proxy_list, max_retries)

    def process_response(self, request, response, spider):
        if response.status in self.RETRY_CODES:
            retries = request.meta.get("retry_count", 0)
            if retries < self.max_retries:
                new_request = request.copy()
                new_request.meta["proxy"] = random.choice(self.proxy_list)
                new_request.meta["retry_count"] = retries + 1
                new_request.dont_filter = True
                return new_request
        return response

Scrapy settings for ThorData proxy:

# settings.py
PROXY_LIST = ["http://username:[email protected]:7000"]
# For sticky sessions: "http://username-session-{session_id}:[email protected]:7000"

DOWNLOADER_MIDDLEWARES = {
    "myproject.middlewares.ProxyMiddleware": 350,
    "myproject.middlewares.UserAgentMiddleware": 400,
    "myproject.middlewares.RetryWithNewProxyMiddleware": 550,
}

AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_TARGET_CONCURRENCY = 8
CONCURRENT_REQUESTS_PER_DOMAIN = 4

Playwright: The Nuclear Option

Playwright launches a real browser. That means it executes JavaScript, renders the page, handles cookies, follows redirects, fires DOM events, and behaves exactly like an actual user. It is the only option when the data you need is constructed client-side.

Architecture: Playwright communicates with browser processes (Chromium, Firefox, or WebKit) over the DevTools Protocol. Each browser instance runs as a separate OS process. Browser contexts within an instance share the browser binary but have isolated storage, cookies, and cache. Pages within a context share the context.

This architecture has real implications: browser startup takes 1-3 seconds, each context consumes 50-150MB RAM, and the DevTools Protocol communication adds latency to every operation.

Where it shines: - Single-page applications built with React, Vue, Angular, or Svelte - Sites behind login walls with complex authentication flows (MFA, OAuth, session tokens) - Pages that load data via XHR or fetch requests after initial render - Infinite scroll pages where content loads on scroll event - When you need to interact with the page (fill forms, click buttons, handle file uploads) - Scraping sites that fingerprint clients in JavaScript - When you need screenshots or PDFs alongside data

Where it falls apart: - Speed: a real browser is 10-50x slower than raw HTTP requests - Memory: each browser context consumes 50-150MB RAM - Scale: running 50+ concurrent browser instances requires serious hardware or cloud resources - Reliability: network timeouts, flaky selectors, race conditions between JS execution and your waits - Cost: proportionally more expensive to run at scale

Performance reality with benchmarks: Scraping 10,000 product pages on a standard VPS (4 vCPUs, 8GB RAM):

Tool	Time	Memory Peak	Pages/min
Scrapy (16 concurrent)	~8 min	~120MB	~1250
httpx + BS4 (async, 12 concurrent)	~14 min	~80MB	~710
Playwright (4 contexts)	~90 min	~800MB	~110

Playwright is not a substitute for HTTP-based scraping at scale. It is a specialized tool for content that is genuinely impossible to get any other way.

Full async Playwright scraper with proxy and anti-detection:

import asyncio
from playwright.async_api import async_playwright, Page, BrowserContext
from typing import List, Optional
import random
import time

PROXY_CONFIG = {
    "server": "http://gate.thordata.com:7000",
    "username": "your_username",
    "password": "your_password",
}

async def block_unnecessary_resources(page: Page) -> None:
    """Block images, fonts, and media to speed up scraping."""
    async def handler(route):
        if route.request.resource_type in ("image", "stylesheet", "font", "media"):
            await route.abort()
        else:
            await route.continue_()
    await page.route("**/*", handler)

async def add_stealth_scripts(page: Page) -> None:
    """Inject scripts before page load to mask automation flags."""
    await page.add_init_script("""
        // Remove webdriver flag
        Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
        // Fake plugins array (empty in headless)
        Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3, 4, 5]});
        // Fake language
        Object.defineProperty(navigator, 'languages', {get: () => ['en-US', 'en']});
        // Remove automation-specific chrome object properties
        if (window.chrome) {
            window.chrome.runtime = {};
        }
    """)

async def scrape_spa_page(
    url: str,
    context: BrowserContext,
    wait_selector: str = ".content, main, #app",
    timeout: int = 15000,
) -> Optional[str]:
    page = await context.new_page()
    try:
        await block_unnecessary_resources(page)
        await page.goto(url, wait_until="domcontentloaded", timeout=timeout)
        await page.wait_for_selector(wait_selector, timeout=timeout)
        # Give JS time to populate dynamic content
        await page.wait_for_load_state("networkidle", timeout=5000)
        return await page.content()
    except Exception as e:
        print(f"Error scraping {url}: {e}")
        return None
    finally:
        await page.close()

async def scrape_with_playwright(
    urls: List[str],
    max_contexts: int = 4,
    proxy_config: dict = None,
) -> List[dict]:
    results = []
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy=proxy_config or PROXY_CONFIG,
            args=[
                "--disable-blink-features=AutomationControlled",
                "--disable-dev-shm-usage",
                "--no-sandbox",
                "--disable-setuid-sandbox",
            ],
        )
        # Create a semaphore to limit concurrent contexts
        semaphore = asyncio.Semaphore(max_contexts)

        async def scrape_with_semaphore(url: str) -> Optional[dict]:
            async with semaphore:
                context = await browser.new_context(
                    viewport={"width": 1366, "height": 768},
                    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
                    locale="en-US",
                    timezone_id="America/New_York",
                )
                page = await context.new_page()
                await add_stealth_scripts(page)
                try:
                    html = await scrape_spa_page(url, context)
                    if html:
                        return {"url": url, "html": html, "success": True}
                    return {"url": url, "html": None, "success": False}
                finally:
                    await context.close()

        tasks = [scrape_with_semaphore(url) for url in urls]
        batch_results = await asyncio.gather(*tasks, return_exceptions=True)
        for r in batch_results:
            if isinstance(r, dict):
                results.append(r)
        await browser.close()
    return results

Handling infinite scroll:

from playwright.async_api import async_playwright, Page

async def scrape_infinite_scroll(url: str, proxy_config: dict, max_scrolls: int = 20) -> list:
    items = []
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True, proxy=proxy_config)
        context = await browser.new_context(
            viewport={"width": 1280, "height": 900},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
        )
        page = await context.new_page()
        await page.goto(url, wait_until="networkidle")

        prev_count = 0
        for scroll_num in range(max_scrolls):
            # Scroll to bottom
            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            await page.wait_for_timeout(1500 + random.randint(0, 1000))

            # Count visible items
            current_items = await page.locator(".product-card, .item, .result").all()
            current_count = len(current_items)

            if current_count == prev_count:
                break  # No new items loaded, we are at the bottom

            prev_count = current_count

        # Extract data from all loaded items
        all_items = await page.locator(".product-card, .item, .result").all()
        for item in all_items:
            try:
                name = await item.locator("h2, h3, .title").first.inner_text()
                price = await item.locator(".price, [class*=price]").first.inner_text()
                items.append({"name": name.strip(), "price": price.strip()})
            except Exception:
                pass

        await browser.close()
    return items

Intercepting XHR/fetch responses (often better than HTML parsing):

from playwright.async_api import async_playwright
import json
from typing import List

async def intercept_api_responses(
    url: str,
    api_pattern: str,
    proxy_config: dict,
) -> List[dict]:
    """
    Intercept background API calls instead of parsing HTML.
    This is cleaner and more reliable when the site fetches data via JSON APIs.
    """
    captured_data = []

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True, proxy=proxy_config)
        context = await browser.new_context()
        page = await context.new_page()

        async def capture_response(response):
            if api_pattern in response.url and response.status == 200:
                try:
                    content_type = response.headers.get("content-type", "")
                    if "json" in content_type:
                        data = await response.json()
                        if isinstance(data, list):
                            captured_data.extend(data)
                        elif isinstance(data, dict):
                            items = data.get("items", data.get("results", data.get("data", [])))
                            captured_data.extend(items)
                except Exception:
                    pass

        page.on("response", capture_response)
        await page.goto(url, wait_until="networkidle", timeout=30000)
        await page.wait_for_timeout(3000)
        await context.close()
        await browser.close()

    return captured_data

Real-World Performance Comparison

These numbers come from scraping 10,000 product pages across a variety of sites, measured on a 4-vCPU VPS with 8GB RAM and residential proxy bandwidth:

Metric	Scrapy	httpx + BS4	Playwright
Pages/minute	1200+	700	110
Memory (steady-state)	~120MB	~80MB	~800MB
Setup time (new project)	15-20 min	5 min	10 min
JS-rendered sites	No (without plugin)	No	Yes
Built-in rate limiting	Yes (AUTOTHROTTLE)	No	No
Proxy rotation	Via middleware	Manual	Via browser config
Retry logic	Built-in	Manual	Manual
Data export	Built-in (JSON, CSV, XML)	Manual	Manual
Distributed crawling	Via Scrapy-Redis	Manual	Manual
Anti-bot resistance (browser fingerprint)	Low (HTTP only)	Low (HTTP only)	High

Combining Tools: The Real Power Move

The most capable scrapers in production do not pick one tool - they use the right tool for each part of the job.

Scrapy + scrapy-playwright: The scrapy-playwright plugin lets you mark specific Scrapy requests to use a real browser while routing everything else through Scrapy's fast HTTP engine. The result: Scrapy handles discovery, crawl management, rate limiting, and data export while Playwright renders only the pages that actually need it.

import scrapy
from scrapy_playwright.page import PageMethod

class HybridSpider(scrapy.Spider):
    name = "hybrid"
    start_urls = ["https://example.com/products"]

    custom_settings = {
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_BROWSER_TYPE": "chromium",
        "PLAYWRIGHT_LAUNCH_OPTIONS": {"headless": True},
    }

    def parse(self, response):
        # Extract product URLs from static listing page (fast HTTP)
        for url in response.css("a.product-link::attr(href)").getall():
            # Determine if product page needs JS rendering
            if self.needs_javascript(url):
                yield scrapy.Request(
                    response.urljoin(url),
                    callback=self.parse_js_product,
                    meta={
                        "playwright": True,
                        "playwright_page_methods": [
                            PageMethod("wait_for_selector", ".product-data", timeout=10000),
                        ],
                    },
                )
            else:
                yield response.follow(url, callback=self.parse_static_product)

    def needs_javascript(self, url: str) -> bool:
        # Logic to determine if this URL requires JS rendering
        js_domains = ["spa-site.com", "dynamic-store.com"]
        return any(d in url for d in js_domains)

    def parse_js_product(self, response):
        """Handle Playwright-rendered response."""
        yield {
            "name": response.css("h1::text").get(default="").strip(),
            "price": response.css(".price::text").get(default="").strip(),
            "url": response.url,
            "rendered": True,
        }

    def parse_static_product(self, response):
        """Handle standard HTTP response."""
        yield {
            "name": response.css("h1::text").get(default="").strip(),
            "price": response.css(".price::text").get(default="").strip(),
            "url": response.url,
            "rendered": False,
        }

BeautifulSoup inside Scrapy: Some developers prefer BeautifulSoup's API for complex HTML parsing tasks within Scrapy spider callbacks - particularly for nested structures or irregular HTML that Scrapy's CSS/XPath selectors struggle with.

import scrapy
from bs4 import BeautifulSoup

class ProductSpiderWithBS4(scrapy.Spider):
    name = "products_bs4"
    start_urls = ["https://example.com/products"]

    def parse(self, response):
        # Use BS4 for complex parsing within Scrapy
        soup = BeautifulSoup(response.text, "lxml")
        # Complex nested structure that is messy with CSS selectors
        for product_div in soup.select("div.product-complex > article > section.details"):
            nested_data = self._extract_nested(product_div)
            if nested_data:
                yield nested_data

    def _extract_nested(self, tag) -> dict:
        try:
            specs = {}
            for row in tag.select("table.specs tr"):
                cells = row.find_all("td")
                if len(cells) == 2:
                    specs[cells[0].get_text(strip=True)] = cells[1].get_text(strip=True)
            return {
                "name": tag.select_one("h2").get_text(strip=True),
                "specs": specs,
            }
        except Exception:
            return {}

Proxy Integration Patterns

Each tool has a slightly different proxy configuration interface.

ThorData with all three tools:

# httpx + BeautifulSoup
import httpx

THORDATA = "http://username:[email protected]:7000"

with httpx.Client(proxy=THORDATA, timeout=20.0) as client:
    resp = client.get("https://target-site.com/page")

# Scrapy - in settings.py
# DOWNLOADER_MIDDLEWARES = {"myproject.middlewares.ProxyMiddleware": 350}
# PROXY_LIST = ["http://username:[email protected]:7000"]

# Playwright
from playwright.async_api import async_playwright

PLAYWRIGHT_PROXY = {
    "server": "http://gate.thordata.com:7000",
    "username": "username",
    "password": "password",
}

async def run():
    async with async_playwright() as p:
        browser = await p.chromium.launch(proxy=PLAYWRIGHT_PROXY)
        context = await browser.new_context()
        page = await context.new_page()
        await page.goto("https://target-site.com")
        await browser.close()

ThorData handles proxy rotation at the gateway level for all three tools. For Scrapy and httpx, you point at the gateway URL and rotation happens server-side. For Playwright, proxy authentication goes into the browser config and rotates per browser context when you create fresh contexts for each target URL.

Anti-Detection by Tool

Each tool has different anti-detection challenges and solutions.

BeautifulSoup + httpx: - Main risk: TLS fingerprint and missing browser headers - Solution: curl_cffi for TLS impersonation + realistic header set - Rate limiting: implement manually with time.sleep + gaussian jitter

Scrapy: - Main risk: consistent User-Agent, missing Sec-Fetch headers, no cookie re-use between requests - Solution: UserAgentMiddleware + full browser header set in HEADERS setting - Downside: no JavaScript execution, so JS-based fingerprinting always reveals Scrapy

Playwright: - Main risk: navigator.webdriver flag, empty plugins array, missing chrome object - Solution: add_init_script to mask automation flags + playwright-stealth library - Strong point: actual browser fingerprint, real JS execution, real TLS from Chromium

# playwright-stealth example
from playwright.async_api import async_playwright

async def scrape_with_stealth(url: str, proxy: dict) -> str:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True, proxy=proxy)
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
            viewport={"width": 1366, "height": 768},
            locale="en-US",
            timezone_id="America/New_York",
            geolocation={"longitude": -74.0060, "latitude": 40.7128},
            permissions=["geolocation"],
        )
        page = await context.new_page()
        # Mask automation indicators
        await page.add_init_script("""
            delete Object.getPrototypeOf(navigator).webdriver;
            Object.defineProperty(navigator, 'platform', {get: () => 'Win32'});
            Object.defineProperty(screen, 'width', {get: () => 1366});
            Object.defineProperty(screen, 'height', {get: () => 768});
        """)
        await page.goto(url, wait_until="networkidle")
        content = await page.content()
        await context.close()
        await browser.close()
        return content

The Bottom Line

Default choice for most developers in 2026: Scrapy for anything beyond a one-off script. It handles 80% of scraping tasks out of the box with sane defaults for rate limiting, retries, and data export. The middleware system makes proxy integration and anti-detection straightforward. Add scrapy-playwright when you hit JS-rendered sites.

Use BeautifulSoup when you need to parse HTML you already have, prototype quickly, or write simple scripts that would be overkill to build in Scrapy.

Use Playwright when the data is genuinely not available without JavaScript execution - SPAs, sites with complex login flows, and targets that fingerprint clients in JavaScript. Accept that you are trading speed for capability.

Never use Playwright as a default just because it is powerful. The resource cost is real, and for static HTML sites it gives you no advantage while making your scraper 10-50x slower.