Best Web Scraping APIs and Services in 2026: A Developer's Comprehensive Comparison

2026-03-30 [scraping apis apify comparison tools proxy-services scrapy crawlee]

Every scraping project starts with the same question: should I build the infrastructure myself, or pay someone to handle the hard parts?

The answer depends on your target sites, budget, how much time you want to spend fighting CAPTCHAs, and whether you need this as a one-off job or an ongoing pipeline. The web scraping tool landscape in 2026 has matured significantly — there are more options, better APIs, and smarter anti-detection built into the tools. But the fundamentals of choosing the right tool haven't changed.

This guide breaks down every viable option in 2026, from fully managed APIs to fully DIY stacks, with real code examples and honest assessments of each.

The Three Approaches to Web Scraping

Before comparing specific tools, understand the three fundamental approaches:

1. Managed Scraping APIs

You send a URL, you get back data. The service handles proxies, JavaScript rendering, CAPTCHA solving, and retries. Examples: ScrapingBee, ScraperAPI, Zenrows.

Pros: Zero infrastructure, fast integration, low maintenance. Cons: Per-request costs add up, limited customization, vendor lock-in. Best for: Teams that need scraping as a feature, not a core competency.

2. Scraping Platforms

You build scrapers on their platform using their SDK. They handle deployment, scheduling, proxy rotation, and storage. Examples: Apify, Zyte/Scrapy Cloud.

Pros: Full control over scraping logic, managed infrastructure, community scrapers. Cons: Platform lock-in, learning curve, compute-based pricing. Best for: Teams that scrape many different sites and want reusable, shareable code.

3. DIY with Proxy Providers

You write your own scrapers and use a proxy service for IP rotation and anti-detection. Examples: httpx + ThorData, Playwright + proxy, Scrapy + proxy middleware.

Pros: Full control, no vendor lock-in, cheapest at scale. Cons: You maintain everything — error handling, retries, CAPTCHA solving, browser management. Best for: Engineering teams with scraping expertise who want maximum flexibility.

Detailed Service Comparisons

1. Apify — The Actor Marketplace

Apify stands out because of its community-driven actor marketplace. Thousands of pre-built scrapers (called "actors") cover common targets — Amazon, Google Maps, Instagram, TikTok, LinkedIn — and you can run them without writing a line of code.

What makes it different: The ecosystem. Over 3,000 ready-to-use actors means someone has probably already built what you need. If you need something custom, you build it with their SDK (JavaScript or Python) and deploy it to their cloud.

SDK Example (Python):

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")

# Run a pre-built Amazon scraper
run_input = {
    "categoryUrls": [
        {"url": "https://www.amazon.com/s?k=wireless+headphones"}
    ],
    "maxItems": 50,
    "proxy": {"useApifyProxy": True, "groups": ["RESIDENTIAL"]},
}

run = client.actor("junglee/amazon-crawler").call(run_input=run_input)

# Fetch results
items = list(client.dataset(run["defaultDatasetId"]).iterate_items())
for item in items:
    print(f"{item.get('title')} — ${item.get('price')}")

Custom Actor Example:

from apify import Actor
import httpx
from bs4 import BeautifulSoup

async def main():
    async with Actor:
        input_data = await Actor.get_input() or {}
        urls = input_data.get("urls", [])

        for url in urls:
            async with httpx.AsyncClient() as client:
                resp = await client.get(url)
                soup = BeautifulSoup(resp.text, "html.parser")

                title = soup.select_one("h1")
                price = soup.select_one(".price")

                await Actor.push_data({
                    "url": url,
                    "title": title.get_text(strip=True) if title else "",
                    "price": price.get_text(strip=True) if price else "",
                })

Pricing Breakdown:

Plan	Monthly Cost	Platform Credits	Key Features
Free	$0	$5 worth	1 actor, basic scheduling
Starter	$49	$49 worth	Unlimited actors, API access
Scale	$499	$499 worth	Priority support, more compute
Enterprise	Custom	Custom	SLA, dedicated infrastructure

Credits are consumed based on compute time (CPU + memory). A simple HTTP scraper costs fractions of a cent per run. Browser-based actors cost 3-10x more.

Strengths: - Largest marketplace of pre-built scrapers - Excellent documentation and tutorials - Python and JavaScript SDKs - Built-in proxy management - Free tier is genuinely useful for small projects - Dataset storage and export included

Weaknesses: - Compute-based pricing is harder to predict than per-request pricing - Browser actors get expensive at scale - Custom actors require learning their SDK - Some marketplace actors are poorly maintained

Best for: Developers who want reusable, shareable scrapers with managed infrastructure. The actor marketplace is the real differentiator — before building anything custom, check if someone's already built it.

2. ScrapingBee — Simple API, Zero Infrastructure

ScrapingBee takes the simplest possible approach: one API endpoint, you send a URL, you get back rendered HTML. It handles JavaScript rendering, proxy rotation, and CAPTCHA solving behind the scenes.

What makes it different: The simplicity. Integration into existing codebases takes five minutes. No platform to learn, no SDK required (though they have one), no actors to configure.

Basic Usage:

import httpx

SCRAPINGBEE_API_KEY = "YOUR_KEY"

def scrape_url(url: str, render_js: bool = False) -> str:
    """Scrape a URL through ScrapingBee."""
    params = {
        "api_key": SCRAPINGBEE_API_KEY,
        "url": url,
        "render_js": str(render_js).lower(),
    }

    resp = httpx.get("https://app.scrapingbee.com/api/v1/", params=params)
    resp.raise_for_status()
    return resp.text

# Simple HTML scraping (1 credit)
html = scrape_url("https://example.com/products")

# JavaScript-rendered page (5 credits)
html = scrape_url("https://spa-site.com/dashboard", render_js=True)

Advanced Features:

def scrape_with_options(
    url: str,
    render_js: bool = False,
    premium_proxy: bool = False,
    country_code: str = "",
    wait_for: str = "",
    extract_rules: dict = None,
    screenshot: bool = False,
) -> dict:
    """ScrapingBee with all options."""
    params = {
        "api_key": SCRAPINGBEE_API_KEY,
        "url": url,
        "render_js": str(render_js).lower(),
        "premium_proxy": str(premium_proxy).lower(),
    }

    if country_code:
        params["country_code"] = country_code

    if wait_for:
        # CSS selector to wait for before returning
        params["wait_for"] = wait_for

    if extract_rules:
        # Server-side data extraction
        import json
        params["extract_rules"] = json.dumps(extract_rules)

    if screenshot:
        params["screenshot"] = "true"

    resp = httpx.get("https://app.scrapingbee.com/api/v1/", params=params)

    if screenshot:
        return {"screenshot": resp.content}

    return {"html": resp.text, "status": resp.status_code}

# Scrape with server-side extraction
result = scrape_with_options(
    url="https://example.com/product/123",
    render_js=True,
    premium_proxy=True,
    country_code="us",
    wait_for=".product-price",
    extract_rules={
        "title": "h1.product-title",
        "price": ".product-price",
        "description": ".product-description",
        "images": {
            "selector": "img.product-image",
            "type": "list",
            "output": "@src",
        },
    },
)

Google Search Scraping:

def scrape_google(query: str, num_results: int = 10) -> list[dict]:
    """Scrape Google search results via ScrapingBee."""
    params = {
        "api_key": SCRAPINGBEE_API_KEY,
        "url": f"https://www.google.com/search?q={query}&num={num_results}",
        "render_js": "false",
        "premium_proxy": "true",
        "country_code": "us",
    }

    resp = httpx.get("https://app.scrapingbee.com/api/v1/", params=params)

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(resp.text, "html.parser")

    results = []
    for div in soup.select("div.g"):
        title_elem = div.select_one("h3")
        link_elem = div.select_one("a[href]")
        snippet_elem = div.select_one("div.VwiC3b")

        if title_elem and link_elem:
            results.append({
                "title": title_elem.get_text(strip=True),
                "url": link_elem["href"],
                "snippet": snippet_elem.get_text(strip=True) if snippet_elem else "",
            })

    return results

Pricing:

Plan	Monthly Cost	API Credits	Cost per Basic Request
Freelance	$49	150,000	$0.00033
Startup	$99	500,000	$0.00020
Business	$249	2,000,000	$0.00012
Enterprise	Custom	Custom	Custom

Credit costs vary by feature: basic HTML = 1 credit, JS rendering = 5 credits, premium proxy = 10-25 credits.

Strengths: - Simplest integration of any scraping service - Server-side data extraction (no local parsing needed) - Google Search API included - Screenshot capability - Good documentation

Weaknesses: - Per-request pricing gets expensive for JavaScript-heavy scraping - Limited customization compared to platform-based tools - No scheduling or data storage - Premium proxy credits burn fast

Best for: Teams that need scraping as a feature inside a larger application, not as the core product. If you want to add "import from URL" to your SaaS without building scraping infrastructure, ScrapingBee is the fastest path.

3. Bright Data — Enterprise Proxy Network

Bright Data (formerly Luminati) operates the largest residential proxy network in the world — over 72 million IPs. They've expanded from pure proxy services into full scraping solutions with their Scraping Browser, Web Unlocker, and pre-built datasets.

What makes it different: Raw proxy power. When other services get blocked, Bright Data's residential and mobile proxies usually still work. Their Scraping Browser runs a full Chromium instance routed through residential IPs, which defeats most fingerprinting.

Web Unlocker Example:

import httpx

def scrape_with_unlocker(url: str) -> str:
    """Use Bright Data's Web Unlocker for anti-bot bypass."""
    proxy = "http://USERNAME:[email protected]:33335"

    with httpx.Client(proxy=proxy, timeout=60) as client:
        resp = client.get(url, headers={
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                          "AppleWebKit/537.36",
        })
        return resp.text

Scraping Browser (Playwright + Residential Proxy):

from playwright.async_api import async_playwright
import asyncio

async def scrape_with_bright_browser(url: str) -> str:
    """Use Bright Data's Scraping Browser — real Chrome + residential IP."""

    async with async_playwright() as p:
        browser = await p.chromium.connect_over_cdp(
            "wss://USERNAME:[email protected]:9222"
        )

        page = await browser.new_page()
        await page.goto(url, wait_until="networkidle")
        content = await page.content()
        await browser.close()

        return content

Pricing:

Product	Pricing Model	Approximate Cost
Datacenter Proxies	Per IP or per GB	$0.60/GB
Residential Proxies	Per GB	$8.40/GB
Mobile Proxies	Per GB	$24/GB
ISP Proxies	Per IP/month	$12/IP/month
Web Unlocker	Per request	$3/1000 requests
Scraping Browser	Per request	$8/1000 requests
SERP API	Per request	$3/1000 requests

Strengths: - Largest IP pool in the industry (72M+ residential IPs) - Multiple products for different needs (proxies, APIs, browser, datasets) - Best anti-bot bypass capabilities - Global geo-targeting (every country) - Enterprise-grade reliability and SLAs

Weaknesses: - Expensive — budget easily runs into thousands per month - Complex pricing with many products and tiers - Overkill for small-scale or occasional scraping - Setup complexity compared to simpler API services

Best for: Enterprise teams scraping at scale against aggressive anti-bot systems. If you're pulling millions of pages from sites that actively fight scrapers, Bright Data has the infrastructure. For smaller teams, the cost is hard to justify.

For proxy-only needs at smaller scale, ThorData offers residential and datacenter proxies at significantly lower rates — worth evaluating if proxies are your main bottleneck rather than full managed scraping.

4. Zyte — Managed Scrapy in the Cloud

Zyte (formerly Scrapycloud / Scrapinghub) is the company behind Scrapy, the most popular open-source scraping framework. Their platform lets you deploy Scrapy spiders to the cloud and adds AI-powered data extraction on top.

What makes it different: If your team already uses Scrapy, Zyte is the natural upgrade path. Their Zyte API uses machine learning to extract structured data from pages without writing custom selectors — point it at a product page and it returns structured product data.

Scrapy Spider Deployment:

# settings.py for Scrapy Cloud deployment
SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'

# Zyte proxy middleware
DOWNLOADER_MIDDLEWARES = {
    'scrapy_zyte_smartproxy.ZyteSmartProxyMiddleware': 610,
}
ZYTE_SMARTPROXY_ENABLED = True
ZYTE_SMARTPROXY_APIKEY = 'YOUR_API_KEY'

Zyte API for Automatic Extraction:

import httpx

ZYTE_API_KEY = "YOUR_KEY"

def extract_product(url: str) -> dict:
    """Use Zyte API to automatically extract product data."""
    resp = httpx.post(
        "https://api.zyte.com/v1/extract",
        auth=(ZYTE_API_KEY, ""),
        json={
            "url": url,
            "product": True,
            "productOptions": {
                "extractFrom": "httpResponseBody",
            },
        },
    )

    data = resp.json()
    product = data.get("product", {})

    return {
        "name": product.get("name"),
        "price": product.get("price"),
        "currency": product.get("currency"),
        "availability": product.get("availability"),
        "description": product.get("description"),
        "brand": product.get("brand", {}).get("name"),
        "images": [img.get("url") for img in product.get("images", [])],
        "rating": product.get("aggregateRating", {}).get("ratingValue"),
    }

def extract_article(url: str) -> dict:
    """Automatically extract article data."""
    resp = httpx.post(
        "https://api.zyte.com/v1/extract",
        auth=(ZYTE_API_KEY, ""),
        json={
            "url": url,
            "article": True,
        },
    )

    data = resp.json()
    article = data.get("article", {})

    return {
        "headline": article.get("headline"),
        "author": article.get("author"),
        "date": article.get("datePublished"),
        "body": article.get("articleBody"),
    }

Pricing:

Product	Cost	Notes
Scrapy Cloud Free	$0	1 concurrent crawl, limited storage
Scrapy Cloud Pro	From $25/mo	More crawls, longer retention
Zyte API (Products)	$3.50/1000	AI-powered product extraction
Zyte API (Articles)	$1.80/1000	AI-powered article extraction
Smart Proxy Manager	$29/mo+	Auto-rotating proxy middleware

Strengths: - Natural fit for Scrapy users - AI-powered automatic data extraction (no selectors needed) - Cloud deployment with scheduling - Built-in proxy management - Long track record (founded 2010)

Weaknesses: - Scrapy-centric — less useful if you don't use Scrapy - AutoExtract accuracy varies by site - Can be expensive for high-volume extraction - UI/dashboard feels dated compared to competitors

Best for: Teams already invested in Scrapy who want managed hosting. The Zyte API is also great if you need structured data from varied page layouts without writing custom parsers.

5. Crawlee — The Modern Open-Source Framework

Crawlee (from the Apify team) is the successor to Apify SDK's crawling capabilities, available as a standalone open-source framework. It provides a unified API for HTTP crawling and browser automation with built-in anti-detection.

What makes it different: It combines the best of Scrapy (crawl management, request queuing, data storage) with Playwright's browser automation, all in a modern API. Available for both JavaScript/TypeScript and Python.

Python Example (Playwright Crawler):

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

async def main():
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=100,
        headless=True,
        browser_type="chromium",
        # Built-in proxy rotation
        proxy_configuration={
            "proxy_urls": [
                "http://user:[email protected]:9000",
                "http://user:[email protected]:9000",
            ],
        },
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext):
        page = context.page

        # Wait for content to load
        await page.wait_for_selector(".product-card")

        # Extract data
        products = await page.evaluate("""
            () => Array.from(document.querySelectorAll('.product-card')).map(card => ({
                title: card.querySelector('h2')?.textContent?.trim(),
                price: card.querySelector('.price')?.textContent?.trim(),
                url: card.querySelector('a')?.href,
            }))
        """)

        # Store results
        await context.push_data(products)

        # Follow pagination
        next_link = await page.query_selector("a.next-page")
        if next_link:
            href = await next_link.get_attribute("href")
            if href:
                await context.enqueue_links(selector="a.next-page")

    await crawler.run(["https://example.com/products"])

    # Export data
    data = await crawler.get_data()
    print(f"Scraped {len(data.items)} products")

HTTP-only Crawler (Faster, Cheaper):

from crawlee.http_crawler import HttpCrawler, HttpCrawlingContext
from bs4 import BeautifulSoup

async def main():
    crawler = HttpCrawler(
        max_requests_per_crawl=500,
        max_concurrency=10,
    )

    @crawler.router.default_handler
    async def handler(context: HttpCrawlingContext):
        soup = BeautifulSoup(context.http_response.read(), "html.parser")

        for item in soup.select(".product"):
            await context.push_data({
                "title": item.select_one("h3").get_text(strip=True),
                "price": item.select_one(".price").get_text(strip=True),
            })

        # Auto-enqueue pagination links
        await context.enqueue_links(selector="a.pagination")

    await crawler.run(["https://example.com/catalog"])

Pricing: Free (open-source). You pay only for your own infrastructure + proxies.

Strengths: - Modern, well-designed API - Unified HTTP and browser crawling - Built-in anti-detection (fingerprint rotation, session management) - Auto-scaling concurrency - Request queue persistence (survives crashes) - Can deploy to Apify cloud if needed

Weaknesses: - Newer than Scrapy — smaller community - Python SDK less mature than JS/TS version - You still manage your own infrastructure (unless using Apify) - Documentation is good but still growing

Best for: New projects that want modern tooling without the baggage of Scrapy's older architecture. Great for teams comfortable managing their own infrastructure.

6. ScraperAPI — ScrapingBee Alternative

ScraperAPI is functionally similar to ScrapingBee with some differences in pricing and features.

Usage:

import httpx

def scrape_via_scraperapi(url: str, render: bool = False) -> str:
    resp = httpx.get("https://api.scraperapi.com/", params={
        "api_key": "YOUR_KEY",
        "url": url,
        "render": str(render).lower(),
    })
    return resp.text

# Structured data endpoint
def get_amazon_product(asin: str) -> dict:
    """ScraperAPI has dedicated endpoints for common targets."""
    resp = httpx.get("https://api.scraperapi.com/structured/amazon/product", params={
        "api_key": "YOUR_KEY",
        "asin": asin,
        "country": "us",
    })
    return resp.json()

Pricing: Starts at $49/month for 100,000 API credits. Structured data endpoints cost more per request.

Strengths: Dedicated endpoints for Amazon, Google, Walmart. Geographic rotation. Weaknesses: Similar limitations to ScrapingBee. Structured endpoints are limited in scope.

7. Zenrows — Anti-Bot Focus

Zenrows specializes in bypassing anti-bot systems, with built-in support for handling CAPTCHAs, JavaScript rendering, and residential proxy rotation.

Usage:

import httpx

def scrape_protected_site(url: str) -> str:
    """Zenrows handles anti-bot automatically."""
    resp = httpx.get("https://api.zenrows.com/v1/", params={
        "apikey": "YOUR_KEY",
        "url": url,
        "js_render": "true",
        "antibot": "true",
        "premium_proxy": "true",
    })
    return resp.text

Strengths: Strong anti-bot bypass, automatic CAPTCHA handling. Weaknesses: Premium features are expensive. Newer service with less track record.

DIY Approach: httpx + Playwright + Proxy Provider

Sometimes the right answer is no scraping service at all. You write your own code and use a proxy provider for IP rotation.

When DIY Makes Sense

You have engineering resources and scraping expertise
You scrape stable targets that don't change often
You need maximum control over every aspect
You want to avoid vendor lock-in
Cost optimization matters (DIY is cheapest at scale)

Complete DIY Stack

"""
Complete DIY scraping stack:
- httpx for HTTP requests
- Playwright for JavaScript-heavy sites  
- ThorData for proxy rotation
- SQLite for data storage
- asyncio for concurrency
"""

import httpx
import asyncio
import sqlite3
import json
import random
import time
from datetime import datetime
from pathlib import Path
from bs4 import BeautifulSoup

class DIYScraper:
    """Production-ready DIY scraper with proxy rotation and storage."""

    def __init__(
        self,
        proxy_url: str,
        db_path: str = "scraping_data.db",
        max_concurrent: int = 5,
        delay_range: tuple = (1, 3),
    ):
        self.proxy_url = proxy_url
        self.db_path = db_path
        self.max_concurrent = max_concurrent
        self.delay_range = delay_range
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.stats = {"success": 0, "failed": 0, "blocked": 0}

        self._init_db()

    def _init_db(self):
        """Initialize SQLite database for results."""
        conn = sqlite3.connect(self.db_path)
        conn.execute("""
            CREATE TABLE IF NOT EXISTS results (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                url TEXT NOT NULL,
                data TEXT,
                scraped_at TEXT,
                status TEXT
            )
        """)
        conn.execute("""
            CREATE TABLE IF NOT EXISTS errors (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                url TEXT,
                error TEXT,
                timestamp TEXT
            )
        """)
        conn.commit()
        conn.close()

    async def scrape_url(self, url: str, parse_fn) -> dict | None:
        """Scrape a single URL with retry logic."""
        async with self.semaphore:
            for attempt in range(3):
                try:
                    async with httpx.AsyncClient(
                        proxy=self.proxy_url,
                        timeout=30,
                        follow_redirects=True,
                    ) as client:
                        resp = await client.get(url, headers={
                            "User-Agent": (
                                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                                "AppleWebKit/537.36 (KHTML, like Gecko) "
                                "Chrome/131.0.0.0 Safari/537.36"
                            ),
                            "Accept": "text/html,application/xhtml+xml",
                            "Accept-Language": "en-US,en;q=0.9",
                        })

                        if resp.status_code == 200:
                            data = parse_fn(resp.text)
                            self._save_result(url, data, "success")
                            self.stats["success"] += 1
                            return data

                        elif resp.status_code in (403, 429, 503):
                            self.stats["blocked"] += 1
                            wait = (2 ** attempt) + random.random()
                            await asyncio.sleep(wait)
                            continue

                        else:
                            self._save_error(url, f"HTTP {resp.status_code}")
                            self.stats["failed"] += 1
                            return None

                except Exception as e:
                    if attempt == 2:
                        self._save_error(url, str(e))
                        self.stats["failed"] += 1
                        return None
                    await asyncio.sleep(2 ** attempt)

            finally:
                await asyncio.sleep(random.uniform(*self.delay_range))

        return None

    async def scrape_batch(self, urls: list[str], parse_fn) -> list[dict]:
        """Scrape multiple URLs concurrently."""
        tasks = [self.scrape_url(url, parse_fn) for url in urls]
        results = await asyncio.gather(*tasks)

        print(f"\nScraping complete:")
        print(f"  Success: {self.stats['success']}")
        print(f"  Blocked: {self.stats['blocked']}")
        print(f"  Failed:  {self.stats['failed']}")

        return [r for r in results if r is not None]

    def _save_result(self, url: str, data: dict, status: str):
        conn = sqlite3.connect(self.db_path)
        conn.execute(
            "INSERT INTO results (url, data, scraped_at, status) VALUES (?, ?, ?, ?)",
            (url, json.dumps(data), datetime.utcnow().isoformat(), status),
        )
        conn.commit()
        conn.close()

    def _save_error(self, url: str, error: str):
        conn = sqlite3.connect(self.db_path)
        conn.execute(
            "INSERT INTO errors (url, error, timestamp) VALUES (?, ?, ?)",
            (url, error, datetime.utcnow().isoformat()),
        )
        conn.commit()
        conn.close()

# Usage example
def parse_product(html: str) -> dict:
    soup = BeautifulSoup(html, "html.parser")
    return {
        "title": soup.select_one("h1").get_text(strip=True) if soup.select_one("h1") else "",
        "price": soup.select_one(".price").get_text(strip=True) if soup.select_one(".price") else "",
    }

async def main():
    # Using ThorData residential proxies
    scraper = DIYScraper(
        proxy_url="http://user:[email protected]:9000",
        max_concurrent=5,
        delay_range=(2, 5),
    )

    urls = [f"https://store.example.com/product/{i}" for i in range(1, 101)]
    results = await scraper.scrape_batch(urls, parse_product)
    print(f"Got {len(results)} products")

asyncio.run(main())

DIY with Playwright for JavaScript Sites

from playwright.async_api import async_playwright
import asyncio
import random

class BrowserScraper:
    """DIY browser scraper with anti-detection measures."""

    def __init__(self, proxy_url: str, max_browsers: int = 3):
        self.proxy_url = proxy_url
        self.max_browsers = max_browsers
        self.semaphore = asyncio.Semaphore(max_browsers)

    async def scrape(self, url: str, extract_js: str) -> dict | None:
        """Scrape a JS-heavy page with Playwright."""
        async with self.semaphore:
            proxy_parts = self.proxy_url.replace("http://", "").split("@")
            user_pass = proxy_parts[0].split(":")

            async with async_playwright() as p:
                browser = await p.chromium.launch(
                    headless=True,
                    proxy={
                        "server": f"http://{proxy_parts[1]}",
                        "username": user_pass[0],
                        "password": user_pass[1],
                    },
                )

                context = await browser.new_context(
                    viewport={"width": 1920, "height": 1080},
                    user_agent=(
                        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                        "AppleWebKit/537.36 (KHTML, like Gecko) "
                        "Chrome/131.0.0.0 Safari/537.36"
                    ),
                    locale="en-US",
                )

                # Block heavy resources to save proxy bandwidth
                page = await context.new_page()
                await page.route("**/*.{png,jpg,gif,svg,woff,woff2}", 
                               lambda route: route.abort())

                try:
                    await page.goto(url, wait_until="networkidle", timeout=30000)
                    data = await page.evaluate(extract_js)
                    return data
                except Exception as e:
                    print(f"Error scraping {url}: {e}")
                    return None
                finally:
                    await browser.close()

# Usage
scraper = BrowserScraper(
    proxy_url="http://user:[email protected]:9000"
)

extract_script = """
() => {
    const products = [];
    document.querySelectorAll('.product-card').forEach(card => {
        products.push({
            title: card.querySelector('h2')?.textContent?.trim(),
            price: card.querySelector('.price')?.textContent?.trim(),
            url: card.querySelector('a')?.href,
        });
    });
    return products;
}
"""

results = asyncio.run(scraper.scrape("https://spa-store.com/products", extract_script))

Error Handling and CAPTCHA Strategies

Detecting and Handling Blocks

import httpx
from bs4 import BeautifulSoup

def detect_block(resp: httpx.Response) -> str | None:
    """Detect if a response is a block page, not real content."""

    # Status code checks
    if resp.status_code == 403:
        return "forbidden"
    if resp.status_code == 429:
        return "rate_limited"
    if resp.status_code == 503:
        return "service_unavailable"

    content = resp.text.lower()

    # Cloudflare
    if "checking your browser" in content or "cf-browser-verification" in content:
        return "cloudflare_challenge"

    # Generic CAPTCHA
    if "captcha" in content and ("recaptcha" in content or "hcaptcha" in content):
        return "captcha"

    # Access denied pages
    if any(phrase in content for phrase in [
        "access denied",
        "access to this page has been denied",
        "bot detected",
        "automated access",
        "unusual traffic",
    ]):
        return "access_denied"

    # PerimeterX
    if "perimeterx" in content or "px-captcha" in content:
        return "perimeterx"

    # DataDome
    if "datadome" in content:
        return "datadome"

    # Empty or suspiciously small response
    if len(resp.text) < 500 and resp.status_code == 200:
        return "empty_response"

    return None  # Not blocked

async def scrape_with_block_handling(
    url: str,
    proxy_url: str,
    max_retries: int = 3,
) -> dict:
    """Scrape with intelligent block detection and retry."""

    for attempt in range(max_retries):
        async with httpx.AsyncClient(proxy=proxy_url, timeout=30) as client:
            resp = await client.get(url, headers={
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                              "AppleWebKit/537.36",
            })

            block_type = detect_block(resp)

            if block_type is None:
                return {"status": "ok", "html": resp.text}

            if block_type == "rate_limited":
                wait = 30 * (attempt + 1)
                print(f"Rate limited, waiting {wait}s...")
                await asyncio.sleep(wait)
                continue

            if block_type in ("cloudflare_challenge", "captcha"):
                print(f"Challenge detected: {block_type}")
                # Switch to browser-based approach
                return {"status": "needs_browser", "block_type": block_type}

            if block_type == "forbidden":
                print("IP blocked, need different proxy")
                return {"status": "blocked", "block_type": block_type}

    return {"status": "failed", "attempts": max_retries}

CAPTCHA Solving Integration

import httpx
import asyncio

async def solve_captcha_2captcha(
    api_key: str,
    site_key: str,
    page_url: str,
    captcha_type: str = "recaptcha_v2",
) -> str | None:
    """Solve CAPTCHAs via 2captcha service."""

    method_map = {
        "recaptcha_v2": "userrecaptcha",
        "recaptcha_v3": "userrecaptcha",
        "hcaptcha": "hcaptcha",
    }

    method = method_map.get(captcha_type)
    if not method:
        return None

    async with httpx.AsyncClient() as client:
        # Submit
        submit_data = {
            "key": api_key,
            "method": method,
            "json": 1,
        }

        if captcha_type == "hcaptcha":
            submit_data["sitekey"] = site_key
            submit_data["pageurl"] = page_url
        else:
            submit_data["googlekey"] = site_key
            submit_data["pageurl"] = page_url
            if captcha_type == "recaptcha_v3":
                submit_data["version"] = "v3"
                submit_data["min_score"] = "0.3"

        resp = await client.post("https://2captcha.com/in.php", data=submit_data)
        result = resp.json()

        if result.get("status") != 1:
            return None

        task_id = result["request"]

        # Poll for solution
        for _ in range(60):
            await asyncio.sleep(5)

            resp = await client.get("https://2captcha.com/res.php", params={
                "key": api_key,
                "action": "get",
                "id": task_id,
                "json": 1,
            })
            result = resp.json()

            if result.get("status") == 1:
                return result["request"]

            if result.get("request") != "CAPCHA_NOT_READY":
                return None  # Error

        return None  # Timeout

Real-World Use Cases

E-Commerce Price Monitoring

import httpx
import asyncio
import json
from datetime import datetime

async def monitor_prices(
    product_urls: list[str],
    proxy_url: str,
    output_file: str = "prices.jsonl",
) -> dict:
    """
    Daily price monitoring across e-commerce sites.
    Uses residential proxies for Amazon/Walmart, datacenter for others.
    """
    results = {"total": len(product_urls), "success": 0, "failed": 0}

    async with httpx.AsyncClient(proxy=proxy_url, timeout=30) as client:
        for url in product_urls:
            try:
                resp = await client.get(url, headers={
                    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                                  "AppleWebKit/537.36",
                })

                if resp.status_code == 200:
                    # Try JSON-LD first
                    from bs4 import BeautifulSoup
                    soup = BeautifulSoup(resp.text, "html.parser")

                    price_data = None
                    for script in soup.find_all("script", type="application/ld+json"):
                        try:
                            ld = json.loads(script.string)
                            if ld.get("@type") == "Product":
                                offers = ld.get("offers", {})
                                if isinstance(offers, list):
                                    offers = offers[0]
                                price_data = {
                                    "url": url,
                                    "name": ld.get("name"),
                                    "price": offers.get("price"),
                                    "currency": offers.get("priceCurrency"),
                                    "availability": offers.get("availability"),
                                    "timestamp": datetime.utcnow().isoformat(),
                                }
                                break
                        except json.JSONDecodeError:
                            continue

                    if price_data:
                        with open(output_file, "a") as f:
                            f.write(json.dumps(price_data) + "\n")
                        results["success"] += 1
                    else:
                        results["failed"] += 1
                else:
                    results["failed"] += 1

                await asyncio.sleep(2)

            except Exception:
                results["failed"] += 1

    return results

SEO and SERP Monitoring

async def track_serp_rankings(
    keywords: list[str],
    domain: str,
    proxy_url: str,
) -> list[dict]:
    """
    Track search engine rankings for keywords.
    Requires residential proxies for Google.
    """
    rankings = []

    async with httpx.AsyncClient(proxy=proxy_url, timeout=30) as client:
        for keyword in keywords:
            try:
                resp = await client.get(
                    "https://www.google.com/search",
                    params={"q": keyword, "num": 100},
                    headers={
                        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                                      "AppleWebKit/537.36",
                        "Accept-Language": "en-US,en;q=0.9",
                    },
                )

                if resp.status_code == 200:
                    from bs4 import BeautifulSoup
                    soup = BeautifulSoup(resp.text, "html.parser")

                    position = None
                    for i, result in enumerate(soup.select("div.g"), 1):
                        link = result.select_one("a[href]")
                        if link and domain in link.get("href", ""):
                            position = i
                            break

                    rankings.append({
                        "keyword": keyword,
                        "position": position,
                        "date": datetime.utcnow().isoformat(),
                    })

                # Important: don't hammer Google
                await asyncio.sleep(5 + random.uniform(0, 5))

            except Exception:
                rankings.append({
                    "keyword": keyword,
                    "position": None,
                    "error": True,
                })

    return rankings

Research Data Collection

async def collect_academic_data(
    search_terms: list[str],
    proxy_url: str = "",  # Datacenter is fine for most academic sources
) -> list[dict]:
    """
    Collect research papers from public academic APIs.
    Most academic sources are bot-friendly — datacenter proxies work.
    """
    papers = []

    client_kwargs = {"timeout": 30}
    if proxy_url:
        client_kwargs["proxy"] = proxy_url

    async with httpx.AsyncClient(**client_kwargs) as client:
        for term in search_terms:
            # OpenAlex API (free, no key needed)
            try:
                resp = await client.get(
                    "https://api.openalex.org/works",
                    params={
                        "search": term,
                        "per_page": 25,
                        "sort": "cited_by_count:desc",
                    },
                )

                if resp.status_code == 200:
                    data = resp.json()
                    for work in data.get("results", []):
                        papers.append({
                            "title": work.get("title"),
                            "doi": work.get("doi"),
                            "year": work.get("publication_year"),
                            "citations": work.get("cited_by_count"),
                            "source": "openalex",
                        })
            except Exception:
                pass

            await asyncio.sleep(1)

    return papers

Comprehensive Decision Matrix

Feature	Apify	ScrapingBee	Bright Data	Zyte	Crawlee	DIY + ThorData
Free tier	$5/mo credits	Trial only	Trial only	Limited	Open source	Proxy cost only
JS rendering	Yes (actors)	Yes (API)	Yes (browser)	Yes	Yes (Playwright)	Yes (Playwright)
Anti-bot bypass	Actor-dependent	Good	Excellent	Good	Good	Manual + proxy
Ease of setup	Medium	Very easy	Medium	Medium	Medium	Hard
Best scale	Medium-large	Medium	Very large	Medium-large	Any	Any
Pricing model	Compute time	Per request	Per GB/request	Per extraction	Free (infra cost)	Per GB (proxy)
Customization	High (SDK)	Low (API)	Medium	High (Scrapy)	High	Full control
Data storage	Included	None	Datasets available	Included	Local/custom	Custom
Scheduling	Included	None (use cron)	Available	Included	Custom	Custom
Community	Large (actors)	None	None	Scrapy community	Growing	Python ecosystem
Monthly cost (10K pages)	$5-20	$10-50	$50-200	$15-35	$0 + proxy	$5-20 (proxy)
Monthly cost (1M pages)	$200-500	$500-2000	$2000-8000	$500-1500	$0 + proxy	$200-800 (proxy)

Choosing the Right Tool: Decision Framework

Step 1: How Many Sites Are You Scraping?

1-3 sites: DIY or ScrapingBee. Managed services are overkill. 4-20 sites: Apify or Crawlee. You need reusable scraper patterns. 20+ sites: Zyte (AutoExtract) or Apify marketplace. Writing custom selectors for 50 sites is unmaintainable.

Step 2: Do Your Targets Use Anti-Bot Protection?

No protection: DIY with datacenter proxies. Cheapest possible approach. Basic protection: ScrapingBee or DIY with ThorData residential proxies. Aggressive protection (Cloudflare, DataDome, PerimeterX): Bright Data Scraping Browser or Zenrows.

Step 3: What's Your Budget?

$0-50/month: Apify free tier, Crawlee + cheap proxies, or DIY. $50-500/month: Any of the services work. Pick based on ease of use vs. control. $500+/month: At this budget, DIY with ThorData proxies is often cheaper than managed services while giving full control.

Step 4: What's Your Team's Expertise?

No scraping experience: ScrapingBee (simplest API) or Apify (marketplace has pre-built scrapers). Python developers: Crawlee, Scrapy/Zyte, or DIY. JavaScript/TypeScript team: Crawlee (JS SDK is more mature), Apify. Data engineers: Zyte AutoExtract (minimal code, structured output).

The Bottom Line

Start with Apify if you want the broadest capability at the lowest entry cost. The actor marketplace means you're rarely starting from scratch, and the free tier lets you validate before spending.

Use ScrapingBee if you need scraping as a feature, not a project. Its API-first design integrates cleanly into existing applications.

Go Bright Data only if you're operating at enterprise scale against hardened targets where other solutions get blocked.

Choose Zyte if your team already knows Scrapy and wants to offload infrastructure, or if you need automatic data extraction across many different page layouts.

Use Crawlee for new projects that want modern, well-designed tooling without platform lock-in.

Build it yourself with ThorData proxies when you want maximum control, minimum vendor lock-in, and the lowest per-page cost at scale. This is the right choice for experienced teams who scrape as a core part of their business.

The scraping landscape evolves fast. Services come and go, anti-bot systems get smarter, and new tools emerge. The framework above — understanding your targets, budget, and team — will help you make the right choice regardless of which specific tools are available.