Best Web Scraping APIs and Services in 2026: A Developer's Comprehensive Comparison
Every scraping project starts with the same question: should I build the infrastructure myself, or pay someone to handle the hard parts?
The answer depends on your target sites, budget, how much time you want to spend fighting CAPTCHAs, and whether you need this as a one-off job or an ongoing pipeline. The web scraping tool landscape in 2026 has matured significantly — there are more options, better APIs, and smarter anti-detection built into the tools. But the fundamentals of choosing the right tool haven't changed.
This guide breaks down every viable option in 2026, from fully managed APIs to fully DIY stacks, with real code examples and honest assessments of each.
The Three Approaches to Web Scraping
Before comparing specific tools, understand the three fundamental approaches:
1. Managed Scraping APIs
You send a URL, you get back data. The service handles proxies, JavaScript rendering, CAPTCHA solving, and retries. Examples: ScrapingBee, ScraperAPI, Zenrows.
Pros: Zero infrastructure, fast integration, low maintenance. Cons: Per-request costs add up, limited customization, vendor lock-in. Best for: Teams that need scraping as a feature, not a core competency.
2. Scraping Platforms
You build scrapers on their platform using their SDK. They handle deployment, scheduling, proxy rotation, and storage. Examples: Apify, Zyte/Scrapy Cloud.
Pros: Full control over scraping logic, managed infrastructure, community scrapers. Cons: Platform lock-in, learning curve, compute-based pricing. Best for: Teams that scrape many different sites and want reusable, shareable code.
3. DIY with Proxy Providers
You write your own scrapers and use a proxy service for IP rotation and anti-detection. Examples: httpx + ThorData, Playwright + proxy, Scrapy + proxy middleware.
Pros: Full control, no vendor lock-in, cheapest at scale. Cons: You maintain everything — error handling, retries, CAPTCHA solving, browser management. Best for: Engineering teams with scraping expertise who want maximum flexibility.
Detailed Service Comparisons
1. Apify — The Actor Marketplace
Apify stands out because of its community-driven actor marketplace. Thousands of pre-built scrapers (called "actors") cover common targets — Amazon, Google Maps, Instagram, TikTok, LinkedIn — and you can run them without writing a line of code.
What makes it different: The ecosystem. Over 3,000 ready-to-use actors means someone has probably already built what you need. If you need something custom, you build it with their SDK (JavaScript or Python) and deploy it to their cloud.
SDK Example (Python):
from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
# Run a pre-built Amazon scraper
run_input = {
"categoryUrls": [
{"url": "https://www.amazon.com/s?k=wireless+headphones"}
],
"maxItems": 50,
"proxy": {"useApifyProxy": True, "groups": ["RESIDENTIAL"]},
}
run = client.actor("junglee/amazon-crawler").call(run_input=run_input)
# Fetch results
items = list(client.dataset(run["defaultDatasetId"]).iterate_items())
for item in items:
print(f"{item.get('title')} — ${item.get('price')}")
Custom Actor Example:
from apify import Actor
import httpx
from bs4 import BeautifulSoup
async def main():
async with Actor:
input_data = await Actor.get_input() or {}
urls = input_data.get("urls", [])
for url in urls:
async with httpx.AsyncClient() as client:
resp = await client.get(url)
soup = BeautifulSoup(resp.text, "html.parser")
title = soup.select_one("h1")
price = soup.select_one(".price")
await Actor.push_data({
"url": url,
"title": title.get_text(strip=True) if title else "",
"price": price.get_text(strip=True) if price else "",
})
Pricing Breakdown:
| Plan | Monthly Cost | Platform Credits | Key Features |
|---|---|---|---|
| Free | $0 | $5 worth | 1 actor, basic scheduling |
| Starter | $49 | $49 worth | Unlimited actors, API access |
| Scale | $499 | $499 worth | Priority support, more compute |
| Enterprise | Custom | Custom | SLA, dedicated infrastructure |
Credits are consumed based on compute time (CPU + memory). A simple HTTP scraper costs fractions of a cent per run. Browser-based actors cost 3-10x more.
Strengths: - Largest marketplace of pre-built scrapers - Excellent documentation and tutorials - Python and JavaScript SDKs - Built-in proxy management - Free tier is genuinely useful for small projects - Dataset storage and export included
Weaknesses: - Compute-based pricing is harder to predict than per-request pricing - Browser actors get expensive at scale - Custom actors require learning their SDK - Some marketplace actors are poorly maintained
Best for: Developers who want reusable, shareable scrapers with managed infrastructure. The actor marketplace is the real differentiator — before building anything custom, check if someone's already built it.
2. ScrapingBee — Simple API, Zero Infrastructure
ScrapingBee takes the simplest possible approach: one API endpoint, you send a URL, you get back rendered HTML. It handles JavaScript rendering, proxy rotation, and CAPTCHA solving behind the scenes.
What makes it different: The simplicity. Integration into existing codebases takes five minutes. No platform to learn, no SDK required (though they have one), no actors to configure.
Basic Usage:
import httpx
SCRAPINGBEE_API_KEY = "YOUR_KEY"
def scrape_url(url: str, render_js: bool = False) -> str:
"""Scrape a URL through ScrapingBee."""
params = {
"api_key": SCRAPINGBEE_API_KEY,
"url": url,
"render_js": str(render_js).lower(),
}
resp = httpx.get("https://app.scrapingbee.com/api/v1/", params=params)
resp.raise_for_status()
return resp.text
# Simple HTML scraping (1 credit)
html = scrape_url("https://example.com/products")
# JavaScript-rendered page (5 credits)
html = scrape_url("https://spa-site.com/dashboard", render_js=True)
Advanced Features:
def scrape_with_options(
url: str,
render_js: bool = False,
premium_proxy: bool = False,
country_code: str = "",
wait_for: str = "",
extract_rules: dict = None,
screenshot: bool = False,
) -> dict:
"""ScrapingBee with all options."""
params = {
"api_key": SCRAPINGBEE_API_KEY,
"url": url,
"render_js": str(render_js).lower(),
"premium_proxy": str(premium_proxy).lower(),
}
if country_code:
params["country_code"] = country_code
if wait_for:
# CSS selector to wait for before returning
params["wait_for"] = wait_for
if extract_rules:
# Server-side data extraction
import json
params["extract_rules"] = json.dumps(extract_rules)
if screenshot:
params["screenshot"] = "true"
resp = httpx.get("https://app.scrapingbee.com/api/v1/", params=params)
if screenshot:
return {"screenshot": resp.content}
return {"html": resp.text, "status": resp.status_code}
# Scrape with server-side extraction
result = scrape_with_options(
url="https://example.com/product/123",
render_js=True,
premium_proxy=True,
country_code="us",
wait_for=".product-price",
extract_rules={
"title": "h1.product-title",
"price": ".product-price",
"description": ".product-description",
"images": {
"selector": "img.product-image",
"type": "list",
"output": "@src",
},
},
)
Google Search Scraping:
def scrape_google(query: str, num_results: int = 10) -> list[dict]:
"""Scrape Google search results via ScrapingBee."""
params = {
"api_key": SCRAPINGBEE_API_KEY,
"url": f"https://www.google.com/search?q={query}&num={num_results}",
"render_js": "false",
"premium_proxy": "true",
"country_code": "us",
}
resp = httpx.get("https://app.scrapingbee.com/api/v1/", params=params)
from bs4 import BeautifulSoup
soup = BeautifulSoup(resp.text, "html.parser")
results = []
for div in soup.select("div.g"):
title_elem = div.select_one("h3")
link_elem = div.select_one("a[href]")
snippet_elem = div.select_one("div.VwiC3b")
if title_elem and link_elem:
results.append({
"title": title_elem.get_text(strip=True),
"url": link_elem["href"],
"snippet": snippet_elem.get_text(strip=True) if snippet_elem else "",
})
return results
Pricing:
| Plan | Monthly Cost | API Credits | Cost per Basic Request |
|---|---|---|---|
| Freelance | $49 | 150,000 | $0.00033 |
| Startup | $99 | 500,000 | $0.00020 |
| Business | $249 | 2,000,000 | $0.00012 |
| Enterprise | Custom | Custom | Custom |
Credit costs vary by feature: basic HTML = 1 credit, JS rendering = 5 credits, premium proxy = 10-25 credits.
Strengths: - Simplest integration of any scraping service - Server-side data extraction (no local parsing needed) - Google Search API included - Screenshot capability - Good documentation
Weaknesses: - Per-request pricing gets expensive for JavaScript-heavy scraping - Limited customization compared to platform-based tools - No scheduling or data storage - Premium proxy credits burn fast
Best for: Teams that need scraping as a feature inside a larger application, not as the core product. If you want to add "import from URL" to your SaaS without building scraping infrastructure, ScrapingBee is the fastest path.
3. Bright Data — Enterprise Proxy Network
Bright Data (formerly Luminati) operates the largest residential proxy network in the world — over 72 million IPs. They've expanded from pure proxy services into full scraping solutions with their Scraping Browser, Web Unlocker, and pre-built datasets.
What makes it different: Raw proxy power. When other services get blocked, Bright Data's residential and mobile proxies usually still work. Their Scraping Browser runs a full Chromium instance routed through residential IPs, which defeats most fingerprinting.
Web Unlocker Example:
import httpx
def scrape_with_unlocker(url: str) -> str:
"""Use Bright Data's Web Unlocker for anti-bot bypass."""
proxy = "http://USERNAME:[email protected]:33335"
with httpx.Client(proxy=proxy, timeout=60) as client:
resp = client.get(url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36",
})
return resp.text
Scraping Browser (Playwright + Residential Proxy):
from playwright.async_api import async_playwright
import asyncio
async def scrape_with_bright_browser(url: str) -> str:
"""Use Bright Data's Scraping Browser — real Chrome + residential IP."""
async with async_playwright() as p:
browser = await p.chromium.connect_over_cdp(
"wss://USERNAME:[email protected]:9222"
)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle")
content = await page.content()
await browser.close()
return content
Pricing:
| Product | Pricing Model | Approximate Cost |
|---|---|---|
| Datacenter Proxies | Per IP or per GB | $0.60/GB |
| Residential Proxies | Per GB | $8.40/GB |
| Mobile Proxies | Per GB | $24/GB |
| ISP Proxies | Per IP/month | $12/IP/month |
| Web Unlocker | Per request | $3/1000 requests |
| Scraping Browser | Per request | $8/1000 requests |
| SERP API | Per request | $3/1000 requests |
Strengths: - Largest IP pool in the industry (72M+ residential IPs) - Multiple products for different needs (proxies, APIs, browser, datasets) - Best anti-bot bypass capabilities - Global geo-targeting (every country) - Enterprise-grade reliability and SLAs
Weaknesses: - Expensive — budget easily runs into thousands per month - Complex pricing with many products and tiers - Overkill for small-scale or occasional scraping - Setup complexity compared to simpler API services
Best for: Enterprise teams scraping at scale against aggressive anti-bot systems. If you're pulling millions of pages from sites that actively fight scrapers, Bright Data has the infrastructure. For smaller teams, the cost is hard to justify.
For proxy-only needs at smaller scale, ThorData offers residential and datacenter proxies at significantly lower rates — worth evaluating if proxies are your main bottleneck rather than full managed scraping.
4. Zyte — Managed Scrapy in the Cloud
Zyte (formerly Scrapycloud / Scrapinghub) is the company behind Scrapy, the most popular open-source scraping framework. Their platform lets you deploy Scrapy spiders to the cloud and adds AI-powered data extraction on top.
What makes it different: If your team already uses Scrapy, Zyte is the natural upgrade path. Their Zyte API uses machine learning to extract structured data from pages without writing custom selectors — point it at a product page and it returns structured product data.
Scrapy Spider Deployment:
# settings.py for Scrapy Cloud deployment
SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'
# Zyte proxy middleware
DOWNLOADER_MIDDLEWARES = {
'scrapy_zyte_smartproxy.ZyteSmartProxyMiddleware': 610,
}
ZYTE_SMARTPROXY_ENABLED = True
ZYTE_SMARTPROXY_APIKEY = 'YOUR_API_KEY'
Zyte API for Automatic Extraction:
import httpx
ZYTE_API_KEY = "YOUR_KEY"
def extract_product(url: str) -> dict:
"""Use Zyte API to automatically extract product data."""
resp = httpx.post(
"https://api.zyte.com/v1/extract",
auth=(ZYTE_API_KEY, ""),
json={
"url": url,
"product": True,
"productOptions": {
"extractFrom": "httpResponseBody",
},
},
)
data = resp.json()
product = data.get("product", {})
return {
"name": product.get("name"),
"price": product.get("price"),
"currency": product.get("currency"),
"availability": product.get("availability"),
"description": product.get("description"),
"brand": product.get("brand", {}).get("name"),
"images": [img.get("url") for img in product.get("images", [])],
"rating": product.get("aggregateRating", {}).get("ratingValue"),
}
def extract_article(url: str) -> dict:
"""Automatically extract article data."""
resp = httpx.post(
"https://api.zyte.com/v1/extract",
auth=(ZYTE_API_KEY, ""),
json={
"url": url,
"article": True,
},
)
data = resp.json()
article = data.get("article", {})
return {
"headline": article.get("headline"),
"author": article.get("author"),
"date": article.get("datePublished"),
"body": article.get("articleBody"),
}
Pricing:
| Product | Cost | Notes |
|---|---|---|
| Scrapy Cloud Free | $0 | 1 concurrent crawl, limited storage |
| Scrapy Cloud Pro | From $25/mo | More crawls, longer retention |
| Zyte API (Products) | $3.50/1000 | AI-powered product extraction |
| Zyte API (Articles) | $1.80/1000 | AI-powered article extraction |
| Smart Proxy Manager | $29/mo+ | Auto-rotating proxy middleware |
Strengths: - Natural fit for Scrapy users - AI-powered automatic data extraction (no selectors needed) - Cloud deployment with scheduling - Built-in proxy management - Long track record (founded 2010)
Weaknesses: - Scrapy-centric — less useful if you don't use Scrapy - AutoExtract accuracy varies by site - Can be expensive for high-volume extraction - UI/dashboard feels dated compared to competitors
Best for: Teams already invested in Scrapy who want managed hosting. The Zyte API is also great if you need structured data from varied page layouts without writing custom parsers.
5. Crawlee — The Modern Open-Source Framework
Crawlee (from the Apify team) is the successor to Apify SDK's crawling capabilities, available as a standalone open-source framework. It provides a unified API for HTTP crawling and browser automation with built-in anti-detection.
What makes it different: It combines the best of Scrapy (crawl management, request queuing, data storage) with Playwright's browser automation, all in a modern API. Available for both JavaScript/TypeScript and Python.
Python Example (Playwright Crawler):
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
async def main():
crawler = PlaywrightCrawler(
max_requests_per_crawl=100,
headless=True,
browser_type="chromium",
# Built-in proxy rotation
proxy_configuration={
"proxy_urls": [
"http://user:[email protected]:9000",
"http://user:[email protected]:9000",
],
},
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext):
page = context.page
# Wait for content to load
await page.wait_for_selector(".product-card")
# Extract data
products = await page.evaluate("""
() => Array.from(document.querySelectorAll('.product-card')).map(card => ({
title: card.querySelector('h2')?.textContent?.trim(),
price: card.querySelector('.price')?.textContent?.trim(),
url: card.querySelector('a')?.href,
}))
""")
# Store results
await context.push_data(products)
# Follow pagination
next_link = await page.query_selector("a.next-page")
if next_link:
href = await next_link.get_attribute("href")
if href:
await context.enqueue_links(selector="a.next-page")
await crawler.run(["https://example.com/products"])
# Export data
data = await crawler.get_data()
print(f"Scraped {len(data.items)} products")
HTTP-only Crawler (Faster, Cheaper):
from crawlee.http_crawler import HttpCrawler, HttpCrawlingContext
from bs4 import BeautifulSoup
async def main():
crawler = HttpCrawler(
max_requests_per_crawl=500,
max_concurrency=10,
)
@crawler.router.default_handler
async def handler(context: HttpCrawlingContext):
soup = BeautifulSoup(context.http_response.read(), "html.parser")
for item in soup.select(".product"):
await context.push_data({
"title": item.select_one("h3").get_text(strip=True),
"price": item.select_one(".price").get_text(strip=True),
})
# Auto-enqueue pagination links
await context.enqueue_links(selector="a.pagination")
await crawler.run(["https://example.com/catalog"])
Pricing: Free (open-source). You pay only for your own infrastructure + proxies.
Strengths: - Modern, well-designed API - Unified HTTP and browser crawling - Built-in anti-detection (fingerprint rotation, session management) - Auto-scaling concurrency - Request queue persistence (survives crashes) - Can deploy to Apify cloud if needed
Weaknesses: - Newer than Scrapy — smaller community - Python SDK less mature than JS/TS version - You still manage your own infrastructure (unless using Apify) - Documentation is good but still growing
Best for: New projects that want modern tooling without the baggage of Scrapy's older architecture. Great for teams comfortable managing their own infrastructure.
6. ScraperAPI — ScrapingBee Alternative
ScraperAPI is functionally similar to ScrapingBee with some differences in pricing and features.
Usage:
import httpx
def scrape_via_scraperapi(url: str, render: bool = False) -> str:
resp = httpx.get("https://api.scraperapi.com/", params={
"api_key": "YOUR_KEY",
"url": url,
"render": str(render).lower(),
})
return resp.text
# Structured data endpoint
def get_amazon_product(asin: str) -> dict:
"""ScraperAPI has dedicated endpoints for common targets."""
resp = httpx.get("https://api.scraperapi.com/structured/amazon/product", params={
"api_key": "YOUR_KEY",
"asin": asin,
"country": "us",
})
return resp.json()
Pricing: Starts at $49/month for 100,000 API credits. Structured data endpoints cost more per request.
Strengths: Dedicated endpoints for Amazon, Google, Walmart. Geographic rotation. Weaknesses: Similar limitations to ScrapingBee. Structured endpoints are limited in scope.
7. Zenrows — Anti-Bot Focus
Zenrows specializes in bypassing anti-bot systems, with built-in support for handling CAPTCHAs, JavaScript rendering, and residential proxy rotation.
Usage:
import httpx
def scrape_protected_site(url: str) -> str:
"""Zenrows handles anti-bot automatically."""
resp = httpx.get("https://api.zenrows.com/v1/", params={
"apikey": "YOUR_KEY",
"url": url,
"js_render": "true",
"antibot": "true",
"premium_proxy": "true",
})
return resp.text
Strengths: Strong anti-bot bypass, automatic CAPTCHA handling. Weaknesses: Premium features are expensive. Newer service with less track record.
DIY Approach: httpx + Playwright + Proxy Provider
Sometimes the right answer is no scraping service at all. You write your own code and use a proxy provider for IP rotation.
When DIY Makes Sense
- You have engineering resources and scraping expertise
- You scrape stable targets that don't change often
- You need maximum control over every aspect
- You want to avoid vendor lock-in
- Cost optimization matters (DIY is cheapest at scale)
Complete DIY Stack
"""
Complete DIY scraping stack:
- httpx for HTTP requests
- Playwright for JavaScript-heavy sites
- ThorData for proxy rotation
- SQLite for data storage
- asyncio for concurrency
"""
import httpx
import asyncio
import sqlite3
import json
import random
import time
from datetime import datetime
from pathlib import Path
from bs4 import BeautifulSoup
class DIYScraper:
"""Production-ready DIY scraper with proxy rotation and storage."""
def __init__(
self,
proxy_url: str,
db_path: str = "scraping_data.db",
max_concurrent: int = 5,
delay_range: tuple = (1, 3),
):
self.proxy_url = proxy_url
self.db_path = db_path
self.max_concurrent = max_concurrent
self.delay_range = delay_range
self.semaphore = asyncio.Semaphore(max_concurrent)
self.stats = {"success": 0, "failed": 0, "blocked": 0}
self._init_db()
def _init_db(self):
"""Initialize SQLite database for results."""
conn = sqlite3.connect(self.db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS results (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT NOT NULL,
data TEXT,
scraped_at TEXT,
status TEXT
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS errors (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT,
error TEXT,
timestamp TEXT
)
""")
conn.commit()
conn.close()
async def scrape_url(self, url: str, parse_fn) -> dict | None:
"""Scrape a single URL with retry logic."""
async with self.semaphore:
for attempt in range(3):
try:
async with httpx.AsyncClient(
proxy=self.proxy_url,
timeout=30,
follow_redirects=True,
) as client:
resp = await client.get(url, headers={
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/131.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "en-US,en;q=0.9",
})
if resp.status_code == 200:
data = parse_fn(resp.text)
self._save_result(url, data, "success")
self.stats["success"] += 1
return data
elif resp.status_code in (403, 429, 503):
self.stats["blocked"] += 1
wait = (2 ** attempt) + random.random()
await asyncio.sleep(wait)
continue
else:
self._save_error(url, f"HTTP {resp.status_code}")
self.stats["failed"] += 1
return None
except Exception as e:
if attempt == 2:
self._save_error(url, str(e))
self.stats["failed"] += 1
return None
await asyncio.sleep(2 ** attempt)
finally:
await asyncio.sleep(random.uniform(*self.delay_range))
return None
async def scrape_batch(self, urls: list[str], parse_fn) -> list[dict]:
"""Scrape multiple URLs concurrently."""
tasks = [self.scrape_url(url, parse_fn) for url in urls]
results = await asyncio.gather(*tasks)
print(f"\nScraping complete:")
print(f" Success: {self.stats['success']}")
print(f" Blocked: {self.stats['blocked']}")
print(f" Failed: {self.stats['failed']}")
return [r for r in results if r is not None]
def _save_result(self, url: str, data: dict, status: str):
conn = sqlite3.connect(self.db_path)
conn.execute(
"INSERT INTO results (url, data, scraped_at, status) VALUES (?, ?, ?, ?)",
(url, json.dumps(data), datetime.utcnow().isoformat(), status),
)
conn.commit()
conn.close()
def _save_error(self, url: str, error: str):
conn = sqlite3.connect(self.db_path)
conn.execute(
"INSERT INTO errors (url, error, timestamp) VALUES (?, ?, ?)",
(url, error, datetime.utcnow().isoformat()),
)
conn.commit()
conn.close()
# Usage example
def parse_product(html: str) -> dict:
soup = BeautifulSoup(html, "html.parser")
return {
"title": soup.select_one("h1").get_text(strip=True) if soup.select_one("h1") else "",
"price": soup.select_one(".price").get_text(strip=True) if soup.select_one(".price") else "",
}
async def main():
# Using ThorData residential proxies
scraper = DIYScraper(
proxy_url="http://user:[email protected]:9000",
max_concurrent=5,
delay_range=(2, 5),
)
urls = [f"https://store.example.com/product/{i}" for i in range(1, 101)]
results = await scraper.scrape_batch(urls, parse_product)
print(f"Got {len(results)} products")
asyncio.run(main())
DIY with Playwright for JavaScript Sites
from playwright.async_api import async_playwright
import asyncio
import random
class BrowserScraper:
"""DIY browser scraper with anti-detection measures."""
def __init__(self, proxy_url: str, max_browsers: int = 3):
self.proxy_url = proxy_url
self.max_browsers = max_browsers
self.semaphore = asyncio.Semaphore(max_browsers)
async def scrape(self, url: str, extract_js: str) -> dict | None:
"""Scrape a JS-heavy page with Playwright."""
async with self.semaphore:
proxy_parts = self.proxy_url.replace("http://", "").split("@")
user_pass = proxy_parts[0].split(":")
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy={
"server": f"http://{proxy_parts[1]}",
"username": user_pass[0],
"password": user_pass[1],
},
)
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent=(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/131.0.0.0 Safari/537.36"
),
locale="en-US",
)
# Block heavy resources to save proxy bandwidth
page = await context.new_page()
await page.route("**/*.{png,jpg,gif,svg,woff,woff2}",
lambda route: route.abort())
try:
await page.goto(url, wait_until="networkidle", timeout=30000)
data = await page.evaluate(extract_js)
return data
except Exception as e:
print(f"Error scraping {url}: {e}")
return None
finally:
await browser.close()
# Usage
scraper = BrowserScraper(
proxy_url="http://user:[email protected]:9000"
)
extract_script = """
() => {
const products = [];
document.querySelectorAll('.product-card').forEach(card => {
products.push({
title: card.querySelector('h2')?.textContent?.trim(),
price: card.querySelector('.price')?.textContent?.trim(),
url: card.querySelector('a')?.href,
});
});
return products;
}
"""
results = asyncio.run(scraper.scrape("https://spa-store.com/products", extract_script))
Error Handling and CAPTCHA Strategies
Detecting and Handling Blocks
import httpx
from bs4 import BeautifulSoup
def detect_block(resp: httpx.Response) -> str | None:
"""Detect if a response is a block page, not real content."""
# Status code checks
if resp.status_code == 403:
return "forbidden"
if resp.status_code == 429:
return "rate_limited"
if resp.status_code == 503:
return "service_unavailable"
content = resp.text.lower()
# Cloudflare
if "checking your browser" in content or "cf-browser-verification" in content:
return "cloudflare_challenge"
# Generic CAPTCHA
if "captcha" in content and ("recaptcha" in content or "hcaptcha" in content):
return "captcha"
# Access denied pages
if any(phrase in content for phrase in [
"access denied",
"access to this page has been denied",
"bot detected",
"automated access",
"unusual traffic",
]):
return "access_denied"
# PerimeterX
if "perimeterx" in content or "px-captcha" in content:
return "perimeterx"
# DataDome
if "datadome" in content:
return "datadome"
# Empty or suspiciously small response
if len(resp.text) < 500 and resp.status_code == 200:
return "empty_response"
return None # Not blocked
async def scrape_with_block_handling(
url: str,
proxy_url: str,
max_retries: int = 3,
) -> dict:
"""Scrape with intelligent block detection and retry."""
for attempt in range(max_retries):
async with httpx.AsyncClient(proxy=proxy_url, timeout=30) as client:
resp = await client.get(url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36",
})
block_type = detect_block(resp)
if block_type is None:
return {"status": "ok", "html": resp.text}
if block_type == "rate_limited":
wait = 30 * (attempt + 1)
print(f"Rate limited, waiting {wait}s...")
await asyncio.sleep(wait)
continue
if block_type in ("cloudflare_challenge", "captcha"):
print(f"Challenge detected: {block_type}")
# Switch to browser-based approach
return {"status": "needs_browser", "block_type": block_type}
if block_type == "forbidden":
print("IP blocked, need different proxy")
return {"status": "blocked", "block_type": block_type}
return {"status": "failed", "attempts": max_retries}
CAPTCHA Solving Integration
import httpx
import asyncio
async def solve_captcha_2captcha(
api_key: str,
site_key: str,
page_url: str,
captcha_type: str = "recaptcha_v2",
) -> str | None:
"""Solve CAPTCHAs via 2captcha service."""
method_map = {
"recaptcha_v2": "userrecaptcha",
"recaptcha_v3": "userrecaptcha",
"hcaptcha": "hcaptcha",
}
method = method_map.get(captcha_type)
if not method:
return None
async with httpx.AsyncClient() as client:
# Submit
submit_data = {
"key": api_key,
"method": method,
"json": 1,
}
if captcha_type == "hcaptcha":
submit_data["sitekey"] = site_key
submit_data["pageurl"] = page_url
else:
submit_data["googlekey"] = site_key
submit_data["pageurl"] = page_url
if captcha_type == "recaptcha_v3":
submit_data["version"] = "v3"
submit_data["min_score"] = "0.3"
resp = await client.post("https://2captcha.com/in.php", data=submit_data)
result = resp.json()
if result.get("status") != 1:
return None
task_id = result["request"]
# Poll for solution
for _ in range(60):
await asyncio.sleep(5)
resp = await client.get("https://2captcha.com/res.php", params={
"key": api_key,
"action": "get",
"id": task_id,
"json": 1,
})
result = resp.json()
if result.get("status") == 1:
return result["request"]
if result.get("request") != "CAPCHA_NOT_READY":
return None # Error
return None # Timeout
Real-World Use Cases
E-Commerce Price Monitoring
import httpx
import asyncio
import json
from datetime import datetime
async def monitor_prices(
product_urls: list[str],
proxy_url: str,
output_file: str = "prices.jsonl",
) -> dict:
"""
Daily price monitoring across e-commerce sites.
Uses residential proxies for Amazon/Walmart, datacenter for others.
"""
results = {"total": len(product_urls), "success": 0, "failed": 0}
async with httpx.AsyncClient(proxy=proxy_url, timeout=30) as client:
for url in product_urls:
try:
resp = await client.get(url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36",
})
if resp.status_code == 200:
# Try JSON-LD first
from bs4 import BeautifulSoup
soup = BeautifulSoup(resp.text, "html.parser")
price_data = None
for script in soup.find_all("script", type="application/ld+json"):
try:
ld = json.loads(script.string)
if ld.get("@type") == "Product":
offers = ld.get("offers", {})
if isinstance(offers, list):
offers = offers[0]
price_data = {
"url": url,
"name": ld.get("name"),
"price": offers.get("price"),
"currency": offers.get("priceCurrency"),
"availability": offers.get("availability"),
"timestamp": datetime.utcnow().isoformat(),
}
break
except json.JSONDecodeError:
continue
if price_data:
with open(output_file, "a") as f:
f.write(json.dumps(price_data) + "\n")
results["success"] += 1
else:
results["failed"] += 1
else:
results["failed"] += 1
await asyncio.sleep(2)
except Exception:
results["failed"] += 1
return results
SEO and SERP Monitoring
async def track_serp_rankings(
keywords: list[str],
domain: str,
proxy_url: str,
) -> list[dict]:
"""
Track search engine rankings for keywords.
Requires residential proxies for Google.
"""
rankings = []
async with httpx.AsyncClient(proxy=proxy_url, timeout=30) as client:
for keyword in keywords:
try:
resp = await client.get(
"https://www.google.com/search",
params={"q": keyword, "num": 100},
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36",
"Accept-Language": "en-US,en;q=0.9",
},
)
if resp.status_code == 200:
from bs4 import BeautifulSoup
soup = BeautifulSoup(resp.text, "html.parser")
position = None
for i, result in enumerate(soup.select("div.g"), 1):
link = result.select_one("a[href]")
if link and domain in link.get("href", ""):
position = i
break
rankings.append({
"keyword": keyword,
"position": position,
"date": datetime.utcnow().isoformat(),
})
# Important: don't hammer Google
await asyncio.sleep(5 + random.uniform(0, 5))
except Exception:
rankings.append({
"keyword": keyword,
"position": None,
"error": True,
})
return rankings
Research Data Collection
async def collect_academic_data(
search_terms: list[str],
proxy_url: str = "", # Datacenter is fine for most academic sources
) -> list[dict]:
"""
Collect research papers from public academic APIs.
Most academic sources are bot-friendly — datacenter proxies work.
"""
papers = []
client_kwargs = {"timeout": 30}
if proxy_url:
client_kwargs["proxy"] = proxy_url
async with httpx.AsyncClient(**client_kwargs) as client:
for term in search_terms:
# OpenAlex API (free, no key needed)
try:
resp = await client.get(
"https://api.openalex.org/works",
params={
"search": term,
"per_page": 25,
"sort": "cited_by_count:desc",
},
)
if resp.status_code == 200:
data = resp.json()
for work in data.get("results", []):
papers.append({
"title": work.get("title"),
"doi": work.get("doi"),
"year": work.get("publication_year"),
"citations": work.get("cited_by_count"),
"source": "openalex",
})
except Exception:
pass
await asyncio.sleep(1)
return papers
Comprehensive Decision Matrix
| Feature | Apify | ScrapingBee | Bright Data | Zyte | Crawlee | DIY + ThorData |
|---|---|---|---|---|---|---|
| Free tier | $5/mo credits | Trial only | Trial only | Limited | Open source | Proxy cost only |
| JS rendering | Yes (actors) | Yes (API) | Yes (browser) | Yes | Yes (Playwright) | Yes (Playwright) |
| Anti-bot bypass | Actor-dependent | Good | Excellent | Good | Good | Manual + proxy |
| Ease of setup | Medium | Very easy | Medium | Medium | Medium | Hard |
| Best scale | Medium-large | Medium | Very large | Medium-large | Any | Any |
| Pricing model | Compute time | Per request | Per GB/request | Per extraction | Free (infra cost) | Per GB (proxy) |
| Customization | High (SDK) | Low (API) | Medium | High (Scrapy) | High | Full control |
| Data storage | Included | None | Datasets available | Included | Local/custom | Custom |
| Scheduling | Included | None (use cron) | Available | Included | Custom | Custom |
| Community | Large (actors) | None | None | Scrapy community | Growing | Python ecosystem |
| Monthly cost (10K pages) | $5-20 | $10-50 | $50-200 | $15-35 | $0 + proxy | $5-20 (proxy) |
| Monthly cost (1M pages) | $200-500 | $500-2000 | $2000-8000 | $500-1500 | $0 + proxy | $200-800 (proxy) |
Choosing the Right Tool: Decision Framework
Step 1: How Many Sites Are You Scraping?
1-3 sites: DIY or ScrapingBee. Managed services are overkill. 4-20 sites: Apify or Crawlee. You need reusable scraper patterns. 20+ sites: Zyte (AutoExtract) or Apify marketplace. Writing custom selectors for 50 sites is unmaintainable.
Step 2: Do Your Targets Use Anti-Bot Protection?
No protection: DIY with datacenter proxies. Cheapest possible approach. Basic protection: ScrapingBee or DIY with ThorData residential proxies. Aggressive protection (Cloudflare, DataDome, PerimeterX): Bright Data Scraping Browser or Zenrows.
Step 3: What's Your Budget?
$0-50/month: Apify free tier, Crawlee + cheap proxies, or DIY. $50-500/month: Any of the services work. Pick based on ease of use vs. control. $500+/month: At this budget, DIY with ThorData proxies is often cheaper than managed services while giving full control.
Step 4: What's Your Team's Expertise?
No scraping experience: ScrapingBee (simplest API) or Apify (marketplace has pre-built scrapers). Python developers: Crawlee, Scrapy/Zyte, or DIY. JavaScript/TypeScript team: Crawlee (JS SDK is more mature), Apify. Data engineers: Zyte AutoExtract (minimal code, structured output).
The Bottom Line
Start with Apify if you want the broadest capability at the lowest entry cost. The actor marketplace means you're rarely starting from scratch, and the free tier lets you validate before spending.
Use ScrapingBee if you need scraping as a feature, not a project. Its API-first design integrates cleanly into existing applications.
Go Bright Data only if you're operating at enterprise scale against hardened targets where other solutions get blocked.
Choose Zyte if your team already knows Scrapy and wants to offload infrastructure, or if you need automatic data extraction across many different page layouts.
Use Crawlee for new projects that want modern, well-designed tooling without platform lock-in.
Build it yourself with ThorData proxies when you want maximum control, minimum vendor lock-in, and the lowest per-page cost at scale. This is the right choice for experienced teams who scrape as a core part of their business.
The scraping landscape evolves fast. Services come and go, anti-bot systems get smarter, and new tools emerge. The framework above — understanding your targets, budget, and team — will help you make the right choice regardless of which specific tools are available.