10 Python Web Scraping Mistakes That Break Your Scraper (and How to Fix Them)
10 Python Web Scraping Mistakes That Break Your Scraper (and How to Fix Them)
Web scraping is one of those disciplines that looks deceptively simple from the outside. You write fifteen lines of Python, call requests.get(), parse some HTML, and pat yourself on the back. Then your scraper runs in production for two days, silently fills a database with garbage, crashes at 3 AM on the most important run of the month, and you spend a week understanding why a script that "worked fine locally" fell apart completely at scale.
I have debugged enough broken scrapers — my own and other people's — to see the same patterns emerge every single time. The same ten mistakes, over and over, in projects ranging from a weekend side hustle to production pipelines processing millions of pages a day.
The frustrating part is that none of these are hard mistakes to fix. They're not deep algorithmic problems or obscure library bugs. They're the kind of thing that seems obvious in retrospect and invisible when you're in the middle of writing the code. The time.sleep() in an async function that blocks the entire event loop for hours. The selector that worked perfectly against yesterday's version of the site. The retry loop that retries a 403 seventeen times before giving up, burning your entire rate limit on requests that were never going to succeed.
What makes these mistakes expensive isn't just the bugs they cause — it's the way they compound. A scraper with bad headers gets blocked. The blocked scraper retries with exponential backoff. The retries hit a rate limit. The rate limit triggers a temporary IP ban. Now the scraper is logging thousands of errors, consuming your entire proxy budget on failed requests, and you've lost the data window you needed. What started as a missing Accept-Language header turned into a two-hour outage.
This guide goes through each of the ten most common mistakes in detail — what causes them, what they look like in practice, and how to fix them in a way that actually holds up at scale. I'll cover everything from the basic header problems that catch 80% of hobby scrapers to the more subtle issues with encoding, async patterns, and retry logic that trip up experienced engineers.
By the end, you'll have a mental checklist you can run through before every scraper you write. Not every item will apply every time — a quick one-off script to grab data from a single page doesn't need multi-selector fallbacks and session caching. But for anything you're going to run repeatedly, at scale, or in production, these are the guardrails that keep you sane.
Let's start with the most common one, and work our way toward the subtle ones.
1. Not Sending Browser Headers
The default requests User-Agent is python-requests/2.31.0. Every website on the planet — and certainly every anti-bot system — knows that string means a script, not a browser. You'll get blocked before you even parse a single <div>.
But it's not just the User-Agent. A real browser sends a full set of headers on every request. Missing headers are as suspicious as wrong ones. If you only set User-Agent but omit Accept-Language, Accept-Encoding, and Sec-Fetch-* headers, fingerprinting systems will still flag you.
Fix: Send a complete, realistic header set.
import requests
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Cache-Control": "max-age=0",
}
session = requests.Session()
session.headers.update(HEADERS)
resp = session.get("https://example.com/products")
resp.raise_for_status()
Using a Session object is better than passing headers to each requests.get() call — the session persists cookies, headers, and connection pooling across all requests, making the browsing pattern look more natural.
For anything beyond occasional scraping, rotating IPs through a residential proxy service like ThorData is essential. Headers alone won't save you at scale — an IP that has sent ten thousand requests in an hour is suspicious regardless of how good the headers look.
proxies = {
"http": "http://username:[email protected]:9000",
"https": "http://username:[email protected]:9000",
}
resp = session.get("https://example.com/products", proxies=proxies)
2. Hardcoding CSS Selectors Without Fallbacks
You write soup.select_one("div.product-card > span.price") and it works perfectly on Tuesday. The site ships a redesign on Wednesday, your selector matches nothing, and your scraper silently returns empty data for three weeks before you notice it in a quarterly data audit.
Sites change their markup constantly — A/B tests, redesigns, framework migrations, minor tweaks. A scraper that relies on a single brittle selector will break, and it will break at the worst possible time.
Fix: Use multiple selector fallbacks and fail loudly when none match.
from bs4 import BeautifulSoup
import requests
def extract_price(soup: BeautifulSoup, url: str) -> str | None:
"""Try multiple selectors in priority order, raise if nothing matches."""
selectors = [
"span.price",
"div.product-price",
"[data-testid='price']",
"[itemprop='price']",
".price-box .final-price",
]
for sel in selectors:
element = soup.select_one(sel)
if element:
return element.get_text(strip=True)
raise ValueError(
f"No price element found on {url}. "
f"Tried selectors: {selectors}. "
f"HTML snippet: {str(soup.body)[:500]}"
)
resp = requests.get("https://example.com/product/123", headers=HEADERS)
soup = BeautifulSoup(resp.text, "lxml")
price = extract_price(soup, resp.url)
The key additions here: the error message includes the URL, the list of selectors tried, and a snippet of the actual HTML. When this breaks at 3 AM, you'll know exactly what to look at instead of guessing.
For structured data, also check for JSON-LD before parsing HTML — many e-commerce sites embed product data in <script type="application/ld+json"> blocks, which are far more stable than CSS selectors:
import json
def extract_structured_data(soup: BeautifulSoup) -> dict | None:
script = soup.find("script", type="application/ld+json")
if script:
try:
data = json.loads(script.string)
return data
except json.JSONDecodeError:
pass
return None
3. Using time.sleep() in Async Code
This one is subtle and devastatingly expensive. time.sleep(2) inside an async def function blocks the entire Python event loop. Every coroutine in your program freezes while it waits. Your "concurrent" scraper that was supposed to process 50 URLs in parallel is now sequential — and slower than a synchronous scraper because of the async overhead.
The really painful part: it works fine in testing because you're only testing with 3-5 URLs. The performance disaster only becomes apparent when you scale up.
Fix: Use asyncio.sleep() in async code.
import asyncio
import httpx
import random
# WRONG — blocks the entire event loop
async def fetch_bad(url: str, client: httpx.AsyncClient) -> str:
resp = await client.get(url)
import time
time.sleep(2) # This FREEZES everything else
return resp.text
# CORRECT — yields control back to the event loop
async def fetch_good(url: str, client: httpx.AsyncClient) -> str:
resp = await client.get(url)
await asyncio.sleep(random.uniform(1.0, 3.0)) # Other coroutines run during this
return resp.text
async def scrape_many(urls: list[str]) -> list[str]:
async with httpx.AsyncClient(headers=HEADERS) as client:
tasks = [fetch_good(url, client) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
# Filter out errors
return [r for r in results if isinstance(r, str)]
# Run it
urls = [f"https://example.com/page/{i}" for i in range(50)]
results = asyncio.run(scrape_many(urls))
print(f"Fetched {len(results)} pages")
Also watch for time.sleep() hidden inside helper functions that get called from async code — the bug isn't always visible at the call site.
4. Not Handling None from .find()
BeautifulSoup's .find() returns None when it doesn't match. Then you call .text on None and get an AttributeError. This is the single most common crash I see in scrapers. It's also the most avoidable.
The problem compounds because the crash happens during data extraction, not data fetching. If you've already done the expensive work of fetching 10,000 pages and you crash during parsing on page 7,432, you may have to start over.
Fix: Always guard .find() results, and do it consistently.
from dataclasses import dataclass
from typing import Optional
@dataclass
class Product:
name: Optional[str]
price: Optional[str]
rating: Optional[str]
in_stock: bool
def parse_product(soup: BeautifulSoup, url: str) -> Product:
def safe_text(selector: str, default: Optional[str] = None) -> Optional[str]:
"""Find element, return its text or default if not found."""
element = soup.select_one(selector)
return element.get_text(strip=True) if element else default
name = safe_text("h1.product-title")
price = safe_text("span.price")
rating = safe_text("[data-testid='rating']")
stock_el = soup.find(class_="stock-status")
in_stock = stock_el.get_text(strip=True).lower() == "in stock" if stock_el else False
if name is None and price is None:
import logging
logging.warning(f"Got empty product on {url} — possible bot detection")
return Product(name=name, price=price, rating=rating, in_stock=in_stock)
The safe_text helper function eliminates the pattern repetition. You write the guard logic once, and the rest of the parsing code stays clean.
5. Re-requesting Pages You Already Have
Your scraper runs daily. It fetches 10,000 pages. 9,500 of them haven't changed since yesterday. You're wasting bandwidth, hammering the target server unnecessarily, burning proxy quota, and taking 10x longer than needed.
Beyond cost and performance, excessive requests are one of the clearest signals to anti-bot systems that something automated is happening. A real user doesn't request the same catalog of 10,000 pages every day.
Fix: Cache responses with content hashing for change detection.
import hashlib
import sqlite3
import time
from pathlib import Path
class ResponseCache:
def __init__(self, db_path: str = "scraper_cache.db", ttl_hours: int = 24):
self.db_path = db_path
self.ttl_seconds = ttl_hours * 3600
self._init_db()
def _init_db(self):
with sqlite3.connect(self.db_path) as conn:
conn.execute("""
CREATE TABLE IF NOT EXISTS cache (
url_hash TEXT PRIMARY KEY,
url TEXT,
html TEXT,
content_hash TEXT,
fetched_at REAL,
changed INTEGER DEFAULT 0
)
""")
def get(self, url: str) -> str | None:
key = hashlib.sha256(url.encode()).hexdigest()
with sqlite3.connect(self.db_path) as conn:
row = conn.execute(
"SELECT html, fetched_at FROM cache WHERE url_hash=?", (key,)
).fetchone()
if row and (time.time() - row[1]) < self.ttl_seconds:
return row[0]
return None
def set(self, url: str, html: str) -> bool:
"""Store response. Returns True if content changed since last fetch."""
key = hashlib.sha256(url.encode()).hexdigest()
content_hash = hashlib.md5(html.encode()).hexdigest()
with sqlite3.connect(self.db_path) as conn:
existing = conn.execute(
"SELECT content_hash FROM cache WHERE url_hash=?", (key,)
).fetchone()
changed = existing is None or existing[0] != content_hash
conn.execute("""
INSERT OR REPLACE INTO cache (url_hash, url, html, content_hash, fetched_at, changed)
VALUES (?, ?, ?, ?, ?, ?)
""", (key, url, html, content_hash, time.time(), int(changed)))
return changed
def fetch_with_cache(self, url: str, session: requests.Session) -> tuple[str, bool]:
"""Returns (html, was_cached)."""
cached = self.get(url)
if cached:
return cached, True
resp = session.get(url, timeout=30)
resp.raise_for_status()
self.set(url, resp.text)
return resp.text, False
# Usage
cache = ResponseCache(ttl_hours=24)
session = requests.Session()
session.headers.update(HEADERS)
for url in urls:
html, from_cache = cache.fetch_with_cache(url, session)
if not from_cache:
import time as t
t.sleep(random.uniform(1, 3)) # Only delay on actual fetches
6. Ignoring HTTP Error Codes
Your scraper gets a 403. Instead of stopping, it parses the "Access Denied" HTML as product data. Now your database contains {"price": "Access Denied", "name": "You do not have permission to access this resource"} for 800 entries, and you don't know which ones are real.
Different error codes require different responses. Treating them all the same is a bug.
Fix: Handle each code class differently.
import logging
from enum import Enum
class FetchResult(Enum):
SUCCESS = "success"
BLOCKED = "blocked"
RATE_LIMITED = "rate_limited"
NOT_FOUND = "not_found"
SERVER_ERROR = "server_error"
def fetch_with_error_handling(url: str, session: requests.Session) -> tuple[FetchResult, str | None]:
try:
resp = session.get(url, timeout=30)
match resp.status_code:
case 200:
# Verify it's not a disguised block
if any(marker in resp.text.lower() for marker in [
"captcha", "verify you are human", "access denied",
"cloudflare", "rate limit exceeded"
]):
logging.warning(f"Soft block on {url} despite 200 OK")
return FetchResult.BLOCKED, None
return FetchResult.SUCCESS, resp.text
case 403 | 401:
logging.warning(f"Blocked (HTTP {resp.status_code}) on {url}")
return FetchResult.BLOCKED, None
case 404:
logging.info(f"Not found: {url}")
return FetchResult.NOT_FOUND, None
case 429:
retry_after = int(resp.headers.get("Retry-After", 60))
logging.warning(f"Rate limited on {url}. Retry after {retry_after}s")
return FetchResult.RATE_LIMITED, None
case _ if 500 <= resp.status_code < 600:
logging.error(f"Server error {resp.status_code} on {url}")
return FetchResult.SERVER_ERROR, None
case _:
resp.raise_for_status()
return FetchResult.SUCCESS, resp.text
except requests.RequestException as e:
logging.error(f"Request failed for {url}: {e}")
return FetchResult.SERVER_ERROR, None
7. Parsing the Wrong Page
You got a 200 OK, so everything's fine, right? Not always. Many sites return 200 for login redirects, CAPTCHA challenges, "verify you're human" interstitials, or maintenance pages. Your scraper happily parses a CAPTCHA page and extracts nonsense, or silently migrates to the login page and starts extracting form fields.
This is especially common with JavaScript-heavy sites where the initial HTML is a loading screen, not the actual content.
Fix: Validate content, not just status codes.
def validate_page_content(html: str, url: str, expected_signals: list[str]) -> str:
"""Verify the page contains what we expect before parsing."""
html_lower = html.lower()
# Check for common block/error patterns
block_signals = [
("captcha" in html_lower, "CAPTCHA challenge detected"),
("verify you are human" in html_lower, "Human verification required"),
("access denied" in html_lower and len(html) < 5000, "Short access denied page"),
("please enable javascript" in html_lower, "JavaScript required page"),
("maintenance" in html_lower and len(html) < 3000, "Maintenance page"),
]
for condition, message in block_signals:
if condition:
raise RuntimeError(f"{message} on {url}")
# Verify expected content is present
for signal in expected_signals:
if signal.lower() not in html_lower:
raise RuntimeError(
f"Expected content '{signal}' not found on {url}. "
f"Page length: {len(html)}. "
f"First 200 chars: {html[:200]}"
)
return html
# Usage — pass signals that should be present on a valid product page
html = session.get(url).text
validated_html = validate_page_content(
html, url,
expected_signals=["add to cart", "product description"]
)
8. Not Rotating User-Agents
Sending the same User-Agent on 10,000 requests from the same IP is a dead giveaway even if the UA string itself is realistic. Real users don't have identical browser versions. Anti-bot systems track not just individual signals but the entropy (or lack of it) in request patterns.
Fix: Rotate from a curated list of current browser UAs, matched to platform and version distribution.
import random
from dataclasses import dataclass
@dataclass
class BrowserProfile:
user_agent: str
sec_ch_ua: str
platform: str
# Reflect realistic browser market share (Chrome dominant, then Safari, Firefox)
BROWSER_PROFILES = [
BrowserProfile(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
sec_ch_ua='"Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"',
platform="Windows",
),
BrowserProfile(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4.1 Safari/605.1.15",
sec_ch_ua='"Safari";v="17", "Not-A.Brand";v="99"',
platform="macOS",
),
BrowserProfile(
user_agent="Mozilla/5.0 (X11; Linux x86_64; rv:125.0) Gecko/20100101 Firefox/125.0",
sec_ch_ua='"Firefox";v="125", "Not-A.Brand";v="99"',
platform="Linux",
),
BrowserProfile(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 Edg/124.0.0.0",
sec_ch_ua='"Microsoft Edge";v="124", "Chromium";v="124", "Not-A.Brand";v="99"',
platform="Windows",
),
]
def get_random_headers() -> dict:
profile = random.choice(BROWSER_PROFILES)
return {
"User-Agent": profile.user_agent,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Sec-Ch-Ua": profile.sec_ch_ua,
"Sec-Ch-Ua-Platform": f'"{profile.platform}"',
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
}
# Each request gets a different profile
for url in urls:
resp = requests.get(url, headers=get_random_headers())
Pair this with ThorData residential proxy rotation for maximum effectiveness — rotating UAs without rotating IPs only solves half the problem.
9. Forgetting About Encoding
response.text auto-decodes using the encoding declared in HTTP headers. But some sites lie about their encoding, declare the wrong one, or don't declare any. You end up with mojibake — garbled text that looks fine in your terminal (because the terminal guesses the encoding) but silently corrupts your data downstream.
This is particularly common with older European sites, Asian-language sites, and sites running legacy CMS platforms.
Fix: Detect and override encoding when needed.
import chardet
def get_response_text(resp: requests.Response) -> str:
# Check what the server claims
declared_encoding = resp.encoding
# Detect from actual bytes
detected = chardet.detect(resp.content)
detected_encoding = detected.get("encoding", "utf-8")
confidence = detected.get("confidence", 0)
# If detection is confident and differs from declared, use detected
if confidence > 0.85 and detected_encoding.lower() != (declared_encoding or "").lower():
import logging
logging.debug(
f"Encoding mismatch on {resp.url}: "
f"declared={declared_encoding}, detected={detected_encoding} (confidence={confidence:.0%})"
)
resp.encoding = detected_encoding
return resp.text
# Test it
resp = requests.get("https://example.com/page", headers=HEADERS)
text = get_response_text(resp)
When saving to files, always specify encoding explicitly:
import csv
with open("output.csv", "w", encoding="utf-8", newline="") as f:
writer = csv.DictWriter(f, fieldnames=["name", "price", "rating"])
writer.writeheader()
writer.writerows(products)
10. Over-engineering Retry Logic
You build a retry decorator that retries every failed request 10 times with exponential backoff. Sounds robust. But now your scraper spends 20 minutes retrying a 403 that will never succeed — the site blocked you, and no amount of waiting will change that. Meanwhile, a legitimate 503 that would have recovered in 5 seconds triggers your 10x retry chain, delaying your pipeline by 45 minutes.
The key insight: different error types require different retry strategies.
Fix: Build retry logic that understands what it's retrying.
from tenacity import (
retry, stop_after_attempt, wait_exponential,
retry_if_exception_type, before_sleep_log
)
import logging
logger = logging.getLogger(__name__)
class BlockedError(Exception):
"""Raised when the server actively blocks us — do NOT retry."""
pass
class TransientError(Exception):
"""Raised for temporary server issues — safe to retry."""
pass
def classify_response(resp: requests.Response, url: str):
"""Raise the appropriate exception based on response code."""
if resp.status_code == 200:
return resp
elif resp.status_code in (403, 401):
raise BlockedError(f"Blocked on {url}: HTTP {resp.status_code}")
elif resp.status_code == 429:
retry_after = int(resp.headers.get("Retry-After", 30))
import time
time.sleep(retry_after)
raise TransientError(f"Rate limited on {url}")
elif resp.status_code in (500, 502, 503, 504):
raise TransientError(f"Server error on {url}: HTTP {resp.status_code}")
else:
resp.raise_for_status()
@retry(
retry=retry_if_exception_type(TransientError),
stop=stop_after_attempt(4),
wait=wait_exponential(multiplier=2, min=2, max=60),
before_sleep=before_sleep_log(logger, logging.WARNING),
reraise=True,
)
def fetch_with_retry(url: str, session: requests.Session) -> requests.Response:
resp = session.get(url, headers=get_random_headers(), timeout=30)
return classify_response(resp, url)
# Usage
try:
resp = fetch_with_retry(url, session)
html = resp.text
except BlockedError:
logger.error(f"Permanently blocked on {url} — rotate IP and skip")
# Don't retry — log it, move on
except TransientError:
logger.error(f"Gave up on {url} after 4 attempts")
# Save to retry queue for later
The rules: Retry 429s and 5xx errors (transient). Never retry 403s or 401s (blocked — fix the root cause, not the symptom). Never retry 404s (the page doesn't exist). Set a hard limit of 3-5 attempts maximum — if something is still failing after 5 tries, the problem isn't transient.
Real-World Use Cases
E-commerce Price Monitoring
import requests
import sqlite3
import time
import random
from bs4 import BeautifulSoup
from datetime import datetime
def monitor_prices(product_urls: list[str], db_path: str = "prices.db"):
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS prices (
url TEXT, price REAL, currency TEXT,
captured_at TEXT, in_stock INTEGER
)
""")
session = requests.Session()
for url in product_urls:
try:
session.headers.update(get_random_headers())
resp = session.get(url, timeout=30)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
price_text = None
for sel in ["[itemprop='price']", "span.price", ".price-box"]:
el = soup.select_one(sel)
if el:
price_text = el.get("content") or el.get_text(strip=True)
break
if price_text:
import re
price_num = float(re.sub(r"[^\d.]", "", price_text))
conn.execute(
"INSERT INTO prices VALUES (?, ?, ?, ?, ?)",
(url, price_num, "USD", datetime.now().isoformat(), 1)
)
conn.commit()
except Exception as e:
print(f"Error on {url}: {e}")
time.sleep(random.uniform(2, 5))
conn.close()
Job Listing Aggregator
from dataclasses import dataclass, asdict
import json
@dataclass
class JobListing:
title: str
company: str
location: str
salary_range: str | None
url: str
posted_date: str
description_snippet: str
def scrape_job_listings(search_url: str) -> list[JobListing]:
session = requests.Session()
session.headers.update(get_random_headers())
resp = session.get(search_url, timeout=30)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
jobs = []
# Try JSON-LD structured data first (most reliable)
for script in soup.find_all("script", type="application/ld+json"):
try:
data = json.loads(script.string)
if isinstance(data, list):
items = data
elif data.get("@type") == "JobPosting":
items = [data]
else:
continue
for item in items:
if item.get("@type") == "JobPosting":
jobs.append(JobListing(
title=item.get("title", ""),
company=item.get("hiringOrganization", {}).get("name", ""),
location=item.get("jobLocation", {}).get("address", {}).get("addressLocality", ""),
salary_range=item.get("baseSalary", {}).get("value", {}).get("value"),
url=item.get("url", search_url),
posted_date=item.get("datePosted", ""),
description_snippet=item.get("description", "")[:300],
))
except (json.JSONDecodeError, AttributeError):
continue
return jobs
Real Estate Listing Scraper
import httpx
import asyncio
async def scrape_listings_async(listing_urls: list[str]) -> list[dict]:
"""Scrape multiple real estate listings concurrently."""
results = []
async with httpx.AsyncClient(
headers=get_random_headers(),
follow_redirects=True,
timeout=30.0,
) as client:
async def fetch_listing(url: str) -> dict | None:
try:
await asyncio.sleep(random.uniform(1, 2))
resp = await client.get(url)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
return {
"url": url,
"price": (soup.select_one("[data-testid='price']") or BeautifulSoup("", "lxml")).get_text(strip=True),
"beds": (soup.select_one("[data-testid='beds']") or BeautifulSoup("", "lxml")).get_text(strip=True),
"sqft": (soup.select_one("[data-testid='sqft']") or BeautifulSoup("", "lxml")).get_text(strip=True),
"address": (soup.select_one("h1") or BeautifulSoup("", "lxml")).get_text(strip=True),
}
except Exception as e:
print(f"Error on {url}: {e}")
return None
tasks = [fetch_listing(url) for url in listing_urls]
raw_results = await asyncio.gather(*tasks)
results = [r for r in raw_results if r is not None]
return results
Error Handling Schema
When building production scrapers, standardize your error output so downstream systems can handle failures gracefully:
from dataclasses import dataclass, field
from typing import Any
from datetime import datetime
import json
@dataclass
class ScrapeResult:
url: str
success: bool
data: dict[str, Any] | None
error_type: str | None = None
error_message: str | None = None
http_status: int | None = None
duration_ms: float | None = None
fetched_at: str = field(default_factory=lambda: datetime.now().isoformat())
from_cache: bool = False
def to_json(self) -> str:
return json.dumps({
"url": self.url,
"success": self.success,
"data": self.data,
"error": {
"type": self.error_type,
"message": self.error_message,
} if not self.success else None,
"meta": {
"http_status": self.http_status,
"duration_ms": self.duration_ms,
"fetched_at": self.fetched_at,
"from_cache": self.from_cache,
}
}, default=str)
# Example output for a successful scrape:
# {
# "url": "https://example.com/product/123",
# "success": true,
# "data": {"name": "Widget Pro", "price": 29.99},
# "error": null,
# "meta": {"http_status": 200, "duration_ms": 342.1, "fetched_at": "2026-03-30T14:23:01", "from_cache": false}
# }
# Example output for a blocked scrape:
# {
# "url": "https://example.com/product/456",
# "success": false,
# "data": null,
# "error": {"type": "BlockedError", "message": "Blocked on https://example.com/product/456: HTTP 403"},
# "meta": {"http_status": 403, "duration_ms": 89.3, "fetched_at": "2026-03-30T14:23:05", "from_cache": false}
# }
Every one of these mistakes is something I've either made myself or found in someone else's scraper. None of them require deep expertise to fix — the hard part is building the habit of checking for them before you deploy.
Build defensively. Check your assumptions. Validate content, not just status codes. And when your scraper breaks at 3 AM, at least now you'll know exactly where to start looking.