BeautifulSoup Web Scraping Tutorial: Complete Python Guide (2026)
BeautifulSoup Web Scraping Tutorial: Complete Python Guide (2026)
Web scraping in 2026 is simultaneously easier and harder than it was five years ago. Easier because the libraries are mature, documentation is excellent, and Python tooling has improved dramatically. Harder because websites have become sophisticated at detecting and blocking automated traffic. A scraper that worked in 2021 may silently return empty results today — not because you coded it wrong, but because the target has deployed bot detection that filters your requests before they ever reach the HTML you want.
BeautifulSoup remains the foundation of Python web scraping. It has been around since 2004, survived multiple paradigm shifts in web development, and continues to be the right tool for the majority of scraping tasks. It does one thing: turn raw HTML into a navigable tree of objects. It does not fetch pages, handle JavaScript rendering, manage sessions, or rotate proxies. Those concerns live in the layers around it. BeautifulSoup itself is a pure parser, and that simplicity is precisely why it has lasted.
This tutorial covers everything you need to go from zero to production-ready scraper in 2026. We start with the basics of parsing and element extraction, move through anti-detection techniques, proxy rotation, error handling and retry logic, then finish with complete real-world examples across seven use cases. Every code block is working Python 3 code designed for the current ecosystem.
Understanding what BeautifulSoup is not doing is as important as understanding what it is doing. When you call BeautifulSoup(html, "lxml"), you are passing a string of HTML text and getting back an object that lets you search and traverse that text using CSS selectors or tag navigation. No network requests happen inside BeautifulSoup. That separation is the key architectural insight: you fetch with requests or httpx, you parse with BeautifulSoup. The two concerns stay cleanly separated, which makes both easier to test and maintain.
The goal of this guide is to give you patterns that actually work against real websites in 2026 — not toy examples against sites that welcome bots, but patterns that handle the full stack of challenges you will encounter in production scraping work.
Setup and Installation
pip install beautifulsoup4 lxml requests httpx playwright
playwright install chromium
For production environments, use a proper dependency file:
# pyproject.toml
[project]
dependencies = [
"beautifulsoup4>=4.12",
"lxml>=5.0",
"requests>=2.31",
"httpx>=0.27",
"playwright>=1.44",
"tenacity>=8.3",
"fake-useragent>=1.5",
]
Quick sanity check:
from bs4 import BeautifulSoup
import requests
response = requests.get(
"https://example.com",
headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"}
)
soup = BeautifulSoup(response.text, "lxml")
print(soup.title.string)
# → Example Domain
Choosing a Parser
BeautifulSoup supports multiple parsers, each with different trade-offs:
| Parser | Install | Speed | Handles Broken HTML | Notes |
|---|---|---|---|---|
lxml |
pip install lxml |
Very fast | Excellent | Recommended for all production use |
html.parser |
Built-in | Moderate | Good | Use when C extensions unavailable |
lxml-xml |
via lxml |
Fast | N/A | For XML documents specifically |
html5lib |
pip install html5lib |
Slow | Perfect | For heavily broken HTML only |
Use lxml for everything unless you have a specific reason not to. It is a C extension that parses HTML 5-10x faster than html.parser and handles malformed markup gracefully. The only case for html.parser is a constrained deployment where you cannot install C extensions.
# Parser selection pattern
def parse_html(html: str) -> BeautifulSoup:
try:
return BeautifulSoup(html, "lxml")
except Exception:
# Fallback if lxml has issues
return BeautifulSoup(html, "html.parser")
Anti-Detection: Headers and Session Setup
The single most common reason scrapers fail in 2026 is not being blocked — it is being silently filtered. Sites return HTTP 200 with either empty results, a CAPTCHA page, or a honeypot response designed to waste your time. Proper request headers are the first line of defense.
A real browser sends dozens of headers with every request. Here is a realistic header set:
import requests
from fake_useragent import UserAgent
ua = UserAgent()
HEADERS = {
"User-Agent": ua.chrome,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"sec-ch-ua": '"Google Chrome";v="122", "Not(A:Brand";v="24", "Chromium";v="122"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"macOS"',
}
session = requests.Session()
session.headers.update(HEADERS)
The Sec-Fetch-* and sec-ch-ua headers are client hints that browsers send automatically. Many bot detection systems check for their presence. If you send a Chrome User-Agent but omit these headers, detection algorithms can infer you are a bot.
For rotating user agents, fake-useragent provides real browser strings pulled from a maintained database:
from fake_useragent import UserAgent
ua = UserAgent()
def get_random_headers() -> dict:
"""Generate headers with a random but realistic user agent."""
browser = ua.random
return {
"User-Agent": browser,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
}
Proxy Rotation with ThorData
Headers alone will not save you at scale. Most production scrapers need proxy rotation. When you send thousands of requests from a single IP, rate limiting and IP bans are inevitable.
Residential proxies route your traffic through real consumer IP addresses, making your requests indistinguishable from normal browsing traffic. ThorData provides rotating residential proxies with global coverage and per-country targeting. Here is a complete proxy-aware scraping session:
import requests
import random
import time
from bs4 import BeautifulSoup
from typing import Optional
class ProxySession:
"""Requests session with ThorData proxy rotation and anti-detection headers."""
THORDATA_HOST = "proxy.thordata.com"
THORDATA_PORT = 9000
def __init__(self, username: str, password: str, country: str = "US"):
self.username = username
self.password = password
self.country = country
self.session = requests.Session()
self._setup_headers()
def _setup_headers(self):
self.session.headers.update({
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
})
def _get_proxy(self) -> dict:
"""Build a rotating proxy URL. Each request gets a fresh residential IP."""
# ThorData rotating residential format
proxy_user = f"{self.username}-country-{self.country}-session-{random.randint(100000, 999999)}"
proxy_url = f"http://{proxy_user}:{self.password}@{self.THORDATA_HOST}:{self.THORDATA_PORT}"
return {"http": proxy_url, "https": proxy_url}
def get(self, url: str, **kwargs) -> requests.Response:
kwargs.setdefault("proxies", self._get_proxy())
kwargs.setdefault("timeout", 30)
return self.session.get(url, **kwargs)
def parse(self, url: str, **kwargs) -> BeautifulSoup:
"""Fetch and parse in one call."""
resp = self.get(url, **kwargs)
resp.raise_for_status()
return BeautifulSoup(resp.text, "lxml")
# Usage
scraper = ProxySession(
username="your_thordata_user",
password="your_thordata_pass",
country="US"
)
soup = scraper.parse("https://target-site.com/products")
products = soup.select(".product-card")
print(f"Found {len(products)} products")
For sticky sessions (same IP across multiple requests to maintain login state or pagination context), use session-based routing:
def get_sticky_proxy(username: str, password: str, session_id: str, country: str = "US") -> dict:
"""Return a proxy that sticks to the same IP for the duration of the session."""
proxy_user = f"{username}-country-{country}-session-{session_id}"
proxy_url = f"http://{proxy_user}:{password}@proxy.thordata.com:9000"
return {"http": proxy_url, "https": proxy_url}
# Reuse the same session_id across multiple requests to maintain the same IP
session_id = "scrape-job-001"
proxies = get_sticky_proxy("user", "pass", session_id, country="GB")
session = requests.Session()
session.get("https://site.com/login", proxies=proxies)
session.post("https://site.com/login", data={"user": "x", "pass": "y"}, proxies=proxies)
session.get("https://site.com/dashboard", proxies=proxies) # Same IP through all requests
Finding Elements
find() and find_all()
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
# First match
heading = soup.find("h2")
# All matches
links = soup.find_all("a")
# By class
cards = soup.find_all("div", class_="product-card")
# By ID
sidebar = soup.find("div", id="sidebar")
# Multiple classes (element must have both)
items = soup.find_all("li", class_=["active", "featured"])
# By attribute
images = soup.find_all("img", src=True) # Any img with a src attribute
external = soup.find_all("a", attrs={"target": "_blank"})
# By pattern
import re
price_elements = soup.find_all("span", class_=re.compile(r"price"))
CSS Selectors
CSS selectors are often more readable for complex queries:
# Descendant selection
items = soup.select("#product-list .item")
# Direct child
direct = soup.select("ul > li")
# First of type
first_price = soup.select_one(".product .price")
# Attribute selector
external_links = soup.select('a[href^="https://"]')
data_items = soup.select('[data-category="electronics"]')
# Pseudo-selectors (limited support in BS4)
# For :nth-child, use find_all and slice instead
all_rows = soup.select("table tr")[1:] # Skip header row
Navigating the Tree
# Parent
parent = element.parent
# Siblings
next_el = element.next_sibling
prev_el = element.previous_sibling
next_tag = element.find_next_sibling("div")
# Children
children = list(element.children)
all_descendants = list(element.descendants)
# First/last child
first = element.find("div")
Extracting Data
link = soup.find("a", class_="product-link")
# Text content
raw_text = link.text # Includes whitespace
clean_text = link.get_text(strip=True) # Stripped
normalized = link.get_text(separator=" ", strip=True) # Join with separator
# Attributes — always use .get() to avoid KeyError
href = link.get("href")
data_id = link.get("data-id")
all_attrs = link.attrs # Dict of all attributes
# When attribute may be a list (like class)
classes = link.get("class", []) # Returns list: ["btn", "primary"]
class_str = " ".join(link.get("class", []))
Rate Limiting and Delays
Sending requests too fast is the fastest way to get blocked. Implement variable delays that simulate human browsing patterns:
import time
import random
def human_delay(min_seconds: float = 1.0, max_seconds: float = 4.0):
"""Sleep for a random duration to simulate human browsing."""
delay = random.uniform(min_seconds, max_seconds)
# Add occasional longer pauses (like a human reading)
if random.random() < 0.1: # 10% chance of a longer pause
delay += random.uniform(3.0, 8.0)
time.sleep(delay)
def scrape_with_rate_limit(urls: list, session: requests.Session,
min_delay: float = 1.5, max_delay: float = 5.0) -> list:
"""Scrape a list of URLs with rate limiting."""
results = []
for i, url in enumerate(urls):
try:
resp = session.get(url, timeout=30)
resp.raise_for_status()
results.append({"url": url, "html": resp.text, "status": resp.status_code})
except Exception as e:
results.append({"url": url, "html": None, "error": str(e)})
# Don't delay after the last URL
if i < len(urls) - 1:
human_delay(min_delay, max_delay)
return results
For async scraping with httpx, you can control concurrency more precisely:
import asyncio
import httpx
import random
async def scrape_urls_async(urls: list, max_concurrent: int = 3,
min_delay: float = 0.5, max_delay: float = 2.0) -> list:
"""Async scraper with concurrency limiting and delays."""
semaphore = asyncio.Semaphore(max_concurrent)
results = []
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}
async def fetch(client: httpx.AsyncClient, url: str) -> dict:
async with semaphore:
await asyncio.sleep(random.uniform(min_delay, max_delay))
try:
resp = await client.get(url, timeout=30)
return {"url": url, "html": resp.text, "status": resp.status_code}
except Exception as e:
return {"url": url, "html": None, "error": str(e)}
async with httpx.AsyncClient(headers=headers, follow_redirects=True) as client:
tasks = [fetch(client, url) for url in urls]
results = await asyncio.gather(*tasks)
return list(results)
# Run it
urls = ["https://example.com/page1", "https://example.com/page2"]
results = asyncio.run(scrape_urls_async(urls))
Error Handling and Retry Logic
Production scrapers must handle failures gracefully. Networks are unreliable, sites go down, rate limits trigger. The tenacity library provides clean retry logic:
import requests
from bs4 import BeautifulSoup
from tenacity import (
retry,
stop_after_attempt,
wait_exponential,
retry_if_exception_type,
before_sleep_log,
)
import logging
import time
logger = logging.getLogger(__name__)
class ScraperError(Exception):
pass
class RateLimitError(ScraperError):
pass
class BlockedError(ScraperError):
pass
@retry(
retry=retry_if_exception_type((requests.RequestException, ScraperError)),
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=1, min=2, max=60),
before_sleep=before_sleep_log(logger, logging.WARNING),
)
def fetch_with_retry(url: str, session: requests.Session) -> BeautifulSoup:
"""Fetch a URL with exponential backoff retry."""
resp = session.get(url, timeout=30)
# Detect soft blocks
if resp.status_code == 429:
retry_after = int(resp.headers.get("Retry-After", 60))
logger.warning(f"Rate limited. Waiting {retry_after}s")
time.sleep(retry_after)
raise RateLimitError(f"Rate limited on {url}")
if resp.status_code == 403:
raise BlockedError(f"Blocked (403) on {url}")
if resp.status_code >= 500:
raise ScraperError(f"Server error {resp.status_code} on {url}")
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
# Detect CAPTCHA pages
if _is_captcha_page(soup):
raise BlockedError(f"CAPTCHA detected on {url}")
# Detect empty/honeypot responses
if _is_empty_response(soup):
raise ScraperError(f"Suspicious empty response for {url}")
return soup
def _is_captcha_page(soup: BeautifulSoup) -> bool:
"""Detect common CAPTCHA patterns."""
captcha_indicators = [
soup.find("div", id="captcha"),
soup.find("div", class_="g-recaptcha"),
soup.find("div", class_="h-captcha"),
soup.find(string=lambda t: t and "robot" in t.lower()),
soup.find(string=lambda t: t and "captcha" in t.lower()),
soup.title and "access denied" in soup.title.string.lower() if soup.title else False,
]
return any(captcha_indicators)
def _is_empty_response(soup: BeautifulSoup) -> bool:
"""Detect suspiciously empty responses."""
body_text = soup.get_text(strip=True)
return len(body_text) < 200 # Adjust threshold per site
# Manual retry with progressive delays for simpler cases
def fetch_simple_retry(url: str, session: requests.Session,
max_retries: int = 3) -> requests.Response:
"""Simple retry without external dependencies."""
delays = [2, 5, 15] # Seconds between retries
for attempt in range(max_retries):
try:
resp = session.get(url, timeout=30)
if resp.status_code == 200:
return resp
if resp.status_code == 429:
wait = delays[min(attempt, len(delays)-1)]
logger.warning(f"Rate limited, waiting {wait}s (attempt {attempt+1})")
time.sleep(wait)
continue
resp.raise_for_status()
except requests.ConnectionError as e:
if attempt == max_retries - 1:
raise
wait = delays[min(attempt, len(delays)-1)]
logger.warning(f"Connection error, retrying in {wait}s: {e}")
time.sleep(wait)
raise ScraperError(f"Failed after {max_retries} attempts: {url}")
CAPTCHA Handling
CAPTCHAs are the hardest problem in web scraping. The right response depends on your use case and scale:
Approach 1: Avoid CAPTCHAs by looking like a browser
import time
import random
from playwright.async_api import async_playwright
async def scrape_with_playwright(url: str, proxy: str = None) -> str:
"""Use a real browser to avoid CAPTCHA triggers."""
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
args=[
"--disable-blink-features=AutomationControlled",
"--disable-infobars",
"--no-sandbox",
]
)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
viewport={"width": 1440, "height": 900},
proxy={"server": proxy} if proxy else None,
java_script_enabled=True,
)
# Patch automation detection
await context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
Object.defineProperty(navigator, 'plugins', { get: () => [1, 2, 3, 4, 5] });
Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });
window.chrome = { runtime: {} };
""")
page = await context.new_page()
# Simulate human-like navigation
await page.goto(url, wait_until="domcontentloaded")
await asyncio.sleep(random.uniform(1, 3))
# Scroll to simulate reading
await page.evaluate("window.scrollTo(0, document.body.scrollHeight / 3)")
await asyncio.sleep(random.uniform(0.5, 1.5))
content = await page.content()
await browser.close()
return content
Approach 2: Detect and wait for CAPTCHA resolution
async def handle_captcha_wait(page, timeout: int = 30) -> bool:
"""Wait for manual CAPTCHA resolution (for supervised scraping)."""
captcha_selectors = [
".g-recaptcha",
"#captcha",
".h-captcha",
"iframe[src*='recaptcha']",
"iframe[src*='hcaptcha']",
]
for selector in captcha_selectors:
captcha = await page.query_selector(selector)
if captcha:
print(f"CAPTCHA detected. Waiting up to {timeout}s for resolution...")
try:
# Wait for CAPTCHA container to disappear
await page.wait_for_selector(selector, state="hidden", timeout=timeout * 1000)
return True
except Exception:
return False
return True # No CAPTCHA found
**Approach 3: Skip to the API**
Before building CAPTCHA handling, check the Network tab in DevTools. Most sites that show CAPTCHAs on their web interface have a mobile API or internal JSON endpoint that bypasses them entirely. This is almost always the better path.
Real-World Use Cases
Use Case 1: E-commerce Price Monitor
import requests
from bs4 import BeautifulSoup
import json
from datetime import datetime
from dataclasses import dataclass, asdict
from typing import Optional
import time
import random
@dataclass
class Product:
url: str
name: str
price: Optional[float]
currency: str
availability: str
scraped_at: str
sku: Optional[str] = None
rating: Optional[float] = None
review_count: Optional[int] = None
def scrape_product(url: str, session: requests.Session) -> Product:
"""Extract product data from a product page."""
resp = session.get(url, timeout=30)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
# Try common product name patterns
name = None
for selector in ["h1.product-title", "h1.product-name", "#productTitle", "h1[itemprop='name']", "h1"]:
el = soup.select_one(selector)
if el:
name = el.get_text(strip=True)
break
# Extract price — handle various formats
price = None
currency = "USD"
for selector in [".price", ".product-price", "[itemprop='price']", ".offer-price", "#priceblock_ourprice"]:
el = soup.select_one(selector)
if el:
raw = el.get_text(strip=True)
# Parse price from string like "$24.99" or "24,99 €"
import re
match = re.search(r"[\d,]+\.?\d*", raw.replace(",", ".").lstrip("$£€"))
if match:
try:
price = float(match.group().replace(",", ""))
except ValueError:
pass
if "€" in raw:
currency = "EUR"
elif "£" in raw:
currency = "GBP"
break
# Availability
availability = "unknown"
avail_el = soup.find(attrs={"itemprop": "availability"})
if avail_el:
avail_href = avail_el.get("href", "")
if "InStock" in avail_href:
availability = "in_stock"
elif "OutOfStock" in avail_href:
availability = "out_of_stock"
else:
page_text = soup.get_text().lower()
if "add to cart" in page_text or "in stock" in page_text:
availability = "in_stock"
elif "out of stock" in page_text or "unavailable" in page_text:
availability = "out_of_stock"
return Product(
url=url,
name=name or "Unknown",
price=price,
currency=currency,
availability=availability,
scraped_at=datetime.utcnow().isoformat(),
)
def monitor_prices(urls: list, output_file: str = "prices.jsonl"):
"""Monitor multiple product pages, appending results to a JSONL file."""
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
})
with open(output_file, "a") as f:
for url in urls:
try:
product = scrape_product(url, session)
f.write(json.dumps(asdict(product)) + "\n")
print(f"✓ {product.name}: {product.currency} {product.price}")
except Exception as e:
print(f"✗ {url}: {e}")
time.sleep(random.uniform(2, 5))
Output schema:
{
"url": "https://example.com/products/widget-pro",
"name": "Widget Pro 2026",
"price": 24.99,
"currency": "USD",
"availability": "in_stock",
"scraped_at": "2026-03-31T10:22:45.123456",
"sku": null,
"rating": null,
"review_count": null
}
Use Case 2: News Article Aggregator
from bs4 import BeautifulSoup
import requests
import feedparser
from dataclasses import dataclass
from typing import Optional, List
from datetime import datetime
import hashlib
@dataclass
class Article:
url: str
title: str
author: Optional[str]
published_at: Optional[str]
body_text: str
word_count: int
tags: List[str]
source_domain: str
content_hash: str
def extract_article(url: str, session: requests.Session) -> Article:
"""Extract article content using common news site patterns."""
resp = session.get(url, timeout=30)
soup = BeautifulSoup(resp.text, "lxml")
# Remove noise elements
for tag in soup.select("nav, footer, header, .sidebar, .advertisement, .ad, script, style, [class*='ad-'], [id*='sidebar']"):
tag.decompose()
# Title
title = None
for selector in ["h1.article-title", "h1.entry-title", "h1.post-title",
"[itemprop='headline']", "article h1", "h1"]:
el = soup.select_one(selector)
if el:
title = el.get_text(strip=True)
break
# Author
author = None
for selector in ["[rel='author']", "[itemprop='author']", ".author-name",
".byline", "[class*='author']"]:
el = soup.select_one(selector)
if el:
author = el.get_text(strip=True)
break
# Published date
published = None
for selector in ["time[datetime]", "[itemprop='datePublished']", ".published-date"]:
el = soup.select_one(selector)
if el:
published = el.get("datetime") or el.get_text(strip=True)
break
# Body text — prioritize article/main content containers
body = ""
for selector in ["article .content", "article .body", ".article-body",
".entry-content", ".post-content", "article", "main"]:
el = soup.select_one(selector)
if el:
# Get all paragraph text
paragraphs = el.find_all("p")
body = " ".join(p.get_text(strip=True) for p in paragraphs if len(p.get_text(strip=True)) > 50)
if len(body) > 200:
break
# Tags from meta keywords or tag links
tags = []
meta_keywords = soup.find("meta", attrs={"name": "keywords"})
if meta_keywords:
tags = [t.strip() for t in meta_keywords.get("content", "").split(",")]
from urllib.parse import urlparse
domain = urlparse(url).netloc
return Article(
url=url,
title=title or "Unknown",
author=author,
published_at=published,
body_text=body,
word_count=len(body.split()),
tags=tags[:10],
source_domain=domain,
content_hash=hashlib.md5(body.encode()).hexdigest(),
)
Use Case 3: Job Listing Scraper with Pagination
import requests
from bs4 import BeautifulSoup
from typing import Generator
import time
import random
from dataclasses import dataclass
from typing import Optional, List
@dataclass
class JobListing:
title: str
company: str
location: str
salary: Optional[str]
url: str
job_type: Optional[str]
description_preview: str
tags: List[str]
def scrape_jobs_paginated(base_url: str, query: str, location: str,
max_pages: int = 10) -> Generator[JobListing, None, None]:
"""Scrape job listings across multiple pages."""
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
})
for page in range(1, max_pages + 1):
params = {"q": query, "l": location, "start": (page - 1) * 10}
try:
resp = session.get(base_url, params=params, timeout=30)
resp.raise_for_status()
except requests.HTTPError as e:
print(f"Page {page} failed: {e}")
break
soup = BeautifulSoup(resp.text, "lxml")
# Generic job card pattern — adapt selectors per site
job_cards = soup.select(".job-card, .job-listing, [data-testid='job-card'], .result")
if not job_cards:
print(f"No jobs found on page {page}, stopping pagination")
break
for card in job_cards:
title_el = card.select_one("h2 a, h3 a, .job-title a, [data-testid='job-title']")
company_el = card.select_one(".company-name, [data-testid='company-name'], .employer")
location_el = card.select_one(".location, [data-testid='job-location'], .job-location")
salary_el = card.select_one(".salary, .compensation, [data-testid='salary-snippet']")
desc_el = card.select_one(".description, .summary, .job-snippet")
if not title_el:
continue
tags = [t.get_text(strip=True) for t in card.select(".tag, .skill-tag, .badge")]
yield JobListing(
title=title_el.get_text(strip=True),
company=company_el.get_text(strip=True) if company_el else "Unknown",
location=location_el.get_text(strip=True) if location_el else "Remote/Unknown",
salary=salary_el.get_text(strip=True) if salary_el else None,
url=title_el.get("href", ""),
job_type=None,
description_preview=desc_el.get_text(strip=True)[:300] if desc_el else "",
tags=tags[:8],
)
# Check for next page
next_link = soup.select_one("a[aria-label='Next'], .pagination-next a, a.next")
if not next_link:
print(f"No next page link found after page {page}")
break
time.sleep(random.uniform(2, 5))
Use Case 4: Real Estate Listing Scraper
import requests
from bs4 import BeautifulSoup
from dataclasses import dataclass, field
from typing import Optional, List
import re
@dataclass
class PropertyListing:
address: str
price: Optional[float]
price_per_sqft: Optional[float]
bedrooms: Optional[int]
bathrooms: Optional[float]
square_feet: Optional[int]
lot_size: Optional[str]
year_built: Optional[int]
property_type: str
listing_url: str
mls_id: Optional[str]
days_on_market: Optional[int]
description: str
features: List[str] = field(default_factory=list)
def parse_property_details(soup: BeautifulSoup, url: str) -> PropertyListing:
"""Parse property details from a listing page."""
def clean_number(text: str) -> Optional[float]:
"""Extract numeric value from formatted string."""
if not text:
return None
clean = re.sub(r"[^\d.]", "", text.replace(",", ""))
try:
return float(clean)
except ValueError:
return None
# Address
address = ""
for sel in ["h1.address", "[itemprop='streetAddress']", ".property-address", "h1"]:
el = soup.select_one(sel)
if el:
address = el.get_text(strip=True)
break
# Price
price = None
for sel in [".listing-price", ".price", "[data-testid='price']", "span.price"]:
el = soup.select_one(sel)
if el:
price = clean_number(el.get_text())
break
# Key facts — often in a definition list or facts grid
bedrooms = bathrooms = sqft = year_built = None
facts_container = soup.select_one(".facts-grid, .property-facts, .key-facts, .home-facts")
if facts_container:
text = facts_container.get_text(" ", strip=True).lower()
bed_match = re.search(r"(\d+)\s*bed", text)
bath_match = re.search(r"([\d.]+)\s*bath", text)
sqft_match = re.search(r"([\d,]+)\s*sq\.?\s*ft", text)
year_match = re.search(r"built\s+in\s+(\d{4})", text)
bedrooms = int(bed_match.group(1)) if bed_match else None
bathrooms = float(bath_match.group(1)) if bath_match else None
sqft = int(sqft_match.group(1).replace(",", "")) if sqft_match else None
year_built = int(year_match.group(1)) if year_match else None
# Description
desc = ""
for sel in [".property-description", ".listing-description", "[data-testid='description']"]:
el = soup.select_one(sel)
if el:
desc = el.get_text(strip=True)
break
# Features/amenities
features = [el.get_text(strip=True) for el in soup.select(".features li, .amenities li, .feature-item")]
price_per_sqft = round(price / sqft, 2) if price and sqft else None
return PropertyListing(
address=address,
price=price,
price_per_sqft=price_per_sqft,
bedrooms=bedrooms,
bathrooms=bathrooms,
square_feet=sqft,
lot_size=None,
year_built=year_built,
property_type="residential",
listing_url=url,
mls_id=None,
days_on_market=None,
description=desc[:1000],
features=features[:20],
)
Use Case 5: Academic Paper Metadata Extractor
import requests
from bs4 import BeautifulSoup
from dataclasses import dataclass, field
from typing import Optional, List
@dataclass
class Paper:
title: str
authors: List[str]
abstract: str
doi: Optional[str]
journal: Optional[str]
year: Optional[int]
keywords: List[str]
citations: Optional[int]
pdf_url: Optional[str]
def extract_paper_metadata(url: str, session: requests.Session) -> Paper:
"""Extract academic paper metadata. Works with arXiv, PubMed-style pages."""
resp = session.get(url, timeout=30)
soup = BeautifulSoup(resp.text, "lxml")
# Try Open Graph / Dublin Core / Schema.org metadata first (most reliable)
def get_meta(name: str = None, property_: str = None) -> Optional[str]:
if name:
el = soup.find("meta", attrs={"name": name})
return el.get("content") if el else None
if property_:
el = soup.find("meta", attrs={"property": property_})
return el.get("content") if el else None
return None
title = (get_meta("citation_title") or get_meta(property_="og:title") or
(soup.find("h1") and soup.find("h1").get_text(strip=True)))
abstract = (get_meta("citation_abstract") or get_meta("description") or
get_meta(property_="og:description") or "")
doi = get_meta("citation_doi")
journal = get_meta("citation_journal_title")
# Authors from citation metadata (can be multiple)
author_metas = soup.find_all("meta", attrs={"name": "citation_author"})
if author_metas:
authors = [m.get("content", "") for m in author_metas]
else:
# Fallback to page scraping
author_els = soup.select(".author, [itemprop='author'], .authors a")
authors = [el.get_text(strip=True) for el in author_els]
# Year
year = None
date_str = get_meta("citation_publication_date") or get_meta("citation_date")
if date_str:
import re
year_match = re.search(r"\d{4}", date_str)
year = int(year_match.group()) if year_match else None
# Keywords
keywords_str = get_meta("citation_keywords") or get_meta("keywords") or ""
keywords = [k.strip() for k in keywords_str.replace(";", ",").split(",") if k.strip()]
# PDF link
pdf_url = None
pdf_meta = soup.find("meta", attrs={"name": "citation_pdf_url"})
if pdf_meta:
pdf_url = pdf_meta.get("content")
else:
pdf_link = soup.find("a", href=lambda h: h and ".pdf" in h.lower())
if pdf_link:
pdf_url = pdf_link.get("href")
return Paper(
title=title or "Unknown",
authors=authors,
abstract=abstract[:2000],
doi=doi,
journal=journal,
year=year,
keywords=keywords[:15],
citations=None,
pdf_url=pdf_url,
)
Use Case 6: Social Media Public Profile Scraper
import asyncio
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import Optional, List
@dataclass
class PublicProfile:
username: str
display_name: str
bio: Optional[str]
follower_count: Optional[int]
following_count: Optional[int]
post_count: Optional[int]
website: Optional[str]
recent_posts: List[dict]
async def scrape_public_profile(username: str, platform_url: str,
proxy: str = None) -> PublicProfile:
"""Scrape a public social media profile using Playwright."""
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
args=["--disable-blink-features=AutomationControlled"]
)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
viewport={"width": 1280, "height": 800},
proxy={"server": proxy} if proxy else None,
)
await context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
""")
page = await context.new_page()
# Intercept API calls to extract data more cleanly
api_data = {}
async def handle_response(response):
if "graphql" in response.url or "/api/" in response.url:
try:
data = await response.json()
api_data[response.url] = data
except Exception:
pass
page.on("response", handle_response)
await page.goto(f"{platform_url}/{username}", wait_until="networkidle")
await asyncio.sleep(2)
html = await page.content()
soup = BeautifulSoup(html, "lxml")
await browser.close()
# Parse the rendered HTML
import re
def parse_count(text: str) -> Optional[int]:
if not text:
return None
text = text.replace(",", "").strip()
if "K" in text:
return int(float(text.replace("K", "")) * 1000)
if "M" in text:
return int(float(text.replace("M", "")) * 1_000_000)
try:
return int(re.sub(r"[^\d]", "", text))
except ValueError:
return None
# These selectors are illustrative — adapt per platform
display_name = ""
bio = ""
name_el = soup.select_one("h1, .profile-name, [data-testid='display-name']")
bio_el = soup.select_one(".bio, .profile-bio, [data-testid='bio']")
if name_el:
display_name = name_el.get_text(strip=True)
if bio_el:
bio = bio_el.get_text(strip=True)
return PublicProfile(
username=username,
display_name=display_name,
bio=bio,
follower_count=None, # Extract from stats elements
following_count=None,
post_count=None,
website=None,
recent_posts=[],
)
Use Case 7: Government Data Extractor
import requests
from bs4 import BeautifulSoup
from dataclasses import dataclass, field
from typing import Optional, List
import csv
import io
@dataclass
class GovernmentRecord:
record_id: str
source_url: str
record_type: str
entity_name: str
date: Optional[str]
amount: Optional[float]
description: str
raw_data: dict = field(default_factory=dict)
def scrape_government_data_table(url: str, record_type: str,
session: requests.Session) -> List[GovernmentRecord]:
"""Extract tabular data from government data portals."""
resp = session.get(url, timeout=60)
soup = BeautifulSoup(resp.text, "lxml")
records = []
# Try to find a data table
tables = soup.find_all("table")
if not tables:
print("No tables found — checking for downloadable data")
# Many .gov sites offer CSV downloads
csv_link = soup.find("a", href=lambda h: h and (".csv" in h.lower() or "download" in h.lower()))
if csv_link:
csv_url = csv_link.get("href")
if not csv_url.startswith("http"):
from urllib.parse import urljoin
csv_url = urljoin(url, csv_url)
csv_resp = session.get(csv_url, timeout=60)
reader = csv.DictReader(io.StringIO(csv_resp.text))
for row in reader:
records.append(GovernmentRecord(
record_id=str(len(records)),
source_url=url,
record_type=record_type,
entity_name=list(row.values())[0] if row else "",
date=None,
amount=None,
description=str(row),
raw_data=dict(row),
))
return records
# Parse the largest table
main_table = max(tables, key=lambda t: len(t.find_all("tr")))
# Extract headers
header_row = main_table.find("tr")
headers = [th.get_text(strip=True).lower().replace(" ", "_")
for th in header_row.find_all(["th", "td"])]
for row in main_table.find_all("tr")[1:]:
cells = row.find_all(["td", "th"])
if not cells:
continue
row_data = {}
for i, cell in enumerate(cells):
if i < len(headers):
row_data[headers[i]] = cell.get_text(strip=True)
# Try to find standard fields
import re
entity_name = row_data.get("name") or row_data.get("entity") or list(row_data.values())[0] if row_data else ""
amount = None
for key in ["amount", "value", "total", "contract_amount"]:
if key in row_data:
clean = re.sub(r"[^\d.]", "", row_data[key])
try:
amount = float(clean)
break
except ValueError:
pass
records.append(GovernmentRecord(
record_id=row_data.get("id", str(len(records))),
source_url=url,
record_type=record_type,
entity_name=entity_name,
date=row_data.get("date") or row_data.get("filing_date"),
amount=amount,
description=str(row_data),
raw_data=row_data,
))
return records
Output Schemas and Storage
Always define your output schema before you start scraping:
import json
import csv
import sqlite3
from pathlib import Path
from dataclasses import asdict
# JSONL — best for streaming large datasets
def save_jsonl(records: list, filepath: str):
with open(filepath, "a", encoding="utf-8") as f:
for record in records:
f.write(json.dumps(asdict(record) if hasattr(record, "__dataclass_fields__") else record,
ensure_ascii=False, default=str) + "\n")
# CSV — best for spreadsheet analysis
def save_csv(records: list, filepath: str, fieldnames: list = None):
if not records:
return
dicts = [asdict(r) if hasattr(r, "__dataclass_fields__") else r for r in records]
fieldnames = fieldnames or list(dicts[0].keys())
write_header = not Path(filepath).exists()
with open(filepath, "a", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction="ignore")
if write_header:
writer.writeheader()
writer.writerows(dicts)
# SQLite — best for querying and incremental updates
def save_sqlite(records: list, db_path: str, table_name: str):
if not records:
return
dicts = [asdict(r) if hasattr(r, "__dataclass_fields__") else r for r in records]
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
# Create table from first record's keys
columns = list(dicts[0].keys())
col_defs = ", ".join(f"{col} TEXT" for col in columns)
cursor.execute(f"CREATE TABLE IF NOT EXISTS {table_name} ({col_defs})")
for record in dicts:
placeholders = ", ".join("?" * len(columns))
values = [str(record.get(col, "")) for col in columns]
cursor.execute(
f"INSERT OR REPLACE INTO {table_name} VALUES ({placeholders})",
values
)
conn.commit()
conn.close()
Production Checklist
Before deploying a scraper to production:
- Test with 5 URLs before running 5,000
- Cache raw HTML during development to avoid hammering targets
- Log everything — URL, status code, response size, parse time
- Handle None everywhere — every
select_one()can return None - Validate output — check that extracted fields are non-empty before saving
- Monitor file sizes — empty or suspiciously small outputs indicate blocking
- Set realistic timeouts — 30s for page loads, 60s for slow government sites
- Respect robots.txt — at minimum for legal protection and to stay undetected longer
- Use a session — reuse TCP connections and cookies across requests
- Rotate proxies via ThorData for any volume above ~100 URLs per day
import logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
handlers=[
logging.FileHandler("scraper.log"),
logging.StreamHandler(),
]
)
logger = logging.getLogger(__name__)
# Template for every production scrape
def scrape_url(url: str, session: requests.Session) -> dict:
start = time.time()
try:
resp = session.get(url, timeout=30)
elapsed = time.time() - start
logger.info(f"GET {url} → {resp.status_code} ({len(resp.content)} bytes, {elapsed:.2f}s)")
if resp.status_code != 200:
logger.warning(f"Non-200 status for {url}: {resp.status_code}")
return {"url": url, "error": f"HTTP {resp.status_code}", "data": None}
soup = BeautifulSoup(resp.text, "lxml")
data = extract_data(soup) # Your extraction logic
return {"url": url, "data": data, "error": None}
except requests.Timeout:
logger.error(f"Timeout fetching {url}")
return {"url": url, "error": "timeout", "data": None}
except Exception as e:
logger.exception(f"Unexpected error for {url}: {e}")
return {"url": url, "error": str(e), "data": None}
Summary
BeautifulSoup handles the HTML parsing layer of web scraping reliably and efficiently. Pair it with requests for simple static sites, httpx for async work, and playwright when JavaScript rendering is required. Add proper headers to avoid easy detection, proxy rotation via ThorData for volume work, and tenacity for retry logic. Define your output schema before you start extracting data.
The biggest wins in production scraping come not from clever parsing tricks but from respecting the fundamental rules: look like a browser, rotate your IP, slow down, handle failures gracefully, and always validate your output before assuming the scraper worked.