← Back to blog

BeautifulSoup Web Scraping Tutorial: Complete Python Guide (2026)

BeautifulSoup Web Scraping Tutorial: Complete Python Guide (2026)

Web scraping in 2026 is simultaneously easier and harder than it was five years ago. Easier because the libraries are mature, documentation is excellent, and Python tooling has improved dramatically. Harder because websites have become sophisticated at detecting and blocking automated traffic. A scraper that worked in 2021 may silently return empty results today — not because you coded it wrong, but because the target has deployed bot detection that filters your requests before they ever reach the HTML you want.

BeautifulSoup remains the foundation of Python web scraping. It has been around since 2004, survived multiple paradigm shifts in web development, and continues to be the right tool for the majority of scraping tasks. It does one thing: turn raw HTML into a navigable tree of objects. It does not fetch pages, handle JavaScript rendering, manage sessions, or rotate proxies. Those concerns live in the layers around it. BeautifulSoup itself is a pure parser, and that simplicity is precisely why it has lasted.

This tutorial covers everything you need to go from zero to production-ready scraper in 2026. We start with the basics of parsing and element extraction, move through anti-detection techniques, proxy rotation, error handling and retry logic, then finish with complete real-world examples across seven use cases. Every code block is working Python 3 code designed for the current ecosystem.

Understanding what BeautifulSoup is not doing is as important as understanding what it is doing. When you call BeautifulSoup(html, "lxml"), you are passing a string of HTML text and getting back an object that lets you search and traverse that text using CSS selectors or tag navigation. No network requests happen inside BeautifulSoup. That separation is the key architectural insight: you fetch with requests or httpx, you parse with BeautifulSoup. The two concerns stay cleanly separated, which makes both easier to test and maintain.

The goal of this guide is to give you patterns that actually work against real websites in 2026 — not toy examples against sites that welcome bots, but patterns that handle the full stack of challenges you will encounter in production scraping work.

Setup and Installation

pip install beautifulsoup4 lxml requests httpx playwright
playwright install chromium

For production environments, use a proper dependency file:

# pyproject.toml
[project]
dependencies = [
    "beautifulsoup4>=4.12",
    "lxml>=5.0",
    "requests>=2.31",
    "httpx>=0.27",
    "playwright>=1.44",
    "tenacity>=8.3",
    "fake-useragent>=1.5",
]

Quick sanity check:

from bs4 import BeautifulSoup
import requests

response = requests.get(
    "https://example.com",
    headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"}
)
soup = BeautifulSoup(response.text, "lxml")
print(soup.title.string)
# → Example Domain

Choosing a Parser

BeautifulSoup supports multiple parsers, each with different trade-offs:

Parser Install Speed Handles Broken HTML Notes
lxml pip install lxml Very fast Excellent Recommended for all production use
html.parser Built-in Moderate Good Use when C extensions unavailable
lxml-xml via lxml Fast N/A For XML documents specifically
html5lib pip install html5lib Slow Perfect For heavily broken HTML only

Use lxml for everything unless you have a specific reason not to. It is a C extension that parses HTML 5-10x faster than html.parser and handles malformed markup gracefully. The only case for html.parser is a constrained deployment where you cannot install C extensions.

# Parser selection pattern
def parse_html(html: str) -> BeautifulSoup:
    try:
        return BeautifulSoup(html, "lxml")
    except Exception:
        # Fallback if lxml has issues
        return BeautifulSoup(html, "html.parser")

Anti-Detection: Headers and Session Setup

The single most common reason scrapers fail in 2026 is not being blocked — it is being silently filtered. Sites return HTTP 200 with either empty results, a CAPTCHA page, or a honeypot response designed to waste your time. Proper request headers are the first line of defense.

A real browser sends dozens of headers with every request. Here is a realistic header set:

import requests
from fake_useragent import UserAgent

ua = UserAgent()

HEADERS = {
    "User-Agent": ua.chrome,
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Cache-Control": "max-age=0",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
    "sec-ch-ua": '"Google Chrome";v="122", "Not(A:Brand";v="24", "Chromium";v="122"',
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": '"macOS"',
}

session = requests.Session()
session.headers.update(HEADERS)

The Sec-Fetch-* and sec-ch-ua headers are client hints that browsers send automatically. Many bot detection systems check for their presence. If you send a Chrome User-Agent but omit these headers, detection algorithms can infer you are a bot.

For rotating user agents, fake-useragent provides real browser strings pulled from a maintained database:

from fake_useragent import UserAgent

ua = UserAgent()

def get_random_headers() -> dict:
    """Generate headers with a random but realistic user agent."""
    browser = ua.random
    return {
        "User-Agent": browser,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
    }

Proxy Rotation with ThorData

Headers alone will not save you at scale. Most production scrapers need proxy rotation. When you send thousands of requests from a single IP, rate limiting and IP bans are inevitable.

Residential proxies route your traffic through real consumer IP addresses, making your requests indistinguishable from normal browsing traffic. ThorData provides rotating residential proxies with global coverage and per-country targeting. Here is a complete proxy-aware scraping session:

import requests
import random
import time
from bs4 import BeautifulSoup
from typing import Optional

class ProxySession:
    """Requests session with ThorData proxy rotation and anti-detection headers."""

    THORDATA_HOST = "proxy.thordata.com"
    THORDATA_PORT = 9000

    def __init__(self, username: str, password: str, country: str = "US"):
        self.username = username
        self.password = password
        self.country = country
        self.session = requests.Session()
        self._setup_headers()

    def _setup_headers(self):
        self.session.headers.update({
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "Upgrade-Insecure-Requests": "1",
            "Sec-Fetch-Dest": "document",
            "Sec-Fetch-Mode": "navigate",
            "Sec-Fetch-Site": "none",
        })

    def _get_proxy(self) -> dict:
        """Build a rotating proxy URL. Each request gets a fresh residential IP."""
        # ThorData rotating residential format
        proxy_user = f"{self.username}-country-{self.country}-session-{random.randint(100000, 999999)}"
        proxy_url = f"http://{proxy_user}:{self.password}@{self.THORDATA_HOST}:{self.THORDATA_PORT}"
        return {"http": proxy_url, "https": proxy_url}

    def get(self, url: str, **kwargs) -> requests.Response:
        kwargs.setdefault("proxies", self._get_proxy())
        kwargs.setdefault("timeout", 30)
        return self.session.get(url, **kwargs)

    def parse(self, url: str, **kwargs) -> BeautifulSoup:
        """Fetch and parse in one call."""
        resp = self.get(url, **kwargs)
        resp.raise_for_status()
        return BeautifulSoup(resp.text, "lxml")


# Usage
scraper = ProxySession(
    username="your_thordata_user",
    password="your_thordata_pass",
    country="US"
)

soup = scraper.parse("https://target-site.com/products")
products = soup.select(".product-card")
print(f"Found {len(products)} products")

For sticky sessions (same IP across multiple requests to maintain login state or pagination context), use session-based routing:

def get_sticky_proxy(username: str, password: str, session_id: str, country: str = "US") -> dict:
    """Return a proxy that sticks to the same IP for the duration of the session."""
    proxy_user = f"{username}-country-{country}-session-{session_id}"
    proxy_url = f"http://{proxy_user}:{password}@proxy.thordata.com:9000"
    return {"http": proxy_url, "https": proxy_url}

# Reuse the same session_id across multiple requests to maintain the same IP
session_id = "scrape-job-001"
proxies = get_sticky_proxy("user", "pass", session_id, country="GB")

session = requests.Session()
session.get("https://site.com/login", proxies=proxies)
session.post("https://site.com/login", data={"user": "x", "pass": "y"}, proxies=proxies)
session.get("https://site.com/dashboard", proxies=proxies)  # Same IP through all requests

Finding Elements

find() and find_all()

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "lxml")

# First match
heading = soup.find("h2")

# All matches
links = soup.find_all("a")

# By class
cards = soup.find_all("div", class_="product-card")

# By ID
sidebar = soup.find("div", id="sidebar")

# Multiple classes (element must have both)
items = soup.find_all("li", class_=["active", "featured"])

# By attribute
images = soup.find_all("img", src=True)  # Any img with a src attribute
external = soup.find_all("a", attrs={"target": "_blank"})

# By pattern
import re
price_elements = soup.find_all("span", class_=re.compile(r"price"))

CSS Selectors

CSS selectors are often more readable for complex queries:

# Descendant selection
items = soup.select("#product-list .item")

# Direct child
direct = soup.select("ul > li")

# First of type
first_price = soup.select_one(".product .price")

# Attribute selector
external_links = soup.select('a[href^="https://"]')
data_items = soup.select('[data-category="electronics"]')

# Pseudo-selectors (limited support in BS4)
# For :nth-child, use find_all and slice instead
all_rows = soup.select("table tr")[1:]  # Skip header row
# Parent
parent = element.parent

# Siblings
next_el = element.next_sibling
prev_el = element.previous_sibling
next_tag = element.find_next_sibling("div")

# Children
children = list(element.children)
all_descendants = list(element.descendants)

# First/last child
first = element.find("div")

Extracting Data

link = soup.find("a", class_="product-link")

# Text content
raw_text = link.text                      # Includes whitespace
clean_text = link.get_text(strip=True)    # Stripped
normalized = link.get_text(separator=" ", strip=True)  # Join with separator

# Attributes — always use .get() to avoid KeyError
href = link.get("href")
data_id = link.get("data-id")
all_attrs = link.attrs  # Dict of all attributes

# When attribute may be a list (like class)
classes = link.get("class", [])  # Returns list: ["btn", "primary"]
class_str = " ".join(link.get("class", []))

Rate Limiting and Delays

Sending requests too fast is the fastest way to get blocked. Implement variable delays that simulate human browsing patterns:

import time
import random

def human_delay(min_seconds: float = 1.0, max_seconds: float = 4.0):
    """Sleep for a random duration to simulate human browsing."""
    delay = random.uniform(min_seconds, max_seconds)
    # Add occasional longer pauses (like a human reading)
    if random.random() < 0.1:  # 10% chance of a longer pause
        delay += random.uniform(3.0, 8.0)
    time.sleep(delay)


def scrape_with_rate_limit(urls: list, session: requests.Session, 
                            min_delay: float = 1.5, max_delay: float = 5.0) -> list:
    """Scrape a list of URLs with rate limiting."""
    results = []
    for i, url in enumerate(urls):
        try:
            resp = session.get(url, timeout=30)
            resp.raise_for_status()
            results.append({"url": url, "html": resp.text, "status": resp.status_code})
        except Exception as e:
            results.append({"url": url, "html": None, "error": str(e)})

        # Don't delay after the last URL
        if i < len(urls) - 1:
            human_delay(min_delay, max_delay)

    return results

For async scraping with httpx, you can control concurrency more precisely:

import asyncio
import httpx
import random

async def scrape_urls_async(urls: list, max_concurrent: int = 3,
                             min_delay: float = 0.5, max_delay: float = 2.0) -> list:
    """Async scraper with concurrency limiting and delays."""
    semaphore = asyncio.Semaphore(max_concurrent)
    results = []

    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    }

    async def fetch(client: httpx.AsyncClient, url: str) -> dict:
        async with semaphore:
            await asyncio.sleep(random.uniform(min_delay, max_delay))
            try:
                resp = await client.get(url, timeout=30)
                return {"url": url, "html": resp.text, "status": resp.status_code}
            except Exception as e:
                return {"url": url, "html": None, "error": str(e)}

    async with httpx.AsyncClient(headers=headers, follow_redirects=True) as client:
        tasks = [fetch(client, url) for url in urls]
        results = await asyncio.gather(*tasks)

    return list(results)


# Run it
urls = ["https://example.com/page1", "https://example.com/page2"]
results = asyncio.run(scrape_urls_async(urls))

Error Handling and Retry Logic

Production scrapers must handle failures gracefully. Networks are unreliable, sites go down, rate limits trigger. The tenacity library provides clean retry logic:

import requests
from bs4 import BeautifulSoup
from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential,
    retry_if_exception_type,
    before_sleep_log,
)
import logging
import time

logger = logging.getLogger(__name__)


class ScraperError(Exception):
    pass


class RateLimitError(ScraperError):
    pass


class BlockedError(ScraperError):
    pass


@retry(
    retry=retry_if_exception_type((requests.RequestException, ScraperError)),
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=2, max=60),
    before_sleep=before_sleep_log(logger, logging.WARNING),
)
def fetch_with_retry(url: str, session: requests.Session) -> BeautifulSoup:
    """Fetch a URL with exponential backoff retry."""
    resp = session.get(url, timeout=30)

    # Detect soft blocks
    if resp.status_code == 429:
        retry_after = int(resp.headers.get("Retry-After", 60))
        logger.warning(f"Rate limited. Waiting {retry_after}s")
        time.sleep(retry_after)
        raise RateLimitError(f"Rate limited on {url}")

    if resp.status_code == 403:
        raise BlockedError(f"Blocked (403) on {url}")

    if resp.status_code >= 500:
        raise ScraperError(f"Server error {resp.status_code} on {url}")

    resp.raise_for_status()

    soup = BeautifulSoup(resp.text, "lxml")

    # Detect CAPTCHA pages
    if _is_captcha_page(soup):
        raise BlockedError(f"CAPTCHA detected on {url}")

    # Detect empty/honeypot responses
    if _is_empty_response(soup):
        raise ScraperError(f"Suspicious empty response for {url}")

    return soup


def _is_captcha_page(soup: BeautifulSoup) -> bool:
    """Detect common CAPTCHA patterns."""
    captcha_indicators = [
        soup.find("div", id="captcha"),
        soup.find("div", class_="g-recaptcha"),
        soup.find("div", class_="h-captcha"),
        soup.find(string=lambda t: t and "robot" in t.lower()),
        soup.find(string=lambda t: t and "captcha" in t.lower()),
        soup.title and "access denied" in soup.title.string.lower() if soup.title else False,
    ]
    return any(captcha_indicators)


def _is_empty_response(soup: BeautifulSoup) -> bool:
    """Detect suspiciously empty responses."""
    body_text = soup.get_text(strip=True)
    return len(body_text) < 200  # Adjust threshold per site


# Manual retry with progressive delays for simpler cases
def fetch_simple_retry(url: str, session: requests.Session, 
                        max_retries: int = 3) -> requests.Response:
    """Simple retry without external dependencies."""
    delays = [2, 5, 15]  # Seconds between retries

    for attempt in range(max_retries):
        try:
            resp = session.get(url, timeout=30)
            if resp.status_code == 200:
                return resp
            if resp.status_code == 429:
                wait = delays[min(attempt, len(delays)-1)]
                logger.warning(f"Rate limited, waiting {wait}s (attempt {attempt+1})")
                time.sleep(wait)
                continue
            resp.raise_for_status()
        except requests.ConnectionError as e:
            if attempt == max_retries - 1:
                raise
            wait = delays[min(attempt, len(delays)-1)]
            logger.warning(f"Connection error, retrying in {wait}s: {e}")
            time.sleep(wait)

    raise ScraperError(f"Failed after {max_retries} attempts: {url}")

CAPTCHA Handling

CAPTCHAs are the hardest problem in web scraping. The right response depends on your use case and scale:

Approach 1: Avoid CAPTCHAs by looking like a browser

import time
import random
from playwright.async_api import async_playwright

async def scrape_with_playwright(url: str, proxy: str = None) -> str:
    """Use a real browser to avoid CAPTCHA triggers."""
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            args=[
                "--disable-blink-features=AutomationControlled",
                "--disable-infobars",
                "--no-sandbox",
            ]
        )
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
            viewport={"width": 1440, "height": 900},
            proxy={"server": proxy} if proxy else None,
            java_script_enabled=True,
        )

        # Patch automation detection
        await context.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
            Object.defineProperty(navigator, 'plugins', { get: () => [1, 2, 3, 4, 5] });
            Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });
            window.chrome = { runtime: {} };
        """)

        page = await context.new_page()

        # Simulate human-like navigation
        await page.goto(url, wait_until="domcontentloaded")
        await asyncio.sleep(random.uniform(1, 3))

        # Scroll to simulate reading
        await page.evaluate("window.scrollTo(0, document.body.scrollHeight / 3)")
        await asyncio.sleep(random.uniform(0.5, 1.5))

        content = await page.content()
        await browser.close()
        return content

Approach 2: Detect and wait for CAPTCHA resolution

async def handle_captcha_wait(page, timeout: int = 30) -> bool:
    """Wait for manual CAPTCHA resolution (for supervised scraping)."""
    captcha_selectors = [
        ".g-recaptcha",
        "#captcha",
        ".h-captcha",
        "iframe[src*='recaptcha']",
        "iframe[src*='hcaptcha']",
    ]

    for selector in captcha_selectors:
        captcha = await page.query_selector(selector)
        if captcha:
            print(f"CAPTCHA detected. Waiting up to {timeout}s for resolution...")
            try:
                # Wait for CAPTCHA container to disappear
                await page.wait_for_selector(selector, state="hidden", timeout=timeout * 1000)
                return True
            except Exception:
                return False

    return True  # No CAPTCHA found


**Approach 3: Skip to the API**

Before building CAPTCHA handling, check the Network tab in DevTools. Most sites that show CAPTCHAs on their web interface have a mobile API or internal JSON endpoint that bypasses them entirely. This is almost always the better path.

Real-World Use Cases

Use Case 1: E-commerce Price Monitor

import requests
from bs4 import BeautifulSoup
import json
from datetime import datetime
from dataclasses import dataclass, asdict
from typing import Optional
import time
import random

@dataclass
class Product:
    url: str
    name: str
    price: Optional[float]
    currency: str
    availability: str
    scraped_at: str
    sku: Optional[str] = None
    rating: Optional[float] = None
    review_count: Optional[int] = None

def scrape_product(url: str, session: requests.Session) -> Product:
    """Extract product data from a product page."""
    resp = session.get(url, timeout=30)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "lxml")

    # Try common product name patterns
    name = None
    for selector in ["h1.product-title", "h1.product-name", "#productTitle", "h1[itemprop='name']", "h1"]:
        el = soup.select_one(selector)
        if el:
            name = el.get_text(strip=True)
            break

    # Extract price — handle various formats
    price = None
    currency = "USD"
    for selector in [".price", ".product-price", "[itemprop='price']", ".offer-price", "#priceblock_ourprice"]:
        el = soup.select_one(selector)
        if el:
            raw = el.get_text(strip=True)
            # Parse price from string like "$24.99" or "24,99 €"
            import re
            match = re.search(r"[\d,]+\.?\d*", raw.replace(",", ".").lstrip("$£€"))
            if match:
                try:
                    price = float(match.group().replace(",", ""))
                except ValueError:
                    pass
            if "€" in raw:
                currency = "EUR"
            elif "£" in raw:
                currency = "GBP"
            break

    # Availability
    availability = "unknown"
    avail_el = soup.find(attrs={"itemprop": "availability"})
    if avail_el:
        avail_href = avail_el.get("href", "")
        if "InStock" in avail_href:
            availability = "in_stock"
        elif "OutOfStock" in avail_href:
            availability = "out_of_stock"
    else:
        page_text = soup.get_text().lower()
        if "add to cart" in page_text or "in stock" in page_text:
            availability = "in_stock"
        elif "out of stock" in page_text or "unavailable" in page_text:
            availability = "out_of_stock"

    return Product(
        url=url,
        name=name or "Unknown",
        price=price,
        currency=currency,
        availability=availability,
        scraped_at=datetime.utcnow().isoformat(),
    )

def monitor_prices(urls: list, output_file: str = "prices.jsonl"):
    """Monitor multiple product pages, appending results to a JSONL file."""
    session = requests.Session()
    session.headers.update({
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    })

    with open(output_file, "a") as f:
        for url in urls:
            try:
                product = scrape_product(url, session)
                f.write(json.dumps(asdict(product)) + "\n")
                print(f"✓ {product.name}: {product.currency} {product.price}")
            except Exception as e:
                print(f"✗ {url}: {e}")
            time.sleep(random.uniform(2, 5))

Output schema:

{
  "url": "https://example.com/products/widget-pro",
  "name": "Widget Pro 2026",
  "price": 24.99,
  "currency": "USD",
  "availability": "in_stock",
  "scraped_at": "2026-03-31T10:22:45.123456",
  "sku": null,
  "rating": null,
  "review_count": null
}

Use Case 2: News Article Aggregator

from bs4 import BeautifulSoup
import requests
import feedparser
from dataclasses import dataclass
from typing import Optional, List
from datetime import datetime
import hashlib

@dataclass
class Article:
    url: str
    title: str
    author: Optional[str]
    published_at: Optional[str]
    body_text: str
    word_count: int
    tags: List[str]
    source_domain: str
    content_hash: str

def extract_article(url: str, session: requests.Session) -> Article:
    """Extract article content using common news site patterns."""
    resp = session.get(url, timeout=30)
    soup = BeautifulSoup(resp.text, "lxml")

    # Remove noise elements
    for tag in soup.select("nav, footer, header, .sidebar, .advertisement, .ad, script, style, [class*='ad-'], [id*='sidebar']"):
        tag.decompose()

    # Title
    title = None
    for selector in ["h1.article-title", "h1.entry-title", "h1.post-title", 
                      "[itemprop='headline']", "article h1", "h1"]:
        el = soup.select_one(selector)
        if el:
            title = el.get_text(strip=True)
            break

    # Author
    author = None
    for selector in ["[rel='author']", "[itemprop='author']", ".author-name", 
                      ".byline", "[class*='author']"]:
        el = soup.select_one(selector)
        if el:
            author = el.get_text(strip=True)
            break

    # Published date
    published = None
    for selector in ["time[datetime]", "[itemprop='datePublished']", ".published-date"]:
        el = soup.select_one(selector)
        if el:
            published = el.get("datetime") or el.get_text(strip=True)
            break

    # Body text — prioritize article/main content containers
    body = ""
    for selector in ["article .content", "article .body", ".article-body", 
                      ".entry-content", ".post-content", "article", "main"]:
        el = soup.select_one(selector)
        if el:
            # Get all paragraph text
            paragraphs = el.find_all("p")
            body = " ".join(p.get_text(strip=True) for p in paragraphs if len(p.get_text(strip=True)) > 50)
            if len(body) > 200:
                break

    # Tags from meta keywords or tag links
    tags = []
    meta_keywords = soup.find("meta", attrs={"name": "keywords"})
    if meta_keywords:
        tags = [t.strip() for t in meta_keywords.get("content", "").split(",")]

    from urllib.parse import urlparse
    domain = urlparse(url).netloc

    return Article(
        url=url,
        title=title or "Unknown",
        author=author,
        published_at=published,
        body_text=body,
        word_count=len(body.split()),
        tags=tags[:10],
        source_domain=domain,
        content_hash=hashlib.md5(body.encode()).hexdigest(),
    )

Use Case 3: Job Listing Scraper with Pagination

import requests
from bs4 import BeautifulSoup
from typing import Generator
import time
import random
from dataclasses import dataclass
from typing import Optional, List

@dataclass
class JobListing:
    title: str
    company: str
    location: str
    salary: Optional[str]
    url: str
    job_type: Optional[str]
    description_preview: str
    tags: List[str]

def scrape_jobs_paginated(base_url: str, query: str, location: str,
                           max_pages: int = 10) -> Generator[JobListing, None, None]:
    """Scrape job listings across multiple pages."""
    session = requests.Session()
    session.headers.update({
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
    })

    for page in range(1, max_pages + 1):
        params = {"q": query, "l": location, "start": (page - 1) * 10}

        try:
            resp = session.get(base_url, params=params, timeout=30)
            resp.raise_for_status()
        except requests.HTTPError as e:
            print(f"Page {page} failed: {e}")
            break

        soup = BeautifulSoup(resp.text, "lxml")

        # Generic job card pattern — adapt selectors per site
        job_cards = soup.select(".job-card, .job-listing, [data-testid='job-card'], .result")

        if not job_cards:
            print(f"No jobs found on page {page}, stopping pagination")
            break

        for card in job_cards:
            title_el = card.select_one("h2 a, h3 a, .job-title a, [data-testid='job-title']")
            company_el = card.select_one(".company-name, [data-testid='company-name'], .employer")
            location_el = card.select_one(".location, [data-testid='job-location'], .job-location")
            salary_el = card.select_one(".salary, .compensation, [data-testid='salary-snippet']")
            desc_el = card.select_one(".description, .summary, .job-snippet")

            if not title_el:
                continue

            tags = [t.get_text(strip=True) for t in card.select(".tag, .skill-tag, .badge")]

            yield JobListing(
                title=title_el.get_text(strip=True),
                company=company_el.get_text(strip=True) if company_el else "Unknown",
                location=location_el.get_text(strip=True) if location_el else "Remote/Unknown",
                salary=salary_el.get_text(strip=True) if salary_el else None,
                url=title_el.get("href", ""),
                job_type=None,
                description_preview=desc_el.get_text(strip=True)[:300] if desc_el else "",
                tags=tags[:8],
            )

        # Check for next page
        next_link = soup.select_one("a[aria-label='Next'], .pagination-next a, a.next")
        if not next_link:
            print(f"No next page link found after page {page}")
            break

        time.sleep(random.uniform(2, 5))

Use Case 4: Real Estate Listing Scraper

import requests
from bs4 import BeautifulSoup
from dataclasses import dataclass, field
from typing import Optional, List
import re

@dataclass
class PropertyListing:
    address: str
    price: Optional[float]
    price_per_sqft: Optional[float]
    bedrooms: Optional[int]
    bathrooms: Optional[float]
    square_feet: Optional[int]
    lot_size: Optional[str]
    year_built: Optional[int]
    property_type: str
    listing_url: str
    mls_id: Optional[str]
    days_on_market: Optional[int]
    description: str
    features: List[str] = field(default_factory=list)

def parse_property_details(soup: BeautifulSoup, url: str) -> PropertyListing:
    """Parse property details from a listing page."""

    def clean_number(text: str) -> Optional[float]:
        """Extract numeric value from formatted string."""
        if not text:
            return None
        clean = re.sub(r"[^\d.]", "", text.replace(",", ""))
        try:
            return float(clean)
        except ValueError:
            return None

    # Address
    address = ""
    for sel in ["h1.address", "[itemprop='streetAddress']", ".property-address", "h1"]:
        el = soup.select_one(sel)
        if el:
            address = el.get_text(strip=True)
            break

    # Price
    price = None
    for sel in [".listing-price", ".price", "[data-testid='price']", "span.price"]:
        el = soup.select_one(sel)
        if el:
            price = clean_number(el.get_text())
            break

    # Key facts — often in a definition list or facts grid
    bedrooms = bathrooms = sqft = year_built = None

    facts_container = soup.select_one(".facts-grid, .property-facts, .key-facts, .home-facts")
    if facts_container:
        text = facts_container.get_text(" ", strip=True).lower()

        bed_match = re.search(r"(\d+)\s*bed", text)
        bath_match = re.search(r"([\d.]+)\s*bath", text)
        sqft_match = re.search(r"([\d,]+)\s*sq\.?\s*ft", text)
        year_match = re.search(r"built\s+in\s+(\d{4})", text)

        bedrooms = int(bed_match.group(1)) if bed_match else None
        bathrooms = float(bath_match.group(1)) if bath_match else None
        sqft = int(sqft_match.group(1).replace(",", "")) if sqft_match else None
        year_built = int(year_match.group(1)) if year_match else None

    # Description
    desc = ""
    for sel in [".property-description", ".listing-description", "[data-testid='description']"]:
        el = soup.select_one(sel)
        if el:
            desc = el.get_text(strip=True)
            break

    # Features/amenities
    features = [el.get_text(strip=True) for el in soup.select(".features li, .amenities li, .feature-item")]

    price_per_sqft = round(price / sqft, 2) if price and sqft else None

    return PropertyListing(
        address=address,
        price=price,
        price_per_sqft=price_per_sqft,
        bedrooms=bedrooms,
        bathrooms=bathrooms,
        square_feet=sqft,
        lot_size=None,
        year_built=year_built,
        property_type="residential",
        listing_url=url,
        mls_id=None,
        days_on_market=None,
        description=desc[:1000],
        features=features[:20],
    )

Use Case 5: Academic Paper Metadata Extractor

import requests
from bs4 import BeautifulSoup
from dataclasses import dataclass, field
from typing import Optional, List

@dataclass
class Paper:
    title: str
    authors: List[str]
    abstract: str
    doi: Optional[str]
    journal: Optional[str]
    year: Optional[int]
    keywords: List[str]
    citations: Optional[int]
    pdf_url: Optional[str]

def extract_paper_metadata(url: str, session: requests.Session) -> Paper:
    """Extract academic paper metadata. Works with arXiv, PubMed-style pages."""
    resp = session.get(url, timeout=30)
    soup = BeautifulSoup(resp.text, "lxml")

    # Try Open Graph / Dublin Core / Schema.org metadata first (most reliable)
    def get_meta(name: str = None, property_: str = None) -> Optional[str]:
        if name:
            el = soup.find("meta", attrs={"name": name})
            return el.get("content") if el else None
        if property_:
            el = soup.find("meta", attrs={"property": property_})
            return el.get("content") if el else None
        return None

    title = (get_meta("citation_title") or get_meta(property_="og:title") or
             (soup.find("h1") and soup.find("h1").get_text(strip=True)))

    abstract = (get_meta("citation_abstract") or get_meta("description") or
                get_meta(property_="og:description") or "")

    doi = get_meta("citation_doi")
    journal = get_meta("citation_journal_title")

    # Authors from citation metadata (can be multiple)
    author_metas = soup.find_all("meta", attrs={"name": "citation_author"})
    if author_metas:
        authors = [m.get("content", "") for m in author_metas]
    else:
        # Fallback to page scraping
        author_els = soup.select(".author, [itemprop='author'], .authors a")
        authors = [el.get_text(strip=True) for el in author_els]

    # Year
    year = None
    date_str = get_meta("citation_publication_date") or get_meta("citation_date")
    if date_str:
        import re
        year_match = re.search(r"\d{4}", date_str)
        year = int(year_match.group()) if year_match else None

    # Keywords
    keywords_str = get_meta("citation_keywords") or get_meta("keywords") or ""
    keywords = [k.strip() for k in keywords_str.replace(";", ",").split(",") if k.strip()]

    # PDF link
    pdf_url = None
    pdf_meta = soup.find("meta", attrs={"name": "citation_pdf_url"})
    if pdf_meta:
        pdf_url = pdf_meta.get("content")
    else:
        pdf_link = soup.find("a", href=lambda h: h and ".pdf" in h.lower())
        if pdf_link:
            pdf_url = pdf_link.get("href")

    return Paper(
        title=title or "Unknown",
        authors=authors,
        abstract=abstract[:2000],
        doi=doi,
        journal=journal,
        year=year,
        keywords=keywords[:15],
        citations=None,
        pdf_url=pdf_url,
    )

Use Case 6: Social Media Public Profile Scraper

import asyncio
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import Optional, List

@dataclass
class PublicProfile:
    username: str
    display_name: str
    bio: Optional[str]
    follower_count: Optional[int]
    following_count: Optional[int]
    post_count: Optional[int]
    website: Optional[str]
    recent_posts: List[dict]

async def scrape_public_profile(username: str, platform_url: str,
                                  proxy: str = None) -> PublicProfile:
    """Scrape a public social media profile using Playwright."""
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            args=["--disable-blink-features=AutomationControlled"]
        )
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
            viewport={"width": 1280, "height": 800},
            proxy={"server": proxy} if proxy else None,
        )

        await context.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
        """)

        page = await context.new_page()

        # Intercept API calls to extract data more cleanly
        api_data = {}

        async def handle_response(response):
            if "graphql" in response.url or "/api/" in response.url:
                try:
                    data = await response.json()
                    api_data[response.url] = data
                except Exception:
                    pass

        page.on("response", handle_response)

        await page.goto(f"{platform_url}/{username}", wait_until="networkidle")
        await asyncio.sleep(2)

        html = await page.content()
        soup = BeautifulSoup(html, "lxml")
        await browser.close()

    # Parse the rendered HTML
    import re

    def parse_count(text: str) -> Optional[int]:
        if not text:
            return None
        text = text.replace(",", "").strip()
        if "K" in text:
            return int(float(text.replace("K", "")) * 1000)
        if "M" in text:
            return int(float(text.replace("M", "")) * 1_000_000)
        try:
            return int(re.sub(r"[^\d]", "", text))
        except ValueError:
            return None

    # These selectors are illustrative — adapt per platform
    display_name = ""
    bio = ""
    name_el = soup.select_one("h1, .profile-name, [data-testid='display-name']")
    bio_el = soup.select_one(".bio, .profile-bio, [data-testid='bio']")

    if name_el:
        display_name = name_el.get_text(strip=True)
    if bio_el:
        bio = bio_el.get_text(strip=True)

    return PublicProfile(
        username=username,
        display_name=display_name,
        bio=bio,
        follower_count=None,  # Extract from stats elements
        following_count=None,
        post_count=None,
        website=None,
        recent_posts=[],
    )

Use Case 7: Government Data Extractor

import requests
from bs4 import BeautifulSoup
from dataclasses import dataclass, field
from typing import Optional, List
import csv
import io

@dataclass
class GovernmentRecord:
    record_id: str
    source_url: str
    record_type: str
    entity_name: str
    date: Optional[str]
    amount: Optional[float]
    description: str
    raw_data: dict = field(default_factory=dict)

def scrape_government_data_table(url: str, record_type: str,
                                   session: requests.Session) -> List[GovernmentRecord]:
    """Extract tabular data from government data portals."""
    resp = session.get(url, timeout=60)
    soup = BeautifulSoup(resp.text, "lxml")

    records = []

    # Try to find a data table
    tables = soup.find_all("table")
    if not tables:
        print("No tables found — checking for downloadable data")
        # Many .gov sites offer CSV downloads
        csv_link = soup.find("a", href=lambda h: h and (".csv" in h.lower() or "download" in h.lower()))
        if csv_link:
            csv_url = csv_link.get("href")
            if not csv_url.startswith("http"):
                from urllib.parse import urljoin
                csv_url = urljoin(url, csv_url)
            csv_resp = session.get(csv_url, timeout=60)
            reader = csv.DictReader(io.StringIO(csv_resp.text))
            for row in reader:
                records.append(GovernmentRecord(
                    record_id=str(len(records)),
                    source_url=url,
                    record_type=record_type,
                    entity_name=list(row.values())[0] if row else "",
                    date=None,
                    amount=None,
                    description=str(row),
                    raw_data=dict(row),
                ))
        return records

    # Parse the largest table
    main_table = max(tables, key=lambda t: len(t.find_all("tr")))

    # Extract headers
    header_row = main_table.find("tr")
    headers = [th.get_text(strip=True).lower().replace(" ", "_") 
               for th in header_row.find_all(["th", "td"])]

    for row in main_table.find_all("tr")[1:]:
        cells = row.find_all(["td", "th"])
        if not cells:
            continue

        row_data = {}
        for i, cell in enumerate(cells):
            if i < len(headers):
                row_data[headers[i]] = cell.get_text(strip=True)

        # Try to find standard fields
        import re
        entity_name = row_data.get("name") or row_data.get("entity") or list(row_data.values())[0] if row_data else ""

        amount = None
        for key in ["amount", "value", "total", "contract_amount"]:
            if key in row_data:
                clean = re.sub(r"[^\d.]", "", row_data[key])
                try:
                    amount = float(clean)
                    break
                except ValueError:
                    pass

        records.append(GovernmentRecord(
            record_id=row_data.get("id", str(len(records))),
            source_url=url,
            record_type=record_type,
            entity_name=entity_name,
            date=row_data.get("date") or row_data.get("filing_date"),
            amount=amount,
            description=str(row_data),
            raw_data=row_data,
        ))

    return records

Output Schemas and Storage

Always define your output schema before you start scraping:

import json
import csv
import sqlite3
from pathlib import Path
from dataclasses import asdict

# JSONL — best for streaming large datasets
def save_jsonl(records: list, filepath: str):
    with open(filepath, "a", encoding="utf-8") as f:
        for record in records:
            f.write(json.dumps(asdict(record) if hasattr(record, "__dataclass_fields__") else record, 
                               ensure_ascii=False, default=str) + "\n")

# CSV — best for spreadsheet analysis
def save_csv(records: list, filepath: str, fieldnames: list = None):
    if not records:
        return

    dicts = [asdict(r) if hasattr(r, "__dataclass_fields__") else r for r in records]
    fieldnames = fieldnames or list(dicts[0].keys())

    write_header = not Path(filepath).exists()
    with open(filepath, "a", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction="ignore")
        if write_header:
            writer.writeheader()
        writer.writerows(dicts)

# SQLite — best for querying and incremental updates
def save_sqlite(records: list, db_path: str, table_name: str):
    if not records:
        return

    dicts = [asdict(r) if hasattr(r, "__dataclass_fields__") else r for r in records]

    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    # Create table from first record's keys
    columns = list(dicts[0].keys())
    col_defs = ", ".join(f"{col} TEXT" for col in columns)
    cursor.execute(f"CREATE TABLE IF NOT EXISTS {table_name} ({col_defs})")

    for record in dicts:
        placeholders = ", ".join("?" * len(columns))
        values = [str(record.get(col, "")) for col in columns]
        cursor.execute(
            f"INSERT OR REPLACE INTO {table_name} VALUES ({placeholders})",
            values
        )

    conn.commit()
    conn.close()

Production Checklist

Before deploying a scraper to production:

  1. Test with 5 URLs before running 5,000
  2. Cache raw HTML during development to avoid hammering targets
  3. Log everything — URL, status code, response size, parse time
  4. Handle None everywhere — every select_one() can return None
  5. Validate output — check that extracted fields are non-empty before saving
  6. Monitor file sizes — empty or suspiciously small outputs indicate blocking
  7. Set realistic timeouts — 30s for page loads, 60s for slow government sites
  8. Respect robots.txt — at minimum for legal protection and to stay undetected longer
  9. Use a session — reuse TCP connections and cookies across requests
  10. Rotate proxies via ThorData for any volume above ~100 URLs per day
import logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s",
    handlers=[
        logging.FileHandler("scraper.log"),
        logging.StreamHandler(),
    ]
)
logger = logging.getLogger(__name__)

# Template for every production scrape
def scrape_url(url: str, session: requests.Session) -> dict:
    start = time.time()
    try:
        resp = session.get(url, timeout=30)
        elapsed = time.time() - start
        logger.info(f"GET {url} → {resp.status_code} ({len(resp.content)} bytes, {elapsed:.2f}s)")

        if resp.status_code != 200:
            logger.warning(f"Non-200 status for {url}: {resp.status_code}")
            return {"url": url, "error": f"HTTP {resp.status_code}", "data": None}

        soup = BeautifulSoup(resp.text, "lxml")
        data = extract_data(soup)  # Your extraction logic

        return {"url": url, "data": data, "error": None}

    except requests.Timeout:
        logger.error(f"Timeout fetching {url}")
        return {"url": url, "error": "timeout", "data": None}
    except Exception as e:
        logger.exception(f"Unexpected error for {url}: {e}")
        return {"url": url, "error": str(e), "data": None}

Summary

BeautifulSoup handles the HTML parsing layer of web scraping reliably and efficiently. Pair it with requests for simple static sites, httpx for async work, and playwright when JavaScript rendering is required. Add proper headers to avoid easy detection, proxy rotation via ThorData for volume work, and tenacity for retry logic. Define your output schema before you start extracting data.

The biggest wins in production scraping come not from clever parsing tricks but from respecting the fundamental rules: look like a browser, rotate your IP, slow down, handle failures gracefully, and always validate your output before assuming the scraper worked.