Extracting Structured Data from HTML: The Complete Python Guide (2026)

2026-03-30 [python scraping beautifulsoup css-selectors xpath json-ld lxml]

Extracting Structured Data from HTML: The Complete Python Guide

Every scraping project starts with the same question: how do I get this data out of the page? There are multiple approaches, and picking the right one saves you hours of debugging brittle selectors.

Here's the thing most tutorials miss — you should check for structured data before writing a single selector. Many sites embed clean, machine-readable data that's more reliable and easier to parse than any CSS selector you could write.

This guide covers every extraction technique available in 2026, from the simplest to the most complex, with production-ready Python code and real-world examples for each approach.

The Extraction Hierarchy

Before writing any code, understand the reliability hierarchy. Always try methods higher on this list before falling back to lower ones:

APIs — official or undocumented, always the most reliable
JSON-LD / Schema.org — structured data embedded in the page for SEO
Microdata / RDFa — older structured data formats, still widely used
Open Graph / Twitter Cards — metadata tags with clean data
__NEXT_DATA__ / inline JSON — data blobs in script tags
CSS Selectors — the workhorse for most scraping
XPath — when CSS selectors aren't powerful enough
Regex — last resort for data embedded in JavaScript variables

Each method further down is more fragile — more likely to break when the site redesigns. Let's dive into each one.

1. JSON-LD: Your First Stop (Always)

JSON-LD (JavaScript Object Notation for Linked Data) is structured data embedded in <script type="application/ld+json"> tags. Sites add it for SEO — Google uses it to generate rich search results. This means the data is accurate, maintained by the site's SEO team, and unlikely to disappear in a redesign.

Basic JSON-LD Extraction

import json
from bs4 import BeautifulSoup
import httpx

def extract_jsonld(html: str) -> list[dict]:
    """Extract all JSON-LD blocks from an HTML page."""
    soup = BeautifulSoup(html, "html.parser")
    results = []

    for script in soup.find_all("script", type="application/ld+json"):
        try:
            data = json.loads(script.string)
            results.append(data)
        except (json.JSONDecodeError, TypeError):
            continue

    return results

# Example: Scrape a product page
resp = httpx.get("https://example.com/product/123")
ld_blocks = extract_jsonld(resp.text)

for block in ld_blocks:
    if block.get("@type") == "Product":
        print(f"Name: {block['name']}")
        print(f"Price: {block['offers']['price']}")
        print(f"Currency: {block['offers']['priceCurrency']}")
        print(f"Available: {block['offers']['availability']}")
        print(f"Brand: {block.get('brand', {}).get('name', 'N/A')}")
        print(f"Rating: {block.get('aggregateRating', {}).get('ratingValue', 'N/A')}")

Handling Complex JSON-LD Structures

Real-world JSON-LD comes in several formats. Some sites use @graph arrays, some nest objects, some have multiple blocks on one page:

from typing import Any

def find_jsonld_by_type(html: str, target_type: str) -> list[dict]:
    """
    Find all JSON-LD objects matching a Schema.org type.
    Handles @graph arrays, nested objects, and list formats.
    """
    soup = BeautifulSoup(html, "html.parser")
    matches = []

    def search_object(obj: Any):
        if isinstance(obj, dict):
            obj_type = obj.get("@type", "")
            # @type can be a string or a list
            if isinstance(obj_type, list):
                if target_type in obj_type:
                    matches.append(obj)
            elif obj_type == target_type:
                matches.append(obj)

            # Recurse into @graph
            if "@graph" in obj:
                search_object(obj["@graph"])

            # Recurse into nested objects
            for value in obj.values():
                if isinstance(value, (dict, list)):
                    search_object(value)

        elif isinstance(obj, list):
            for item in obj:
                search_object(item)

    for script in soup.find_all("script", type="application/ld+json"):
        try:
            data = json.loads(script.string)
            search_object(data)
        except (json.JSONDecodeError, TypeError):
            continue

    return matches

# Example: Find all products, even in nested @graph structures
products = find_jsonld_by_type(page_html, "Product")
articles = find_jsonld_by_type(page_html, "Article")
recipes = find_jsonld_by_type(page_html, "Recipe")
events = find_jsonld_by_type(page_html, "Event")

Common JSON-LD Types and What They Contain

Schema.org Type	Common On	Data Available
`Product`	E-commerce sites	Name, price, currency, availability, reviews, images, brand
`Article` / `NewsArticle`	News sites, blogs	Headline, author, date published, body text, images
`Recipe`	Food sites	Name, ingredients, instructions, cook time, nutrition
`LocalBusiness`	Business listings	Name, address, phone, hours, geo-coordinates
`Event`	Event sites	Name, date, location, performer, ticket info
`JobPosting`	Job boards	Title, company, salary, location, description
`FAQPage`	Help pages	Question-answer pairs
`Review`	Review sites	Rating, author, review body
`BreadcrumbList`	Most sites	Navigation hierarchy (useful for categorization)
`Organization`	Company pages	Name, logo, social profiles, contact info

Production JSON-LD Scraper

import httpx
import json
from dataclasses import dataclass, field
from bs4 import BeautifulSoup

@dataclass
class ProductData:
    name: str = ""
    price: float = 0.0
    currency: str = "USD"
    availability: str = ""
    brand: str = ""
    rating: float = 0.0
    review_count: int = 0
    description: str = ""
    image_url: str = ""
    sku: str = ""
    url: str = ""

    @classmethod
    def from_jsonld(cls, data: dict, url: str = "") -> "ProductData":
        """Parse a Product JSON-LD object into a clean dataclass."""
        offers = data.get("offers", {})
        # offers can be a list (multiple offers) or a dict
        if isinstance(offers, list):
            offers = offers[0] if offers else {}

        rating_data = data.get("aggregateRating", {})
        brand_data = data.get("brand", {})

        # Image can be string, list, or dict
        image = data.get("image", "")
        if isinstance(image, list):
            image = image[0] if image else ""
        elif isinstance(image, dict):
            image = image.get("url", "")

        return cls(
            name=data.get("name", ""),
            price=float(offers.get("price", 0)),
            currency=offers.get("priceCurrency", "USD"),
            availability=offers.get("availability", "").split("/")[-1],
            brand=brand_data.get("name", "") if isinstance(brand_data, dict) else str(brand_data),
            rating=float(rating_data.get("ratingValue", 0)),
            review_count=int(rating_data.get("reviewCount", 0)),
            description=data.get("description", ""),
            image_url=image,
            sku=data.get("sku", ""),
            url=url,
        )

async def scrape_products_jsonld(
    urls: list[str], 
    proxy_url: str = "",
) -> list[ProductData]:
    """Scrape product data using JSON-LD extraction."""
    products = []
    client_kwargs = {}
    if proxy_url:
        client_kwargs["proxy"] = proxy_url

    async with httpx.AsyncClient(timeout=30, **client_kwargs) as client:
        for url in urls:
            try:
                resp = await client.get(url, headers={
                    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                                  "AppleWebKit/537.36",
                    "Accept": "text/html",
                })

                if resp.status_code == 200:
                    product_lds = find_jsonld_by_type(resp.text, "Product")
                    for ld in product_lds:
                        product = ProductData.from_jsonld(ld, url=url)
                        if product.name:  # Skip empty results
                            products.append(product)

            except Exception as e:
                print(f"Error scraping {url}: {e}")

    return products

2. Microdata and RDFa

Older than JSON-LD but still present on many sites. Microdata is embedded directly in HTML attributes (itemscope, itemprop, itemtype). RDFa uses typeof, property, and about attributes.

Extracting Microdata

from bs4 import BeautifulSoup

def extract_microdata(html: str) -> list[dict]:
    """Extract Schema.org microdata from HTML attributes."""
    soup = BeautifulSoup(html, "html.parser")
    items = []

    for element in soup.find_all(attrs={"itemscope": True}):
        # Skip nested items — they'll be handled by their parent
        parent_scope = element.find_parent(attrs={"itemscope": True})
        if parent_scope:
            continue

        item = parse_microdata_item(element)
        items.append(item)

    return items

def parse_microdata_item(element) -> dict:
    """Recursively parse a microdata item and its properties."""
    item = {
        "@type": element.get("itemtype", "").split("/")[-1],
    }

    for prop in element.find_all(attrs={"itemprop": True}):
        name = prop.get("itemprop")

        # Check if this property is itself a nested item
        if prop.get("itemscope") is not None:
            value = parse_microdata_item(prop)
        elif prop.name == "meta":
            value = prop.get("content", "")
        elif prop.name == "link":
            value = prop.get("href", "")
        elif prop.name == "img":
            value = prop.get("src", "")
        elif prop.name == "time":
            value = prop.get("datetime", prop.get_text(strip=True))
        elif prop.name == "a":
            value = prop.get("href", prop.get_text(strip=True))
        else:
            value = prop.get_text(strip=True)

        # Handle multiple values for the same property
        if name in item:
            existing = item[name]
            if isinstance(existing, list):
                existing.append(value)
            else:
                item[name] = [existing, value]
        else:
            item[name] = value

    return item

# Example HTML with microdata:
# <div itemscope itemtype="https://schema.org/Product">
#   <h1 itemprop="name">Widget Pro</h1>
#   <span itemprop="price" content="29.99">$29.99</span>
#   <meta itemprop="priceCurrency" content="USD">
# </div>

Extracting RDFa

def extract_rdfa(html: str) -> list[dict]:
    """Extract RDFa structured data from HTML."""
    soup = BeautifulSoup(html, "html.parser")
    items = []

    for element in soup.find_all(attrs={"typeof": True}):
        item = {"@type": element.get("typeof")}

        for prop in element.find_all(attrs={"property": True}):
            name = prop.get("property").split(":")[-1]  # Remove prefix

            if prop.get("content"):
                value = prop["content"]
            elif prop.name == "a":
                value = prop.get("href", prop.get_text(strip=True))
            elif prop.name == "img":
                value = prop.get("src", "")
            else:
                value = prop.get_text(strip=True)

            item[name] = value

        items.append(item)

    return items

3. Open Graph and Twitter Card Metadata

Almost every website has Open Graph tags for social media previews. These contain titles, descriptions, images, and sometimes prices — cleaner than scraping the page body.

def extract_meta_tags(html: str) -> dict:
    """Extract Open Graph, Twitter Card, and standard meta tags."""
    soup = BeautifulSoup(html, "html.parser")
    meta = {
        "og": {},
        "twitter": {},
        "standard": {},
    }

    for tag in soup.find_all("meta"):
        # Open Graph tags
        prop = tag.get("property", "")
        if prop.startswith("og:"):
            key = prop[3:]  # Remove "og:" prefix
            meta["og"][key] = tag.get("content", "")

        # Twitter Card tags
        name = tag.get("name", "")
        if name.startswith("twitter:"):
            key = name[8:]  # Remove "twitter:" prefix
            meta["twitter"][key] = tag.get("content", "")

        # Standard meta tags
        if name in ("description", "keywords", "author", "robots"):
            meta["standard"][name] = tag.get("content", "")

    # Also grab the title tag
    title_tag = soup.find("title")
    if title_tag:
        meta["standard"]["title"] = title_tag.get_text(strip=True)

    # Canonical URL
    canonical = soup.find("link", rel="canonical")
    if canonical:
        meta["standard"]["canonical"] = canonical.get("href", "")

    return meta

# Usage
resp = httpx.get("https://example.com/article/123")
meta = extract_meta_tags(resp.text)

print(f"Title: {meta['og'].get('title', meta['standard'].get('title'))}")
print(f"Description: {meta['og'].get('description')}")
print(f"Image: {meta['og'].get('image')}")
print(f"Type: {meta['og'].get('type')}")
print(f"Site: {meta['og'].get('site_name')}")

Product-Specific Open Graph Tags

E-commerce sites often include pricing in Open Graph tags:

def extract_product_og(html: str) -> dict | None:
    """Extract product-specific Open Graph data."""
    soup = BeautifulSoup(html, "html.parser")

    product = {}
    og_mappings = {
        "og:title": "name",
        "og:description": "description",
        "og:image": "image",
        "og:url": "url",
        "product:price:amount": "price",
        "product:price:currency": "currency",
        "product:availability": "availability",
        "product:brand": "brand",
        "product:category": "category",
        "product:condition": "condition",
    }

    for tag in soup.find_all("meta"):
        prop = tag.get("property", "")
        if prop in og_mappings:
            product[og_mappings[prop]] = tag.get("content", "")

    return product if product else None

4. JavaScript Data Extraction

Modern websites increasingly load data via JavaScript. The HTML is a shell, and the actual data lives in script tags as JSON blobs, window variables, or framework-specific data stores.

Next.js `__NEXT_DATA__`

Next.js applications embed their page data in a <script id="__NEXT_DATA__"> tag. This is a goldmine — the entire page's data in one clean JSON object:

import re
import json

def extract_nextjs_data(html: str) -> dict | None:
    """Extract Next.js page data from __NEXT_DATA__ script tag."""
    # Method 1: BeautifulSoup (more reliable)
    soup = BeautifulSoup(html, "html.parser")
    script = soup.find("script", id="__NEXT_DATA__")
    if script and script.string:
        try:
            data = json.loads(script.string)
            return data.get("props", {}).get("pageProps", {})
        except json.JSONDecodeError:
            pass

    # Method 2: Regex fallback
    match = re.search(
        r'<script\s+id="__NEXT_DATA__"[^>]*>(.*?)</script>',
        html,
        re.DOTALL,
    )
    if match:
        try:
            data = json.loads(match.group(1))
            return data.get("props", {}).get("pageProps", {})
        except json.JSONDecodeError:
            pass

    return None

# Example: Scrape a Next.js e-commerce site
resp = httpx.get("https://nextjs-store.example.com/product/widget-pro")
page_data = extract_nextjs_data(resp.text)

if page_data:
    product = page_data.get("product", {})
    print(f"Name: {product.get('name')}")
    print(f"Price: {product.get('price')}")
    print(f"Variants: {len(product.get('variants', []))}")

Nuxt.js `__NUXT_DATA__`

Nuxt 3 uses a different serialization format:

def extract_nuxt_data(html: str) -> list | None:
    """Extract Nuxt.js page data."""
    soup = BeautifulSoup(html, "html.parser")

    # Nuxt 3 uses multiple script tags with type="application/json" and id pattern
    scripts = soup.find_all("script", type="application/json")
    for script in scripts:
        if script.get("id", "").startswith("__NUXT_DATA__"):
            try:
                return json.loads(script.string)
            except json.JSONDecodeError:
                continue

    # Nuxt 2 uses window.__NUXT__
    match = re.search(
        r'window\.__NUXT__\s*=\s*({.*?});?\s*</script>',
        html,
        re.DOTALL,
    )
    if match:
        # Nuxt 2 data often uses JavaScript syntax (not pure JSON)
        # This may need eval-like parsing for complex cases
        try:
            return json.loads(match.group(1))
        except json.JSONDecodeError:
            pass

    return None

Generic JavaScript Variable Extraction

def extract_js_variables(html: str, patterns: list[str]) -> dict:
    """
    Extract JavaScript variables from inline scripts.

    patterns: list of variable name patterns like:
        "window.initialState", "var productData", "__CONFIG__"
    """
    results = {}

    for pattern in patterns:
        # Handle different assignment styles
        regex_patterns = [
            rf'{re.escape(pattern)}\s*=\s*({{.*?}});',       # var x = {...};
            rf'{re.escape(pattern)}\s*=\s*(\[.*?\]);',       # var x = [...];
            rf'{re.escape(pattern)}\s*=\s*JSON\.parse\(\'(.*?)\'\)', # JSON.parse('...')
        ]

        for regex in regex_patterns:
            match = re.search(regex, html, re.DOTALL)
            if match:
                try:
                    # Try parsing as JSON
                    data = json.loads(match.group(1))
                    results[pattern] = data
                    break
                except json.JSONDecodeError:
                    # Store raw string if not valid JSON
                    results[pattern] = match.group(1)
                    break

    return results

# Usage
data = extract_js_variables(page_html, [
    "window.__INITIAL_STATE__",
    "window.__PRELOADED_STATE__",
    "window.__APP_DATA__",
    "window.pageData",
])

API Response Interception

Sometimes the best data source is the API the frontend calls. Intercept these with Playwright:

from playwright.async_api import async_playwright
import asyncio
import json

async def intercept_api_calls(
    url: str,
    api_patterns: list[str],
    proxy: str = "",
) -> list[dict]:
    """
    Load a page and capture API responses matching patterns.
    Often cleaner than parsing HTML at all.
    """
    captured = []

    async with async_playwright() as p:
        launch_kwargs = {"headless": True}
        if proxy:
            launch_kwargs["proxy"] = {"server": proxy}

        browser = await p.chromium.launch(**launch_kwargs)
        page = await browser.new_page()

        # Intercept API responses
        async def handle_response(response):
            for pattern in api_patterns:
                if pattern in response.url:
                    try:
                        body = await response.json()
                        captured.append({
                            "url": response.url,
                            "status": response.status,
                            "data": body,
                        })
                    except Exception:
                        pass

        page.on("response", handle_response)

        await page.goto(url, wait_until="networkidle")
        await asyncio.sleep(2)  # Wait for any lazy-loaded API calls

        await browser.close()

    return captured

# Example: Capture product API responses
results = asyncio.run(intercept_api_calls(
    url="https://store.example.com/category/electronics",
    api_patterns=["/api/products", "/api/v2/catalog", "graphql"],
))

for result in results:
    print(f"API: {result['url']}")
    print(f"  Items: {len(result['data'].get('products', []))}")

5. CSS Selectors with BeautifulSoup

When structured data isn't available, CSS selectors are the workhorse. They're fast, readable, and handle most well-structured HTML.

Core Selector Patterns

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")

# === Basic Selectors ===
soup.select_one(".product-title")              # Class
soup.select_one("#price")                      # ID
soup.select("ul.results > li")                # Direct children
soup.select("div.card")                        # All matching elements

# === Attribute Selectors ===
soup.select('a[href*="product"]')             # href contains "product"
soup.select('a[href^="/category"]')           # href starts with "/category"
soup.select('a[href$=".pdf"]')               # href ends with ".pdf"
soup.select('input[type="hidden"]')           # exact attribute match
soup.select('[data-product-id]')              # has attribute (any value)

# === Combinators ===
soup.select("div.sidebar a")                  # Descendant (any depth)
soup.select("div.sidebar > a")               # Direct child only
soup.select("h2 + p")                         # Immediately following sibling
soup.select("h2 ~ p")                         # Any following sibling

# === Pseudo-selectors ===
soup.select("tr:nth-child(2) td")            # Second row
soup.select("li:first-child")                # First list item
soup.select("li:last-child")                 # Last list item
soup.select("p:not(.ad)")                    # Exclude class

# === Combining Multiple Selectors ===
soup.select("h1, h2, h3")                    # Any of these tags
soup.select("div.price.sale")                # Element with both classes

Production-Ready Card Scraping

from dataclasses import dataclass
from bs4 import BeautifulSoup, Tag

@dataclass
class ScrapedItem:
    title: str
    price: str
    url: str
    image: str
    rating: str

    def is_valid(self) -> bool:
        return bool(self.title and self.price)

def scrape_product_cards(html: str, base_url: str = "") -> list[ScrapedItem]:
    """Extract product data from common card-based layouts."""
    soup = BeautifulSoup(html, "html.parser")
    items = []

    # Try common card selectors
    card_selectors = [
        "div.product-card",
        "div.product-item",
        "li.product",
        "article.product",
        "div[data-component='product-card']",
        ".search-result-item",
        ".listing-card",
    ]

    cards = []
    for selector in card_selectors:
        cards = soup.select(selector)
        if cards:
            break

    for card in cards:
        item = ScrapedItem(
            title=extract_text(card, [
                "h2", "h3", ".product-title", ".product-name",
                "[data-testid='title']", ".listing-title",
            ]),
            price=extract_text(card, [
                ".price", ".product-price", "[data-testid='price']",
                ".sale-price", ".current-price", "span.amount",
            ]),
            url=extract_link(card, base_url),
            image=extract_image(card),
            rating=extract_text(card, [
                ".rating", ".stars", "[data-testid='rating']",
                ".review-score",
            ]),
        )

        if item.is_valid():
            items.append(item)

    return items

def extract_text(parent: Tag, selectors: list[str]) -> str:
    """Try multiple selectors, return first match's text."""
    for selector in selectors:
        elem = parent.select_one(selector)
        if elem:
            return elem.get_text(strip=True)
    return ""

def extract_link(parent: Tag, base_url: str) -> str:
    """Extract the primary link from a card element."""
    link = parent.select_one("a[href]")
    if link:
        href = link.get("href", "")
        if href.startswith("/"):
            return base_url + href
        return href
    return ""

def extract_image(parent: Tag) -> str:
    """Extract image URL, handling lazy-loading attributes."""
    img = parent.select_one("img")
    if img:
        # Try lazy-loading attributes first (actual image URL)
        for attr in ["data-src", "data-lazy-src", "data-original"]:
            if img.get(attr):
                return img[attr]
        return img.get("src", "")
    return ""

Table Extraction

Tables are one of the most common data formats on the web. Here's a robust extractor:

import pandas as pd
from bs4 import BeautifulSoup

def extract_tables(html: str, table_index: int | None = None) -> list[list[dict]]:
    """
    Extract all tables from HTML into list-of-dicts format.
    Each table becomes a list of rows, each row is a dict.
    """
    soup = BeautifulSoup(html, "html.parser")
    tables = soup.find_all("table")

    if table_index is not None:
        tables = [tables[table_index]] if table_index < len(tables) else []

    results = []

    for table in tables:
        # Extract headers
        headers = []
        header_row = table.find("thead")
        if header_row:
            headers = [
                th.get_text(strip=True) 
                for th in header_row.find_all(["th", "td"])
            ]
        else:
            # Try first row as header
            first_row = table.find("tr")
            if first_row and first_row.find("th"):
                headers = [
                    th.get_text(strip=True) for th in first_row.find_all("th")
                ]

        # Extract body rows
        body = table.find("tbody") or table
        rows = []

        for tr in body.find_all("tr"):
            cells = tr.find_all(["td", "th"])
            if not cells:
                continue

            values = [cell.get_text(strip=True) for cell in cells]

            if headers and len(values) == len(headers):
                row = dict(zip(headers, values))
            else:
                row = {f"col_{i}": v for i, v in enumerate(values)}

            # Skip header row if it's in the body
            if values != headers:
                rows.append(row)

        results.append(rows)

    return results

def tables_to_dataframes(html: str) -> list[pd.DataFrame]:
    """Convert HTML tables directly to pandas DataFrames."""
    tables = extract_tables(html)
    return [pd.DataFrame(table) for table in tables if table]

# Quick table extraction with pandas
dfs = pd.read_html("https://example.com/stats")
for i, df in enumerate(dfs):
    print(f"Table {i}: {len(df)} rows x {len(df.columns)} columns")

Handling CSS-in-JS (Hashed Class Names)

Modern React/Vue apps often use CSS-in-JS libraries that generate random class names like _a3f2b or css-1x2y3z. These break between deployments. Strategies:

def extract_by_structure(html: str) -> list[dict]:
    """
    Extract data using structural patterns instead of class names.
    Works when classes are hashed/random.
    """
    soup = BeautifulSoup(html, "html.parser")

    # Strategy 1: Use data-* attributes (these survive CSS-in-JS)
    items = soup.select("[data-testid='product-card']")

    # Strategy 2: Use ARIA attributes
    items = soup.select("[role='listitem']")
    items = soup.select("[aria-label*='product']")

    # Strategy 3: Use tag structure
    # "Find all divs that contain an h2 and a span with $ in the text"
    results = []
    for div in soup.find_all("div"):
        h2 = div.find("h2")
        price_span = div.find("span", string=re.compile(r'\$\d'))
        if h2 and price_span:
            results.append({
                "title": h2.get_text(strip=True),
                "price": price_span.get_text(strip=True),
            })

    # Strategy 4: Use semantic HTML tags
    for article in soup.find_all("article"):
        heading = article.find(["h1", "h2", "h3"])
        if heading:
            results.append({"title": heading.get_text(strip=True)})

    return results

6. XPath with lxml

XPath's killer features are parent traversal (CSS can't go up the tree) and complex conditional expressions. Use lxml for XPath in Python.

Core XPath Patterns

from lxml import html

tree = html.fromstring(page_content)

# === Basic Navigation ===
titles = tree.xpath("//h2/text()")                    # All h2 text
links = tree.xpath("//a/@href")                       # All link hrefs
tree.xpath("//div[@class='product']/h2/text()")       # Class match

# === Parent Traversal (CSS can't do this) ===
# Find the div that contains a span with text "Price"
container = tree.xpath("//span[text()='Price']/parent::div")

# Find the table row containing "Total"
total_row = tree.xpath("//td[contains(text(),'Total')]/parent::tr")

# === Sibling Navigation ===
# Get the value next to a label
value = tree.xpath("//dt[text()='SKU']/following-sibling::dd[1]/text()")

# Get all list items after a specific heading
items = tree.xpath("//h3[text()='Features']/following-sibling::ul[1]/li/text()")

# === Complex Conditions ===
# Find rows where the second column contains "USD"
rows = tree.xpath("//tr[td[2][contains(text(), 'USD')]]")

# Find links that don't start with '#' or 'javascript:'
links = tree.xpath("//a[not(starts-with(@href, '#')) and not(starts-with(@href, 'javascript:'))]/@href")

# Find products with price below $50 (text comparison, not numeric)
tree.xpath("//div[@class='product'][.//span[@class='price' and number(translate(text(), '$,', '')) < 50]]")

# === Text Functions ===
# Normalize whitespace
tree.xpath("normalize-space(//div[@class='description'])")

# Concatenate text from multiple elements
tree.xpath("string(//div[@class='address'])")

# === Positional ===
tree.xpath("(//div[@class='result'])[1]")        # First result
tree.xpath("(//div[@class='result'])[last()]")   # Last result
tree.xpath("(//div[@class='result'])[position() <= 5]")  # First 5

Practical XPath Scraper

from lxml import html
import httpx

def scrape_with_xpath(page_content: str) -> list[dict]:
    """Extract data using XPath — handles complex layouts."""
    tree = html.fromstring(page_content)
    results = []

    # Find product containers using structural XPath
    products = tree.xpath(
        "//div[contains(@class, 'product') or contains(@class, 'item')]"
        "[.//h2 or .//h3]"  # Must contain a heading
        "[.//span[contains(@class, 'price')]]"  # Must contain a price
    )

    for product in products:
        # Extract with fallback chains
        title = (
            product.xpath(".//h2/text()") or 
            product.xpath(".//h3/text()") or
            product.xpath(".//a[@title]/@title") or
            [""]
        )[0].strip()

        price = (
            product.xpath(".//span[contains(@class, 'price')]/text()") or
            product.xpath(".//*[contains(@class, 'amount')]/text()") or
            [""]
        )[0].strip()

        link = (
            product.xpath(".//a/@href") or
            [""]
        )[0]

        # Extract structured attributes
        data_attrs = {}
        for attr in ["data-id", "data-sku", "data-price", "data-brand"]:
            values = product.xpath(f".//@{attr}")
            if values:
                data_attrs[attr.replace("data-", "")] = values[0]

        results.append({
            "title": title,
            "price": price,
            "link": link,
            **data_attrs,
        })

    return results

7. Regex for Non-HTML Data

Regex should never be your first choice for parsing HTML structure. But it's essential for extracting data from non-HTML sources embedded in pages — JavaScript variables, inline styles, comments, and data URIs.

import re
import json

def extract_embedded_data(html: str) -> dict:
    """Extract various types of embedded data from a page."""
    results = {}

    # Email addresses
    emails = re.findall(
        r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
        html,
    )
    if emails:
        results["emails"] = list(set(emails))

    # Phone numbers (various formats)
    phones = re.findall(
        r'[\+]?[(]?[0-9]{1,4}[)]?[-\s\./0-9]{7,15}',
        html,
    )
    if phones:
        results["phones"] = [p.strip() for p in set(phones) if len(p.strip()) >= 10]

    # Prices (various formats)
    prices = re.findall(
        r'(?:[$€£¥])\s*[\d,]+(?:\.\d{2})?|[\d,]+(?:\.\d{2})?\s*(?:USD|EUR|GBP)',
        html,
    )
    if prices:
        results["prices"] = list(set(prices))

    # Coordinates (latitude, longitude)
    coords = re.findall(
        r'[-+]?(?:[1-8]?\d(?:\.\d+)?|90(?:\.0+)?)\s*,\s*'
        r'[-+]?(?:180(?:\.0+)?|(?:(?:1[0-7]\d)|(?:[1-9]?\d))(?:\.\d+)?)',
        html,
    )
    if coords:
        results["coordinates"] = coords

    # JSON objects in script tags
    json_blobs = re.findall(
        r'(?:var|let|const)\s+\w+\s*=\s*({[^;]+});',
        html,
        re.DOTALL,
    )
    valid_json = []
    for blob in json_blobs:
        try:
            parsed = json.loads(blob)
            valid_json.append(parsed)
        except json.JSONDecodeError:
            pass
    if valid_json:
        results["json_data"] = valid_json

    return results

Putting It All Together: Universal Extractor

Here's a production-ready extractor that tries every method in order:

import httpx
import json
import re
from bs4 import BeautifulSoup
from dataclasses import dataclass, field

@dataclass
class ExtractionResult:
    url: str
    method: str  # Which extraction method succeeded
    data: dict = field(default_factory=dict)
    raw_html: str = ""
    confidence: float = 0.0  # 0-1, how confident we are in the data

class UniversalExtractor:
    """Try multiple extraction methods, return the best result."""

    def __init__(self, proxy_url: str = ""):
        self.proxy_url = proxy_url

    async def extract(self, url: str) -> ExtractionResult:
        """Fetch URL and extract data using the best available method."""
        client_kwargs = {"timeout": 30}
        if self.proxy_url:
            client_kwargs["proxy"] = self.proxy_url

        async with httpx.AsyncClient(**client_kwargs) as client:
            resp = await client.get(url, headers={
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                              "AppleWebKit/537.36",
            })
            html = resp.text

        # Try methods in order of reliability

        # 1. JSON-LD
        jsonld_data = self._try_jsonld(html)
        if jsonld_data:
            return ExtractionResult(
                url=url, method="jsonld", 
                data=jsonld_data, confidence=0.95,
            )

        # 2. Next.js data
        nextjs_data = self._try_nextjs(html)
        if nextjs_data:
            return ExtractionResult(
                url=url, method="nextjs",
                data=nextjs_data, confidence=0.90,
            )

        # 3. Open Graph
        og_data = self._try_opengraph(html)
        if og_data and len(og_data) >= 3:  # Meaningful OG data
            return ExtractionResult(
                url=url, method="opengraph",
                data=og_data, confidence=0.80,
            )

        # 4. Microdata
        micro_data = self._try_microdata(html)
        if micro_data:
            return ExtractionResult(
                url=url, method="microdata",
                data=micro_data, confidence=0.85,
            )

        # 5. CSS selectors (generic)
        css_data = self._try_css(html)
        if css_data:
            return ExtractionResult(
                url=url, method="css",
                data=css_data, confidence=0.70,
            )

        # Return raw HTML if nothing worked
        return ExtractionResult(
            url=url, method="none",
            data={}, raw_html=html, confidence=0.0,
        )

    def _try_jsonld(self, html: str) -> dict | None:
        soup = BeautifulSoup(html, "html.parser")
        for script in soup.find_all("script", type="application/ld+json"):
            try:
                data = json.loads(script.string)
                if isinstance(data, dict) and "@type" in data:
                    return data
                if isinstance(data, dict) and "@graph" in data:
                    return {"@graph": data["@graph"]}
            except (json.JSONDecodeError, TypeError):
                continue
        return None

    def _try_nextjs(self, html: str) -> dict | None:
        match = re.search(
            r'<script\s+id="__NEXT_DATA__"[^>]*>(.*?)</script>',
            html, re.DOTALL,
        )
        if match:
            try:
                data = json.loads(match.group(1))
                return data.get("props", {}).get("pageProps", {})
            except json.JSONDecodeError:
                pass
        return None

    def _try_opengraph(self, html: str) -> dict | None:
        soup = BeautifulSoup(html, "html.parser")
        og = {}
        for tag in soup.find_all("meta"):
            prop = tag.get("property", "")
            if prop.startswith("og:"):
                og[prop[3:]] = tag.get("content", "")
        return og if og else None

    def _try_microdata(self, html: str) -> dict | None:
        soup = BeautifulSoup(html, "html.parser")
        items = soup.find_all(attrs={"itemscope": True, "itemtype": True})
        if items:
            item = items[0]
            result = {"@type": item.get("itemtype", "").split("/")[-1]}
            for prop in item.find_all(attrs={"itemprop": True}):
                name = prop["itemprop"]
                if prop.get("content"):
                    result[name] = prop["content"]
                else:
                    result[name] = prop.get_text(strip=True)
            return result
        return None

    def _try_css(self, html: str) -> dict | None:
        soup = BeautifulSoup(html, "html.parser")
        result = {}

        # Try to extract title
        for sel in ["h1", "h2.title", ".product-title", "[data-testid='title']"]:
            elem = soup.select_one(sel)
            if elem:
                result["title"] = elem.get_text(strip=True)
                break

        # Try to extract price
        for sel in [".price", "#price", "[data-testid='price']", ".amount"]:
            elem = soup.select_one(sel)
            if elem:
                result["price"] = elem.get_text(strip=True)
                break

        # Try to extract description
        for sel in [".description", "#description", "[data-testid='description']", "p.intro"]:
            elem = soup.select_one(sel)
            if elem:
                result["description"] = elem.get_text(strip=True)[:500]
                break

        return result if result else None

Error Handling and Robustness

Handling Encoding Issues

import httpx
import chardet

def fetch_with_encoding(url: str, proxy: str = "") -> str:
    """Fetch a page and handle encoding correctly."""
    client_kwargs = {"timeout": 30}
    if proxy:
        client_kwargs["proxy"] = proxy

    with httpx.Client(**client_kwargs) as client:
        resp = client.get(url)

        # httpx usually detects encoding correctly, but sometimes it doesn't
        if resp.encoding == "ascii" or resp.encoding is None:
            # Try to detect from content
            detected = chardet.detect(resp.content)
            encoding = detected.get("encoding", "utf-8")
            return resp.content.decode(encoding, errors="replace")

        return resp.text

Handling Malformed HTML

from bs4 import BeautifulSoup
from lxml import html as lxml_html

def parse_html_robust(raw_html: str) -> BeautifulSoup:
    """Parse HTML that might be malformed."""
    # html.parser is more forgiving than lxml for broken HTML
    # But lxml is faster for well-formed HTML

    try:
        # Try lxml first (faster)
        soup = BeautifulSoup(raw_html, "lxml")
        return soup
    except Exception:
        pass

    try:
        # Fall back to html.parser (more forgiving)
        soup = BeautifulSoup(raw_html, "html.parser")
        return soup
    except Exception:
        pass

    # Last resort: html5lib (slowest but handles anything)
    return BeautifulSoup(raw_html, "html5lib")

Handling Missing Elements Gracefully

def safe_select(soup, selector: str, attribute: str = None) -> str:
    """Safely extract text or attribute from a CSS selector."""
    elem = soup.select_one(selector)
    if elem is None:
        return ""
    if attribute:
        return elem.get(attribute, "")
    return elem.get_text(strip=True)

def safe_select_all(soup, selector: str) -> list[str]:
    """Safely extract text from all matching elements."""
    return [elem.get_text(strip=True) for elem in soup.select(selector)]

Using Proxies for Large-Scale Extraction

When scraping at scale, you need proxies to avoid rate limits. Here's how to integrate proxy rotation with the extraction techniques above:

import httpx
import asyncio
import random
from bs4 import BeautifulSoup

async def extract_at_scale(
    urls: list[str],
    proxy_url: str,
    extract_fn,
    max_concurrent: int = 5,
    delay_range: tuple = (1, 3),
) -> list[dict]:
    """
    Scrape URLs at scale with proxy rotation and rate limiting.

    Uses ThorData residential proxies for protected sites.
    Get started at: https://thordata.partnerstack.com/partner/0a0x4nzh
    """
    semaphore = asyncio.Semaphore(max_concurrent)
    results = []

    async def fetch_one(url: str) -> dict:
        async with semaphore:
            try:
                async with httpx.AsyncClient(
                    proxy=proxy_url, timeout=30,
                ) as client:
                    resp = await client.get(url, headers={
                        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                                      "AppleWebKit/537.36",
                        "Accept": "text/html,application/xhtml+xml",
                        "Accept-Language": "en-US,en;q=0.9",
                    })

                    if resp.status_code == 200:
                        data = extract_fn(resp.text)
                        return {"url": url, "status": "ok", "data": data}
                    else:
                        return {"url": url, "status": resp.status_code}

            except Exception as e:
                return {"url": url, "status": "error", "error": str(e)}

            finally:
                await asyncio.sleep(random.uniform(*delay_range))

    tasks = [fetch_one(url) for url in urls]
    results = await asyncio.gather(*tasks)

    success = sum(1 for r in results if r.get("status") == "ok")
    print(f"Extracted: {success}/{len(urls)} successful")

    return results

Testing Your Selectors Before Coding

Before writing Python, test selectors in the browser DevTools console:

// Test CSS selectors
document.querySelectorAll("div.product-card h2")

// Test XPath
document.evaluate("//h2[@class='title']", document, null,
  XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null)

// Check for JSON-LD
document.querySelectorAll('script[type="application/ld+json"]')
  .forEach(s => console.log(JSON.parse(s.textContent)))

// Check for Next.js data
const nd = document.getElementById('__NEXT_DATA__');
if (nd) console.log(JSON.parse(nd.textContent).props.pageProps);

// Check for microdata
document.querySelectorAll('[itemscope]').forEach(el => {
  console.log('Type:', el.getAttribute('itemtype'));
  el.querySelectorAll('[itemprop]').forEach(prop => {
    console.log(`  ${prop.getAttribute('itemprop')}: ${prop.textContent.trim()}`);
  });
});

This takes 30 seconds and saves you from running your scraper 15 times to debug a selector.

The Main Takeaway

Always check for structured data first. The best selector is the one you don't have to write. JSON-LD gives you clean, typed, stable data that survives redesigns. __NEXT_DATA__ gives you the exact data the frontend uses. Open Graph tags give you the title, description, and image without parsing the page body.

Start at the top of the extraction hierarchy and work your way down. Every step down increases fragility and maintenance cost. The 30 seconds you spend checking for JSON-LD before writing CSS selectors will save you hours of selector maintenance when the site inevitably redesigns.

When you do need to use selectors, prefer data attributes (data-testid, data-id) and semantic HTML (article, nav, main) over CSS classes — they're more stable across redesigns. And when scraping at scale, pair your extraction logic with reliable proxy rotation through a service like ThorData to handle rate limits and geo-restrictions without getting blocked.