Extracting Structured Data from HTML: The Complete Python Guide (2026)
Extracting Structured Data from HTML: The Complete Python Guide
Every scraping project starts with the same question: how do I get this data out of the page? There are multiple approaches, and picking the right one saves you hours of debugging brittle selectors.
Here's the thing most tutorials miss — you should check for structured data before writing a single selector. Many sites embed clean, machine-readable data that's more reliable and easier to parse than any CSS selector you could write.
This guide covers every extraction technique available in 2026, from the simplest to the most complex, with production-ready Python code and real-world examples for each approach.
The Extraction Hierarchy
Before writing any code, understand the reliability hierarchy. Always try methods higher on this list before falling back to lower ones:
- APIs — official or undocumented, always the most reliable
- JSON-LD / Schema.org — structured data embedded in the page for SEO
- Microdata / RDFa — older structured data formats, still widely used
- Open Graph / Twitter Cards — metadata tags with clean data
__NEXT_DATA__/ inline JSON — data blobs in script tags- CSS Selectors — the workhorse for most scraping
- XPath — when CSS selectors aren't powerful enough
- Regex — last resort for data embedded in JavaScript variables
Each method further down is more fragile — more likely to break when the site redesigns. Let's dive into each one.
1. JSON-LD: Your First Stop (Always)
JSON-LD (JavaScript Object Notation for Linked Data) is structured data embedded in <script type="application/ld+json"> tags. Sites add it for SEO — Google uses it to generate rich search results. This means the data is accurate, maintained by the site's SEO team, and unlikely to disappear in a redesign.
Basic JSON-LD Extraction
import json
from bs4 import BeautifulSoup
import httpx
def extract_jsonld(html: str) -> list[dict]:
"""Extract all JSON-LD blocks from an HTML page."""
soup = BeautifulSoup(html, "html.parser")
results = []
for script in soup.find_all("script", type="application/ld+json"):
try:
data = json.loads(script.string)
results.append(data)
except (json.JSONDecodeError, TypeError):
continue
return results
# Example: Scrape a product page
resp = httpx.get("https://example.com/product/123")
ld_blocks = extract_jsonld(resp.text)
for block in ld_blocks:
if block.get("@type") == "Product":
print(f"Name: {block['name']}")
print(f"Price: {block['offers']['price']}")
print(f"Currency: {block['offers']['priceCurrency']}")
print(f"Available: {block['offers']['availability']}")
print(f"Brand: {block.get('brand', {}).get('name', 'N/A')}")
print(f"Rating: {block.get('aggregateRating', {}).get('ratingValue', 'N/A')}")
Handling Complex JSON-LD Structures
Real-world JSON-LD comes in several formats. Some sites use @graph arrays, some nest objects, some have multiple blocks on one page:
from typing import Any
def find_jsonld_by_type(html: str, target_type: str) -> list[dict]:
"""
Find all JSON-LD objects matching a Schema.org type.
Handles @graph arrays, nested objects, and list formats.
"""
soup = BeautifulSoup(html, "html.parser")
matches = []
def search_object(obj: Any):
if isinstance(obj, dict):
obj_type = obj.get("@type", "")
# @type can be a string or a list
if isinstance(obj_type, list):
if target_type in obj_type:
matches.append(obj)
elif obj_type == target_type:
matches.append(obj)
# Recurse into @graph
if "@graph" in obj:
search_object(obj["@graph"])
# Recurse into nested objects
for value in obj.values():
if isinstance(value, (dict, list)):
search_object(value)
elif isinstance(obj, list):
for item in obj:
search_object(item)
for script in soup.find_all("script", type="application/ld+json"):
try:
data = json.loads(script.string)
search_object(data)
except (json.JSONDecodeError, TypeError):
continue
return matches
# Example: Find all products, even in nested @graph structures
products = find_jsonld_by_type(page_html, "Product")
articles = find_jsonld_by_type(page_html, "Article")
recipes = find_jsonld_by_type(page_html, "Recipe")
events = find_jsonld_by_type(page_html, "Event")
Common JSON-LD Types and What They Contain
| Schema.org Type | Common On | Data Available |
|---|---|---|
Product |
E-commerce sites | Name, price, currency, availability, reviews, images, brand |
Article / NewsArticle |
News sites, blogs | Headline, author, date published, body text, images |
Recipe |
Food sites | Name, ingredients, instructions, cook time, nutrition |
LocalBusiness |
Business listings | Name, address, phone, hours, geo-coordinates |
Event |
Event sites | Name, date, location, performer, ticket info |
JobPosting |
Job boards | Title, company, salary, location, description |
FAQPage |
Help pages | Question-answer pairs |
Review |
Review sites | Rating, author, review body |
BreadcrumbList |
Most sites | Navigation hierarchy (useful for categorization) |
Organization |
Company pages | Name, logo, social profiles, contact info |
Production JSON-LD Scraper
import httpx
import json
from dataclasses import dataclass, field
from bs4 import BeautifulSoup
@dataclass
class ProductData:
name: str = ""
price: float = 0.0
currency: str = "USD"
availability: str = ""
brand: str = ""
rating: float = 0.0
review_count: int = 0
description: str = ""
image_url: str = ""
sku: str = ""
url: str = ""
@classmethod
def from_jsonld(cls, data: dict, url: str = "") -> "ProductData":
"""Parse a Product JSON-LD object into a clean dataclass."""
offers = data.get("offers", {})
# offers can be a list (multiple offers) or a dict
if isinstance(offers, list):
offers = offers[0] if offers else {}
rating_data = data.get("aggregateRating", {})
brand_data = data.get("brand", {})
# Image can be string, list, or dict
image = data.get("image", "")
if isinstance(image, list):
image = image[0] if image else ""
elif isinstance(image, dict):
image = image.get("url", "")
return cls(
name=data.get("name", ""),
price=float(offers.get("price", 0)),
currency=offers.get("priceCurrency", "USD"),
availability=offers.get("availability", "").split("/")[-1],
brand=brand_data.get("name", "") if isinstance(brand_data, dict) else str(brand_data),
rating=float(rating_data.get("ratingValue", 0)),
review_count=int(rating_data.get("reviewCount", 0)),
description=data.get("description", ""),
image_url=image,
sku=data.get("sku", ""),
url=url,
)
async def scrape_products_jsonld(
urls: list[str],
proxy_url: str = "",
) -> list[ProductData]:
"""Scrape product data using JSON-LD extraction."""
products = []
client_kwargs = {}
if proxy_url:
client_kwargs["proxy"] = proxy_url
async with httpx.AsyncClient(timeout=30, **client_kwargs) as client:
for url in urls:
try:
resp = await client.get(url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36",
"Accept": "text/html",
})
if resp.status_code == 200:
product_lds = find_jsonld_by_type(resp.text, "Product")
for ld in product_lds:
product = ProductData.from_jsonld(ld, url=url)
if product.name: # Skip empty results
products.append(product)
except Exception as e:
print(f"Error scraping {url}: {e}")
return products
2. Microdata and RDFa
Older than JSON-LD but still present on many sites. Microdata is embedded directly in HTML attributes (itemscope, itemprop, itemtype). RDFa uses typeof, property, and about attributes.
Extracting Microdata
from bs4 import BeautifulSoup
def extract_microdata(html: str) -> list[dict]:
"""Extract Schema.org microdata from HTML attributes."""
soup = BeautifulSoup(html, "html.parser")
items = []
for element in soup.find_all(attrs={"itemscope": True}):
# Skip nested items — they'll be handled by their parent
parent_scope = element.find_parent(attrs={"itemscope": True})
if parent_scope:
continue
item = parse_microdata_item(element)
items.append(item)
return items
def parse_microdata_item(element) -> dict:
"""Recursively parse a microdata item and its properties."""
item = {
"@type": element.get("itemtype", "").split("/")[-1],
}
for prop in element.find_all(attrs={"itemprop": True}):
name = prop.get("itemprop")
# Check if this property is itself a nested item
if prop.get("itemscope") is not None:
value = parse_microdata_item(prop)
elif prop.name == "meta":
value = prop.get("content", "")
elif prop.name == "link":
value = prop.get("href", "")
elif prop.name == "img":
value = prop.get("src", "")
elif prop.name == "time":
value = prop.get("datetime", prop.get_text(strip=True))
elif prop.name == "a":
value = prop.get("href", prop.get_text(strip=True))
else:
value = prop.get_text(strip=True)
# Handle multiple values for the same property
if name in item:
existing = item[name]
if isinstance(existing, list):
existing.append(value)
else:
item[name] = [existing, value]
else:
item[name] = value
return item
# Example HTML with microdata:
# <div itemscope itemtype="https://schema.org/Product">
# <h1 itemprop="name">Widget Pro</h1>
# <span itemprop="price" content="29.99">$29.99</span>
# <meta itemprop="priceCurrency" content="USD">
# </div>
Extracting RDFa
def extract_rdfa(html: str) -> list[dict]:
"""Extract RDFa structured data from HTML."""
soup = BeautifulSoup(html, "html.parser")
items = []
for element in soup.find_all(attrs={"typeof": True}):
item = {"@type": element.get("typeof")}
for prop in element.find_all(attrs={"property": True}):
name = prop.get("property").split(":")[-1] # Remove prefix
if prop.get("content"):
value = prop["content"]
elif prop.name == "a":
value = prop.get("href", prop.get_text(strip=True))
elif prop.name == "img":
value = prop.get("src", "")
else:
value = prop.get_text(strip=True)
item[name] = value
items.append(item)
return items
3. Open Graph and Twitter Card Metadata
Almost every website has Open Graph tags for social media previews. These contain titles, descriptions, images, and sometimes prices — cleaner than scraping the page body.
def extract_meta_tags(html: str) -> dict:
"""Extract Open Graph, Twitter Card, and standard meta tags."""
soup = BeautifulSoup(html, "html.parser")
meta = {
"og": {},
"twitter": {},
"standard": {},
}
for tag in soup.find_all("meta"):
# Open Graph tags
prop = tag.get("property", "")
if prop.startswith("og:"):
key = prop[3:] # Remove "og:" prefix
meta["og"][key] = tag.get("content", "")
# Twitter Card tags
name = tag.get("name", "")
if name.startswith("twitter:"):
key = name[8:] # Remove "twitter:" prefix
meta["twitter"][key] = tag.get("content", "")
# Standard meta tags
if name in ("description", "keywords", "author", "robots"):
meta["standard"][name] = tag.get("content", "")
# Also grab the title tag
title_tag = soup.find("title")
if title_tag:
meta["standard"]["title"] = title_tag.get_text(strip=True)
# Canonical URL
canonical = soup.find("link", rel="canonical")
if canonical:
meta["standard"]["canonical"] = canonical.get("href", "")
return meta
# Usage
resp = httpx.get("https://example.com/article/123")
meta = extract_meta_tags(resp.text)
print(f"Title: {meta['og'].get('title', meta['standard'].get('title'))}")
print(f"Description: {meta['og'].get('description')}")
print(f"Image: {meta['og'].get('image')}")
print(f"Type: {meta['og'].get('type')}")
print(f"Site: {meta['og'].get('site_name')}")
Product-Specific Open Graph Tags
E-commerce sites often include pricing in Open Graph tags:
def extract_product_og(html: str) -> dict | None:
"""Extract product-specific Open Graph data."""
soup = BeautifulSoup(html, "html.parser")
product = {}
og_mappings = {
"og:title": "name",
"og:description": "description",
"og:image": "image",
"og:url": "url",
"product:price:amount": "price",
"product:price:currency": "currency",
"product:availability": "availability",
"product:brand": "brand",
"product:category": "category",
"product:condition": "condition",
}
for tag in soup.find_all("meta"):
prop = tag.get("property", "")
if prop in og_mappings:
product[og_mappings[prop]] = tag.get("content", "")
return product if product else None
4. JavaScript Data Extraction
Modern websites increasingly load data via JavaScript. The HTML is a shell, and the actual data lives in script tags as JSON blobs, window variables, or framework-specific data stores.
Next.js __NEXT_DATA__
Next.js applications embed their page data in a <script id="__NEXT_DATA__"> tag. This is a goldmine — the entire page's data in one clean JSON object:
import re
import json
def extract_nextjs_data(html: str) -> dict | None:
"""Extract Next.js page data from __NEXT_DATA__ script tag."""
# Method 1: BeautifulSoup (more reliable)
soup = BeautifulSoup(html, "html.parser")
script = soup.find("script", id="__NEXT_DATA__")
if script and script.string:
try:
data = json.loads(script.string)
return data.get("props", {}).get("pageProps", {})
except json.JSONDecodeError:
pass
# Method 2: Regex fallback
match = re.search(
r'<script\s+id="__NEXT_DATA__"[^>]*>(.*?)</script>',
html,
re.DOTALL,
)
if match:
try:
data = json.loads(match.group(1))
return data.get("props", {}).get("pageProps", {})
except json.JSONDecodeError:
pass
return None
# Example: Scrape a Next.js e-commerce site
resp = httpx.get("https://nextjs-store.example.com/product/widget-pro")
page_data = extract_nextjs_data(resp.text)
if page_data:
product = page_data.get("product", {})
print(f"Name: {product.get('name')}")
print(f"Price: {product.get('price')}")
print(f"Variants: {len(product.get('variants', []))}")
Nuxt.js __NUXT_DATA__
Nuxt 3 uses a different serialization format:
def extract_nuxt_data(html: str) -> list | None:
"""Extract Nuxt.js page data."""
soup = BeautifulSoup(html, "html.parser")
# Nuxt 3 uses multiple script tags with type="application/json" and id pattern
scripts = soup.find_all("script", type="application/json")
for script in scripts:
if script.get("id", "").startswith("__NUXT_DATA__"):
try:
return json.loads(script.string)
except json.JSONDecodeError:
continue
# Nuxt 2 uses window.__NUXT__
match = re.search(
r'window\.__NUXT__\s*=\s*({.*?});?\s*</script>',
html,
re.DOTALL,
)
if match:
# Nuxt 2 data often uses JavaScript syntax (not pure JSON)
# This may need eval-like parsing for complex cases
try:
return json.loads(match.group(1))
except json.JSONDecodeError:
pass
return None
Generic JavaScript Variable Extraction
def extract_js_variables(html: str, patterns: list[str]) -> dict:
"""
Extract JavaScript variables from inline scripts.
patterns: list of variable name patterns like:
"window.initialState", "var productData", "__CONFIG__"
"""
results = {}
for pattern in patterns:
# Handle different assignment styles
regex_patterns = [
rf'{re.escape(pattern)}\s*=\s*({{.*?}});', # var x = {...};
rf'{re.escape(pattern)}\s*=\s*(\[.*?\]);', # var x = [...];
rf'{re.escape(pattern)}\s*=\s*JSON\.parse\(\'(.*?)\'\)', # JSON.parse('...')
]
for regex in regex_patterns:
match = re.search(regex, html, re.DOTALL)
if match:
try:
# Try parsing as JSON
data = json.loads(match.group(1))
results[pattern] = data
break
except json.JSONDecodeError:
# Store raw string if not valid JSON
results[pattern] = match.group(1)
break
return results
# Usage
data = extract_js_variables(page_html, [
"window.__INITIAL_STATE__",
"window.__PRELOADED_STATE__",
"window.__APP_DATA__",
"window.pageData",
])
API Response Interception
Sometimes the best data source is the API the frontend calls. Intercept these with Playwright:
from playwright.async_api import async_playwright
import asyncio
import json
async def intercept_api_calls(
url: str,
api_patterns: list[str],
proxy: str = "",
) -> list[dict]:
"""
Load a page and capture API responses matching patterns.
Often cleaner than parsing HTML at all.
"""
captured = []
async with async_playwright() as p:
launch_kwargs = {"headless": True}
if proxy:
launch_kwargs["proxy"] = {"server": proxy}
browser = await p.chromium.launch(**launch_kwargs)
page = await browser.new_page()
# Intercept API responses
async def handle_response(response):
for pattern in api_patterns:
if pattern in response.url:
try:
body = await response.json()
captured.append({
"url": response.url,
"status": response.status,
"data": body,
})
except Exception:
pass
page.on("response", handle_response)
await page.goto(url, wait_until="networkidle")
await asyncio.sleep(2) # Wait for any lazy-loaded API calls
await browser.close()
return captured
# Example: Capture product API responses
results = asyncio.run(intercept_api_calls(
url="https://store.example.com/category/electronics",
api_patterns=["/api/products", "/api/v2/catalog", "graphql"],
))
for result in results:
print(f"API: {result['url']}")
print(f" Items: {len(result['data'].get('products', []))}")
5. CSS Selectors with BeautifulSoup
When structured data isn't available, CSS selectors are the workhorse. They're fast, readable, and handle most well-structured HTML.
Core Selector Patterns
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
# === Basic Selectors ===
soup.select_one(".product-title") # Class
soup.select_one("#price") # ID
soup.select("ul.results > li") # Direct children
soup.select("div.card") # All matching elements
# === Attribute Selectors ===
soup.select('a[href*="product"]') # href contains "product"
soup.select('a[href^="/category"]') # href starts with "/category"
soup.select('a[href$=".pdf"]') # href ends with ".pdf"
soup.select('input[type="hidden"]') # exact attribute match
soup.select('[data-product-id]') # has attribute (any value)
# === Combinators ===
soup.select("div.sidebar a") # Descendant (any depth)
soup.select("div.sidebar > a") # Direct child only
soup.select("h2 + p") # Immediately following sibling
soup.select("h2 ~ p") # Any following sibling
# === Pseudo-selectors ===
soup.select("tr:nth-child(2) td") # Second row
soup.select("li:first-child") # First list item
soup.select("li:last-child") # Last list item
soup.select("p:not(.ad)") # Exclude class
# === Combining Multiple Selectors ===
soup.select("h1, h2, h3") # Any of these tags
soup.select("div.price.sale") # Element with both classes
Production-Ready Card Scraping
from dataclasses import dataclass
from bs4 import BeautifulSoup, Tag
@dataclass
class ScrapedItem:
title: str
price: str
url: str
image: str
rating: str
def is_valid(self) -> bool:
return bool(self.title and self.price)
def scrape_product_cards(html: str, base_url: str = "") -> list[ScrapedItem]:
"""Extract product data from common card-based layouts."""
soup = BeautifulSoup(html, "html.parser")
items = []
# Try common card selectors
card_selectors = [
"div.product-card",
"div.product-item",
"li.product",
"article.product",
"div[data-component='product-card']",
".search-result-item",
".listing-card",
]
cards = []
for selector in card_selectors:
cards = soup.select(selector)
if cards:
break
for card in cards:
item = ScrapedItem(
title=extract_text(card, [
"h2", "h3", ".product-title", ".product-name",
"[data-testid='title']", ".listing-title",
]),
price=extract_text(card, [
".price", ".product-price", "[data-testid='price']",
".sale-price", ".current-price", "span.amount",
]),
url=extract_link(card, base_url),
image=extract_image(card),
rating=extract_text(card, [
".rating", ".stars", "[data-testid='rating']",
".review-score",
]),
)
if item.is_valid():
items.append(item)
return items
def extract_text(parent: Tag, selectors: list[str]) -> str:
"""Try multiple selectors, return first match's text."""
for selector in selectors:
elem = parent.select_one(selector)
if elem:
return elem.get_text(strip=True)
return ""
def extract_link(parent: Tag, base_url: str) -> str:
"""Extract the primary link from a card element."""
link = parent.select_one("a[href]")
if link:
href = link.get("href", "")
if href.startswith("/"):
return base_url + href
return href
return ""
def extract_image(parent: Tag) -> str:
"""Extract image URL, handling lazy-loading attributes."""
img = parent.select_one("img")
if img:
# Try lazy-loading attributes first (actual image URL)
for attr in ["data-src", "data-lazy-src", "data-original"]:
if img.get(attr):
return img[attr]
return img.get("src", "")
return ""
Table Extraction
Tables are one of the most common data formats on the web. Here's a robust extractor:
import pandas as pd
from bs4 import BeautifulSoup
def extract_tables(html: str, table_index: int | None = None) -> list[list[dict]]:
"""
Extract all tables from HTML into list-of-dicts format.
Each table becomes a list of rows, each row is a dict.
"""
soup = BeautifulSoup(html, "html.parser")
tables = soup.find_all("table")
if table_index is not None:
tables = [tables[table_index]] if table_index < len(tables) else []
results = []
for table in tables:
# Extract headers
headers = []
header_row = table.find("thead")
if header_row:
headers = [
th.get_text(strip=True)
for th in header_row.find_all(["th", "td"])
]
else:
# Try first row as header
first_row = table.find("tr")
if first_row and first_row.find("th"):
headers = [
th.get_text(strip=True) for th in first_row.find_all("th")
]
# Extract body rows
body = table.find("tbody") or table
rows = []
for tr in body.find_all("tr"):
cells = tr.find_all(["td", "th"])
if not cells:
continue
values = [cell.get_text(strip=True) for cell in cells]
if headers and len(values) == len(headers):
row = dict(zip(headers, values))
else:
row = {f"col_{i}": v for i, v in enumerate(values)}
# Skip header row if it's in the body
if values != headers:
rows.append(row)
results.append(rows)
return results
def tables_to_dataframes(html: str) -> list[pd.DataFrame]:
"""Convert HTML tables directly to pandas DataFrames."""
tables = extract_tables(html)
return [pd.DataFrame(table) for table in tables if table]
# Quick table extraction with pandas
dfs = pd.read_html("https://example.com/stats")
for i, df in enumerate(dfs):
print(f"Table {i}: {len(df)} rows x {len(df.columns)} columns")
Handling CSS-in-JS (Hashed Class Names)
Modern React/Vue apps often use CSS-in-JS libraries that generate random class names like _a3f2b or css-1x2y3z. These break between deployments. Strategies:
def extract_by_structure(html: str) -> list[dict]:
"""
Extract data using structural patterns instead of class names.
Works when classes are hashed/random.
"""
soup = BeautifulSoup(html, "html.parser")
# Strategy 1: Use data-* attributes (these survive CSS-in-JS)
items = soup.select("[data-testid='product-card']")
# Strategy 2: Use ARIA attributes
items = soup.select("[role='listitem']")
items = soup.select("[aria-label*='product']")
# Strategy 3: Use tag structure
# "Find all divs that contain an h2 and a span with $ in the text"
results = []
for div in soup.find_all("div"):
h2 = div.find("h2")
price_span = div.find("span", string=re.compile(r'\$\d'))
if h2 and price_span:
results.append({
"title": h2.get_text(strip=True),
"price": price_span.get_text(strip=True),
})
# Strategy 4: Use semantic HTML tags
for article in soup.find_all("article"):
heading = article.find(["h1", "h2", "h3"])
if heading:
results.append({"title": heading.get_text(strip=True)})
return results
6. XPath with lxml
XPath's killer features are parent traversal (CSS can't go up the tree) and complex conditional expressions. Use lxml for XPath in Python.
Core XPath Patterns
from lxml import html
tree = html.fromstring(page_content)
# === Basic Navigation ===
titles = tree.xpath("//h2/text()") # All h2 text
links = tree.xpath("//a/@href") # All link hrefs
tree.xpath("//div[@class='product']/h2/text()") # Class match
# === Parent Traversal (CSS can't do this) ===
# Find the div that contains a span with text "Price"
container = tree.xpath("//span[text()='Price']/parent::div")
# Find the table row containing "Total"
total_row = tree.xpath("//td[contains(text(),'Total')]/parent::tr")
# === Sibling Navigation ===
# Get the value next to a label
value = tree.xpath("//dt[text()='SKU']/following-sibling::dd[1]/text()")
# Get all list items after a specific heading
items = tree.xpath("//h3[text()='Features']/following-sibling::ul[1]/li/text()")
# === Complex Conditions ===
# Find rows where the second column contains "USD"
rows = tree.xpath("//tr[td[2][contains(text(), 'USD')]]")
# Find links that don't start with '#' or 'javascript:'
links = tree.xpath("//a[not(starts-with(@href, '#')) and not(starts-with(@href, 'javascript:'))]/@href")
# Find products with price below $50 (text comparison, not numeric)
tree.xpath("//div[@class='product'][.//span[@class='price' and number(translate(text(), '$,', '')) < 50]]")
# === Text Functions ===
# Normalize whitespace
tree.xpath("normalize-space(//div[@class='description'])")
# Concatenate text from multiple elements
tree.xpath("string(//div[@class='address'])")
# === Positional ===
tree.xpath("(//div[@class='result'])[1]") # First result
tree.xpath("(//div[@class='result'])[last()]") # Last result
tree.xpath("(//div[@class='result'])[position() <= 5]") # First 5
Practical XPath Scraper
from lxml import html
import httpx
def scrape_with_xpath(page_content: str) -> list[dict]:
"""Extract data using XPath — handles complex layouts."""
tree = html.fromstring(page_content)
results = []
# Find product containers using structural XPath
products = tree.xpath(
"//div[contains(@class, 'product') or contains(@class, 'item')]"
"[.//h2 or .//h3]" # Must contain a heading
"[.//span[contains(@class, 'price')]]" # Must contain a price
)
for product in products:
# Extract with fallback chains
title = (
product.xpath(".//h2/text()") or
product.xpath(".//h3/text()") or
product.xpath(".//a[@title]/@title") or
[""]
)[0].strip()
price = (
product.xpath(".//span[contains(@class, 'price')]/text()") or
product.xpath(".//*[contains(@class, 'amount')]/text()") or
[""]
)[0].strip()
link = (
product.xpath(".//a/@href") or
[""]
)[0]
# Extract structured attributes
data_attrs = {}
for attr in ["data-id", "data-sku", "data-price", "data-brand"]:
values = product.xpath(f".//@{attr}")
if values:
data_attrs[attr.replace("data-", "")] = values[0]
results.append({
"title": title,
"price": price,
"link": link,
**data_attrs,
})
return results
7. Regex for Non-HTML Data
Regex should never be your first choice for parsing HTML structure. But it's essential for extracting data from non-HTML sources embedded in pages — JavaScript variables, inline styles, comments, and data URIs.
import re
import json
def extract_embedded_data(html: str) -> dict:
"""Extract various types of embedded data from a page."""
results = {}
# Email addresses
emails = re.findall(
r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
html,
)
if emails:
results["emails"] = list(set(emails))
# Phone numbers (various formats)
phones = re.findall(
r'[\+]?[(]?[0-9]{1,4}[)]?[-\s\./0-9]{7,15}',
html,
)
if phones:
results["phones"] = [p.strip() for p in set(phones) if len(p.strip()) >= 10]
# Prices (various formats)
prices = re.findall(
r'(?:[$€£¥])\s*[\d,]+(?:\.\d{2})?|[\d,]+(?:\.\d{2})?\s*(?:USD|EUR|GBP)',
html,
)
if prices:
results["prices"] = list(set(prices))
# Coordinates (latitude, longitude)
coords = re.findall(
r'[-+]?(?:[1-8]?\d(?:\.\d+)?|90(?:\.0+)?)\s*,\s*'
r'[-+]?(?:180(?:\.0+)?|(?:(?:1[0-7]\d)|(?:[1-9]?\d))(?:\.\d+)?)',
html,
)
if coords:
results["coordinates"] = coords
# JSON objects in script tags
json_blobs = re.findall(
r'(?:var|let|const)\s+\w+\s*=\s*({[^;]+});',
html,
re.DOTALL,
)
valid_json = []
for blob in json_blobs:
try:
parsed = json.loads(blob)
valid_json.append(parsed)
except json.JSONDecodeError:
pass
if valid_json:
results["json_data"] = valid_json
return results
Putting It All Together: Universal Extractor
Here's a production-ready extractor that tries every method in order:
import httpx
import json
import re
from bs4 import BeautifulSoup
from dataclasses import dataclass, field
@dataclass
class ExtractionResult:
url: str
method: str # Which extraction method succeeded
data: dict = field(default_factory=dict)
raw_html: str = ""
confidence: float = 0.0 # 0-1, how confident we are in the data
class UniversalExtractor:
"""Try multiple extraction methods, return the best result."""
def __init__(self, proxy_url: str = ""):
self.proxy_url = proxy_url
async def extract(self, url: str) -> ExtractionResult:
"""Fetch URL and extract data using the best available method."""
client_kwargs = {"timeout": 30}
if self.proxy_url:
client_kwargs["proxy"] = self.proxy_url
async with httpx.AsyncClient(**client_kwargs) as client:
resp = await client.get(url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36",
})
html = resp.text
# Try methods in order of reliability
# 1. JSON-LD
jsonld_data = self._try_jsonld(html)
if jsonld_data:
return ExtractionResult(
url=url, method="jsonld",
data=jsonld_data, confidence=0.95,
)
# 2. Next.js data
nextjs_data = self._try_nextjs(html)
if nextjs_data:
return ExtractionResult(
url=url, method="nextjs",
data=nextjs_data, confidence=0.90,
)
# 3. Open Graph
og_data = self._try_opengraph(html)
if og_data and len(og_data) >= 3: # Meaningful OG data
return ExtractionResult(
url=url, method="opengraph",
data=og_data, confidence=0.80,
)
# 4. Microdata
micro_data = self._try_microdata(html)
if micro_data:
return ExtractionResult(
url=url, method="microdata",
data=micro_data, confidence=0.85,
)
# 5. CSS selectors (generic)
css_data = self._try_css(html)
if css_data:
return ExtractionResult(
url=url, method="css",
data=css_data, confidence=0.70,
)
# Return raw HTML if nothing worked
return ExtractionResult(
url=url, method="none",
data={}, raw_html=html, confidence=0.0,
)
def _try_jsonld(self, html: str) -> dict | None:
soup = BeautifulSoup(html, "html.parser")
for script in soup.find_all("script", type="application/ld+json"):
try:
data = json.loads(script.string)
if isinstance(data, dict) and "@type" in data:
return data
if isinstance(data, dict) and "@graph" in data:
return {"@graph": data["@graph"]}
except (json.JSONDecodeError, TypeError):
continue
return None
def _try_nextjs(self, html: str) -> dict | None:
match = re.search(
r'<script\s+id="__NEXT_DATA__"[^>]*>(.*?)</script>',
html, re.DOTALL,
)
if match:
try:
data = json.loads(match.group(1))
return data.get("props", {}).get("pageProps", {})
except json.JSONDecodeError:
pass
return None
def _try_opengraph(self, html: str) -> dict | None:
soup = BeautifulSoup(html, "html.parser")
og = {}
for tag in soup.find_all("meta"):
prop = tag.get("property", "")
if prop.startswith("og:"):
og[prop[3:]] = tag.get("content", "")
return og if og else None
def _try_microdata(self, html: str) -> dict | None:
soup = BeautifulSoup(html, "html.parser")
items = soup.find_all(attrs={"itemscope": True, "itemtype": True})
if items:
item = items[0]
result = {"@type": item.get("itemtype", "").split("/")[-1]}
for prop in item.find_all(attrs={"itemprop": True}):
name = prop["itemprop"]
if prop.get("content"):
result[name] = prop["content"]
else:
result[name] = prop.get_text(strip=True)
return result
return None
def _try_css(self, html: str) -> dict | None:
soup = BeautifulSoup(html, "html.parser")
result = {}
# Try to extract title
for sel in ["h1", "h2.title", ".product-title", "[data-testid='title']"]:
elem = soup.select_one(sel)
if elem:
result["title"] = elem.get_text(strip=True)
break
# Try to extract price
for sel in [".price", "#price", "[data-testid='price']", ".amount"]:
elem = soup.select_one(sel)
if elem:
result["price"] = elem.get_text(strip=True)
break
# Try to extract description
for sel in [".description", "#description", "[data-testid='description']", "p.intro"]:
elem = soup.select_one(sel)
if elem:
result["description"] = elem.get_text(strip=True)[:500]
break
return result if result else None
Error Handling and Robustness
Handling Encoding Issues
import httpx
import chardet
def fetch_with_encoding(url: str, proxy: str = "") -> str:
"""Fetch a page and handle encoding correctly."""
client_kwargs = {"timeout": 30}
if proxy:
client_kwargs["proxy"] = proxy
with httpx.Client(**client_kwargs) as client:
resp = client.get(url)
# httpx usually detects encoding correctly, but sometimes it doesn't
if resp.encoding == "ascii" or resp.encoding is None:
# Try to detect from content
detected = chardet.detect(resp.content)
encoding = detected.get("encoding", "utf-8")
return resp.content.decode(encoding, errors="replace")
return resp.text
Handling Malformed HTML
from bs4 import BeautifulSoup
from lxml import html as lxml_html
def parse_html_robust(raw_html: str) -> BeautifulSoup:
"""Parse HTML that might be malformed."""
# html.parser is more forgiving than lxml for broken HTML
# But lxml is faster for well-formed HTML
try:
# Try lxml first (faster)
soup = BeautifulSoup(raw_html, "lxml")
return soup
except Exception:
pass
try:
# Fall back to html.parser (more forgiving)
soup = BeautifulSoup(raw_html, "html.parser")
return soup
except Exception:
pass
# Last resort: html5lib (slowest but handles anything)
return BeautifulSoup(raw_html, "html5lib")
Handling Missing Elements Gracefully
def safe_select(soup, selector: str, attribute: str = None) -> str:
"""Safely extract text or attribute from a CSS selector."""
elem = soup.select_one(selector)
if elem is None:
return ""
if attribute:
return elem.get(attribute, "")
return elem.get_text(strip=True)
def safe_select_all(soup, selector: str) -> list[str]:
"""Safely extract text from all matching elements."""
return [elem.get_text(strip=True) for elem in soup.select(selector)]
Using Proxies for Large-Scale Extraction
When scraping at scale, you need proxies to avoid rate limits. Here's how to integrate proxy rotation with the extraction techniques above:
import httpx
import asyncio
import random
from bs4 import BeautifulSoup
async def extract_at_scale(
urls: list[str],
proxy_url: str,
extract_fn,
max_concurrent: int = 5,
delay_range: tuple = (1, 3),
) -> list[dict]:
"""
Scrape URLs at scale with proxy rotation and rate limiting.
Uses ThorData residential proxies for protected sites.
Get started at: https://thordata.partnerstack.com/partner/0a0x4nzh
"""
semaphore = asyncio.Semaphore(max_concurrent)
results = []
async def fetch_one(url: str) -> dict:
async with semaphore:
try:
async with httpx.AsyncClient(
proxy=proxy_url, timeout=30,
) as client:
resp = await client.get(url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36",
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "en-US,en;q=0.9",
})
if resp.status_code == 200:
data = extract_fn(resp.text)
return {"url": url, "status": "ok", "data": data}
else:
return {"url": url, "status": resp.status_code}
except Exception as e:
return {"url": url, "status": "error", "error": str(e)}
finally:
await asyncio.sleep(random.uniform(*delay_range))
tasks = [fetch_one(url) for url in urls]
results = await asyncio.gather(*tasks)
success = sum(1 for r in results if r.get("status") == "ok")
print(f"Extracted: {success}/{len(urls)} successful")
return results
Testing Your Selectors Before Coding
Before writing Python, test selectors in the browser DevTools console:
// Test CSS selectors
document.querySelectorAll("div.product-card h2")
// Test XPath
document.evaluate("//h2[@class='title']", document, null,
XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null)
// Check for JSON-LD
document.querySelectorAll('script[type="application/ld+json"]')
.forEach(s => console.log(JSON.parse(s.textContent)))
// Check for Next.js data
const nd = document.getElementById('__NEXT_DATA__');
if (nd) console.log(JSON.parse(nd.textContent).props.pageProps);
// Check for microdata
document.querySelectorAll('[itemscope]').forEach(el => {
console.log('Type:', el.getAttribute('itemtype'));
el.querySelectorAll('[itemprop]').forEach(prop => {
console.log(` ${prop.getAttribute('itemprop')}: ${prop.textContent.trim()}`);
});
});
This takes 30 seconds and saves you from running your scraper 15 times to debug a selector.
The Main Takeaway
Always check for structured data first. The best selector is the one you don't have to write. JSON-LD gives you clean, typed, stable data that survives redesigns. __NEXT_DATA__ gives you the exact data the frontend uses. Open Graph tags give you the title, description, and image without parsing the page body.
Start at the top of the extraction hierarchy and work your way down. Every step down increases fragility and maintenance cost. The 30 seconds you spend checking for JSON-LD before writing CSS selectors will save you hours of selector maintenance when the site inevitably redesigns.
When you do need to use selectors, prefer data attributes (data-testid, data-id) and semantic HTML (article, nav, main) over CSS classes — they're more stable across redesigns. And when scraping at scale, pair your extraction logic with reliable proxy rotation through a service like ThorData to handle rate limits and geo-restrictions without getting blocked.