How to Scrape Recipe Data from AllRecipes in 2026 (JSON-LD Extraction)
Recipe sites have quietly become some of the best-structured data sources on the web. AllRecipes, Food Network, and most major cooking sites embed their recipe data as JSON-LD inside <script> tags — a byproduct of chasing Google's rich results. This means you can often skip fighting HTML entirely and pull clean, structured data with a handful of lines of Python.
This guide covers the full pipeline: extracting JSON-LD, falling back to HTML when needed, paginating through category pages, handling anti-bot layers, storing results, and running at production scale.
Why JSON-LD Makes Recipe Scraping Easy
Google's recipe rich results require structured data in the schema.org/Recipe format. Sites that want their recipes to appear with star ratings, cooking times, and calorie counts in search results have to publish this data. The result is that most major recipe sites now embed a machine-readable version of every recipe directly in the page.
A typical JSON-LD block on AllRecipes looks like this (inside a <script type="application/ld+json"> tag):
{
"@context": "https://schema.org",
"@type": "Recipe",
"name": "Classic Beef Stew",
"recipeIngredient": ["2 lbs beef chuck", "4 carrots", "3 potatoes"],
"recipeInstructions": [
{"@type": "HowToStep", "text": "Season beef and brown in batches..."},
{"@type": "HowToStep", "text": "Add vegetables and simmer..."}
],
"aggregateRating": {
"@type": "AggregateRating",
"ratingValue": "4.6",
"reviewCount": "1842"
},
"nutrition": {
"@type": "NutritionInformation",
"calories": "420 calories",
"fatContent": "18g",
"proteinContent": "35g",
"carbohydrateContent": "32g",
"sodiumContent": "820mg"
},
"totalTime": "PT2H30M",
"prepTime": "PT30M",
"cookTime": "PT2H",
"recipeYield": "6 servings",
"recipeCategory": "Main Dish",
"recipeCuisine": "American",
"keywords": "beef stew, comfort food, winter recipe"
}
Every field you care about — ingredients, ratings, nutrition, cooking time, yield — is already parsed and labeled. Compare this to scraping the same data from HTML: you'd be chasing inconsistent class names, parsing free-text strings like "1½ cups, sifted", and writing fragile regex for every site variant.
Environment Setup
pip install httpx beautifulsoup4 isodate sqlite3
We'll use httpx for its HTTP/2 support (matches real browser TLS fingerprint better than requests), BeautifulSoup for HTML parsing fallbacks, and isodate for converting ISO 8601 duration strings to minutes.
Core Extraction: JSON-LD with Python
The extraction is straightforward. Use httpx for requests and BeautifulSoup to find the script tags:
import httpx
import json
import isodate
from bs4 import BeautifulSoup
from datetime import timedelta
HEADERS = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
}
def get_recipe_schema(url: str, client: httpx.Client | None = None) -> dict | None:
"""Extract the schema.org/Recipe JSON-LD object from a recipe page."""
close_client = False
if client is None:
client = httpx.Client(headers=HEADERS, follow_redirects=True, timeout=20)
close_client = True
try:
resp = client.get(url)
resp.raise_for_status()
finally:
if close_client:
client.close()
soup = BeautifulSoup(resp.text, "html.parser")
for script in soup.find_all("script", type="application/ld+json"):
try:
data = json.loads(script.string)
except (json.JSONDecodeError, TypeError):
continue
# Handle both direct objects and @graph arrays
items = data if isinstance(data, list) else [data]
if "@graph" in data:
items = data["@graph"]
for item in items:
if item.get("@type") == "Recipe":
return item
return None
def iso_duration_to_minutes(duration_str: str | None) -> int | None:
"""Convert ISO 8601 duration (PT1H30M) to minutes."""
if not duration_str:
return None
try:
duration = isodate.parse_duration(duration_str)
if isinstance(duration, timedelta):
return int(duration.total_seconds() / 60)
except Exception:
pass
return None
def parse_recipe(schema: dict) -> dict:
"""Normalize a schema.org/Recipe dict into a clean flat structure."""
rating = schema.get("aggregateRating", {})
nutrition = schema.get("nutrition", {})
# Instructions may be strings or HowToStep objects
instructions_raw = schema.get("recipeInstructions", [])
if isinstance(instructions_raw, str):
instructions = [instructions_raw]
else:
instructions = []
for step in instructions_raw:
if isinstance(step, str):
instructions.append(step)
elif isinstance(step, dict):
instructions.append(step.get("text", ""))
# Image may be a string, list, or ImageObject dict
image_raw = schema.get("image")
if isinstance(image_raw, list):
image = image_raw[0] if image_raw else None
elif isinstance(image_raw, dict):
image = image_raw.get("url")
else:
image = image_raw
return {
"name": schema.get("name"),
"description": schema.get("description"),
"ingredients": schema.get("recipeIngredient", []),
"instructions": instructions,
"servings": schema.get("recipeYield"),
"total_time_minutes": iso_duration_to_minutes(schema.get("totalTime")),
"prep_time_minutes": iso_duration_to_minutes(schema.get("prepTime")),
"cook_time_minutes": iso_duration_to_minutes(schema.get("cookTime")),
"category": schema.get("recipeCategory"),
"cuisine": schema.get("recipeCuisine"),
"keywords": schema.get("keywords"),
"rating": rating.get("ratingValue"),
"review_count": rating.get("reviewCount"),
"calories": nutrition.get("calories"),
"fat": nutrition.get("fatContent"),
"protein": nutrition.get("proteinContent"),
"carbs": nutrition.get("carbohydrateContent"),
"sodium": nutrition.get("sodiumContent"),
"fiber": nutrition.get("fiberContent"),
"sugar": nutrition.get("sugarContent"),
"image": image,
"author": schema.get("author", {}).get("name") if isinstance(schema.get("author"), dict) else schema.get("author"),
}
Fallback HTML Parsing
Not every recipe page has clean JSON-LD. Older pages, user-submitted content, and some Food Network categories fall back to HTML-only markup. When get_recipe_schema() returns None, use a fallback parser:
def parse_recipe_html_fallback(url: str, client: httpx.Client) -> dict:
"""HTML fallback for pages without JSON-LD."""
resp = client.get(url)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
# Try multiple selector strategies, most specific first
def get_text(selectors: list[str]) -> str | None:
for sel in selectors:
el = soup.select_one(sel)
if el:
return el.get_text(strip=True)
return None
def get_list(selectors: list[str]) -> list[str]:
for sel in selectors:
items = soup.select(sel)
if items:
return [i.get_text(strip=True) for i in items]
return []
ingredients = get_list([
"[class*='ingredient'] li",
"[itemprop='recipeIngredient']",
".ingredients-item",
".ingredient",
])
instructions = get_list([
"[class*='instruction'] li",
"[class*='direction'] li",
"[itemprop='recipeInstructions'] li",
".step",
])
rating_text = get_text([
"[class*='rating-value']",
"[itemprop='ratingValue']",
"[class*='aggregate-rating']",
])
return {
"name": get_text(["h1", "[itemprop='name']"]),
"ingredients": ingredients,
"instructions": instructions,
"rating": rating_text,
"source": "html_fallback",
}
CSS attribute selectors like [class*='ingredient'] are more resilient than exact class names, which change with redesigns. This is messier than JSON-LD but workable as a fallback.
Paginating Through Category Pages
Category pages (e.g., /recipes/breakfast/, /recipes/world-cuisine/italian/) paginate results. AllRecipes typically uses a ?page=N parameter:
import time
import random
def scrape_category_urls(
base_url: str,
client: httpx.Client,
max_pages: int = 20
) -> list[str]:
"""Collect all recipe URLs from a category page."""
urls = []
for page in range(1, max_pages + 1):
page_url = f"{base_url}?page={page}" if page > 1 else base_url
try:
resp = client.get(page_url)
if resp.status_code == 404:
break
resp.raise_for_status()
except httpx.HTTPStatusError:
break
soup = BeautifulSoup(resp.text, "html.parser")
# Find recipe card links
recipe_links = soup.select("a[href*='/recipe/']")
page_urls = list(set(
a["href"] for a in recipe_links
if a.get("href") and "/recipe/" in a["href"]
))
if not page_urls:
print(f"Page {page}: no recipes found, stopping")
break
# Make absolute URLs
page_urls = [
u if u.startswith("http") else f"https://www.allrecipes.com{u}"
for u in page_urls
]
urls.extend(page_urls)
print(f"Page {page}: found {len(page_urls)} recipe links")
time.sleep(random.uniform(1.5, 3.5))
return list(set(urls)) # deduplicate
Handling Site Search
AllRecipes has a search endpoint you can hit directly:
def search_recipes(
query: str,
client: httpx.Client,
max_results: int = 100
) -> list[str]:
"""Search AllRecipes and return recipe URLs."""
urls = []
page = 1
while len(urls) < max_results:
search_url = f"https://www.allrecipes.com/search?q={query.replace(' ', '+')}&page={page}"
resp = client.get(search_url)
if resp.status_code != 200:
break
soup = BeautifulSoup(resp.text, "html.parser")
links = [
a["href"] for a in soup.select("a[href*='/recipe/']")
if a.get("href") and "/recipe/" in a["href"]
]
links = list(set(
l if l.startswith("http") else f"https://www.allrecipes.com{l}"
for l in links
))
if not links:
break
urls.extend(links)
page += 1
time.sleep(random.uniform(2, 4))
return list(set(urls))[:max_results]
Anti-Bot Measures and How to Handle Them
AllRecipes and Food Network run behind Cloudflare. At small volumes — a few hundred requests per day with proper delays — you will usually get through with realistic headers. At scale, expect these layers:
Cloudflare Bot Management — The most common blocker. Returns 403 or a JS challenge page (check if resp.text contains cf-browser-verification or challenge-platform). Cloudflare fingerprints TLS configuration, HTTP/2 settings, and JavaScript execution environment. Standard requests fails because it doesn't speak HTTP/2 properly. Use httpx with http2=True:
client = httpx.Client(
headers=HEADERS,
http2=True,
follow_redirects=True,
timeout=20
)
Rate limiting — AllRecipes throttles with 429 responses after aggressive crawling. Keep delays above 1.5-2 seconds and use jitter: time.sleep(random.uniform(1.5, 4.0)).
Header validation — Missing or inconsistent headers trigger soft blocks. Always send a complete, consistent header set. Pay special attention to Sec-Fetch-* headers which signal navigation context.
IP reputation — Your home or VPS IP will get flagged after a few thousand requests. For production-scale scraping, residential proxies are the practical solution. Datacenter IPs get blocked quickly on both AllRecipes and Food Network. ThorData provides residential proxy access with per-GB billing and supports sticky sessions, useful when a recipe page requires multiple requests (following redirects from mobile to desktop URLs):
# ThorData proxy setup
PROXY_URL = "http://USER:[email protected]:9000"
client = httpx.Client(
headers=HEADERS,
proxy=PROXY_URL,
http2=True,
follow_redirects=True,
timeout=25,
)
Robust Scraping with Error Handling
import time
import random
from dataclasses import dataclass
@dataclass
class ScrapeResult:
url: str
recipe: dict | None
error: str | None
def scrape_recipes_robust(
urls: list[str],
proxy_url: str | None = None,
max_retries: int = 3
) -> list[ScrapeResult]:
"""Scrape a list of recipe URLs with retry logic and error handling."""
results = []
client_kwargs = {
"headers": HEADERS,
"follow_redirects": True,
"timeout": 25,
"http2": True,
}
if proxy_url:
client_kwargs["proxy"] = proxy_url
with httpx.Client(**client_kwargs) as client:
for i, url in enumerate(urls):
result = None
for attempt in range(max_retries):
try:
schema = get_recipe_schema(url, client)
if schema:
recipe = parse_recipe(schema)
recipe["url"] = url
result = ScrapeResult(url=url, recipe=recipe, error=None)
else:
# Try HTML fallback
recipe = parse_recipe_html_fallback(url, client)
recipe["url"] = url
result = ScrapeResult(url=url, recipe=recipe, error=None)
break
except httpx.HTTPStatusError as e:
if e.response.status_code == 429:
wait = (2 ** attempt) * 10 + random.uniform(0, 5)
print(f"Rate limited ({url}). Waiting {wait:.0f}s...")
time.sleep(wait)
elif e.response.status_code == 403:
print(f"Blocked on {url} — consider rotating IP")
result = ScrapeResult(url=url, recipe=None, error="403 blocked")
break
else:
result = ScrapeResult(url=url, recipe=None,
error=f"HTTP {e.response.status_code}")
break
except Exception as e:
if attempt == max_retries - 1:
result = ScrapeResult(url=url, recipe=None, error=str(e))
if result is None:
result = ScrapeResult(url=url, recipe=None, error="max retries exceeded")
results.append(result)
if (i + 1) % 10 == 0:
print(f"Progress: {i + 1}/{len(urls)} ({len([r for r in results if r.recipe])} successful)")
time.sleep(random.uniform(1.5, 3.5))
return results
Storing Data in SQLite
import sqlite3
import json
from datetime import datetime, timezone
def init_recipe_db(path: str = "recipes.db") -> sqlite3.Connection:
conn = sqlite3.connect(path)
conn.executescript("""
CREATE TABLE IF NOT EXISTS recipes (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT UNIQUE NOT NULL,
name TEXT,
category TEXT,
cuisine TEXT,
total_time_minutes INTEGER,
prep_time_minutes INTEGER,
cook_time_minutes INTEGER,
servings TEXT,
rating REAL,
review_count INTEGER,
calories TEXT,
fat TEXT,
protein TEXT,
carbs TEXT,
sodium TEXT,
ingredients_json TEXT,
instructions_json TEXT,
keywords TEXT,
author TEXT,
image_url TEXT,
scraped_at TEXT,
raw_json TEXT
);
CREATE INDEX IF NOT EXISTS idx_category ON recipes(category);
CREATE INDEX IF NOT EXISTS idx_cuisine ON recipes(cuisine);
CREATE INDEX IF NOT EXISTS idx_rating ON recipes(rating);
""")
conn.commit()
return conn
def save_recipe(conn: sqlite3.Connection, recipe: dict, raw_schema: dict) -> None:
conn.execute("""
INSERT OR REPLACE INTO recipes (
url, name, category, cuisine, total_time_minutes, prep_time_minutes,
cook_time_minutes, servings, rating, review_count, calories, fat,
protein, carbs, sodium, ingredients_json, instructions_json,
keywords, author, image_url, scraped_at, raw_json
) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)
""", (
recipe.get("url"),
recipe.get("name"),
recipe.get("category"),
recipe.get("cuisine"),
recipe.get("total_time_minutes"),
recipe.get("prep_time_minutes"),
recipe.get("cook_time_minutes"),
recipe.get("servings"),
recipe.get("rating"),
recipe.get("review_count"),
recipe.get("calories"),
recipe.get("fat"),
recipe.get("protein"),
recipe.get("carbs"),
recipe.get("sodium"),
json.dumps(recipe.get("ingredients", [])),
json.dumps(recipe.get("instructions", [])),
recipe.get("keywords"),
recipe.get("author"),
recipe.get("image"),
datetime.now(timezone.utc).isoformat(),
json.dumps(raw_schema),
))
conn.commit()
def query_recipes_by_ingredient(conn: sqlite3.Connection, ingredient: str) -> list[dict]:
"""Find recipes containing a specific ingredient (basic text search)."""
rows = conn.execute(
"SELECT name, url, rating, total_time_minutes FROM recipes "
"WHERE ingredients_json LIKE ? ORDER BY rating DESC LIMIT 50",
(f"%{ingredient}%",)
).fetchall()
return [
{"name": r[0], "url": r[1], "rating": r[2], "time_min": r[3]}
for r in rows
]
Full Pipeline Script
Here's a complete script that ties everything together:
def main():
proxy_url = "http://USER:[email protected]:9000"
conn = init_recipe_db("recipes.db")
# Category to scrape
categories = [
"https://www.allrecipes.com/recipes/80/main-dish/",
"https://www.allrecipes.com/recipes/76/appetizers-and-snacks/",
"https://www.allrecipes.com/recipes/156/bread/",
]
all_urls = []
with httpx.Client(headers=HEADERS, proxy=proxy_url, http2=True,
follow_redirects=True, timeout=25) as client:
for cat_url in categories:
print(f"\nCollecting URLs from: {cat_url}")
urls = scrape_category_urls(cat_url, client, max_pages=10)
all_urls.extend(urls)
print(f"Found {len(urls)} recipes")
time.sleep(random.uniform(3, 6))
print(f"\nTotal URLs collected: {len(all_urls)}")
print("Starting recipe extraction...")
results = scrape_recipes_robust(all_urls, proxy_url=proxy_url)
success_count = 0
for r in results:
if r.recipe:
schema = {} # We'd need to pass the raw schema through too
save_recipe(conn, r.recipe, schema)
success_count += 1
else:
print(f"Failed: {r.url} — {r.error}")
print(f"\nComplete. {success_count}/{len(results)} recipes saved to recipes.db")
if __name__ == "__main__":
main()
Working with Nutritional Data
Nutrition strings from AllRecipes come in formats like "420 calories" or "18g". Clean them up for analysis:
import re
def parse_nutrition_value(value: str | None) -> float | None:
"""Extract numeric value from nutrition strings like '420 calories' or '18g'."""
if not value:
return None
match = re.search(r'([\d.]+)', str(value))
return float(match.group(1)) if match else None
def compute_macros(recipe: dict) -> dict:
"""Return clean numeric macros from a parsed recipe."""
return {
"calories": parse_nutrition_value(recipe.get("calories")),
"fat_g": parse_nutrition_value(recipe.get("fat")),
"protein_g": parse_nutrition_value(recipe.get("protein")),
"carbs_g": parse_nutrition_value(recipe.get("carbs")),
"sodium_mg": parse_nutrition_value(recipe.get("sodium")),
}
Common Gotchas
@graph arrays: Some sites wrap multiple schema objects in a single JSON-LD block with an @graph key. Always check for this pattern before assuming the top-level object is the Recipe.
Embedded JSON in window.__INITIAL_STATE__: Some sites (particularly Food Network) don't use JSON-LD but instead embed recipe data in a JavaScript variable. Look for window.__INITIAL_STATE__ or window.__SERVER_DATA__ in script tags and parse those as a fallback.
ISO duration edge cases: PT30M (30 minutes), PT1H (1 hour), P1DT2H (1 day 2 hours) are all valid. The isodate library handles them all correctly; a simple regex won't.
Rating type inconsistency: Some pages have ratingValue as a string ("4.6") and others as a float (4.6). Always cast to float before storing numerically.
Ingredient formatting: AllRecipes ingredients are free-text strings. Normalizing "1½ cups, sifted all-purpose flour" into structured quantity/unit/ingredient requires an ingredient parsing library like ingredient-parser or spacy for NLP.
What You Get
A well-structured recipe dataset includes: normalized ingredient lists, ISO duration times converted to minutes, per-serving nutrition facts, weighted average ratings with review counts, and category/cuisine tags — all without writing a single HTML parser for the data itself. JSON-LD does the normalization work for you because Google required it.
The main practical limit is rate and IP-level access. At research scale, plain httpx with good headers works fine. At production scale, budget for residential proxy bandwidth via a service like ThorData and implement retry logic with exponential backoff on 429 and 503 responses.