How to Scrape Recipe Data from AllRecipes in 2026 (JSON-LD Extraction)

2026-04-09 [allrecipes scraping recipes python json-ld]

Recipe sites have quietly become some of the best-structured data sources on the web. AllRecipes, Food Network, and most major cooking sites embed their recipe data as JSON-LD inside <script> tags — a byproduct of chasing Google's rich results. This means you can often skip fighting HTML entirely and pull clean, structured data with a handful of lines of Python.

This guide covers the full pipeline: extracting JSON-LD, falling back to HTML when needed, paginating through category pages, handling anti-bot layers, storing results, and running at production scale.

Why JSON-LD Makes Recipe Scraping Easy

Google's recipe rich results require structured data in the schema.org/Recipe format. Sites that want their recipes to appear with star ratings, cooking times, and calorie counts in search results have to publish this data. The result is that most major recipe sites now embed a machine-readable version of every recipe directly in the page.

A typical JSON-LD block on AllRecipes looks like this (inside a <script type="application/ld+json"> tag):

{
  "@context": "https://schema.org",
  "@type": "Recipe",
  "name": "Classic Beef Stew",
  "recipeIngredient": ["2 lbs beef chuck", "4 carrots", "3 potatoes"],
  "recipeInstructions": [
    {"@type": "HowToStep", "text": "Season beef and brown in batches..."},
    {"@type": "HowToStep", "text": "Add vegetables and simmer..."}
  ],
  "aggregateRating": {
    "@type": "AggregateRating",
    "ratingValue": "4.6",
    "reviewCount": "1842"
  },
  "nutrition": {
    "@type": "NutritionInformation",
    "calories": "420 calories",
    "fatContent": "18g",
    "proteinContent": "35g",
    "carbohydrateContent": "32g",
    "sodiumContent": "820mg"
  },
  "totalTime": "PT2H30M",
  "prepTime": "PT30M",
  "cookTime": "PT2H",
  "recipeYield": "6 servings",
  "recipeCategory": "Main Dish",
  "recipeCuisine": "American",
  "keywords": "beef stew, comfort food, winter recipe"
}

Every field you care about — ingredients, ratings, nutrition, cooking time, yield — is already parsed and labeled. Compare this to scraping the same data from HTML: you'd be chasing inconsistent class names, parsing free-text strings like "1½ cups, sifted", and writing fragile regex for every site variant.

Environment Setup

pip install httpx beautifulsoup4 isodate sqlite3

We'll use httpx for its HTTP/2 support (matches real browser TLS fingerprint better than requests), BeautifulSoup for HTML parsing fallbacks, and isodate for converting ISO 8601 duration strings to minutes.

Core Extraction: JSON-LD with Python

The extraction is straightforward. Use httpx for requests and BeautifulSoup to find the script tags:

import httpx
import json
import isodate
from bs4 import BeautifulSoup
from datetime import timedelta

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/124.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
}


def get_recipe_schema(url: str, client: httpx.Client | None = None) -> dict | None:
    """Extract the schema.org/Recipe JSON-LD object from a recipe page."""
    close_client = False
    if client is None:
        client = httpx.Client(headers=HEADERS, follow_redirects=True, timeout=20)
        close_client = True

    try:
        resp = client.get(url)
        resp.raise_for_status()
    finally:
        if close_client:
            client.close()

    soup = BeautifulSoup(resp.text, "html.parser")

    for script in soup.find_all("script", type="application/ld+json"):
        try:
            data = json.loads(script.string)
        except (json.JSONDecodeError, TypeError):
            continue

        # Handle both direct objects and @graph arrays
        items = data if isinstance(data, list) else [data]
        if "@graph" in data:
            items = data["@graph"]

        for item in items:
            if item.get("@type") == "Recipe":
                return item

    return None


def iso_duration_to_minutes(duration_str: str | None) -> int | None:
    """Convert ISO 8601 duration (PT1H30M) to minutes."""
    if not duration_str:
        return None
    try:
        duration = isodate.parse_duration(duration_str)
        if isinstance(duration, timedelta):
            return int(duration.total_seconds() / 60)
    except Exception:
        pass
    return None


def parse_recipe(schema: dict) -> dict:
    """Normalize a schema.org/Recipe dict into a clean flat structure."""
    rating = schema.get("aggregateRating", {})
    nutrition = schema.get("nutrition", {})

    # Instructions may be strings or HowToStep objects
    instructions_raw = schema.get("recipeInstructions", [])
    if isinstance(instructions_raw, str):
        instructions = [instructions_raw]
    else:
        instructions = []
        for step in instructions_raw:
            if isinstance(step, str):
                instructions.append(step)
            elif isinstance(step, dict):
                instructions.append(step.get("text", ""))

    # Image may be a string, list, or ImageObject dict
    image_raw = schema.get("image")
    if isinstance(image_raw, list):
        image = image_raw[0] if image_raw else None
    elif isinstance(image_raw, dict):
        image = image_raw.get("url")
    else:
        image = image_raw

    return {
        "name": schema.get("name"),
        "description": schema.get("description"),
        "ingredients": schema.get("recipeIngredient", []),
        "instructions": instructions,
        "servings": schema.get("recipeYield"),
        "total_time_minutes": iso_duration_to_minutes(schema.get("totalTime")),
        "prep_time_minutes": iso_duration_to_minutes(schema.get("prepTime")),
        "cook_time_minutes": iso_duration_to_minutes(schema.get("cookTime")),
        "category": schema.get("recipeCategory"),
        "cuisine": schema.get("recipeCuisine"),
        "keywords": schema.get("keywords"),
        "rating": rating.get("ratingValue"),
        "review_count": rating.get("reviewCount"),
        "calories": nutrition.get("calories"),
        "fat": nutrition.get("fatContent"),
        "protein": nutrition.get("proteinContent"),
        "carbs": nutrition.get("carbohydrateContent"),
        "sodium": nutrition.get("sodiumContent"),
        "fiber": nutrition.get("fiberContent"),
        "sugar": nutrition.get("sugarContent"),
        "image": image,
        "author": schema.get("author", {}).get("name") if isinstance(schema.get("author"), dict) else schema.get("author"),
    }

Fallback HTML Parsing

Not every recipe page has clean JSON-LD. Older pages, user-submitted content, and some Food Network categories fall back to HTML-only markup. When get_recipe_schema() returns None, use a fallback parser:

def parse_recipe_html_fallback(url: str, client: httpx.Client) -> dict:
    """HTML fallback for pages without JSON-LD."""
    resp = client.get(url)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "html.parser")

    # Try multiple selector strategies, most specific first
    def get_text(selectors: list[str]) -> str | None:
        for sel in selectors:
            el = soup.select_one(sel)
            if el:
                return el.get_text(strip=True)
        return None

    def get_list(selectors: list[str]) -> list[str]:
        for sel in selectors:
            items = soup.select(sel)
            if items:
                return [i.get_text(strip=True) for i in items]
        return []

    ingredients = get_list([
        "[class*='ingredient'] li",
        "[itemprop='recipeIngredient']",
        ".ingredients-item",
        ".ingredient",
    ])

    instructions = get_list([
        "[class*='instruction'] li",
        "[class*='direction'] li",
        "[itemprop='recipeInstructions'] li",
        ".step",
    ])

    rating_text = get_text([
        "[class*='rating-value']",
        "[itemprop='ratingValue']",
        "[class*='aggregate-rating']",
    ])

    return {
        "name": get_text(["h1", "[itemprop='name']"]),
        "ingredients": ingredients,
        "instructions": instructions,
        "rating": rating_text,
        "source": "html_fallback",
    }

CSS attribute selectors like [class*='ingredient'] are more resilient than exact class names, which change with redesigns. This is messier than JSON-LD but workable as a fallback.

Paginating Through Category Pages

Category pages (e.g., /recipes/breakfast/, /recipes/world-cuisine/italian/) paginate results. AllRecipes typically uses a ?page=N parameter:

import time
import random

def scrape_category_urls(
    base_url: str,
    client: httpx.Client,
    max_pages: int = 20
) -> list[str]:
    """Collect all recipe URLs from a category page."""
    urls = []

    for page in range(1, max_pages + 1):
        page_url = f"{base_url}?page={page}" if page > 1 else base_url
        try:
            resp = client.get(page_url)
            if resp.status_code == 404:
                break
            resp.raise_for_status()
        except httpx.HTTPStatusError:
            break

        soup = BeautifulSoup(resp.text, "html.parser")

        # Find recipe card links
        recipe_links = soup.select("a[href*='/recipe/']")
        page_urls = list(set(
            a["href"] for a in recipe_links
            if a.get("href") and "/recipe/" in a["href"]
        ))

        if not page_urls:
            print(f"Page {page}: no recipes found, stopping")
            break

        # Make absolute URLs
        page_urls = [
            u if u.startswith("http") else f"https://www.allrecipes.com{u}"
            for u in page_urls
        ]
        urls.extend(page_urls)
        print(f"Page {page}: found {len(page_urls)} recipe links")

        time.sleep(random.uniform(1.5, 3.5))

    return list(set(urls))  # deduplicate

Handling Site Search

AllRecipes has a search endpoint you can hit directly:

def search_recipes(
    query: str,
    client: httpx.Client,
    max_results: int = 100
) -> list[str]:
    """Search AllRecipes and return recipe URLs."""
    urls = []
    page = 1

    while len(urls) < max_results:
        search_url = f"https://www.allrecipes.com/search?q={query.replace(' ', '+')}&page={page}"
        resp = client.get(search_url)
        if resp.status_code != 200:
            break

        soup = BeautifulSoup(resp.text, "html.parser")
        links = [
            a["href"] for a in soup.select("a[href*='/recipe/']")
            if a.get("href") and "/recipe/" in a["href"]
        ]
        links = list(set(
            l if l.startswith("http") else f"https://www.allrecipes.com{l}"
            for l in links
        ))

        if not links:
            break

        urls.extend(links)
        page += 1
        time.sleep(random.uniform(2, 4))

    return list(set(urls))[:max_results]

Anti-Bot Measures and How to Handle Them

AllRecipes and Food Network run behind Cloudflare. At small volumes — a few hundred requests per day with proper delays — you will usually get through with realistic headers. At scale, expect these layers:

Cloudflare Bot Management — The most common blocker. Returns 403 or a JS challenge page (check if resp.text contains cf-browser-verification or challenge-platform). Cloudflare fingerprints TLS configuration, HTTP/2 settings, and JavaScript execution environment. Standard requests fails because it doesn't speak HTTP/2 properly. Use httpx with http2=True:

client = httpx.Client(
    headers=HEADERS,
    http2=True,
    follow_redirects=True,
    timeout=20
)

Rate limiting — AllRecipes throttles with 429 responses after aggressive crawling. Keep delays above 1.5-2 seconds and use jitter: time.sleep(random.uniform(1.5, 4.0)).

Header validation — Missing or inconsistent headers trigger soft blocks. Always send a complete, consistent header set. Pay special attention to Sec-Fetch-* headers which signal navigation context.

IP reputation — Your home or VPS IP will get flagged after a few thousand requests. For production-scale scraping, residential proxies are the practical solution. Datacenter IPs get blocked quickly on both AllRecipes and Food Network. ThorData provides residential proxy access with per-GB billing and supports sticky sessions, useful when a recipe page requires multiple requests (following redirects from mobile to desktop URLs):

# ThorData proxy setup
PROXY_URL = "http://USER:[email protected]:9000"

client = httpx.Client(
    headers=HEADERS,
    proxy=PROXY_URL,
    http2=True,
    follow_redirects=True,
    timeout=25,
)

Robust Scraping with Error Handling

import time
import random
from dataclasses import dataclass


@dataclass
class ScrapeResult:
    url: str
    recipe: dict | None
    error: str | None


def scrape_recipes_robust(
    urls: list[str],
    proxy_url: str | None = None,
    max_retries: int = 3
) -> list[ScrapeResult]:
    """Scrape a list of recipe URLs with retry logic and error handling."""
    results = []

    client_kwargs = {
        "headers": HEADERS,
        "follow_redirects": True,
        "timeout": 25,
        "http2": True,
    }
    if proxy_url:
        client_kwargs["proxy"] = proxy_url

    with httpx.Client(**client_kwargs) as client:
        for i, url in enumerate(urls):
            result = None

            for attempt in range(max_retries):
                try:
                    schema = get_recipe_schema(url, client)
                    if schema:
                        recipe = parse_recipe(schema)
                        recipe["url"] = url
                        result = ScrapeResult(url=url, recipe=recipe, error=None)
                    else:
                        # Try HTML fallback
                        recipe = parse_recipe_html_fallback(url, client)
                        recipe["url"] = url
                        result = ScrapeResult(url=url, recipe=recipe, error=None)
                    break

                except httpx.HTTPStatusError as e:
                    if e.response.status_code == 429:
                        wait = (2 ** attempt) * 10 + random.uniform(0, 5)
                        print(f"Rate limited ({url}). Waiting {wait:.0f}s...")
                        time.sleep(wait)
                    elif e.response.status_code == 403:
                        print(f"Blocked on {url} — consider rotating IP")
                        result = ScrapeResult(url=url, recipe=None, error="403 blocked")
                        break
                    else:
                        result = ScrapeResult(url=url, recipe=None,
                                              error=f"HTTP {e.response.status_code}")
                        break
                except Exception as e:
                    if attempt == max_retries - 1:
                        result = ScrapeResult(url=url, recipe=None, error=str(e))

            if result is None:
                result = ScrapeResult(url=url, recipe=None, error="max retries exceeded")

            results.append(result)

            if (i + 1) % 10 == 0:
                print(f"Progress: {i + 1}/{len(urls)} ({len([r for r in results if r.recipe])} successful)")

            time.sleep(random.uniform(1.5, 3.5))

    return results

Storing Data in SQLite

import sqlite3
import json
from datetime import datetime, timezone


def init_recipe_db(path: str = "recipes.db") -> sqlite3.Connection:
    conn = sqlite3.connect(path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS recipes (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            url TEXT UNIQUE NOT NULL,
            name TEXT,
            category TEXT,
            cuisine TEXT,
            total_time_minutes INTEGER,
            prep_time_minutes INTEGER,
            cook_time_minutes INTEGER,
            servings TEXT,
            rating REAL,
            review_count INTEGER,
            calories TEXT,
            fat TEXT,
            protein TEXT,
            carbs TEXT,
            sodium TEXT,
            ingredients_json TEXT,
            instructions_json TEXT,
            keywords TEXT,
            author TEXT,
            image_url TEXT,
            scraped_at TEXT,
            raw_json TEXT
        );

        CREATE INDEX IF NOT EXISTS idx_category ON recipes(category);
        CREATE INDEX IF NOT EXISTS idx_cuisine ON recipes(cuisine);
        CREATE INDEX IF NOT EXISTS idx_rating ON recipes(rating);
    """)
    conn.commit()
    return conn


def save_recipe(conn: sqlite3.Connection, recipe: dict, raw_schema: dict) -> None:
    conn.execute("""
        INSERT OR REPLACE INTO recipes (
            url, name, category, cuisine, total_time_minutes, prep_time_minutes,
            cook_time_minutes, servings, rating, review_count, calories, fat,
            protein, carbs, sodium, ingredients_json, instructions_json,
            keywords, author, image_url, scraped_at, raw_json
        ) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)
    """, (
        recipe.get("url"),
        recipe.get("name"),
        recipe.get("category"),
        recipe.get("cuisine"),
        recipe.get("total_time_minutes"),
        recipe.get("prep_time_minutes"),
        recipe.get("cook_time_minutes"),
        recipe.get("servings"),
        recipe.get("rating"),
        recipe.get("review_count"),
        recipe.get("calories"),
        recipe.get("fat"),
        recipe.get("protein"),
        recipe.get("carbs"),
        recipe.get("sodium"),
        json.dumps(recipe.get("ingredients", [])),
        json.dumps(recipe.get("instructions", [])),
        recipe.get("keywords"),
        recipe.get("author"),
        recipe.get("image"),
        datetime.now(timezone.utc).isoformat(),
        json.dumps(raw_schema),
    ))
    conn.commit()


def query_recipes_by_ingredient(conn: sqlite3.Connection, ingredient: str) -> list[dict]:
    """Find recipes containing a specific ingredient (basic text search)."""
    rows = conn.execute(
        "SELECT name, url, rating, total_time_minutes FROM recipes "
        "WHERE ingredients_json LIKE ? ORDER BY rating DESC LIMIT 50",
        (f"%{ingredient}%",)
    ).fetchall()
    return [
        {"name": r[0], "url": r[1], "rating": r[2], "time_min": r[3]}
        for r in rows
    ]

Full Pipeline Script

Here's a complete script that ties everything together:

def main():
    proxy_url = "http://USER:[email protected]:9000"
    conn = init_recipe_db("recipes.db")

    # Category to scrape
    categories = [
        "https://www.allrecipes.com/recipes/80/main-dish/",
        "https://www.allrecipes.com/recipes/76/appetizers-and-snacks/",
        "https://www.allrecipes.com/recipes/156/bread/",
    ]

    all_urls = []
    with httpx.Client(headers=HEADERS, proxy=proxy_url, http2=True,
                       follow_redirects=True, timeout=25) as client:
        for cat_url in categories:
            print(f"\nCollecting URLs from: {cat_url}")
            urls = scrape_category_urls(cat_url, client, max_pages=10)
            all_urls.extend(urls)
            print(f"Found {len(urls)} recipes")
            time.sleep(random.uniform(3, 6))

    print(f"\nTotal URLs collected: {len(all_urls)}")
    print("Starting recipe extraction...")

    results = scrape_recipes_robust(all_urls, proxy_url=proxy_url)

    success_count = 0
    for r in results:
        if r.recipe:
            schema = {}  # We'd need to pass the raw schema through too
            save_recipe(conn, r.recipe, schema)
            success_count += 1
        else:
            print(f"Failed: {r.url} — {r.error}")

    print(f"\nComplete. {success_count}/{len(results)} recipes saved to recipes.db")


if __name__ == "__main__":
    main()

Working with Nutritional Data

Nutrition strings from AllRecipes come in formats like "420 calories" or "18g". Clean them up for analysis:

import re

def parse_nutrition_value(value: str | None) -> float | None:
    """Extract numeric value from nutrition strings like '420 calories' or '18g'."""
    if not value:
        return None
    match = re.search(r'([\d.]+)', str(value))
    return float(match.group(1)) if match else None


def compute_macros(recipe: dict) -> dict:
    """Return clean numeric macros from a parsed recipe."""
    return {
        "calories": parse_nutrition_value(recipe.get("calories")),
        "fat_g": parse_nutrition_value(recipe.get("fat")),
        "protein_g": parse_nutrition_value(recipe.get("protein")),
        "carbs_g": parse_nutrition_value(recipe.get("carbs")),
        "sodium_mg": parse_nutrition_value(recipe.get("sodium")),
    }

Common Gotchas

@graph arrays: Some sites wrap multiple schema objects in a single JSON-LD block with an @graph key. Always check for this pattern before assuming the top-level object is the Recipe.

Embedded JSON in window.__INITIAL_STATE__: Some sites (particularly Food Network) don't use JSON-LD but instead embed recipe data in a JavaScript variable. Look for window.__INITIAL_STATE__ or window.__SERVER_DATA__ in script tags and parse those as a fallback.

ISO duration edge cases: PT30M (30 minutes), PT1H (1 hour), P1DT2H (1 day 2 hours) are all valid. The isodate library handles them all correctly; a simple regex won't.

Rating type inconsistency: Some pages have ratingValue as a string ("4.6") and others as a float (4.6). Always cast to float before storing numerically.

Ingredient formatting: AllRecipes ingredients are free-text strings. Normalizing "1½ cups, sifted all-purpose flour" into structured quantity/unit/ingredient requires an ingredient parsing library like ingredient-parser or spacy for NLP.

What You Get

A well-structured recipe dataset includes: normalized ingredient lists, ISO duration times converted to minutes, per-serving nutrition facts, weighted average ratings with review counts, and category/cuisine tags — all without writing a single HTML parser for the data itself. JSON-LD does the normalization work for you because Google required it.

The main practical limit is rate and IP-level access. At research scale, plain httpx with good headers works fine. At production scale, budget for residential proxy bandwidth via a service like ThorData and implement retry logic with exponential backoff on 429 and 503 responses.