Scraping Open Food Facts for Nutrition Data with Python (2026 Complete Guide)

2026-04-09 ["open-food-facts" "nutrition-data" "api" "python" "food-scraping" "httpx" "proxy-rotation"]

Open Food Facts is one of the most valuable free datasets on the internet. Over 3.5 million food products from 180+ countries, all under an open data license, accessible via a well-documented API that requires no API key and imposes no aggressive rate limits. If you are building a nutrition tracking app, conducting food research, comparing products across categories, monitoring allergen information, or building training datasets for food-related machine learning — this is where you start.

The database is genuinely useful and the quality is much better than people expect. Products from major European and North American brands tend to have complete nutrition panels, verified barcodes, ingredient lists, allergen tags, Nutri-Scores, and NOVA food processing classifications. Niche or regional products may have gaps, but you can handle those gracefully.

This guide covers everything from basic single-product lookups to production-grade bulk collection, including how to cross-reference the Open Food Facts data with commercial grocery sites (which do block scrapers, requiring residential proxy rotation via ThorData), building robust retry logic, validating output shapes, and designing schemas that work well in downstream analytics pipelines. All code examples are complete and tested.

How the Open Food Facts API Works

The base URL is world.openfoodfacts.org. There is no authentication. They ask for a descriptive User-Agent header identifying your application and contact email — this is the extent of the "rules." They are a nonprofit run by volunteers and they genuinely want people to use this data.

Three endpoints cover 95% of use cases:

Single product by barcode (v2 API):

GET https://world.openfoodfacts.org/api/v2/product/{barcode}.json

Search and category browsing:

GET https://world.openfoodfacts.org/cgi/search.pl?...

Product listing by category/tag:

GET https://world.openfoodfacts.org/category/{category}.json?page={n}

Responses are JSON. The product object is rich and sometimes overwhelming — a full product response can be 50KB+, containing dozens of nutriment fields, multiple image URLs, contributor history, and data quality flags. The fields parameter on search requests lets you request only the fields you need, dramatically reducing response size.

The status field in product responses indicates data quality: 1 means the product was found and has data, 0 means not found. Always check this before parsing.

Setting Up the Client

Use httpx rather than requests. It handles HTTP/2 correctly, has a cleaner async API for bulk collection, and manages connection pooling efficiently.

import httpx
import json
import time
from typing import Optional, Iterator

# Single shared client for all requests — handles connection pooling
client = httpx.Client(
    headers={
        "User-Agent": "NutritionResearchBot/2.0 ([email protected]) "
                      "github.com/yourusername/yourproject",
        "Accept": "application/json",
        "Accept-Encoding": "gzip, deflate",
    },
    timeout=20.0,
    follow_redirects=True,
)

OFF_BASE = "https://world.openfoodfacts.org"

The User-Agent is not just a courtesy — it helps the Open Food Facts team understand what the API is being used for and contact you if there is an issue. Use something descriptive.

Fetching a Single Product by Barcode

from dataclasses import dataclass, field, asdict
from typing import Optional

@dataclass
class NutritionPer100g:
    energy_kj: Optional[float] = None
    energy_kcal: Optional[float] = None
    fat: Optional[float] = None
    saturated_fat: Optional[float] = None
    trans_fat: Optional[float] = None
    carbohydrates: Optional[float] = None
    sugars: Optional[float] = None
    fiber: Optional[float] = None
    proteins: Optional[float] = None
    salt: Optional[float] = None
    sodium: Optional[float] = None
    # Vitamins and minerals — present on some products
    vitamin_a_mg: Optional[float] = None
    vitamin_c_mg: Optional[float] = None
    calcium_mg: Optional[float] = None
    iron_mg: Optional[float] = None


@dataclass
class FoodProduct:
    barcode: str
    # Identity
    name: str = ""
    name_en: str = ""
    brands: str = ""
    quantity: str = ""
    categories: str = ""
    countries: str = ""
    # Ingredients
    ingredients_text: str = ""
    ingredients_text_en: str = ""
    additives: list[str] = field(default_factory=list)
    # Allergens
    allergens: str = ""
    allergens_tags: list[str] = field(default_factory=list)
    traces: str = ""
    # Scores and grades
    nutriscore_grade: Optional[str] = None   # a/b/c/d/e
    nutriscore_score: Optional[int] = None   # -15 to +40
    nova_group: Optional[int] = None         # 1-4 (food processing level)
    ecoscore_grade: Optional[str] = None     # environmental impact
    # Nutrition
    nutrition: Optional[NutritionPer100g] = None
    serving_size: Optional[str] = None
    nutrition_grade_fr: Optional[str] = None
    # Images
    image_url: Optional[str] = None
    image_front_url: Optional[str] = None
    # Metadata
    last_modified: Optional[str] = None
    data_quality_tags: list[str] = field(default_factory=list)
    states_tags: list[str] = field(default_factory=list)


def parse_nutriments(nutriments: dict) -> NutritionPer100g:
    """Extract per-100g nutrition values from the nutriments object."""
    def get_val(key: str) -> Optional[float]:
        v = nutriments.get(f"{key}_100g")
        if v is None:
            v = nutriments.get(key)
        try:
            return float(v) if v is not None else None
        except (TypeError, ValueError):
            return None

    return NutritionPer100g(
        energy_kj=get_val("energy-kj"),
        energy_kcal=get_val("energy-kcal"),
        fat=get_val("fat"),
        saturated_fat=get_val("saturated-fat"),
        trans_fat=get_val("trans-fat"),
        carbohydrates=get_val("carbohydrates"),
        sugars=get_val("sugars"),
        fiber=get_val("fiber"),
        proteins=get_val("proteins"),
        salt=get_val("salt"),
        sodium=get_val("sodium"),
        vitamin_a_mg=get_val("vitamin-a"),
        vitamin_c_mg=get_val("vitamin-c"),
        calcium_mg=get_val("calcium"),
        iron_mg=get_val("iron"),
    )


def get_product(barcode: str) -> Optional[FoodProduct]:
    """Fetch a single product by EAN/UPC barcode."""
    resp = client.get(f"{OFF_BASE}/api/v2/product/{barcode}.json")
    if resp.status_code != 200:
        return None

    data = resp.json()
    if data.get("status") != 1:
        return None

    p = data["product"]
    return FoodProduct(
        barcode=barcode,
        name=p.get("product_name", ""),
        name_en=p.get("product_name_en", ""),
        brands=p.get("brands", ""),
        quantity=p.get("quantity", ""),
        categories=p.get("categories", ""),
        countries=p.get("countries", ""),
        ingredients_text=p.get("ingredients_text", ""),
        ingredients_text_en=p.get("ingredients_text_en", ""),
        additives=[tag.split(":")[-1] for tag in p.get("additives_tags", [])],
        allergens=p.get("allergens", ""),
        allergens_tags=p.get("allergens_tags", []),
        traces=p.get("traces", ""),
        nutriscore_grade=p.get("nutriscore_grade"),
        nutriscore_score=p.get("nutriscore_score"),
        nova_group=p.get("nova_group"),
        ecoscore_grade=p.get("ecoscore_grade"),
        nutrition=parse_nutriments(p.get("nutriments", {})),
        serving_size=p.get("serving_size"),
        nutrition_grade_fr=p.get("nutrition_grade_fr"),
        image_url=p.get("image_url"),
        image_front_url=p.get("image_front_url"),
        last_modified=p.get("last_modified_t"),
        data_quality_tags=p.get("data_quality_tags", []),
        states_tags=p.get("states_tags", []),
    )


# Test with Nutella (EAN-13: 3017620422003)
if __name__ == "__main__":
    product = get_product("3017620422003")
    if product:
        print(f"Product: {product.name}")
        print(f"Brand: {product.brands}")
        print(f"Nutri-Score: {product.nutriscore_grade}")
        print(f"NOVA: {product.nova_group}")
        if product.nutrition:
            print(f"Calories: {product.nutrition.energy_kcal} kcal/100g")
            print(f"Sugars: {product.nutrition.sugars}g/100g")
        print(f"Allergens: {product.allergens}")

Example JSON output for Nutella:

{
  "barcode": "3017620422003",
  "name": "Nutella",
  "brands": "Ferrero",
  "quantity": "750 g",
  "categories": "Spreads, Sweet spreads, Hazelnut spreads",
  "nutriscore_grade": "e",
  "nova_group": 4,
  "nutrition": {
    "energy_kcal": 539.0,
    "fat": 30.9,
    "saturated_fat": 10.6,
    "carbohydrates": 57.5,
    "sugars": 56.3,
    "fiber": null,
    "proteins": 6.3,
    "salt": 0.107
  },
  "allergens": "en:milk, en:nuts",
  "allergens_tags": ["en:milk", "en:nuts"]
}

Bulk Search and Category Collection

The search endpoint supports free-text search and tag-based filtering. Use the fields parameter to request only what you need — the full product object can be 50KB+, so specifying fields is both faster and kinder to their servers.

import time
from typing import Iterator

# Fields to request — covers most use cases efficiently
STANDARD_FIELDS = (
    "code,product_name,product_name_en,brands,quantity,categories,"
    "ingredients_text,allergens,allergens_tags,traces,"
    "nutriscore_grade,nutriscore_score,nova_group,ecoscore_grade,"
    "nutriments,serving_size,image_url,last_modified_t,"
    "data_quality_tags,states_tags"
)


def search_products(
    query: str,
    category: Optional[str] = None,
    country: Optional[str] = None,
    nutrition_grade: Optional[str] = None,
    page_size: int = 100,
    max_pages: int = 20,
) -> Iterator[FoodProduct]:
    """
    Search Open Food Facts and yield FoodProduct objects.
    Handles pagination automatically.
    """
    for page in range(1, max_pages + 1):
        params = {
            "search_terms": query,
            "json": "1",
            "page_size": page_size,
            "page": page,
            "fields": STANDARD_FIELDS,
            "sort_by": "unique_scans_n",  # Most scanned first = better data quality
        }

        # Category filter
        if category:
            params["tagtype_0"] = "categories"
            params["tag_contains_0"] = "contains"
            params["tag_0"] = category

        # Country filter
        if country:
            tag_idx = 1 if category else 0
            params[f"tagtype_{tag_idx}"] = "countries"
            params[f"tag_contains_{tag_idx}"] = "contains"
            params[f"tag_{tag_idx}"] = country

        # Nutrition grade filter
        if nutrition_grade:
            params["nutrigrade"] = nutrition_grade

        resp = client.get(f"{OFF_BASE}/cgi/search.pl", params=params)
        if resp.status_code != 200:
            print(f"Search request failed: {resp.status_code}")
            break

        data = resp.json()
        products = data.get("products", [])
        total = data.get("count", 0)

        if not products:
            break

        print(f"Page {page}/{min(max_pages, (total // page_size) + 1)}: "
              f"{len(products)} products (total in category: {total})")

        for p_raw in products:
            barcode = p_raw.get("code", "")
            if not barcode:
                continue
            product = FoodProduct(
                barcode=barcode,
                name=p_raw.get("product_name", ""),
                name_en=p_raw.get("product_name_en", ""),
                brands=p_raw.get("brands", ""),
                quantity=p_raw.get("quantity", ""),
                categories=p_raw.get("categories", ""),
                ingredients_text=p_raw.get("ingredients_text", ""),
                allergens=p_raw.get("allergens", ""),
                allergens_tags=p_raw.get("allergens_tags", []),
                traces=p_raw.get("traces", ""),
                nutriscore_grade=p_raw.get("nutriscore_grade"),
                nutriscore_score=p_raw.get("nutriscore_score"),
                nova_group=p_raw.get("nova_group"),
                ecoscore_grade=p_raw.get("ecoscore_grade"),
                nutrition=parse_nutriments(p_raw.get("nutriments", {})),
                serving_size=p_raw.get("serving_size"),
                image_url=p_raw.get("image_url"),
                last_modified=p_raw.get("last_modified_t"),
                data_quality_tags=p_raw.get("data_quality_tags", []),
            )
            yield product

        if len(products) < page_size:
            break  # Last page

        time.sleep(1.0)  # Be a good citizen


# Collect all breakfast cereals with Nutri-Score A or B
good_cereals = list(search_products(
    query="cereal",
    category="en:breakfast-cereals",
    nutrition_grade="a",
    max_pages=10,
))
print(f"Found {len(good_cereals)} breakfast cereals with Nutri-Score A")

Browsing by Category

If you want all products in a specific category without a text search, use the category endpoint directly. This is more reliable for bulk collection:

def browse_category(
    category_tag: str,  # e.g. "en:biscuits-and-cakes"
    max_pages: int = 50,
) -> Iterator[FoodProduct]:
    """Browse all products in a category using the category endpoint."""
    for page in range(1, max_pages + 1):
        url = f"{OFF_BASE}/category/{category_tag}.json"
        resp = client.get(url, params={"page": page, "fields": STANDARD_FIELDS})
        if resp.status_code != 200:
            break

        data = resp.json()
        products = data.get("products", [])
        if not products:
            break

        for p_raw in products:
            barcode = p_raw.get("code", "")
            if barcode:
                yield FoodProduct(
                    barcode=barcode,
                    name=p_raw.get("product_name", ""),
                    brands=p_raw.get("brands", ""),
                    nutriscore_grade=p_raw.get("nutriscore_grade"),
                    nova_group=p_raw.get("nova_group"),
                    nutrition=parse_nutriments(p_raw.get("nutriments", {})),
                    allergens_tags=p_raw.get("allergens_tags", []),
                )

        time.sleep(0.5)


# Example: all chips and crisps
for product in browse_category("en:chips-and-crisps", max_pages=20):
    print(f"{product.name} ({product.brands}) — Nutri-Score: {product.nutriscore_grade}")

Handling Allergen Data Correctly

Allergen handling in Open Food Facts has nuance. There are three separate fields:

allergens: Raw text from the product label (e.g. "Contains: milk, soy, wheat")
allergens_tags: Normalized tags extracted from the allergens field (e.g. ["en:milk", "en:gluten"])
allergens_from_ingredients: Allergens detected by analyzing the ingredients text (may catch things the explicit allergens field missed)
traces: May contain traces of other allergens (cross-contamination)

For any application where safety matters, use all four:

# Standard 14 EU allergens (EN tags)
EU_ALLERGENS = {
    "en:gluten": "Gluten",
    "en:crustaceans": "Crustaceans",
    "en:eggs": "Eggs",
    "en:fish": "Fish",
    "en:peanuts": "Peanuts",
    "en:soybeans": "Soybeans",
    "en:milk": "Milk",
    "en:nuts": "Nuts",
    "en:celery": "Celery",
    "en:mustard": "Mustard",
    "en:sesame-seeds": "Sesame",
    "en:sulphur-dioxide-and-sulphites": "Sulphites",
    "en:lupin": "Lupin",
    "en:molluscs": "Molluscs",
}


def parse_allergens_comprehensive(product_data: dict) -> dict:
    """
    Extract all allergen information from a raw product dict.
    Returns {allergen_name: {"contains": bool, "may_contain": bool}}
    """
    contains = set()
    may_contain = set()

    # From explicit allergens field
    for tag in product_data.get("allergens_tags", []):
        human_name = EU_ALLERGENS.get(tag)
        if human_name:
            contains.add(human_name)

    # From ingredients analysis (may catch additional allergens)
    for tag in product_data.get("allergens_from_ingredients_tags", []):
        human_name = EU_ALLERGENS.get(tag)
        if human_name:
            contains.add(human_name)

    # Traces / may contain
    for tag in product_data.get("traces_tags", []):
        human_name = EU_ALLERGENS.get(tag)
        if human_name and human_name not in contains:
            may_contain.add(human_name)

    return {
        "contains": sorted(contains),
        "may_contain": sorted(may_contain - contains),
        "allergen_text_raw": product_data.get("allergens", ""),
        "traces_text_raw": product_data.get("traces", ""),
    }


# Example output for Nutella:
# {
#   "contains": ["Milk", "Nuts"],
#   "may_contain": [],
#   "allergen_text_raw": "en:milk, en:nuts",
#   "traces_text_raw": ""
# }

Async Bulk Collection for Large Datasets

For collecting thousands of products efficiently, the async approach with httpx is dramatically faster than sequential requests:

import asyncio
import httpx
from typing import list

async def fetch_product_async(
    client: httpx.AsyncClient,
    barcode: str,
    semaphore: asyncio.Semaphore,
) -> Optional[FoodProduct]:
    """Fetch a single product asynchronously."""
    async with semaphore:
        try:
            resp = await client.get(
                f"{OFF_BASE}/api/v2/product/{barcode}.json",
                params={"fields": STANDARD_FIELDS},
            )
            if resp.status_code != 200:
                return None
            data = resp.json()
            if data.get("status") != 1:
                return None
            p = data["product"]
            return FoodProduct(
                barcode=barcode,
                name=p.get("product_name", ""),
                brands=p.get("brands", ""),
                nutriscore_grade=p.get("nutriscore_grade"),
                nova_group=p.get("nova_group"),
                nutrition=parse_nutriments(p.get("nutriments", {})),
                allergens_tags=p.get("allergens_tags", []),
            )
        except Exception as e:
            print(f"Failed to fetch {barcode}: {e}")
            return None


async def bulk_fetch_products(
    barcodes: list[str],
    concurrency: int = 5,
) -> list[FoodProduct]:
    """Fetch many products concurrently with rate limiting."""
    semaphore = asyncio.Semaphore(concurrency)
    headers = {
        "User-Agent": "NutritionBot/2.0 ([email protected])",
        "Accept": "application/json",
    }

    async with httpx.AsyncClient(headers=headers, timeout=20.0) as client:
        tasks = [
            fetch_product_async(client, barcode, semaphore)
            for barcode in barcodes
        ]
        results = await asyncio.gather(*tasks, return_exceptions=True)

    products = []
    for result in results:
        if isinstance(result, FoodProduct):
            products.append(result)
        elif isinstance(result, Exception):
            print(f"Exception during fetch: {result}")

    return products


# Usage
barcodes = ["3017620422003", "5449000000996", "8000500310427"]  # Nutella, Coca-Cola, Kinder
products = asyncio.run(bulk_fetch_products(barcodes, concurrency=5))
print(f"Fetched {len(products)} products")

Cross-Referencing with Commercial Grocery Sites

Open Food Facts gives you nutrition data, but commercial grocery sites have prices, availability, store locations, and promotional data that OFF does not. Cross-referencing these sources creates richer datasets — but commercial grocery sites actively block scrapers.

This is where residential proxy rotation becomes necessary. Commercial grocery sites (Tesco, Walmart, Carrefour, etc.) use IP reputation systems that ban datacenter IPs within minutes. Residential proxies from ThorData route your requests through real ISP IP addresses, making them indistinguishable from normal shopper traffic.

import httpx
import asyncio
from typing import Optional

THORDATA_USER = "your_thordata_username"
THORDATA_PASS = "your_thordata_password"

def make_proxied_client(country: str = "gb") -> httpx.Client:
    """Create an httpx client routing through ThorData residential proxies."""
    proxy_url = f"http://{THORDATA_USER}-country-{country}:{THORDATA_PASS}@proxy.thordata.com:9000"
    return httpx.Client(
        proxy=proxy_url,
        headers={
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                          "AppleWebKit/537.36 (KHTML, like Gecko) "
                          "Chrome/124.0.0.0 Safari/537.36",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-GB,en;q=0.9",
        },
        timeout=30.0,
        follow_redirects=True,
    )


async def get_product_price_tesco(barcode: str, country: str = "gb") -> Optional[dict]:
    """
    Look up a product's price on Tesco using their unofficial search API.
    Requires UK residential IP — use ThorData UK proxies.
    """
    proxy_url = f"http://{THORDATA_USER}-country-{country}:{THORDATA_PASS}@proxy.thordata.com:9000"

    async with httpx.AsyncClient(
        proxy=proxy_url,
        headers={
            "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) "
                          "AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Mobile Safari/604.1",
            "Accept": "application/json",
        },
        timeout=20.0,
    ) as client:
        # Tesco product API (found via DevTools inspection)
        resp = await client.get(
            "https://api.tesco.com/shoppingexperience/v1/api/products",
            params={"query": barcode, "offset": 0, "limit": 5},
            headers={"x-api-key": "your_intercepted_key"},
        )
        if resp.status_code != 200:
            return None
        data = resp.json()
        items = data.get("products", {}).get("results", [])
        if not items:
            return None
        item = items[0]
        return {
            "barcode": barcode,
            "retailer": "tesco",
            "name": item.get("name", ""),
            "price": item.get("price", {}).get("actual"),
            "unit_price": item.get("unitPrice", {}).get("price"),
            "in_stock": item.get("available", False),
            "url": f"https://www.tesco.com/groceries/en-GB/products/{item.get('id')}",
        }


async def enrich_with_prices(
    products: list[FoodProduct],
    country: str = "gb",
) -> list[dict]:
    """Enrich Open Food Facts data with current retail prices."""
    enriched = []
    semaphore = asyncio.Semaphore(3)  # Conservative concurrency for commercial sites

    async def enrich_one(product: FoodProduct) -> dict:
        async with semaphore:
            base_data = asdict(product)
            price_data = await get_product_price_tesco(product.barcode, country)
            if price_data:
                base_data["retail_price"] = price_data.get("price")
                base_data["retail_in_stock"] = price_data.get("in_stock")
                base_data["retail_url"] = price_data.get("url")
            await asyncio.sleep(1.5)  # Rate limit for commercial sites
            return base_data

    tasks = [enrich_one(p) for p in products]
    enriched = await asyncio.gather(*tasks, return_exceptions=False)
    return list(enriched)

Retry Logic and Error Handling

Open Food Facts is generally reliable but network issues happen. Build robust retry logic:

import time
import functools
import random
from typing import TypeVar, Callable

T = TypeVar("T")

def retry_with_backoff(
    max_attempts: int = 3,
    base_wait: float = 1.0,
    exceptions: tuple = (httpx.HTTPError, httpx.TimeoutException),
):
    """Decorator for automatic retry with exponential backoff."""
    def decorator(func: Callable) -> Callable:
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(1, max_attempts + 1):
                try:
                    return func(*args, **kwargs)
                except exceptions as e:
                    if attempt == max_attempts:
                        raise
                    wait = base_wait * (2 ** (attempt - 1)) + random.uniform(0, 0.5)
                    print(f"Attempt {attempt} failed ({e}), retrying in {wait:.1f}s...")
                    time.sleep(wait)
        return wrapper
    return decorator


@retry_with_backoff(max_attempts=3)
def get_product_robust(barcode: str) -> Optional[FoodProduct]:
    """Fetch product with automatic retry on network errors."""
    resp = client.get(f"{OFF_BASE}/api/v2/product/{barcode}.json")
    resp.raise_for_status()  # Raises HTTPStatusError on 4xx/5xx

    data = resp.json()
    if data.get("status") != 1:
        return None  # Product not found — don't retry this

    p = data["product"]
    return FoodProduct(
        barcode=barcode,
        name=p.get("product_name", ""),
        brands=p.get("brands", ""),
        nutriscore_grade=p.get("nutriscore_grade"),
        nutrition=parse_nutriments(p.get("nutriments", {})),
        allergens_tags=p.get("allergens_tags", []),
    )


def bulk_fetch_robust(
    barcodes: list[str],
    delay_between: float = 0.5,
) -> tuple[list[FoodProduct], list[str]]:
    """
    Fetch products sequentially with error handling.
    Returns (successful products, failed barcodes).
    """
    products = []
    failed = []

    for i, barcode in enumerate(barcodes):
        try:
            product = get_product_robust(barcode)
            if product:
                products.append(product)
            else:
                print(f"[{i+1}/{len(barcodes)}] Not found: {barcode}")
        except Exception as e:
            print(f"[{i+1}/{len(barcodes)}] Failed {barcode}: {e}")
            failed.append(barcode)

        if i < len(barcodes) - 1:
            time.sleep(delay_between)

    return products, failed

Data Quality Assessment

Not all records in Open Food Facts are complete. Assess quality before including records in downstream analysis:

def assess_data_quality(product: FoodProduct) -> dict:
    """Score a product record's completeness for different use cases."""
    has_basic = bool(product.name and product.brands)
    has_nutrition = (
        product.nutrition is not None and
        product.nutrition.energy_kcal is not None and
        product.nutrition.proteins is not None
    )
    has_full_nutrition = (
        has_nutrition and
        product.nutrition.fat is not None and
        product.nutrition.carbohydrates is not None and
        product.nutrition.sugars is not None and
        product.nutrition.salt is not None
    )
    has_allergens = len(product.allergens_tags) > 0 or bool(product.allergens)
    has_ingredients = bool(product.ingredients_text)
    has_scores = product.nutriscore_grade is not None
    has_nova = product.nova_group is not None

    # Data quality warnings from OFF's own validation
    quality_warnings = [
        tag for tag in product.data_quality_tags
        if "warning" in tag or "error" in tag
    ]

    # Completeness score 0-100
    checks = [has_basic, has_nutrition, has_full_nutrition,
               has_allergens, has_ingredients, has_scores, has_nova]
    score = int(100 * sum(checks) / len(checks))

    return {
        "barcode": product.barcode,
        "completeness_score": score,
        "has_basic_info": has_basic,
        "has_nutrition": has_nutrition,
        "has_full_nutrition": has_full_nutrition,
        "has_allergen_data": has_allergens,
        "has_ingredients": has_ingredients,
        "has_nutriscore": has_scores,
        "has_nova": has_nova,
        "quality_warnings": quality_warnings,
        "suitable_for": {
            "nutrition_analysis": has_full_nutrition,
            "allergen_app": has_allergens and has_basic,
            "nutriscore_comparison": has_scores and has_basic,
            "ingredient_analysis": has_ingredients,
            "ml_training": score >= 70,
        }
    }

Exporting Data

For analysis and storage, export to CSV or JSON:

import csv
import json
from pathlib import Path

def export_to_csv(products: list[FoodProduct], path: str):
    """Export products to CSV with flattened nutrition data."""
    if not products:
        return

    rows = []
    for p in products:
        row = {
            "barcode": p.barcode,
            "name": p.name,
            "brands": p.brands,
            "quantity": p.quantity,
            "categories": p.categories,
            "allergens": p.allergens,
            "traces": p.traces,
            "nutriscore_grade": p.nutriscore_grade,
            "nutriscore_score": p.nutriscore_score,
            "nova_group": p.nova_group,
            "ecoscore_grade": p.ecoscore_grade,
            "serving_size": p.serving_size,
            "image_url": p.image_url,
        }
        # Flatten nutrition
        if p.nutrition:
            for field_name, value in vars(p.nutrition).items():
                row[f"nutrition_{field_name}"] = value
        rows.append(row)

    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=rows[0].keys())
        writer.writeheader()
        writer.writerows(rows)

    print(f"Exported {len(rows)} products to {path}")


def export_to_jsonl(products: list[FoodProduct], path: str):
    """Export products to JSON Lines format (one JSON object per line)."""
    with open(path, "w", encoding="utf-8") as f:
        for p in products:
            f.write(json.dumps(asdict(p), default=str) + "\n")
    print(f"Exported {len(products)} products to {path}")

Using the Official Data Dump

For truly large-scale collection — millions of products — the API is the wrong tool. Open Food Facts publishes daily data dumps:

import subprocess
import gzip
import json

def download_and_process_dump():
    """
    Download the full Open Food Facts database dump and process it.
    The CSV dump is ~10GB uncompressed; the JSON dump is ~30GB.
    Process it as a stream to avoid loading everything into memory.
    """
    # Download (takes time — several GB)
    dump_url = "https://static.openfoodfacts.org/data/openfoodfacts-products.jsonl.gz"

    # Process as a stream
    count = 0
    cereals = []

    with gzip.open("openfoodfacts-products.jsonl.gz", "rt", encoding="utf-8") as f:
        for line in f:
            if not line.strip():
                continue
            try:
                p = json.loads(line)
                # Filter to only breakfast cereals
                categories = p.get("categories_tags", [])
                if "en:breakfast-cereals" not in categories:
                    continue
                cereals.append({
                    "barcode": p.get("code"),
                    "name": p.get("product_name", ""),
                    "nutriscore": p.get("nutriscore_grade"),
                    "kcal": p.get("nutriments", {}).get("energy-kcal_100g"),
                })
                count += 1
                if count % 10000 == 0:
                    print(f"Processed {count} matching products...")
            except json.JSONDecodeError:
                continue

    print(f"Total breakfast cereals found: {len(cereals)}")
    return cereals

7 Real-World Applications

1. Allergen Alert Mobile App

The most immediately practical use case: an app that scans a barcode and tells users instantly whether a product contains their allergens:

def check_product_for_allergens(
    barcode: str,
    user_allergens: list[str],  # e.g. ["Milk", "Gluten", "Nuts"]
) -> dict:
    product = get_product_robust(barcode)
    if not product:
        return {"found": False, "barcode": barcode}

    allergen_info = parse_allergens_comprehensive({
        "allergens_tags": product.allergens_tags,
        "allergens": product.allergens,
        "traces_tags": [],
        "traces": "",
    })

    dangers = [a for a in user_allergens if a in allergen_info["contains"]]
    warnings = [a for a in user_allergens if a in allergen_info["may_contain"]]

    return {
        "found": True,
        "barcode": barcode,
        "name": product.name,
        "brand": product.brands,
        "safe": len(dangers) == 0 and len(warnings) == 0,
        "contains_allergens": dangers,
        "may_contain_allergens": warnings,
        "nutriscore": product.nutriscore_grade,
    }

2. Nutri-Score Category Analysis

Compare the nutritional profile of products within a category to identify the healthiest options:

def analyze_category_nutrition(category: str) -> dict:
    products = list(search_products("", category=category, max_pages=10))

    # Filter to products with complete nutrition data
    scored = [p for p in products if p.nutriscore_grade and p.nutrition and p.nutrition.energy_kcal]

    grade_counts = {"a": 0, "b": 0, "c": 0, "d": 0, "e": 0}
    for p in scored:
        grade = p.nutriscore_grade.lower()
        if grade in grade_counts:
            grade_counts[grade] += 1

    # Best products by Nutri-Score
    best = [p for p in scored if p.nutriscore_grade in ("a", "A")][:10]

    return {
        "category": category,
        "total_products": len(products),
        "with_nutriscore": len(scored),
        "grade_distribution": grade_counts,
        "best_products": [
            {"name": p.name, "brand": p.brands, "grade": p.nutriscore_grade,
             "kcal": p.nutrition.energy_kcal if p.nutrition else None}
            for p in best
        ],
    }

3. NOVA Processing Level Research

NOVA classifies foods 1-4 by processing level. Level 1 is unprocessed (fruits, vegetables, meat). Level 4 is ultra-processed (soft drinks, packaged snacks, reconstituted meat products). Collecting NOVA data at scale enables research into dietary patterns:

def analyze_nova_distribution(country: str = "en:france") -> dict:
    all_products = list(search_products("", country=country, max_pages=20))
    with_nova = [p for p in all_products if p.nova_group]

    distribution = {1: [], 2: [], 3: [], 4: []}
    for p in with_nova:
        if p.nova_group in distribution:
            distribution[p.nova_group].append(p)

    return {
        "country": country,
        "total_analyzed": len(with_nova),
        "nova_distribution": {
            level: {
                "count": len(products),
                "percentage": round(100 * len(products) / len(with_nova), 1) if with_nova else 0,
                "example_brands": list({p.brands for p in products[:5] if p.brands}),
            }
            for level, products in distribution.items()
        }
    }

4. Reformulation Tracking

Track how product recipes change over time by storing historical snapshots and comparing them:

import sqlite3
import json

def setup_tracking_db(db_path: str = "reformulation_tracking.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS snapshots (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            barcode TEXT NOT NULL,
            snapshot_date TEXT NOT NULL,
            product_name TEXT,
            nutriscore_grade TEXT,
            sugar_per_100g REAL,
            salt_per_100g REAL,
            fat_per_100g REAL,
            ingredients_text TEXT,
            full_data JSON
        )
    """)
    conn.commit()
    return conn


def record_snapshot(conn: sqlite3.Connection, product: FoodProduct):
    import datetime
    conn.execute("""
        INSERT INTO snapshots 
        (barcode, snapshot_date, product_name, nutriscore_grade, 
         sugar_per_100g, salt_per_100g, fat_per_100g, ingredients_text, full_data)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, (
        product.barcode,
        datetime.date.today().isoformat(),
        product.name,
        product.nutriscore_grade,
        product.nutrition.sugars if product.nutrition else None,
        product.nutrition.salt if product.nutrition else None,
        product.nutrition.fat if product.nutrition else None,
        product.ingredients_text,
        json.dumps(asdict(product), default=str),
    ))
    conn.commit()

5. Price Comparison Engine (with ThorData Proxies)

Combine Open Food Facts barcodes with real-time prices scraped from grocery retailers using residential proxies:

async def build_price_comparison(
    category: str,
    retailers: list[str] = ["tesco", "sainsburys", "waitrose"],
) -> list[dict]:
    # Get products from OFF
    products = list(search_products("", category=category, max_pages=3))
    print(f"Loaded {len(products)} products from Open Food Facts")

    enriched = []
    semaphore = asyncio.Semaphore(2)  # Low concurrency for retail sites

    async def get_prices_for_product(product: FoodProduct) -> dict:
        async with semaphore:
            result = asdict(product)
            result["prices"] = {}

            for retailer in retailers:
                proxy = f"http://{THORDATA_USER}-country-gb:{THORDATA_PASS}@proxy.thordata.com:9000"
                async with httpx.AsyncClient(proxy=proxy, timeout=20.0) as client:
                    try:
                        # Each retailer has a different API pattern
                        # Intercept via DevTools/mitmproxy to find the right endpoint
                        resp = await client.get(
                            f"https://api.{retailer}.com/products",
                            params={"barcode": product.barcode},
                            headers={"User-Agent": "Mozilla/5.0 ..."},
                        )
                        if resp.status_code == 200:
                            data = resp.json()
                            result["prices"][retailer] = data.get("price")
                    except Exception:
                        result["prices"][retailer] = None

                await asyncio.sleep(1.0)

            return result

    tasks = [get_prices_for_product(p) for p in products[:50]]
    enriched = await asyncio.gather(*tasks)
    return list(enriched)

6. Nutritional Label Compliance Checker

Automatically check whether products meet nutritional criteria for specific health claims or regulatory thresholds:

def check_health_claim_eligibility(product: FoodProduct) -> dict:
    """
    Check EU health claim eligibility based on nutrition data.
    Rules vary by claim — this covers common ones.
    """
    n = product.nutrition
    if not n:
        return {"error": "No nutrition data"}

    checks = {}

    # Low fat claim: ≤3g fat per 100g (solids), ≤1.5g per 100ml (liquids)
    if n.fat is not None:
        checks["low_fat"] = n.fat <= 3.0

    # Low sugar claim: ≤5g sugars per 100g
    if n.sugars is not None:
        checks["low_sugar"] = n.sugars <= 5.0

    # Low sodium/salt claim: ≤0.12g sodium per 100g
    if n.sodium is not None:
        checks["low_sodium"] = n.sodium <= 0.12
    elif n.salt is not None:
        checks["low_sodium"] = n.salt <= 0.3  # 0.12g sodium ≈ 0.3g salt

    # Source of protein: ≥12% energy from protein
    if n.energy_kcal and n.proteins:
        protein_energy_pct = (n.proteins * 4 / n.energy_kcal) * 100
        checks["source_of_protein"] = protein_energy_pct >= 12.0
        checks["high_protein"] = protein_energy_pct >= 20.0

    # High fiber: ≥6g fiber per 100g
    if n.fiber is not None:
        checks["high_fiber"] = n.fiber >= 6.0
        checks["source_of_fiber"] = n.fiber >= 3.0

    return {
        "barcode": product.barcode,
        "name": product.name,
        "eligible_claims": [claim for claim, eligible in checks.items() if eligible],
        "ineligible_claims": [claim for claim, eligible in checks.items() if not eligible],
        "checks": checks,
    }

7. ML Training Dataset Builder

Build labeled datasets for food image classification, ingredient parsing, or Nutri-Score prediction:

import csv
from pathlib import Path
import httpx

def build_nutriscore_dataset(
    target_per_grade: int = 500,
    output_dir: str = "nutriscore_dataset",
) -> dict:
    """
    Build a balanced dataset for Nutri-Score prediction.
    target_per_grade: how many examples to collect per A/B/C/D/E grade.
    """
    Path(output_dir).mkdir(exist_ok=True)
    records_by_grade = {"a": [], "b": [], "c": [], "d": [], "e": []}
    collected_total = 0

    for grade in ["a", "b", "c", "d", "e"]:
        print(f"Collecting grade {grade.upper()} products...")
        for product in search_products(
            "",
            nutrition_grade=grade,
            max_pages=20,
        ):
            if len(records_by_grade[grade]) >= target_per_grade:
                break

            quality = assess_data_quality(product)
            if not quality["suitable_for"]["ml_training"]:
                continue

            n = product.nutrition
            if not n or not n.energy_kcal:
                continue

            records_by_grade[grade].append({
                "barcode": product.barcode,
                "label": grade,
                "energy_kcal": n.energy_kcal,
                "fat": n.fat,
                "saturated_fat": n.saturated_fat,
                "carbohydrates": n.carbohydrates,
                "sugars": n.sugars,
                "fiber": n.fiber,
                "proteins": n.proteins,
                "salt": n.salt,
            })
            collected_total += 1

    # Write to CSV
    all_records = []
    for grade_records in records_by_grade.values():
        all_records.extend(grade_records)

    if all_records:
        output_path = f"{output_dir}/nutriscore_features.csv"
        with open(output_path, "w", newline="") as f:
            writer = csv.DictWriter(f, fieldnames=all_records[0].keys())
            writer.writeheader()
            writer.writerows(all_records)

    return {
        "total_collected": collected_total,
        "per_grade": {g: len(r) for g, r in records_by_grade.items()},
        "output": f"{output_dir}/nutriscore_features.csv",
    }

Rate Limits and Being a Good Citizen

Open Food Facts does not publish strict rate limits, but their infrastructure is not Google-scale. They are a volunteer-run nonprofit. Practical guidelines that keep you in good standing:

1 request per second for search queries
Up to 10 requests per second for individual barcode lookups (cached at CDN edge)
Always include a descriptive User-Agent with contact information
Use the fields parameter to reduce response sizes
For bulk collection of millions of products, use the data dump instead of the API
If you are building something commercial, consider contributing back to the project

The data quality improves as more people contribute. If you notice missing or incorrect data during your scraping, use their product edit interface or API to contribute improvements. The project runs entirely on community contributions.

Summary

Open Food Facts is an exceptional free resource for food and nutrition data. The API is straightforward — no auth, sensible rate limits, well-documented fields. The data quality is genuinely good for major brands in Europe and North America and improving globally.

The patterns in this guide scale from a quick single-product lookup to collecting millions of records from the daily data dump. For cross-referencing with commercial grocery sites, residential proxies from ThorData handle the IP-based blocking that would otherwise prevent access. The output schemas and quality assessment functions give you a solid foundation for building reliable data pipelines on top of the raw API.

Start with the single-product lookup, test with a few known barcodes, then scale up to category searches and bulk collection as you understand the data shape you are working with.