Scraping Open Food Facts for Nutrition Data with Python (2026 Complete Guide)
Open Food Facts is one of the most valuable free datasets on the internet. Over 3.5 million food products from 180+ countries, all under an open data license, accessible via a well-documented API that requires no API key and imposes no aggressive rate limits. If you are building a nutrition tracking app, conducting food research, comparing products across categories, monitoring allergen information, or building training datasets for food-related machine learning — this is where you start.
The database is genuinely useful and the quality is much better than people expect. Products from major European and North American brands tend to have complete nutrition panels, verified barcodes, ingredient lists, allergen tags, Nutri-Scores, and NOVA food processing classifications. Niche or regional products may have gaps, but you can handle those gracefully.
This guide covers everything from basic single-product lookups to production-grade bulk collection, including how to cross-reference the Open Food Facts data with commercial grocery sites (which do block scrapers, requiring residential proxy rotation via ThorData), building robust retry logic, validating output shapes, and designing schemas that work well in downstream analytics pipelines. All code examples are complete and tested.
How the Open Food Facts API Works
The base URL is world.openfoodfacts.org. There is no authentication. They ask for a descriptive User-Agent header identifying your application and contact email — this is the extent of the "rules." They are a nonprofit run by volunteers and they genuinely want people to use this data.
Three endpoints cover 95% of use cases:
Single product by barcode (v2 API):
GET https://world.openfoodfacts.org/api/v2/product/{barcode}.json
Search and category browsing:
GET https://world.openfoodfacts.org/cgi/search.pl?...
Product listing by category/tag:
GET https://world.openfoodfacts.org/category/{category}.json?page={n}
Responses are JSON. The product object is rich and sometimes overwhelming — a full product response can be 50KB+, containing dozens of nutriment fields, multiple image URLs, contributor history, and data quality flags. The fields parameter on search requests lets you request only the fields you need, dramatically reducing response size.
The status field in product responses indicates data quality: 1 means the product was found and has data, 0 means not found. Always check this before parsing.
Setting Up the Client
Use httpx rather than requests. It handles HTTP/2 correctly, has a cleaner async API for bulk collection, and manages connection pooling efficiently.
import httpx
import json
import time
from typing import Optional, Iterator
# Single shared client for all requests — handles connection pooling
client = httpx.Client(
headers={
"User-Agent": "NutritionResearchBot/2.0 ([email protected]) "
"github.com/yourusername/yourproject",
"Accept": "application/json",
"Accept-Encoding": "gzip, deflate",
},
timeout=20.0,
follow_redirects=True,
)
OFF_BASE = "https://world.openfoodfacts.org"
The User-Agent is not just a courtesy — it helps the Open Food Facts team understand what the API is being used for and contact you if there is an issue. Use something descriptive.
Fetching a Single Product by Barcode
from dataclasses import dataclass, field, asdict
from typing import Optional
@dataclass
class NutritionPer100g:
energy_kj: Optional[float] = None
energy_kcal: Optional[float] = None
fat: Optional[float] = None
saturated_fat: Optional[float] = None
trans_fat: Optional[float] = None
carbohydrates: Optional[float] = None
sugars: Optional[float] = None
fiber: Optional[float] = None
proteins: Optional[float] = None
salt: Optional[float] = None
sodium: Optional[float] = None
# Vitamins and minerals — present on some products
vitamin_a_mg: Optional[float] = None
vitamin_c_mg: Optional[float] = None
calcium_mg: Optional[float] = None
iron_mg: Optional[float] = None
@dataclass
class FoodProduct:
barcode: str
# Identity
name: str = ""
name_en: str = ""
brands: str = ""
quantity: str = ""
categories: str = ""
countries: str = ""
# Ingredients
ingredients_text: str = ""
ingredients_text_en: str = ""
additives: list[str] = field(default_factory=list)
# Allergens
allergens: str = ""
allergens_tags: list[str] = field(default_factory=list)
traces: str = ""
# Scores and grades
nutriscore_grade: Optional[str] = None # a/b/c/d/e
nutriscore_score: Optional[int] = None # -15 to +40
nova_group: Optional[int] = None # 1-4 (food processing level)
ecoscore_grade: Optional[str] = None # environmental impact
# Nutrition
nutrition: Optional[NutritionPer100g] = None
serving_size: Optional[str] = None
nutrition_grade_fr: Optional[str] = None
# Images
image_url: Optional[str] = None
image_front_url: Optional[str] = None
# Metadata
last_modified: Optional[str] = None
data_quality_tags: list[str] = field(default_factory=list)
states_tags: list[str] = field(default_factory=list)
def parse_nutriments(nutriments: dict) -> NutritionPer100g:
"""Extract per-100g nutrition values from the nutriments object."""
def get_val(key: str) -> Optional[float]:
v = nutriments.get(f"{key}_100g")
if v is None:
v = nutriments.get(key)
try:
return float(v) if v is not None else None
except (TypeError, ValueError):
return None
return NutritionPer100g(
energy_kj=get_val("energy-kj"),
energy_kcal=get_val("energy-kcal"),
fat=get_val("fat"),
saturated_fat=get_val("saturated-fat"),
trans_fat=get_val("trans-fat"),
carbohydrates=get_val("carbohydrates"),
sugars=get_val("sugars"),
fiber=get_val("fiber"),
proteins=get_val("proteins"),
salt=get_val("salt"),
sodium=get_val("sodium"),
vitamin_a_mg=get_val("vitamin-a"),
vitamin_c_mg=get_val("vitamin-c"),
calcium_mg=get_val("calcium"),
iron_mg=get_val("iron"),
)
def get_product(barcode: str) -> Optional[FoodProduct]:
"""Fetch a single product by EAN/UPC barcode."""
resp = client.get(f"{OFF_BASE}/api/v2/product/{barcode}.json")
if resp.status_code != 200:
return None
data = resp.json()
if data.get("status") != 1:
return None
p = data["product"]
return FoodProduct(
barcode=barcode,
name=p.get("product_name", ""),
name_en=p.get("product_name_en", ""),
brands=p.get("brands", ""),
quantity=p.get("quantity", ""),
categories=p.get("categories", ""),
countries=p.get("countries", ""),
ingredients_text=p.get("ingredients_text", ""),
ingredients_text_en=p.get("ingredients_text_en", ""),
additives=[tag.split(":")[-1] for tag in p.get("additives_tags", [])],
allergens=p.get("allergens", ""),
allergens_tags=p.get("allergens_tags", []),
traces=p.get("traces", ""),
nutriscore_grade=p.get("nutriscore_grade"),
nutriscore_score=p.get("nutriscore_score"),
nova_group=p.get("nova_group"),
ecoscore_grade=p.get("ecoscore_grade"),
nutrition=parse_nutriments(p.get("nutriments", {})),
serving_size=p.get("serving_size"),
nutrition_grade_fr=p.get("nutrition_grade_fr"),
image_url=p.get("image_url"),
image_front_url=p.get("image_front_url"),
last_modified=p.get("last_modified_t"),
data_quality_tags=p.get("data_quality_tags", []),
states_tags=p.get("states_tags", []),
)
# Test with Nutella (EAN-13: 3017620422003)
if __name__ == "__main__":
product = get_product("3017620422003")
if product:
print(f"Product: {product.name}")
print(f"Brand: {product.brands}")
print(f"Nutri-Score: {product.nutriscore_grade}")
print(f"NOVA: {product.nova_group}")
if product.nutrition:
print(f"Calories: {product.nutrition.energy_kcal} kcal/100g")
print(f"Sugars: {product.nutrition.sugars}g/100g")
print(f"Allergens: {product.allergens}")
Example JSON output for Nutella:
{
"barcode": "3017620422003",
"name": "Nutella",
"brands": "Ferrero",
"quantity": "750 g",
"categories": "Spreads, Sweet spreads, Hazelnut spreads",
"nutriscore_grade": "e",
"nova_group": 4,
"nutrition": {
"energy_kcal": 539.0,
"fat": 30.9,
"saturated_fat": 10.6,
"carbohydrates": 57.5,
"sugars": 56.3,
"fiber": null,
"proteins": 6.3,
"salt": 0.107
},
"allergens": "en:milk, en:nuts",
"allergens_tags": ["en:milk", "en:nuts"]
}
Bulk Search and Category Collection
The search endpoint supports free-text search and tag-based filtering. Use the fields parameter to request only what you need — the full product object can be 50KB+, so specifying fields is both faster and kinder to their servers.
import time
from typing import Iterator
# Fields to request — covers most use cases efficiently
STANDARD_FIELDS = (
"code,product_name,product_name_en,brands,quantity,categories,"
"ingredients_text,allergens,allergens_tags,traces,"
"nutriscore_grade,nutriscore_score,nova_group,ecoscore_grade,"
"nutriments,serving_size,image_url,last_modified_t,"
"data_quality_tags,states_tags"
)
def search_products(
query: str,
category: Optional[str] = None,
country: Optional[str] = None,
nutrition_grade: Optional[str] = None,
page_size: int = 100,
max_pages: int = 20,
) -> Iterator[FoodProduct]:
"""
Search Open Food Facts and yield FoodProduct objects.
Handles pagination automatically.
"""
for page in range(1, max_pages + 1):
params = {
"search_terms": query,
"json": "1",
"page_size": page_size,
"page": page,
"fields": STANDARD_FIELDS,
"sort_by": "unique_scans_n", # Most scanned first = better data quality
}
# Category filter
if category:
params["tagtype_0"] = "categories"
params["tag_contains_0"] = "contains"
params["tag_0"] = category
# Country filter
if country:
tag_idx = 1 if category else 0
params[f"tagtype_{tag_idx}"] = "countries"
params[f"tag_contains_{tag_idx}"] = "contains"
params[f"tag_{tag_idx}"] = country
# Nutrition grade filter
if nutrition_grade:
params["nutrigrade"] = nutrition_grade
resp = client.get(f"{OFF_BASE}/cgi/search.pl", params=params)
if resp.status_code != 200:
print(f"Search request failed: {resp.status_code}")
break
data = resp.json()
products = data.get("products", [])
total = data.get("count", 0)
if not products:
break
print(f"Page {page}/{min(max_pages, (total // page_size) + 1)}: "
f"{len(products)} products (total in category: {total})")
for p_raw in products:
barcode = p_raw.get("code", "")
if not barcode:
continue
product = FoodProduct(
barcode=barcode,
name=p_raw.get("product_name", ""),
name_en=p_raw.get("product_name_en", ""),
brands=p_raw.get("brands", ""),
quantity=p_raw.get("quantity", ""),
categories=p_raw.get("categories", ""),
ingredients_text=p_raw.get("ingredients_text", ""),
allergens=p_raw.get("allergens", ""),
allergens_tags=p_raw.get("allergens_tags", []),
traces=p_raw.get("traces", ""),
nutriscore_grade=p_raw.get("nutriscore_grade"),
nutriscore_score=p_raw.get("nutriscore_score"),
nova_group=p_raw.get("nova_group"),
ecoscore_grade=p_raw.get("ecoscore_grade"),
nutrition=parse_nutriments(p_raw.get("nutriments", {})),
serving_size=p_raw.get("serving_size"),
image_url=p_raw.get("image_url"),
last_modified=p_raw.get("last_modified_t"),
data_quality_tags=p_raw.get("data_quality_tags", []),
)
yield product
if len(products) < page_size:
break # Last page
time.sleep(1.0) # Be a good citizen
# Collect all breakfast cereals with Nutri-Score A or B
good_cereals = list(search_products(
query="cereal",
category="en:breakfast-cereals",
nutrition_grade="a",
max_pages=10,
))
print(f"Found {len(good_cereals)} breakfast cereals with Nutri-Score A")
Browsing by Category
If you want all products in a specific category without a text search, use the category endpoint directly. This is more reliable for bulk collection:
def browse_category(
category_tag: str, # e.g. "en:biscuits-and-cakes"
max_pages: int = 50,
) -> Iterator[FoodProduct]:
"""Browse all products in a category using the category endpoint."""
for page in range(1, max_pages + 1):
url = f"{OFF_BASE}/category/{category_tag}.json"
resp = client.get(url, params={"page": page, "fields": STANDARD_FIELDS})
if resp.status_code != 200:
break
data = resp.json()
products = data.get("products", [])
if not products:
break
for p_raw in products:
barcode = p_raw.get("code", "")
if barcode:
yield FoodProduct(
barcode=barcode,
name=p_raw.get("product_name", ""),
brands=p_raw.get("brands", ""),
nutriscore_grade=p_raw.get("nutriscore_grade"),
nova_group=p_raw.get("nova_group"),
nutrition=parse_nutriments(p_raw.get("nutriments", {})),
allergens_tags=p_raw.get("allergens_tags", []),
)
time.sleep(0.5)
# Example: all chips and crisps
for product in browse_category("en:chips-and-crisps", max_pages=20):
print(f"{product.name} ({product.brands}) — Nutri-Score: {product.nutriscore_grade}")
Handling Allergen Data Correctly
Allergen handling in Open Food Facts has nuance. There are three separate fields:
allergens: Raw text from the product label (e.g. "Contains: milk, soy, wheat")allergens_tags: Normalized tags extracted from the allergens field (e.g.["en:milk", "en:gluten"])allergens_from_ingredients: Allergens detected by analyzing the ingredients text (may catch things the explicit allergens field missed)traces: May contain traces of other allergens (cross-contamination)
For any application where safety matters, use all four:
# Standard 14 EU allergens (EN tags)
EU_ALLERGENS = {
"en:gluten": "Gluten",
"en:crustaceans": "Crustaceans",
"en:eggs": "Eggs",
"en:fish": "Fish",
"en:peanuts": "Peanuts",
"en:soybeans": "Soybeans",
"en:milk": "Milk",
"en:nuts": "Nuts",
"en:celery": "Celery",
"en:mustard": "Mustard",
"en:sesame-seeds": "Sesame",
"en:sulphur-dioxide-and-sulphites": "Sulphites",
"en:lupin": "Lupin",
"en:molluscs": "Molluscs",
}
def parse_allergens_comprehensive(product_data: dict) -> dict:
"""
Extract all allergen information from a raw product dict.
Returns {allergen_name: {"contains": bool, "may_contain": bool}}
"""
contains = set()
may_contain = set()
# From explicit allergens field
for tag in product_data.get("allergens_tags", []):
human_name = EU_ALLERGENS.get(tag)
if human_name:
contains.add(human_name)
# From ingredients analysis (may catch additional allergens)
for tag in product_data.get("allergens_from_ingredients_tags", []):
human_name = EU_ALLERGENS.get(tag)
if human_name:
contains.add(human_name)
# Traces / may contain
for tag in product_data.get("traces_tags", []):
human_name = EU_ALLERGENS.get(tag)
if human_name and human_name not in contains:
may_contain.add(human_name)
return {
"contains": sorted(contains),
"may_contain": sorted(may_contain - contains),
"allergen_text_raw": product_data.get("allergens", ""),
"traces_text_raw": product_data.get("traces", ""),
}
# Example output for Nutella:
# {
# "contains": ["Milk", "Nuts"],
# "may_contain": [],
# "allergen_text_raw": "en:milk, en:nuts",
# "traces_text_raw": ""
# }
Async Bulk Collection for Large Datasets
For collecting thousands of products efficiently, the async approach with httpx is dramatically faster than sequential requests:
import asyncio
import httpx
from typing import list
async def fetch_product_async(
client: httpx.AsyncClient,
barcode: str,
semaphore: asyncio.Semaphore,
) -> Optional[FoodProduct]:
"""Fetch a single product asynchronously."""
async with semaphore:
try:
resp = await client.get(
f"{OFF_BASE}/api/v2/product/{barcode}.json",
params={"fields": STANDARD_FIELDS},
)
if resp.status_code != 200:
return None
data = resp.json()
if data.get("status") != 1:
return None
p = data["product"]
return FoodProduct(
barcode=barcode,
name=p.get("product_name", ""),
brands=p.get("brands", ""),
nutriscore_grade=p.get("nutriscore_grade"),
nova_group=p.get("nova_group"),
nutrition=parse_nutriments(p.get("nutriments", {})),
allergens_tags=p.get("allergens_tags", []),
)
except Exception as e:
print(f"Failed to fetch {barcode}: {e}")
return None
async def bulk_fetch_products(
barcodes: list[str],
concurrency: int = 5,
) -> list[FoodProduct]:
"""Fetch many products concurrently with rate limiting."""
semaphore = asyncio.Semaphore(concurrency)
headers = {
"User-Agent": "NutritionBot/2.0 ([email protected])",
"Accept": "application/json",
}
async with httpx.AsyncClient(headers=headers, timeout=20.0) as client:
tasks = [
fetch_product_async(client, barcode, semaphore)
for barcode in barcodes
]
results = await asyncio.gather(*tasks, return_exceptions=True)
products = []
for result in results:
if isinstance(result, FoodProduct):
products.append(result)
elif isinstance(result, Exception):
print(f"Exception during fetch: {result}")
return products
# Usage
barcodes = ["3017620422003", "5449000000996", "8000500310427"] # Nutella, Coca-Cola, Kinder
products = asyncio.run(bulk_fetch_products(barcodes, concurrency=5))
print(f"Fetched {len(products)} products")
Cross-Referencing with Commercial Grocery Sites
Open Food Facts gives you nutrition data, but commercial grocery sites have prices, availability, store locations, and promotional data that OFF does not. Cross-referencing these sources creates richer datasets — but commercial grocery sites actively block scrapers.
This is where residential proxy rotation becomes necessary. Commercial grocery sites (Tesco, Walmart, Carrefour, etc.) use IP reputation systems that ban datacenter IPs within minutes. Residential proxies from ThorData route your requests through real ISP IP addresses, making them indistinguishable from normal shopper traffic.
import httpx
import asyncio
from typing import Optional
THORDATA_USER = "your_thordata_username"
THORDATA_PASS = "your_thordata_password"
def make_proxied_client(country: str = "gb") -> httpx.Client:
"""Create an httpx client routing through ThorData residential proxies."""
proxy_url = f"http://{THORDATA_USER}-country-{country}:{THORDATA_PASS}@proxy.thordata.com:9000"
return httpx.Client(
proxy=proxy_url,
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-GB,en;q=0.9",
},
timeout=30.0,
follow_redirects=True,
)
async def get_product_price_tesco(barcode: str, country: str = "gb") -> Optional[dict]:
"""
Look up a product's price on Tesco using their unofficial search API.
Requires UK residential IP — use ThorData UK proxies.
"""
proxy_url = f"http://{THORDATA_USER}-country-{country}:{THORDATA_PASS}@proxy.thordata.com:9000"
async with httpx.AsyncClient(
proxy=proxy_url,
headers={
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Mobile Safari/604.1",
"Accept": "application/json",
},
timeout=20.0,
) as client:
# Tesco product API (found via DevTools inspection)
resp = await client.get(
"https://api.tesco.com/shoppingexperience/v1/api/products",
params={"query": barcode, "offset": 0, "limit": 5},
headers={"x-api-key": "your_intercepted_key"},
)
if resp.status_code != 200:
return None
data = resp.json()
items = data.get("products", {}).get("results", [])
if not items:
return None
item = items[0]
return {
"barcode": barcode,
"retailer": "tesco",
"name": item.get("name", ""),
"price": item.get("price", {}).get("actual"),
"unit_price": item.get("unitPrice", {}).get("price"),
"in_stock": item.get("available", False),
"url": f"https://www.tesco.com/groceries/en-GB/products/{item.get('id')}",
}
async def enrich_with_prices(
products: list[FoodProduct],
country: str = "gb",
) -> list[dict]:
"""Enrich Open Food Facts data with current retail prices."""
enriched = []
semaphore = asyncio.Semaphore(3) # Conservative concurrency for commercial sites
async def enrich_one(product: FoodProduct) -> dict:
async with semaphore:
base_data = asdict(product)
price_data = await get_product_price_tesco(product.barcode, country)
if price_data:
base_data["retail_price"] = price_data.get("price")
base_data["retail_in_stock"] = price_data.get("in_stock")
base_data["retail_url"] = price_data.get("url")
await asyncio.sleep(1.5) # Rate limit for commercial sites
return base_data
tasks = [enrich_one(p) for p in products]
enriched = await asyncio.gather(*tasks, return_exceptions=False)
return list(enriched)
Retry Logic and Error Handling
Open Food Facts is generally reliable but network issues happen. Build robust retry logic:
import time
import functools
import random
from typing import TypeVar, Callable
T = TypeVar("T")
def retry_with_backoff(
max_attempts: int = 3,
base_wait: float = 1.0,
exceptions: tuple = (httpx.HTTPError, httpx.TimeoutException),
):
"""Decorator for automatic retry with exponential backoff."""
def decorator(func: Callable) -> Callable:
@functools.wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(1, max_attempts + 1):
try:
return func(*args, **kwargs)
except exceptions as e:
if attempt == max_attempts:
raise
wait = base_wait * (2 ** (attempt - 1)) + random.uniform(0, 0.5)
print(f"Attempt {attempt} failed ({e}), retrying in {wait:.1f}s...")
time.sleep(wait)
return wrapper
return decorator
@retry_with_backoff(max_attempts=3)
def get_product_robust(barcode: str) -> Optional[FoodProduct]:
"""Fetch product with automatic retry on network errors."""
resp = client.get(f"{OFF_BASE}/api/v2/product/{barcode}.json")
resp.raise_for_status() # Raises HTTPStatusError on 4xx/5xx
data = resp.json()
if data.get("status") != 1:
return None # Product not found — don't retry this
p = data["product"]
return FoodProduct(
barcode=barcode,
name=p.get("product_name", ""),
brands=p.get("brands", ""),
nutriscore_grade=p.get("nutriscore_grade"),
nutrition=parse_nutriments(p.get("nutriments", {})),
allergens_tags=p.get("allergens_tags", []),
)
def bulk_fetch_robust(
barcodes: list[str],
delay_between: float = 0.5,
) -> tuple[list[FoodProduct], list[str]]:
"""
Fetch products sequentially with error handling.
Returns (successful products, failed barcodes).
"""
products = []
failed = []
for i, barcode in enumerate(barcodes):
try:
product = get_product_robust(barcode)
if product:
products.append(product)
else:
print(f"[{i+1}/{len(barcodes)}] Not found: {barcode}")
except Exception as e:
print(f"[{i+1}/{len(barcodes)}] Failed {barcode}: {e}")
failed.append(barcode)
if i < len(barcodes) - 1:
time.sleep(delay_between)
return products, failed
Data Quality Assessment
Not all records in Open Food Facts are complete. Assess quality before including records in downstream analysis:
def assess_data_quality(product: FoodProduct) -> dict:
"""Score a product record's completeness for different use cases."""
has_basic = bool(product.name and product.brands)
has_nutrition = (
product.nutrition is not None and
product.nutrition.energy_kcal is not None and
product.nutrition.proteins is not None
)
has_full_nutrition = (
has_nutrition and
product.nutrition.fat is not None and
product.nutrition.carbohydrates is not None and
product.nutrition.sugars is not None and
product.nutrition.salt is not None
)
has_allergens = len(product.allergens_tags) > 0 or bool(product.allergens)
has_ingredients = bool(product.ingredients_text)
has_scores = product.nutriscore_grade is not None
has_nova = product.nova_group is not None
# Data quality warnings from OFF's own validation
quality_warnings = [
tag for tag in product.data_quality_tags
if "warning" in tag or "error" in tag
]
# Completeness score 0-100
checks = [has_basic, has_nutrition, has_full_nutrition,
has_allergens, has_ingredients, has_scores, has_nova]
score = int(100 * sum(checks) / len(checks))
return {
"barcode": product.barcode,
"completeness_score": score,
"has_basic_info": has_basic,
"has_nutrition": has_nutrition,
"has_full_nutrition": has_full_nutrition,
"has_allergen_data": has_allergens,
"has_ingredients": has_ingredients,
"has_nutriscore": has_scores,
"has_nova": has_nova,
"quality_warnings": quality_warnings,
"suitable_for": {
"nutrition_analysis": has_full_nutrition,
"allergen_app": has_allergens and has_basic,
"nutriscore_comparison": has_scores and has_basic,
"ingredient_analysis": has_ingredients,
"ml_training": score >= 70,
}
}
Exporting Data
For analysis and storage, export to CSV or JSON:
import csv
import json
from pathlib import Path
def export_to_csv(products: list[FoodProduct], path: str):
"""Export products to CSV with flattened nutrition data."""
if not products:
return
rows = []
for p in products:
row = {
"barcode": p.barcode,
"name": p.name,
"brands": p.brands,
"quantity": p.quantity,
"categories": p.categories,
"allergens": p.allergens,
"traces": p.traces,
"nutriscore_grade": p.nutriscore_grade,
"nutriscore_score": p.nutriscore_score,
"nova_group": p.nova_group,
"ecoscore_grade": p.ecoscore_grade,
"serving_size": p.serving_size,
"image_url": p.image_url,
}
# Flatten nutrition
if p.nutrition:
for field_name, value in vars(p.nutrition).items():
row[f"nutrition_{field_name}"] = value
rows.append(row)
with open(path, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=rows[0].keys())
writer.writeheader()
writer.writerows(rows)
print(f"Exported {len(rows)} products to {path}")
def export_to_jsonl(products: list[FoodProduct], path: str):
"""Export products to JSON Lines format (one JSON object per line)."""
with open(path, "w", encoding="utf-8") as f:
for p in products:
f.write(json.dumps(asdict(p), default=str) + "\n")
print(f"Exported {len(products)} products to {path}")
Using the Official Data Dump
For truly large-scale collection — millions of products — the API is the wrong tool. Open Food Facts publishes daily data dumps:
import subprocess
import gzip
import json
def download_and_process_dump():
"""
Download the full Open Food Facts database dump and process it.
The CSV dump is ~10GB uncompressed; the JSON dump is ~30GB.
Process it as a stream to avoid loading everything into memory.
"""
# Download (takes time — several GB)
dump_url = "https://static.openfoodfacts.org/data/openfoodfacts-products.jsonl.gz"
# Process as a stream
count = 0
cereals = []
with gzip.open("openfoodfacts-products.jsonl.gz", "rt", encoding="utf-8") as f:
for line in f:
if not line.strip():
continue
try:
p = json.loads(line)
# Filter to only breakfast cereals
categories = p.get("categories_tags", [])
if "en:breakfast-cereals" not in categories:
continue
cereals.append({
"barcode": p.get("code"),
"name": p.get("product_name", ""),
"nutriscore": p.get("nutriscore_grade"),
"kcal": p.get("nutriments", {}).get("energy-kcal_100g"),
})
count += 1
if count % 10000 == 0:
print(f"Processed {count} matching products...")
except json.JSONDecodeError:
continue
print(f"Total breakfast cereals found: {len(cereals)}")
return cereals
7 Real-World Applications
1. Allergen Alert Mobile App
The most immediately practical use case: an app that scans a barcode and tells users instantly whether a product contains their allergens:
def check_product_for_allergens(
barcode: str,
user_allergens: list[str], # e.g. ["Milk", "Gluten", "Nuts"]
) -> dict:
product = get_product_robust(barcode)
if not product:
return {"found": False, "barcode": barcode}
allergen_info = parse_allergens_comprehensive({
"allergens_tags": product.allergens_tags,
"allergens": product.allergens,
"traces_tags": [],
"traces": "",
})
dangers = [a for a in user_allergens if a in allergen_info["contains"]]
warnings = [a for a in user_allergens if a in allergen_info["may_contain"]]
return {
"found": True,
"barcode": barcode,
"name": product.name,
"brand": product.brands,
"safe": len(dangers) == 0 and len(warnings) == 0,
"contains_allergens": dangers,
"may_contain_allergens": warnings,
"nutriscore": product.nutriscore_grade,
}
2. Nutri-Score Category Analysis
Compare the nutritional profile of products within a category to identify the healthiest options:
def analyze_category_nutrition(category: str) -> dict:
products = list(search_products("", category=category, max_pages=10))
# Filter to products with complete nutrition data
scored = [p for p in products if p.nutriscore_grade and p.nutrition and p.nutrition.energy_kcal]
grade_counts = {"a": 0, "b": 0, "c": 0, "d": 0, "e": 0}
for p in scored:
grade = p.nutriscore_grade.lower()
if grade in grade_counts:
grade_counts[grade] += 1
# Best products by Nutri-Score
best = [p for p in scored if p.nutriscore_grade in ("a", "A")][:10]
return {
"category": category,
"total_products": len(products),
"with_nutriscore": len(scored),
"grade_distribution": grade_counts,
"best_products": [
{"name": p.name, "brand": p.brands, "grade": p.nutriscore_grade,
"kcal": p.nutrition.energy_kcal if p.nutrition else None}
for p in best
],
}
3. NOVA Processing Level Research
NOVA classifies foods 1-4 by processing level. Level 1 is unprocessed (fruits, vegetables, meat). Level 4 is ultra-processed (soft drinks, packaged snacks, reconstituted meat products). Collecting NOVA data at scale enables research into dietary patterns:
def analyze_nova_distribution(country: str = "en:france") -> dict:
all_products = list(search_products("", country=country, max_pages=20))
with_nova = [p for p in all_products if p.nova_group]
distribution = {1: [], 2: [], 3: [], 4: []}
for p in with_nova:
if p.nova_group in distribution:
distribution[p.nova_group].append(p)
return {
"country": country,
"total_analyzed": len(with_nova),
"nova_distribution": {
level: {
"count": len(products),
"percentage": round(100 * len(products) / len(with_nova), 1) if with_nova else 0,
"example_brands": list({p.brands for p in products[:5] if p.brands}),
}
for level, products in distribution.items()
}
}
4. Reformulation Tracking
Track how product recipes change over time by storing historical snapshots and comparing them:
import sqlite3
import json
def setup_tracking_db(db_path: str = "reformulation_tracking.db"):
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS snapshots (
id INTEGER PRIMARY KEY AUTOINCREMENT,
barcode TEXT NOT NULL,
snapshot_date TEXT NOT NULL,
product_name TEXT,
nutriscore_grade TEXT,
sugar_per_100g REAL,
salt_per_100g REAL,
fat_per_100g REAL,
ingredients_text TEXT,
full_data JSON
)
""")
conn.commit()
return conn
def record_snapshot(conn: sqlite3.Connection, product: FoodProduct):
import datetime
conn.execute("""
INSERT INTO snapshots
(barcode, snapshot_date, product_name, nutriscore_grade,
sugar_per_100g, salt_per_100g, fat_per_100g, ingredients_text, full_data)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
product.barcode,
datetime.date.today().isoformat(),
product.name,
product.nutriscore_grade,
product.nutrition.sugars if product.nutrition else None,
product.nutrition.salt if product.nutrition else None,
product.nutrition.fat if product.nutrition else None,
product.ingredients_text,
json.dumps(asdict(product), default=str),
))
conn.commit()
5. Price Comparison Engine (with ThorData Proxies)
Combine Open Food Facts barcodes with real-time prices scraped from grocery retailers using residential proxies:
async def build_price_comparison(
category: str,
retailers: list[str] = ["tesco", "sainsburys", "waitrose"],
) -> list[dict]:
# Get products from OFF
products = list(search_products("", category=category, max_pages=3))
print(f"Loaded {len(products)} products from Open Food Facts")
enriched = []
semaphore = asyncio.Semaphore(2) # Low concurrency for retail sites
async def get_prices_for_product(product: FoodProduct) -> dict:
async with semaphore:
result = asdict(product)
result["prices"] = {}
for retailer in retailers:
proxy = f"http://{THORDATA_USER}-country-gb:{THORDATA_PASS}@proxy.thordata.com:9000"
async with httpx.AsyncClient(proxy=proxy, timeout=20.0) as client:
try:
# Each retailer has a different API pattern
# Intercept via DevTools/mitmproxy to find the right endpoint
resp = await client.get(
f"https://api.{retailer}.com/products",
params={"barcode": product.barcode},
headers={"User-Agent": "Mozilla/5.0 ..."},
)
if resp.status_code == 200:
data = resp.json()
result["prices"][retailer] = data.get("price")
except Exception:
result["prices"][retailer] = None
await asyncio.sleep(1.0)
return result
tasks = [get_prices_for_product(p) for p in products[:50]]
enriched = await asyncio.gather(*tasks)
return list(enriched)
6. Nutritional Label Compliance Checker
Automatically check whether products meet nutritional criteria for specific health claims or regulatory thresholds:
def check_health_claim_eligibility(product: FoodProduct) -> dict:
"""
Check EU health claim eligibility based on nutrition data.
Rules vary by claim — this covers common ones.
"""
n = product.nutrition
if not n:
return {"error": "No nutrition data"}
checks = {}
# Low fat claim: ≤3g fat per 100g (solids), ≤1.5g per 100ml (liquids)
if n.fat is not None:
checks["low_fat"] = n.fat <= 3.0
# Low sugar claim: ≤5g sugars per 100g
if n.sugars is not None:
checks["low_sugar"] = n.sugars <= 5.0
# Low sodium/salt claim: ≤0.12g sodium per 100g
if n.sodium is not None:
checks["low_sodium"] = n.sodium <= 0.12
elif n.salt is not None:
checks["low_sodium"] = n.salt <= 0.3 # 0.12g sodium ≈ 0.3g salt
# Source of protein: ≥12% energy from protein
if n.energy_kcal and n.proteins:
protein_energy_pct = (n.proteins * 4 / n.energy_kcal) * 100
checks["source_of_protein"] = protein_energy_pct >= 12.0
checks["high_protein"] = protein_energy_pct >= 20.0
# High fiber: ≥6g fiber per 100g
if n.fiber is not None:
checks["high_fiber"] = n.fiber >= 6.0
checks["source_of_fiber"] = n.fiber >= 3.0
return {
"barcode": product.barcode,
"name": product.name,
"eligible_claims": [claim for claim, eligible in checks.items() if eligible],
"ineligible_claims": [claim for claim, eligible in checks.items() if not eligible],
"checks": checks,
}
7. ML Training Dataset Builder
Build labeled datasets for food image classification, ingredient parsing, or Nutri-Score prediction:
import csv
from pathlib import Path
import httpx
def build_nutriscore_dataset(
target_per_grade: int = 500,
output_dir: str = "nutriscore_dataset",
) -> dict:
"""
Build a balanced dataset for Nutri-Score prediction.
target_per_grade: how many examples to collect per A/B/C/D/E grade.
"""
Path(output_dir).mkdir(exist_ok=True)
records_by_grade = {"a": [], "b": [], "c": [], "d": [], "e": []}
collected_total = 0
for grade in ["a", "b", "c", "d", "e"]:
print(f"Collecting grade {grade.upper()} products...")
for product in search_products(
"",
nutrition_grade=grade,
max_pages=20,
):
if len(records_by_grade[grade]) >= target_per_grade:
break
quality = assess_data_quality(product)
if not quality["suitable_for"]["ml_training"]:
continue
n = product.nutrition
if not n or not n.energy_kcal:
continue
records_by_grade[grade].append({
"barcode": product.barcode,
"label": grade,
"energy_kcal": n.energy_kcal,
"fat": n.fat,
"saturated_fat": n.saturated_fat,
"carbohydrates": n.carbohydrates,
"sugars": n.sugars,
"fiber": n.fiber,
"proteins": n.proteins,
"salt": n.salt,
})
collected_total += 1
# Write to CSV
all_records = []
for grade_records in records_by_grade.values():
all_records.extend(grade_records)
if all_records:
output_path = f"{output_dir}/nutriscore_features.csv"
with open(output_path, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=all_records[0].keys())
writer.writeheader()
writer.writerows(all_records)
return {
"total_collected": collected_total,
"per_grade": {g: len(r) for g, r in records_by_grade.items()},
"output": f"{output_dir}/nutriscore_features.csv",
}
Rate Limits and Being a Good Citizen
Open Food Facts does not publish strict rate limits, but their infrastructure is not Google-scale. They are a volunteer-run nonprofit. Practical guidelines that keep you in good standing:
- 1 request per second for search queries
- Up to 10 requests per second for individual barcode lookups (cached at CDN edge)
- Always include a descriptive
User-Agentwith contact information - Use the
fieldsparameter to reduce response sizes - For bulk collection of millions of products, use the data dump instead of the API
- If you are building something commercial, consider contributing back to the project
The data quality improves as more people contribute. If you notice missing or incorrect data during your scraping, use their product edit interface or API to contribute improvements. The project runs entirely on community contributions.
Summary
Open Food Facts is an exceptional free resource for food and nutrition data. The API is straightforward — no auth, sensible rate limits, well-documented fields. The data quality is genuinely good for major brands in Europe and North America and improving globally.
The patterns in this guide scale from a quick single-product lookup to collecting millions of records from the daily data dump. For cross-referencing with commercial grocery sites, residential proxies from ThorData handle the IP-based blocking that would otherwise prevent access. The output schemas and quality assessment functions give you a solid foundation for building reliable data pipelines on top of the raw API.
Start with the single-product lookup, test with a few known barcodes, then scale up to category searches and bulk collection as you understand the data shape you are working with.