Scraping the FDA Drug Database: Approvals, Adverse Events & Recalls with Python (2026)
Scraping the FDA Drug Database: Approvals, Adverse Events & Recalls with Python (2026)
The FDA maintains one of the most valuable public drug databases in the world. And unlike most government data, it's actually accessible — the openFDA API is well-documented, returns clean JSON, and doesn't require authentication for basic use.
If you're doing pharmaceutical research, tracking drug safety signals, building pharmacovigilance tools, or constructing a drug information product, openFDA is where you start. This guide covers pulling drug approvals, adverse event reports, recall data, and drug labeling using Python, with production-grade error handling and storage patterns.
The openFDA API Overview
The API lives at https://api.fda.gov. No API key needed for up to 240 requests per minute (1,000 per day per IP without a key). Register for a free key to get higher limits — the registration page is at open.fda.gov/apis/authentication/.
Main endpoints:
| Endpoint | Description |
|---|---|
/drug/event.json |
Adverse event reports (FAERS database) |
/drug/label.json |
Drug labeling / package inserts |
/drug/enforcement.json |
Recalls and market withdrawals |
/drug/drugsfda.json |
Drug approval information (NDA/ANDA) |
/drug/ndc.json |
NDC product codes and packaging |
/device/event.json |
Medical device adverse events |
/food/enforcement.json |
Food recalls |
All endpoints return JSON. All support search, count, skip, and limit parameters for filtering and pagination.
Setting Up the Client
import httpx
import time
import json
import sqlite3
from datetime import datetime
BASE = "https://api.fda.gov"
API_KEY = "" # Optional: get a free key at open.fda.gov/apis/authentication/
def fda_request(endpoint: str, params: dict,
max_retries: int = 5) -> dict:
"""
Make an openFDA API request with retry logic and rate limit handling.
Implements exponential backoff for 429 responses.
"""
if API_KEY:
params["api_key"] = API_KEY
for attempt in range(max_retries):
try:
resp = httpx.get(
f"{BASE}{endpoint}",
params=params,
timeout=20,
follow_redirects=True,
)
if resp.status_code == 200:
return resp.json()
elif resp.status_code == 404:
# No results found — not an error
return {"results": [], "meta": {"results": {"total": 0}}}
elif resp.status_code == 429:
wait = 2 ** attempt * 10
print(f" Rate limited (attempt {attempt+1}). Waiting {wait}s...")
time.sleep(wait)
continue
elif resp.status_code >= 500:
wait = 2 ** attempt * 5
print(f" Server error {resp.status_code} (attempt {attempt+1}). Waiting {wait}s...")
time.sleep(wait)
continue
else:
resp.raise_for_status()
except httpx.TimeoutException:
wait = 2 ** attempt * 5
print(f" Timeout on attempt {attempt+1}. Waiting {wait}s...")
time.sleep(wait)
except httpx.ConnectError:
wait = 2 ** attempt * 10
print(f" Connection error on attempt {attempt+1}. Waiting {wait}s...")
time.sleep(wait)
# After max retries — write placeholder and move on
print(f" Failed after {max_retries} attempts: {endpoint}")
return {"results": [], "error": "max_retries_exceeded"}
Pulling Drug Approval Data
The /drug/drugsfda.json endpoint covers approved drugs — both brand-name (NDA) and generic (ANDA):
def search_drug_approvals(drug_name: str, limit: int = 10) -> list:
"""Search for FDA drug approval records by name."""
params = {
"search": f'openfda.brand_name:"{drug_name}"',
"limit": limit,
}
data = fda_request("/drug/drugsfda.json", params)
return data.get("results", [])
def get_approval_details(application_number: str) -> dict | None:
"""Get full details for a specific NDA/ANDA application."""
params = {
"search": f'application_number:"{application_number}"',
"limit": 1,
}
data = fda_request("/drug/drugsfda.json", params)
results = data.get("results", [])
return results[0] if results else None
def parse_approval_record(record: dict) -> dict:
"""Flatten an approval record into a clean structure."""
products = []
for product in record.get("products", []):
products.append({
"brand_name": product.get("brand_name", ""),
"generic_name": product.get("active_ingredients", [{}])[0].get("name", ""),
"dosage_form": product.get("dosage_form", ""),
"route": product.get("route", ""),
"marketing_status": product.get("marketing_status", ""),
"te_code": product.get("te_code", ""), # therapeutic equivalence
})
submissions = []
for sub in record.get("submissions", []):
if sub.get("submission_type") in ("ORIG", "SUPPL"):
submissions.append({
"type": sub.get("submission_type"),
"number": sub.get("submission_number"),
"status": sub.get("submission_status"),
"date": sub.get("submission_status_date"),
"class_code": sub.get("submission_class_code"),
"class_description": sub.get("submission_class_code_description"),
})
return {
"application_number": record.get("application_number"),
"sponsor": record.get("sponsor_name"),
"openfda": record.get("openfda", {}),
"products": products,
"submissions": submissions,
}
# Example: look up ozempic approvals
approvals = search_drug_approvals("ozempic")
for rec in approvals:
parsed = parse_approval_record(rec)
print(f"\nApplication: {parsed['application_number']}")
print(f" Sponsor: {parsed['sponsor']}")
for prod in parsed['products'][:2]:
print(f" Product: {prod['brand_name']} ({prod['dosage_form']}) — {prod['marketing_status']}")
Mining Adverse Event Reports (FAERS)
The FAERS (FDA Adverse Event Reporting System) endpoint is the most data-rich. It contains millions of reports from healthcare providers, patients, and manufacturers about adverse drug reactions.
def get_adverse_events(drug_name: str, limit: int = 100,
skip: int = 0) -> dict:
"""Pull adverse event reports for a specific drug."""
params = {
"search": f'patient.drug.openfda.brand_name:"{drug_name}"',
"limit": min(limit, 100), # API max is 100 per request
"skip": skip,
}
return fda_request("/drug/event.json", params)
def get_total_event_count(drug_name: str) -> int:
"""Get total adverse event count for a drug."""
params = {
"search": f'patient.drug.openfda.brand_name:"{drug_name}"',
"limit": 1,
}
data = fda_request("/drug/event.json", params)
return data.get("meta", {}).get("results", {}).get("total", 0)
def collect_all_events(drug_name: str, max_records: int = 5000) -> list:
"""Paginate through all adverse event records for a drug."""
total = get_total_event_count(drug_name)
actual_max = min(total, max_records)
print(f" Total available: {total:,}. Collecting up to {actual_max:,}...")
all_results = []
skip = 0
while skip < actual_max:
batch_limit = min(100, actual_max - skip)
data = get_adverse_events(drug_name, limit=batch_limit, skip=skip)
results = data.get("results", [])
if not results:
break
all_results.extend(results)
skip += len(results)
if skip % 1000 == 0:
print(f" Collected {skip:,}/{actual_max:,} records...")
time.sleep(0.3) # Stay comfortably under rate limits
return all_results
events = collect_all_events("ozempic", max_records=1000)
print(f"Collected {len(events):,} adverse event reports")
Parsing and Analyzing Adverse Event Reports
The FAERS data structure is deeply nested. Here's how to flatten it for analysis:
def parse_adverse_event(event: dict) -> dict:
"""Flatten a FAERS adverse event report into a clean record."""
patient = event.get("patient", {})
# Primary suspect drugs (characterization = "1")
suspect_drugs = []
concomitant_drugs = []
for drug in patient.get("drug", []):
openfda = drug.get("openfda", {})
drug_info = {
"name": drug.get("medicinalproduct", ""),
"brand_names": openfda.get("brand_name", []),
"generic_names": openfda.get("generic_name", []),
"indication": drug.get("drugindication", ""),
"dose": drug.get("drugdosagetext", ""),
"route": drug.get("drugadministrationroute", ""),
"characterization": drug.get("drugcharacterization", ""),
# 1=suspect, 2=concomitant, 3=interacting
}
if drug.get("drugcharacterization") == "1":
suspect_drugs.append(drug_info)
else:
concomitant_drugs.append(drug_info)
# Reported reactions (MedDRA terms)
reactions = []
for reaction in patient.get("reaction", []):
reactions.append({
"term": reaction.get("reactionmeddrapt", ""),
"outcome": reaction.get("reactionoutcome", ""),
# 1=recovered, 2=recovering, 3=not recovered, 4=fatal, 5=unknown, 6=unknown
})
# Outcomes
outcomes = {
"serious": event.get("serious") == "1",
"death": event.get("seriousnessdeath") == "1",
"hospitalized": event.get("seriousnesshospitalization") == "1",
"life_threatening": event.get("seriousnesslifethreatening") == "1",
"disability": event.get("seriousnessdisabling") == "1",
"congenital": event.get("seriousnesscongenitalanomali") == "1",
"other": event.get("seriousnessother") == "1",
}
return {
"report_id": event.get("safetyreportid", ""),
"receive_date": event.get("receivedate", ""),
"receipt_date": event.get("receiptdate", ""),
"report_type": event.get("reporttype", ""),
"reporter_country": event.get("primarysource", {}).get("reportercountry", ""),
"reporter_qualification": event.get("primarysource", {}).get("qualification", ""),
"patient_age": patient.get("patientonsetage", ""),
"patient_age_unit": patient.get("patientonsetageunit", ""),
"patient_sex": patient.get("patientsex", ""), # 1=male, 2=female
"patient_weight_kg": patient.get("patientweight", ""),
"suspect_drugs": suspect_drugs,
"concomitant_drugs": concomitant_drugs,
"reactions": reactions,
"outcomes": outcomes,
}
def summarize_events(events: list) -> dict:
"""Summarize adverse events — top reactions, serious event rate, etc."""
from collections import Counter
parsed = [parse_adverse_event(e) for e in events]
all_reactions = [r["term"] for e in parsed for r in e["reactions"] if r["term"]]
reaction_counts = Counter(all_reactions).most_common(20)
serious_count = sum(1 for e in parsed if e["outcomes"]["serious"])
death_count = sum(1 for e in parsed if e["outcomes"]["death"])
hosp_count = sum(1 for e in parsed if e["outcomes"]["hospitalized"])
return {
"total_reports": len(parsed),
"serious_rate": serious_count / len(parsed) if parsed else 0,
"death_rate": death_count / len(parsed) if parsed else 0,
"hospitalization_rate": hosp_count / len(parsed) if parsed else 0,
"top_reactions": reaction_counts,
}
summary = summarize_events(events)
print(f"\nTotal reports: {summary['total_reports']:,}")
print(f"Serious rate: {summary['serious_rate']:.1%}")
print(f"Death rate: {summary['death_rate']:.1%}")
print(f"Hospitalization rate: {summary['hospitalization_rate']:.1%}")
print("\nTop reactions:")
for reaction, count in summary["top_reactions"][:10]:
print(f" {reaction}: {count}")
Tracking Drug Recalls
The enforcement endpoint tracks recalls, market withdrawals, and safety alerts:
def get_recalls(search_term: str = None, classification: str = None,
product_type: str = "drug", limit: int = 100,
skip: int = 0) -> list:
"""Search for FDA recalls."""
search_parts = []
if search_term:
search_parts.append(f'reason_for_recall:"{search_term}"')
if classification:
search_parts.append(f'classification:"{classification}"')
params = {
"search": " AND ".join(search_parts) if search_parts else None,
"limit": limit,
"skip": skip,
"sort": "report_date:desc",
}
# Remove None values
params = {k: v for k, v in params.items() if v is not None}
data = fda_request("/drug/enforcement.json", params)
return data.get("results", [])
def get_recent_recalls(days_back: int = 30, classification: str = "Class I") -> list:
"""Get recent high-severity drug recalls."""
import datetime
cutoff = (datetime.date.today() - datetime.timedelta(days=days_back)).strftime("%Y%m%d")
params = {
"search": f'report_date:[{cutoff}+TO+20991231] AND classification:"{classification}"',
"limit": 100,
"sort": "report_date:desc",
}
data = fda_request("/drug/enforcement.json", params)
return data.get("results", [])
def parse_recall(recall: dict) -> dict:
"""Parse a recall record."""
return {
"recall_number": recall.get("recall_number"),
"classification": recall.get("classification"),
# Class I: dangerous/defective
# Class II: may cause temporary adverse health consequences
# Class III: unlikely to cause adverse health consequences
"product_description": recall.get("product_description", ""),
"reason": recall.get("reason_for_recall", ""),
"action": recall.get("action", ""),
"firm": recall.get("recalling_firm", ""),
"city": recall.get("city", ""),
"state": recall.get("state", ""),
"country": recall.get("country", ""),
"report_date": recall.get("report_date", ""),
"recall_initiation_date": recall.get("recall_initiation_date", ""),
"termination_date": recall.get("termination_date"),
"distribution_pattern": recall.get("distribution_pattern", ""),
"product_quantity": recall.get("product_quantity", ""),
"lot_numbers": recall.get("code_info", ""),
"status": recall.get("status", ""), # Ongoing, Completed, Terminated
}
# Get recent Class I recalls
recent = get_recent_recalls(days_back=60, classification="Class I")
for recall in recent[:5]:
parsed = parse_recall(recall)
print(f"\n[{parsed['classification']}] {parsed['firm']}")
print(f" Product: {parsed['product_description'][:100]}")
print(f" Reason: {parsed['reason'][:100]}")
print(f" Date: {parsed['report_date']}")
Drug Labeling API
The labeling endpoint contains full package insert text, including indications, warnings, and contraindications:
def get_drug_label(drug_name: str) -> dict | None:
"""Get full drug labeling for a drug."""
params = {
"search": f'openfda.brand_name:"{drug_name}"',
"limit": 1,
}
data = fda_request("/drug/label.json", params)
results = data.get("results", [])
return results[0] if results else None
def extract_label_sections(label: dict) -> dict:
"""Extract key sections from a drug label."""
return {
"brand_name": label.get("openfda", {}).get("brand_name", []),
"generic_name": label.get("openfda", {}).get("generic_name", []),
"manufacturer": label.get("openfda", {}).get("manufacturer_name", []),
"product_type": label.get("openfda", {}).get("product_type", []),
"route": label.get("openfda", {}).get("route", []),
"indications": label.get("indications_and_usage", [""])[0][:500],
"contraindications": label.get("contraindications", [""])[0][:500],
"warnings": label.get("warnings", [""])[0][:500],
"adverse_reactions": label.get("adverse_reactions", [""])[0][:500],
"drug_interactions": label.get("drug_interactions", [""])[0][:500],
"dosage": label.get("dosage_and_administration", [""])[0][:300],
"effective_time": label.get("effective_time", ""),
}
label = get_drug_label("humira")
if label:
sections = extract_label_sections(label)
print(f"Drug: {sections['brand_name']}")
print(f"Generic: {sections['generic_name']}")
print(f"Route: {sections['route']}")
print(f"Indications: {sections['indications'][:200]}...")
Count Queries — Aggregation Without Pagination
openFDA supports count queries that return frequency distributions without paginating through individual records. These are much faster for analytics:
def count_adverse_events_by_reaction(drug_name: str, top_n: int = 20) -> list:
"""Get reaction frequency distribution for a drug."""
params = {
"search": f'patient.drug.openfda.brand_name:"{drug_name}"',
"count": "patient.reaction.reactionmeddrapt.exact",
"limit": top_n,
}
data = fda_request("/drug/event.json", params)
return data.get("results", [])
def count_recalls_by_firm(top_n: int = 20) -> list:
"""Get top firms by number of drug recalls."""
params = {
"count": "recalling_firm.exact",
"limit": top_n,
}
data = fda_request("/drug/enforcement.json", params)
return data.get("results", [])
def count_recalls_by_reason(top_n: int = 20) -> list:
"""Get top recall reasons."""
params = {
"count": "reason_for_recall.exact",
"limit": top_n,
}
data = fda_request("/drug/enforcement.json", params)
return data.get("results", [])
# Top adverse reactions for ozempic
reactions = count_adverse_events_by_reaction("ozempic", top_n=15)
print("Top adverse reactions for Ozempic:")
for r in reactions:
print(f" {r['term']}: {r['count']:,}")
Rate Limits and Proxy Rotation
openFDA is generous — 240 requests/minute without a key, more with one. For large-scale collection (pulling millions of FAERS records across multiple drugs), those limits become binding.
Practical strategies:
- Register for an API key — free and instant, raises limits significantly
- Cache aggressively — FDA data is relatively stable; cache responses and set a refresh schedule
- Use count queries for aggregation — much faster than paginating through individual records
- Proxy rotation for parallel collection — when pulling from multiple endpoints simultaneously
ThorData's proxy network works well for FDA API work. For API endpoints (as opposed to browser scraping), datacenter proxies are sufficient and cheaper than residential:
PROXY_CONFIGS = {
"host": "proxy.thordata.net",
"port": 10000,
"user": "your_thordata_user",
"pass": "your_thordata_password",
}
def fda_request_with_proxy(endpoint: str, params: dict,
session_id: int = None) -> dict:
"""FDA request routed through a proxy."""
import random
sid = session_id or random.randint(1000, 9999)
proxy_url = (
f"http://{PROXY_CONFIGS['user']}-session-{sid}:"
f"{PROXY_CONFIGS['pass']}@{PROXY_CONFIGS['host']}:{PROXY_CONFIGS['port']}"
)
if API_KEY:
params["api_key"] = API_KEY
resp = httpx.get(
f"{BASE}{endpoint}",
params=params,
timeout=20,
proxies={"http://": proxy_url, "https://": proxy_url},
)
resp.raise_for_status()
return resp.json()
Storing Results in SQLite
For any serious data collection, persist to SQLite as you go:
def init_fda_db(db_path: str = "fda_data.db") -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS adverse_events (
report_id TEXT PRIMARY KEY,
receive_date TEXT,
report_type TEXT,
patient_age TEXT,
patient_sex TEXT,
serious INTEGER,
death INTEGER,
hospitalized INTEGER,
suspect_drugs TEXT,
reactions TEXT,
raw_json TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS recalls (
recall_number TEXT PRIMARY KEY,
classification TEXT,
product_description TEXT,
reason TEXT,
firm TEXT,
city TEXT,
state TEXT,
report_date TEXT,
status TEXT,
raw_json TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS drug_approvals (
application_number TEXT PRIMARY KEY,
sponsor TEXT,
brand_names TEXT,
generic_names TEXT,
products_json TEXT,
submissions_json TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS drug_labels (
set_id TEXT PRIMARY KEY,
brand_name TEXT,
generic_name TEXT,
manufacturer TEXT,
effective_time TEXT,
indications TEXT,
contraindications TEXT,
warnings TEXT,
adverse_reactions TEXT,
raw_json TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
conn.execute("CREATE INDEX IF NOT EXISTS idx_events_date ON adverse_events(receive_date)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_recalls_date ON recalls(report_date)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_recalls_class ON recalls(classification)")
conn.commit()
return conn
def save_adverse_events(conn: sqlite3.Connection, events: list):
"""Bulk insert adverse events into SQLite."""
rows = []
for e in events:
parsed = parse_adverse_event(e)
rows.append((
parsed["report_id"],
parsed["receive_date"],
parsed["report_type"],
parsed["patient_age"],
parsed["patient_sex"],
1 if parsed["outcomes"]["serious"] else 0,
1 if parsed["outcomes"]["death"] else 0,
1 if parsed["outcomes"]["hospitalized"] else 0,
json.dumps(parsed["suspect_drugs"]),
json.dumps([r["term"] for r in parsed["reactions"]]),
json.dumps(e),
))
conn.executemany(
"INSERT OR REPLACE INTO adverse_events VALUES (?,?,?,?,?,?,?,?,?,?,?,CURRENT_TIMESTAMP)",
rows,
)
conn.commit()
print(f" Saved {len(rows)} adverse event records")
def save_recalls(conn: sqlite3.Connection, recalls: list):
"""Bulk insert recall records into SQLite."""
rows = []
for r in recalls:
parsed = parse_recall(r)
rows.append((
parsed["recall_number"],
parsed["classification"],
parsed["product_description"][:500],
parsed["reason"][:500],
parsed["firm"],
parsed["city"],
parsed["state"],
parsed["report_date"],
parsed["status"],
json.dumps(r),
))
conn.executemany(
"INSERT OR REPLACE INTO recalls VALUES (?,?,?,?,?,?,?,?,?,?,CURRENT_TIMESTAMP)",
rows,
)
conn.commit()
print(f" Saved {len(rows)} recall records")
Complete Pipeline Example
Here's a full pipeline that builds a drug safety database:
def build_drug_safety_db(drug_names: list, db_path: str = "fda_safety.db"):
"""Build a comprehensive drug safety database for a list of drugs."""
conn = init_fda_db(db_path)
for drug_name in drug_names:
print(f"\nProcessing: {drug_name}")
# Adverse events
events = collect_all_events(drug_name, max_records=2000)
if events:
save_adverse_events(conn, events)
time.sleep(2)
# Recalls
recalls = get_recalls(search_term=drug_name, limit=100)
if recalls:
save_recalls(conn, recalls)
time.sleep(2)
# Approval data
approvals = search_drug_approvals(drug_name, limit=5)
for approval in approvals:
parsed = parse_approval_record(approval)
conn.execute(
"INSERT OR REPLACE INTO drug_approvals VALUES (?,?,?,?,?,?,CURRENT_TIMESTAMP)",
(
parsed["application_number"],
parsed["sponsor"],
json.dumps(parsed["openfda"].get("brand_name", [])),
json.dumps(parsed["openfda"].get("generic_name", [])),
json.dumps(parsed["products"]),
json.dumps(parsed["submissions"]),
)
)
conn.commit()
print(f" Completed {drug_name}")
time.sleep(3)
conn.close()
print(f"\nDatabase built: {db_path}")
# Build safety data for a set of GLP-1 drugs
build_drug_safety_db(["ozempic", "wegovy", "mounjaro", "zepbound"])
What You Can Build
The FDA data is a goldmine for health technology:
Drug safety dashboards — Visualize adverse event trends over time for specific drugs. Compare pre/post-approval signal rates. Track seasonal patterns in reporting.
Recall monitoring system — Build an alert system for new recalls in specific drug categories or for specific manufacturers. Integrate with Slack or email notifications for Class I recalls.
Pharmacovigilance signals — Use disproportionality analysis (PRR, ROR) to detect emerging safety signals before they hit mainstream news. Cross-reference FAERS reports where multiple suspect drugs co-occur.
Drug interaction analysis — Identify drugs that frequently co-appear in serious adverse event reports and correlate with specific reaction clusters.
Regulatory research tools — Track approval timelines, submission types, and approval rates by therapeutic area, sponsor, or drug type.
Clinical trial support — Cross-reference approved indications against reported adverse reactions to assess safety profiles for specific patient populations.
The openFDA API is one of the best public data sources available. Clean JSON, free access, well-documented, and covers decades of regulatory data. The hard part isn't getting the data — it's asking the right questions once you have it.