Scraping the FDA Drug Database: Approvals, Adverse Events & Recalls with Python (2026)

2026-04-09 [python scraping fda openFDA drug-data api]

Scraping the FDA Drug Database: Approvals, Adverse Events & Recalls with Python (2026)

The FDA maintains one of the most valuable public drug databases in the world. And unlike most government data, it's actually accessible — the openFDA API is well-documented, returns clean JSON, and doesn't require authentication for basic use.

If you're doing pharmaceutical research, tracking drug safety signals, building pharmacovigilance tools, or constructing a drug information product, openFDA is where you start. This guide covers pulling drug approvals, adverse event reports, recall data, and drug labeling using Python, with production-grade error handling and storage patterns.

The openFDA API Overview

The API lives at https://api.fda.gov. No API key needed for up to 240 requests per minute (1,000 per day per IP without a key). Register for a free key to get higher limits — the registration page is at open.fda.gov/apis/authentication/.

Main endpoints:

Endpoint	Description
`/drug/event.json`	Adverse event reports (FAERS database)
`/drug/label.json`	Drug labeling / package inserts
`/drug/enforcement.json`	Recalls and market withdrawals
`/drug/drugsfda.json`	Drug approval information (NDA/ANDA)
`/drug/ndc.json`	NDC product codes and packaging
`/device/event.json`	Medical device adverse events
`/food/enforcement.json`	Food recalls

All endpoints return JSON. All support search, count, skip, and limit parameters for filtering and pagination.

Setting Up the Client

import httpx
import time
import json
import sqlite3
from datetime import datetime

BASE = "https://api.fda.gov"
API_KEY = ""  # Optional: get a free key at open.fda.gov/apis/authentication/

def fda_request(endpoint: str, params: dict,
                max_retries: int = 5) -> dict:
    """
    Make an openFDA API request with retry logic and rate limit handling.
    Implements exponential backoff for 429 responses.
    """
    if API_KEY:
        params["api_key"] = API_KEY

    for attempt in range(max_retries):
        try:
            resp = httpx.get(
                f"{BASE}{endpoint}",
                params=params,
                timeout=20,
                follow_redirects=True,
            )

            if resp.status_code == 200:
                return resp.json()
            elif resp.status_code == 404:
                # No results found — not an error
                return {"results": [], "meta": {"results": {"total": 0}}}
            elif resp.status_code == 429:
                wait = 2 ** attempt * 10
                print(f"  Rate limited (attempt {attempt+1}). Waiting {wait}s...")
                time.sleep(wait)
                continue
            elif resp.status_code >= 500:
                wait = 2 ** attempt * 5
                print(f"  Server error {resp.status_code} (attempt {attempt+1}). Waiting {wait}s...")
                time.sleep(wait)
                continue
            else:
                resp.raise_for_status()

        except httpx.TimeoutException:
            wait = 2 ** attempt * 5
            print(f"  Timeout on attempt {attempt+1}. Waiting {wait}s...")
            time.sleep(wait)
        except httpx.ConnectError:
            wait = 2 ** attempt * 10
            print(f"  Connection error on attempt {attempt+1}. Waiting {wait}s...")
            time.sleep(wait)

    # After max retries — write placeholder and move on
    print(f"  Failed after {max_retries} attempts: {endpoint}")
    return {"results": [], "error": "max_retries_exceeded"}

Pulling Drug Approval Data

The /drug/drugsfda.json endpoint covers approved drugs — both brand-name (NDA) and generic (ANDA):

def search_drug_approvals(drug_name: str, limit: int = 10) -> list:
    """Search for FDA drug approval records by name."""
    params = {
        "search": f'openfda.brand_name:"{drug_name}"',
        "limit": limit,
    }
    data = fda_request("/drug/drugsfda.json", params)
    return data.get("results", [])


def get_approval_details(application_number: str) -> dict | None:
    """Get full details for a specific NDA/ANDA application."""
    params = {
        "search": f'application_number:"{application_number}"',
        "limit": 1,
    }
    data = fda_request("/drug/drugsfda.json", params)
    results = data.get("results", [])
    return results[0] if results else None


def parse_approval_record(record: dict) -> dict:
    """Flatten an approval record into a clean structure."""
    products = []
    for product in record.get("products", []):
        products.append({
            "brand_name": product.get("brand_name", ""),
            "generic_name": product.get("active_ingredients", [{}])[0].get("name", ""),
            "dosage_form": product.get("dosage_form", ""),
            "route": product.get("route", ""),
            "marketing_status": product.get("marketing_status", ""),
            "te_code": product.get("te_code", ""),  # therapeutic equivalence
        })

    submissions = []
    for sub in record.get("submissions", []):
        if sub.get("submission_type") in ("ORIG", "SUPPL"):
            submissions.append({
                "type": sub.get("submission_type"),
                "number": sub.get("submission_number"),
                "status": sub.get("submission_status"),
                "date": sub.get("submission_status_date"),
                "class_code": sub.get("submission_class_code"),
                "class_description": sub.get("submission_class_code_description"),
            })

    return {
        "application_number": record.get("application_number"),
        "sponsor": record.get("sponsor_name"),
        "openfda": record.get("openfda", {}),
        "products": products,
        "submissions": submissions,
    }


# Example: look up ozempic approvals
approvals = search_drug_approvals("ozempic")
for rec in approvals:
    parsed = parse_approval_record(rec)
    print(f"\nApplication: {parsed['application_number']}")
    print(f"  Sponsor: {parsed['sponsor']}")
    for prod in parsed['products'][:2]:
        print(f"  Product: {prod['brand_name']} ({prod['dosage_form']}) — {prod['marketing_status']}")

Mining Adverse Event Reports (FAERS)

The FAERS (FDA Adverse Event Reporting System) endpoint is the most data-rich. It contains millions of reports from healthcare providers, patients, and manufacturers about adverse drug reactions.

def get_adverse_events(drug_name: str, limit: int = 100,
                        skip: int = 0) -> dict:
    """Pull adverse event reports for a specific drug."""
    params = {
        "search": f'patient.drug.openfda.brand_name:"{drug_name}"',
        "limit": min(limit, 100),  # API max is 100 per request
        "skip": skip,
    }
    return fda_request("/drug/event.json", params)


def get_total_event_count(drug_name: str) -> int:
    """Get total adverse event count for a drug."""
    params = {
        "search": f'patient.drug.openfda.brand_name:"{drug_name}"',
        "limit": 1,
    }
    data = fda_request("/drug/event.json", params)
    return data.get("meta", {}).get("results", {}).get("total", 0)


def collect_all_events(drug_name: str, max_records: int = 5000) -> list:
    """Paginate through all adverse event records for a drug."""
    total = get_total_event_count(drug_name)
    actual_max = min(total, max_records)
    print(f"  Total available: {total:,}. Collecting up to {actual_max:,}...")

    all_results = []
    skip = 0

    while skip < actual_max:
        batch_limit = min(100, actual_max - skip)
        data = get_adverse_events(drug_name, limit=batch_limit, skip=skip)
        results = data.get("results", [])

        if not results:
            break

        all_results.extend(results)
        skip += len(results)

        if skip % 1000 == 0:
            print(f"  Collected {skip:,}/{actual_max:,} records...")

        time.sleep(0.3)  # Stay comfortably under rate limits

    return all_results


events = collect_all_events("ozempic", max_records=1000)
print(f"Collected {len(events):,} adverse event reports")

Parsing and Analyzing Adverse Event Reports

The FAERS data structure is deeply nested. Here's how to flatten it for analysis:

def parse_adverse_event(event: dict) -> dict:
    """Flatten a FAERS adverse event report into a clean record."""
    patient = event.get("patient", {})

    # Primary suspect drugs (characterization = "1")
    suspect_drugs = []
    concomitant_drugs = []
    for drug in patient.get("drug", []):
        openfda = drug.get("openfda", {})
        drug_info = {
            "name": drug.get("medicinalproduct", ""),
            "brand_names": openfda.get("brand_name", []),
            "generic_names": openfda.get("generic_name", []),
            "indication": drug.get("drugindication", ""),
            "dose": drug.get("drugdosagetext", ""),
            "route": drug.get("drugadministrationroute", ""),
            "characterization": drug.get("drugcharacterization", ""),
            # 1=suspect, 2=concomitant, 3=interacting
        }
        if drug.get("drugcharacterization") == "1":
            suspect_drugs.append(drug_info)
        else:
            concomitant_drugs.append(drug_info)

    # Reported reactions (MedDRA terms)
    reactions = []
    for reaction in patient.get("reaction", []):
        reactions.append({
            "term": reaction.get("reactionmeddrapt", ""),
            "outcome": reaction.get("reactionoutcome", ""),
            # 1=recovered, 2=recovering, 3=not recovered, 4=fatal, 5=unknown, 6=unknown
        })

    # Outcomes
    outcomes = {
        "serious": event.get("serious") == "1",
        "death": event.get("seriousnessdeath") == "1",
        "hospitalized": event.get("seriousnesshospitalization") == "1",
        "life_threatening": event.get("seriousnesslifethreatening") == "1",
        "disability": event.get("seriousnessdisabling") == "1",
        "congenital": event.get("seriousnesscongenitalanomali") == "1",
        "other": event.get("seriousnessother") == "1",
    }

    return {
        "report_id": event.get("safetyreportid", ""),
        "receive_date": event.get("receivedate", ""),
        "receipt_date": event.get("receiptdate", ""),
        "report_type": event.get("reporttype", ""),
        "reporter_country": event.get("primarysource", {}).get("reportercountry", ""),
        "reporter_qualification": event.get("primarysource", {}).get("qualification", ""),
        "patient_age": patient.get("patientonsetage", ""),
        "patient_age_unit": patient.get("patientonsetageunit", ""),
        "patient_sex": patient.get("patientsex", ""),  # 1=male, 2=female
        "patient_weight_kg": patient.get("patientweight", ""),
        "suspect_drugs": suspect_drugs,
        "concomitant_drugs": concomitant_drugs,
        "reactions": reactions,
        "outcomes": outcomes,
    }


def summarize_events(events: list) -> dict:
    """Summarize adverse events — top reactions, serious event rate, etc."""
    from collections import Counter

    parsed = [parse_adverse_event(e) for e in events]

    all_reactions = [r["term"] for e in parsed for r in e["reactions"] if r["term"]]
    reaction_counts = Counter(all_reactions).most_common(20)

    serious_count = sum(1 for e in parsed if e["outcomes"]["serious"])
    death_count = sum(1 for e in parsed if e["outcomes"]["death"])
    hosp_count = sum(1 for e in parsed if e["outcomes"]["hospitalized"])

    return {
        "total_reports": len(parsed),
        "serious_rate": serious_count / len(parsed) if parsed else 0,
        "death_rate": death_count / len(parsed) if parsed else 0,
        "hospitalization_rate": hosp_count / len(parsed) if parsed else 0,
        "top_reactions": reaction_counts,
    }


summary = summarize_events(events)
print(f"\nTotal reports: {summary['total_reports']:,}")
print(f"Serious rate: {summary['serious_rate']:.1%}")
print(f"Death rate: {summary['death_rate']:.1%}")
print(f"Hospitalization rate: {summary['hospitalization_rate']:.1%}")
print("\nTop reactions:")
for reaction, count in summary["top_reactions"][:10]:
    print(f"  {reaction}: {count}")

Tracking Drug Recalls

The enforcement endpoint tracks recalls, market withdrawals, and safety alerts:

def get_recalls(search_term: str = None, classification: str = None,
                product_type: str = "drug", limit: int = 100,
                skip: int = 0) -> list:
    """Search for FDA recalls."""
    search_parts = []
    if search_term:
        search_parts.append(f'reason_for_recall:"{search_term}"')
    if classification:
        search_parts.append(f'classification:"{classification}"')

    params = {
        "search": " AND ".join(search_parts) if search_parts else None,
        "limit": limit,
        "skip": skip,
        "sort": "report_date:desc",
    }
    # Remove None values
    params = {k: v for k, v in params.items() if v is not None}

    data = fda_request("/drug/enforcement.json", params)
    return data.get("results", [])


def get_recent_recalls(days_back: int = 30, classification: str = "Class I") -> list:
    """Get recent high-severity drug recalls."""
    import datetime
    cutoff = (datetime.date.today() - datetime.timedelta(days=days_back)).strftime("%Y%m%d")
    params = {
        "search": f'report_date:[{cutoff}+TO+20991231] AND classification:"{classification}"',
        "limit": 100,
        "sort": "report_date:desc",
    }
    data = fda_request("/drug/enforcement.json", params)
    return data.get("results", [])


def parse_recall(recall: dict) -> dict:
    """Parse a recall record."""
    return {
        "recall_number": recall.get("recall_number"),
        "classification": recall.get("classification"),
        # Class I: dangerous/defective
        # Class II: may cause temporary adverse health consequences
        # Class III: unlikely to cause adverse health consequences
        "product_description": recall.get("product_description", ""),
        "reason": recall.get("reason_for_recall", ""),
        "action": recall.get("action", ""),
        "firm": recall.get("recalling_firm", ""),
        "city": recall.get("city", ""),
        "state": recall.get("state", ""),
        "country": recall.get("country", ""),
        "report_date": recall.get("report_date", ""),
        "recall_initiation_date": recall.get("recall_initiation_date", ""),
        "termination_date": recall.get("termination_date"),
        "distribution_pattern": recall.get("distribution_pattern", ""),
        "product_quantity": recall.get("product_quantity", ""),
        "lot_numbers": recall.get("code_info", ""),
        "status": recall.get("status", ""),  # Ongoing, Completed, Terminated
    }


# Get recent Class I recalls
recent = get_recent_recalls(days_back=60, classification="Class I")
for recall in recent[:5]:
    parsed = parse_recall(recall)
    print(f"\n[{parsed['classification']}] {parsed['firm']}")
    print(f"  Product: {parsed['product_description'][:100]}")
    print(f"  Reason: {parsed['reason'][:100]}")
    print(f"  Date: {parsed['report_date']}")

Drug Labeling API

The labeling endpoint contains full package insert text, including indications, warnings, and contraindications:

def get_drug_label(drug_name: str) -> dict | None:
    """Get full drug labeling for a drug."""
    params = {
        "search": f'openfda.brand_name:"{drug_name}"',
        "limit": 1,
    }
    data = fda_request("/drug/label.json", params)
    results = data.get("results", [])
    return results[0] if results else None


def extract_label_sections(label: dict) -> dict:
    """Extract key sections from a drug label."""
    return {
        "brand_name": label.get("openfda", {}).get("brand_name", []),
        "generic_name": label.get("openfda", {}).get("generic_name", []),
        "manufacturer": label.get("openfda", {}).get("manufacturer_name", []),
        "product_type": label.get("openfda", {}).get("product_type", []),
        "route": label.get("openfda", {}).get("route", []),
        "indications": label.get("indications_and_usage", [""])[0][:500],
        "contraindications": label.get("contraindications", [""])[0][:500],
        "warnings": label.get("warnings", [""])[0][:500],
        "adverse_reactions": label.get("adverse_reactions", [""])[0][:500],
        "drug_interactions": label.get("drug_interactions", [""])[0][:500],
        "dosage": label.get("dosage_and_administration", [""])[0][:300],
        "effective_time": label.get("effective_time", ""),
    }


label = get_drug_label("humira")
if label:
    sections = extract_label_sections(label)
    print(f"Drug: {sections['brand_name']}")
    print(f"Generic: {sections['generic_name']}")
    print(f"Route: {sections['route']}")
    print(f"Indications: {sections['indications'][:200]}...")

Count Queries — Aggregation Without Pagination

openFDA supports count queries that return frequency distributions without paginating through individual records. These are much faster for analytics:

def count_adverse_events_by_reaction(drug_name: str, top_n: int = 20) -> list:
    """Get reaction frequency distribution for a drug."""
    params = {
        "search": f'patient.drug.openfda.brand_name:"{drug_name}"',
        "count": "patient.reaction.reactionmeddrapt.exact",
        "limit": top_n,
    }
    data = fda_request("/drug/event.json", params)
    return data.get("results", [])


def count_recalls_by_firm(top_n: int = 20) -> list:
    """Get top firms by number of drug recalls."""
    params = {
        "count": "recalling_firm.exact",
        "limit": top_n,
    }
    data = fda_request("/drug/enforcement.json", params)
    return data.get("results", [])


def count_recalls_by_reason(top_n: int = 20) -> list:
    """Get top recall reasons."""
    params = {
        "count": "reason_for_recall.exact",
        "limit": top_n,
    }
    data = fda_request("/drug/enforcement.json", params)
    return data.get("results", [])


# Top adverse reactions for ozempic
reactions = count_adverse_events_by_reaction("ozempic", top_n=15)
print("Top adverse reactions for Ozempic:")
for r in reactions:
    print(f"  {r['term']}: {r['count']:,}")

Rate Limits and Proxy Rotation

openFDA is generous — 240 requests/minute without a key, more with one. For large-scale collection (pulling millions of FAERS records across multiple drugs), those limits become binding.

Practical strategies:

Register for an API key — free and instant, raises limits significantly
Cache aggressively — FDA data is relatively stable; cache responses and set a refresh schedule
Use count queries for aggregation — much faster than paginating through individual records
Proxy rotation for parallel collection — when pulling from multiple endpoints simultaneously

ThorData's proxy network works well for FDA API work. For API endpoints (as opposed to browser scraping), datacenter proxies are sufficient and cheaper than residential:

PROXY_CONFIGS = {
    "host": "proxy.thordata.net",
    "port": 10000,
    "user": "your_thordata_user",
    "pass": "your_thordata_password",
}

def fda_request_with_proxy(endpoint: str, params: dict,
                             session_id: int = None) -> dict:
    """FDA request routed through a proxy."""
    import random
    sid = session_id or random.randint(1000, 9999)
    proxy_url = (
        f"http://{PROXY_CONFIGS['user']}-session-{sid}:"
        f"{PROXY_CONFIGS['pass']}@{PROXY_CONFIGS['host']}:{PROXY_CONFIGS['port']}"
    )

    if API_KEY:
        params["api_key"] = API_KEY

    resp = httpx.get(
        f"{BASE}{endpoint}",
        params=params,
        timeout=20,
        proxies={"http://": proxy_url, "https://": proxy_url},
    )
    resp.raise_for_status()
    return resp.json()

Storing Results in SQLite

For any serious data collection, persist to SQLite as you go:

def init_fda_db(db_path: str = "fda_data.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS adverse_events (
            report_id TEXT PRIMARY KEY,
            receive_date TEXT,
            report_type TEXT,
            patient_age TEXT,
            patient_sex TEXT,
            serious INTEGER,
            death INTEGER,
            hospitalized INTEGER,
            suspect_drugs TEXT,
            reactions TEXT,
            raw_json TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS recalls (
            recall_number TEXT PRIMARY KEY,
            classification TEXT,
            product_description TEXT,
            reason TEXT,
            firm TEXT,
            city TEXT,
            state TEXT,
            report_date TEXT,
            status TEXT,
            raw_json TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS drug_approvals (
            application_number TEXT PRIMARY KEY,
            sponsor TEXT,
            brand_names TEXT,
            generic_names TEXT,
            products_json TEXT,
            submissions_json TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS drug_labels (
            set_id TEXT PRIMARY KEY,
            brand_name TEXT,
            generic_name TEXT,
            manufacturer TEXT,
            effective_time TEXT,
            indications TEXT,
            contraindications TEXT,
            warnings TEXT,
            adverse_reactions TEXT,
            raw_json TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)

    conn.execute("CREATE INDEX IF NOT EXISTS idx_events_date ON adverse_events(receive_date)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_recalls_date ON recalls(report_date)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_recalls_class ON recalls(classification)")
    conn.commit()
    return conn


def save_adverse_events(conn: sqlite3.Connection, events: list):
    """Bulk insert adverse events into SQLite."""
    rows = []
    for e in events:
        parsed = parse_adverse_event(e)
        rows.append((
            parsed["report_id"],
            parsed["receive_date"],
            parsed["report_type"],
            parsed["patient_age"],
            parsed["patient_sex"],
            1 if parsed["outcomes"]["serious"] else 0,
            1 if parsed["outcomes"]["death"] else 0,
            1 if parsed["outcomes"]["hospitalized"] else 0,
            json.dumps(parsed["suspect_drugs"]),
            json.dumps([r["term"] for r in parsed["reactions"]]),
            json.dumps(e),
        ))

    conn.executemany(
        "INSERT OR REPLACE INTO adverse_events VALUES (?,?,?,?,?,?,?,?,?,?,?,CURRENT_TIMESTAMP)",
        rows,
    )
    conn.commit()
    print(f"  Saved {len(rows)} adverse event records")


def save_recalls(conn: sqlite3.Connection, recalls: list):
    """Bulk insert recall records into SQLite."""
    rows = []
    for r in recalls:
        parsed = parse_recall(r)
        rows.append((
            parsed["recall_number"],
            parsed["classification"],
            parsed["product_description"][:500],
            parsed["reason"][:500],
            parsed["firm"],
            parsed["city"],
            parsed["state"],
            parsed["report_date"],
            parsed["status"],
            json.dumps(r),
        ))

    conn.executemany(
        "INSERT OR REPLACE INTO recalls VALUES (?,?,?,?,?,?,?,?,?,?,CURRENT_TIMESTAMP)",
        rows,
    )
    conn.commit()
    print(f"  Saved {len(rows)} recall records")

Complete Pipeline Example

Here's a full pipeline that builds a drug safety database:

def build_drug_safety_db(drug_names: list, db_path: str = "fda_safety.db"):
    """Build a comprehensive drug safety database for a list of drugs."""
    conn = init_fda_db(db_path)

    for drug_name in drug_names:
        print(f"\nProcessing: {drug_name}")

        # Adverse events
        events = collect_all_events(drug_name, max_records=2000)
        if events:
            save_adverse_events(conn, events)

        time.sleep(2)

        # Recalls
        recalls = get_recalls(search_term=drug_name, limit=100)
        if recalls:
            save_recalls(conn, recalls)

        time.sleep(2)

        # Approval data
        approvals = search_drug_approvals(drug_name, limit=5)
        for approval in approvals:
            parsed = parse_approval_record(approval)
            conn.execute(
                "INSERT OR REPLACE INTO drug_approvals VALUES (?,?,?,?,?,?,CURRENT_TIMESTAMP)",
                (
                    parsed["application_number"],
                    parsed["sponsor"],
                    json.dumps(parsed["openfda"].get("brand_name", [])),
                    json.dumps(parsed["openfda"].get("generic_name", [])),
                    json.dumps(parsed["products"]),
                    json.dumps(parsed["submissions"]),
                )
            )
            conn.commit()

        print(f"  Completed {drug_name}")
        time.sleep(3)

    conn.close()
    print(f"\nDatabase built: {db_path}")


# Build safety data for a set of GLP-1 drugs
build_drug_safety_db(["ozempic", "wegovy", "mounjaro", "zepbound"])

What You Can Build

The FDA data is a goldmine for health technology:

Drug safety dashboards — Visualize adverse event trends over time for specific drugs. Compare pre/post-approval signal rates. Track seasonal patterns in reporting.

Recall monitoring system — Build an alert system for new recalls in specific drug categories or for specific manufacturers. Integrate with Slack or email notifications for Class I recalls.

Pharmacovigilance signals — Use disproportionality analysis (PRR, ROR) to detect emerging safety signals before they hit mainstream news. Cross-reference FAERS reports where multiple suspect drugs co-occur.

Drug interaction analysis — Identify drugs that frequently co-appear in serious adverse event reports and correlate with specific reaction clusters.

Regulatory research tools — Track approval timelines, submission types, and approval rates by therapeutic area, sponsor, or drug type.

Clinical trial support — Cross-reference approved indications against reported adverse reactions to assess safety profiles for specific patient populations.

The openFDA API is one of the best public data sources available. Clean JSON, free access, well-documented, and covers decades of regulatory data. The hard part isn't getting the data — it's asking the right questions once you have it.