← Back to blog

Scraping WebMD: Medical Conditions, Symptoms & Drug Data with Python (2026)

Scraping WebMD: Medical Conditions, Symptoms & Drug Data with Python (2026)

WebMD is one of the largest medical reference sites on the web. It has structured data on thousands of conditions, drugs, symptoms, and user-submitted reviews. If you're building a health research dataset, comparing drug interactions, aggregating patient experiences, or feeding medical content into an NLP pipeline — the data is there, and it's publicly accessible.

This guide covers scraping condition pages, drug interaction data, user reviews, and symptom mappings from WebMD using Python. We'll deal with their anti-bot measures, set up SQLite storage, integrate residential proxies, and walk through practical use cases for the data you collect.

A Note on Ethics

Medical data carries weight. If you scrape it, be responsible with it. Don't republish it as your own medical advice. Don't use it to build misleading health products. WebMD's content is written and reviewed by medical professionals — treat it accordingly.

Also: respect their robots.txt and rate-limit your requests. Hammering a health information site helps nobody, and getting your IP banned on the first day kills any research project before it starts.

What Data Is Available

WebMD organizes content across several distinct categories:

Setup

pip install httpx beautifulsoup4 lxml sqlite3

I'm using httpx over requests — it handles HTTP/2 and async better, which matters when you're making many sequential requests with delays. lxml is significantly faster than Python's built-in HTML parser for large pages.

Understanding WebMD's URL Structure

WebMD uses predictable URL patterns that make enumeration straightforward:

# Condition pages
https://www.webmd.com/[category]/[condition-slug]
https://www.webmd.com/diabetes/type-2-diabetes
https://www.webmd.com/heart-disease/atrial-fibrillation/
https://www.webmd.com/migraines-headaches/migraines-headaches-migraines

# Drug pages
https://www.webmd.com/drugs/2/drug-[drug-id]/[drug-name]/details

# Drug reviews
https://www.webmd.com/drugs/drugreview-[drug-id]-[drug-name].aspx

# Symptom index
https://www.webmd.com/a-to-z-guides/symptoms-a-z

Understanding these patterns lets you build a crawler that works systematically from index pages rather than guessing URLs.

Scraping Condition Pages

WebMD condition pages follow a structured layout with sections for symptoms, causes, treatments, and related conditions:

import httpx
from bs4 import BeautifulSoup
import time
import json
import re

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Upgrade-Insecure-Requests": "1",
}

def scrape_condition(url: str, client: httpx.Client = None) -> dict:
    """Scrape a WebMD condition page for structured data."""
    use_client = client or httpx.Client(headers=headers, follow_redirects=True, timeout=15)

    try:
        resp = use_client.get(url)
        resp.raise_for_status()
    except httpx.HTTPStatusError as e:
        print(f"HTTP {e.response.status_code} for {url}")
        return {}

    soup = BeautifulSoup(resp.text, "lxml")

    # Check for Cloudflare challenge
    if is_challenge_page(resp.text):
        print(f"Challenge page detected for {url}")
        return {}

    data = {
        "url": url,
        "title": "",
        "summary": "",
        "sections": {},
        "related": [],
        "tags": [],
        "last_reviewed": "",
    }

    # Title
    title_tag = soup.select_one("h1")
    if title_tag:
        data["title"] = title_tag.get_text(strip=True)

    # Meta description as summary
    meta_desc = soup.select_one("meta[name='description']")
    if meta_desc:
        data["summary"] = meta_desc.get("content", "")

    # Last reviewed date
    reviewed = soup.select_one("[class*='reviewed'], [class*='byline'] time")
    if reviewed:
        data["last_reviewed"] = reviewed.get_text(strip=True)

    # Main content sections
    for section in soup.select("div.article-page section, div.article-body section, section.article-section"):
        heading = section.select_one("h2, h3")
        if heading:
            section_title = heading.get_text(strip=True)
            paragraphs = [p.get_text(strip=True) for p in section.select("p") if p.get_text(strip=True)]
            bullet_points = [li.get_text(strip=True) for li in section.select("ul li")]
            data["sections"][section_title] = {
                "text": paragraphs,
                "bullet_points": bullet_points,
            }

    # Tags/categories
    for tag in soup.select("a[href*='/a-to-z-guides/'], nav.breadcrumb a"):
        tag_text = tag.get_text(strip=True)
        if tag_text and len(tag_text) > 2:
            data["tags"].append(tag_text)

    # Related conditions sidebar
    for link in soup.select("div.related-conditions a, nav[class*='related'] a, div[class*='related-links'] a"):
        href = link.get("href", "")
        if "webmd.com" in href or href.startswith("/"):
            data["related"].append({
                "title": link.get_text(strip=True),
                "url": href if href.startswith("http") else "https://www.webmd.com" + href,
            })

    return data

Extracting Drug Information

WebMD's drug database is particularly structured. Each drug page follows a consistent layout:

def scrape_drug_page(drug_url: str, client: httpx.Client = None) -> dict:
    """Extract drug information from a WebMD drug page."""
    use_client = client or httpx.Client(headers=headers, follow_redirects=True, timeout=15)

    resp = use_client.get(drug_url)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "lxml")

    drug_data = {
        "url": drug_url,
        "name": "",
        "generic_name": "",
        "drug_class": "",
        "uses": [],
        "side_effects_common": [],
        "side_effects_serious": [],
        "interactions": [],
        "warnings": [],
        "dosage_notes": "",
        "pregnancy_category": "",
        "controlled_substance": False,
    }

    # Drug name and class
    name_tag = soup.select_one("h1, .drug-name, [class*='DrugName']")
    if name_tag:
        drug_data["name"] = name_tag.get_text(strip=True)

    generic_tag = soup.select_one("[class*='generic-name'], .generic")
    if generic_tag:
        drug_data["generic_name"] = generic_tag.get_text(strip=True)

    class_tag = soup.select_one("[class*='drug-class']")
    if class_tag:
        drug_data["drug_class"] = class_tag.get_text(strip=True)

    # Uses section
    uses_section = soup.find(["h2", "h3"], string=lambda t: t and ("use" in t.lower() or "treat" in t.lower()))
    if uses_section:
        container = uses_section.find_next(["div", "ul"])
        if container:
            drug_data["uses"] = [
                item.get_text(strip=True)
                for item in container.select("p, li")
                if item.get_text(strip=True)
            ]

    # Side effects — split into common vs serious
    side_section = soup.find(["h2", "h3"], string=lambda t: t and "side effect" in t.lower())
    if side_section:
        container = side_section.find_next("div")
        if container:
            all_effects = [li.get_text(strip=True) for li in container.select("li")]
            # WebMD typically lists common first, then serious with a subheading
            serious_marker = container.find(string=lambda t: t and "serious" in t.lower())
            if serious_marker:
                idx = all_effects.index(serious_marker.parent.get_text(strip=True)) if serious_marker.parent.get_text(strip=True) in all_effects else len(all_effects)
                drug_data["side_effects_common"] = all_effects[:idx]
                drug_data["side_effects_serious"] = all_effects[idx:]
            else:
                drug_data["side_effects_common"] = all_effects

    # Drug interactions
    interact_section = soup.find(["h2", "h3"], string=lambda t: t and "interaction" in t.lower())
    if interact_section:
        container = interact_section.find_next("div")
        if container:
            drug_data["interactions"] = [li.get_text(strip=True) for li in container.select("li")]

    # Warnings (black box, pregnancy warnings)
    warning_section = soup.find(["h2", "h3"], string=lambda t: t and "warning" in t.lower() or (t and "precaution" in t.lower()))
    if warning_section:
        container = warning_section.find_next("div")
        if container:
            drug_data["warnings"] = [p.get_text(strip=True) for p in container.select("p, li")]

    # Controlled substance check
    if "controlled substance" in resp.text.lower() or "schedule ii" in resp.text.lower():
        drug_data["controlled_substance"] = True

    return drug_data


def scrape_drug_index(letter: str = "A", client: httpx.Client = None) -> list:
    """Scrape the drug index for a given letter to get all drug URLs."""
    use_client = client or httpx.Client(headers=headers, follow_redirects=True, timeout=15)

    url = f"https://www.webmd.com/drugs/2/alpha/{letter}"
    resp = use_client.get(url)
    resp.raise_for_status()

    soup = BeautifulSoup(resp.text, "lxml")
    drugs = []

    for link in soup.select("a[href*='/drugs/2/drug-']"):
        drugs.append({
            "name": link.get_text(strip=True),
            "url": link.get("href", ""),
        })

    return drugs

Scraping User Reviews

WebMD has user-submitted drug and condition reviews. These are valuable for sentiment analysis and patient experience research:

def scrape_reviews(reviews_url: str, max_pages: int = 10, client: httpx.Client = None) -> list:
    """Scrape user reviews from a WebMD drug/condition review page."""
    use_client = client or httpx.Client(headers=headers, follow_redirects=True, timeout=15)
    all_reviews = []

    for page in range(1, max_pages + 1):
        url = f"{reviews_url}?page={page}" if page > 1 else reviews_url

        try:
            resp = use_client.get(url)
            if resp.status_code != 200:
                break
        except Exception as e:
            print(f"Error on page {page}: {e}")
            break

        soup = BeautifulSoup(resp.text, "lxml")

        # WebMD review selectors (multiple possible formats)
        review_cards = soup.select(
            "div.review-card, div.user-review, div[class*='drugReview'], "
            "div[class*='review-content'], div.patientReview"
        )

        if not review_cards:
            # Check if we're on the last page
            no_more = soup.select_one("[class*='no-reviews'], [class*='noResults']")
            if no_more or page > 1:
                break

        for card in review_cards:
            review = {"page": page}

            # Rating — WebMD uses 1-5 scale
            rating_el = card.select_one(
                ".rating, .review-rating, [class*='userRating'], "
                "span[class*='stars'], [itemprop='ratingValue']"
            )
            if rating_el:
                rating_text = rating_el.get("content") or rating_el.get_text(strip=True)
                try:
                    review["rating"] = float(rating_text.replace("/5", "").strip())
                except (ValueError, AttributeError):
                    review["rating"] = None

            # Review text
            comment_el = card.select_one(
                ".review-comment, .comment-text, [class*='reviewText'], "
                "p.review-body, [itemprop='reviewBody']"
            )
            review["comment"] = comment_el.get_text(strip=True) if comment_el else None

            # Condition the reviewer was treating
            condition_el = card.select_one(".condition, .review-condition, [class*='conditionTreated']")
            review["condition"] = condition_el.get_text(strip=True) if condition_el else None

            # Time on medication
            time_el = card.select_one("[class*='timeMedication'], [class*='duration']")
            review["time_on_medication"] = time_el.get_text(strip=True) if time_el else None

            # Effectiveness, ease of use, satisfaction sub-ratings
            sub_ratings = {}
            for sub in card.select("[class*='subRating'], [class*='category-rating']"):
                label_el = sub.select_one("[class*='label'], span:first-child")
                value_el = sub.select_one("[class*='value'], [class*='score']")
                if label_el and value_el:
                    sub_ratings[label_el.get_text(strip=True)] = value_el.get_text(strip=True)
            review["sub_ratings"] = sub_ratings

            # Date posted
            date_el = card.select_one("time, [class*='reviewDate'], [class*='date']")
            review["date"] = date_el.get("datetime") or (date_el.get_text(strip=True) if date_el else None)

            if review.get("comment"):  # Only add reviews with content
                all_reviews.append(review)

        print(f"  Page {page}: {len(review_cards)} reviews (total: {len(all_reviews)})")
        time.sleep(4)  # Be respectful between pages

    return all_reviews

Dealing with Anti-Bot Protections

WebMD uses Cloudflare and fingerprinting on some pages. You'll notice 403 responses or challenge pages if you hit it too fast or with bare headers.

def is_challenge_page(html: str) -> bool:
    """Check if response is a Cloudflare or bot-detection challenge page."""
    markers = [
        "challenge-platform",
        "cf-browser-verification",
        "Just a moment",
        "Checking if the site connection is secure",
        "_cf_chl_opt",
        "Enable JavaScript and cookies",
    ]
    return any(marker in html for marker in markers)


def make_safe_request(url: str, client: httpx.Client, max_retries: int = 3) -> httpx.Response | None:
    """Make a request with retry logic and challenge detection."""
    for attempt in range(max_retries):
        try:
            resp = client.get(url)

            if is_challenge_page(resp.text):
                print(f"  Challenge page on attempt {attempt + 1}, backing off...")
                time.sleep(30 * (attempt + 1))  # Exponential backoff
                continue

            if resp.status_code == 429:
                print(f"  Rate limited, waiting {60 * (attempt + 1)}s...")
                time.sleep(60 * (attempt + 1))
                continue

            resp.raise_for_status()
            return resp

        except httpx.HTTPStatusError as e:
            if e.response.status_code in [403, 503]:
                print(f"  HTTP {e.response.status_code} on attempt {attempt + 1}")
                time.sleep(15 * (attempt + 1))
            else:
                raise

    print(f"  Failed after {max_retries} attempts: {url}")
    return None

What works in 2026:

For residential proxies, ThorData works well for medical site scraping. Their rotating residential IPs handle Cloudflare challenges and the geo-targeting is useful when WebMD serves different content by region. Setup:

import random

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
]

def make_client(proxy_url: str = None, rotate_ua: bool = True) -> httpx.Client:
    """Create an httpx client with optional proxy and UA rotation."""
    current_headers = headers.copy()
    if rotate_ua:
        current_headers["User-Agent"] = random.choice(USER_AGENTS)

    return httpx.Client(
        headers=current_headers,
        proxy=proxy_url,  # e.g., "http://USER:[email protected]:9000"
        follow_redirects=True,
        timeout=20,
        http2=True,  # HTTP/2 looks more like a real browser
    )

# Rotate client every N requests
def create_rotating_scraper(proxy_url: str):
    """Create a scraper that rotates UA every 10 requests."""
    request_count = 0
    client = make_client(proxy_url)

    def get(url: str) -> httpx.Response:
        nonlocal request_count, client
        if request_count % 10 == 0:
            client = make_client(proxy_url)  # Fresh UA
        request_count += 1
        return client.get(url)

    return get

Building a Symptom Database

With the condition scraper, you can build a structured symptom-to-condition mapping:

def build_symptom_index(conditions: list[dict]) -> dict:
    """Build a reverse index: symptom -> list of conditions."""
    index = {}
    for condition in conditions:
        title = condition.get("title", "")
        for section_name, section_data in condition.get("sections", {}).items():
            if "symptom" in section_name.lower():
                for symptom in section_data.get("bullet_points", []):
                    symptom_clean = symptom.lower().strip().rstrip(".").rstrip(",")
                    # Filter out very short or very long entries (likely not symptoms)
                    if 3 < len(symptom_clean) < 100:
                        if symptom_clean not in index:
                            index[symptom_clean] = []
                        if title not in index[symptom_clean]:
                            index[symptom_clean].append(title)
    return index


def top_symptoms_by_condition_count(index: dict, top_n: int = 20) -> list:
    """Find symptoms that appear across the most conditions."""
    return sorted(
        index.items(),
        key=lambda x: len(x[1]),
        reverse=True
    )[:top_n]

SQLite Storage

For any serious scraping job, plan your storage before you start. If the scraper crashes at page 500, you don't want to restart from zero:

import sqlite3
from datetime import datetime


def init_db(db_path: str = "webmd_data.db") -> sqlite3.Connection:
    """Initialize SQLite database with all required tables."""
    conn = sqlite3.connect(db_path)
    conn.execute("PRAGMA journal_mode=WAL")  # Better concurrent write performance

    conn.execute("""
        CREATE TABLE IF NOT EXISTS conditions (
            url TEXT PRIMARY KEY,
            title TEXT NOT NULL,
            summary TEXT,
            tags TEXT,  -- JSON array
            last_reviewed TEXT,
            data TEXT NOT NULL,  -- Full JSON
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS drugs (
            url TEXT PRIMARY KEY,
            name TEXT NOT NULL,
            generic_name TEXT,
            drug_class TEXT,
            controlled_substance INTEGER DEFAULT 0,
            data TEXT NOT NULL,  -- Full JSON
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS reviews (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            source_url TEXT NOT NULL,
            drug_or_condition TEXT,
            rating REAL,
            comment TEXT,
            condition_treated TEXT,
            time_on_medication TEXT,
            date_posted TEXT,
            sub_ratings TEXT,  -- JSON
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS symptoms (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            symptom TEXT NOT NULL,
            condition TEXT NOT NULL,
            UNIQUE(symptom, condition)
        )
    """)

    conn.execute("CREATE INDEX IF NOT EXISTS idx_conditions_title ON conditions(title)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_drugs_name ON drugs(name)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_reviews_url ON reviews(source_url)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_symptoms_symptom ON symptoms(symptom)")

    conn.commit()
    return conn


def save_condition(conn: sqlite3.Connection, condition: dict) -> bool:
    """Save a condition to the database. Returns True if new, False if updated."""
    try:
        cursor = conn.execute(
            """INSERT OR REPLACE INTO conditions
               (url, title, summary, tags, last_reviewed, data)
               VALUES (?, ?, ?, ?, ?, ?)""",
            (
                condition["url"],
                condition.get("title", ""),
                condition.get("summary", ""),
                json.dumps(condition.get("tags", [])),
                condition.get("last_reviewed", ""),
                json.dumps(condition),
            )
        )
        conn.commit()
        return cursor.lastrowid is not None
    except sqlite3.Error as e:
        print(f"DB error saving condition {condition.get('url')}: {e}")
        return False


def save_reviews(conn: sqlite3.Connection, source_url: str, drug_name: str, reviews: list) -> int:
    """Bulk save reviews to database."""
    saved = 0
    for review in reviews:
        try:
            conn.execute(
                """INSERT INTO reviews
                   (source_url, drug_or_condition, rating, comment, condition_treated,
                    time_on_medication, date_posted, sub_ratings)
                   VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
                (
                    source_url,
                    drug_name,
                    review.get("rating"),
                    review.get("comment"),
                    review.get("condition"),
                    review.get("time_on_medication"),
                    review.get("date"),
                    json.dumps(review.get("sub_ratings", {})),
                )
            )
            saved += 1
        except sqlite3.Error:
            continue
    conn.commit()
    return saved


def is_already_scraped(conn: sqlite3.Connection, url: str, table: str = "conditions") -> bool:
    """Check if a URL has already been scraped (for resume capability)."""
    cursor = conn.execute(f"SELECT 1 FROM {table} WHERE url = ?", (url,))
    return cursor.fetchone() is not None

Pagination and Crawling at Scale

Crawling all of WebMD's condition pages requires starting from an index:

def get_condition_urls_from_index(client: httpx.Client) -> list[str]:
    """Get all condition URLs from WebMD's A-Z index."""
    all_urls = []

    # WebMD's condition index
    index_url = "https://www.webmd.com/a-to-z-guides/medical-conditions"
    resp = client.get(index_url)
    soup = BeautifulSoup(resp.text, "lxml")

    for link in soup.select("a[href*='/webmd.com/'], a[href^='/']"):
        href = link.get("href", "")
        # Filter to condition-like URLs (exclude ads, nav links)
        if any(cat in href for cat in [
            "/diabetes/", "/heart-disease/", "/cancer/", "/mental-health/",
            "/digestive-disorders/", "/pain-management/", "/allergies/",
            "/cold-and-flu/", "/depression/", "/arthritis/"
        ]):
            full_url = href if href.startswith("http") else f"https://www.webmd.com{href}"
            if full_url not in all_urls:
                all_urls.append(full_url)

    return all_urls


def crawl_conditions(db_path: str, proxy_url: str = None,
                      max_conditions: int = 500) -> dict:
    """Full crawl pipeline: discover and scrape condition pages."""
    conn = init_db(db_path)
    client = make_client(proxy_url)

    print("Discovering condition URLs...")
    urls = get_condition_urls_from_index(client)
    print(f"Found {len(urls)} condition URLs")

    stats = {"new": 0, "updated": 0, "failed": 0, "skipped": 0}

    for i, url in enumerate(urls[:max_conditions]):
        if is_already_scraped(conn, url):
            stats["skipped"] += 1
            continue

        print(f"[{i+1}/{min(len(urls), max_conditions)}] {url}")

        # Rotate client every 15 requests
        if i % 15 == 0 and i > 0:
            client = make_client(proxy_url)

        condition = scrape_condition(url, client)

        if condition and condition.get("title"):
            saved = save_condition(conn, condition)

            # Also index symptoms
            symptom_index = build_symptom_index([condition])
            for symptom, conditions in symptom_index.items():
                for cond_title in conditions:
                    try:
                        conn.execute(
                            "INSERT OR IGNORE INTO symptoms (symptom, condition) VALUES (?, ?)",
                            (symptom, cond_title)
                        )
                    except sqlite3.Error:
                        pass
            conn.commit()

            stats["new" if saved else "updated"] += 1
            print(f"  OK: {condition['title']} ({len(condition['sections'])} sections)")
        else:
            stats["failed"] += 1
            print(f"  FAILED: {url}")

        # Respectful delay: 3-6 seconds with jitter
        delay = 3 + (i % 3) + (0.5 * (i % 7))
        time.sleep(delay)

    conn.close()
    return stats

Querying Your Dataset

Once you have data in SQLite, it's easy to run analyses:

def query_conditions_by_symptom(db_path: str, symptom: str) -> list:
    """Find all conditions associated with a symptom."""
    conn = sqlite3.connect(db_path)
    results = conn.execute(
        """SELECT condition, COUNT(*) as match_count
           FROM symptoms
           WHERE LOWER(symptom) LIKE ?
           GROUP BY condition
           ORDER BY match_count DESC""",
        (f"%{symptom.lower()}%",)
    ).fetchall()
    conn.close()
    return results


def get_drug_interaction_summary(db_path: str, drug_name: str) -> dict:
    """Get interaction data for a drug."""
    conn = sqlite3.connect(db_path)
    row = conn.execute(
        "SELECT data FROM drugs WHERE LOWER(name) LIKE ?",
        (f"%{drug_name.lower()}%",)
    ).fetchone()
    conn.close()

    if row:
        data = json.loads(row[0])
        return {
            "drug": data["name"],
            "interactions": data.get("interactions", []),
            "warnings": data.get("warnings", []),
        }
    return {}


def top_rated_drugs_by_condition(db_path: str, condition: str, min_reviews: int = 10) -> list:
    """Rank drugs by average user rating for a given condition."""
    conn = sqlite3.connect(db_path)
    results = conn.execute(
        """SELECT drug_or_condition,
                  AVG(rating) as avg_rating,
                  COUNT(*) as review_count
           FROM reviews
           WHERE LOWER(condition_treated) LIKE ?
             AND rating IS NOT NULL
           GROUP BY drug_or_condition
           HAVING COUNT(*) >= ?
           ORDER BY avg_rating DESC""",
        (f"%{condition.lower()}%", min_reviews)
    ).fetchall()
    conn.close()
    return results

Complete Pipeline Example

Putting it all together — a script that scrapes a set of conditions, their drug reviews, and stores everything:

import sqlite3, json, time, random
import httpx
from bs4 import BeautifulSoup

PROXY_URL = "http://USER:[email protected]:9000"  # ThorData residential proxy

TARGET_CONDITIONS = [
    ("https://www.webmd.com/diabetes/type-2-diabetes", "type-2-diabetes"),
    ("https://www.webmd.com/heart-disease/atrial-fibrillation/", "atrial-fibrillation"),
    ("https://www.webmd.com/migraines-headaches/migraines-headaches-migraines", "migraines"),
    ("https://www.webmd.com/depression/guide/depression-diagnosis-tests", "depression"),
    ("https://www.webmd.com/arthritis/rheumatoid-arthritis-ra", "rheumatoid-arthritis"),
]

def run_full_pipeline(db_path: str = "webmd_data.db"):
    conn = init_db(db_path)
    client = make_client(PROXY_URL)

    for url, slug in TARGET_CONDITIONS:
        print(f"\n=== {slug} ===")

        if not is_already_scraped(conn, url):
            condition = scrape_condition(url, client)
            if condition.get("title"):
                save_condition(conn, condition)
                print(f"Saved condition: {condition['title']}")
        else:
            print("Already scraped, skipping")

        # Scrape reviews for this condition
        reviews_url = f"https://www.webmd.com/drugs/condition-{slug}"
        reviews = scrape_reviews(reviews_url, max_pages=5, client=client)
        if reviews:
            saved = save_reviews(conn, reviews_url, slug, reviews)
            print(f"Saved {saved} reviews")

        time.sleep(random.uniform(5, 10))

    # Summary stats
    stats = {}
    for table in ["conditions", "drugs", "reviews", "symptoms"]:
        count = conn.execute(f"SELECT COUNT(*) FROM {table}").fetchone()[0]
        stats[table] = count

    print(f"\n=== Database Summary ===")
    for table, count in stats.items():
        print(f"  {table}: {count:,} records")

    conn.close()
    return stats

if __name__ == "__main__":
    run_full_pipeline()

What You Can Build

A few practical uses for this data:

Rate Limits and Ethical Considerations

WebMD's data is rich and mostly well-structured in the HTML. The main challenge is scale — be patient with rate limiting, rotate your IPs, and store incrementally.

Practical rate limit guidelines based on real-world testing in 2026:

Target Safe rate Risk threshold
Condition pages 1 req/4s > 1 req/2s
Drug pages 1 req/5s > 1 req/3s
Review pages 1 req/6s > 1 req/4s
Drug index 1 req/3s > 1 req/2s

Always check if a URL returns a 200 before parsing — a 200 that contains a challenge page is a common failure mode. The is_challenge_page() function above handles this.

The data is worth the patience. WebMD's medical content is authored and reviewed by clinical professionals, which makes it higher quality than most scraped health content. Use it responsibly.