Scraping WebMD: Medical Conditions, Symptoms & Drug Data with Python (2026)
Scraping WebMD: Medical Conditions, Symptoms & Drug Data with Python (2026)
WebMD is one of the largest medical reference sites on the web. It has structured data on thousands of conditions, drugs, symptoms, and user-submitted reviews. If you're building a health research dataset, comparing drug interactions, aggregating patient experiences, or feeding medical content into an NLP pipeline — the data is there, and it's publicly accessible.
This guide covers scraping condition pages, drug interaction data, user reviews, and symptom mappings from WebMD using Python. We'll deal with their anti-bot measures, set up SQLite storage, integrate residential proxies, and walk through practical use cases for the data you collect.
A Note on Ethics
Medical data carries weight. If you scrape it, be responsible with it. Don't republish it as your own medical advice. Don't use it to build misleading health products. WebMD's content is written and reviewed by medical professionals — treat it accordingly.
Also: respect their robots.txt and rate-limit your requests. Hammering a health information site helps nobody, and getting your IP banned on the first day kills any research project before it starts.
What Data Is Available
WebMD organizes content across several distinct categories:
- Condition pages — symptoms, causes, risk factors, diagnosis, treatment options, complications, when to see a doctor
- Drug database — dosage information, side effects (common and rare), interactions, contraindications, black box warnings
- Drug interaction checker — multi-drug interaction reports generated dynamically
- User reviews — patient-submitted ratings and comments for drugs and treatments, organized by condition
- Symptom checker — structured symptom-to-condition mapping (partially in the DOM, partially JavaScript)
- Vitamins & supplements — structured data on OTC supplements following the same format as drugs
- Medical reference A-Z — alphabetical index of all conditions with category tags
Setup
pip install httpx beautifulsoup4 lxml sqlite3
I'm using httpx over requests — it handles HTTP/2 and async better, which matters when you're making many sequential requests with delays. lxml is significantly faster than Python's built-in HTML parser for large pages.
Understanding WebMD's URL Structure
WebMD uses predictable URL patterns that make enumeration straightforward:
# Condition pages
https://www.webmd.com/[category]/[condition-slug]
https://www.webmd.com/diabetes/type-2-diabetes
https://www.webmd.com/heart-disease/atrial-fibrillation/
https://www.webmd.com/migraines-headaches/migraines-headaches-migraines
# Drug pages
https://www.webmd.com/drugs/2/drug-[drug-id]/[drug-name]/details
# Drug reviews
https://www.webmd.com/drugs/drugreview-[drug-id]-[drug-name].aspx
# Symptom index
https://www.webmd.com/a-to-z-guides/symptoms-a-z
Understanding these patterns lets you build a crawler that works systematically from index pages rather than guessing URLs.
Scraping Condition Pages
WebMD condition pages follow a structured layout with sections for symptoms, causes, treatments, and related conditions:
import httpx
from bs4 import BeautifulSoup
import time
import json
import re
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Upgrade-Insecure-Requests": "1",
}
def scrape_condition(url: str, client: httpx.Client = None) -> dict:
"""Scrape a WebMD condition page for structured data."""
use_client = client or httpx.Client(headers=headers, follow_redirects=True, timeout=15)
try:
resp = use_client.get(url)
resp.raise_for_status()
except httpx.HTTPStatusError as e:
print(f"HTTP {e.response.status_code} for {url}")
return {}
soup = BeautifulSoup(resp.text, "lxml")
# Check for Cloudflare challenge
if is_challenge_page(resp.text):
print(f"Challenge page detected for {url}")
return {}
data = {
"url": url,
"title": "",
"summary": "",
"sections": {},
"related": [],
"tags": [],
"last_reviewed": "",
}
# Title
title_tag = soup.select_one("h1")
if title_tag:
data["title"] = title_tag.get_text(strip=True)
# Meta description as summary
meta_desc = soup.select_one("meta[name='description']")
if meta_desc:
data["summary"] = meta_desc.get("content", "")
# Last reviewed date
reviewed = soup.select_one("[class*='reviewed'], [class*='byline'] time")
if reviewed:
data["last_reviewed"] = reviewed.get_text(strip=True)
# Main content sections
for section in soup.select("div.article-page section, div.article-body section, section.article-section"):
heading = section.select_one("h2, h3")
if heading:
section_title = heading.get_text(strip=True)
paragraphs = [p.get_text(strip=True) for p in section.select("p") if p.get_text(strip=True)]
bullet_points = [li.get_text(strip=True) for li in section.select("ul li")]
data["sections"][section_title] = {
"text": paragraphs,
"bullet_points": bullet_points,
}
# Tags/categories
for tag in soup.select("a[href*='/a-to-z-guides/'], nav.breadcrumb a"):
tag_text = tag.get_text(strip=True)
if tag_text and len(tag_text) > 2:
data["tags"].append(tag_text)
# Related conditions sidebar
for link in soup.select("div.related-conditions a, nav[class*='related'] a, div[class*='related-links'] a"):
href = link.get("href", "")
if "webmd.com" in href or href.startswith("/"):
data["related"].append({
"title": link.get_text(strip=True),
"url": href if href.startswith("http") else "https://www.webmd.com" + href,
})
return data
Extracting Drug Information
WebMD's drug database is particularly structured. Each drug page follows a consistent layout:
def scrape_drug_page(drug_url: str, client: httpx.Client = None) -> dict:
"""Extract drug information from a WebMD drug page."""
use_client = client or httpx.Client(headers=headers, follow_redirects=True, timeout=15)
resp = use_client.get(drug_url)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
drug_data = {
"url": drug_url,
"name": "",
"generic_name": "",
"drug_class": "",
"uses": [],
"side_effects_common": [],
"side_effects_serious": [],
"interactions": [],
"warnings": [],
"dosage_notes": "",
"pregnancy_category": "",
"controlled_substance": False,
}
# Drug name and class
name_tag = soup.select_one("h1, .drug-name, [class*='DrugName']")
if name_tag:
drug_data["name"] = name_tag.get_text(strip=True)
generic_tag = soup.select_one("[class*='generic-name'], .generic")
if generic_tag:
drug_data["generic_name"] = generic_tag.get_text(strip=True)
class_tag = soup.select_one("[class*='drug-class']")
if class_tag:
drug_data["drug_class"] = class_tag.get_text(strip=True)
# Uses section
uses_section = soup.find(["h2", "h3"], string=lambda t: t and ("use" in t.lower() or "treat" in t.lower()))
if uses_section:
container = uses_section.find_next(["div", "ul"])
if container:
drug_data["uses"] = [
item.get_text(strip=True)
for item in container.select("p, li")
if item.get_text(strip=True)
]
# Side effects — split into common vs serious
side_section = soup.find(["h2", "h3"], string=lambda t: t and "side effect" in t.lower())
if side_section:
container = side_section.find_next("div")
if container:
all_effects = [li.get_text(strip=True) for li in container.select("li")]
# WebMD typically lists common first, then serious with a subheading
serious_marker = container.find(string=lambda t: t and "serious" in t.lower())
if serious_marker:
idx = all_effects.index(serious_marker.parent.get_text(strip=True)) if serious_marker.parent.get_text(strip=True) in all_effects else len(all_effects)
drug_data["side_effects_common"] = all_effects[:idx]
drug_data["side_effects_serious"] = all_effects[idx:]
else:
drug_data["side_effects_common"] = all_effects
# Drug interactions
interact_section = soup.find(["h2", "h3"], string=lambda t: t and "interaction" in t.lower())
if interact_section:
container = interact_section.find_next("div")
if container:
drug_data["interactions"] = [li.get_text(strip=True) for li in container.select("li")]
# Warnings (black box, pregnancy warnings)
warning_section = soup.find(["h2", "h3"], string=lambda t: t and "warning" in t.lower() or (t and "precaution" in t.lower()))
if warning_section:
container = warning_section.find_next("div")
if container:
drug_data["warnings"] = [p.get_text(strip=True) for p in container.select("p, li")]
# Controlled substance check
if "controlled substance" in resp.text.lower() or "schedule ii" in resp.text.lower():
drug_data["controlled_substance"] = True
return drug_data
def scrape_drug_index(letter: str = "A", client: httpx.Client = None) -> list:
"""Scrape the drug index for a given letter to get all drug URLs."""
use_client = client or httpx.Client(headers=headers, follow_redirects=True, timeout=15)
url = f"https://www.webmd.com/drugs/2/alpha/{letter}"
resp = use_client.get(url)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
drugs = []
for link in soup.select("a[href*='/drugs/2/drug-']"):
drugs.append({
"name": link.get_text(strip=True),
"url": link.get("href", ""),
})
return drugs
Scraping User Reviews
WebMD has user-submitted drug and condition reviews. These are valuable for sentiment analysis and patient experience research:
def scrape_reviews(reviews_url: str, max_pages: int = 10, client: httpx.Client = None) -> list:
"""Scrape user reviews from a WebMD drug/condition review page."""
use_client = client or httpx.Client(headers=headers, follow_redirects=True, timeout=15)
all_reviews = []
for page in range(1, max_pages + 1):
url = f"{reviews_url}?page={page}" if page > 1 else reviews_url
try:
resp = use_client.get(url)
if resp.status_code != 200:
break
except Exception as e:
print(f"Error on page {page}: {e}")
break
soup = BeautifulSoup(resp.text, "lxml")
# WebMD review selectors (multiple possible formats)
review_cards = soup.select(
"div.review-card, div.user-review, div[class*='drugReview'], "
"div[class*='review-content'], div.patientReview"
)
if not review_cards:
# Check if we're on the last page
no_more = soup.select_one("[class*='no-reviews'], [class*='noResults']")
if no_more or page > 1:
break
for card in review_cards:
review = {"page": page}
# Rating — WebMD uses 1-5 scale
rating_el = card.select_one(
".rating, .review-rating, [class*='userRating'], "
"span[class*='stars'], [itemprop='ratingValue']"
)
if rating_el:
rating_text = rating_el.get("content") or rating_el.get_text(strip=True)
try:
review["rating"] = float(rating_text.replace("/5", "").strip())
except (ValueError, AttributeError):
review["rating"] = None
# Review text
comment_el = card.select_one(
".review-comment, .comment-text, [class*='reviewText'], "
"p.review-body, [itemprop='reviewBody']"
)
review["comment"] = comment_el.get_text(strip=True) if comment_el else None
# Condition the reviewer was treating
condition_el = card.select_one(".condition, .review-condition, [class*='conditionTreated']")
review["condition"] = condition_el.get_text(strip=True) if condition_el else None
# Time on medication
time_el = card.select_one("[class*='timeMedication'], [class*='duration']")
review["time_on_medication"] = time_el.get_text(strip=True) if time_el else None
# Effectiveness, ease of use, satisfaction sub-ratings
sub_ratings = {}
for sub in card.select("[class*='subRating'], [class*='category-rating']"):
label_el = sub.select_one("[class*='label'], span:first-child")
value_el = sub.select_one("[class*='value'], [class*='score']")
if label_el and value_el:
sub_ratings[label_el.get_text(strip=True)] = value_el.get_text(strip=True)
review["sub_ratings"] = sub_ratings
# Date posted
date_el = card.select_one("time, [class*='reviewDate'], [class*='date']")
review["date"] = date_el.get("datetime") or (date_el.get_text(strip=True) if date_el else None)
if review.get("comment"): # Only add reviews with content
all_reviews.append(review)
print(f" Page {page}: {len(review_cards)} reviews (total: {len(all_reviews)})")
time.sleep(4) # Be respectful between pages
return all_reviews
Dealing with Anti-Bot Protections
WebMD uses Cloudflare and fingerprinting on some pages. You'll notice 403 responses or challenge pages if you hit it too fast or with bare headers.
def is_challenge_page(html: str) -> bool:
"""Check if response is a Cloudflare or bot-detection challenge page."""
markers = [
"challenge-platform",
"cf-browser-verification",
"Just a moment",
"Checking if the site connection is secure",
"_cf_chl_opt",
"Enable JavaScript and cookies",
]
return any(marker in html for marker in markers)
def make_safe_request(url: str, client: httpx.Client, max_retries: int = 3) -> httpx.Response | None:
"""Make a request with retry logic and challenge detection."""
for attempt in range(max_retries):
try:
resp = client.get(url)
if is_challenge_page(resp.text):
print(f" Challenge page on attempt {attempt + 1}, backing off...")
time.sleep(30 * (attempt + 1)) # Exponential backoff
continue
if resp.status_code == 429:
print(f" Rate limited, waiting {60 * (attempt + 1)}s...")
time.sleep(60 * (attempt + 1))
continue
resp.raise_for_status()
return resp
except httpx.HTTPStatusError as e:
if e.response.status_code in [403, 503]:
print(f" HTTP {e.response.status_code} on attempt {attempt + 1}")
time.sleep(15 * (attempt + 1))
else:
raise
print(f" Failed after {max_retries} attempts: {url}")
return None
What works in 2026:
- Rotate User-Agents — keep a list of 10+ current browser UAs and rotate them per request
- Slow down — 3-5 seconds between requests minimum; WebMD isn't a speed game
- Use residential proxies — datacenter IPs get flagged quickly on any meaningful scale
For residential proxies, ThorData works well for medical site scraping. Their rotating residential IPs handle Cloudflare challenges and the geo-targeting is useful when WebMD serves different content by region. Setup:
import random
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
]
def make_client(proxy_url: str = None, rotate_ua: bool = True) -> httpx.Client:
"""Create an httpx client with optional proxy and UA rotation."""
current_headers = headers.copy()
if rotate_ua:
current_headers["User-Agent"] = random.choice(USER_AGENTS)
return httpx.Client(
headers=current_headers,
proxy=proxy_url, # e.g., "http://USER:[email protected]:9000"
follow_redirects=True,
timeout=20,
http2=True, # HTTP/2 looks more like a real browser
)
# Rotate client every N requests
def create_rotating_scraper(proxy_url: str):
"""Create a scraper that rotates UA every 10 requests."""
request_count = 0
client = make_client(proxy_url)
def get(url: str) -> httpx.Response:
nonlocal request_count, client
if request_count % 10 == 0:
client = make_client(proxy_url) # Fresh UA
request_count += 1
return client.get(url)
return get
Building a Symptom Database
With the condition scraper, you can build a structured symptom-to-condition mapping:
def build_symptom_index(conditions: list[dict]) -> dict:
"""Build a reverse index: symptom -> list of conditions."""
index = {}
for condition in conditions:
title = condition.get("title", "")
for section_name, section_data in condition.get("sections", {}).items():
if "symptom" in section_name.lower():
for symptom in section_data.get("bullet_points", []):
symptom_clean = symptom.lower().strip().rstrip(".").rstrip(",")
# Filter out very short or very long entries (likely not symptoms)
if 3 < len(symptom_clean) < 100:
if symptom_clean not in index:
index[symptom_clean] = []
if title not in index[symptom_clean]:
index[symptom_clean].append(title)
return index
def top_symptoms_by_condition_count(index: dict, top_n: int = 20) -> list:
"""Find symptoms that appear across the most conditions."""
return sorted(
index.items(),
key=lambda x: len(x[1]),
reverse=True
)[:top_n]
SQLite Storage
For any serious scraping job, plan your storage before you start. If the scraper crashes at page 500, you don't want to restart from zero:
import sqlite3
from datetime import datetime
def init_db(db_path: str = "webmd_data.db") -> sqlite3.Connection:
"""Initialize SQLite database with all required tables."""
conn = sqlite3.connect(db_path)
conn.execute("PRAGMA journal_mode=WAL") # Better concurrent write performance
conn.execute("""
CREATE TABLE IF NOT EXISTS conditions (
url TEXT PRIMARY KEY,
title TEXT NOT NULL,
summary TEXT,
tags TEXT, -- JSON array
last_reviewed TEXT,
data TEXT NOT NULL, -- Full JSON
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS drugs (
url TEXT PRIMARY KEY,
name TEXT NOT NULL,
generic_name TEXT,
drug_class TEXT,
controlled_substance INTEGER DEFAULT 0,
data TEXT NOT NULL, -- Full JSON
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS reviews (
id INTEGER PRIMARY KEY AUTOINCREMENT,
source_url TEXT NOT NULL,
drug_or_condition TEXT,
rating REAL,
comment TEXT,
condition_treated TEXT,
time_on_medication TEXT,
date_posted TEXT,
sub_ratings TEXT, -- JSON
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS symptoms (
id INTEGER PRIMARY KEY AUTOINCREMENT,
symptom TEXT NOT NULL,
condition TEXT NOT NULL,
UNIQUE(symptom, condition)
)
""")
conn.execute("CREATE INDEX IF NOT EXISTS idx_conditions_title ON conditions(title)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_drugs_name ON drugs(name)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_reviews_url ON reviews(source_url)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_symptoms_symptom ON symptoms(symptom)")
conn.commit()
return conn
def save_condition(conn: sqlite3.Connection, condition: dict) -> bool:
"""Save a condition to the database. Returns True if new, False if updated."""
try:
cursor = conn.execute(
"""INSERT OR REPLACE INTO conditions
(url, title, summary, tags, last_reviewed, data)
VALUES (?, ?, ?, ?, ?, ?)""",
(
condition["url"],
condition.get("title", ""),
condition.get("summary", ""),
json.dumps(condition.get("tags", [])),
condition.get("last_reviewed", ""),
json.dumps(condition),
)
)
conn.commit()
return cursor.lastrowid is not None
except sqlite3.Error as e:
print(f"DB error saving condition {condition.get('url')}: {e}")
return False
def save_reviews(conn: sqlite3.Connection, source_url: str, drug_name: str, reviews: list) -> int:
"""Bulk save reviews to database."""
saved = 0
for review in reviews:
try:
conn.execute(
"""INSERT INTO reviews
(source_url, drug_or_condition, rating, comment, condition_treated,
time_on_medication, date_posted, sub_ratings)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
(
source_url,
drug_name,
review.get("rating"),
review.get("comment"),
review.get("condition"),
review.get("time_on_medication"),
review.get("date"),
json.dumps(review.get("sub_ratings", {})),
)
)
saved += 1
except sqlite3.Error:
continue
conn.commit()
return saved
def is_already_scraped(conn: sqlite3.Connection, url: str, table: str = "conditions") -> bool:
"""Check if a URL has already been scraped (for resume capability)."""
cursor = conn.execute(f"SELECT 1 FROM {table} WHERE url = ?", (url,))
return cursor.fetchone() is not None
Pagination and Crawling at Scale
Crawling all of WebMD's condition pages requires starting from an index:
def get_condition_urls_from_index(client: httpx.Client) -> list[str]:
"""Get all condition URLs from WebMD's A-Z index."""
all_urls = []
# WebMD's condition index
index_url = "https://www.webmd.com/a-to-z-guides/medical-conditions"
resp = client.get(index_url)
soup = BeautifulSoup(resp.text, "lxml")
for link in soup.select("a[href*='/webmd.com/'], a[href^='/']"):
href = link.get("href", "")
# Filter to condition-like URLs (exclude ads, nav links)
if any(cat in href for cat in [
"/diabetes/", "/heart-disease/", "/cancer/", "/mental-health/",
"/digestive-disorders/", "/pain-management/", "/allergies/",
"/cold-and-flu/", "/depression/", "/arthritis/"
]):
full_url = href if href.startswith("http") else f"https://www.webmd.com{href}"
if full_url not in all_urls:
all_urls.append(full_url)
return all_urls
def crawl_conditions(db_path: str, proxy_url: str = None,
max_conditions: int = 500) -> dict:
"""Full crawl pipeline: discover and scrape condition pages."""
conn = init_db(db_path)
client = make_client(proxy_url)
print("Discovering condition URLs...")
urls = get_condition_urls_from_index(client)
print(f"Found {len(urls)} condition URLs")
stats = {"new": 0, "updated": 0, "failed": 0, "skipped": 0}
for i, url in enumerate(urls[:max_conditions]):
if is_already_scraped(conn, url):
stats["skipped"] += 1
continue
print(f"[{i+1}/{min(len(urls), max_conditions)}] {url}")
# Rotate client every 15 requests
if i % 15 == 0 and i > 0:
client = make_client(proxy_url)
condition = scrape_condition(url, client)
if condition and condition.get("title"):
saved = save_condition(conn, condition)
# Also index symptoms
symptom_index = build_symptom_index([condition])
for symptom, conditions in symptom_index.items():
for cond_title in conditions:
try:
conn.execute(
"INSERT OR IGNORE INTO symptoms (symptom, condition) VALUES (?, ?)",
(symptom, cond_title)
)
except sqlite3.Error:
pass
conn.commit()
stats["new" if saved else "updated"] += 1
print(f" OK: {condition['title']} ({len(condition['sections'])} sections)")
else:
stats["failed"] += 1
print(f" FAILED: {url}")
# Respectful delay: 3-6 seconds with jitter
delay = 3 + (i % 3) + (0.5 * (i % 7))
time.sleep(delay)
conn.close()
return stats
Querying Your Dataset
Once you have data in SQLite, it's easy to run analyses:
def query_conditions_by_symptom(db_path: str, symptom: str) -> list:
"""Find all conditions associated with a symptom."""
conn = sqlite3.connect(db_path)
results = conn.execute(
"""SELECT condition, COUNT(*) as match_count
FROM symptoms
WHERE LOWER(symptom) LIKE ?
GROUP BY condition
ORDER BY match_count DESC""",
(f"%{symptom.lower()}%",)
).fetchall()
conn.close()
return results
def get_drug_interaction_summary(db_path: str, drug_name: str) -> dict:
"""Get interaction data for a drug."""
conn = sqlite3.connect(db_path)
row = conn.execute(
"SELECT data FROM drugs WHERE LOWER(name) LIKE ?",
(f"%{drug_name.lower()}%",)
).fetchone()
conn.close()
if row:
data = json.loads(row[0])
return {
"drug": data["name"],
"interactions": data.get("interactions", []),
"warnings": data.get("warnings", []),
}
return {}
def top_rated_drugs_by_condition(db_path: str, condition: str, min_reviews: int = 10) -> list:
"""Rank drugs by average user rating for a given condition."""
conn = sqlite3.connect(db_path)
results = conn.execute(
"""SELECT drug_or_condition,
AVG(rating) as avg_rating,
COUNT(*) as review_count
FROM reviews
WHERE LOWER(condition_treated) LIKE ?
AND rating IS NOT NULL
GROUP BY drug_or_condition
HAVING COUNT(*) >= ?
ORDER BY avg_rating DESC""",
(f"%{condition.lower()}%", min_reviews)
).fetchall()
conn.close()
return results
Complete Pipeline Example
Putting it all together — a script that scrapes a set of conditions, their drug reviews, and stores everything:
import sqlite3, json, time, random
import httpx
from bs4 import BeautifulSoup
PROXY_URL = "http://USER:[email protected]:9000" # ThorData residential proxy
TARGET_CONDITIONS = [
("https://www.webmd.com/diabetes/type-2-diabetes", "type-2-diabetes"),
("https://www.webmd.com/heart-disease/atrial-fibrillation/", "atrial-fibrillation"),
("https://www.webmd.com/migraines-headaches/migraines-headaches-migraines", "migraines"),
("https://www.webmd.com/depression/guide/depression-diagnosis-tests", "depression"),
("https://www.webmd.com/arthritis/rheumatoid-arthritis-ra", "rheumatoid-arthritis"),
]
def run_full_pipeline(db_path: str = "webmd_data.db"):
conn = init_db(db_path)
client = make_client(PROXY_URL)
for url, slug in TARGET_CONDITIONS:
print(f"\n=== {slug} ===")
if not is_already_scraped(conn, url):
condition = scrape_condition(url, client)
if condition.get("title"):
save_condition(conn, condition)
print(f"Saved condition: {condition['title']}")
else:
print("Already scraped, skipping")
# Scrape reviews for this condition
reviews_url = f"https://www.webmd.com/drugs/condition-{slug}"
reviews = scrape_reviews(reviews_url, max_pages=5, client=client)
if reviews:
saved = save_reviews(conn, reviews_url, slug, reviews)
print(f"Saved {saved} reviews")
time.sleep(random.uniform(5, 10))
# Summary stats
stats = {}
for table in ["conditions", "drugs", "reviews", "symptoms"]:
count = conn.execute(f"SELECT COUNT(*) FROM {table}").fetchone()[0]
stats[table] = count
print(f"\n=== Database Summary ===")
for table, count in stats.items():
print(f" {table}: {count:,} records")
conn.close()
return stats
if __name__ == "__main__":
run_full_pipeline()
What You Can Build
A few practical uses for this data:
- Symptom checker datasets — map symptoms to conditions with frequency and co-occurrence data for ML training
- Drug interaction graphs — network analysis of which drugs interact with which; useful for pharmacovigilance research
- Patient sentiment analysis — NLP on user reviews grouped by condition or treatment; train classifiers on effectiveness vs. side-effect complaints
- Medical content gap analysis — compare what WebMD covers vs. clinical literature or other sources
- Condition similarity clustering — use symptom co-occurrence to cluster related conditions; useful for recommendation systems
- Drug pricing and availability monitoring — track which drugs are mentioned across condition pages and cross-reference with pharmacy APIs
- SEO research — WebMD is a dominant health content site; analyzing their content structure reveals what Google rewards for medical topics
Rate Limits and Ethical Considerations
WebMD's data is rich and mostly well-structured in the HTML. The main challenge is scale — be patient with rate limiting, rotate your IPs, and store incrementally.
Practical rate limit guidelines based on real-world testing in 2026:
| Target | Safe rate | Risk threshold |
|---|---|---|
| Condition pages | 1 req/4s | > 1 req/2s |
| Drug pages | 1 req/5s | > 1 req/3s |
| Review pages | 1 req/6s | > 1 req/4s |
| Drug index | 1 req/3s | > 1 req/2s |
Always check if a URL returns a 200 before parsing — a 200 that contains a challenge page is a common failure mode. The is_challenge_page() function above handles this.
The data is worth the patience. WebMD's medical content is authored and reviewed by clinical professionals, which makes it higher quality than most scraped health content. Use it responsibly.