How to Scrape Drugs.com for Medication Data with Python (2026)
How to Scrape Drugs.com for Medication Data with Python (2026)
Drugs.com stands as one of the most comprehensive publicly accessible medication databases on the internet. With over 24,000 drug monographs, millions of patient reviews, detailed pharmacological interaction data, dosage calculators, and side-effect frequency breakdowns sourced from clinical trials, it represents a goldmine for pharmaceutical researchers, healthcare data scientists, pharmacovigilance teams, and developers building health-adjacent applications.
Unlike many data sources in the healthcare space, Drugs.com provides consumer-facing information that bridges clinical data and real patient experiences. The combination of FDA-approved prescribing information alongside unstructured patient narratives creates a uniquely rich dataset. You can understand not just what a drug is supposed to do, but what patients actually experience — the difference between documented side-effect frequency in clinical trials and what emerges when millions of real-world patients self-report over years of use.
The technical challenge of scraping Drugs.com lies not in parsing complexity (the HTML is well-structured) but in circumventing its multi-layered bot defenses. The site sits behind Cloudflare's Enterprise WAF, uses behavioral fingerprinting, and deploys IP reputation scoring that's particularly harsh on datacenter ranges. A naive requests.get() approach will get you blocked within minutes. This guide covers everything from basic setup through production-grade scraping pipelines with residential proxy rotation, adaptive rate limiting, session management, and CAPTCHA fallback strategies.
Disclaimer: This data is for research and analytical purposes only. Never use scraped medication data for clinical decisions, prescribing advice, or patient care. Always consult licensed healthcare professionals for medical guidance. Review Drugs.com's Terms of Service and your jurisdiction's data scraping laws before building any large-scale collection pipeline. Patient reviews contain personal health information — anonymize and aggregate appropriately.
Understanding the Data Structure
Before writing a single line of code, map out what Drugs.com actually offers:
- Drug monographs at
drugs.com/{drug-name}.html— clinical descriptions, mechanism, indications - Side effects at
drugs.com/sfx/{drug-name}-side-effects.html— frequency-bucketed adverse effects - Patient reviews at
drugs.com/comments/{drug-name}/— paginated, with condition, rating, text - Drug interactions at
drugs.com/drug-interactions/{drug-name}.html— severity-classified pairings - Dosage information at
drugs.com/dosage/{drug-name}.html— age/weight/indication tables - Drug search at
drugs.com/search.php?searchterm={query}— autocomplete-style results - Drug classes at
drugs.com/drug-class/— categorical browsing structure - FDA drug database at
drugs.com/fda/— regulatory filings and approvals
Most of these pages are server-side rendered HTML. The reviews section uses some dynamic loading for pagination but the core content is available in the initial response.
Setup and Dependencies
pip install requests httpx beautifulsoup4 lxml pandas playwright tenacity fake-useragent
playwright install chromium
For proxy-enabled scraping at scale:
pip install requests[socks] httpx[socks]
Core HTTP Client with Anti-Detection
The foundation of any successful Drugs.com scraper is a well-crafted HTTP client that looks like real browser traffic. This means realistic headers, consistent TLS fingerprinting, and session cookie persistence.
import requests
import httpx
import random
import time
import logging
from typing import Optional, Dict, Any
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
logger = logging.getLogger(__name__)
# Rotate through realistic browser user agents
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4.1 Safari/605.1.15",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]
def build_headers(referer: Optional[str] = None) -> Dict[str, str]:
"""Build realistic browser headers with optional referer."""
ua = random.choice(USER_AGENTS)
headers = {
"User-Agent": ua,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "same-origin" if referer else "none",
"Sec-Fetch-User": "?1",
"Cache-Control": "max-age=0",
}
if referer:
headers["Referer"] = referer
return headers
class DrugsComSession:
"""Persistent session with cookie management and proxy rotation."""
def __init__(self, proxy_url: Optional[str] = None):
self.session = requests.Session()
self.proxy_url = proxy_url
self._warm_up()
def _warm_up(self):
"""Visit homepage to establish cookies before scraping."""
try:
self.session.headers.update(build_headers())
if self.proxy_url:
self.session.proxies = {
"http": self.proxy_url,
"https": self.proxy_url,
}
resp = self.session.get(
"https://www.drugs.com/",
timeout=30,
allow_redirects=True,
)
logger.info(f"Session warmed up, status={resp.status_code}, cookies={len(self.session.cookies)}")
time.sleep(random.uniform(1.5, 3.0))
except Exception as e:
logger.warning(f"Warm-up failed: {e}")
@retry(
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=2, min=4, max=60),
retry=retry_if_exception_type((requests.RequestException, IOError)),
)
def get(self, url: str, referer: Optional[str] = None) -> requests.Response:
"""Make a GET request with retry logic and adaptive delays."""
# Update headers on each request for variation
headers = build_headers(referer)
self.session.headers.update(headers)
time.sleep(random.uniform(3.5, 7.0))
resp = self.session.get(url, timeout=30)
if resp.status_code == 429:
retry_after = int(resp.headers.get("Retry-After", 120))
logger.warning(f"Rate limited, waiting {retry_after}s")
time.sleep(retry_after + random.uniform(5, 15))
raise requests.RequestException("Rate limited")
if resp.status_code == 403:
logger.warning(f"403 Forbidden at {url} - possible Cloudflare block")
time.sleep(random.uniform(30, 60))
raise requests.RequestException("Forbidden")
if resp.status_code == 503:
logger.warning("503 Service Unavailable - server overloaded")
time.sleep(random.uniform(20, 40))
raise requests.RequestException("Service unavailable")
resp.raise_for_status()
return resp
Drug Information Pages
from bs4 import BeautifulSoup
import re
def parse_drug_info(html: str, drug_name: str) -> Dict[str, Any]:
"""Parse a drug monograph page into structured data."""
soup = BeautifulSoup(html, "lxml")
info = {"name": drug_name, "url": f"https://www.drugs.com/{drug_name.lower().replace(' ', '-')}.html"}
# Title and generic name
h1 = soup.find("h1")
if h1:
info["title"] = h1.get_text(strip=True)
# Drug class badge
drug_class = soup.select_one(".drug-class a, .content-box .ddc-pid-class")
if drug_class:
info["drug_class"] = drug_class.get_text(strip=True)
# Main description (first substantial paragraph)
content_box = soup.find("div", class_="contentBox") or soup.find("div", class_="ddc-main-content")
if content_box:
paragraphs = content_box.find_all("p", recursive=False)
if paragraphs:
info["description"] = paragraphs[0].get_text(strip=True)
# Availability (Rx/OTC)
availability = soup.find(string=re.compile(r"Availability|Rx only|OTC"))
if availability:
info["availability"] = str(availability).strip()
# FDA approval status
fda_note = soup.find("div", class_="ddc-fda-approval")
if fda_note:
info["fda_status"] = fda_note.get_text(strip=True)
# Related drugs / alternatives
related = []
for link in soup.select(".ddc-related a, .related-drugs a")[:10]:
related.append(link.get_text(strip=True))
if related:
info["related_drugs"] = related
# Pronunciation guide
pronunciation = soup.find("div", class_="pronunciation")
if pronunciation:
info["pronunciation"] = pronunciation.get_text(strip=True)
return info
def get_drug_info(drug_name: str, session: DrugsComSession) -> Dict[str, Any]:
slug = drug_name.lower().replace(" ", "-")
url = f"https://www.drugs.com/{slug}.html"
resp = session.get(url, referer="https://www.drugs.com/")
return parse_drug_info(resp.text, drug_name)
# Usage
session = DrugsComSession(proxy_url="http://USER:[email protected]:9000")
info = get_drug_info("metformin", session)
print(info)
Side Effects with Frequency Data
Side effect pages contain clinically useful frequency buckets from trial data:
import pandas as pd
def get_side_effects(drug_name: str, session: DrugsComSession) -> pd.DataFrame:
"""Scrape side effects with frequency classification."""
slug = drug_name.lower().replace(" ", "-")
url = f"https://www.drugs.com/sfx/{slug}-side-effects.html"
resp = session.get(url, referer=f"https://www.drugs.com/{slug}.html")
soup = BeautifulSoup(resp.text, "lxml")
effects = []
# Frequency-bucketed sections (Common, Infrequent, Rare)
for section in soup.find_all(["h2", "h3"]):
header_text = section.get_text(strip=True)
# Find the list following this header
next_sibling = section.find_next_sibling()
while next_sibling:
if next_sibling.name in ["ul", "div"] and next_sibling.find("li"):
for item in next_sibling.find_all("li"):
effect_text = item.get_text(strip=True)
if effect_text:
effects.append({
"drug": drug_name,
"side_effect": effect_text,
"frequency_category": header_text,
})
break
next_sibling = next_sibling.find_next_sibling()
# Also try the structured side-effects-list divs
for container in soup.find_all("div", class_=re.compile("side-effects")):
category = container.get("data-freq", "Unknown")
header = container.find_previous(["h2", "h3"])
if header:
category = header.get_text(strip=True)
for item in container.find_all("li"):
text = item.get_text(strip=True)
if text and not any(e["side_effect"] == text for e in effects):
effects.append({
"drug": drug_name,
"side_effect": text,
"frequency_category": category,
})
df = pd.DataFrame(effects)
logger.info(f"Extracted {len(df)} side effects for {drug_name}")
return df
Scraping Patient Reviews at Scale
Reviews are the highest-value data on Drugs.com — real patient experiences with effectiveness ratings, condition-specific filtering, and temporal data going back over a decade.
from dataclasses import dataclass
from typing import List, Iterator
import sqlite3
@dataclass
class DrugReview:
drug: str
condition: str
rating: float
effectiveness: Optional[str]
ease_of_use: Optional[str]
satisfaction: Optional[str]
review_text: str
date: str
reviewer_age: Optional[str]
duration_of_use: Optional[str]
helpful_votes: int
def parse_review_card(card: BeautifulSoup, drug_name: str) -> Optional[DrugReview]:
"""Parse a single review card element."""
try:
condition_el = card.find("b", class_=re.compile("condition")) or card.find("strong", string=re.compile("Condition"))
condition = ""
if condition_el:
# Sometimes condition is in the next sibling text
condition = condition_el.get_text(strip=True).replace("Condition:", "").strip()
rating_el = card.find("span", class_=re.compile("rating")) or card.find("div", class_=re.compile("rating"))
rating = 0.0
if rating_el:
rating_text = rating_el.get_text(strip=True)
match = re.search(r"(\d+(?:\.\d+)?)", rating_text)
if match:
rating = float(match.group(1))
comment_el = (
card.find("span", class_=re.compile("comment-text"))
or card.find("p", class_=re.compile("comment"))
or card.find("div", class_=re.compile("review-text"))
)
review_text = comment_el.get_text(strip=True) if comment_el else ""
date_el = card.find("span", class_=re.compile("date")) or card.find("time")
date = date_el.get_text(strip=True) if date_el else ""
helpful_el = card.find(string=re.compile(r"\d+ found this comment helpful"))
helpful_votes = 0
if helpful_el:
match = re.search(r"(\d+)", str(helpful_el))
if match:
helpful_votes = int(match.group(1))
duration_el = card.find(string=re.compile(r"Duration of Use|duration"))
duration = None
if duration_el:
duration = str(duration_el).strip()
age_el = card.find(string=re.compile(r"Age:|years old"))
age = None
if age_el:
age = str(age_el).strip()
return DrugReview(
drug=drug_name,
condition=condition,
rating=rating,
effectiveness=None,
ease_of_use=None,
satisfaction=None,
review_text=review_text,
date=date,
reviewer_age=age,
duration_of_use=duration,
helpful_votes=helpful_votes,
)
except Exception as e:
logger.warning(f"Failed to parse review card: {e}")
return None
def iter_drug_reviews(drug_name: str, session: DrugsComSession, max_pages: int = 20) -> Iterator[DrugReview]:
"""Iterate over all review pages for a drug."""
slug = drug_name.lower().replace(" ", "-")
base_url = f"https://www.drugs.com/comments/{slug}/"
for page in range(1, max_pages + 1):
url = base_url if page == 1 else f"{base_url}?page={page}"
referer = base_url if page > 1 else "https://www.drugs.com/"
try:
resp = session.get(url, referer=referer)
except Exception as e:
logger.error(f"Failed to fetch page {page} for {drug_name}: {e}")
break
soup = BeautifulSoup(resp.text, "lxml")
# Detect end of pagination
cards = (
soup.find_all("div", class_=re.compile(r"user-comment|review-card|comment-card"))
or soup.find_all("li", class_=re.compile(r"review"))
)
if not cards:
logger.info(f"No review cards found on page {page} for {drug_name}, stopping")
break
page_reviews = 0
for card in cards:
review = parse_review_card(card, drug_name)
if review and review.review_text:
yield review
page_reviews += 1
logger.info(f"Page {page}: extracted {page_reviews} reviews for {drug_name}")
# Check if there's a next page
next_link = soup.find("a", string=re.compile(r"Next|>")) or soup.find("a", rel="next")
if not next_link:
break
def save_reviews_to_sqlite(
drug_name: str,
session: DrugsComSession,
db_path: str = "drugs_reviews.db",
max_pages: int = 20,
) -> int:
"""Stream reviews directly into SQLite."""
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS reviews (
id INTEGER PRIMARY KEY AUTOINCREMENT,
drug TEXT,
condition TEXT,
rating REAL,
review_text TEXT,
date TEXT,
reviewer_age TEXT,
duration_of_use TEXT,
helpful_votes INTEGER,
scraped_at TEXT DEFAULT (datetime('now'))
)
""")
conn.commit()
count = 0
for review in iter_drug_reviews(drug_name, session, max_pages=max_pages):
conn.execute(
"""INSERT INTO reviews
(drug, condition, rating, review_text, date, reviewer_age, duration_of_use, helpful_votes)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
(
review.drug, review.condition, review.rating, review.review_text,
review.date, review.reviewer_age, review.duration_of_use, review.helpful_votes,
),
)
count += 1
if count % 50 == 0:
conn.commit()
logger.info(f"Committed {count} reviews for {drug_name}")
conn.commit()
conn.close()
return count
Drug Interactions
The interactions database is particularly valuable for pharmacovigilance and clinical decision support research:
from enum import Enum
class InteractionSeverity(str, Enum):
MAJOR = "major"
MODERATE = "moderate"
MINOR = "minor"
UNKNOWN = "unknown"
FOOD = "food"
def get_interactions(drug_name: str, session: DrugsComSession) -> pd.DataFrame:
"""Scrape drug interaction data with severity classifications."""
slug = drug_name.lower().replace(" ", "-")
url = f"https://www.drugs.com/drug-interactions/{slug}.html"
resp = session.get(url, referer=f"https://www.drugs.com/{slug}.html")
soup = BeautifulSoup(resp.text, "lxml")
interactions = []
# Structured interaction rows
for row in soup.find_all("tr", class_=re.compile(r"int-")):
cells = row.find_all("td")
if len(cells) < 2:
continue
classes = row.get("class", [])
severity = InteractionSeverity.UNKNOWN
for cls in classes:
if "major" in cls:
severity = InteractionSeverity.MAJOR
elif "moderate" in cls:
severity = InteractionSeverity.MODERATE
elif "minor" in cls:
severity = InteractionSeverity.MINOR
elif "food" in cls:
severity = InteractionSeverity.FOOD
interactant_link = cells[0].find("a")
interactions.append({
"drug": drug_name,
"interacts_with": cells[0].get_text(strip=True),
"interactant_url": interactant_link.get("href", "") if interactant_link else "",
"severity": severity.value,
"description": cells[1].get_text(strip=True) if len(cells) > 1 else "",
})
# Also check for food interactions section
food_section = soup.find("h2", string=re.compile(r"food", re.IGNORECASE))
if food_section:
food_list = food_section.find_next("ul")
if food_list:
for item in food_list.find_all("li"):
interactions.append({
"drug": drug_name,
"interacts_with": item.get_text(strip=True),
"interactant_url": "",
"severity": InteractionSeverity.FOOD.value,
"description": "Food interaction",
})
df = pd.DataFrame(interactions)
logger.info(f"Found {len(df)} interactions for {drug_name} ({df['severity'].value_counts().to_dict() if not df.empty else {}})")
return df
Proxy Rotation with ThorData
At scale — scraping hundreds of drugs — you will get blocked without residential proxies. Drugs.com's Cloudflare integration is tuned to flag datacenter IPs aggressively. ThorData provides residential proxy pools with real ISP addresses that pass Cloudflare's reputation checks.
import itertools
import threading
from queue import Queue
class ThorDataProxyPool:
"""
Rotating proxy pool using ThorData's residential network.
Supports sticky sessions (same IP for multi-page workflows)
and rotating sessions (new IP per request).
"""
def __init__(
self,
username: str,
password: str,
host: str = "proxy.thordata.com",
port: int = 9000,
country: str = "US",
sticky_session_minutes: int = 5,
):
self.username = username
self.password = password
self.host = host
self.port = port
self.country = country
self.sticky_minutes = sticky_session_minutes
self._session_id = None
self._session_created = 0
self._lock = threading.Lock()
def _new_session_id(self) -> str:
"""Generate a random session identifier for sticky sessions."""
return f"sess_{random.randint(100000, 999999)}"
def get_rotating_proxy(self) -> str:
"""Get a proxy URL that rotates on every request."""
return (
f"http://{self.username}-country-{self.country}:"
f"{self.password}@{self.host}:{self.port}"
)
def get_sticky_proxy(self) -> str:
"""
Get a proxy URL that uses the same exit IP for up to sticky_session_minutes.
Useful for multi-page workflows like paginated reviews.
"""
with self._lock:
now = time.time()
if (
self._session_id is None
or now - self._session_created > self.sticky_minutes * 60
):
self._session_id = self._new_session_id()
self._session_created = now
logger.debug(f"New sticky proxy session: {self._session_id}")
return (
f"http://{self.username}-country-{self.country}-"
f"session-{self._session_id}:{self.password}@{self.host}:{self.port}"
)
def rotate(self):
"""Force rotation to a new IP on next sticky request."""
with self._lock:
self._session_id = None
# Usage pattern
proxy_pool = ThorDataProxyPool(
username="your_username",
password="your_password",
country="US",
sticky_session_minutes=10, # Keep same IP for 10 minutes per drug
)
def create_session_for_drug(drug_name: str) -> DrugsComSession:
"""Create a fresh session with sticky proxy for a single drug's data collection."""
proxy_url = proxy_pool.get_sticky_proxy()
session = DrugsComSession(proxy_url=proxy_url)
logger.info(f"Created session for {drug_name} via {proxy_url[:50]}...")
return session
def scrape_drug_complete(drug_name: str) -> Dict[str, Any]:
"""Full data collection for a single drug with proxy rotation between drugs."""
session = create_session_for_drug(drug_name)
result = {}
try:
result["info"] = get_drug_info(drug_name, session)
time.sleep(random.uniform(4, 8))
result["side_effects"] = get_side_effects(drug_name, session).to_dict(orient="records")
time.sleep(random.uniform(4, 8))
result["interactions"] = get_interactions(drug_name, session).to_dict(orient="records")
except Exception as e:
logger.error(f"Failed scraping {drug_name}: {e}")
result["error"] = str(e)
# Force new IP for next drug
proxy_pool.rotate()
return result
Anti-Detection: Headers, Delays, Fingerprint Spoofing
Beyond proxies, there are several layers of detection to defeat:
import json
from datetime import datetime, timedelta
class BrowserFingerprintSimulator:
"""
Simulate consistent browser fingerprint attributes.
A real browser has consistent screen resolution, timezone, plugins, etc.
Inconsistency is a bot signal.
"""
BROWSER_PROFILES = [
{
"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"sec_ch_ua": '"Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"',
"sec_ch_ua_platform": '"Windows"',
"accept_language": "en-US,en;q=0.9",
"viewport": "1920x1080",
},
{
"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"sec_ch_ua": '"Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"',
"sec_ch_ua_platform": '"macOS"',
"accept_language": "en-US,en;q=0.9,en-GB;q=0.8",
"viewport": "1440x900",
},
{
"user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
"sec_ch_ua": None, # Firefox doesn't send sec-ch-ua
"sec_ch_ua_platform": None,
"accept_language": "en-US,en;q=0.5",
"viewport": "1920x1080",
},
]
def __init__(self):
self._profile = random.choice(self.BROWSER_PROFILES)
def get_headers(self, url: str, referer: Optional[str] = None) -> Dict[str, str]:
headers = {
"User-Agent": self._profile["user_agent"],
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"Accept-Language": self._profile["accept_language"],
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
}
if self._profile.get("sec_ch_ua"):
headers["Sec-CH-UA"] = self._profile["sec_ch_ua"]
headers["Sec-CH-UA-Mobile"] = "?0"
headers["Sec-CH-UA-Platform"] = self._profile["sec_ch_ua_platform"]
if referer:
headers["Referer"] = referer
headers["Sec-Fetch-Site"] = "same-origin"
else:
headers["Sec-Fetch-Site"] = "none"
headers["Sec-Fetch-Mode"] = "navigate"
headers["Sec-Fetch-Dest"] = "document"
headers["Sec-Fetch-User"] = "?1"
return headers
class AdaptiveRateLimiter:
"""
Adaptive rate limiter that backs off when detecting throttling signals
and speeds up when requests are succeeding consistently.
"""
def __init__(self, base_delay: float = 4.0, min_delay: float = 2.0, max_delay: float = 30.0):
self.delay = base_delay
self.min_delay = min_delay
self.max_delay = max_delay
self._consecutive_success = 0
self._consecutive_failure = 0
self._lock = threading.Lock()
def wait(self):
"""Wait the appropriate amount of time before next request."""
with self._lock:
jitter = random.uniform(-0.5, 1.5)
sleep_time = max(self.min_delay, self.delay + jitter)
time.sleep(sleep_time)
def record_success(self):
with self._lock:
self._consecutive_success += 1
self._consecutive_failure = 0
# Gradually speed up after 5 consecutive successes
if self._consecutive_success >= 5:
self.delay = max(self.min_delay, self.delay * 0.9)
self._consecutive_success = 0
logger.debug(f"Rate limiter sped up: delay={self.delay:.1f}s")
def record_throttle(self):
with self._lock:
self._consecutive_failure += 1
self._consecutive_success = 0
self.delay = min(self.max_delay, self.delay * 2.0)
logger.warning(f"Rate limiter backed off: delay={self.delay:.1f}s")
Playwright-Based Scraping for JavaScript-Heavy Pages
Some sections of Drugs.com — particularly the interaction checker and review sections with modern pagination — require JavaScript execution:
import asyncio
from playwright.async_api import async_playwright, Page, Browser
from typing import AsyncIterator
async def setup_stealth_browser(proxy_url: Optional[str] = None) -> Browser:
"""Launch a hardened Playwright browser with anti-detection measures."""
playwright = await async_playwright().start()
launch_args = [
"--disable-blink-features=AutomationControlled",
"--disable-dev-shm-usage",
"--no-first-run",
"--no-default-browser-check",
"--disable-infobars",
"--window-size=1920,1080",
"--lang=en-US",
]
launch_kwargs = {
"headless": True,
"args": launch_args,
}
if proxy_url:
# Parse proxy URL into components for Playwright
from urllib.parse import urlparse
parsed = urlparse(proxy_url)
launch_kwargs["proxy"] = {
"server": f"{parsed.scheme}://{parsed.hostname}:{parsed.port}",
"username": parsed.username or "",
"password": parsed.password or "",
}
browser = await playwright.chromium.launch(**launch_kwargs)
return browser
async def make_stealth_page(browser: Browser) -> Page:
"""Create a new page with stealth settings to avoid detection."""
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
locale="en-US",
timezone_id="America/New_York",
user_agent=random.choice(USER_AGENTS),
java_script_enabled=True,
accept_downloads=False,
ignore_https_errors=False,
)
page = await context.new_page()
# Override navigator.webdriver property
await page.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3, 4, 5]});
Object.defineProperty(navigator, 'languages', {get: () => ['en-US', 'en']});
window.chrome = { runtime: {} };
""")
page.set_default_timeout(30000)
return page
async def scrape_reviews_playwright(
drug_name: str,
proxy_url: Optional[str] = None,
max_pages: int = 10,
) -> List[Dict]:
"""Scrape reviews using full browser automation for sites with JS challenges."""
slug = drug_name.lower().replace(" ", "-")
reviews = []
browser = await setup_stealth_browser(proxy_url)
try:
page = await make_stealth_page(browser)
# First visit the main drug page to establish session
await page.goto(f"https://www.drugs.com/{slug}.html", wait_until="networkidle")
await asyncio.sleep(random.uniform(2, 4))
for page_num in range(1, max_pages + 1):
url = f"https://www.drugs.com/comments/{slug}/"
if page_num > 1:
url += f"?page={page_num}"
await page.goto(url, wait_until="networkidle")
await asyncio.sleep(random.uniform(1.5, 3.5))
# Check for CAPTCHA
captcha_frame = await page.query_selector("iframe[src*='captcha'], .captcha-container")
if captcha_frame:
logger.warning(f"CAPTCHA detected on page {page_num}, stopping")
break
# Extract review data via page evaluation
page_reviews = await page.evaluate("""
() => {
const reviews = [];
const cards = document.querySelectorAll('[class*="user-comment"], [class*="review-card"]');
cards.forEach(card => {
const rating = card.querySelector('[class*="rating"]');
const text = card.querySelector('[class*="comment-text"], [class*="review-text"]');
const date = card.querySelector('[class*="date"], time');
const condition = card.querySelector('[class*="condition"]');
reviews.push({
rating: rating ? rating.textContent.trim() : '',
text: text ? text.textContent.trim() : '',
date: date ? (date.getAttribute('datetime') || date.textContent.trim()) : '',
condition: condition ? condition.textContent.trim() : '',
});
});
return reviews;
}
""")
if not page_reviews:
break
reviews.extend([{**r, "drug": drug_name} for r in page_reviews])
logger.info(f"Playwright page {page_num}: {len(page_reviews)} reviews")
finally:
await browser.close()
return reviews
Rate Limiting and CAPTCHA Handling
import base64
from io import BytesIO
class CaptchaHandler:
"""
CAPTCHA detection and fallback strategies.
Note: Solving CAPTCHAs programmatically may violate ToS.
This class focuses on detection and graceful fallback.
"""
CAPTCHA_SIGNALS = [
"captcha",
"cf-challenge",
"challenge-form",
"ray id",
"checking your browser",
"ddos-guard",
"access denied",
"bot detection",
]
@staticmethod
def is_captcha_page(html: str) -> bool:
"""Detect if a response contains a CAPTCHA or bot challenge."""
html_lower = html.lower()
return any(signal in html_lower for signal in CaptchaHandler.CAPTCHA_SIGNALS)
@staticmethod
def is_rate_limited(response: requests.Response) -> bool:
"""Check if response indicates rate limiting."""
if response.status_code == 429:
return True
if response.status_code == 503:
return True
if "rate limit" in response.text.lower():
return True
return False
@staticmethod
def handle_captcha_fallback(url: str, drug_name: str) -> Optional[Dict]:
"""
Fallback strategy when CAPTCHA is encountered.
Options:
1. Rotate proxy (most effective)
2. Wait and retry with longer delay
3. Switch to Playwright with fresh browser context
4. Log URL for manual review
5. Use alternative data source (FDA API, etc.)
"""
logger.warning(f"CAPTCHA/block detected for {drug_name} at {url}")
logger.info("Strategies: rotate proxy, extend delay, or use Playwright fallback")
# Record for later retry
with open("captcha_backlog.txt", "a") as f:
f.write(f"{url}\t{drug_name}\t{datetime.now().isoformat()}\n")
return None
def make_request_with_captcha_handling(
url: str,
session: DrugsComSession,
rate_limiter: AdaptiveRateLimiter,
proxy_pool: Optional[ThorDataProxyPool] = None,
) -> Optional[requests.Response]:
"""Make request with full CAPTCHA and rate limit handling."""
rate_limiter.wait()
try:
resp = session.get(url)
if CaptchaHandler.is_rate_limited(resp):
rate_limiter.record_throttle()
retry_after = int(resp.headers.get("Retry-After", 60))
logger.warning(f"Rate limited, waiting {retry_after}s")
time.sleep(retry_after + random.uniform(10, 30))
if proxy_pool:
proxy_pool.rotate()
return None
if CaptchaHandler.is_captcha_page(resp.text):
rate_limiter.record_throttle()
if proxy_pool:
proxy_pool.rotate()
logger.info("Rotated proxy after CAPTCHA detection")
return None
rate_limiter.record_success()
return resp
except requests.RequestException as e:
rate_limiter.record_throttle()
logger.error(f"Request failed for {url}: {e}")
return None
Output Schemas with Examples
A well-defined output schema is essential for data pipeline reliability:
# Drug information output schema
DRUG_INFO_SCHEMA = {
"name": "metformin",
"title": "Metformin",
"drug_class": "Biguanide antidiabetic",
"description": "Metformin is an oral diabetes medicine that helps control blood sugar levels...",
"availability": "Rx only",
"fda_status": "FDA approved",
"related_drugs": ["glipizide", "januvia", "victoza"],
"pronunciation": "met-FOR-min",
"url": "https://www.drugs.com/metformin.html",
}
# Side effects output schema
SIDE_EFFECTS_SCHEMA = {
"drug": "metformin",
"side_effect": "Nausea",
"frequency_category": "More Common", # More Common / Less Common / Rare
}
# Review output schema
REVIEW_SCHEMA = {
"drug": "metformin",
"condition": "Type 2 Diabetes",
"rating": 8.5, # 1-10 scale
"review_text": "I've been taking metformin for 6 months...",
"date": "March 15, 2024",
"reviewer_age": "45-54",
"duration_of_use": "1 to 6 months",
"helpful_votes": 12,
"scraped_at": "2026-03-31T14:22:00",
}
# Interaction output schema
INTERACTION_SCHEMA = {
"drug": "warfarin",
"interacts_with": "aspirin",
"interactant_url": "/drug-interactions/aspirin.html",
"severity": "major", # major / moderate / minor / food / unknown
"description": "May significantly increase the risk of bleeding...",
}
Real-World Use Cases with Code
Use Case 1: Pharmacovigilance Signal Detection
Identify drugs with unusually high proportions of serious adverse event reports:
def build_adverse_event_profile(drug_list: List[str], session: DrugsComSession) -> pd.DataFrame:
"""Build adverse event severity profiles for a drug list."""
profiles = []
for drug in drug_list:
try:
effects_df = get_side_effects(drug, session)
if effects_df.empty:
continue
total = len(effects_df)
serious_keywords = r"death|hospitali|seizure|fatal|cardiac|stroke|anaphylaxis|liver failure"
serious = effects_df["side_effect"].str.contains(serious_keywords, case=False, na=False).sum()
profiles.append({
"drug": drug,
"total_effects": total,
"serious_effects": serious,
"serious_ratio": serious / total if total > 0 else 0,
"frequency_breakdown": effects_df["frequency_category"].value_counts().to_dict(),
})
time.sleep(random.uniform(5, 10))
except Exception as e:
logger.error(f"Failed profile for {drug}: {e}")
return pd.DataFrame(profiles).sort_values("serious_ratio", ascending=False)
# Analyze top antidiabetics
diabetes_drugs = ["metformin", "ozempic", "victoza", "januvia", "jardiance", "farxiga"]
profiles = build_adverse_event_profile(diabetes_drugs, session)
print(profiles[["drug", "total_effects", "serious_ratio"]].to_string())
Use Case 2: Patient Sentiment Analysis by Condition
Analyze how patients rate the same drug for different conditions:
def analyze_drug_by_condition(drug_name: str, session: DrugsComSession) -> pd.DataFrame:
"""Compare drug effectiveness ratings across different conditions."""
reviews = []
for review in iter_drug_reviews(drug_name, session, max_pages=15):
if review.condition and review.rating > 0:
reviews.append({"condition": review.condition, "rating": review.rating, "text": review.review_text})
if not reviews:
return pd.DataFrame()
df = pd.DataFrame(reviews)
analysis = df.groupby("condition").agg(
mean_rating=("rating", "mean"),
review_count=("rating", "count"),
std_rating=("rating", "std"),
).round(2).sort_values("review_count", ascending=False)
return analysis[analysis["review_count"] >= 5] # Filter low-sample conditions
# Example output:
# mean_rating review_count std_rating
# Type 2 Diabetes 7.2 847 2.1
# Weight Loss 5.8 234 2.8
# Polycystic Ovary Syndrome 7.6 198 2.2
Use Case 3: Drug Interaction Network Builder
Build a graph of drug-drug interactions:
import json
def build_interaction_network(
seed_drugs: List[str],
session: DrugsComSession,
max_depth: int = 1,
) -> Dict:
"""Build a drug interaction network up to specified depth."""
network = {"nodes": {}, "edges": []}
visited = set()
queue = [(drug, 0) for drug in seed_drugs]
while queue:
drug, depth = queue.pop(0)
if drug in visited:
continue
visited.add(drug)
network["nodes"][drug] = {"scraped": True}
interactions = get_interactions(drug, session)
if interactions.empty:
continue
for _, row in interactions.iterrows():
edge = {
"source": drug,
"target": row["interacts_with"],
"severity": row["severity"],
}
network["edges"].append(edge)
if depth < max_depth and row["interacts_with"] not in visited:
if row["severity"] in ["major", "moderate"]:
queue.append((row["interacts_with"], depth + 1))
time.sleep(random.uniform(5, 10))
with open("interaction_network.json", "w") as f:
json.dump(network, f, indent=2)
return network
Use Case 4: Generic vs Brand Price Intelligence
def compare_generic_brand(drug_name: str, session: DrugsComSession) -> Dict:
"""Scrape price comparison data between generic and brand versions."""
slug = drug_name.lower().replace(" ", "-")
url = f"https://www.drugs.com/price-guide/{slug}"
try:
resp = session.get(url)
soup = BeautifulSoup(resp.text, "lxml")
prices = {}
price_table = soup.find("table", class_=re.compile("price"))
if price_table:
for row in price_table.find_all("tr")[1:]:
cells = row.find_all("td")
if len(cells) >= 3:
prices[cells[0].get_text(strip=True)] = {
"dosage": cells[1].get_text(strip=True),
"price": cells[2].get_text(strip=True),
}
return prices
except Exception as e:
logger.error(f"Price scrape failed for {drug_name}: {e}")
return {}
Use Case 5: Dosage Table Extraction
def get_dosage_info(drug_name: str, session: DrugsComSession) -> pd.DataFrame:
"""Extract structured dosage information."""
slug = drug_name.lower().replace(" ", "-")
url = f"https://www.drugs.com/dosage/{slug}.html"
resp = session.get(url, referer=f"https://www.drugs.com/{slug}.html")
soup = BeautifulSoup(resp.text, "lxml")
dosage_data = []
# Extract from structured tables
for table in soup.find_all("table"):
headers = [th.get_text(strip=True) for th in table.find_all("th")]
for row in table.find_all("tr")[1:]:
cells = [td.get_text(strip=True) for td in row.find_all("td")]
if cells:
row_dict = dict(zip(headers, cells))
row_dict["drug"] = drug_name
dosage_data.append(row_dict)
return pd.DataFrame(dosage_data)
Use Case 6: Treatment Outcome Research Dataset
def build_treatment_research_dataset(
conditions: List[str],
drugs_per_condition: int = 5,
reviews_per_drug: int = 100,
db_path: str = "treatment_research.db",
) -> None:
"""
Build a research dataset linking conditions to drug effectiveness scores.
Useful for comparative effectiveness research.
"""
conn = sqlite3.connect(db_path)
conn.executescript("""
CREATE TABLE IF NOT EXISTS drug_condition_stats (
id INTEGER PRIMARY KEY AUTOINCREMENT,
drug TEXT NOT NULL,
condition TEXT NOT NULL,
mean_rating REAL,
review_count INTEGER,
positive_ratio REAL,
scraped_at TEXT DEFAULT (datetime('now')),
UNIQUE(drug, condition)
);
CREATE TABLE IF NOT EXISTS reviews (
id INTEGER PRIMARY KEY AUTOINCREMENT,
drug TEXT,
condition TEXT,
rating REAL,
review_text TEXT,
date TEXT,
helpful_votes INTEGER
);
""")
conn.commit()
session = DrugsComSession(proxy_url=proxy_pool.get_sticky_proxy())
for condition in conditions:
logger.info(f"Processing condition: {condition}")
# In practice, you'd use a drug-by-condition lookup endpoint
# or pre-populate from medical knowledge bases
time.sleep(random.uniform(5, 10))
conn.close()
logger.info(f"Research dataset saved to {db_path}")
Use Case 7: New Drug Approval Monitor
def monitor_new_approvals(session: DrugsComSession, db_path: str = "approvals.db") -> List[Dict]:
"""
Monitor Drugs.com new approvals feed for recently approved drugs.
Useful for pharmaceutical market intelligence.
"""
url = "https://www.drugs.com/newdrugs.html"
resp = session.get(url)
soup = BeautifulSoup(resp.text, "lxml")
new_drugs = []
for item in soup.select(".newdrugs-list li, .content-box li"):
link = item.find("a")
if link:
new_drugs.append({
"name": link.get_text(strip=True),
"url": "https://www.drugs.com" + link.get("href", ""),
"description": item.get_text(strip=True),
"found_at": datetime.now().isoformat(),
})
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS approvals (
id INTEGER PRIMARY KEY, name TEXT, url TEXT UNIQUE,
description TEXT, found_at TEXT
)
""")
for drug in new_drugs:
conn.execute(
"INSERT OR IGNORE INTO approvals (name, url, description, found_at) VALUES (?, ?, ?, ?)",
(drug["name"], drug["url"], drug["description"], drug["found_at"]),
)
conn.commit()
conn.close()
return new_drugs
Complete Production Pipeline
import concurrent.futures
import argparse
from pathlib import Path
def run_full_scraping_pipeline(
drug_list: List[str],
output_dir: str = "output",
max_workers: int = 2,
proxy_username: str = "",
proxy_password: str = "",
) -> Dict[str, Any]:
"""
Production-grade pipeline for scraping drug data at scale.
Uses thread pool for parallelism with per-drug proxy rotation.
"""
Path(output_dir).mkdir(exist_ok=True)
proxy_pool = ThorDataProxyPool(
username=proxy_username,
password=proxy_password,
country="US",
sticky_session_minutes=8,
) if proxy_username else None
rate_limiter = AdaptiveRateLimiter(base_delay=5.0)
results = {"success": [], "failed": [], "total": len(drug_list)}
def scrape_one(drug_name: str) -> bool:
try:
proxy = proxy_pool.get_sticky_proxy() if proxy_pool else None
session = DrugsComSession(proxy_url=proxy)
data = {
"info": get_drug_info(drug_name, session),
"side_effects": get_side_effects(drug_name, session).to_dict(orient="records"),
"interactions": get_interactions(drug_name, session).to_dict(orient="records"),
}
out_file = Path(output_dir) / f"{drug_name.replace(' ', '_')}.json"
with open(out_file, "w") as f:
json.dump(data, f, indent=2, default=str)
if proxy_pool:
proxy_pool.rotate()
return True
except Exception as e:
logger.error(f"Pipeline failed for {drug_name}: {e}")
return False
# Process with limited concurrency to respect rate limits
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = {executor.submit(scrape_one, drug): drug for drug in drug_list}
for future in concurrent.futures.as_completed(futures):
drug = futures[future]
success = future.result()
(results["success"] if success else results["failed"]).append(drug)
logger.info(f"Progress: {len(results['success']) + len(results['failed'])}/{results['total']}")
return results
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--drugs", nargs="+", default=["metformin", "lisinopril", "atorvastatin"])
parser.add_argument("--proxy-user", default="")
parser.add_argument("--proxy-pass", default="")
args = parser.parse_args()
results = run_full_scraping_pipeline(
args.drugs,
proxy_username=args.proxy_user,
proxy_password=args.proxy_pass,
)
print(f"Completed: {len(results['success'])} success, {len(results['failed'])} failed")
Ethical Considerations and Legal Compliance
Scraping health data carries heightened responsibility beyond typical web scraping:
- robots.txt: Always check
https://www.drugs.com/robots.txtand respect disallowed paths - Patient privacy: Reviews contain personal health disclosures — aggregate and anonymize; do not redistribute individually
- Rate limiting: Real patients depend on this site. Keep requests slow enough to avoid impacting legitimate users
- Research ethics: If publishing findings from this data, consult your institution's IRB/ethics board
- HIPAA awareness: Even publicly available health data can raise regulatory concerns in certain research contexts
- Terms of Service: Drugs.com prohibits systematic data collection without authorization; large-scale scraping should be preceded by legal review
- Clinical safety: Never present scraped medication data as clinical guidance or use it in any decision-support system without proper validation and regulatory clearance
The combination of structured pharmacological data and patient narratives makes Drugs.com extraordinarily valuable for pharmaceutical research, drug safety surveillance, and patient experience analytics. Approach it with the care and responsibility that health data demands.