How to Scrape Drugs.com for Medication Data with Python (2026)

2026-04-09 [python scraping healthcare drugs-com medication-data proxy-rotation anti-bot]

How to Scrape Drugs.com for Medication Data with Python (2026)

Drugs.com stands as one of the most comprehensive publicly accessible medication databases on the internet. With over 24,000 drug monographs, millions of patient reviews, detailed pharmacological interaction data, dosage calculators, and side-effect frequency breakdowns sourced from clinical trials, it represents a goldmine for pharmaceutical researchers, healthcare data scientists, pharmacovigilance teams, and developers building health-adjacent applications.

Unlike many data sources in the healthcare space, Drugs.com provides consumer-facing information that bridges clinical data and real patient experiences. The combination of FDA-approved prescribing information alongside unstructured patient narratives creates a uniquely rich dataset. You can understand not just what a drug is supposed to do, but what patients actually experience — the difference between documented side-effect frequency in clinical trials and what emerges when millions of real-world patients self-report over years of use.

The technical challenge of scraping Drugs.com lies not in parsing complexity (the HTML is well-structured) but in circumventing its multi-layered bot defenses. The site sits behind Cloudflare's Enterprise WAF, uses behavioral fingerprinting, and deploys IP reputation scoring that's particularly harsh on datacenter ranges. A naive requests.get() approach will get you blocked within minutes. This guide covers everything from basic setup through production-grade scraping pipelines with residential proxy rotation, adaptive rate limiting, session management, and CAPTCHA fallback strategies.

Disclaimer: This data is for research and analytical purposes only. Never use scraped medication data for clinical decisions, prescribing advice, or patient care. Always consult licensed healthcare professionals for medical guidance. Review Drugs.com's Terms of Service and your jurisdiction's data scraping laws before building any large-scale collection pipeline. Patient reviews contain personal health information — anonymize and aggregate appropriately.

Understanding the Data Structure

Before writing a single line of code, map out what Drugs.com actually offers:

Drug monographs at drugs.com/{drug-name}.html — clinical descriptions, mechanism, indications
Side effects at drugs.com/sfx/{drug-name}-side-effects.html — frequency-bucketed adverse effects
Patient reviews at drugs.com/comments/{drug-name}/ — paginated, with condition, rating, text
Drug interactions at drugs.com/drug-interactions/{drug-name}.html — severity-classified pairings
Dosage information at drugs.com/dosage/{drug-name}.html — age/weight/indication tables
Drug search at drugs.com/search.php?searchterm={query} — autocomplete-style results
Drug classes at drugs.com/drug-class/ — categorical browsing structure
FDA drug database at drugs.com/fda/ — regulatory filings and approvals

Most of these pages are server-side rendered HTML. The reviews section uses some dynamic loading for pagination but the core content is available in the initial response.

Setup and Dependencies

pip install requests httpx beautifulsoup4 lxml pandas playwright tenacity fake-useragent
playwright install chromium

For proxy-enabled scraping at scale:

pip install requests[socks] httpx[socks]

Core HTTP Client with Anti-Detection

The foundation of any successful Drugs.com scraper is a well-crafted HTTP client that looks like real browser traffic. This means realistic headers, consistent TLS fingerprinting, and session cookie persistence.

import requests
import httpx
import random
import time
import logging
from typing import Optional, Dict, Any
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

logger = logging.getLogger(__name__)

# Rotate through realistic browser user agents
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4.1 Safari/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]

def build_headers(referer: Optional[str] = None) -> Dict[str, str]:
    """Build realistic browser headers with optional referer."""
    ua = random.choice(USER_AGENTS)
    headers = {
        "User-Agent": ua,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "same-origin" if referer else "none",
        "Sec-Fetch-User": "?1",
        "Cache-Control": "max-age=0",
    }
    if referer:
        headers["Referer"] = referer
    return headers

class DrugsComSession:
    """Persistent session with cookie management and proxy rotation."""

    def __init__(self, proxy_url: Optional[str] = None):
        self.session = requests.Session()
        self.proxy_url = proxy_url
        self._warm_up()

    def _warm_up(self):
        """Visit homepage to establish cookies before scraping."""
        try:
            self.session.headers.update(build_headers())
            if self.proxy_url:
                self.session.proxies = {
                    "http": self.proxy_url,
                    "https": self.proxy_url,
                }
            resp = self.session.get(
                "https://www.drugs.com/",
                timeout=30,
                allow_redirects=True,
            )
            logger.info(f"Session warmed up, status={resp.status_code}, cookies={len(self.session.cookies)}")
            time.sleep(random.uniform(1.5, 3.0))
        except Exception as e:
            logger.warning(f"Warm-up failed: {e}")

    @retry(
        stop=stop_after_attempt(5),
        wait=wait_exponential(multiplier=2, min=4, max=60),
        retry=retry_if_exception_type((requests.RequestException, IOError)),
    )
    def get(self, url: str, referer: Optional[str] = None) -> requests.Response:
        """Make a GET request with retry logic and adaptive delays."""
        # Update headers on each request for variation
        headers = build_headers(referer)
        self.session.headers.update(headers)

        time.sleep(random.uniform(3.5, 7.0))

        resp = self.session.get(url, timeout=30)

        if resp.status_code == 429:
            retry_after = int(resp.headers.get("Retry-After", 120))
            logger.warning(f"Rate limited, waiting {retry_after}s")
            time.sleep(retry_after + random.uniform(5, 15))
            raise requests.RequestException("Rate limited")

        if resp.status_code == 403:
            logger.warning(f"403 Forbidden at {url} - possible Cloudflare block")
            time.sleep(random.uniform(30, 60))
            raise requests.RequestException("Forbidden")

        if resp.status_code == 503:
            logger.warning("503 Service Unavailable - server overloaded")
            time.sleep(random.uniform(20, 40))
            raise requests.RequestException("Service unavailable")

        resp.raise_for_status()
        return resp

Drug Information Pages

from bs4 import BeautifulSoup
import re

def parse_drug_info(html: str, drug_name: str) -> Dict[str, Any]:
    """Parse a drug monograph page into structured data."""
    soup = BeautifulSoup(html, "lxml")
    info = {"name": drug_name, "url": f"https://www.drugs.com/{drug_name.lower().replace(' ', '-')}.html"}

    # Title and generic name
    h1 = soup.find("h1")
    if h1:
        info["title"] = h1.get_text(strip=True)

    # Drug class badge
    drug_class = soup.select_one(".drug-class a, .content-box .ddc-pid-class")
    if drug_class:
        info["drug_class"] = drug_class.get_text(strip=True)

    # Main description (first substantial paragraph)
    content_box = soup.find("div", class_="contentBox") or soup.find("div", class_="ddc-main-content")
    if content_box:
        paragraphs = content_box.find_all("p", recursive=False)
        if paragraphs:
            info["description"] = paragraphs[0].get_text(strip=True)

    # Availability (Rx/OTC)
    availability = soup.find(string=re.compile(r"Availability|Rx only|OTC"))
    if availability:
        info["availability"] = str(availability).strip()

    # FDA approval status
    fda_note = soup.find("div", class_="ddc-fda-approval")
    if fda_note:
        info["fda_status"] = fda_note.get_text(strip=True)

    # Related drugs / alternatives
    related = []
    for link in soup.select(".ddc-related a, .related-drugs a")[:10]:
        related.append(link.get_text(strip=True))
    if related:
        info["related_drugs"] = related

    # Pronunciation guide
    pronunciation = soup.find("div", class_="pronunciation")
    if pronunciation:
        info["pronunciation"] = pronunciation.get_text(strip=True)

    return info

def get_drug_info(drug_name: str, session: DrugsComSession) -> Dict[str, Any]:
    slug = drug_name.lower().replace(" ", "-")
    url = f"https://www.drugs.com/{slug}.html"
    resp = session.get(url, referer="https://www.drugs.com/")
    return parse_drug_info(resp.text, drug_name)

# Usage
session = DrugsComSession(proxy_url="http://USER:[email protected]:9000")
info = get_drug_info("metformin", session)
print(info)

Side Effects with Frequency Data

Side effect pages contain clinically useful frequency buckets from trial data:

import pandas as pd

def get_side_effects(drug_name: str, session: DrugsComSession) -> pd.DataFrame:
    """Scrape side effects with frequency classification."""
    slug = drug_name.lower().replace(" ", "-")
    url = f"https://www.drugs.com/sfx/{slug}-side-effects.html"

    resp = session.get(url, referer=f"https://www.drugs.com/{slug}.html")
    soup = BeautifulSoup(resp.text, "lxml")

    effects = []

    # Frequency-bucketed sections (Common, Infrequent, Rare)
    for section in soup.find_all(["h2", "h3"]):
        header_text = section.get_text(strip=True)
        # Find the list following this header
        next_sibling = section.find_next_sibling()
        while next_sibling:
            if next_sibling.name in ["ul", "div"] and next_sibling.find("li"):
                for item in next_sibling.find_all("li"):
                    effect_text = item.get_text(strip=True)
                    if effect_text:
                        effects.append({
                            "drug": drug_name,
                            "side_effect": effect_text,
                            "frequency_category": header_text,
                        })
                break
            next_sibling = next_sibling.find_next_sibling()

    # Also try the structured side-effects-list divs
    for container in soup.find_all("div", class_=re.compile("side-effects")):
        category = container.get("data-freq", "Unknown")
        header = container.find_previous(["h2", "h3"])
        if header:
            category = header.get_text(strip=True)
        for item in container.find_all("li"):
            text = item.get_text(strip=True)
            if text and not any(e["side_effect"] == text for e in effects):
                effects.append({
                    "drug": drug_name,
                    "side_effect": text,
                    "frequency_category": category,
                })

    df = pd.DataFrame(effects)
    logger.info(f"Extracted {len(df)} side effects for {drug_name}")
    return df

Scraping Patient Reviews at Scale

Reviews are the highest-value data on Drugs.com — real patient experiences with effectiveness ratings, condition-specific filtering, and temporal data going back over a decade.

from dataclasses import dataclass
from typing import List, Iterator
import sqlite3

@dataclass
class DrugReview:
    drug: str
    condition: str
    rating: float
    effectiveness: Optional[str]
    ease_of_use: Optional[str]
    satisfaction: Optional[str]
    review_text: str
    date: str
    reviewer_age: Optional[str]
    duration_of_use: Optional[str]
    helpful_votes: int

def parse_review_card(card: BeautifulSoup, drug_name: str) -> Optional[DrugReview]:
    """Parse a single review card element."""
    try:
        condition_el = card.find("b", class_=re.compile("condition")) or card.find("strong", string=re.compile("Condition"))
        condition = ""
        if condition_el:
            # Sometimes condition is in the next sibling text
            condition = condition_el.get_text(strip=True).replace("Condition:", "").strip()

        rating_el = card.find("span", class_=re.compile("rating")) or card.find("div", class_=re.compile("rating"))
        rating = 0.0
        if rating_el:
            rating_text = rating_el.get_text(strip=True)
            match = re.search(r"(\d+(?:\.\d+)?)", rating_text)
            if match:
                rating = float(match.group(1))

        comment_el = (
            card.find("span", class_=re.compile("comment-text"))
            or card.find("p", class_=re.compile("comment"))
            or card.find("div", class_=re.compile("review-text"))
        )
        review_text = comment_el.get_text(strip=True) if comment_el else ""

        date_el = card.find("span", class_=re.compile("date")) or card.find("time")
        date = date_el.get_text(strip=True) if date_el else ""

        helpful_el = card.find(string=re.compile(r"\d+ found this comment helpful"))
        helpful_votes = 0
        if helpful_el:
            match = re.search(r"(\d+)", str(helpful_el))
            if match:
                helpful_votes = int(match.group(1))

        duration_el = card.find(string=re.compile(r"Duration of Use|duration"))
        duration = None
        if duration_el:
            duration = str(duration_el).strip()

        age_el = card.find(string=re.compile(r"Age:|years old"))
        age = None
        if age_el:
            age = str(age_el).strip()

        return DrugReview(
            drug=drug_name,
            condition=condition,
            rating=rating,
            effectiveness=None,
            ease_of_use=None,
            satisfaction=None,
            review_text=review_text,
            date=date,
            reviewer_age=age,
            duration_of_use=duration,
            helpful_votes=helpful_votes,
        )
    except Exception as e:
        logger.warning(f"Failed to parse review card: {e}")
        return None

def iter_drug_reviews(drug_name: str, session: DrugsComSession, max_pages: int = 20) -> Iterator[DrugReview]:
    """Iterate over all review pages for a drug."""
    slug = drug_name.lower().replace(" ", "-")
    base_url = f"https://www.drugs.com/comments/{slug}/"

    for page in range(1, max_pages + 1):
        url = base_url if page == 1 else f"{base_url}?page={page}"
        referer = base_url if page > 1 else "https://www.drugs.com/"

        try:
            resp = session.get(url, referer=referer)
        except Exception as e:
            logger.error(f"Failed to fetch page {page} for {drug_name}: {e}")
            break

        soup = BeautifulSoup(resp.text, "lxml")

        # Detect end of pagination
        cards = (
            soup.find_all("div", class_=re.compile(r"user-comment|review-card|comment-card"))
            or soup.find_all("li", class_=re.compile(r"review"))
        )

        if not cards:
            logger.info(f"No review cards found on page {page} for {drug_name}, stopping")
            break

        page_reviews = 0
        for card in cards:
            review = parse_review_card(card, drug_name)
            if review and review.review_text:
                yield review
                page_reviews += 1

        logger.info(f"Page {page}: extracted {page_reviews} reviews for {drug_name}")

        # Check if there's a next page
        next_link = soup.find("a", string=re.compile(r"Next|>")) or soup.find("a", rel="next")
        if not next_link:
            break

def save_reviews_to_sqlite(
    drug_name: str,
    session: DrugsComSession,
    db_path: str = "drugs_reviews.db",
    max_pages: int = 20,
) -> int:
    """Stream reviews directly into SQLite."""
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS reviews (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            drug TEXT,
            condition TEXT,
            rating REAL,
            review_text TEXT,
            date TEXT,
            reviewer_age TEXT,
            duration_of_use TEXT,
            helpful_votes INTEGER,
            scraped_at TEXT DEFAULT (datetime('now'))
        )
    """)
    conn.commit()

    count = 0
    for review in iter_drug_reviews(drug_name, session, max_pages=max_pages):
        conn.execute(
            """INSERT INTO reviews
               (drug, condition, rating, review_text, date, reviewer_age, duration_of_use, helpful_votes)
               VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
            (
                review.drug, review.condition, review.rating, review.review_text,
                review.date, review.reviewer_age, review.duration_of_use, review.helpful_votes,
            ),
        )
        count += 1
        if count % 50 == 0:
            conn.commit()
            logger.info(f"Committed {count} reviews for {drug_name}")

    conn.commit()
    conn.close()
    return count

Drug Interactions

The interactions database is particularly valuable for pharmacovigilance and clinical decision support research:

from enum import Enum

class InteractionSeverity(str, Enum):
    MAJOR = "major"
    MODERATE = "moderate"
    MINOR = "minor"
    UNKNOWN = "unknown"
    FOOD = "food"

def get_interactions(drug_name: str, session: DrugsComSession) -> pd.DataFrame:
    """Scrape drug interaction data with severity classifications."""
    slug = drug_name.lower().replace(" ", "-")
    url = f"https://www.drugs.com/drug-interactions/{slug}.html"

    resp = session.get(url, referer=f"https://www.drugs.com/{slug}.html")
    soup = BeautifulSoup(resp.text, "lxml")

    interactions = []

    # Structured interaction rows
    for row in soup.find_all("tr", class_=re.compile(r"int-")):
        cells = row.find_all("td")
        if len(cells) < 2:
            continue

        classes = row.get("class", [])
        severity = InteractionSeverity.UNKNOWN
        for cls in classes:
            if "major" in cls:
                severity = InteractionSeverity.MAJOR
            elif "moderate" in cls:
                severity = InteractionSeverity.MODERATE
            elif "minor" in cls:
                severity = InteractionSeverity.MINOR
            elif "food" in cls:
                severity = InteractionSeverity.FOOD

        interactant_link = cells[0].find("a")
        interactions.append({
            "drug": drug_name,
            "interacts_with": cells[0].get_text(strip=True),
            "interactant_url": interactant_link.get("href", "") if interactant_link else "",
            "severity": severity.value,
            "description": cells[1].get_text(strip=True) if len(cells) > 1 else "",
        })

    # Also check for food interactions section
    food_section = soup.find("h2", string=re.compile(r"food", re.IGNORECASE))
    if food_section:
        food_list = food_section.find_next("ul")
        if food_list:
            for item in food_list.find_all("li"):
                interactions.append({
                    "drug": drug_name,
                    "interacts_with": item.get_text(strip=True),
                    "interactant_url": "",
                    "severity": InteractionSeverity.FOOD.value,
                    "description": "Food interaction",
                })

    df = pd.DataFrame(interactions)
    logger.info(f"Found {len(df)} interactions for {drug_name} ({df['severity'].value_counts().to_dict() if not df.empty else {}})")
    return df

Proxy Rotation with ThorData

At scale — scraping hundreds of drugs — you will get blocked without residential proxies. Drugs.com's Cloudflare integration is tuned to flag datacenter IPs aggressively. ThorData provides residential proxy pools with real ISP addresses that pass Cloudflare's reputation checks.

import itertools
import threading
from queue import Queue

class ThorDataProxyPool:
    """
    Rotating proxy pool using ThorData's residential network.
    Supports sticky sessions (same IP for multi-page workflows)
    and rotating sessions (new IP per request).
    """

    def __init__(
        self,
        username: str,
        password: str,
        host: str = "proxy.thordata.com",
        port: int = 9000,
        country: str = "US",
        sticky_session_minutes: int = 5,
    ):
        self.username = username
        self.password = password
        self.host = host
        self.port = port
        self.country = country
        self.sticky_minutes = sticky_session_minutes
        self._session_id = None
        self._session_created = 0
        self._lock = threading.Lock()

    def _new_session_id(self) -> str:
        """Generate a random session identifier for sticky sessions."""
        return f"sess_{random.randint(100000, 999999)}"

    def get_rotating_proxy(self) -> str:
        """Get a proxy URL that rotates on every request."""
        return (
            f"http://{self.username}-country-{self.country}:"
            f"{self.password}@{self.host}:{self.port}"
        )

    def get_sticky_proxy(self) -> str:
        """
        Get a proxy URL that uses the same exit IP for up to sticky_session_minutes.
        Useful for multi-page workflows like paginated reviews.
        """
        with self._lock:
            now = time.time()
            if (
                self._session_id is None
                or now - self._session_created > self.sticky_minutes * 60
            ):
                self._session_id = self._new_session_id()
                self._session_created = now
                logger.debug(f"New sticky proxy session: {self._session_id}")

            return (
                f"http://{self.username}-country-{self.country}-"
                f"session-{self._session_id}:{self.password}@{self.host}:{self.port}"
            )

    def rotate(self):
        """Force rotation to a new IP on next sticky request."""
        with self._lock:
            self._session_id = None

# Usage pattern
proxy_pool = ThorDataProxyPool(
    username="your_username",
    password="your_password",
    country="US",
    sticky_session_minutes=10,  # Keep same IP for 10 minutes per drug
)

def create_session_for_drug(drug_name: str) -> DrugsComSession:
    """Create a fresh session with sticky proxy for a single drug's data collection."""
    proxy_url = proxy_pool.get_sticky_proxy()
    session = DrugsComSession(proxy_url=proxy_url)
    logger.info(f"Created session for {drug_name} via {proxy_url[:50]}...")
    return session

def scrape_drug_complete(drug_name: str) -> Dict[str, Any]:
    """Full data collection for a single drug with proxy rotation between drugs."""
    session = create_session_for_drug(drug_name)
    result = {}

    try:
        result["info"] = get_drug_info(drug_name, session)
        time.sleep(random.uniform(4, 8))

        result["side_effects"] = get_side_effects(drug_name, session).to_dict(orient="records")
        time.sleep(random.uniform(4, 8))

        result["interactions"] = get_interactions(drug_name, session).to_dict(orient="records")

    except Exception as e:
        logger.error(f"Failed scraping {drug_name}: {e}")
        result["error"] = str(e)

    # Force new IP for next drug
    proxy_pool.rotate()
    return result

Anti-Detection: Headers, Delays, Fingerprint Spoofing

Beyond proxies, there are several layers of detection to defeat:

import json
from datetime import datetime, timedelta

class BrowserFingerprintSimulator:
    """
    Simulate consistent browser fingerprint attributes.
    A real browser has consistent screen resolution, timezone, plugins, etc.
    Inconsistency is a bot signal.
    """

    BROWSER_PROFILES = [
        {
            "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
            "sec_ch_ua": '"Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"',
            "sec_ch_ua_platform": '"Windows"',
            "accept_language": "en-US,en;q=0.9",
            "viewport": "1920x1080",
        },
        {
            "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
            "sec_ch_ua": '"Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"',
            "sec_ch_ua_platform": '"macOS"',
            "accept_language": "en-US,en;q=0.9,en-GB;q=0.8",
            "viewport": "1440x900",
        },
        {
            "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
            "sec_ch_ua": None,  # Firefox doesn't send sec-ch-ua
            "sec_ch_ua_platform": None,
            "accept_language": "en-US,en;q=0.5",
            "viewport": "1920x1080",
        },
    ]

    def __init__(self):
        self._profile = random.choice(self.BROWSER_PROFILES)

    def get_headers(self, url: str, referer: Optional[str] = None) -> Dict[str, str]:
        headers = {
            "User-Agent": self._profile["user_agent"],
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
            "Accept-Language": self._profile["accept_language"],
            "Accept-Encoding": "gzip, deflate, br",
            "Connection": "keep-alive",
            "Upgrade-Insecure-Requests": "1",
        }

        if self._profile.get("sec_ch_ua"):
            headers["Sec-CH-UA"] = self._profile["sec_ch_ua"]
            headers["Sec-CH-UA-Mobile"] = "?0"
            headers["Sec-CH-UA-Platform"] = self._profile["sec_ch_ua_platform"]

        if referer:
            headers["Referer"] = referer
            headers["Sec-Fetch-Site"] = "same-origin"
        else:
            headers["Sec-Fetch-Site"] = "none"

        headers["Sec-Fetch-Mode"] = "navigate"
        headers["Sec-Fetch-Dest"] = "document"
        headers["Sec-Fetch-User"] = "?1"

        return headers

class AdaptiveRateLimiter:
    """
    Adaptive rate limiter that backs off when detecting throttling signals
    and speeds up when requests are succeeding consistently.
    """

    def __init__(self, base_delay: float = 4.0, min_delay: float = 2.0, max_delay: float = 30.0):
        self.delay = base_delay
        self.min_delay = min_delay
        self.max_delay = max_delay
        self._consecutive_success = 0
        self._consecutive_failure = 0
        self._lock = threading.Lock()

    def wait(self):
        """Wait the appropriate amount of time before next request."""
        with self._lock:
            jitter = random.uniform(-0.5, 1.5)
            sleep_time = max(self.min_delay, self.delay + jitter)
        time.sleep(sleep_time)

    def record_success(self):
        with self._lock:
            self._consecutive_success += 1
            self._consecutive_failure = 0
            # Gradually speed up after 5 consecutive successes
            if self._consecutive_success >= 5:
                self.delay = max(self.min_delay, self.delay * 0.9)
                self._consecutive_success = 0
                logger.debug(f"Rate limiter sped up: delay={self.delay:.1f}s")

    def record_throttle(self):
        with self._lock:
            self._consecutive_failure += 1
            self._consecutive_success = 0
            self.delay = min(self.max_delay, self.delay * 2.0)
            logger.warning(f"Rate limiter backed off: delay={self.delay:.1f}s")

Playwright-Based Scraping for JavaScript-Heavy Pages

Some sections of Drugs.com — particularly the interaction checker and review sections with modern pagination — require JavaScript execution:

import asyncio
from playwright.async_api import async_playwright, Page, Browser
from typing import AsyncIterator

async def setup_stealth_browser(proxy_url: Optional[str] = None) -> Browser:
    """Launch a hardened Playwright browser with anti-detection measures."""
    playwright = await async_playwright().start()

    launch_args = [
        "--disable-blink-features=AutomationControlled",
        "--disable-dev-shm-usage",
        "--no-first-run",
        "--no-default-browser-check",
        "--disable-infobars",
        "--window-size=1920,1080",
        "--lang=en-US",
    ]

    launch_kwargs = {
        "headless": True,
        "args": launch_args,
    }

    if proxy_url:
        # Parse proxy URL into components for Playwright
        from urllib.parse import urlparse
        parsed = urlparse(proxy_url)
        launch_kwargs["proxy"] = {
            "server": f"{parsed.scheme}://{parsed.hostname}:{parsed.port}",
            "username": parsed.username or "",
            "password": parsed.password or "",
        }

    browser = await playwright.chromium.launch(**launch_kwargs)
    return browser

async def make_stealth_page(browser: Browser) -> Page:
    """Create a new page with stealth settings to avoid detection."""
    context = await browser.new_context(
        viewport={"width": 1920, "height": 1080},
        locale="en-US",
        timezone_id="America/New_York",
        user_agent=random.choice(USER_AGENTS),
        java_script_enabled=True,
        accept_downloads=False,
        ignore_https_errors=False,
    )

    page = await context.new_page()

    # Override navigator.webdriver property
    await page.add_init_script("""
        Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
        Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3, 4, 5]});
        Object.defineProperty(navigator, 'languages', {get: () => ['en-US', 'en']});
        window.chrome = { runtime: {} };
    """)

    page.set_default_timeout(30000)
    return page

async def scrape_reviews_playwright(
    drug_name: str,
    proxy_url: Optional[str] = None,
    max_pages: int = 10,
) -> List[Dict]:
    """Scrape reviews using full browser automation for sites with JS challenges."""
    slug = drug_name.lower().replace(" ", "-")
    reviews = []

    browser = await setup_stealth_browser(proxy_url)

    try:
        page = await make_stealth_page(browser)

        # First visit the main drug page to establish session
        await page.goto(f"https://www.drugs.com/{slug}.html", wait_until="networkidle")
        await asyncio.sleep(random.uniform(2, 4))

        for page_num in range(1, max_pages + 1):
            url = f"https://www.drugs.com/comments/{slug}/"
            if page_num > 1:
                url += f"?page={page_num}"

            await page.goto(url, wait_until="networkidle")
            await asyncio.sleep(random.uniform(1.5, 3.5))

            # Check for CAPTCHA
            captcha_frame = await page.query_selector("iframe[src*='captcha'], .captcha-container")
            if captcha_frame:
                logger.warning(f"CAPTCHA detected on page {page_num}, stopping")
                break

            # Extract review data via page evaluation
            page_reviews = await page.evaluate("""
                () => {
                    const reviews = [];
                    const cards = document.querySelectorAll('[class*="user-comment"], [class*="review-card"]');
                    cards.forEach(card => {
                        const rating = card.querySelector('[class*="rating"]');
                        const text = card.querySelector('[class*="comment-text"], [class*="review-text"]');
                        const date = card.querySelector('[class*="date"], time');
                        const condition = card.querySelector('[class*="condition"]');
                        reviews.push({
                            rating: rating ? rating.textContent.trim() : '',
                            text: text ? text.textContent.trim() : '',
                            date: date ? (date.getAttribute('datetime') || date.textContent.trim()) : '',
                            condition: condition ? condition.textContent.trim() : '',
                        });
                    });
                    return reviews;
                }
            """)

            if not page_reviews:
                break

            reviews.extend([{**r, "drug": drug_name} for r in page_reviews])
            logger.info(f"Playwright page {page_num}: {len(page_reviews)} reviews")

    finally:
        await browser.close()

    return reviews

Rate Limiting and CAPTCHA Handling

import base64
from io import BytesIO

class CaptchaHandler:
    """
    CAPTCHA detection and fallback strategies.
    Note: Solving CAPTCHAs programmatically may violate ToS.
    This class focuses on detection and graceful fallback.
    """

    CAPTCHA_SIGNALS = [
        "captcha",
        "cf-challenge",
        "challenge-form",
        "ray id",
        "checking your browser",
        "ddos-guard",
        "access denied",
        "bot detection",
    ]

    @staticmethod
    def is_captcha_page(html: str) -> bool:
        """Detect if a response contains a CAPTCHA or bot challenge."""
        html_lower = html.lower()
        return any(signal in html_lower for signal in CaptchaHandler.CAPTCHA_SIGNALS)

    @staticmethod
    def is_rate_limited(response: requests.Response) -> bool:
        """Check if response indicates rate limiting."""
        if response.status_code == 429:
            return True
        if response.status_code == 503:
            return True
        if "rate limit" in response.text.lower():
            return True
        return False

    @staticmethod
    def handle_captcha_fallback(url: str, drug_name: str) -> Optional[Dict]:
        """
        Fallback strategy when CAPTCHA is encountered.
        Options:
        1. Rotate proxy (most effective)
        2. Wait and retry with longer delay
        3. Switch to Playwright with fresh browser context
        4. Log URL for manual review
        5. Use alternative data source (FDA API, etc.)
        """
        logger.warning(f"CAPTCHA/block detected for {drug_name} at {url}")
        logger.info("Strategies: rotate proxy, extend delay, or use Playwright fallback")

        # Record for later retry
        with open("captcha_backlog.txt", "a") as f:
            f.write(f"{url}\t{drug_name}\t{datetime.now().isoformat()}\n")

        return None

def make_request_with_captcha_handling(
    url: str,
    session: DrugsComSession,
    rate_limiter: AdaptiveRateLimiter,
    proxy_pool: Optional[ThorDataProxyPool] = None,
) -> Optional[requests.Response]:
    """Make request with full CAPTCHA and rate limit handling."""
    rate_limiter.wait()

    try:
        resp = session.get(url)

        if CaptchaHandler.is_rate_limited(resp):
            rate_limiter.record_throttle()
            retry_after = int(resp.headers.get("Retry-After", 60))
            logger.warning(f"Rate limited, waiting {retry_after}s")
            time.sleep(retry_after + random.uniform(10, 30))
            if proxy_pool:
                proxy_pool.rotate()
            return None

        if CaptchaHandler.is_captcha_page(resp.text):
            rate_limiter.record_throttle()
            if proxy_pool:
                proxy_pool.rotate()
                logger.info("Rotated proxy after CAPTCHA detection")
            return None

        rate_limiter.record_success()
        return resp

    except requests.RequestException as e:
        rate_limiter.record_throttle()
        logger.error(f"Request failed for {url}: {e}")
        return None

Output Schemas with Examples

A well-defined output schema is essential for data pipeline reliability:

# Drug information output schema
DRUG_INFO_SCHEMA = {
    "name": "metformin",
    "title": "Metformin",
    "drug_class": "Biguanide antidiabetic",
    "description": "Metformin is an oral diabetes medicine that helps control blood sugar levels...",
    "availability": "Rx only",
    "fda_status": "FDA approved",
    "related_drugs": ["glipizide", "januvia", "victoza"],
    "pronunciation": "met-FOR-min",
    "url": "https://www.drugs.com/metformin.html",
}

# Side effects output schema
SIDE_EFFECTS_SCHEMA = {
    "drug": "metformin",
    "side_effect": "Nausea",
    "frequency_category": "More Common",  # More Common / Less Common / Rare
}

# Review output schema
REVIEW_SCHEMA = {
    "drug": "metformin",
    "condition": "Type 2 Diabetes",
    "rating": 8.5,                    # 1-10 scale
    "review_text": "I've been taking metformin for 6 months...",
    "date": "March 15, 2024",
    "reviewer_age": "45-54",
    "duration_of_use": "1 to 6 months",
    "helpful_votes": 12,
    "scraped_at": "2026-03-31T14:22:00",
}

# Interaction output schema
INTERACTION_SCHEMA = {
    "drug": "warfarin",
    "interacts_with": "aspirin",
    "interactant_url": "/drug-interactions/aspirin.html",
    "severity": "major",              # major / moderate / minor / food / unknown
    "description": "May significantly increase the risk of bleeding...",
}

Real-World Use Cases with Code

Use Case 1: Pharmacovigilance Signal Detection

Identify drugs with unusually high proportions of serious adverse event reports:

def build_adverse_event_profile(drug_list: List[str], session: DrugsComSession) -> pd.DataFrame:
    """Build adverse event severity profiles for a drug list."""
    profiles = []

    for drug in drug_list:
        try:
            effects_df = get_side_effects(drug, session)
            if effects_df.empty:
                continue

            total = len(effects_df)
            serious_keywords = r"death|hospitali|seizure|fatal|cardiac|stroke|anaphylaxis|liver failure"
            serious = effects_df["side_effect"].str.contains(serious_keywords, case=False, na=False).sum()

            profiles.append({
                "drug": drug,
                "total_effects": total,
                "serious_effects": serious,
                "serious_ratio": serious / total if total > 0 else 0,
                "frequency_breakdown": effects_df["frequency_category"].value_counts().to_dict(),
            })

            time.sleep(random.uniform(5, 10))

        except Exception as e:
            logger.error(f"Failed profile for {drug}: {e}")

    return pd.DataFrame(profiles).sort_values("serious_ratio", ascending=False)

# Analyze top antidiabetics
diabetes_drugs = ["metformin", "ozempic", "victoza", "januvia", "jardiance", "farxiga"]
profiles = build_adverse_event_profile(diabetes_drugs, session)
print(profiles[["drug", "total_effects", "serious_ratio"]].to_string())

Use Case 2: Patient Sentiment Analysis by Condition

Analyze how patients rate the same drug for different conditions:

def analyze_drug_by_condition(drug_name: str, session: DrugsComSession) -> pd.DataFrame:
    """Compare drug effectiveness ratings across different conditions."""
    reviews = []
    for review in iter_drug_reviews(drug_name, session, max_pages=15):
        if review.condition and review.rating > 0:
            reviews.append({"condition": review.condition, "rating": review.rating, "text": review.review_text})

    if not reviews:
        return pd.DataFrame()

    df = pd.DataFrame(reviews)
    analysis = df.groupby("condition").agg(
        mean_rating=("rating", "mean"),
        review_count=("rating", "count"),
        std_rating=("rating", "std"),
    ).round(2).sort_values("review_count", ascending=False)

    return analysis[analysis["review_count"] >= 5]  # Filter low-sample conditions

# Example output:
#                              mean_rating  review_count  std_rating
# Type 2 Diabetes                     7.2           847        2.1
# Weight Loss                         5.8           234        2.8
# Polycystic Ovary Syndrome           7.6           198        2.2

Use Case 3: Drug Interaction Network Builder

Build a graph of drug-drug interactions:

import json

def build_interaction_network(
    seed_drugs: List[str],
    session: DrugsComSession,
    max_depth: int = 1,
) -> Dict:
    """Build a drug interaction network up to specified depth."""
    network = {"nodes": {}, "edges": []}
    visited = set()
    queue = [(drug, 0) for drug in seed_drugs]

    while queue:
        drug, depth = queue.pop(0)
        if drug in visited:
            continue
        visited.add(drug)

        network["nodes"][drug] = {"scraped": True}

        interactions = get_interactions(drug, session)
        if interactions.empty:
            continue

        for _, row in interactions.iterrows():
            edge = {
                "source": drug,
                "target": row["interacts_with"],
                "severity": row["severity"],
            }
            network["edges"].append(edge)

            if depth < max_depth and row["interacts_with"] not in visited:
                if row["severity"] in ["major", "moderate"]:
                    queue.append((row["interacts_with"], depth + 1))

        time.sleep(random.uniform(5, 10))

    with open("interaction_network.json", "w") as f:
        json.dump(network, f, indent=2)

    return network

Use Case 4: Generic vs Brand Price Intelligence

def compare_generic_brand(drug_name: str, session: DrugsComSession) -> Dict:
    """Scrape price comparison data between generic and brand versions."""
    slug = drug_name.lower().replace(" ", "-")
    url = f"https://www.drugs.com/price-guide/{slug}"

    try:
        resp = session.get(url)
        soup = BeautifulSoup(resp.text, "lxml")

        prices = {}
        price_table = soup.find("table", class_=re.compile("price"))
        if price_table:
            for row in price_table.find_all("tr")[1:]:
                cells = row.find_all("td")
                if len(cells) >= 3:
                    prices[cells[0].get_text(strip=True)] = {
                        "dosage": cells[1].get_text(strip=True),
                        "price": cells[2].get_text(strip=True),
                    }

        return prices
    except Exception as e:
        logger.error(f"Price scrape failed for {drug_name}: {e}")
        return {}

Use Case 5: Dosage Table Extraction

def get_dosage_info(drug_name: str, session: DrugsComSession) -> pd.DataFrame:
    """Extract structured dosage information."""
    slug = drug_name.lower().replace(" ", "-")
    url = f"https://www.drugs.com/dosage/{slug}.html"

    resp = session.get(url, referer=f"https://www.drugs.com/{slug}.html")
    soup = BeautifulSoup(resp.text, "lxml")

    dosage_data = []

    # Extract from structured tables
    for table in soup.find_all("table"):
        headers = [th.get_text(strip=True) for th in table.find_all("th")]
        for row in table.find_all("tr")[1:]:
            cells = [td.get_text(strip=True) for td in row.find_all("td")]
            if cells:
                row_dict = dict(zip(headers, cells))
                row_dict["drug"] = drug_name
                dosage_data.append(row_dict)

    return pd.DataFrame(dosage_data)

Use Case 6: Treatment Outcome Research Dataset

def build_treatment_research_dataset(
    conditions: List[str],
    drugs_per_condition: int = 5,
    reviews_per_drug: int = 100,
    db_path: str = "treatment_research.db",
) -> None:
    """
    Build a research dataset linking conditions to drug effectiveness scores.
    Useful for comparative effectiveness research.
    """
    conn = sqlite3.connect(db_path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS drug_condition_stats (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            drug TEXT NOT NULL,
            condition TEXT NOT NULL,
            mean_rating REAL,
            review_count INTEGER,
            positive_ratio REAL,
            scraped_at TEXT DEFAULT (datetime('now')),
            UNIQUE(drug, condition)
        );

        CREATE TABLE IF NOT EXISTS reviews (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            drug TEXT,
            condition TEXT,
            rating REAL,
            review_text TEXT,
            date TEXT,
            helpful_votes INTEGER
        );
    """)
    conn.commit()

    session = DrugsComSession(proxy_url=proxy_pool.get_sticky_proxy())

    for condition in conditions:
        logger.info(f"Processing condition: {condition}")
        # In practice, you'd use a drug-by-condition lookup endpoint
        # or pre-populate from medical knowledge bases
        time.sleep(random.uniform(5, 10))

    conn.close()
    logger.info(f"Research dataset saved to {db_path}")

Use Case 7: New Drug Approval Monitor

def monitor_new_approvals(session: DrugsComSession, db_path: str = "approvals.db") -> List[Dict]:
    """
    Monitor Drugs.com new approvals feed for recently approved drugs.
    Useful for pharmaceutical market intelligence.
    """
    url = "https://www.drugs.com/newdrugs.html"
    resp = session.get(url)
    soup = BeautifulSoup(resp.text, "lxml")

    new_drugs = []
    for item in soup.select(".newdrugs-list li, .content-box li"):
        link = item.find("a")
        if link:
            new_drugs.append({
                "name": link.get_text(strip=True),
                "url": "https://www.drugs.com" + link.get("href", ""),
                "description": item.get_text(strip=True),
                "found_at": datetime.now().isoformat(),
            })

    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS approvals (
            id INTEGER PRIMARY KEY, name TEXT, url TEXT UNIQUE,
            description TEXT, found_at TEXT
        )
    """)
    for drug in new_drugs:
        conn.execute(
            "INSERT OR IGNORE INTO approvals (name, url, description, found_at) VALUES (?, ?, ?, ?)",
            (drug["name"], drug["url"], drug["description"], drug["found_at"]),
        )
    conn.commit()
    conn.close()

    return new_drugs

Complete Production Pipeline

import concurrent.futures
import argparse
from pathlib import Path

def run_full_scraping_pipeline(
    drug_list: List[str],
    output_dir: str = "output",
    max_workers: int = 2,
    proxy_username: str = "",
    proxy_password: str = "",
) -> Dict[str, Any]:
    """
    Production-grade pipeline for scraping drug data at scale.
    Uses thread pool for parallelism with per-drug proxy rotation.
    """
    Path(output_dir).mkdir(exist_ok=True)

    proxy_pool = ThorDataProxyPool(
        username=proxy_username,
        password=proxy_password,
        country="US",
        sticky_session_minutes=8,
    ) if proxy_username else None

    rate_limiter = AdaptiveRateLimiter(base_delay=5.0)
    results = {"success": [], "failed": [], "total": len(drug_list)}

    def scrape_one(drug_name: str) -> bool:
        try:
            proxy = proxy_pool.get_sticky_proxy() if proxy_pool else None
            session = DrugsComSession(proxy_url=proxy)

            data = {
                "info": get_drug_info(drug_name, session),
                "side_effects": get_side_effects(drug_name, session).to_dict(orient="records"),
                "interactions": get_interactions(drug_name, session).to_dict(orient="records"),
            }

            out_file = Path(output_dir) / f"{drug_name.replace(' ', '_')}.json"
            with open(out_file, "w") as f:
                json.dump(data, f, indent=2, default=str)

            if proxy_pool:
                proxy_pool.rotate()

            return True

        except Exception as e:
            logger.error(f"Pipeline failed for {drug_name}: {e}")
            return False

    # Process with limited concurrency to respect rate limits
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {executor.submit(scrape_one, drug): drug for drug in drug_list}
        for future in concurrent.futures.as_completed(futures):
            drug = futures[future]
            success = future.result()
            (results["success"] if success else results["failed"]).append(drug)
            logger.info(f"Progress: {len(results['success']) + len(results['failed'])}/{results['total']}")

    return results

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--drugs", nargs="+", default=["metformin", "lisinopril", "atorvastatin"])
    parser.add_argument("--proxy-user", default="")
    parser.add_argument("--proxy-pass", default="")
    args = parser.parse_args()

    results = run_full_scraping_pipeline(
        args.drugs,
        proxy_username=args.proxy_user,
        proxy_password=args.proxy_pass,
    )
    print(f"Completed: {len(results['success'])} success, {len(results['failed'])} failed")

Ethical Considerations and Legal Compliance

Scraping health data carries heightened responsibility beyond typical web scraping:

robots.txt: Always check https://www.drugs.com/robots.txt and respect disallowed paths
Patient privacy: Reviews contain personal health disclosures — aggregate and anonymize; do not redistribute individually
Rate limiting: Real patients depend on this site. Keep requests slow enough to avoid impacting legitimate users
Research ethics: If publishing findings from this data, consult your institution's IRB/ethics board
HIPAA awareness: Even publicly available health data can raise regulatory concerns in certain research contexts
Terms of Service: Drugs.com prohibits systematic data collection without authorization; large-scale scraping should be preceded by legal review
Clinical safety: Never present scraped medication data as clinical guidance or use it in any decision-support system without proper validation and regulatory clearance

The combination of structured pharmacological data and patient narratives makes Drugs.com extraordinarily valuable for pharmaceutical research, drug safety surveillance, and patient experience analytics. Approach it with the care and responsibility that health data demands.