← Back to blog

How to Scrape Amazon Product Reviews with Python (2026)

How to Scrape Amazon Product Reviews with Python (2026)

Amazon doesn't have a public reviews API. There's the Product Advertising API, but it doesn't include review text — just aggregate star ratings. If you need actual review content, ratings breakdowns, reviewer history, or verified purchase status, you're scraping HTML.

And Amazon really doesn't want you to. Their anti-bot system is one of the most aggressive on the web. You'll hit CAPTCHAs, IP blocks, and detection fingerprinting within minutes if you're not careful. But this data is invaluable — for competitive analysis, sentiment research, product development feedback loops, review aggregation tools, and brand monitoring.

This guide walks through the full pipeline: URL structure, parsing, anti-detection, pagination, proxy integration with ThorData, and durable storage.


Why Scrape Amazon Reviews?

Amazon product reviews are some of the most valuable unstructured text on the internet:

None of this is available through Amazon's official APIs. The Product Advertising API gives you star averages; the raw text lives in HTML.


Review URL Structure

Every Amazon product has an ASIN — a 10-character alphanumeric identifier. Review pages follow a consistent URL pattern:

https://www.amazon.com/product-reviews/{ASIN}/
    ?pageNumber={page}
    &filterByStar={star_rating}
    &reviewerType=avp_only_reviews
    &sortBy=recent

Key query parameters:

Parameter Options Purpose
pageNumber 1–500 Pagination (10 reviews per page)
filterByStar one_star, two_star, three_star, four_star, five_star, all_stars, critical, positive Filter by rating
reviewerType avp_only_reviews, all_reviews Verified purchase filter
sortBy recent, helpful Sort order

Amazon caps pages at around 500 (5,000 reviews per filter). For products with 100,000+ reviews, you need to segment by star rating to maximize coverage.


Understanding the HTML Structure

Amazon's review HTML is relatively stable. They use data-hook attributes as stable anchors for review elements:

<div data-hook="review">
  <span data-hook="review-star-rating">4.0 out of 5 stars</span>
  <span data-hook="review-title">Great product, minor issues</span>
  <span data-hook="review-date">Reviewed in the United States on March 15, 2026</span>
  <span data-hook="avp-badge">Verified Purchase</span>
  <span data-hook="review-body">The build quality is excellent...</span>
  <span data-hook="helpful-vote-statement">47 people found this helpful</span>
  <span data-hook="review-author">CustomerName</span>
</div>

These data-hook attributes have been consistent for years. They're more reliable than class-based selectors, which Amazon rotates frequently to break scrapers.


Basic Scraper

Start with a minimal working scraper, then add robustness:

import httpx
from bs4 import BeautifulSoup
import time
import random
import json
from pathlib import Path

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/125.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
    "Cache-Control": "max-age=0",
}

def build_review_url(asin, page=1, star_filter="all_stars",
                     verified_only=False, sort="recent"):
    url = f"https://www.amazon.com/product-reviews/{asin}/"
    params = {
        "pageNumber": page,
        "filterByStar": star_filter,
        "sortBy": sort,
    }
    if verified_only:
        params["reviewerType"] = "avp_only_reviews"
    # Build query string manually to preserve order
    qs = "&".join(f"{k}={v}" for k, v in params.items())
    return f"{url}?{qs}"

def parse_rating(el):
    """Extract float rating from Amazon star element."""
    if not el:
        return None
    text = el.get_text()
    try:
        return float(text.split(" out of")[0].strip())
    except (ValueError, IndexError):
        return None

def parse_helpful_votes(el):
    """Parse helpful vote count like '47 people found this helpful'."""
    if not el:
        return 0
    text = el.get_text(strip=True)
    if text == "One person found this helpful":
        return 1
    try:
        num = text.split(" ")[0].replace(",", "")
        return int(num)
    except (ValueError, IndexError):
        return 0

def parse_review_div(div):
    """Extract all fields from a single review div."""
    title_el = div.select_one('[data-hook="review-title"]')
    body_el = div.select_one('[data-hook="review-body"]')
    rating_el = div.select_one('[data-hook="review-star-rating"]')
    date_el = div.select_one('[data-hook="review-date"]')
    verified_el = div.select_one('[data-hook="avp-badge"]')
    helpful_el = div.select_one('[data-hook="helpful-vote-statement"]')
    author_el = div.select_one('[data-hook="review-author"]')

    # Review ID lives on the parent div
    review_id = div.get("id", "")

    # Extract title text — skip leading star rating text if present
    title_text = ""
    if title_el:
        # Title spans sometimes contain nested span with star text
        spans = title_el.find_all("span")
        title_text = spans[-1].get_text(strip=True) if spans else title_el.get_text(strip=True)

    return {
        "review_id": review_id,
        "title": title_text,
        "body": body_el.get_text(strip=True) if body_el else "",
        "rating": parse_rating(rating_el),
        "date": date_el.get_text(strip=True) if date_el else "",
        "verified": verified_el is not None,
        "helpful_votes": parse_helpful_votes(helpful_el),
        "author": author_el.get_text(strip=True) if author_el else "",
    }

def scrape_reviews_page(asin, page=1, star_filter="all_stars",
                        verified_only=False):
    """Fetch and parse a single review page."""
    url = build_review_url(asin, page, star_filter, verified_only)

    try:
        resp = httpx.get(url, headers=HEADERS, follow_redirects=True,
                         timeout=20)
    except httpx.TimeoutException:
        print(f"Timeout on page {page}")
        return None, None

    if resp.status_code != 200:
        print(f"Page {page}: HTTP {resp.status_code}")
        return None, resp.status_code

    html = resp.text

    # Check for CAPTCHA or block
    if is_blocked(html):
        return None, "blocked"

    soup = BeautifulSoup(html, "lxml")
    review_divs = soup.select('[data-hook="review"]')

    reviews = [parse_review_div(div) for div in review_divs]
    return reviews, 200

def is_blocked(html):
    """Detect Amazon CAPTCHA or soft-block pages."""
    block_markers = [
        "Type the characters you see in this image",
        "[email protected]",
        "/errors/validateCaptcha",
        "Enter the characters you see below",
        "Sorry, we just need to make sure you",
    ]
    return any(m in html for m in block_markers)

def scrape_reviews(asin, max_pages=10, star_filter="all_stars",
                   verified_only=False):
    """Scrape multiple pages of reviews with delays."""
    all_reviews = []

    for page in range(1, max_pages + 1):
        reviews, status = scrape_reviews_page(
            asin, page, star_filter, verified_only
        )

        if status == "blocked":
            print(f"Blocked on page {page}. Stopping.")
            break
        if reviews is None:
            break
        if not reviews:
            print(f"Page {page}: empty, done.")
            break

        all_reviews.extend(reviews)
        print(f"Page {page}: {len(reviews)} reviews (total: {len(all_reviews)})")

        if page < max_pages:
            time.sleep(random.uniform(4.0, 8.0))

    return all_reviews

# Basic usage
if __name__ == "__main__":
    asin = "B0CX59THPZ"  # Replace with your target ASIN
    reviews = scrape_reviews(asin, max_pages=5)
    print(f"Collected {len(reviews)} reviews")

This gets you started, but hits Amazon's anti-bot detection quickly at any meaningful scale.


Amazon's Anti-Bot Measures

Amazon runs one of the most sophisticated bot detection systems in e-commerce. Understanding it helps you build more resilient scrapers.

IP Reputation Tracking

Amazon maintains per-IP reputation scores. After 20–30 requests, fresh datacenter IPs start returning CAPTCHA pages. After continued requests, they return 503s. The reputation decays slowly — an IP blocked today may work again in 24–48 hours, but at any meaningful scraping volume, you'll exhaust IPs faster than they recover.

TLS Fingerprinting

Amazon (and its CDN) inspects your TLS ClientHello — the cipher suites offered, extension order, and supported groups. Python's default TLS stack (OpenSSL via requests/httpx) produces a fingerprint distinctly unlike Chrome or Firefox. Amazon's detection picks this up and applies additional scrutiny to requests from known-bot TLS stacks.

Libraries like curl_cffi solve this by using libcurl with Chrome's exact TLS parameters:

import curl_cffi.requests as cffi_req

resp = cffi_req.get(
    url,
    headers=HEADERS,
    impersonate="chrome124",  # Mimics Chrome 124 TLS fingerprint
    timeout=20,
)

Behavioral Analysis

Amazon tracks: - Request cadence — machine-perfect timing (exactly 5s between requests) is a signal - Navigation patterns — real browsers load CSS, images, fonts; scrapers only load HTML - Cookie behavior — sessions that never set/read preferences look robotic - Referrer chains — jumping directly to review pages without a product page visit is suspicious

Amazon sets session cookies on first visit. If your "browser session" visits 47 different product review pages without ever visiting a product listing, that's anomalous. Warming up your session by visiting a few product pages before hitting reviews reduces detection risk.


Anti-Detection Implementation

Using curl_cffi for TLS Fingerprint Spoofing

import curl_cffi.requests as cffi_req
import random
import time

# Multiple Chrome versions to rotate
IMPERSONATIONS = ["chrome120", "chrome124", "chrome126"]

def fetch_with_cffi(url, proxy_url=None, retries=3):
    """Fetch with Chrome TLS fingerprint via curl_cffi."""
    impersonate = random.choice(IMPERSONATIONS)

    for attempt in range(retries):
        try:
            kwargs = {
                "headers": HEADERS,
                "impersonate": impersonate,
                "timeout": 25,
                "follow_redirects": True,
            }
            if proxy_url:
                kwargs["proxies"] = {"http": proxy_url, "https": proxy_url}

            resp = cffi_req.get(url, **kwargs)

            if resp.status_code == 200 and not is_blocked(resp.text):
                return resp.text
            elif is_blocked(resp.text):
                print(f"Attempt {attempt+1}: CAPTCHA/blocked")
                time.sleep(random.uniform(15, 30))
            else:
                print(f"Attempt {attempt+1}: HTTP {resp.status_code}")
                time.sleep(random.uniform(5, 10))

        except Exception as e:
            print(f"Attempt {attempt+1}: Error — {e}")
            time.sleep(random.uniform(5, 15))

    return None

def session_warmup(proxy_url=None):
    """Visit product and category pages before scraping reviews."""
    warmup_urls = [
        "https://www.amazon.com/",
        "https://www.amazon.com/best-sellers-electronics/",
    ]
    for url in warmup_urls:
        fetch_with_cffi(url, proxy_url)
        time.sleep(random.uniform(2, 5))
    print("Session warmed up")

Randomized Timing Patterns

import random

def human_delay(min_s=3.0, max_s=8.0, spike_chance=0.1):
    """
    Simulate human reading time. Occasionally pause longer
    as if the user is reading a review carefully.
    """
    if random.random() < spike_chance:
        # Occasional longer pause (human got distracted)
        delay = random.uniform(15, 45)
    else:
        delay = random.uniform(min_s, max_s)
    time.sleep(delay)
    return delay

def page_delay(page_num):
    """
    Slightly different delay pattern per page number.
    Humans slow down on later pages.
    """
    base = 4.0 + (page_num * 0.3)
    jitter = random.gauss(0, 1.5)
    delay = max(2.0, base + jitter)
    time.sleep(delay)

User-Agent Rotation

USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.5 Safari/605.1.15",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:126.0) Gecko/20100101 Firefox/126.0",
]

def get_headers():
    return {
        **HEADERS,
        "User-Agent": random.choice(USER_AGENTS),
    }

ThorData Proxy Integration

For any serious Amazon scraping, residential proxies are non-negotiable. Datacenter IPs are flagged almost immediately. Amazon maintains blocklists of major datacenter IP ranges (AWS, GCP, Azure, common proxy providers) and applies much stricter bot detection to those ranges.

ThorData's residential proxy network provides access to millions of real residential IPs, with automatic rotation and US-specific IP pools that match Amazon's expected traffic patterns.

Proxy Setup and Rotation

THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
THORDATA_HOST = "proxy.thordata.com"
THORDATA_PORT = 9000

def get_proxy_url(session_id=None, country="us"):
    """
    Build ThorData proxy URL.
    session_id: use same IP for a session (sticky), or None for per-request rotation
    country: target country for the exit node
    """
    if session_id:
        # Sticky session — same IP across requests
        user = f"{THORDATA_USER}-session-{session_id}-country-{country}"
    else:
        # Rotating — new IP each request
        user = f"{THORDATA_USER}-country-{country}"

    return f"http://{user}:{THORDATA_PASS}@{THORDATA_HOST}:{THORDATA_PORT}"

def scrape_with_rotation(asin, page, star_filter="all_stars"):
    """Fetch with a fresh proxy IP per page."""
    proxy_url = get_proxy_url(country="us")
    url = build_review_url(asin, page, star_filter)
    html = fetch_with_cffi(url, proxy_url)
    return html

def scrape_with_sticky_session(asin, pages, star_filter="all_stars"):
    """
    Use the same IP across a short session (mimics one user browsing).
    Rotate sessions between products.
    """
    session_id = random.randint(10000, 99999)
    proxy_url = get_proxy_url(session_id=session_id, country="us")

    all_reviews = []
    for page in range(1, pages + 1):
        url = build_review_url(asin, page, star_filter)
        html = fetch_with_cffi(url, proxy_url)

        if not html:
            # Rotate to fresh session on block
            session_id = random.randint(10000, 99999)
            proxy_url = get_proxy_url(session_id=session_id, country="us")
            time.sleep(random.uniform(20, 40))
            continue

        soup = BeautifulSoup(html, "lxml")
        reviews = [parse_review_div(d) for d in soup.select('[data-hook="review"]')]
        all_reviews.extend(reviews)
        print(f"Page {page}: {len(reviews)} reviews")
        page_delay(page)

    return all_reviews

CAPTCHA Handling

Even with residential proxies, you'll occasionally hit CAPTCHA pages. Detect them and rotate:

def robust_fetch(asin, page, star_filter="all_stars", max_retries=4):
    """Fetch with automatic proxy rotation on CAPTCHA."""
    for attempt in range(max_retries):
        proxy_url = get_proxy_url(country="us")
        url = build_review_url(asin, page, star_filter)
        html = fetch_with_cffi(url, proxy_url)

        if html and not is_blocked(html):
            return html

        wait = min(30 * (2 ** attempt), 120)  # Exponential backoff, cap 2 min
        print(f"Attempt {attempt+1} blocked. Waiting {wait}s...")
        time.sleep(wait + random.uniform(0, 10))

    print(f"Page {page}: failed after {max_retries} attempts")
    return None

Star Rating Segmentation

Amazon caps accessible pages at ~500 per filter (5,000 reviews). For popular products with 50,000+ reviews, segment by star rating to multiply accessible reviews by 5–10x:

STAR_FILTERS = ["one_star", "two_star", "three_star", "four_star", "five_star"]
SORT_ORDERS = ["recent", "helpful"]

def scrape_all_segments(asin, pages_per_segment=50):
    """
    Scrape each star rating + sort order combination.
    Effectively gives access to 10x more reviews.
    """
    seen_ids = set()
    all_reviews = []

    for star in STAR_FILTERS:
        for sort in SORT_ORDERS:
            print(f"\nScraping {star} reviews, sorted by {sort}...")

            for page in range(1, pages_per_segment + 1):
                html = robust_fetch(asin, page, star_filter=star)
                if not html:
                    break

                soup = BeautifulSoup(html, "lxml")
                review_divs = soup.select('[data-hook="review"]')

                if not review_divs:
                    break

                new_reviews = 0
                for div in review_divs:
                    review = parse_review_div(div)
                    review["star_segment"] = star
                    review["sort_segment"] = sort

                    if review["review_id"] not in seen_ids:
                        seen_ids.add(review["review_id"])
                        all_reviews.append(review)
                        new_reviews += 1

                print(f"  Page {page}: {new_reviews} new ({len(all_reviews)} total unique)")
                if new_reviews == 0:
                    break  # All reviews on this page already seen

                page_delay(page)

    return all_reviews

Pagination Handling

Amazon paginates review pages with standard pageNumber query params, but there are edge cases:

def get_total_review_count(asin):
    """Extract total review count from the product reviews page."""
    html = robust_fetch(asin, page=1)
    if not html:
        return None

    soup = BeautifulSoup(html, "lxml")

    # Total count appears in multiple places
    selectors = [
        '[data-hook="total-review-count"]',
        '[data-hook="cr-filter-info-review-count"]',
        'span[data-action="reviews:filter-by-star:ratings-count"]',
    ]

    for selector in selectors:
        el = soup.select_one(selector)
        if el:
            text = el.get_text(strip=True)
            # Extract number from "1,234 global ratings" or "1,234 reviews"
            num = text.replace(",", "").split()[0]
            try:
                return int(num)
            except ValueError:
                continue

    return None

def calculate_scraping_strategy(asin):
    """Plan how many pages to scrape per segment."""
    total = get_total_review_count(asin)
    if not total:
        return {"pages_per_segment": 50, "total_estimate": "unknown"}

    print(f"Total reviews: {total:,}")
    per_segment = min(50, (total // 5 // 10) + 5)  # 10 reviews per page

    return {
        "total_reviews": total,
        "pages_per_segment": per_segment,
        "estimated_accessible": per_segment * 10 * 5,  # 5 star segments
    }

strategy = calculate_scraping_strategy("B0CX59THPZ")
print(f"Strategy: {strategy}")

Data Storage

Incremental JSONL Storage

Write each page's reviews immediately — don't hold everything in memory:

import json
from pathlib import Path
from datetime import datetime

def save_reviews_jsonl(reviews, filepath):
    """Append reviews to JSONL file — one JSON object per line."""
    path = Path(filepath)
    with path.open("a", encoding="utf-8") as f:
        for review in reviews:
            review["scraped_at"] = datetime.utcnow().isoformat()
            f.write(json.dumps(review, ensure_ascii=False) + "\n")

def load_reviews_jsonl(filepath):
    """Load all reviews from JSONL file."""
    path = Path(filepath)
    if not path.exists():
        return []
    reviews = []
    with path.open(encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if line:
                try:
                    reviews.append(json.loads(line))
                except json.JSONDecodeError:
                    continue
    return reviews

# Usage: save after each page, resume after interruptions
output_file = f"reviews_{asin}.jsonl"
seen_file = Path(f"seen_ids_{asin}.txt")

# Load previously seen IDs to avoid duplicates on resume
seen_ids = set()
if seen_file.exists():
    seen_ids = set(seen_file.read_text().split())

for page in range(1, 51):
    html = robust_fetch(asin, page)
    if not html:
        break

    soup = BeautifulSoup(html, "lxml")
    divs = soup.select('[data-hook="review"]')
    new_reviews = []

    for div in divs:
        r = parse_review_div(div)
        if r["review_id"] and r["review_id"] not in seen_ids:
            new_reviews.append(r)
            seen_ids.add(r["review_id"])

    save_reviews_jsonl(new_reviews, output_file)
    # Update seen IDs file
    seen_file.write_text("\n".join(seen_ids))
    print(f"Page {page}: saved {len(new_reviews)} reviews")
    page_delay(page)

SQLite Storage for Analysis

For analysis queries, SQLite beats JSONL:

import sqlite3
import json

def init_reviews_db(db_path="amazon_reviews.db"):
    conn = sqlite3.connect(db_path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS reviews (
            review_id TEXT PRIMARY KEY,
            asin TEXT NOT NULL,
            title TEXT,
            body TEXT,
            rating REAL,
            date TEXT,
            verified INTEGER DEFAULT 0,
            helpful_votes INTEGER DEFAULT 0,
            author TEXT,
            star_segment TEXT,
            scraped_at TEXT,
            UNIQUE(review_id)
        );

        CREATE TABLE IF NOT EXISTS scrape_runs (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            asin TEXT,
            total_reviews INTEGER,
            pages_scraped INTEGER,
            started_at TEXT,
            completed_at TEXT
        );

        CREATE INDEX IF NOT EXISTS idx_asin ON reviews(asin);
        CREATE INDEX IF NOT EXISTS idx_rating ON reviews(rating);
        CREATE INDEX IF NOT EXISTS idx_date ON reviews(date);
    """)
    conn.commit()
    return conn

def insert_reviews(conn, reviews, asin):
    """Bulk insert reviews, ignore duplicates."""
    rows = [
        (
            r.get("review_id"), asin,
            r.get("title"), r.get("body"),
            r.get("rating"), r.get("date"),
            1 if r.get("verified") else 0,
            r.get("helpful_votes", 0),
            r.get("author"), r.get("star_segment"),
            datetime.utcnow().isoformat(),
        )
        for r in reviews
    ]
    conn.executemany("""
        INSERT OR IGNORE INTO reviews
        (review_id, asin, title, body, rating, date,
         verified, helpful_votes, author, star_segment, scraped_at)
        VALUES (?,?,?,?,?,?,?,?,?,?,?)
    """, rows)
    conn.commit()

def analyze_reviews(conn, asin):
    """Quick sentiment analysis from the database."""
    cursor = conn.execute("""
        SELECT
            rating,
            COUNT(*) as count,
            AVG(helpful_votes) as avg_helpful,
            SUM(CASE WHEN verified = 1 THEN 1 ELSE 0 END) as verified_count
        FROM reviews
        WHERE asin = ?
        GROUP BY rating
        ORDER BY rating
    """, (asin,))

    print(f"\nReview breakdown for {asin}:")
    for row in cursor.fetchall():
        stars = "★" * int(row[0]) if row[0] else "?"
        print(f"  {stars}: {row[1]} reviews, {row[3]} verified, {row[2]:.1f} avg helpful")

Real-World Use Cases

1. Competitor Analysis Tool

Track how a competitor's reviews evolve over time, particularly after product launches or recalls:

def monitor_competitor_reviews(asin_list, interval_hours=24):
    """
    Daily review monitor — alert when negative reviews spike.
    """
    import time
    conn = init_reviews_db()

    while True:
        for asin in asin_list:
            print(f"Checking {asin}...")
            reviews = scrape_with_sticky_session(asin, pages=3)
            insert_reviews(conn, reviews, asin)

            # Check for spike in 1-star reviews
            cursor = conn.execute("""
                SELECT COUNT(*) FROM reviews
                WHERE asin = ? AND rating <= 2
                  AND scraped_at > datetime('now', '-24 hours')
            """, (asin,))
            recent_negative = cursor.fetchone()[0]

            if recent_negative > 10:
                print(f"ALERT: {recent_negative} negative reviews in last 24h for {asin}")

        time.sleep(interval_hours * 3600)

2. Review Sentiment Aggregator

Extract key themes from negative reviews to inform product development:

from collections import Counter
import re

def extract_common_complaints(reviews, min_rating=2):
    """Find most common words/phrases in low-rated reviews."""
    negative_reviews = [r for r in reviews if (r.get("rating") or 5) <= min_rating]

    # Simple word frequency (replace with NLP for better results)
    word_freq = Counter()
    stop_words = {"the", "a", "an", "is", "it", "this", "was", "i", "and",
                  "to", "of", "in", "for", "on", "that", "my", "but", "not"}

    for review in negative_reviews:
        text = (review.get("body", "") + " " + review.get("title", "")).lower()
        words = re.findall(r"\b[a-z]{4,}\b", text)
        word_freq.update(w for w in words if w not in stop_words)

    return word_freq.most_common(20)

complaints = extract_common_complaints(reviews, min_rating=2)
print("Most common complaint terms:")
for word, count in complaints:
    print(f"  {word}: {count}")

3. Price-to-Satisfaction Correlation

Combine review data with pricing to find the optimal price point:

def analyze_price_sensitivity(reviews_by_price):
    """
    Compare ratings across price points.
    reviews_by_price: {price: [reviews]}
    """
    for price, reviews in sorted(reviews_by_price.items()):
        ratings = [r["rating"] for r in reviews if r.get("rating")]
        if ratings:
            avg = sum(ratings) / len(ratings)
            print(f"${price}: avg rating {avg:.2f} ({len(ratings)} reviews)")

Handling International Reviews

Amazon operates separate storefronts for each country. Reviews on amazon.co.uk differ from amazon.com. Use the same scraper with country-specific domains:

AMAZON_DOMAINS = {
    "us": "www.amazon.com",
    "uk": "www.amazon.co.uk",
    "de": "www.amazon.de",
    "fr": "www.amazon.fr",
    "jp": "www.amazon.co.jp",
    "ca": "www.amazon.ca",
    "au": "www.amazon.com.au",
}

def scrape_international_reviews(asin, countries=None, pages=5):
    """Scrape reviews from multiple Amazon storefronts."""
    if countries is None:
        countries = ["us", "uk", "de"]

    all_reviews = []
    for country in countries:
        domain = AMAZON_DOMAINS.get(country)
        if not domain:
            continue

        print(f"\nScraping {country.upper()} reviews...")
        # Use country-targeted proxy
        proxy_url = get_proxy_url(country=country)

        for page in range(1, pages + 1):
            url = f"https://{domain}/product-reviews/{asin}/?pageNumber={page}"
            html = fetch_with_cffi(url, proxy_url)
            if not html:
                break

            soup = BeautifulSoup(html, "lxml")
            divs = soup.select('[data-hook="review"]')
            if not divs:
                break

            for div in divs:
                review = parse_review_div(div)
                review["country"] = country
                review["storefront"] = domain
                all_reviews.append(review)

            page_delay(page)

    return all_reviews

Rate Limiting and Retry Logic

Production-grade retry logic with exponential backoff:

import functools
import time
import random

def with_retry(max_attempts=4, base_wait=10):
    """Decorator for automatic retry with exponential backoff."""
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_attempts):
                try:
                    result = func(*args, **kwargs)
                    if result is not None:
                        return result
                except Exception as e:
                    print(f"Attempt {attempt+1} error: {e}")

                if attempt < max_attempts - 1:
                    wait = base_wait * (2 ** attempt) + random.uniform(0, 5)
                    print(f"Retrying in {wait:.1f}s...")
                    time.sleep(wait)

            return None
        return wrapper
    return decorator

@with_retry(max_attempts=4, base_wait=15)
def fetch_page_with_retry(asin, page, star_filter="all_stars"):
    """Fetch with built-in retry."""
    proxy_url = get_proxy_url(country="us")
    url = build_review_url(asin, page, star_filter)
    html = fetch_with_cffi(url, proxy_url)

    if html and not is_blocked(html):
        return html
    return None

Amazon's Terms of Service prohibit automated scraping. This matters practically and legally:

Practical guidance: - Use review data for research, analysis, and internal tooling — not for resale or republishing at scale - Don't scrape review author contact information - Respect rate limits; don't overwhelm Amazon's servers - Store data with appropriate retention limits


Complete Pipeline

Putting it all together:

import json
from pathlib import Path
from datetime import datetime

def full_scrape_pipeline(asin, output_dir="output", max_pages_per_segment=30):
    """Complete Amazon review scraping pipeline."""
    out = Path(output_dir)
    out.mkdir(exist_ok=True)

    output_file = out / f"reviews_{asin}.jsonl"
    state_file = out / f"state_{asin}.json"

    # Load or initialize state
    state = {"seen_ids": [], "completed_segments": []}
    if state_file.exists():
        state = json.loads(state_file.read_text())
    seen_ids = set(state["seen_ids"])
    completed = set(state["completed_segments"])

    # Initialize DB
    conn = init_reviews_db(str(out / "amazon_reviews.db"))

    # Warm up session
    session_warmup()

    total_new = 0
    for star in STAR_FILTERS:
        segment_key = f"{star}_recent"
        if segment_key in completed:
            print(f"Skipping {segment_key} (already done)")
            continue

        print(f"\n=== Scraping {star} reviews ===")
        new_in_segment = 0

        for page in range(1, max_pages_per_segment + 1):
            html = fetch_page_with_retry(asin, page, star_filter=star)
            if not html:
                break

            soup = BeautifulSoup(html, "lxml")
            divs = soup.select('[data-hook="review"]')
            if not divs:
                break

            new_reviews = []
            for div in divs:
                r = parse_review_div(div)
                r["asin"] = asin
                r["star_segment"] = star
                if r["review_id"] not in seen_ids:
                    new_reviews.append(r)
                    seen_ids.add(r["review_id"])

            if new_reviews:
                save_reviews_jsonl(new_reviews, str(output_file))
                insert_reviews(conn, new_reviews, asin)
                new_in_segment += len(new_reviews)
                total_new += len(new_reviews)

            print(f"  Page {page}: {len(new_reviews)} new reviews")
            page_delay(page)

            # Save progress after each page
            state["seen_ids"] = list(seen_ids)
            state_file.write_text(json.dumps(state))

        completed.add(segment_key)
        state["completed_segments"] = list(completed)
        state_file.write_text(json.dumps(state))
        print(f"Segment complete: {new_in_segment} reviews")

    print(f"\nPipeline complete. {total_new} total new reviews for {asin}")
    analyze_reviews(conn, asin)
    return total_new

if __name__ == "__main__":
    full_scrape_pipeline("B0CX59THPZ", max_pages_per_segment=20)

Summary

Amazon review scraping in 2026 requires:

  1. Stable selectors — use data-hook attributes, which Amazon doesn't rotate
  2. TLS fingerprint spoofingcurl_cffi with Chrome impersonation
  3. Residential proxiesThorData for IPs that pass Amazon's reputation checks
  4. Segmentation — scrape by star rating to multiply accessible review count
  5. Incremental storage — JSONL + SQLite, save after each page
  6. Retry logic — exponential backoff on blocks, rotate proxy on CAPTCHA

The arms race continues. Amazon updates their detection quarterly. The data-hook selectors have been stable for years, but proxy rotation and TLS fingerprinting will need periodic updates to match Amazon's latest detection methods. Build your scraper to be easily updatable rather than deeply hardcoded.