How to Scrape Amazon Product Reviews with Python (2026)
How to Scrape Amazon Product Reviews with Python (2026)
Amazon doesn't have a public reviews API. There's the Product Advertising API, but it doesn't include review text — just aggregate star ratings. If you need actual review content, ratings breakdowns, reviewer history, or verified purchase status, you're scraping HTML.
And Amazon really doesn't want you to. Their anti-bot system is one of the most aggressive on the web. You'll hit CAPTCHAs, IP blocks, and detection fingerprinting within minutes if you're not careful. But this data is invaluable — for competitive analysis, sentiment research, product development feedback loops, review aggregation tools, and brand monitoring.
This guide walks through the full pipeline: URL structure, parsing, anti-detection, pagination, proxy integration with ThorData, and durable storage.
Why Scrape Amazon Reviews?
Amazon product reviews are some of the most valuable unstructured text on the internet:
- Competitive intelligence — understand what customers hate about rival products
- Market research — identify pain points that your product could solve
- Sentiment analysis — track brand reputation over time
- Review aggregation — build tools that consolidate reviews across platforms
- Fraud detection — identify fake review patterns (sudden bursts, suspiciously similar phrasing)
- Product development — surface the most common feature requests from 1-star and 2-star reviews
None of this is available through Amazon's official APIs. The Product Advertising API gives you star averages; the raw text lives in HTML.
Review URL Structure
Every Amazon product has an ASIN — a 10-character alphanumeric identifier. Review pages follow a consistent URL pattern:
https://www.amazon.com/product-reviews/{ASIN}/
?pageNumber={page}
&filterByStar={star_rating}
&reviewerType=avp_only_reviews
&sortBy=recent
Key query parameters:
| Parameter | Options | Purpose |
|---|---|---|
pageNumber |
1–500 | Pagination (10 reviews per page) |
filterByStar |
one_star, two_star, three_star, four_star, five_star, all_stars, critical, positive |
Filter by rating |
reviewerType |
avp_only_reviews, all_reviews |
Verified purchase filter |
sortBy |
recent, helpful |
Sort order |
Amazon caps pages at around 500 (5,000 reviews per filter). For products with 100,000+ reviews, you need to segment by star rating to maximize coverage.
Understanding the HTML Structure
Amazon's review HTML is relatively stable. They use data-hook attributes as stable anchors for review elements:
<div data-hook="review">
<span data-hook="review-star-rating">4.0 out of 5 stars</span>
<span data-hook="review-title">Great product, minor issues</span>
<span data-hook="review-date">Reviewed in the United States on March 15, 2026</span>
<span data-hook="avp-badge">Verified Purchase</span>
<span data-hook="review-body">The build quality is excellent...</span>
<span data-hook="helpful-vote-statement">47 people found this helpful</span>
<span data-hook="review-author">CustomerName</span>
</div>
These data-hook attributes have been consistent for years. They're more reliable than class-based selectors, which Amazon rotates frequently to break scrapers.
Basic Scraper
Start with a minimal working scraper, then add robustness:
import httpx
from bs4 import BeautifulSoup
import time
import random
import json
from pathlib import Path
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/125.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Cache-Control": "max-age=0",
}
def build_review_url(asin, page=1, star_filter="all_stars",
verified_only=False, sort="recent"):
url = f"https://www.amazon.com/product-reviews/{asin}/"
params = {
"pageNumber": page,
"filterByStar": star_filter,
"sortBy": sort,
}
if verified_only:
params["reviewerType"] = "avp_only_reviews"
# Build query string manually to preserve order
qs = "&".join(f"{k}={v}" for k, v in params.items())
return f"{url}?{qs}"
def parse_rating(el):
"""Extract float rating from Amazon star element."""
if not el:
return None
text = el.get_text()
try:
return float(text.split(" out of")[0].strip())
except (ValueError, IndexError):
return None
def parse_helpful_votes(el):
"""Parse helpful vote count like '47 people found this helpful'."""
if not el:
return 0
text = el.get_text(strip=True)
if text == "One person found this helpful":
return 1
try:
num = text.split(" ")[0].replace(",", "")
return int(num)
except (ValueError, IndexError):
return 0
def parse_review_div(div):
"""Extract all fields from a single review div."""
title_el = div.select_one('[data-hook="review-title"]')
body_el = div.select_one('[data-hook="review-body"]')
rating_el = div.select_one('[data-hook="review-star-rating"]')
date_el = div.select_one('[data-hook="review-date"]')
verified_el = div.select_one('[data-hook="avp-badge"]')
helpful_el = div.select_one('[data-hook="helpful-vote-statement"]')
author_el = div.select_one('[data-hook="review-author"]')
# Review ID lives on the parent div
review_id = div.get("id", "")
# Extract title text — skip leading star rating text if present
title_text = ""
if title_el:
# Title spans sometimes contain nested span with star text
spans = title_el.find_all("span")
title_text = spans[-1].get_text(strip=True) if spans else title_el.get_text(strip=True)
return {
"review_id": review_id,
"title": title_text,
"body": body_el.get_text(strip=True) if body_el else "",
"rating": parse_rating(rating_el),
"date": date_el.get_text(strip=True) if date_el else "",
"verified": verified_el is not None,
"helpful_votes": parse_helpful_votes(helpful_el),
"author": author_el.get_text(strip=True) if author_el else "",
}
def scrape_reviews_page(asin, page=1, star_filter="all_stars",
verified_only=False):
"""Fetch and parse a single review page."""
url = build_review_url(asin, page, star_filter, verified_only)
try:
resp = httpx.get(url, headers=HEADERS, follow_redirects=True,
timeout=20)
except httpx.TimeoutException:
print(f"Timeout on page {page}")
return None, None
if resp.status_code != 200:
print(f"Page {page}: HTTP {resp.status_code}")
return None, resp.status_code
html = resp.text
# Check for CAPTCHA or block
if is_blocked(html):
return None, "blocked"
soup = BeautifulSoup(html, "lxml")
review_divs = soup.select('[data-hook="review"]')
reviews = [parse_review_div(div) for div in review_divs]
return reviews, 200
def is_blocked(html):
"""Detect Amazon CAPTCHA or soft-block pages."""
block_markers = [
"Type the characters you see in this image",
"[email protected]",
"/errors/validateCaptcha",
"Enter the characters you see below",
"Sorry, we just need to make sure you",
]
return any(m in html for m in block_markers)
def scrape_reviews(asin, max_pages=10, star_filter="all_stars",
verified_only=False):
"""Scrape multiple pages of reviews with delays."""
all_reviews = []
for page in range(1, max_pages + 1):
reviews, status = scrape_reviews_page(
asin, page, star_filter, verified_only
)
if status == "blocked":
print(f"Blocked on page {page}. Stopping.")
break
if reviews is None:
break
if not reviews:
print(f"Page {page}: empty, done.")
break
all_reviews.extend(reviews)
print(f"Page {page}: {len(reviews)} reviews (total: {len(all_reviews)})")
if page < max_pages:
time.sleep(random.uniform(4.0, 8.0))
return all_reviews
# Basic usage
if __name__ == "__main__":
asin = "B0CX59THPZ" # Replace with your target ASIN
reviews = scrape_reviews(asin, max_pages=5)
print(f"Collected {len(reviews)} reviews")
This gets you started, but hits Amazon's anti-bot detection quickly at any meaningful scale.
Amazon's Anti-Bot Measures
Amazon runs one of the most sophisticated bot detection systems in e-commerce. Understanding it helps you build more resilient scrapers.
IP Reputation Tracking
Amazon maintains per-IP reputation scores. After 20–30 requests, fresh datacenter IPs start returning CAPTCHA pages. After continued requests, they return 503s. The reputation decays slowly — an IP blocked today may work again in 24–48 hours, but at any meaningful scraping volume, you'll exhaust IPs faster than they recover.
TLS Fingerprinting
Amazon (and its CDN) inspects your TLS ClientHello — the cipher suites offered, extension order, and supported groups. Python's default TLS stack (OpenSSL via requests/httpx) produces a fingerprint distinctly unlike Chrome or Firefox. Amazon's detection picks this up and applies additional scrutiny to requests from known-bot TLS stacks.
Libraries like curl_cffi solve this by using libcurl with Chrome's exact TLS parameters:
import curl_cffi.requests as cffi_req
resp = cffi_req.get(
url,
headers=HEADERS,
impersonate="chrome124", # Mimics Chrome 124 TLS fingerprint
timeout=20,
)
Behavioral Analysis
Amazon tracks: - Request cadence — machine-perfect timing (exactly 5s between requests) is a signal - Navigation patterns — real browsers load CSS, images, fonts; scrapers only load HTML - Cookie behavior — sessions that never set/read preferences look robotic - Referrer chains — jumping directly to review pages without a product page visit is suspicious
Cookie and Session Tracking
Amazon sets session cookies on first visit. If your "browser session" visits 47 different product review pages without ever visiting a product listing, that's anomalous. Warming up your session by visiting a few product pages before hitting reviews reduces detection risk.
Anti-Detection Implementation
Using curl_cffi for TLS Fingerprint Spoofing
import curl_cffi.requests as cffi_req
import random
import time
# Multiple Chrome versions to rotate
IMPERSONATIONS = ["chrome120", "chrome124", "chrome126"]
def fetch_with_cffi(url, proxy_url=None, retries=3):
"""Fetch with Chrome TLS fingerprint via curl_cffi."""
impersonate = random.choice(IMPERSONATIONS)
for attempt in range(retries):
try:
kwargs = {
"headers": HEADERS,
"impersonate": impersonate,
"timeout": 25,
"follow_redirects": True,
}
if proxy_url:
kwargs["proxies"] = {"http": proxy_url, "https": proxy_url}
resp = cffi_req.get(url, **kwargs)
if resp.status_code == 200 and not is_blocked(resp.text):
return resp.text
elif is_blocked(resp.text):
print(f"Attempt {attempt+1}: CAPTCHA/blocked")
time.sleep(random.uniform(15, 30))
else:
print(f"Attempt {attempt+1}: HTTP {resp.status_code}")
time.sleep(random.uniform(5, 10))
except Exception as e:
print(f"Attempt {attempt+1}: Error — {e}")
time.sleep(random.uniform(5, 15))
return None
def session_warmup(proxy_url=None):
"""Visit product and category pages before scraping reviews."""
warmup_urls = [
"https://www.amazon.com/",
"https://www.amazon.com/best-sellers-electronics/",
]
for url in warmup_urls:
fetch_with_cffi(url, proxy_url)
time.sleep(random.uniform(2, 5))
print("Session warmed up")
Randomized Timing Patterns
import random
def human_delay(min_s=3.0, max_s=8.0, spike_chance=0.1):
"""
Simulate human reading time. Occasionally pause longer
as if the user is reading a review carefully.
"""
if random.random() < spike_chance:
# Occasional longer pause (human got distracted)
delay = random.uniform(15, 45)
else:
delay = random.uniform(min_s, max_s)
time.sleep(delay)
return delay
def page_delay(page_num):
"""
Slightly different delay pattern per page number.
Humans slow down on later pages.
"""
base = 4.0 + (page_num * 0.3)
jitter = random.gauss(0, 1.5)
delay = max(2.0, base + jitter)
time.sleep(delay)
User-Agent Rotation
USER_AGENTS = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.5 Safari/605.1.15",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:126.0) Gecko/20100101 Firefox/126.0",
]
def get_headers():
return {
**HEADERS,
"User-Agent": random.choice(USER_AGENTS),
}
ThorData Proxy Integration
For any serious Amazon scraping, residential proxies are non-negotiable. Datacenter IPs are flagged almost immediately. Amazon maintains blocklists of major datacenter IP ranges (AWS, GCP, Azure, common proxy providers) and applies much stricter bot detection to those ranges.
ThorData's residential proxy network provides access to millions of real residential IPs, with automatic rotation and US-specific IP pools that match Amazon's expected traffic patterns.
Proxy Setup and Rotation
THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
THORDATA_HOST = "proxy.thordata.com"
THORDATA_PORT = 9000
def get_proxy_url(session_id=None, country="us"):
"""
Build ThorData proxy URL.
session_id: use same IP for a session (sticky), or None for per-request rotation
country: target country for the exit node
"""
if session_id:
# Sticky session — same IP across requests
user = f"{THORDATA_USER}-session-{session_id}-country-{country}"
else:
# Rotating — new IP each request
user = f"{THORDATA_USER}-country-{country}"
return f"http://{user}:{THORDATA_PASS}@{THORDATA_HOST}:{THORDATA_PORT}"
def scrape_with_rotation(asin, page, star_filter="all_stars"):
"""Fetch with a fresh proxy IP per page."""
proxy_url = get_proxy_url(country="us")
url = build_review_url(asin, page, star_filter)
html = fetch_with_cffi(url, proxy_url)
return html
def scrape_with_sticky_session(asin, pages, star_filter="all_stars"):
"""
Use the same IP across a short session (mimics one user browsing).
Rotate sessions between products.
"""
session_id = random.randint(10000, 99999)
proxy_url = get_proxy_url(session_id=session_id, country="us")
all_reviews = []
for page in range(1, pages + 1):
url = build_review_url(asin, page, star_filter)
html = fetch_with_cffi(url, proxy_url)
if not html:
# Rotate to fresh session on block
session_id = random.randint(10000, 99999)
proxy_url = get_proxy_url(session_id=session_id, country="us")
time.sleep(random.uniform(20, 40))
continue
soup = BeautifulSoup(html, "lxml")
reviews = [parse_review_div(d) for d in soup.select('[data-hook="review"]')]
all_reviews.extend(reviews)
print(f"Page {page}: {len(reviews)} reviews")
page_delay(page)
return all_reviews
CAPTCHA Handling
Even with residential proxies, you'll occasionally hit CAPTCHA pages. Detect them and rotate:
def robust_fetch(asin, page, star_filter="all_stars", max_retries=4):
"""Fetch with automatic proxy rotation on CAPTCHA."""
for attempt in range(max_retries):
proxy_url = get_proxy_url(country="us")
url = build_review_url(asin, page, star_filter)
html = fetch_with_cffi(url, proxy_url)
if html and not is_blocked(html):
return html
wait = min(30 * (2 ** attempt), 120) # Exponential backoff, cap 2 min
print(f"Attempt {attempt+1} blocked. Waiting {wait}s...")
time.sleep(wait + random.uniform(0, 10))
print(f"Page {page}: failed after {max_retries} attempts")
return None
Star Rating Segmentation
Amazon caps accessible pages at ~500 per filter (5,000 reviews). For popular products with 50,000+ reviews, segment by star rating to multiply accessible reviews by 5–10x:
STAR_FILTERS = ["one_star", "two_star", "three_star", "four_star", "five_star"]
SORT_ORDERS = ["recent", "helpful"]
def scrape_all_segments(asin, pages_per_segment=50):
"""
Scrape each star rating + sort order combination.
Effectively gives access to 10x more reviews.
"""
seen_ids = set()
all_reviews = []
for star in STAR_FILTERS:
for sort in SORT_ORDERS:
print(f"\nScraping {star} reviews, sorted by {sort}...")
for page in range(1, pages_per_segment + 1):
html = robust_fetch(asin, page, star_filter=star)
if not html:
break
soup = BeautifulSoup(html, "lxml")
review_divs = soup.select('[data-hook="review"]')
if not review_divs:
break
new_reviews = 0
for div in review_divs:
review = parse_review_div(div)
review["star_segment"] = star
review["sort_segment"] = sort
if review["review_id"] not in seen_ids:
seen_ids.add(review["review_id"])
all_reviews.append(review)
new_reviews += 1
print(f" Page {page}: {new_reviews} new ({len(all_reviews)} total unique)")
if new_reviews == 0:
break # All reviews on this page already seen
page_delay(page)
return all_reviews
Pagination Handling
Amazon paginates review pages with standard pageNumber query params, but there are edge cases:
def get_total_review_count(asin):
"""Extract total review count from the product reviews page."""
html = robust_fetch(asin, page=1)
if not html:
return None
soup = BeautifulSoup(html, "lxml")
# Total count appears in multiple places
selectors = [
'[data-hook="total-review-count"]',
'[data-hook="cr-filter-info-review-count"]',
'span[data-action="reviews:filter-by-star:ratings-count"]',
]
for selector in selectors:
el = soup.select_one(selector)
if el:
text = el.get_text(strip=True)
# Extract number from "1,234 global ratings" or "1,234 reviews"
num = text.replace(",", "").split()[0]
try:
return int(num)
except ValueError:
continue
return None
def calculate_scraping_strategy(asin):
"""Plan how many pages to scrape per segment."""
total = get_total_review_count(asin)
if not total:
return {"pages_per_segment": 50, "total_estimate": "unknown"}
print(f"Total reviews: {total:,}")
per_segment = min(50, (total // 5 // 10) + 5) # 10 reviews per page
return {
"total_reviews": total,
"pages_per_segment": per_segment,
"estimated_accessible": per_segment * 10 * 5, # 5 star segments
}
strategy = calculate_scraping_strategy("B0CX59THPZ")
print(f"Strategy: {strategy}")
Data Storage
Incremental JSONL Storage
Write each page's reviews immediately — don't hold everything in memory:
import json
from pathlib import Path
from datetime import datetime
def save_reviews_jsonl(reviews, filepath):
"""Append reviews to JSONL file — one JSON object per line."""
path = Path(filepath)
with path.open("a", encoding="utf-8") as f:
for review in reviews:
review["scraped_at"] = datetime.utcnow().isoformat()
f.write(json.dumps(review, ensure_ascii=False) + "\n")
def load_reviews_jsonl(filepath):
"""Load all reviews from JSONL file."""
path = Path(filepath)
if not path.exists():
return []
reviews = []
with path.open(encoding="utf-8") as f:
for line in f:
line = line.strip()
if line:
try:
reviews.append(json.loads(line))
except json.JSONDecodeError:
continue
return reviews
# Usage: save after each page, resume after interruptions
output_file = f"reviews_{asin}.jsonl"
seen_file = Path(f"seen_ids_{asin}.txt")
# Load previously seen IDs to avoid duplicates on resume
seen_ids = set()
if seen_file.exists():
seen_ids = set(seen_file.read_text().split())
for page in range(1, 51):
html = robust_fetch(asin, page)
if not html:
break
soup = BeautifulSoup(html, "lxml")
divs = soup.select('[data-hook="review"]')
new_reviews = []
for div in divs:
r = parse_review_div(div)
if r["review_id"] and r["review_id"] not in seen_ids:
new_reviews.append(r)
seen_ids.add(r["review_id"])
save_reviews_jsonl(new_reviews, output_file)
# Update seen IDs file
seen_file.write_text("\n".join(seen_ids))
print(f"Page {page}: saved {len(new_reviews)} reviews")
page_delay(page)
SQLite Storage for Analysis
For analysis queries, SQLite beats JSONL:
import sqlite3
import json
def init_reviews_db(db_path="amazon_reviews.db"):
conn = sqlite3.connect(db_path)
conn.executescript("""
CREATE TABLE IF NOT EXISTS reviews (
review_id TEXT PRIMARY KEY,
asin TEXT NOT NULL,
title TEXT,
body TEXT,
rating REAL,
date TEXT,
verified INTEGER DEFAULT 0,
helpful_votes INTEGER DEFAULT 0,
author TEXT,
star_segment TEXT,
scraped_at TEXT,
UNIQUE(review_id)
);
CREATE TABLE IF NOT EXISTS scrape_runs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
asin TEXT,
total_reviews INTEGER,
pages_scraped INTEGER,
started_at TEXT,
completed_at TEXT
);
CREATE INDEX IF NOT EXISTS idx_asin ON reviews(asin);
CREATE INDEX IF NOT EXISTS idx_rating ON reviews(rating);
CREATE INDEX IF NOT EXISTS idx_date ON reviews(date);
""")
conn.commit()
return conn
def insert_reviews(conn, reviews, asin):
"""Bulk insert reviews, ignore duplicates."""
rows = [
(
r.get("review_id"), asin,
r.get("title"), r.get("body"),
r.get("rating"), r.get("date"),
1 if r.get("verified") else 0,
r.get("helpful_votes", 0),
r.get("author"), r.get("star_segment"),
datetime.utcnow().isoformat(),
)
for r in reviews
]
conn.executemany("""
INSERT OR IGNORE INTO reviews
(review_id, asin, title, body, rating, date,
verified, helpful_votes, author, star_segment, scraped_at)
VALUES (?,?,?,?,?,?,?,?,?,?,?)
""", rows)
conn.commit()
def analyze_reviews(conn, asin):
"""Quick sentiment analysis from the database."""
cursor = conn.execute("""
SELECT
rating,
COUNT(*) as count,
AVG(helpful_votes) as avg_helpful,
SUM(CASE WHEN verified = 1 THEN 1 ELSE 0 END) as verified_count
FROM reviews
WHERE asin = ?
GROUP BY rating
ORDER BY rating
""", (asin,))
print(f"\nReview breakdown for {asin}:")
for row in cursor.fetchall():
stars = "★" * int(row[0]) if row[0] else "?"
print(f" {stars}: {row[1]} reviews, {row[3]} verified, {row[2]:.1f} avg helpful")
Real-World Use Cases
1. Competitor Analysis Tool
Track how a competitor's reviews evolve over time, particularly after product launches or recalls:
def monitor_competitor_reviews(asin_list, interval_hours=24):
"""
Daily review monitor — alert when negative reviews spike.
"""
import time
conn = init_reviews_db()
while True:
for asin in asin_list:
print(f"Checking {asin}...")
reviews = scrape_with_sticky_session(asin, pages=3)
insert_reviews(conn, reviews, asin)
# Check for spike in 1-star reviews
cursor = conn.execute("""
SELECT COUNT(*) FROM reviews
WHERE asin = ? AND rating <= 2
AND scraped_at > datetime('now', '-24 hours')
""", (asin,))
recent_negative = cursor.fetchone()[0]
if recent_negative > 10:
print(f"ALERT: {recent_negative} negative reviews in last 24h for {asin}")
time.sleep(interval_hours * 3600)
2. Review Sentiment Aggregator
Extract key themes from negative reviews to inform product development:
from collections import Counter
import re
def extract_common_complaints(reviews, min_rating=2):
"""Find most common words/phrases in low-rated reviews."""
negative_reviews = [r for r in reviews if (r.get("rating") or 5) <= min_rating]
# Simple word frequency (replace with NLP for better results)
word_freq = Counter()
stop_words = {"the", "a", "an", "is", "it", "this", "was", "i", "and",
"to", "of", "in", "for", "on", "that", "my", "but", "not"}
for review in negative_reviews:
text = (review.get("body", "") + " " + review.get("title", "")).lower()
words = re.findall(r"\b[a-z]{4,}\b", text)
word_freq.update(w for w in words if w not in stop_words)
return word_freq.most_common(20)
complaints = extract_common_complaints(reviews, min_rating=2)
print("Most common complaint terms:")
for word, count in complaints:
print(f" {word}: {count}")
3. Price-to-Satisfaction Correlation
Combine review data with pricing to find the optimal price point:
def analyze_price_sensitivity(reviews_by_price):
"""
Compare ratings across price points.
reviews_by_price: {price: [reviews]}
"""
for price, reviews in sorted(reviews_by_price.items()):
ratings = [r["rating"] for r in reviews if r.get("rating")]
if ratings:
avg = sum(ratings) / len(ratings)
print(f"${price}: avg rating {avg:.2f} ({len(ratings)} reviews)")
Handling International Reviews
Amazon operates separate storefronts for each country. Reviews on amazon.co.uk differ from amazon.com. Use the same scraper with country-specific domains:
AMAZON_DOMAINS = {
"us": "www.amazon.com",
"uk": "www.amazon.co.uk",
"de": "www.amazon.de",
"fr": "www.amazon.fr",
"jp": "www.amazon.co.jp",
"ca": "www.amazon.ca",
"au": "www.amazon.com.au",
}
def scrape_international_reviews(asin, countries=None, pages=5):
"""Scrape reviews from multiple Amazon storefronts."""
if countries is None:
countries = ["us", "uk", "de"]
all_reviews = []
for country in countries:
domain = AMAZON_DOMAINS.get(country)
if not domain:
continue
print(f"\nScraping {country.upper()} reviews...")
# Use country-targeted proxy
proxy_url = get_proxy_url(country=country)
for page in range(1, pages + 1):
url = f"https://{domain}/product-reviews/{asin}/?pageNumber={page}"
html = fetch_with_cffi(url, proxy_url)
if not html:
break
soup = BeautifulSoup(html, "lxml")
divs = soup.select('[data-hook="review"]')
if not divs:
break
for div in divs:
review = parse_review_div(div)
review["country"] = country
review["storefront"] = domain
all_reviews.append(review)
page_delay(page)
return all_reviews
Rate Limiting and Retry Logic
Production-grade retry logic with exponential backoff:
import functools
import time
import random
def with_retry(max_attempts=4, base_wait=10):
"""Decorator for automatic retry with exponential backoff."""
def decorator(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_attempts):
try:
result = func(*args, **kwargs)
if result is not None:
return result
except Exception as e:
print(f"Attempt {attempt+1} error: {e}")
if attempt < max_attempts - 1:
wait = base_wait * (2 ** attempt) + random.uniform(0, 5)
print(f"Retrying in {wait:.1f}s...")
time.sleep(wait)
return None
return wrapper
return decorator
@with_retry(max_attempts=4, base_wait=15)
def fetch_page_with_retry(asin, page, star_filter="all_stars"):
"""Fetch with built-in retry."""
proxy_url = get_proxy_url(country="us")
url = build_review_url(asin, page, star_filter)
html = fetch_with_cffi(url, proxy_url)
if html and not is_blocked(html):
return html
return None
Legal Considerations
Amazon's Terms of Service prohibit automated scraping. This matters practically and legally:
- ToS violation: Amazon can terminate your account and IP-block your traffic
- CFAA exposure: In the US, the Computer Fraud and Abuse Act has been used (controversially) against scrapers
- hiQ v. LinkedIn: Courts have ruled that scraping publicly accessible data is generally permissible, but this precedent doesn't fully cover Amazon's walled-garden data
Practical guidance: - Use review data for research, analysis, and internal tooling — not for resale or republishing at scale - Don't scrape review author contact information - Respect rate limits; don't overwhelm Amazon's servers - Store data with appropriate retention limits
Complete Pipeline
Putting it all together:
import json
from pathlib import Path
from datetime import datetime
def full_scrape_pipeline(asin, output_dir="output", max_pages_per_segment=30):
"""Complete Amazon review scraping pipeline."""
out = Path(output_dir)
out.mkdir(exist_ok=True)
output_file = out / f"reviews_{asin}.jsonl"
state_file = out / f"state_{asin}.json"
# Load or initialize state
state = {"seen_ids": [], "completed_segments": []}
if state_file.exists():
state = json.loads(state_file.read_text())
seen_ids = set(state["seen_ids"])
completed = set(state["completed_segments"])
# Initialize DB
conn = init_reviews_db(str(out / "amazon_reviews.db"))
# Warm up session
session_warmup()
total_new = 0
for star in STAR_FILTERS:
segment_key = f"{star}_recent"
if segment_key in completed:
print(f"Skipping {segment_key} (already done)")
continue
print(f"\n=== Scraping {star} reviews ===")
new_in_segment = 0
for page in range(1, max_pages_per_segment + 1):
html = fetch_page_with_retry(asin, page, star_filter=star)
if not html:
break
soup = BeautifulSoup(html, "lxml")
divs = soup.select('[data-hook="review"]')
if not divs:
break
new_reviews = []
for div in divs:
r = parse_review_div(div)
r["asin"] = asin
r["star_segment"] = star
if r["review_id"] not in seen_ids:
new_reviews.append(r)
seen_ids.add(r["review_id"])
if new_reviews:
save_reviews_jsonl(new_reviews, str(output_file))
insert_reviews(conn, new_reviews, asin)
new_in_segment += len(new_reviews)
total_new += len(new_reviews)
print(f" Page {page}: {len(new_reviews)} new reviews")
page_delay(page)
# Save progress after each page
state["seen_ids"] = list(seen_ids)
state_file.write_text(json.dumps(state))
completed.add(segment_key)
state["completed_segments"] = list(completed)
state_file.write_text(json.dumps(state))
print(f"Segment complete: {new_in_segment} reviews")
print(f"\nPipeline complete. {total_new} total new reviews for {asin}")
analyze_reviews(conn, asin)
return total_new
if __name__ == "__main__":
full_scrape_pipeline("B0CX59THPZ", max_pages_per_segment=20)
Summary
Amazon review scraping in 2026 requires:
- Stable selectors — use
data-hookattributes, which Amazon doesn't rotate - TLS fingerprint spoofing —
curl_cffiwith Chrome impersonation - Residential proxies — ThorData for IPs that pass Amazon's reputation checks
- Segmentation — scrape by star rating to multiply accessible review count
- Incremental storage — JSONL + SQLite, save after each page
- Retry logic — exponential backoff on blocks, rotate proxy on CAPTCHA
The arms race continues. Amazon updates their detection quarterly. The data-hook selectors have been stable for years, but proxy rotation and TLS fingerprinting will need periodic updates to match Amazon's latest detection methods. Build your scraper to be easily updatable rather than deeply hardcoded.