Scraping Amazon Product Data Without Getting Blocked (2026)
Scraping Amazon Product Data Without Getting Blocked (2026)
Amazon runs one of the most aggressive anti-bot systems on the web. Send a few requests with a default Python user-agent and you'll hit a CAPTCHA wall within minutes. Scale that up and your IP gets blacklisted for hours.
But people still need Amazon data — price monitoring, competitor analysis, review tracking. Here's what actually works in 2026 without getting your infrastructure burned.
Why Amazon Is So Hard to Scrape
Amazon uses a layered defense system:
- DataDome and proprietary fingerprinting — They analyze TLS fingerprints, JavaScript execution patterns, mouse movements, and request timing. A simple
requests.get()is flagged instantly. - IP reputation scoring — Datacenter IPs are pre-flagged. Even clean residential IPs get throttled after repeated access.
- Login walls on certain data — Seller-specific metrics, purchase history, and some review pages require authentication. You can't scrape these without violating ToS.
- Dynamic HTML rendering — Product pages load critical data via JavaScript, meaning static HTML parsing misses prices and availability on some layouts.
The key insight: Amazon product pages are more accessible than search results. Search result pages have the tightest bot detection. Individual product pages (accessed via direct ASIN URLs) are comparatively easier to scrape.
What You Can Actually Get From Public Pages
Without logging in, you can reliably extract:
- Product title, description, and bullet points
- Current price and deal status
- Star rating and total review count
- Product images (high-res URLs are in the page source)
- "Frequently bought together" items
- Basic seller information
- BSR (Best Seller Rank) within categories
What you can't get without authentication: detailed seller analytics, your purchase history, subscriber discounts, full review text beyond the first page.
The Proxy Situation: Residential Is Non-Negotiable
For Amazon specifically, datacenter proxies are nearly useless. Amazon maintains blocklists of major datacenter IP ranges. Even rotating through thousands of datacenter IPs, you'll see CAPTCHA rates above 60%.
Residential proxies are the baseline requirement. These route through real ISP connections, making requests look like normal household traffic.
ThorData's residential proxy network works well for Amazon scraping — their pool covers IPs across multiple regions, which helps when Amazon serves different content based on location. The geo-targeting is useful for price comparison across markets.
Budget reality: expect to pay $3-8 per GB of residential proxy traffic. Amazon product pages average 200-400KB each, so you're looking at roughly 2,500-5,000 pages per GB.
Complete Amazon Product Scraper
Here's a full working scraper with proxy rotation, CAPTCHA detection, retry logic, and structured data extraction:
#!/usr/bin/env python3
"""
Amazon Product Scraper — Residential Proxy + Anti-Detection
Scrapes product data from Amazon product pages using direct ASIN URLs.
Handles CAPTCHA detection, automatic retry with backoff, proxy rotation,
and exports to JSON or CSV.
Usage:
python amazon_scraper.py B0BSHF7WHW B0D1XD1ZV3
python amazon_scraper.py --file asins.txt --format csv --output products.csv
python amazon_scraper.py B0BSHF7WHW --domain amazon.co.uk
Requirements:
pip install httpx selectolax
"""
import httpx
import json
import csv
import time
import random
import re
import argparse
import sys
from datetime import datetime, timezone
from pathlib import Path
from selectolax.parser import HTMLParser
# Browser User-Agent pool (keep updated — stale UAs are a detection signal)
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:132.0) Gecko/20100101 "
"Firefox/132.0",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 "
"(KHTML, like Gecko) Version/18.1 Safari/605.1.15",
]
# Accept-Language variations to match different browser profiles
ACCEPT_LANGUAGES = [
"en-US,en;q=0.9",
"en-US,en;q=0.9,es;q=0.8",
"en-GB,en;q=0.9,en-US;q=0.8",
"en-US,en;q=0.5",
]
class AmazonScraper:
def __init__(self, proxy_url: str = None, domain: str = "amazon.com",
delay_range: tuple = (5.0, 12.0)):
"""
Args:
proxy_url: Residential proxy URL (http://user:pass@host:port)
domain: Amazon domain (amazon.com, amazon.co.uk, amazon.de, etc.)
delay_range: Random delay between requests in seconds
"""
self.domain = domain
self.base_url = f"https://www.{domain}"
self.delay_range = delay_range
self.request_count = 0
self.captcha_count = 0
self.success_count = 0
self.client_kwargs = {
"timeout": 30,
"follow_redirects": True,
"http2": True,
}
if proxy_url:
self.client_kwargs["proxy"] = proxy_url
def _get_headers(self) -> dict:
"""Generate realistic browser headers with randomization."""
ua = random.choice(USER_AGENTS)
return {
"User-Agent": ua,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,"
"image/avif,image/webp,*/*;q=0.8",
"Accept-Language": random.choice(ACCEPT_LANGUAGES),
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Cache-Control": "max-age=0",
}
def _is_captcha(self, html: str) -> bool:
"""Detect CAPTCHA / bot challenge pages."""
captcha_signals = [
"captcha", "robot check", "automated access",
"enter the characters", "type the characters",
"sorry, we just need to make sure",
"[email protected]",
]
html_lower = html.lower()
return any(signal in html_lower for signal in captcha_signals)
def _is_dog_page(self, html: str) -> bool:
"""Detect the Amazon 'sorry' dog page (soft block)."""
return "sorry" in html.lower() and "dogs of amazon" in html.lower()
def _rate_limit(self):
"""Apply adaptive rate limiting."""
self.request_count += 1
# Longer pause every 20 requests
if self.request_count % 20 == 0:
pause = random.uniform(20, 40)
print(f" Cooling down {pause:.0f}s after {self.request_count} requests "
f"({self.captcha_count} CAPTCHAs so far)...")
time.sleep(pause)
else:
time.sleep(random.uniform(*self.delay_range))
def fetch_product_page(self, asin: str, max_retries: int = 3) -> str | None:
"""
Fetch raw HTML for a product page with retry logic.
Returns HTML string or None on failure.
"""
url = f"{self.base_url}/dp/{asin}"
for attempt in range(max_retries):
try:
headers = self._get_headers()
with httpx.Client(**self.client_kwargs) as client:
resp = client.get(url, headers=headers)
if resp.status_code == 503 or self._is_captcha(resp.text):
self.captcha_count += 1
wait = (2 ** attempt) * 10 + random.uniform(5, 15)
print(f" CAPTCHA on {asin} (attempt {attempt+1}). "
f"Backing off {wait:.0f}s...")
time.sleep(wait)
continue
if self._is_dog_page(resp.text):
print(f" Dog page on {asin} — IP may be flagged. "
f"Waiting 60s...")
time.sleep(60)
continue
if resp.status_code == 404:
print(f" {asin}: product not found (404)")
return None
if resp.status_code == 200:
self.success_count += 1
return resp.text
print(f" {asin}: HTTP {resp.status_code} on attempt {attempt+1}")
time.sleep(5)
except httpx.RequestError as e:
print(f" {asin}: connection error - {e}")
time.sleep(5)
print(f" {asin}: all {max_retries} attempts failed")
return None
def parse_product(self, html: str, asin: str) -> dict:
"""Extract structured product data from HTML."""
tree = HTMLParser(html)
product = {"asin": asin, "url": f"{self.base_url}/dp/{asin}"}
# Title
title_el = tree.css_first("#productTitle")
product["title"] = title_el.text(strip=True) if title_el else None
# Price — Amazon uses multiple price containers
price = None
for selector in [
".a-price .a-offscreen",
"#priceblock_ourprice",
"#priceblock_dealprice",
"span.a-price span.a-offscreen",
"#corePrice_feature_div .a-offscreen",
]:
el = tree.css_first(selector)
if el:
price = el.text(strip=True)
break
product["price"] = price
# Original price (for deals)
orig_price_el = tree.css_first(
".a-price.a-text-price .a-offscreen, "
"#listPrice, .basisPrice .a-offscreen"
)
product["original_price"] = (
orig_price_el.text(strip=True) if orig_price_el else None
)
# Rating
rating_el = tree.css_first("#acrPopover span.a-size-base, #acrPopover .a-icon-alt")
if rating_el:
rating_text = rating_el.text(strip=True)
match = re.search(r'(\d+\.?\d*)', rating_text)
product["rating"] = float(match.group(1)) if match else None
else:
product["rating"] = None
# Review count
review_el = tree.css_first("#acrCustomerReviewText")
if review_el:
review_text = review_el.text(strip=True).replace(",", "")
match = re.search(r'(\d+)', review_text)
product["review_count"] = int(match.group(1)) if match else 0
else:
product["review_count"] = 0
# Availability
avail_el = tree.css_first("#availability span, #availability")
product["availability"] = (
avail_el.text(strip=True) if avail_el else "Unknown"
)
# Brand
brand_el = tree.css_first("#bylineInfo, a#brand")
if brand_el:
brand_text = brand_el.text(strip=True)
brand_text = re.sub(r'^(Visit the |Brand: )', '', brand_text)
brand_text = brand_text.replace(" Store", "")
product["brand"] = brand_text
else:
product["brand"] = None
# Bullet points / feature list
bullets = []
for li in tree.css("#feature-bullets ul li span.a-list-item"):
text = li.text(strip=True)
if text and "see more product details" not in text.lower():
bullets.append(text)
product["features"] = bullets
# Product description
desc_el = tree.css_first("#productDescription p, #productDescription span")
product["description"] = desc_el.text(strip=True) if desc_el else None
# Best Seller Rank
bsr_el = tree.css_first("#SalesRank, th:contains('Best Sellers Rank') + td")
if bsr_el:
bsr_text = bsr_el.text(strip=True)
match = re.search(r'#([\d,]+)', bsr_text)
product["bsr"] = (
int(match.group(1).replace(",", "")) if match else None
)
# Extract category
cat_match = re.search(r'in\s+(.+?)(?:\(|$)', bsr_text)
product["bsr_category"] = (
cat_match.group(1).strip() if cat_match else None
)
else:
product["bsr"] = None
product["bsr_category"] = None
# Images (high-res URLs from the page source)
images = []
img_matches = re.findall(
r'"hiRes"\s*:\s*"(https://[^"]+)"', html
)
images.extend(img_matches)
if not images:
# Fallback to main image
main_img = tree.css_first("#landingImage, #imgBlkFront")
if main_img:
src = main_img.attributes.get("src", "")
if src:
images.append(src)
product["images"] = images[:10] # cap at 10
# ASIN confirmation from page
asin_el = tree.css_first("th:contains('ASIN') + td, input#ASIN")
if asin_el:
product["asin_confirmed"] = asin_el.text(strip=True)
product["scraped_at"] = datetime.now(timezone.utc).isoformat()
product["domain"] = self.domain
return product
def scrape_product(self, asin: str) -> dict | None:
"""Fetch and parse a single product."""
html = self.fetch_product_page(asin)
if not html:
return None
return self.parse_product(html, asin)
def scrape_batch(self, asins: list[str]) -> list[dict]:
"""Scrape multiple products with rate limiting."""
results = []
total = len(asins)
for i, asin in enumerate(asins):
asin = asin.strip()
if not asin:
continue
print(f"[{i+1}/{total}] Scraping {asin}...")
product = self.scrape_product(asin)
if product:
results.append(product)
title = (product["title"] or "No title")[:60]
print(f" {title}")
print(f" Price: {product['price']} | "
f"Rating: {product['rating']} | "
f"Reviews: {product['review_count']:,}")
if i < total - 1:
self._rate_limit()
print(f"\nDone. Scraped {len(results)}/{total} products.")
print(f"Success rate: {len(results)/total*100:.0f}% "
f"| CAPTCHAs hit: {self.captcha_count}")
return results
def export_json(products: list[dict], filename: str):
"""Export products to JSON."""
with open(filename, "w", encoding="utf-8") as f:
json.dump(products, f, indent=2, ensure_ascii=False)
print(f"Exported {len(products)} products to {filename}")
def export_csv(products: list[dict], filename: str):
"""Export products to CSV (features flattened to semicolons)."""
if not products:
return
flat = []
for p in products:
row = {k: v for k, v in p.items() if k not in ("features", "images")}
row["features"] = "; ".join(p.get("features", []))
row["image_count"] = len(p.get("images", []))
row["main_image"] = p["images"][0] if p.get("images") else ""
flat.append(row)
fieldnames = list(flat[0].keys())
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(flat)
print(f"Exported {len(flat)} products to {filename}")
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Scrape Amazon product data via ASIN"
)
parser.add_argument("asins", nargs="*", help="Amazon ASINs to scrape")
parser.add_argument("--file", "-f", help="File with ASINs (one per line)")
parser.add_argument("--format", choices=["json", "csv"], default="json")
parser.add_argument("--output", "-o", help="Output filename")
parser.add_argument("--domain", default="amazon.com",
help="Amazon domain (amazon.com, amazon.co.uk, etc.)")
parser.add_argument("--proxy", help="Proxy URL (http://user:pass@host:port)")
args = parser.parse_args()
asins = list(args.asins)
if args.file:
asins.extend(Path(args.file).read_text().strip().splitlines())
if not asins:
print("No ASINs provided. Usage: python amazon_scraper.py B0BSHF7WHW")
sys.exit(1)
scraper = AmazonScraper(proxy_url=args.proxy, domain=args.domain)
products = scraper.scrape_batch(asins)
out_file = args.output or f"amazon_products.{args.format}"
if args.format == "csv":
export_csv(products, out_file)
else:
export_json(products, out_file)
Expected output
Running python amazon_scraper.py B0BSHF7WHW B0D1XD1ZV3 --format json:
[
{
"asin": "B0BSHF7WHW",
"url": "https://www.amazon.com/dp/B0BSHF7WHW",
"title": "Apple AirPods Pro (2nd Generation) Wireless Ear Buds",
"price": "$189.99",
"original_price": "$249.00",
"rating": 4.7,
"review_count": 87432,
"availability": "In Stock",
"brand": "Apple",
"features": [
"PIONEERING HEARING — AirPods Pro 2 unlock world-class hearing aid...",
"INTELLIGENT NOISE CONTROL — Active Noise Cancellation removes...",
"IMPROVED SOUND AND CALL QUALITY — A custom Apple-designed chip..."
],
"bsr": 3,
"bsr_category": "Electronics",
"images": [
"https://m.media-amazon.com/images/I/61f1YfTkTDL._AC_SL1500_.jpg"
],
"domain": "amazon.com",
"scraped_at": "2026-03-30T14:22:00+00:00"
}
]
CSV output:
asin,url,title,price,original_price,rating,review_count,availability,brand,bsr,bsr_category,features,image_count,main_image
B0BSHF7WHW,https://www.amazon.com/dp/B0BSHF7WHW,"Apple AirPods Pro...",$189.99,$249.00,4.7,87432,In Stock,Apple,3,Electronics,"PIONEERING HEARING...;INTELLIGENT NOISE...",6,https://m.media-amazon.com/...
Use Case 1: Price Monitoring Dashboard
Track price changes over time — useful for deal alerts, competitor pricing, or purchase timing:
"""
Amazon Price Monitor
Runs on a schedule to track price history for a watchlist of ASINs.
Appends to a CSV log for time-series analysis.
"""
def monitor_prices(asins: list[str], log_file: str = "price_history.csv",
proxy_url: str = None):
"""Scrape current prices and append to history log."""
scraper = AmazonScraper(proxy_url=proxy_url)
products = scraper.scrape_batch(asins)
today = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M")
file_exists = Path(log_file).exists()
with open(log_file, "a", newline="", encoding="utf-8") as f:
fieldnames = [
"timestamp", "asin", "title", "price", "original_price",
"rating", "review_count", "availability", "bsr"
]
writer = csv.DictWriter(f, fieldnames=fieldnames)
if not file_exists:
writer.writeheader()
for p in products:
writer.writerow({
"timestamp": today,
"asin": p["asin"],
"title": (p["title"] or "")[:80],
"price": p["price"],
"original_price": p["original_price"],
"rating": p["rating"],
"review_count": p["review_count"],
"availability": p["availability"],
"bsr": p["bsr"],
})
# Print price summary
print(f"\n{'ASIN':<14} {'Price':>10} {'Was':>10} {'Rating':>7} {'Reviews':>8}")
print("-" * 55)
for p in products:
print(f"{p['asin']:<14} {p['price'] or 'N/A':>10} "
f"{p['original_price'] or '-':>10} "
f"{p['rating'] or '-':>7} {p['review_count']:>8,}")
return products
# Run daily via cron:
# 0 9 * * * cd /path/to && python amazon_scraper.py --monitor
Use Case 2: Cross-Market Price Comparison
Compare the same product across different Amazon marketplaces to find price differences:
"""
Cross-market Amazon price comparison.
Checks the same ASIN across multiple Amazon domains.
"""
AMAZON_DOMAINS = [
"amazon.com", # US
"amazon.co.uk", # UK
"amazon.de", # Germany
"amazon.fr", # France
"amazon.co.jp", # Japan
"amazon.ca", # Canada
]
def compare_markets(asin: str, domains: list[str] = None,
proxy_url: str = None):
"""Compare prices for an ASIN across Amazon marketplaces."""
domains = domains or AMAZON_DOMAINS
print(f"\nCross-market price comparison for {asin}")
print("=" * 60)
results = []
for domain in domains:
print(f"\n Checking {domain}...")
scraper = AmazonScraper(proxy_url=proxy_url, domain=domain)
product = scraper.scrape_product(asin)
if product and product["price"]:
results.append({
"domain": domain,
"price": product["price"],
"availability": product["availability"],
"rating": product["rating"],
"review_count": product["review_count"],
})
print(f" Price: {product['price']} | {product['availability']}")
else:
print(f" Not available or blocked")
time.sleep(random.uniform(5, 10))
if results:
print(f"\n{'Domain':<20} {'Price':>12} {'Rating':>8} {'Reviews':>10}")
print("-" * 52)
for r in results:
print(f"{r['domain']:<20} {r['price']:>12} "
f"{r['rating'] or '-':>8} {r['review_count']:>10,}")
return results
Use Case 3: Review Trend Tracker
Monitor how review counts and ratings change over time — useful for detecting review manipulation or tracking product reception:
"""
Amazon Review Trend Tracker
Tracks review count and rating changes daily.
Flags anomalies like sudden review spikes (possible fake reviews).
"""
def track_review_trends(asins: list[str], history_file: str = "review_trends.csv",
proxy_url: str = None):
"""Track review metrics and flag anomalies."""
scraper = AmazonScraper(proxy_url=proxy_url)
products = scraper.scrape_batch(asins)
today = datetime.now(timezone.utc).strftime("%Y-%m-%d")
history = Path(history_file)
# Load previous data for comparison
previous = {}
if history.exists():
with open(history, "r") as f:
reader = csv.DictReader(f)
for row in reader:
key = row["asin"]
if key not in previous or row["date"] > previous[key]["date"]:
previous[key] = row
# Append new data
file_exists = history.exists()
with open(history, "a", newline="", encoding="utf-8") as f:
fieldnames = ["date", "asin", "title", "rating", "review_count", "bsr", "flag"]
writer = csv.DictWriter(f, fieldnames=fieldnames)
if not file_exists:
writer.writeheader()
for p in products:
flag = ""
prev = previous.get(p["asin"])
if prev:
old_reviews = int(prev.get("review_count", 0))
new_reviews = p["review_count"]
daily_increase = new_reviews - old_reviews
# Flag if review count jumped by more than 5% in one day
if old_reviews > 0 and daily_increase > old_reviews * 0.05:
flag = f"SPIKE: +{daily_increase} reviews in 1 day"
print(f" WARNING: {p['asin']} - {flag}")
# Flag if rating dropped significantly
old_rating = float(prev.get("rating") or 0)
if old_rating > 0 and p["rating"] and p["rating"] < old_rating - 0.2:
flag += f" RATING_DROP: {old_rating} -> {p['rating']}"
writer.writerow({
"date": today,
"asin": p["asin"],
"title": (p["title"] or "")[:60],
"rating": p["rating"],
"review_count": p["review_count"],
"bsr": p["bsr"],
"flag": flag,
})
return products
Rate Limiting: The Single Most Important Thing
The single biggest mistake is going too fast. Even with perfect proxies, sending 10 requests per second from any pattern will trigger detection.
What works in practice
| Setting | Value | Why |
|---|---|---|
| Delay between requests | 5-12 seconds random | Fixed intervals are a bot signature |
| Cooldown every 20 requests | 20-40 seconds | Prevents cumulative detection |
| CAPTCHA backoff | Exponential (10s, 20s, 40s) | Don't retry immediately |
| Max requests per IP per hour | ~50-80 | Beyond this, blocks increase sharply |
| Session duration | Fresh client per 50 requests | Prevents cookie tracking |
Signals that get you blocked
- Fixed-interval requests — Real humans don't browse at exactly 3.0-second intervals
- Accessing only
/dp/pages — Real users browse categories, search, and click around - Stale User-Agent strings — Chrome 100 in 2026 is an obvious red flag
- Missing browser headers —
Sec-Fetch-*headers are expected by modern detection - HTTP/1.1 only — Modern browsers negotiate HTTP/2; failing to do so is a fingerprint
What Breaks and When to Expect It
Amazon updates their anti-bot measures roughly every 2-4 weeks. Common breakage points:
- Selector changes — CSS selectors for price, title, and reviews shift periodically. The scraper returns
Nonevalues when this happens. Check#corePrice_feature_divvariants first. - New fingerprinting checks — TLS fingerprint detection gets tighter over time. The
httpxlibrary's default TLS fingerprint is on some watchlists. Considercurl_cffiif blocks increase without higher request volume. - Geographic restrictions — Some product pages now redirect based on IP geolocation, returning different content or blocking entirely.
- JavaScript-rendered prices — Some product variants only show prices after JS execution. When the scraper returns
Nonefor price but other fields work, this is likely the cause. Use Playwright as a fallback for these cases.
Plan for maintenance. Any Amazon scraper that works today will need updates within a month.
The Easier Alternative: Pre-Built Actors
If you don't want to maintain proxy infrastructure and fight Amazon's constantly-changing selectors, Apify's Amazon scraper actors handle the anti-bot layer for you. They maintain the selectors, rotate proxies internally, and handle CAPTCHAs. You pay per result instead of managing infrastructure.
This makes more sense when you need data reliably at scale rather than building scraping as a core competency.
Summary
The formula for scraping Amazon without blocks in 2026: residential proxies + slow request rates + direct product page URLs + realistic browser headers + adaptive backoff. Skip any of these and you'll hit CAPTCHAs immediately.
The complete scraper above handles CAPTCHA detection, exponential backoff, dog-page detection, multi-domain support, and clean CSV/JSON export. For production use, combine it with the price monitoring or review tracking pipelines to build ongoing data collection that survives Amazon's regular anti-bot updates.