Scraping Amazon Prime Video: Catalog Data and Streaming Availability (2026)
Scraping Amazon Prime Video: Catalog Data and Streaming Availability (2026)
Amazon Prime Video is one of the harder streaming platforms to scrape. Unlike Netflix or Disney+, Amazon doesn't have a widely documented internal API that people have reverse-engineered. Their catalog sits behind multiple layers of authentication, dynamic rendering, and aggressive bot detection. But the data is valuable — content availability varies wildly by country, pricing for rent/buy titles changes constantly, and there's no public dataset tracking any of it.
Here's what actually works in 2026 for extracting Prime Video catalog data.
What You Can Extract
Prime Video pages contain surprisingly rich structured data if you know where to look:
- Title metadata — name, release year, genre tags, runtime, MPAA rating, IMDb score
- Availability status — included with Prime, rent price, buy price, not available
- Cast and crew — actors, directors, writers with their Amazon person IDs
- Episode data — for series: season count, episode titles, individual runtimes
- Country availability — which titles are available in which regions
- Customer reviews — star ratings, review text, helpful votes
- Related titles — "customers also watched" recommendations
The structured data lives in two places: JSON-LD in the page source and a GraphQL-like internal API that the React frontend calls.
Anti-Bot Defenses
Amazon's bot detection is serious. They run their own detection stack across all Amazon properties, and Prime Video inherits all of it.
Device fingerprinting. Amazon's JavaScript collects canvas fingerprints, WebGL renderer strings, audio context hashes, installed plugins, screen dimensions, and timezone. They build a device ID from this and track it across sessions. Default headless browsers fail this instantly.
Session behavior analysis. They watch navigation patterns. A real user browses, scrolls, clicks through categories. A bot hits 200 title pages in sequence. Amazon's ML models catch this pattern quickly and start serving CAPTCHAs or empty pages.
IP reputation. Amazon maintains extensive IP reputation databases. Datacenter IPs are blocked on sight for most requests. Even some residential IPs with abuse history get flagged. You need clean residential IPs that look like normal home connections.
TLS fingerprinting. Like most major sites in 2026, Amazon inspects the TLS ClientHello. Python's default TLS stack has a recognizable fingerprint. You either need to use a real browser or a library like curl_cffi that impersonates a real browser's TLS.
For anything beyond basic testing, residential proxies are essential. I've been running ThorData's rotating residential proxies for Amazon scraping — they have good coverage across countries, which matters when you're checking availability by region. Datacenter proxies just don't work here.
Browser-Based Approach
The most reliable method is Playwright with stealth patches. You drive a real Chromium instance, so TLS fingerprinting and most JavaScript checks pass automatically.
import asyncio
import json
import re
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async
from dataclasses import dataclass, field
PROXY_CONFIG = {
"server": "http://proxy.thordata.com:9000",
"username": "YOUR_USER",
"password": "YOUR_PASS",
}
@dataclass
class PrimeTitle:
"""A single title from Prime Video."""
asin: str = ""
title: str = ""
content_type: str = "" # movie, series
release_year: int = 0
genres: list[str] = field(default_factory=list)
rating: str = ""
imdb_score: float = 0.0
synopsis: str = ""
seasons: int = 0
runtime_minutes: int = 0
cast: list[str] = field(default_factory=list)
directors: list[str] = field(default_factory=list)
included_with_prime: bool = False
rent_price: float | None = None
buy_price: float | None = None
async def create_browser_context(proxy_config: dict = None):
"""Create a stealth browser context that passes Amazon's checks."""
pw = await async_playwright().start()
launch_args = {
"headless": True,
"args": [
"--disable-blink-features=AutomationControlled",
"--disable-features=IsolateOrigins,site-per-process",
],
}
if proxy_config:
launch_args["proxy"] = proxy_config
browser = await pw.chromium.launch(**launch_args)
context = await browser.new_context(
viewport={"width": 1440, "height": 900},
locale="en-US",
timezone_id="America/New_York",
user_agent=(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
)
return pw, browser, context
async def scrape_prime_title(url: str, context=None) -> dict:
"""Scrape a single Prime Video title page."""
own_context = context is None
if own_context:
pw, browser, context = await create_browser_context(PROXY_CONFIG)
page = await context.new_page()
await stealth_async(page)
await page.goto(url, wait_until="networkidle", timeout=30000)
await page.wait_for_timeout(2000) # let JS hydrate
data = await extract_structured_data(page)
await page.close()
if own_context:
await browser.close()
await pw.stop()
return data
Extracting Structured Data
Amazon embeds title metadata in multiple places on the page. The extraction function tries three sources in order of reliability:
async def extract_structured_data(page) -> dict:
"""Pull structured data from JSON-LD, page state, and DOM."""
result = {}
# Source 1: JSON-LD (most reliable, cleanest structure)
ld_scripts = await page.query_selector_all(
'script[type="application/ld+json"]'
)
for script in ld_scripts:
text = await script.inner_text()
try:
ld = json.loads(text)
if ld.get("@type") in ("Movie", "TVSeries", "TVSeason"):
result["title"] = ld.get("name", "")
result["content_type"] = ld["@type"]
result["synopsis"] = ld.get("description", "")
result["genres"] = ld.get("genre", [])
if isinstance(result["genres"], str):
result["genres"] = [result["genres"]]
result["rating"] = ld.get("contentRating", "")
result["release_year"] = ld.get("dateCreated", "")[:4]
# Cast from JSON-LD
actors = ld.get("actor", [])
result["cast"] = [
a["name"] if isinstance(a, dict) else a
for a in actors
]
directors = ld.get("director", [])
result["directors"] = [
d["name"] if isinstance(d, dict) else d
for d in directors
]
# Rating from aggregateRating
agg = ld.get("aggregateRating", {})
if agg:
result["imdb_score"] = float(agg.get("ratingValue", 0))
result["review_count"] = int(agg.get("ratingCount", 0))
except (json.JSONDecodeError, KeyError, ValueError):
continue
# Source 2: Window state object (has pricing and availability)
state_data = await page.evaluate("""
() => {
const scripts = document.querySelectorAll('script');
for (const s of scripts) {
const text = s.textContent;
if (text.includes('window.__INITIAL_STATE__')) {
const match = text.match(
/window\\.__INITIAL_STATE__\\s*=\\s*({.+?});/s
);
if (match) return JSON.parse(match[1]);
}
}
return null;
}
""")
if state_data:
result["_state"] = state_data
# Extract pricing from state if available
detail = state_data.get("detail", state_data.get("self", {}))
if isinstance(detail, dict):
offers = detail.get("offers", [])
for offer in offers if isinstance(offers, list) else []:
otype = offer.get("offerType", "")
price = offer.get("price", {}).get("amount")
if "RENT" in otype.upper() and price:
result["rent_price"] = float(price)
elif "BUY" in otype.upper() and price:
result["buy_price"] = float(price)
# Source 3: DOM extraction (fallback for anything still missing)
if "title" not in result:
title_el = await page.query_selector(
'[data-automation-id="title"]'
)
if title_el:
result["title"] = await title_el.inner_text()
if "synopsis" not in result:
synopsis_el = await page.query_selector(
'[data-automation-id="synopsis"]'
)
if synopsis_el:
result["synopsis"] = await synopsis_el.inner_text()
# Genre tags from DOM
if not result.get("genres"):
genre_els = await page.query_selector_all(
'[data-automation-id="genre-tag"] a'
)
result["genres"] = [await g.inner_text() for g in genre_els]
# Metadata line (year, runtime, rating)
meta_els = await page.query_selector_all(
'[data-automation-id="meta-info"] span'
)
result["meta_raw"] = [await m.inner_text() for m in meta_els]
# Check Prime inclusion badge
prime_badge = await page.query_selector(
'[data-automation-id="prime-badge"]'
)
result["included_with_prime"] = prime_badge is not None
# Extract ASIN from URL
asin_match = re.search(r"/dp/([A-Z0-9]{10})", page.url)
if asin_match:
result["asin"] = asin_match.group(1)
return result
Catalog Discovery via Browse Pages
You can't just guess URLs — you need a way to discover what's in the catalog. Prime Video's browse pages load content via AJAX calls as you scroll. Intercept those network requests to collect ASINs.
async def discover_catalog(
category_url: str,
max_scrolls: int = 15,
proxy_config: dict = None,
) -> list[str]:
"""Discover title ASINs by intercepting browse page API calls."""
collected_asins = set()
api_responses = []
async def capture_api(response):
"""Intercept API responses that contain title data."""
url = response.url
if any(k in url for k in ("api/discover", "api/search", "getDataByPath")):
try:
body = await response.json()
api_responses.append(body)
except Exception:
pass
pw, browser, context = await create_browser_context(proxy_config)
page = await context.new_page()
await stealth_async(page)
page.on("response", capture_api)
await page.goto(category_url, wait_until="networkidle")
await page.wait_for_timeout(3000)
# Scroll down to trigger lazy loading of more titles
for i in range(max_scrolls):
await page.evaluate("window.scrollBy(0, window.innerHeight)")
await page.wait_for_timeout(1500)
await browser.close()
await pw.stop()
# Parse collected API responses for ASINs
def extract_asins(obj):
"""Recursively find ASINs in nested API response data."""
if isinstance(obj, dict):
for key in ("titleId", "asin", "catalogId"):
val = obj.get(key, "")
if isinstance(val, str) and re.match(r"^[A-Z0-9]{10}$", val):
collected_asins.add(val)
for v in obj.values():
extract_asins(v)
elif isinstance(obj, list):
for item in obj:
extract_asins(item)
for resp in api_responses:
extract_asins(resp)
return list(collected_asins)
# Usage: discover ASINs from Prime Video's movie hub
async def discover_all_categories() -> list[str]:
"""Crawl multiple browse categories to build a catalog."""
categories = [
"https://www.amazon.com/gp/video/storefront/ref=atv_dp_cnc_brws_0",
"https://www.amazon.com/Amazon-Video/b?node=2858778011", # Movies
"https://www.amazon.com/Amazon-Video/b?node=2864549011", # TV Shows
"https://www.amazon.com/gp/video/offers/ref=atv_dp_cnc_sby_1", # Prime
]
all_asins = set()
for url in categories:
asins = await discover_catalog(url, proxy_config=PROXY_CONFIG)
all_asins.update(asins)
print(f" {url[:60]}... → {len(asins)} ASINs ({len(all_asins)} total)")
await asyncio.sleep(5) # pause between categories
return list(all_asins)
Country Availability Checking
This is where Prime Video scraping gets really useful. Content libraries differ massively between countries. To check availability across regions, route requests through proxies in each target country.
COUNTRY_DOMAINS = {
"US": ("amazon.com", "en-US"),
"UK": ("amazon.co.uk", "en-GB"),
"DE": ("amazon.de", "de-DE"),
"JP": ("amazon.co.jp", "ja-JP"),
"IN": ("amazon.in", "en-IN"),
"BR": ("amazon.com.br", "pt-BR"),
"FR": ("amazon.fr", "fr-FR"),
"ES": ("amazon.es", "es-ES"),
"IT": ("amazon.it", "it-IT"),
"AU": ("amazon.com.au", "en-AU"),
}
async def check_availability(
asin: str,
countries: list[str] = None,
) -> dict:
"""Check title availability across multiple countries.
Returns a dict mapping country code to availability info:
{
"US": {"available": True, "prime": True, "rent": 3.99, "buy": 14.99},
"DE": {"available": True, "prime": False, "rent": 4.99, "buy": None},
"JP": {"available": False},
}
"""
if countries is None:
countries = list(COUNTRY_DOMAINS.keys())
results = {}
for code in countries:
domain, locale = COUNTRY_DOMAINS[code]
url = f"https://www.{domain}/dp/{asin}"
# Route through country-specific proxy
proxy = {
"server": "http://proxy.thordata.com:9000",
"username": f"YOUR_USER-country-{code.lower()}",
"password": "YOUR_PASS",
}
pw, browser, context = await create_browser_context(proxy)
page = await context.new_page()
await stealth_async(page)
try:
resp = await page.goto(
url, wait_until="domcontentloaded", timeout=20000
)
await page.wait_for_timeout(2000)
# Check HTTP status first
if resp and resp.status >= 400:
results[code] = {"available": False, "reason": f"HTTP {resp.status}"}
continue
# Check for "not available" indicators
unavailable = await page.query_selector(
'[data-automation-id="unavailable"]'
)
not_found = await page.query_selector('text="not available"')
if unavailable or not_found:
results[code] = {"available": False}
continue
# Check Prime inclusion
prime_badge = await page.query_selector(
'[data-automation-id="prime-badge"]'
)
# Check rent/buy buttons and extract prices
rent_price = None
buy_price = None
rent_btn = await page.query_selector(
'[data-automation-id="rent-button"]'
)
if rent_btn:
rent_text = await rent_btn.inner_text()
price_match = re.search(r"[\d.,]+", rent_text)
if price_match:
rent_price = float(
price_match.group().replace(",", ".")
)
buy_btn = await page.query_selector(
'[data-automation-id="buy-button"]'
)
if buy_btn:
buy_text = await buy_btn.inner_text()
price_match = re.search(r"[\d.,]+", buy_text)
if price_match:
buy_price = float(
price_match.group().replace(",", ".")
)
results[code] = {
"available": True,
"prime": prime_badge is not None,
"rent": rent_price,
"buy": buy_price,
"url": url,
}
except Exception as e:
results[code] = {"available": None, "error": str(e)}
finally:
await browser.close()
await pw.stop()
await asyncio.sleep(3) # pace between countries
return results
Storage and Change Tracking
For catalog tracking over time, you need both a snapshot of current state and a history of changes. Here's a complete storage layer:
import sqlite3
from datetime import datetime, timezone
def init_prime_db(path: str = "prime_video.db") -> sqlite3.Connection:
"""Initialize database with tables for titles, availability, and changes."""
conn = sqlite3.connect(path)
conn.executescript("""
CREATE TABLE IF NOT EXISTS titles (
asin TEXT PRIMARY KEY,
title TEXT,
content_type TEXT,
genres TEXT,
release_year INTEGER,
imdb_score REAL,
synopsis TEXT,
cast_list TEXT,
directors TEXT,
first_seen TEXT,
last_checked TEXT
);
CREATE TABLE IF NOT EXISTS availability (
id INTEGER PRIMARY KEY AUTOINCREMENT,
asin TEXT NOT NULL,
country TEXT NOT NULL,
included_with_prime BOOLEAN,
rent_price REAL,
buy_price REAL,
checked_at TEXT NOT NULL,
FOREIGN KEY (asin) REFERENCES titles(asin)
);
CREATE TABLE IF NOT EXISTS changes (
id INTEGER PRIMARY KEY AUTOINCREMENT,
asin TEXT NOT NULL,
country TEXT NOT NULL,
change_type TEXT NOT NULL,
old_value TEXT,
new_value TEXT,
detected_at TEXT NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_avail_asin
ON availability(asin);
CREATE INDEX IF NOT EXISTS idx_avail_country
ON availability(country);
CREATE INDEX IF NOT EXISTS idx_avail_checked
ON availability(checked_at);
CREATE INDEX IF NOT EXISTS idx_changes_asin
ON changes(asin);
""")
conn.commit()
return conn
def save_title(conn: sqlite3.Connection, data: dict):
"""Insert or update a title record."""
now = datetime.now(timezone.utc).isoformat()
conn.execute("""
INSERT INTO titles (asin, title, content_type, genres, release_year,
imdb_score, synopsis, cast_list, directors,
first_seen, last_checked)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(asin) DO UPDATE SET
title = excluded.title,
imdb_score = excluded.imdb_score,
last_checked = excluded.last_checked
""", (
data.get("asin"),
data.get("title"),
data.get("content_type"),
json.dumps(data.get("genres", [])),
int(data.get("release_year") or 0),
data.get("imdb_score", 0.0),
data.get("synopsis"),
json.dumps(data.get("cast", [])),
json.dumps(data.get("directors", [])),
now, now,
))
conn.commit()
def save_availability(
conn: sqlite3.Connection, asin: str, country_data: dict
):
"""Save availability results and detect changes from previous check."""
now = datetime.now(timezone.utc).isoformat()
for country, info in country_data.items():
if info.get("available") is None:
continue # skip errors
# Get the previous availability for this title + country
prev = conn.execute("""
SELECT included_with_prime, rent_price, buy_price
FROM availability
WHERE asin = ? AND country = ?
ORDER BY checked_at DESC LIMIT 1
""", (asin, country)).fetchone()
prime = info.get("prime", False)
rent = info.get("rent")
buy = info.get("buy")
# Record the current availability
conn.execute("""
INSERT INTO availability
(asin, country, included_with_prime, rent_price,
buy_price, checked_at)
VALUES (?, ?, ?, ?, ?, ?)
""", (asin, country, prime, rent, buy, now))
# Detect and log changes
if prev is not None:
prev_prime, prev_rent, prev_buy = prev
if bool(prev_prime) != prime:
change = "added_to_prime" if prime else "removed_from_prime"
conn.execute("""
INSERT INTO changes
(asin, country, change_type, old_value,
new_value, detected_at)
VALUES (?, ?, ?, ?, ?, ?)
""", (asin, country, change,
str(prev_prime), str(prime), now))
if prev_rent != rent and rent is not None:
conn.execute("""
INSERT INTO changes
(asin, country, change_type, old_value,
new_value, detected_at)
VALUES (?, ?, 'rent_price_change', ?, ?, ?)
""", (asin, country, str(prev_rent), str(rent), now))
if prev_buy != buy and buy is not None:
conn.execute("""
INSERT INTO changes
(asin, country, change_type, old_value,
new_value, detected_at)
VALUES (?, ?, 'buy_price_change', ?, ?, ?)
""", (asin, country, str(prev_buy), str(buy), now))
conn.commit()
Running the Full Pipeline
Here's a complete daily pipeline that discovers titles, checks availability, and reports changes:
async def daily_pipeline(
db_path: str = "prime_video.db",
target_countries: list[str] = None,
max_titles: int = 50,
):
"""Run a full catalog scan: discover → scrape → check availability → report."""
if target_countries is None:
target_countries = ["US", "UK", "DE"]
conn = init_prime_db(db_path)
# Step 1: Discover new ASINs
print("Discovering catalog...")
asins = await discover_all_categories()
print(f"Found {len(asins)} ASINs")
# Step 2: Scrape title metadata for new/unchecked titles
existing = {
row[0] for row in
conn.execute("SELECT asin FROM titles").fetchall()
}
new_asins = [a for a in asins if a not in existing][:max_titles]
pw, browser, context = await create_browser_context(PROXY_CONFIG)
print(f"Scraping {len(new_asins)} new titles...")
for i, asin in enumerate(new_asins):
url = f"https://www.amazon.com/dp/{asin}"
try:
data = await scrape_prime_title(url, context=context)
data["asin"] = asin
save_title(conn, data)
print(f" [{i+1}/{len(new_asins)}] {data.get('title', asin)}")
except Exception as e:
print(f" [{i+1}/{len(new_asins)}] Error on {asin}: {e}")
await asyncio.sleep(5)
await browser.close()
await pw.stop()
# Step 3: Check availability across countries
all_asins = [
row[0] for row in
conn.execute("SELECT asin FROM titles ORDER BY last_checked ASC").fetchall()
][:max_titles]
print(f"Checking availability for {len(all_asins)} titles...")
for i, asin in enumerate(all_asins):
avail = await check_availability(asin, target_countries)
save_availability(conn, asin, avail)
summary = ", ".join(
f"{c}:{'P' if v.get('prime') else 'R' if v.get('rent') else 'X'}"
for c, v in avail.items() if v.get("available")
)
print(f" [{i+1}/{len(all_asins)}] {asin}: {summary}")
# Step 4: Report changes
recent_changes = conn.execute("""
SELECT c.asin, t.title, c.country, c.change_type,
c.old_value, c.new_value
FROM changes c
JOIN titles t ON t.asin = c.asin
WHERE c.detected_at > datetime('now', '-1 day')
ORDER BY c.detected_at DESC
""").fetchall()
if recent_changes:
print(f"\n--- {len(recent_changes)} changes detected ---")
for asin, title, country, ctype, old, new in recent_changes:
print(f" {title} [{country}]: {ctype} ({old} → {new})")
else:
print("\nNo changes detected.")
conn.close()
if __name__ == "__main__":
asyncio.run(daily_pipeline())
Querying Your Dataset
Once you have a few weeks of data, the interesting queries write themselves:
def get_availability_matrix(conn: sqlite3.Connection) -> list[dict]:
"""Which titles are available where? Returns a matrix."""
rows = conn.execute("""
SELECT t.asin, t.title, a.country, a.included_with_prime,
a.rent_price, a.buy_price
FROM titles t
JOIN availability a ON a.asin = t.asin
WHERE a.checked_at = (
SELECT MAX(a2.checked_at) FROM availability a2
WHERE a2.asin = a.asin AND a2.country = a.country
)
ORDER BY t.title, a.country
""").fetchall()
return [
{"asin": r[0], "title": r[1], "country": r[2],
"prime": bool(r[3]), "rent": r[4], "buy": r[5]}
for r in rows
]
def get_prime_exclusive_by_country(
conn: sqlite3.Connection, country: str
) -> list[str]:
"""Titles included with Prime in one country but not others."""
rows = conn.execute("""
SELECT DISTINCT t.title FROM titles t
JOIN availability a1 ON a1.asin = t.asin
WHERE a1.country = ? AND a1.included_with_prime = 1
AND a1.asin NOT IN (
SELECT a2.asin FROM availability a2
WHERE a2.country != ? AND a2.included_with_prime = 1
)
""", (country, country)).fetchall()
return [r[0] for r in rows]
def get_price_drops(
conn: sqlite3.Connection, days: int = 7
) -> list[dict]:
"""Titles with rent/buy price decreases in the last N days."""
rows = conn.execute("""
SELECT c.asin, t.title, c.country,
c.change_type, c.old_value, c.new_value
FROM changes c
JOIN titles t ON t.asin = c.asin
WHERE c.change_type IN ('rent_price_change', 'buy_price_change')
AND CAST(c.new_value AS REAL) < CAST(c.old_value AS REAL)
AND c.detected_at > datetime('now', ? || ' days')
ORDER BY c.detected_at DESC
""", (f"-{days}",)).fetchall()
return [
{"asin": r[0], "title": r[1], "country": r[2],
"type": r[3], "old_price": r[4], "new_price": r[5]}
for r in rows
]
Practical Tips
A few things I've learned scraping Prime Video:
Don't scrape logged in. It's tempting to authenticate for more data, but Amazon monitors authenticated sessions much more aggressively. Stick to public catalog pages. You get 90% of the useful metadata without authentication.
Pace yourself aggressively. One title page every 5-10 seconds is safe. If you need to cover a large catalog, spread it over days. Amazon's detection has memory — if you get flagged, that IP stays flagged for weeks.
ASINs are your primary key. Every Amazon product has an ASIN (Amazon Standard Identification Number). For Prime Video content, the ASIN is the stable identifier. URLs change, page layouts change, ASINs persist across redesigns.
Watch for soft blocks. Amazon doesn't always give you a clear "you're blocked" page. Sometimes they return the page but with missing data, or redirect to a CAPTCHA that looks like a normal page element. Always validate that your scraped data contains what you expected — check that the title field isn't empty before saving.
Country checks are the real value. There's no good public source for tracking which titles move between countries or switch from included-with-Prime to rent-only. If you build this dataset consistently, it has real commercial value. Streaming comparison sites charge subscription fees for exactly this kind of information.
Reuse browser contexts. Opening a new browser for each title is slow and wasteful. Create one context, scrape multiple pages through it, and only restart when you hit a block or change proxy regions. This is both faster and looks more like real browsing behavior.