Scraping Amazon Prime Video: Catalog Data and Streaming Availability (2026)

2026-04-09 ["amazon" "prime-video" "web scraping" "python" "streaming"]

Scraping Amazon Prime Video: Catalog Data and Streaming Availability (2026)

Amazon Prime Video is one of the harder streaming platforms to scrape. Unlike Netflix or Disney+, Amazon doesn't have a widely documented internal API that people have reverse-engineered. Their catalog sits behind multiple layers of authentication, dynamic rendering, and aggressive bot detection. But the data is valuable — content availability varies wildly by country, pricing for rent/buy titles changes constantly, and there's no public dataset tracking any of it.

Here's what actually works in 2026 for extracting Prime Video catalog data.

What You Can Extract

Prime Video pages contain surprisingly rich structured data if you know where to look:

Title metadata — name, release year, genre tags, runtime, MPAA rating, IMDb score
Availability status — included with Prime, rent price, buy price, not available
Cast and crew — actors, directors, writers with their Amazon person IDs
Episode data — for series: season count, episode titles, individual runtimes
Country availability — which titles are available in which regions
Customer reviews — star ratings, review text, helpful votes
Related titles — "customers also watched" recommendations

The structured data lives in two places: JSON-LD in the page source and a GraphQL-like internal API that the React frontend calls.

Anti-Bot Defenses

Amazon's bot detection is serious. They run their own detection stack across all Amazon properties, and Prime Video inherits all of it.

Device fingerprinting. Amazon's JavaScript collects canvas fingerprints, WebGL renderer strings, audio context hashes, installed plugins, screen dimensions, and timezone. They build a device ID from this and track it across sessions. Default headless browsers fail this instantly.

Session behavior analysis. They watch navigation patterns. A real user browses, scrolls, clicks through categories. A bot hits 200 title pages in sequence. Amazon's ML models catch this pattern quickly and start serving CAPTCHAs or empty pages.

IP reputation. Amazon maintains extensive IP reputation databases. Datacenter IPs are blocked on sight for most requests. Even some residential IPs with abuse history get flagged. You need clean residential IPs that look like normal home connections.

TLS fingerprinting. Like most major sites in 2026, Amazon inspects the TLS ClientHello. Python's default TLS stack has a recognizable fingerprint. You either need to use a real browser or a library like curl_cffi that impersonates a real browser's TLS.

For anything beyond basic testing, residential proxies are essential. I've been running ThorData's rotating residential proxies for Amazon scraping — they have good coverage across countries, which matters when you're checking availability by region. Datacenter proxies just don't work here.

Browser-Based Approach

The most reliable method is Playwright with stealth patches. You drive a real Chromium instance, so TLS fingerprinting and most JavaScript checks pass automatically.

import asyncio
import json
import re
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async
from dataclasses import dataclass, field

PROXY_CONFIG = {
    "server": "http://proxy.thordata.com:9000",
    "username": "YOUR_USER",
    "password": "YOUR_PASS",
}

@dataclass
class PrimeTitle:
    """A single title from Prime Video."""
    asin: str = ""
    title: str = ""
    content_type: str = ""  # movie, series
    release_year: int = 0
    genres: list[str] = field(default_factory=list)
    rating: str = ""
    imdb_score: float = 0.0
    synopsis: str = ""
    seasons: int = 0
    runtime_minutes: int = 0
    cast: list[str] = field(default_factory=list)
    directors: list[str] = field(default_factory=list)
    included_with_prime: bool = False
    rent_price: float | None = None
    buy_price: float | None = None


async def create_browser_context(proxy_config: dict = None):
    """Create a stealth browser context that passes Amazon's checks."""
    pw = await async_playwright().start()
    launch_args = {
        "headless": True,
        "args": [
            "--disable-blink-features=AutomationControlled",
            "--disable-features=IsolateOrigins,site-per-process",
        ],
    }
    if proxy_config:
        launch_args["proxy"] = proxy_config

    browser = await pw.chromium.launch(**launch_args)
    context = await browser.new_context(
        viewport={"width": 1440, "height": 900},
        locale="en-US",
        timezone_id="America/New_York",
        user_agent=(
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/124.0.0.0 Safari/537.36"
        ),
    )
    return pw, browser, context


async def scrape_prime_title(url: str, context=None) -> dict:
    """Scrape a single Prime Video title page."""
    own_context = context is None
    if own_context:
        pw, browser, context = await create_browser_context(PROXY_CONFIG)

    page = await context.new_page()
    await stealth_async(page)

    await page.goto(url, wait_until="networkidle", timeout=30000)
    await page.wait_for_timeout(2000)  # let JS hydrate

    data = await extract_structured_data(page)
    await page.close()

    if own_context:
        await browser.close()
        await pw.stop()

    return data

Extracting Structured Data

Amazon embeds title metadata in multiple places on the page. The extraction function tries three sources in order of reliability:

async def extract_structured_data(page) -> dict:
    """Pull structured data from JSON-LD, page state, and DOM."""
    result = {}

    # Source 1: JSON-LD (most reliable, cleanest structure)
    ld_scripts = await page.query_selector_all(
        'script[type="application/ld+json"]'
    )
    for script in ld_scripts:
        text = await script.inner_text()
        try:
            ld = json.loads(text)
            if ld.get("@type") in ("Movie", "TVSeries", "TVSeason"):
                result["title"] = ld.get("name", "")
                result["content_type"] = ld["@type"]
                result["synopsis"] = ld.get("description", "")
                result["genres"] = ld.get("genre", [])
                if isinstance(result["genres"], str):
                    result["genres"] = [result["genres"]]
                result["rating"] = ld.get("contentRating", "")
                result["release_year"] = ld.get("dateCreated", "")[:4]

                # Cast from JSON-LD
                actors = ld.get("actor", [])
                result["cast"] = [
                    a["name"] if isinstance(a, dict) else a
                    for a in actors
                ]
                directors = ld.get("director", [])
                result["directors"] = [
                    d["name"] if isinstance(d, dict) else d
                    for d in directors
                ]

                # Rating from aggregateRating
                agg = ld.get("aggregateRating", {})
                if agg:
                    result["imdb_score"] = float(agg.get("ratingValue", 0))
                    result["review_count"] = int(agg.get("ratingCount", 0))
        except (json.JSONDecodeError, KeyError, ValueError):
            continue

    # Source 2: Window state object (has pricing and availability)
    state_data = await page.evaluate("""
        () => {
            const scripts = document.querySelectorAll('script');
            for (const s of scripts) {
                const text = s.textContent;
                if (text.includes('window.__INITIAL_STATE__')) {
                    const match = text.match(
                        /window\\.__INITIAL_STATE__\\s*=\\s*({.+?});/s
                    );
                    if (match) return JSON.parse(match[1]);
                }
            }
            return null;
        }
    """)
    if state_data:
        result["_state"] = state_data
        # Extract pricing from state if available
        detail = state_data.get("detail", state_data.get("self", {}))
        if isinstance(detail, dict):
            offers = detail.get("offers", [])
            for offer in offers if isinstance(offers, list) else []:
                otype = offer.get("offerType", "")
                price = offer.get("price", {}).get("amount")
                if "RENT" in otype.upper() and price:
                    result["rent_price"] = float(price)
                elif "BUY" in otype.upper() and price:
                    result["buy_price"] = float(price)

    # Source 3: DOM extraction (fallback for anything still missing)
    if "title" not in result:
        title_el = await page.query_selector(
            '[data-automation-id="title"]'
        )
        if title_el:
            result["title"] = await title_el.inner_text()

    if "synopsis" not in result:
        synopsis_el = await page.query_selector(
            '[data-automation-id="synopsis"]'
        )
        if synopsis_el:
            result["synopsis"] = await synopsis_el.inner_text()

    # Genre tags from DOM
    if not result.get("genres"):
        genre_els = await page.query_selector_all(
            '[data-automation-id="genre-tag"] a'
        )
        result["genres"] = [await g.inner_text() for g in genre_els]

    # Metadata line (year, runtime, rating)
    meta_els = await page.query_selector_all(
        '[data-automation-id="meta-info"] span'
    )
    result["meta_raw"] = [await m.inner_text() for m in meta_els]

    # Check Prime inclusion badge
    prime_badge = await page.query_selector(
        '[data-automation-id="prime-badge"]'
    )
    result["included_with_prime"] = prime_badge is not None

    # Extract ASIN from URL
    asin_match = re.search(r"/dp/([A-Z0-9]{10})", page.url)
    if asin_match:
        result["asin"] = asin_match.group(1)

    return result

Catalog Discovery via Browse Pages

You can't just guess URLs — you need a way to discover what's in the catalog. Prime Video's browse pages load content via AJAX calls as you scroll. Intercept those network requests to collect ASINs.

async def discover_catalog(
    category_url: str,
    max_scrolls: int = 15,
    proxy_config: dict = None,
) -> list[str]:
    """Discover title ASINs by intercepting browse page API calls."""
    collected_asins = set()
    api_responses = []

    async def capture_api(response):
        """Intercept API responses that contain title data."""
        url = response.url
        if any(k in url for k in ("api/discover", "api/search", "getDataByPath")):
            try:
                body = await response.json()
                api_responses.append(body)
            except Exception:
                pass

    pw, browser, context = await create_browser_context(proxy_config)
    page = await context.new_page()
    await stealth_async(page)
    page.on("response", capture_api)

    await page.goto(category_url, wait_until="networkidle")
    await page.wait_for_timeout(3000)

    # Scroll down to trigger lazy loading of more titles
    for i in range(max_scrolls):
        await page.evaluate("window.scrollBy(0, window.innerHeight)")
        await page.wait_for_timeout(1500)

    await browser.close()
    await pw.stop()

    # Parse collected API responses for ASINs
    def extract_asins(obj):
        """Recursively find ASINs in nested API response data."""
        if isinstance(obj, dict):
            for key in ("titleId", "asin", "catalogId"):
                val = obj.get(key, "")
                if isinstance(val, str) and re.match(r"^[A-Z0-9]{10}$", val):
                    collected_asins.add(val)
            for v in obj.values():
                extract_asins(v)
        elif isinstance(obj, list):
            for item in obj:
                extract_asins(item)

    for resp in api_responses:
        extract_asins(resp)

    return list(collected_asins)


# Usage: discover ASINs from Prime Video's movie hub
async def discover_all_categories() -> list[str]:
    """Crawl multiple browse categories to build a catalog."""
    categories = [
        "https://www.amazon.com/gp/video/storefront/ref=atv_dp_cnc_brws_0",
        "https://www.amazon.com/Amazon-Video/b?node=2858778011",  # Movies
        "https://www.amazon.com/Amazon-Video/b?node=2864549011",  # TV Shows
        "https://www.amazon.com/gp/video/offers/ref=atv_dp_cnc_sby_1",  # Prime
    ]

    all_asins = set()
    for url in categories:
        asins = await discover_catalog(url, proxy_config=PROXY_CONFIG)
        all_asins.update(asins)
        print(f"  {url[:60]}... → {len(asins)} ASINs ({len(all_asins)} total)")
        await asyncio.sleep(5)  # pause between categories

    return list(all_asins)

Country Availability Checking

This is where Prime Video scraping gets really useful. Content libraries differ massively between countries. To check availability across regions, route requests through proxies in each target country.

COUNTRY_DOMAINS = {
    "US": ("amazon.com", "en-US"),
    "UK": ("amazon.co.uk", "en-GB"),
    "DE": ("amazon.de", "de-DE"),
    "JP": ("amazon.co.jp", "ja-JP"),
    "IN": ("amazon.in", "en-IN"),
    "BR": ("amazon.com.br", "pt-BR"),
    "FR": ("amazon.fr", "fr-FR"),
    "ES": ("amazon.es", "es-ES"),
    "IT": ("amazon.it", "it-IT"),
    "AU": ("amazon.com.au", "en-AU"),
}


async def check_availability(
    asin: str,
    countries: list[str] = None,
) -> dict:
    """Check title availability across multiple countries.

    Returns a dict mapping country code to availability info:
    {
        "US": {"available": True, "prime": True, "rent": 3.99, "buy": 14.99},
        "DE": {"available": True, "prime": False, "rent": 4.99, "buy": None},
        "JP": {"available": False},
    }
    """
    if countries is None:
        countries = list(COUNTRY_DOMAINS.keys())

    results = {}

    for code in countries:
        domain, locale = COUNTRY_DOMAINS[code]
        url = f"https://www.{domain}/dp/{asin}"

        # Route through country-specific proxy
        proxy = {
            "server": "http://proxy.thordata.com:9000",
            "username": f"YOUR_USER-country-{code.lower()}",
            "password": "YOUR_PASS",
        }

        pw, browser, context = await create_browser_context(proxy)
        page = await context.new_page()
        await stealth_async(page)

        try:
            resp = await page.goto(
                url, wait_until="domcontentloaded", timeout=20000
            )
            await page.wait_for_timeout(2000)

            # Check HTTP status first
            if resp and resp.status >= 400:
                results[code] = {"available": False, "reason": f"HTTP {resp.status}"}
                continue

            # Check for "not available" indicators
            unavailable = await page.query_selector(
                '[data-automation-id="unavailable"]'
            )
            not_found = await page.query_selector('text="not available"')

            if unavailable or not_found:
                results[code] = {"available": False}
                continue

            # Check Prime inclusion
            prime_badge = await page.query_selector(
                '[data-automation-id="prime-badge"]'
            )

            # Check rent/buy buttons and extract prices
            rent_price = None
            buy_price = None

            rent_btn = await page.query_selector(
                '[data-automation-id="rent-button"]'
            )
            if rent_btn:
                rent_text = await rent_btn.inner_text()
                price_match = re.search(r"[\d.,]+", rent_text)
                if price_match:
                    rent_price = float(
                        price_match.group().replace(",", ".")
                    )

            buy_btn = await page.query_selector(
                '[data-automation-id="buy-button"]'
            )
            if buy_btn:
                buy_text = await buy_btn.inner_text()
                price_match = re.search(r"[\d.,]+", buy_text)
                if price_match:
                    buy_price = float(
                        price_match.group().replace(",", ".")
                    )

            results[code] = {
                "available": True,
                "prime": prime_badge is not None,
                "rent": rent_price,
                "buy": buy_price,
                "url": url,
            }
        except Exception as e:
            results[code] = {"available": None, "error": str(e)}
        finally:
            await browser.close()
            await pw.stop()

        await asyncio.sleep(3)  # pace between countries

    return results

Storage and Change Tracking

For catalog tracking over time, you need both a snapshot of current state and a history of changes. Here's a complete storage layer:

import sqlite3
from datetime import datetime, timezone


def init_prime_db(path: str = "prime_video.db") -> sqlite3.Connection:
    """Initialize database with tables for titles, availability, and changes."""
    conn = sqlite3.connect(path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS titles (
            asin TEXT PRIMARY KEY,
            title TEXT,
            content_type TEXT,
            genres TEXT,
            release_year INTEGER,
            imdb_score REAL,
            synopsis TEXT,
            cast_list TEXT,
            directors TEXT,
            first_seen TEXT,
            last_checked TEXT
        );

        CREATE TABLE IF NOT EXISTS availability (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            asin TEXT NOT NULL,
            country TEXT NOT NULL,
            included_with_prime BOOLEAN,
            rent_price REAL,
            buy_price REAL,
            checked_at TEXT NOT NULL,
            FOREIGN KEY (asin) REFERENCES titles(asin)
        );

        CREATE TABLE IF NOT EXISTS changes (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            asin TEXT NOT NULL,
            country TEXT NOT NULL,
            change_type TEXT NOT NULL,
            old_value TEXT,
            new_value TEXT,
            detected_at TEXT NOT NULL
        );

        CREATE INDEX IF NOT EXISTS idx_avail_asin
            ON availability(asin);
        CREATE INDEX IF NOT EXISTS idx_avail_country
            ON availability(country);
        CREATE INDEX IF NOT EXISTS idx_avail_checked
            ON availability(checked_at);
        CREATE INDEX IF NOT EXISTS idx_changes_asin
            ON changes(asin);
    """)
    conn.commit()
    return conn


def save_title(conn: sqlite3.Connection, data: dict):
    """Insert or update a title record."""
    now = datetime.now(timezone.utc).isoformat()
    conn.execute("""
        INSERT INTO titles (asin, title, content_type, genres, release_year,
                           imdb_score, synopsis, cast_list, directors,
                           first_seen, last_checked)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        ON CONFLICT(asin) DO UPDATE SET
            title = excluded.title,
            imdb_score = excluded.imdb_score,
            last_checked = excluded.last_checked
    """, (
        data.get("asin"),
        data.get("title"),
        data.get("content_type"),
        json.dumps(data.get("genres", [])),
        int(data.get("release_year") or 0),
        data.get("imdb_score", 0.0),
        data.get("synopsis"),
        json.dumps(data.get("cast", [])),
        json.dumps(data.get("directors", [])),
        now, now,
    ))
    conn.commit()


def save_availability(
    conn: sqlite3.Connection, asin: str, country_data: dict
):
    """Save availability results and detect changes from previous check."""
    now = datetime.now(timezone.utc).isoformat()

    for country, info in country_data.items():
        if info.get("available") is None:
            continue  # skip errors

        # Get the previous availability for this title + country
        prev = conn.execute("""
            SELECT included_with_prime, rent_price, buy_price
            FROM availability
            WHERE asin = ? AND country = ?
            ORDER BY checked_at DESC LIMIT 1
        """, (asin, country)).fetchone()

        prime = info.get("prime", False)
        rent = info.get("rent")
        buy = info.get("buy")

        # Record the current availability
        conn.execute("""
            INSERT INTO availability
                (asin, country, included_with_prime, rent_price,
                 buy_price, checked_at)
            VALUES (?, ?, ?, ?, ?, ?)
        """, (asin, country, prime, rent, buy, now))

        # Detect and log changes
        if prev is not None:
            prev_prime, prev_rent, prev_buy = prev

            if bool(prev_prime) != prime:
                change = "added_to_prime" if prime else "removed_from_prime"
                conn.execute("""
                    INSERT INTO changes
                        (asin, country, change_type, old_value,
                         new_value, detected_at)
                    VALUES (?, ?, ?, ?, ?, ?)
                """, (asin, country, change,
                      str(prev_prime), str(prime), now))

            if prev_rent != rent and rent is not None:
                conn.execute("""
                    INSERT INTO changes
                        (asin, country, change_type, old_value,
                         new_value, detected_at)
                    VALUES (?, ?, 'rent_price_change', ?, ?, ?)
                """, (asin, country, str(prev_rent), str(rent), now))

            if prev_buy != buy and buy is not None:
                conn.execute("""
                    INSERT INTO changes
                        (asin, country, change_type, old_value,
                         new_value, detected_at)
                    VALUES (?, ?, 'buy_price_change', ?, ?, ?)
                """, (asin, country, str(prev_buy), str(buy), now))

    conn.commit()

Running the Full Pipeline

Here's a complete daily pipeline that discovers titles, checks availability, and reports changes:

async def daily_pipeline(
    db_path: str = "prime_video.db",
    target_countries: list[str] = None,
    max_titles: int = 50,
):
    """Run a full catalog scan: discover → scrape → check availability → report."""
    if target_countries is None:
        target_countries = ["US", "UK", "DE"]

    conn = init_prime_db(db_path)

    # Step 1: Discover new ASINs
    print("Discovering catalog...")
    asins = await discover_all_categories()
    print(f"Found {len(asins)} ASINs")

    # Step 2: Scrape title metadata for new/unchecked titles
    existing = {
        row[0] for row in
        conn.execute("SELECT asin FROM titles").fetchall()
    }
    new_asins = [a for a in asins if a not in existing][:max_titles]

    pw, browser, context = await create_browser_context(PROXY_CONFIG)

    print(f"Scraping {len(new_asins)} new titles...")
    for i, asin in enumerate(new_asins):
        url = f"https://www.amazon.com/dp/{asin}"
        try:
            data = await scrape_prime_title(url, context=context)
            data["asin"] = asin
            save_title(conn, data)
            print(f"  [{i+1}/{len(new_asins)}] {data.get('title', asin)}")
        except Exception as e:
            print(f"  [{i+1}/{len(new_asins)}] Error on {asin}: {e}")
        await asyncio.sleep(5)

    await browser.close()
    await pw.stop()

    # Step 3: Check availability across countries
    all_asins = [
        row[0] for row in
        conn.execute("SELECT asin FROM titles ORDER BY last_checked ASC").fetchall()
    ][:max_titles]

    print(f"Checking availability for {len(all_asins)} titles...")
    for i, asin in enumerate(all_asins):
        avail = await check_availability(asin, target_countries)
        save_availability(conn, asin, avail)
        summary = ", ".join(
            f"{c}:{'P' if v.get('prime') else 'R' if v.get('rent') else 'X'}"
            for c, v in avail.items() if v.get("available")
        )
        print(f"  [{i+1}/{len(all_asins)}] {asin}: {summary}")

    # Step 4: Report changes
    recent_changes = conn.execute("""
        SELECT c.asin, t.title, c.country, c.change_type,
               c.old_value, c.new_value
        FROM changes c
        JOIN titles t ON t.asin = c.asin
        WHERE c.detected_at > datetime('now', '-1 day')
        ORDER BY c.detected_at DESC
    """).fetchall()

    if recent_changes:
        print(f"\n--- {len(recent_changes)} changes detected ---")
        for asin, title, country, ctype, old, new in recent_changes:
            print(f"  {title} [{country}]: {ctype} ({old} → {new})")
    else:
        print("\nNo changes detected.")

    conn.close()


if __name__ == "__main__":
    asyncio.run(daily_pipeline())

Querying Your Dataset

Once you have a few weeks of data, the interesting queries write themselves:

def get_availability_matrix(conn: sqlite3.Connection) -> list[dict]:
    """Which titles are available where? Returns a matrix."""
    rows = conn.execute("""
        SELECT t.asin, t.title, a.country, a.included_with_prime,
               a.rent_price, a.buy_price
        FROM titles t
        JOIN availability a ON a.asin = t.asin
        WHERE a.checked_at = (
            SELECT MAX(a2.checked_at) FROM availability a2
            WHERE a2.asin = a.asin AND a2.country = a.country
        )
        ORDER BY t.title, a.country
    """).fetchall()
    return [
        {"asin": r[0], "title": r[1], "country": r[2],
         "prime": bool(r[3]), "rent": r[4], "buy": r[5]}
        for r in rows
    ]


def get_prime_exclusive_by_country(
    conn: sqlite3.Connection, country: str
) -> list[str]:
    """Titles included with Prime in one country but not others."""
    rows = conn.execute("""
        SELECT DISTINCT t.title FROM titles t
        JOIN availability a1 ON a1.asin = t.asin
        WHERE a1.country = ? AND a1.included_with_prime = 1
        AND a1.asin NOT IN (
            SELECT a2.asin FROM availability a2
            WHERE a2.country != ? AND a2.included_with_prime = 1
        )
    """, (country, country)).fetchall()
    return [r[0] for r in rows]


def get_price_drops(
    conn: sqlite3.Connection, days: int = 7
) -> list[dict]:
    """Titles with rent/buy price decreases in the last N days."""
    rows = conn.execute("""
        SELECT c.asin, t.title, c.country,
               c.change_type, c.old_value, c.new_value
        FROM changes c
        JOIN titles t ON t.asin = c.asin
        WHERE c.change_type IN ('rent_price_change', 'buy_price_change')
        AND CAST(c.new_value AS REAL) < CAST(c.old_value AS REAL)
        AND c.detected_at > datetime('now', ? || ' days')
        ORDER BY c.detected_at DESC
    """, (f"-{days}",)).fetchall()
    return [
        {"asin": r[0], "title": r[1], "country": r[2],
         "type": r[3], "old_price": r[4], "new_price": r[5]}
        for r in rows
    ]

Practical Tips

A few things I've learned scraping Prime Video:

Don't scrape logged in. It's tempting to authenticate for more data, but Amazon monitors authenticated sessions much more aggressively. Stick to public catalog pages. You get 90% of the useful metadata without authentication.

Pace yourself aggressively. One title page every 5-10 seconds is safe. If you need to cover a large catalog, spread it over days. Amazon's detection has memory — if you get flagged, that IP stays flagged for weeks.

ASINs are your primary key. Every Amazon product has an ASIN (Amazon Standard Identification Number). For Prime Video content, the ASIN is the stable identifier. URLs change, page layouts change, ASINs persist across redesigns.

Watch for soft blocks. Amazon doesn't always give you a clear "you're blocked" page. Sometimes they return the page but with missing data, or redirect to a CAPTCHA that looks like a normal page element. Always validate that your scraped data contains what you expected — check that the title field isn't empty before saving.

Country checks are the real value. There's no good public source for tracking which titles move between countries or switch from included-with-Prime to rent-only. If you build this dataset consistently, it has real commercial value. Streaming comparison sites charge subscription fees for exactly this kind of information.

Reuse browser contexts. Opening a new browser for each title is slow and wasteful. Create one context, scrape multiple pages through it, and only restart when you hit a block or change proxy regions. This is both faster and looks more like real browsing behavior.