How to Scrape IMDb Movie Data in 2026: Ratings, Cast & Box Office

2026-04-09 ["imdb" "web scraping" "python" "movie data" "beautifulsoup"]

How to Scrape IMDb Movie Data in 2026: Ratings, Cast & Box Office

IMDb is the single best source for structured movie data on the internet. Over 12 million titles, 600 million monthly visitors, and decades of user ratings make it invaluable for building recommendation engines, tracking box office trends, or doing film analysis that goes beyond what the official datasets provide.

The old IMDb datasets (the TSV dumps) are fine for basic lookups, but they don't include box office numbers, user reviews, or detailed cast billing order. For that, you need to scrape the actual pages. This guide covers scraping IMDb with Python — including working code for every major data type, anti-bot strategies, and how to handle IMDb's 2026 defenses.

What Data Can You Pull from IMDb?

Each movie page packs a substantial amount of structured data:

Title, year, runtime, genres — the basics, embedded in JSON-LD and visible without JS
IMDb rating and vote count — the weighted average plus raw vote distribution by star level
Box office numbers — domestic gross, worldwide gross, opening weekend, production budget
Full cast with billing order — actors, directors, writers, producers with character names
User reviews — title, full text, rating, date, helpfulness votes
Technical specs — aspect ratio, sound mix, color/black-and-white
Release dates by country — useful for international distribution analysis
Parental guide — content rating, MPAA rating, advisories
Awards — Oscar nominations, wins, other awards
Filming locations — shooting locations with on-screen descriptions
Connections — references to, remakes of, spoofed by, etc.

IMDb's Anti-Bot Measures in 2026

IMDb runs behind AWS CloudFront with additional behavioral analysis:

Rate limiting per IP: More than ~40 requests per minute triggers a 503 with a CAPTCHA challenge page. CloudFront applies this before any application logic runs.
Cookie gating: The session-id and ubid-main cookies must be present from a valid initial homepage visit. Missing them returns stripped-down HTML without the rich JSON-LD metadata.
JSON-LD throttling: IMDb embeds rich structured data in <script type="application/ld+json"> tags. Heavy scrapers get served pages without it after sustained traffic.
Login walls for reviews: Full review text beyond the first 3-5 reviews requires paginating through their AJAX endpoint (though this endpoint is accessible without login).
Header fingerprinting: Non-browser header combinations (missing sec-fetch-* headers, wrong Accept ordering) get flagged and served degraded content.
JavaScript challenge pages: Heavy traffic from an IP triggers a Cloudflare-style JS challenge that requires a real browser to solve.

The good news: IMDb's protections are moderate compared to Zillow or LinkedIn. With proper headers, session initialization, and reasonable delays, you can scrape thousands of titles per day from a single residential IP.

Setting Up Your Scraper

pip install requests beautifulsoup4 fake-useragent lxml sqlite3

Session Initialization

Always initialize a session by hitting the IMDb homepage first — this sets the cookies that IMDb expects on subsequent requests:

import requests
import json
import time
import random
import re
import csv
import sqlite3
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from datetime import datetime

ua = UserAgent()

# ThorData residential proxy - critical for sustained scraping
# https://thordata.partnerstack.com/partner/0a0x4nzq (or [Oxylabs](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=2066&url_id=174))
PROXY = "http://USERNAME:[email protected]:7777"

def create_imdb_session(proxy: str = None) -> requests.Session:
    """
    Create a requests Session pre-loaded with IMDb cookies.
    Must call this before any title page requests.
    """
    s = requests.Session()
    s.headers.update({
        "User-Agent": ua.random,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Cache-Control": "max-age=0",
    })

    if proxy:
        s.proxies = {"http": proxy, "https": proxy}

    # Hit the homepage to initialize session cookies
    try:
        resp = s.get("https://www.imdb.com/", timeout=20)
        # IMDb may set additional cookies via meta refresh or JS — the important ones
        # are set via HTTP Set-Cookie headers which requests handles automatically
        print(f"Session initialized: {len(s.cookies)} cookies set")
    except Exception as e:
        print(f"Warning: Session init failed ({e}), continuing anyway")

    return s

def fetch_imdb_page(url: str, session: requests.Session, retries: int = 3) -> tuple[str, int]:
    """Fetch an IMDb page with retry logic and delay."""
    for attempt in range(retries):
        try:
            time.sleep(random.uniform(1.5, 4.0))  # Polite delay
            resp = session.get(url, timeout=20)

            if resp.status_code == 200:
                return resp.text, 200
            elif resp.status_code == 503:
                print(f"503 on {url} — backing off {30 * (2**attempt)}s")
                time.sleep(30 * (2 ** attempt))
            elif resp.status_code == 429:
                print(f"429 rate limit — backing off {60 * (2**attempt)}s")
                time.sleep(60 * (2 ** attempt))
            else:
                return "", resp.status_code

        except requests.exceptions.ProxyError:
            print(f"Proxy error on attempt {attempt+1}")
            time.sleep(5)
        except requests.exceptions.Timeout:
            print(f"Timeout on attempt {attempt+1}")
            time.sleep(10)
        except Exception as e:
            print(f"Error: {e}")
            time.sleep(5)

    return "", -1

Scraping Movie Details

The primary data extraction strategy uses IMDb's embedded JSON-LD structured data, which is more reliable than CSS selectors that change with redesigns:

def scrape_movie(imdb_id: str, session: requests.Session) -> dict:
    """
    Scrape a single IMDb movie page by title ID.
    imdb_id format: 'tt1375666' (Inception), 'tt0111161' (Shawshank), etc.
    """
    url = f"https://www.imdb.com/title/{imdb_id}/"
    html, status = fetch_imdb_page(url, session)

    if status != 200:
        return {"error": status, "imdb_id": imdb_id}

    soup = BeautifulSoup(html, "lxml")

    movie = {
        "imdb_id": imdb_id,
        "url": url,
        "title": None,
        "original_title": None,
        "year": None,
        "end_year": None,  # for TV series
        "genres": [],
        "rating": None,
        "vote_count": None,
        "metascore": None,
        "description": None,
        "director": [],
        "writers": [],
        "stars": [],
        "runtime": None,
        "content_rating": None,
        "languages": [],
        "countries": [],
        "box_office": {},
        "cast": [],
        "awards_summary": None,
        "keywords": [],
        "similar_titles": [],
    }

    # === JSON-LD: Most reliable data source ===
    ld_script = soup.find("script", {"type": "application/ld+json"})
    if ld_script:
        try:
            ld = json.loads(ld_script.string)

            movie["title"] = ld.get("name")
            movie["description"] = ld.get("description")
            movie["genres"] = ld.get("genre", [])
            if isinstance(movie["genres"], str):
                movie["genres"] = [movie["genres"]]

            # Date handling
            pub_date = ld.get("datePublished", "")
            if pub_date:
                movie["year"] = pub_date[:4]

            # Ratings
            agg = ld.get("aggregateRating", {})
            movie["rating"] = agg.get("ratingValue")
            movie["vote_count"] = agg.get("ratingCount")

            # Content rating
            movie["content_rating"] = ld.get("contentRating")

            # Duration (ISO 8601 duration format: PT2H22M)
            duration = ld.get("duration", "")
            if duration:
                hours = re.search(r"(\d+)H", duration)
                minutes = re.search(r"(\d+)M", duration)
                h = int(hours.group(1)) if hours else 0
                m = int(minutes.group(1)) if minutes else 0
                movie["runtime"] = h * 60 + m if (h or m) else None
                movie["runtime_str"] = f"{h}h {m}m" if h else f"{m}m"

            # Directors
            directors = ld.get("director", [])
            if isinstance(directors, dict):
                directors = [directors]
            movie["director"] = [d.get("name") for d in directors if d.get("name")]

            # Actors (top cast from JSON-LD)
            actors = ld.get("actor", [])
            if isinstance(actors, dict):
                actors = [actors]
            movie["stars"] = [a.get("name") for a in actors[:5] if a.get("name")]

            # Trailer URL
            trailer = ld.get("trailer", {})
            if isinstance(trailer, dict):
                movie["trailer_url"] = trailer.get("url") or trailer.get("embedUrl")

        except (json.JSONDecodeError, TypeError) as e:
            print(f"  JSON-LD parse error for {imdb_id}: {e}")

    # === __NEXT_DATA__: Rich structured data from Next.js ===
    next_script = soup.find("script", {"id": "__NEXT_DATA__"})
    if next_script:
        try:
            next_data = json.loads(next_script.string)
            _extract_next_data(movie, next_data)
        except (json.JSONDecodeError, TypeError):
            pass

    # === HTML fallback: Box office section ===
    if not movie["box_office"]:
        for item in soup.select("[data-testid='title-boxoffice-section'] li"):
            label_el = item.select_one(".ipc-metadata-list-item__label")
            value_el = item.select_one(".ipc-metadata-list-item__content-container")
            if label_el and value_el:
                key = label_el.get_text(strip=True)
                val = value_el.get_text(strip=True)
                movie["box_office"][key] = val

    # === Full cast from HTML ===
    for cast_item in soup.select("[data-testid='title-cast-item']")[:20]:
        actor_el = cast_item.select_one("[data-testid='title-cast-item__actor']")
        char_el = cast_item.select_one("[data-testid='cast-item-characters-link'] span")
        if char_el is None:
            char_el = cast_item.select_one("[data-testid='cast-item-characters-with-link'] span")

        if actor_el:
            movie["cast"].append({
                "actor": actor_el.get_text(strip=True),
                "character": char_el.get_text(strip=True) if char_el else None,
                "billing_order": len(movie["cast"]) + 1,
            })

    # === Metascore ===
    metascore_el = soup.select_one("[data-testid='metacritic-score-box'] span")
    if metascore_el:
        try:
            movie["metascore"] = int(metascore_el.get_text(strip=True))
        except ValueError:
            pass

    # === Technical specs ===
    tech_items = soup.select("[data-testid='title-techspecs-section'] li")
    tech = {}
    for item in tech_items:
        label = item.select_one(".ipc-metadata-list-item__label")
        value = item.select_one(".ipc-metadata-list-item__content-container")
        if label and value:
            tech[label.get_text(strip=True)] = value.get_text(strip=True)
    if tech:
        movie["technical_specs"] = tech

    return movie

def _extract_next_data(movie: dict, next_data: dict):
    """Extract additional data from Next.js page data."""
    try:
        page_props = next_data.get("props", {}).get("pageProps", {})
        title_data = page_props.get("aboveTheFold", {})

        # Original title
        original = title_data.get("originalTitleText", {})
        if isinstance(original, dict):
            movie["original_title"] = original.get("text")

        # More detailed release year info
        release = title_data.get("releaseYear", {})
        if isinstance(release, dict):
            movie["year"] = release.get("year")
            movie["end_year"] = release.get("endYear")

        # Languages and countries from below the fold
        below = page_props.get("belowTheFold", {})
        if isinstance(below, dict):
            # Countries
            countries = below.get("countriesOfOrigin", {})
            if isinstance(countries, dict):
                movie["countries"] = [
                    c.get("text", "") for c in countries.get("countries", [])
                ]
            # Languages
            langs = below.get("spokenLanguages", {})
            if isinstance(langs, dict):
                movie["languages"] = [
                    l.get("text", "") for l in langs.get("spokenLanguages", [])
                ]
            # Keywords
            kws = below.get("keywords", {})
            if isinstance(kws, dict):
                movie["keywords"] = [
                    kw.get("text", "") for kw in kws.get("edges", [])
                    if isinstance(kw.get("node"), dict)
                ][:10]

    except (KeyError, AttributeError, TypeError):
        pass

Scraping User Reviews

IMDb paginates reviews through an AJAX endpoint. You can hit it directly:

def scrape_reviews(
    imdb_id: str,
    session: requests.Session,
    max_reviews: int = 100,
    sort: str = "helpfulnessScore",
) -> list[dict]:
    """
    Scrape user reviews for a title.
    sort options: helpfulnessScore, submissionDate, userRating
    """
    reviews = []
    pagination_key = None
    base_url = f"https://www.imdb.com/title/{imdb_id}/reviews/"

    # First page
    params = {"sort": sort}
    html, status = fetch_imdb_page(base_url, session)

    if status != 200:
        return []

    soup = BeautifulSoup(html, "lxml")

    def parse_review_page(soup: BeautifulSoup) -> tuple[list[dict], str | None]:
        page_reviews = []
        for div in soup.select(".review-container"):
            title_el = div.select_one(".title")
            text_el = div.select_one(".text.show-more__control")
            if text_el is None:
                text_el = div.select_one(".content .text")
            rating_el = div.select_one(".rating-other-user-rating span")
            date_el = div.select_one(".review-date")
            author_el = div.select_one(".display-name-link")
            helpful_el = div.select_one(".actions")

            page_reviews.append({
                "title": title_el.get_text(strip=True) if title_el else None,
                "text": text_el.get_text(strip=True) if text_el else None,
                "rating": int(rating_el.get_text(strip=True)) if rating_el else None,
                "date": date_el.get_text(strip=True) if date_el else None,
                "author": author_el.get_text(strip=True) if author_el else None,
                "helpful_text": helpful_el.get_text(strip=True) if helpful_el else None,
            })

        # Get pagination key for next page load
        load_more = soup.select_one(".load-more-data")
        next_key = load_more.get("data-key") if load_more else None

        return page_reviews, next_key

    page_reviews, pagination_key = parse_review_page(soup)
    reviews.extend(page_reviews)

    # Load more pages via AJAX endpoint
    while pagination_key and len(reviews) < max_reviews:
        ajax_url = f"https://www.imdb.com/title/{imdb_id}/reviews/_ajax"
        params = {
            "ref_": "undefined",
            "paginationKey": pagination_key,
            "spoiler": "include",
        }

        time.sleep(random.uniform(2, 5))

        try:
            resp = session.get(ajax_url, params=params, timeout=15)
            if resp.status_code != 200:
                break

            ajax_soup = BeautifulSoup(resp.text, "lxml")
            page_reviews, pagination_key = parse_review_page(ajax_soup)

            if not page_reviews:
                break

            reviews.extend(page_reviews)
        except Exception as e:
            print(f"  Review pagination error: {e}")
            break

    return reviews[:max_reviews]

Scraping the Top 250 and Genre Charts

def scrape_top_250(session: requests.Session) -> list[dict]:
    """Get IMDb Top 250 movies with rankings and basic data."""
    html, status = fetch_imdb_page("https://www.imdb.com/chart/top/", session)
    if status != 200:
        return []

    soup = BeautifulSoup(html, "lxml")

    # Top 250 is now rendered as JSON in a script tag
    chart_data_script = soup.find("script", {"id": "__NEXT_DATA__"})
    if chart_data_script:
        try:
            next_data = json.loads(chart_data_script.string)
            page_props = next_data.get("props", {}).get("pageProps", {})
            page_data = page_props.get("pageData", {})
            chart_titles = page_data.get("chartTitles", {}).get("edges", [])

            return [
                {
                    "rank": i + 1,
                    "imdb_id": edge.get("node", {}).get("id"),
                    "title": edge.get("node", {}).get("titleText", {}).get("text"),
                    "year": edge.get("node", {}).get("releaseYear", {}).get("year"),
                    "rating": edge.get("node", {}).get("ratingsSummary", {}).get("aggregateRating"),
                    "vote_count": edge.get("node", {}).get("ratingsSummary", {}).get("voteCount"),
                }
                for i, edge in enumerate(chart_titles)
            ]
        except (json.JSONDecodeError, TypeError, AttributeError):
            pass

    # Fallback: parse HTML table
    movies = []
    rank = 0
    for row in soup.select("li.ipc-metadata-list-summary-item"):
        rank += 1
        title_el = row.select_one(".ipc-title__text")
        year_el = row.select_one(".cli-title-metadata-item")
        rating_el = row.select_one(".ipc-rating-star--rating")
        link_el = row.select_one("a.ipc-title-link-wrapper")

        imdb_id = None
        if link_el:
            href = link_el.get("href", "")
            id_match = re.search(r"/(tt\d+)/", href)
            if id_match:
                imdb_id = id_match.group(1)

        movies.append({
            "rank": rank,
            "imdb_id": imdb_id,
            "title": title_el.get_text(strip=True).split(". ", 1)[-1] if title_el else None,
            "year": year_el.get_text(strip=True) if year_el else None,
            "rating": float(rating_el.get_text(strip=True)) if rating_el else None,
        })

    return movies

def scrape_genre_chart(genre: str, session: requests.Session, limit: int = 50) -> list[dict]:
    """
    Scrape the most popular titles for a genre using IMDb's Advanced Search.
    genre examples: action, comedy, drama, horror, sci-fi, thriller, animation
    """
    url = (
        f"https://www.imdb.com/search/title/"
        f"?genres={genre}&sort=num_votes,desc&count={min(limit, 250)}"
    )
    html, status = fetch_imdb_page(url, session)
    if status != 200:
        return []

    soup = BeautifulSoup(html, "lxml")
    movies = []

    for item in soup.select(".ipc-metadata-list-summary-item")[:limit]:
        title_el = item.select_one(".ipc-title__text")
        rating_el = item.select_one(".ipc-rating-star--rating")
        year_el = item.select_one(".dli-title-metadata-item")
        link_el = item.select_one("a.ipc-title-link-wrapper")

        imdb_id = None
        if link_el:
            href = link_el.get("href", "")
            id_match = re.search(r"/(tt\d+)/", href)
            if id_match:
                imdb_id = id_match.group(1)

        movies.append({
            "imdb_id": imdb_id,
            "title": title_el.get_text(strip=True).split(". ", 1)[-1] if title_el else None,
            "year": year_el.get_text(strip=True) if year_el else None,
            "rating": float(rating_el.get_text(strip=True)) if rating_el else None,
            "genre": genre,
        })

    return movies

Full Pipeline: Scrape, Store, and Export

def setup_imdb_database(db_path: str) -> sqlite3.Connection:
    """Create SQLite schema for IMDb data."""
    conn = sqlite3.connect(db_path)
    conn.execute("PRAGMA journal_mode=WAL")

    conn.execute("""
        CREATE TABLE IF NOT EXISTS movies (
            imdb_id TEXT PRIMARY KEY,
            title TEXT,
            original_title TEXT,
            year INTEGER,
            runtime INTEGER,
            content_rating TEXT,
            rating REAL,
            vote_count INTEGER,
            metascore INTEGER,
            description TEXT,
            genres TEXT,
            director TEXT,
            stars TEXT,
            budget TEXT,
            domestic_gross TEXT,
            worldwide_gross TEXT,
            opening_weekend TEXT,
            countries TEXT,
            languages TEXT,
            keywords TEXT,
            scraped_at TEXT
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS cast_members (
            imdb_id TEXT,
            billing_order INTEGER,
            actor TEXT,
            character TEXT,
            PRIMARY KEY (imdb_id, billing_order)
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS reviews (
            imdb_id TEXT,
            review_idx INTEGER,
            title TEXT,
            text TEXT,
            rating INTEGER,
            date TEXT,
            author TEXT,
            PRIMARY KEY (imdb_id, review_idx)
        )
    """)

    conn.commit()
    return conn

def save_movie(conn: sqlite3.Connection, movie: dict):
    """Save a scraped movie to the database."""
    box = movie.get("box_office", {})
    now = datetime.now().isoformat()

    conn.execute("""
        INSERT OR REPLACE INTO movies
        (imdb_id, title, original_title, year, runtime, content_rating,
         rating, vote_count, metascore, description, genres, director,
         stars, budget, domestic_gross, worldwide_gross, opening_weekend,
         countries, languages, keywords, scraped_at)
        VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)
    """, (
        movie["imdb_id"],
        movie.get("title"),
        movie.get("original_title"),
        movie.get("year"),
        movie.get("runtime"),
        movie.get("content_rating"),
        movie.get("rating"),
        movie.get("vote_count"),
        movie.get("metascore"),
        movie.get("description"),
        json.dumps(movie.get("genres", [])),
        json.dumps(movie.get("director", [])),
        json.dumps(movie.get("stars", [])),
        box.get("Budget") or box.get("Gross Budget"),
        box.get("Gross US & Canada") or box.get("Opening Weekend US & Canada"),
        box.get("Gross worldwide") or box.get("Gross Worldwide"),
        box.get("Opening weekend US & Canada") or box.get("Opening Weekend"),
        json.dumps(movie.get("countries", [])),
        json.dumps(movie.get("languages", [])),
        json.dumps(movie.get("keywords", [])),
        now,
    ))

    # Save cast
    for member in movie.get("cast", []):
        conn.execute("""
            INSERT OR REPLACE INTO cast_members VALUES (?,?,?,?)
        """, (
            movie["imdb_id"],
            member.get("billing_order", 0),
            member.get("actor"),
            member.get("character"),
        ))

def run_scraping_pipeline(
    imdb_ids: list[str],
    db_path: str = "imdb_data.db",
    include_reviews: bool = False,
    proxy: str = None,
) -> dict:
    """
    Full pipeline: scrape movies, save to DB, return stats.
    """
    conn = setup_imdb_database(db_path)
    session = create_imdb_session(proxy=proxy)

    stats = {"scraped": 0, "failed": 0, "errors": []}

    for i, imdb_id in enumerate(imdb_ids):
        print(f"[{i+1}/{len(imdb_ids)}] Scraping {imdb_id}...")

        movie = scrape_movie(imdb_id, session)

        if "error" in movie:
            print(f"  Failed: {movie['error']}")
            stats["failed"] += 1
            stats["errors"].append(imdb_id)

            # Rotate proxy on repeated failures
            if stats["failed"] % 5 == 0 and proxy:
                session = create_imdb_session(proxy=proxy)
                print("  Refreshed session after failures")
            continue

        print(f"  {movie.get('title', '?')} ({movie.get('year', '?')}) "
              f"— {movie.get('rating', '?')}★ ({movie.get('vote_count', 0):,} votes)")

        if include_reviews:
            reviews = scrape_reviews(imdb_id, session, max_reviews=20)
            for j, review in enumerate(reviews):
                conn.execute("""
                    INSERT OR REPLACE INTO reviews VALUES (?,?,?,?,?,?,?)
                """, (
                    imdb_id, j,
                    review.get("title"),
                    review.get("text"),
                    review.get("rating"),
                    review.get("date"),
                    review.get("author"),
                ))
            print(f"  + {len(reviews)} reviews")

        save_movie(conn, movie)
        conn.commit()
        stats["scraped"] += 1

    conn.close()
    print(f"\nDone: {stats['scraped']} scraped, {stats['failed']} failed")
    return stats

# Scrape the IMDb Top 250
session = create_imdb_session(proxy=PROXY)
top250 = scrape_top_250(session)
top250_ids = [m["imdb_id"] for m in top250 if m["imdb_id"]]
print(f"Got {len(top250_ids)} IDs from Top 250")

# Scrape details for all of them
stats = run_scraping_pipeline(
    top250_ids,
    db_path="imdb_top250.db",
    include_reviews=False,
    proxy=PROXY,
)

Handling Proxies at Scale

IMDb's CloudFront layer checks ASN reputation — datacenter IPs (AWS, GCP, Hetzner, etc.) get throttled faster than residential ones. After about 40 requests per minute from a datacenter IP, you hit 503 responses that don't clear for 10-15 minutes.

ThorData provides rotating residential proxies where each request appears to come from a different home connection. For IMDb, this means: - Requests appear as organic user traffic from different household IPs - CloudFront's per-IP rate limits reset with each rotation - Geographic targeting lets you pull region-specific release dates

For sticky sessions (needed when following links from one page to the next), use a session label:

def get_imdb_proxy(session_id: str = None) -> str:
    """
    Get a proxy URL. If session_id provided, uses sticky routing
    (same exit IP for all requests with that session_id).
    """
    base = "http://USERNAME:PASSWORD"
    host = "gate.thordata.com:7777"
    if session_id:
        return f"{base}-session-{session_id}@{host}"
    return f"{base}@{host}"

# Use sticky proxy for scraping a single movie + its reviews
import random
movie_session = str(random.randint(10000, 99999))
session = create_imdb_session(proxy=get_imdb_proxy(movie_session))

movie = scrape_movie("tt1375666", session)
reviews = scrape_reviews("tt1375666", session, max_reviews=50)

Exporting to CSV for Analysis

def export_movies_to_csv(db_path: str, output_file: str = "imdb_export.csv"):
    """Export movie data with computed fields for analysis."""
    conn = sqlite3.connect(db_path)

    cursor = conn.execute("""
        SELECT
            m.imdb_id,
            m.title,
            m.year,
            m.runtime,
            m.content_rating,
            m.rating,
            m.vote_count,
            m.metascore,
            m.genres,
            m.director,
            m.budget,
            m.domestic_gross,
            m.worldwide_gross,
            m.countries,
            m.languages,
            COUNT(c.billing_order) as cast_count
        FROM movies m
        LEFT JOIN cast_members c ON m.imdb_id = c.imdb_id
        WHERE m.rating IS NOT NULL
        GROUP BY m.imdb_id
        ORDER BY m.vote_count DESC NULLS LAST
    """)

    rows = cursor.fetchall()
    fieldnames = [
        "imdb_id", "title", "year", "runtime", "content_rating",
        "rating", "vote_count", "metascore", "genres", "director",
        "budget", "domestic_gross", "worldwide_gross", "countries",
        "languages", "cast_count"
    ]

    with open(output_file, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        for row in rows:
            writer.writerow(dict(zip(fieldnames, row)))

    conn.close()
    print(f"Exported {len(rows)} movies to {output_file}")

export_movies_to_csv("imdb_top250.db", "top250_analysis.csv")

Using the Official IMDb Datasets

For basic metadata, IMDb provides official TSV datasets at datasets.imdbws.com — use these when possible instead of scraping:

import gzip
import csv

def download_and_parse_imdb_dataset(dataset_name: str) -> list[dict]:
    """
    Download and parse official IMDb datasets.
    Available: title.basics, title.ratings, title.akas, title.crew,
               title.episode, title.principals, name.basics
    """
    import requests
    url = f"https://datasets.imdbws.com/{dataset_name}.tsv.gz"
    print(f"Downloading {url}...")

    resp = requests.get(url, stream=True)
    rows = []

    with gzip.open(resp.raw, "rt", encoding="utf-8") as f:
        reader = csv.DictReader(f, delimiter="\t")
        for row in reader:
            # Replace IMDb's null marker with None
            cleaned = {k: (v if v != "\\N" else None) for k, v in row.items()}
            rows.append(cleaned)

    print(f"Loaded {len(rows):,} rows from {dataset_name}")
    return rows

# Load movie ratings (no scraping needed)
# ratings = download_and_parse_imdb_dataset("title.ratings")
# basics = download_and_parse_imdb_dataset("title.basics")

Legal Considerations

IMDb's Terms of Use prohibit scraping for commercial use. Amazon (IMDb's parent) takes ToS enforcement seriously. However, they provide official datasets at datasets.imdbws.com for non-commercial use — use those for anything you can.

For data not in the official dumps (box office numbers, reviews, filming locations), scraping for personal research and non-commercial analysis is generally tolerated. Don't republish raw IMDb content at scale, and don't build a product that competes with IMDb using scraped data.

Key Takeaways

JSON-LD first: IMDb embeds rich structured data in <script type="application/ld+json"> — parse this before touching HTML selectors. It's more stable than CSS selectors across redesigns.
Session initialization matters: Always hit the IMDb homepage before title pages to set proper session cookies. Missing cookies return degraded HTML.
Moderate defenses: IMDb is not Zillow — with proper delays (2-5s between requests) and a residential IP from ThorData, you can scrape thousands of titles per day.
Official datasets for basics: Use datasets.imdbws.com for title metadata and ratings — save scraping quota for box office, reviews, and cast details.
Reviews are AJAX: The review endpoint paginates via data-key tokens — no JavaScript rendering needed.
Use the database: SQLite handles millions of IMDb records cleanly with proper schema design.