Scraping Letterboxd Film Reviews, Ratings, and User Lists (2026)

2026-04-09 ["letterboxd" "web scraping" "python" "movies" "beautifulsoup"]

Scraping Letterboxd Film Reviews, Ratings, and User Lists (2026)

Letterboxd is the social network that film nerds actually use. It has detailed ratings, thoughtful reviews, curated lists, and active community engagement around movies. It also has no public API. If you want Letterboxd data for research, recommendation engines, sentiment analysis, or film trend tracking, you're scraping HTML.

The good news is that Letterboxd's HTML is remarkably clean. Server-rendered pages, consistent CSS classes, predictable URL structures. It's one of the more pleasant scraping targets if you respect their rate limits. This guide covers extracting film ratings, review text, user activity, popular lists, and diary data — along with production-ready patterns for building a reliable film data pipeline.

The Value of Letterboxd Data

Letterboxd has unique data characteristics that make it worth the effort to scrape:

Review quality is high. Unlike Amazon or Google reviews, Letterboxd attracts film enthusiasts who write substantive reviews. The average review length and analytical depth is significantly above other review platforms. This makes Letterboxd data particularly valuable for sentiment analysis, recommendation systems, and film criticism research.

Community curation is rich. User-created lists on Letterboxd are an underappreciated dataset. Lists like "Best Films of the 21st Century," "Essential Horror," or "Overlooked Science Fiction" represent aggregated editorial judgment that's hard to find elsewhere.

Rating distribution matters. Letterboxd's 0.5-5.0 scale with half-star increments creates a more granular distribution than most platforms' 1-5 integer ratings. The distribution of ratings across the community often reveals interesting bimodal patterns for controversial films.

Popular films lists are real-time signals. Letterboxd's "popular this week" list reflects actual community engagement, not algorithmic promotion. It's a cleaner signal for film trend tracking than streaming charts that are influenced by platform recommendations.

URL Structure

Understanding Letterboxd's URL patterns is essential:

Film page:          https://letterboxd.com/film/{slug}/
Film reviews:       https://letterboxd.com/film/{slug}/reviews/by/activity/
Film reviews (pop): https://letterboxd.com/film/{slug}/reviews/by/popularity/
Film reviews (p.N): https://letterboxd.com/film/{slug}/reviews/by/activity/page/{n}/
Film ratings:       https://letterboxd.com/film/{slug}/ratings/
Film cast:          https://letterboxd.com/film/{slug}/cast/
Film fans:          https://letterboxd.com/film/{slug}/fans/
Film lists:         https://letterboxd.com/film/{slug}/lists/
User profile:       https://letterboxd.com/{username}/
User films:         https://letterboxd.com/{username}/films/
User diary:         https://letterboxd.com/{username}/films/diary/
User reviews:       https://letterboxd.com/{username}/films/reviews/
User lists:         https://letterboxd.com/{username}/lists/
User watchlist:     https://letterboxd.com/{username}/watchlist/
Popular films:      https://letterboxd.com/films/popular/
Popular this week:  https://letterboxd.com/films/popular/this/week/
Popular this month: https://letterboxd.com/films/popular/this/month/
Genre popular:      https://letterboxd.com/films/popular/genre/{genre}/
Director films:     https://letterboxd.com/director/{name}/
Actor films:        https://letterboxd.com/actor/{name}/

All of these return server-rendered HTML that BeautifulSoup can parse directly.

Setup and Base Client

pip install httpx beautifulsoup4 lxml

import httpx
import time
import re
import json
import sqlite3
import random
from bs4 import BeautifulSoup
from dataclasses import dataclass, field
from typing import Optional

USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]

def create_client(proxy: Optional[str] = None, ua: Optional[str] = None) -> httpx.Client:
    """Create an httpx client with appropriate headers for Letterboxd."""
    transport = httpx.HTTPTransport(proxy=proxy) if proxy else None
    return httpx.Client(
        transport=transport,
        headers={
            "User-Agent": ua or random.choice(USER_AGENTS),
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "DNT": "1",
            "Upgrade-Insecure-Requests": "1",
        },
        timeout=20,
        follow_redirects=True,
    )

def get_soup(client: httpx.Client, url: str, retries: int = 3) -> Optional[BeautifulSoup]:
    """Fetch a URL and return a BeautifulSoup object with retry logic."""
    for attempt in range(retries):
        try:
            resp = client.get(url)

            if resp.status_code == 200:
                return BeautifulSoup(resp.text, "lxml")
            elif resp.status_code == 404:
                return None
            elif resp.status_code == 429:
                wait = float(resp.headers.get("Retry-After", 30 * (attempt + 1)))
                print(f"Rate limited on {url}. Waiting {wait:.0f}s...")
                time.sleep(wait)
            elif resp.status_code == 503:
                time.sleep(10 * (attempt + 1))
            else:
                print(f"Got {resp.status_code} for {url} (attempt {attempt + 1})")
                time.sleep(2 * (attempt + 1))

        except httpx.TimeoutException:
            print(f"Timeout on {url} (attempt {attempt + 1})")
            time.sleep(3 * (attempt + 1))
        except httpx.ConnectError as e:
            print(f"Connection error: {e}")
            time.sleep(5)

    return None

Scraping Film Metadata

Each film page contains structured data in both the HTML and JSON-LD:

@dataclass
class LetterboxdFilm:
    slug: str
    title: str
    year: str
    director: str
    average_rating: Optional[str]
    rating_count: Optional[int]
    genres: list
    description: str
    image_url: str
    url: str
    runtime_minutes: Optional[int] = None
    countries: list = field(default_factory=list)
    languages: list = field(default_factory=list)
    studios: list = field(default_factory=list)

def scrape_film(client: httpx.Client, film_slug: str) -> Optional[LetterboxdFilm]:
    """Scrape comprehensive film metadata from its Letterboxd page."""
    url = f"https://letterboxd.com/film/{film_slug}/"
    soup = get_soup(client, url)
    if not soup:
        return None

    # Extract JSON-LD structured data
    ld_script = soup.find("script", type="application/ld+json")
    ld_data = {}
    if ld_script and ld_script.string:
        try:
            ld_data = json.loads(ld_script.string)
        except json.JSONDecodeError:
            pass

    # Average rating from meta tag
    rating_meta = soup.find("meta", {"name": "twitter:data2"})
    avg_rating = None
    if rating_meta and rating_meta.get("content"):
        content = rating_meta["content"]
        match = re.search(r"([\d.]+)\s+out of 5", content)
        if match:
            avg_rating = match.group(1)

    # Rating count
    rating_count = None
    for tooltip_el in soup.find_all("a", class_="tooltip"):
        title_attr = tooltip_el.get("title", "")
        if "rating" in title_attr.lower():
            count_match = re.search(r"([\d,]+)\s+rating", title_attr)
            if count_match:
                try:
                    rating_count = int(count_match.group(1).replace(",", ""))
                except ValueError:
                    pass
            break

    # Year from featured film header
    year = ""
    year_el = soup.find("small", class_="number")
    if year_el:
        year = year_el.text.strip()
    elif soup.find("meta", property="og:title"):
        og_title = soup.find("meta", property="og:title").get("content", "")
        year_match = re.search(r"\((\d{4})\)", og_title)
        if year_match:
            year = year_match.group(1)

    # Director from twitter meta
    director = ""
    director_meta = soup.find("meta", {"name": "twitter:data1"})
    if director_meta:
        director = director_meta.get("content", "")

    # Genres from tab
    genres = []
    genre_tab = soup.find("div", id="tab-genres")
    if genre_tab:
        for a_tag in genre_tab.find_all("a", class_="text-slug"):
            g = a_tag.text.strip()
            if g:
                genres.append(g)

    # Runtime
    runtime = None
    runtime_el = soup.find("p", class_="text-link text-footer")
    if not runtime_el:
        runtime_el = soup.find("span", class_="duration")
    if runtime_el:
        runtime_text = runtime_el.get_text()
        minutes_match = re.search(r"(\d+)\s*mins?", runtime_text)
        if minutes_match:
            runtime = int(minutes_match.group(1))

    # Countries and languages from details tab
    countries = []
    languages = []
    details_tab = soup.find("div", id="tab-details")
    if details_tab:
        for section in details_tab.find_all("div", class_=True):
            label_el = section.find("h3")
            if label_el:
                label = label_el.text.strip().lower()
                items = [a.text.strip() for a in section.find_all("a", class_="text-slug")]
                if "countr" in label:
                    countries = items
                elif "language" in label:
                    languages = items

    return LetterboxdFilm(
        slug=film_slug,
        title=ld_data.get("name", ""),
        year=year,
        director=director,
        average_rating=avg_rating,
        rating_count=rating_count,
        genres=genres,
        description=(ld_data.get("description") or "")[:500],
        image_url=ld_data.get("image", ""),
        url=url,
        runtime_minutes=runtime,
        countries=countries,
        languages=languages,
    )

Extracting Reviews

Letterboxd reviews are paginated, with 12 reviews per page. The page structure is stable across the activity and popularity sort modes:

@dataclass
class LetterboxdReview:
    reviewer: str
    reviewer_url: str
    rating: Optional[float]  # 0.5 to 5.0 in 0.5 increments
    text: str
    date: str
    likes: int
    has_spoiler_warning: bool = False
    is_featured: bool = False

def _extract_rating_from_classes(classes: list[str]) -> Optional[float]:
    """Convert Letterboxd CSS rating class to float (rated-7 -> 3.5)."""
    for cls in classes:
        if cls.startswith("rated-"):
            try:
                raw = int(cls.replace("rated-", ""))
                return raw / 2  # 1-10 -> 0.5-5.0
            except ValueError:
                pass
    return None

def scrape_film_reviews(
    client: httpx.Client,
    film_slug: str,
    sort: str = "activity",
    max_pages: int = 5,
    delay: float = 2.0,
) -> list[LetterboxdReview]:
    """
    Scrape reviews for a film.
    sort: 'activity' (recent), 'popularity' (most liked), 'highest-rated', 'lowest-rated'
    """
    reviews = []

    for page in range(1, max_pages + 1):
        url = f"https://letterboxd.com/film/{film_slug}/reviews/by/{sort}/page/{page}/"
        soup = get_soup(client, url)

        if soup is None:
            break

        review_items = soup.find_all("li", class_=lambda c: c and "film-detail" in c)
        if not review_items:
            break

        for item in review_items:
            # Reviewer info
            reviewer_el = item.find("strong", class_="name")
            reviewer = ""
            reviewer_url = ""
            if reviewer_el:
                reviewer = reviewer_el.text.strip()
                link = reviewer_el.find("a") or reviewer_el.parent
                if link and link.get("href"):
                    reviewer_url = "https://letterboxd.com" + link.get("href", "")

            # Rating
            rating_el = item.find("span", class_="rating")
            rating = _extract_rating_from_classes(rating_el.get("class", [])) if rating_el else None

            # Review text — handle collapsed/spoiler versions
            body_el = item.find("div", class_="body-text")
            review_text = ""
            has_spoiler = False

            if body_el:
                # Check for spoiler warning
                spoiler_el = body_el.find("p", class_="contains-spoilers")
                if spoiler_el:
                    has_spoiler = True

                # Get text from collapsed or full version
                collapsed = body_el.find("div", class_="collapsed-text")
                full = body_el.find("div", class_="full-text")
                text_el = collapsed or full or body_el
                review_text = text_el.get_text(separator=" ", strip=True)[:2000]

            # Date
            date = ""
            date_el = item.find("span", class_="_nobr")
            if date_el:
                date = date_el.text.strip()
            else:
                # Try time element
                time_el = item.find("time")
                if time_el:
                    date = time_el.get("datetime", time_el.text.strip())

            # Like count
            likes = 0
            like_el = item.find("a", class_=lambda c: c and "has-icon" in c and "icon-like" in c)
            if like_el:
                like_text = like_el.get("title", like_el.text)
                like_match = re.search(r"([\d,]+)\s+like", like_text, re.IGNORECASE)
                if like_match:
                    try:
                        likes = int(like_match.group(1).replace(",", ""))
                    except ValueError:
                        pass

            # Featured review indicator
            is_featured = bool(item.find(class_=lambda c: c and "featured" in str(c).lower()))

            if reviewer or review_text:
                reviews.append(LetterboxdReview(
                    reviewer=reviewer,
                    reviewer_url=reviewer_url,
                    rating=rating,
                    text=review_text,
                    date=date,
                    likes=likes,
                    has_spoiler_warning=has_spoiler,
                    is_featured=is_featured,
                ))

        time.sleep(delay)

    return reviews

Scraping Popular Films Lists

@dataclass
class LetterboxdListEntry:
    title: str
    slug: str
    year: Optional[str]
    position: int
    poster_url: str
    url: str

def scrape_popular_films(
    client: httpx.Client,
    time_period: str = "this/week",
    genre: Optional[str] = None,
    decade: Optional[str] = None,
    max_pages: int = 3,
    delay: float = 2.0,
) -> list[LetterboxdListEntry]:
    """
    Scrape Letterboxd popular films lists.
    time_period: 'this/week', 'this/month', 'this/year', '' (all time)
    genre: 'horror', 'drama', 'comedy', etc.
    decade: '2020s', '2010s', '1980s', etc.
    """
    films = []

    for page in range(1, max_pages + 1):
        # Build URL
        base = "https://letterboxd.com/films/popular"
        url_parts = [base]
        if time_period:
            url_parts.append(time_period)
        if genre:
            url_parts.append(f"genre/{genre}")
        if decade:
            url_parts.append(f"decade/{decade}")
        url_parts.append(f"page/{page}/")
        url = "/".join(url_parts)

        soup = get_soup(client, url)
        if not soup:
            break

        posters = soup.find_all("li", class_="poster-container")
        if not posters:
            break

        for poster_li in posters:
            div = poster_li.find("div", class_="film-poster")
            if not div:
                continue

            slug = div.get("data-film-slug", "")
            film_id = div.get("data-film-id", "")

            img = poster_li.find("img")
            title = img.get("alt", "") if img else ""

            # Poster URL
            poster_url = ""
            if img:
                poster_url = img.get("src", "") or img.get("data-src", "")

            # Year (sometimes in data attribute)
            year = div.get("data-film-release-year", "")

            films.append(LetterboxdListEntry(
                title=title,
                slug=slug,
                year=year or None,
                position=len(films) + 1,
                poster_url=poster_url,
                url=f"https://letterboxd.com/film/{slug}/",
            ))

        time.sleep(delay)

    return films

def scrape_films_by_genre(
    client: httpx.Client,
    genre: str,
    sort: str = "popular",
    decade: Optional[str] = None,
    max_pages: int = 5,
) -> list[LetterboxdListEntry]:
    """
    Scrape films filtered by genre.
    sort: 'popular', 'average-rating', 'date', 'name'
    """
    films = []
    for page in range(1, max_pages + 1):
        url = f"https://letterboxd.com/films/genre/{genre}/page/{page}/"
        if decade:
            url = f"https://letterboxd.com/films/genre/{genre}/decade/{decade}/page/{page}/"

        soup = get_soup(client, url)
        if not soup:
            break

        posters = soup.find_all("li", class_="poster-container")
        if not posters:
            break

        for poster_li in posters:
            div = poster_li.find("div", class_="film-poster")
            if not div:
                continue

            slug = div.get("data-film-slug", "")
            img = poster_li.find("img")
            title = img.get("alt", "") if img else ""

            films.append(LetterboxdListEntry(
                title=title, slug=slug, year=None,
                position=len(films) + 1,
                poster_url="",
                url=f"https://letterboxd.com/film/{slug}/",
            ))

        time.sleep(2.0)

    return films

Scraping User Profiles, Diary, and Lists

@dataclass
class LetterboxdUserProfile:
    username: str
    bio: str
    films_count: int
    following_count: int
    followers_count: int
    lists_count: int
    favorite_films: list

@dataclass
class LetterboxdDiaryEntry:
    title: str
    slug: str
    date: str
    rating: Optional[float]
    rewatch: bool
    liked: bool

@dataclass
class LetterboxdList:
    title: str
    slug: str
    url: str
    film_count: int
    description: str
    published_date: str
    likes: int
    tags: list

def scrape_user_profile(client: httpx.Client, username: str) -> Optional[LetterboxdUserProfile]:
    """Scrape a user's public profile page."""
    url = f"https://letterboxd.com/{username}/"
    soup = get_soup(client, url)
    if not soup:
        return None

    # Check if profile is private
    if soup.find("section", class_="error"):
        return None

    # Stats
    stats = {}
    for stat_el in soup.find_all("h4", class_="profile-statistic"):
        value_el = stat_el.find("span", class_="value")
        if value_el:
            label = stat_el.get_text(strip=True).lower()
            value_text = value_el.text.strip().replace(",", "").replace("k", "000")
            try:
                stats[label] = int(float(value_text))
            except ValueError:
                pass

    # Bio
    bio = ""
    bio_el = soup.find("div", class_="body-text")
    if bio_el:
        bio = bio_el.get_text(separator=" ", strip=True)[:500]

    # Favorite films
    favorites = []
    fav_section = soup.find("section", id="favourites")
    if fav_section:
        for poster_div in fav_section.find_all("div", class_="film-poster"):
            img = poster_div.find("img")
            if img:
                favorites.append({
                    "title": img.get("alt", ""),
                    "slug": poster_div.get("data-film-slug", ""),
                })

    return LetterboxdUserProfile(
        username=username,
        bio=bio,
        films_count=stats.get("films", 0),
        following_count=stats.get("following", 0),
        followers_count=stats.get("followers", 0),
        lists_count=stats.get("lists", 0),
        favorite_films=favorites,
    )

def scrape_user_diary(
    client: httpx.Client,
    username: str,
    max_pages: int = 5,
    year: Optional[int] = None,
    delay: float = 2.0,
) -> list[LetterboxdDiaryEntry]:
    """
    Scrape a user's film diary (watches with dates and ratings).
    year: filter to a specific year (e.g., 2024)
    """
    entries = []
    base_url = f"https://letterboxd.com/{username}/films/diary"
    if year:
        base_url = f"{base_url}/for/{year}"

    for page in range(1, max_pages + 1):
        url = f"{base_url}/page/{page}/"
        soup = get_soup(client, url)
        if not soup:
            break

        rows = soup.find_all("tr", class_="diary-entry-row")
        if not rows:
            break

        for row in rows:
            # Film info
            title_cell = row.find("td", class_="td-film-details")
            title = ""
            slug = ""
            if title_cell:
                link_el = title_cell.find("a", href=True)
                if link_el:
                    title = link_el.text.strip()
                    href = link_el.get("href", "")
                    if "/film/" in href:
                        slug = href.split("/film/")[-1].strip("/")

            # Watch date
            date = ""
            cal_cell = row.find("td", class_="td-calendar")
            if cal_cell:
                date_link = cal_cell.find("a", href=True)
                if date_link:
                    href = date_link.get("href", "")
                    # URL format: /username/films/diary/for/2024/01/15/
                    date_match = re.search(r"/for/(\d{4}/\d{2}/\d{2})/", href)
                    if date_match:
                        date = date_match.group(1).replace("/", "-")

            # Rating
            rating_el = row.find("td", class_="td-rating")
            rating = None
            if rating_el:
                span = rating_el.find("span", class_="rating")
                if span:
                    rating = _extract_rating_from_classes(span.get("class", []))

            # Rewatch
            rewatch = bool(row.find("td", class_="td-rewatch"))

            # Liked
            liked_el = row.find("td", class_="td-like")
            liked = bool(liked_el and liked_el.find(class_=lambda c: c and "liked" in c))

            if title:
                entries.append(LetterboxdDiaryEntry(
                    title=title,
                    slug=slug,
                    date=date,
                    rating=rating,
                    rewatch=rewatch,
                    liked=liked,
                ))

        time.sleep(delay)

    return entries

def scrape_user_lists(
    client: httpx.Client,
    username: str,
    max_pages: int = 3,
    delay: float = 2.0,
) -> list[LetterboxdList]:
    """Scrape a user's published lists."""
    lists = []

    for page in range(1, max_pages + 1):
        url = f"https://letterboxd.com/{username}/lists/page/{page}/"
        soup = get_soup(client, url)
        if not soup:
            break

        list_items = soup.find_all("section", class_=lambda c: c and "list-set" in str(c))
        if not list_items:
            # Also try the grid layout
            list_items = soup.find_all("div", class_=lambda c: c and "list-" in str(c) and "set" in str(c))

        if not list_items:
            # Try the main content container directly
            list_items = soup.find_all("a", class_=lambda c: c and "list-link" in str(c))

        for item in soup.find_all("h2", class_=lambda c: c and "list-title" in str(c)):
            link = item.find("a", href=True)
            if not link:
                continue
            href = link.get("href", "")
            list_url = "https://letterboxd.com" + href if href.startswith("/") else href

            # Film count
            count = 0
            count_el = item.find_next("small", class_=lambda c: c and "subtitle" in str(c))
            if count_el:
                count_match = re.search(r"(\d+)\s+film", count_el.text)
                if count_match:
                    count = int(count_match.group(1))

            lists.append(LetterboxdList(
                title=link.text.strip(),
                slug=href.split("/")[-2] if href else "",
                url=list_url,
                film_count=count,
                description="",
                published_date="",
                likes=0,
                tags=[],
            ))

        time.sleep(delay)

    return lists

def scrape_list_films(
    client: httpx.Client,
    list_url: str,
    max_pages: int = 10,
    delay: float = 2.0,
) -> list[dict]:
    """Scrape all films from a Letterboxd list."""
    films = []

    for page in range(1, max_pages + 1):
        url = f"{list_url.rstrip('/')}/page/{page}/"
        soup = get_soup(client, url)
        if not soup:
            break

        posters = soup.find_all("li", class_="poster-container")
        if not posters:
            break

        for i, li in enumerate(posters):
            div = li.find("div", class_="film-poster")
            if not div:
                continue
            img = li.find("img")
            films.append({
                "position": len(films) + 1,
                "title": img.get("alt", "") if img else "",
                "slug": div.get("data-film-slug", ""),
                "year": div.get("data-film-release-year", ""),
                "url": f"https://letterboxd.com/film/{div.get('data-film-slug', '')}/",
            })

        time.sleep(delay)

    return films

Anti-Bot Measures and Production Setup

Letterboxd is not aggressively protected, but they do enforce meaningful limits.

Rate limiting. Letterboxd will return 429 responses if you exceed roughly 1 request per second sustained. The 2-second sleep in the examples above keeps you comfortably under this threshold.

IP blocking. Sustained scraping from a single IP — even at polite rates — will eventually get noticed. Letterboxd monitors for patterns: the same IP requesting hundreds of review pages, or systematically crawling user profiles. For any collection beyond a few hundred pages, IP rotation is needed.

Cloudflare. Letterboxd uses Cloudflare at a moderate protection level. Most requests pass without challenge. If you see Cloudflare challenge pages (the "checking your browser" interstitial), your request pattern or IP has been flagged.

ThorData's residential proxy network provides residential IP rotation that prevents individual IPs from accumulating enough requests to trigger Letterboxd's pattern detection. This is especially relevant when scraping user profiles, where the URL pattern makes bot activity easy to detect from server logs.

THORDATA_PROXY_ROTATING = "http://USER:[email protected]:9001"

def create_proxied_client() -> httpx.Client:
    return create_client(proxy=THORDATA_PROXY_ROTATING)

def scrape_with_rotation(
    slugs: list[str],
    scrape_reviews: bool = True,
    proxy_url: Optional[str] = None,
    delay: float = 2.5,
) -> list[dict]:
    """
    Scrape multiple films with proxy rotation and polite pacing.
    """
    client = create_client(proxy=proxy_url)
    results = []

    for i, slug in enumerate(slugs):
        try:
            film = scrape_film(client, slug)
            if not film:
                print(f"  [{i+1}/{len(slugs)}] Not found: {slug}")
                continue

            result = {
                "slug": slug,
                "title": film.title,
                "year": film.year,
                "director": film.director,
                "average_rating": film.average_rating,
                "rating_count": film.rating_count,
                "genres": film.genres,
                "runtime_minutes": film.runtime_minutes,
                "reviews": [],
            }

            if scrape_reviews:
                reviews = scrape_film_reviews(client, slug, max_pages=2, delay=delay)
                result["reviews_scraped"] = len(reviews)
                result["reviews"] = [
                    {
                        "reviewer": r.reviewer,
                        "rating": r.rating,
                        "text": r.text[:500],
                        "date": r.date,
                        "likes": r.likes,
                    }
                    for r in reviews[:10]
                ]

            results.append(result)
            print(f"  [{i+1}/{len(slugs)}] {film.title} ({film.year}) — {film.average_rating or 'N/A'} avg")

        except httpx.HTTPStatusError as e:
            if e.response.status_code == 429:
                print(f"  Rate limited on {slug}. Waiting 60s...")
                time.sleep(60)
            elif e.response.status_code == 403:
                print(f"  Blocked on {slug}. Waiting 30s and rotating...")
                time.sleep(30)
                # Re-create client to get fresh proxy IP
                client.close()
                client = create_client(proxy=proxy_url, ua=random.choice(USER_AGENTS))
            else:
                print(f"  HTTP error on {slug}: {e.response.status_code}")
        except Exception as e:
            print(f"  Error on {slug}: {e}")

        time.sleep(delay + random.uniform(0, 1.0))

    client.close()
    return results

Storage Schema

def setup_letterboxd_db(db_path: str = "letterboxd.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.execute("PRAGMA journal_mode=WAL")
    conn.execute("PRAGMA synchronous=NORMAL")
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS films (
            slug TEXT PRIMARY KEY,
            title TEXT,
            year TEXT,
            director TEXT,
            average_rating TEXT,
            rating_count INTEGER,
            genres TEXT,
            description TEXT,
            runtime_minutes INTEGER,
            countries TEXT,
            languages TEXT,
            image_url TEXT,
            scraped_at TEXT DEFAULT (datetime('now'))
        );

        CREATE TABLE IF NOT EXISTS reviews (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            film_slug TEXT NOT NULL,
            reviewer TEXT,
            reviewer_url TEXT,
            rating REAL,
            review_text TEXT,
            review_date TEXT,
            likes INTEGER DEFAULT 0,
            has_spoiler INTEGER DEFAULT 0,
            scraped_at TEXT DEFAULT (datetime('now')),
            FOREIGN KEY (film_slug) REFERENCES films(slug)
        );

        CREATE TABLE IF NOT EXISTS popular_snapshots (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            film_slug TEXT,
            film_title TEXT,
            position INTEGER,
            time_period TEXT,
            genre TEXT,
            snapshot_date TEXT DEFAULT (date('now')),
            captured_at TEXT DEFAULT (datetime('now'))
        );

        CREATE TABLE IF NOT EXISTS users (
            username TEXT PRIMARY KEY,
            films_count INTEGER,
            following_count INTEGER,
            followers_count INTEGER,
            bio TEXT,
            scraped_at TEXT DEFAULT (datetime('now'))
        );

        CREATE TABLE IF NOT EXISTS diary_entries (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            username TEXT NOT NULL,
            film_slug TEXT,
            film_title TEXT,
            watch_date TEXT,
            rating REAL,
            rewatch INTEGER DEFAULT 0,
            liked INTEGER DEFAULT 0,
            FOREIGN KEY (username) REFERENCES users(username),
            UNIQUE (username, film_slug, watch_date)
        );

        CREATE INDEX IF NOT EXISTS idx_reviews_film ON reviews(film_slug);
        CREATE INDEX IF NOT EXISTS idx_reviews_rating ON reviews(rating);
        CREATE INDEX IF NOT EXISTS idx_reviews_date ON reviews(review_date);
        CREATE INDEX IF NOT EXISTS idx_popular_date ON popular_snapshots(snapshot_date, time_period);
        CREATE INDEX IF NOT EXISTS idx_diary_user ON diary_entries(username);
        CREATE INDEX IF NOT EXISTS idx_diary_date ON diary_entries(watch_date);
    """)
    conn.commit()
    return conn

def save_film(conn: sqlite3.Connection, film: LetterboxdFilm):
    conn.execute("""
        INSERT OR REPLACE INTO films
        (slug, title, year, director, average_rating, rating_count, genres,
         description, runtime_minutes, countries, languages, image_url)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, (
        film.slug, film.title, film.year, film.director,
        film.average_rating, film.rating_count, json.dumps(film.genres),
        film.description, film.runtime_minutes,
        json.dumps(film.countries), json.dumps(film.languages), film.image_url,
    ))
    conn.commit()

def save_reviews(conn: sqlite3.Connection, film_slug: str, reviews: list[LetterboxdReview]):
    conn.executemany("""
        INSERT OR IGNORE INTO reviews
        (film_slug, reviewer, reviewer_url, rating, review_text, review_date, likes, has_spoiler)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?)
    """, [
        (film_slug, r.reviewer, r.reviewer_url, r.rating, r.text,
         r.date, r.likes, int(r.has_spoiler_warning))
        for r in reviews
    ])
    conn.commit()

A Complete Weekly Film Report Pipeline

def letterboxd_weekly_report(
    db_path: str = "letterboxd.db",
    proxy: Optional[str] = None,
    max_films: int = 30,
) -> dict:
    """
    Generate a weekly popular films report with reviews.
    Scrapes this week's popular films and their top reviews.
    """
    conn = setup_letterboxd_db(db_path)
    client = create_client(proxy=proxy)
    stats = {"films_scraped": 0, "reviews_scraped": 0, "errors": 0}

    print("Fetching popular films this week...")
    popular = scrape_popular_films(client, time_period="this/week", max_pages=2)
    print(f"Found {len(popular)} popular films")

    for film_info in popular[:max_films]:
        slug = film_info.slug
        print(f"\nScraping: {film_info.title}")

        try:
            film = scrape_film(client, slug)
            if not film:
                stats["errors"] += 1
                continue

            save_film(conn, film)

            reviews = scrape_film_reviews(client, slug, max_pages=2, delay=2.0)
            save_reviews(conn, slug, reviews)

            # Save to popular snapshots
            conn.execute("""
                INSERT INTO popular_snapshots (film_slug, film_title, position, time_period, genre)
                VALUES (?, ?, ?, 'this/week', 'all')
            """, (slug, film.title, film_info.position))
            conn.commit()

            stats["films_scraped"] += 1
            stats["reviews_scraped"] += len(reviews)
            print(f"  Rating: {film.average_rating} ({film.rating_count} ratings), {len(reviews)} reviews scraped")

        except Exception as e:
            print(f"  Error: {e}")
            stats["errors"] += 1

        time.sleep(2.5)

    client.close()
    conn.close()

    print(f"\n=== Report complete ===")
    print(f"Films: {stats['films_scraped']}, Reviews: {stats['reviews_scraped']}, Errors: {stats['errors']}")
    return stats

What to Watch Out For

HTML structure changes. Letterboxd updates their frontend periodically. The CSS class names used here (film-detail, body-text, poster-container, diary-entry-row) have been stable for years, but check your selectors when things break.

Logged-in vs. logged-out content. Some user profiles are private and return nothing useful. Reviews marked as spoilers are partially hidden. The scraper above collects public, accessible data only.

Rating scale conversion. Letterboxd uses a 0.5 to 5.0 scale. The CSS classes encode this as integers 1-10 (rated-7 means 3.5 stars). The code above converts to the standard scale. Be careful not to confuse rated-7 (3.5 stars) with a 7/10 rating.

Lists vs. watchlists. User "watchlists" and user "lists" are different endpoints with different HTML structures. The list scraper above works for published lists; watchlists use a similar poster grid but different container classes.

No API means fragile selectors. Everything here depends on HTML parsing. Write tests that verify your selectors still return expected results against a saved snapshot of each page type, and run them weekly. A simple CI check that scrapes one known film and validates the output is a reasonable canary.

Letterboxd is a rewarding target for film data because the community is genuinely engaged and the reviews have analytical substance. The lack of a public API makes it scraping-only territory, but the clean HTML and moderate anti-bot stance keep it accessible with basic tools, respectful rate limits, and ThorData residential proxies for production-scale collection.