Scraping Letterboxd Film Reviews, Ratings, and User Lists (2026)
Scraping Letterboxd Film Reviews, Ratings, and User Lists (2026)
Letterboxd is the social network that film nerds actually use. It has detailed ratings, thoughtful reviews, curated lists, and active community engagement around movies. It also has no public API. If you want Letterboxd data for research, recommendation engines, sentiment analysis, or film trend tracking, you're scraping HTML.
The good news is that Letterboxd's HTML is remarkably clean. Server-rendered pages, consistent CSS classes, predictable URL structures. It's one of the more pleasant scraping targets if you respect their rate limits. This guide covers extracting film ratings, review text, user activity, popular lists, and diary data — along with production-ready patterns for building a reliable film data pipeline.
The Value of Letterboxd Data
Letterboxd has unique data characteristics that make it worth the effort to scrape:
Review quality is high. Unlike Amazon or Google reviews, Letterboxd attracts film enthusiasts who write substantive reviews. The average review length and analytical depth is significantly above other review platforms. This makes Letterboxd data particularly valuable for sentiment analysis, recommendation systems, and film criticism research.
Community curation is rich. User-created lists on Letterboxd are an underappreciated dataset. Lists like "Best Films of the 21st Century," "Essential Horror," or "Overlooked Science Fiction" represent aggregated editorial judgment that's hard to find elsewhere.
Rating distribution matters. Letterboxd's 0.5-5.0 scale with half-star increments creates a more granular distribution than most platforms' 1-5 integer ratings. The distribution of ratings across the community often reveals interesting bimodal patterns for controversial films.
Popular films lists are real-time signals. Letterboxd's "popular this week" list reflects actual community engagement, not algorithmic promotion. It's a cleaner signal for film trend tracking than streaming charts that are influenced by platform recommendations.
URL Structure
Understanding Letterboxd's URL patterns is essential:
Film page: https://letterboxd.com/film/{slug}/
Film reviews: https://letterboxd.com/film/{slug}/reviews/by/activity/
Film reviews (pop): https://letterboxd.com/film/{slug}/reviews/by/popularity/
Film reviews (p.N): https://letterboxd.com/film/{slug}/reviews/by/activity/page/{n}/
Film ratings: https://letterboxd.com/film/{slug}/ratings/
Film cast: https://letterboxd.com/film/{slug}/cast/
Film fans: https://letterboxd.com/film/{slug}/fans/
Film lists: https://letterboxd.com/film/{slug}/lists/
User profile: https://letterboxd.com/{username}/
User films: https://letterboxd.com/{username}/films/
User diary: https://letterboxd.com/{username}/films/diary/
User reviews: https://letterboxd.com/{username}/films/reviews/
User lists: https://letterboxd.com/{username}/lists/
User watchlist: https://letterboxd.com/{username}/watchlist/
Popular films: https://letterboxd.com/films/popular/
Popular this week: https://letterboxd.com/films/popular/this/week/
Popular this month: https://letterboxd.com/films/popular/this/month/
Genre popular: https://letterboxd.com/films/popular/genre/{genre}/
Director films: https://letterboxd.com/director/{name}/
Actor films: https://letterboxd.com/actor/{name}/
All of these return server-rendered HTML that BeautifulSoup can parse directly.
Setup and Base Client
pip install httpx beautifulsoup4 lxml
import httpx
import time
import re
import json
import sqlite3
import random
from bs4 import BeautifulSoup
from dataclasses import dataclass, field
from typing import Optional
USER_AGENTS = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]
def create_client(proxy: Optional[str] = None, ua: Optional[str] = None) -> httpx.Client:
"""Create an httpx client with appropriate headers for Letterboxd."""
transport = httpx.HTTPTransport(proxy=proxy) if proxy else None
return httpx.Client(
transport=transport,
headers={
"User-Agent": ua or random.choice(USER_AGENTS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Upgrade-Insecure-Requests": "1",
},
timeout=20,
follow_redirects=True,
)
def get_soup(client: httpx.Client, url: str, retries: int = 3) -> Optional[BeautifulSoup]:
"""Fetch a URL and return a BeautifulSoup object with retry logic."""
for attempt in range(retries):
try:
resp = client.get(url)
if resp.status_code == 200:
return BeautifulSoup(resp.text, "lxml")
elif resp.status_code == 404:
return None
elif resp.status_code == 429:
wait = float(resp.headers.get("Retry-After", 30 * (attempt + 1)))
print(f"Rate limited on {url}. Waiting {wait:.0f}s...")
time.sleep(wait)
elif resp.status_code == 503:
time.sleep(10 * (attempt + 1))
else:
print(f"Got {resp.status_code} for {url} (attempt {attempt + 1})")
time.sleep(2 * (attempt + 1))
except httpx.TimeoutException:
print(f"Timeout on {url} (attempt {attempt + 1})")
time.sleep(3 * (attempt + 1))
except httpx.ConnectError as e:
print(f"Connection error: {e}")
time.sleep(5)
return None
Scraping Film Metadata
Each film page contains structured data in both the HTML and JSON-LD:
@dataclass
class LetterboxdFilm:
slug: str
title: str
year: str
director: str
average_rating: Optional[str]
rating_count: Optional[int]
genres: list
description: str
image_url: str
url: str
runtime_minutes: Optional[int] = None
countries: list = field(default_factory=list)
languages: list = field(default_factory=list)
studios: list = field(default_factory=list)
def scrape_film(client: httpx.Client, film_slug: str) -> Optional[LetterboxdFilm]:
"""Scrape comprehensive film metadata from its Letterboxd page."""
url = f"https://letterboxd.com/film/{film_slug}/"
soup = get_soup(client, url)
if not soup:
return None
# Extract JSON-LD structured data
ld_script = soup.find("script", type="application/ld+json")
ld_data = {}
if ld_script and ld_script.string:
try:
ld_data = json.loads(ld_script.string)
except json.JSONDecodeError:
pass
# Average rating from meta tag
rating_meta = soup.find("meta", {"name": "twitter:data2"})
avg_rating = None
if rating_meta and rating_meta.get("content"):
content = rating_meta["content"]
match = re.search(r"([\d.]+)\s+out of 5", content)
if match:
avg_rating = match.group(1)
# Rating count
rating_count = None
for tooltip_el in soup.find_all("a", class_="tooltip"):
title_attr = tooltip_el.get("title", "")
if "rating" in title_attr.lower():
count_match = re.search(r"([\d,]+)\s+rating", title_attr)
if count_match:
try:
rating_count = int(count_match.group(1).replace(",", ""))
except ValueError:
pass
break
# Year from featured film header
year = ""
year_el = soup.find("small", class_="number")
if year_el:
year = year_el.text.strip()
elif soup.find("meta", property="og:title"):
og_title = soup.find("meta", property="og:title").get("content", "")
year_match = re.search(r"\((\d{4})\)", og_title)
if year_match:
year = year_match.group(1)
# Director from twitter meta
director = ""
director_meta = soup.find("meta", {"name": "twitter:data1"})
if director_meta:
director = director_meta.get("content", "")
# Genres from tab
genres = []
genre_tab = soup.find("div", id="tab-genres")
if genre_tab:
for a_tag in genre_tab.find_all("a", class_="text-slug"):
g = a_tag.text.strip()
if g:
genres.append(g)
# Runtime
runtime = None
runtime_el = soup.find("p", class_="text-link text-footer")
if not runtime_el:
runtime_el = soup.find("span", class_="duration")
if runtime_el:
runtime_text = runtime_el.get_text()
minutes_match = re.search(r"(\d+)\s*mins?", runtime_text)
if minutes_match:
runtime = int(minutes_match.group(1))
# Countries and languages from details tab
countries = []
languages = []
details_tab = soup.find("div", id="tab-details")
if details_tab:
for section in details_tab.find_all("div", class_=True):
label_el = section.find("h3")
if label_el:
label = label_el.text.strip().lower()
items = [a.text.strip() for a in section.find_all("a", class_="text-slug")]
if "countr" in label:
countries = items
elif "language" in label:
languages = items
return LetterboxdFilm(
slug=film_slug,
title=ld_data.get("name", ""),
year=year,
director=director,
average_rating=avg_rating,
rating_count=rating_count,
genres=genres,
description=(ld_data.get("description") or "")[:500],
image_url=ld_data.get("image", ""),
url=url,
runtime_minutes=runtime,
countries=countries,
languages=languages,
)
Extracting Reviews
Letterboxd reviews are paginated, with 12 reviews per page. The page structure is stable across the activity and popularity sort modes:
@dataclass
class LetterboxdReview:
reviewer: str
reviewer_url: str
rating: Optional[float] # 0.5 to 5.0 in 0.5 increments
text: str
date: str
likes: int
has_spoiler_warning: bool = False
is_featured: bool = False
def _extract_rating_from_classes(classes: list[str]) -> Optional[float]:
"""Convert Letterboxd CSS rating class to float (rated-7 -> 3.5)."""
for cls in classes:
if cls.startswith("rated-"):
try:
raw = int(cls.replace("rated-", ""))
return raw / 2 # 1-10 -> 0.5-5.0
except ValueError:
pass
return None
def scrape_film_reviews(
client: httpx.Client,
film_slug: str,
sort: str = "activity",
max_pages: int = 5,
delay: float = 2.0,
) -> list[LetterboxdReview]:
"""
Scrape reviews for a film.
sort: 'activity' (recent), 'popularity' (most liked), 'highest-rated', 'lowest-rated'
"""
reviews = []
for page in range(1, max_pages + 1):
url = f"https://letterboxd.com/film/{film_slug}/reviews/by/{sort}/page/{page}/"
soup = get_soup(client, url)
if soup is None:
break
review_items = soup.find_all("li", class_=lambda c: c and "film-detail" in c)
if not review_items:
break
for item in review_items:
# Reviewer info
reviewer_el = item.find("strong", class_="name")
reviewer = ""
reviewer_url = ""
if reviewer_el:
reviewer = reviewer_el.text.strip()
link = reviewer_el.find("a") or reviewer_el.parent
if link and link.get("href"):
reviewer_url = "https://letterboxd.com" + link.get("href", "")
# Rating
rating_el = item.find("span", class_="rating")
rating = _extract_rating_from_classes(rating_el.get("class", [])) if rating_el else None
# Review text — handle collapsed/spoiler versions
body_el = item.find("div", class_="body-text")
review_text = ""
has_spoiler = False
if body_el:
# Check for spoiler warning
spoiler_el = body_el.find("p", class_="contains-spoilers")
if spoiler_el:
has_spoiler = True
# Get text from collapsed or full version
collapsed = body_el.find("div", class_="collapsed-text")
full = body_el.find("div", class_="full-text")
text_el = collapsed or full or body_el
review_text = text_el.get_text(separator=" ", strip=True)[:2000]
# Date
date = ""
date_el = item.find("span", class_="_nobr")
if date_el:
date = date_el.text.strip()
else:
# Try time element
time_el = item.find("time")
if time_el:
date = time_el.get("datetime", time_el.text.strip())
# Like count
likes = 0
like_el = item.find("a", class_=lambda c: c and "has-icon" in c and "icon-like" in c)
if like_el:
like_text = like_el.get("title", like_el.text)
like_match = re.search(r"([\d,]+)\s+like", like_text, re.IGNORECASE)
if like_match:
try:
likes = int(like_match.group(1).replace(",", ""))
except ValueError:
pass
# Featured review indicator
is_featured = bool(item.find(class_=lambda c: c and "featured" in str(c).lower()))
if reviewer or review_text:
reviews.append(LetterboxdReview(
reviewer=reviewer,
reviewer_url=reviewer_url,
rating=rating,
text=review_text,
date=date,
likes=likes,
has_spoiler_warning=has_spoiler,
is_featured=is_featured,
))
time.sleep(delay)
return reviews
Scraping Popular Films Lists
@dataclass
class LetterboxdListEntry:
title: str
slug: str
year: Optional[str]
position: int
poster_url: str
url: str
def scrape_popular_films(
client: httpx.Client,
time_period: str = "this/week",
genre: Optional[str] = None,
decade: Optional[str] = None,
max_pages: int = 3,
delay: float = 2.0,
) -> list[LetterboxdListEntry]:
"""
Scrape Letterboxd popular films lists.
time_period: 'this/week', 'this/month', 'this/year', '' (all time)
genre: 'horror', 'drama', 'comedy', etc.
decade: '2020s', '2010s', '1980s', etc.
"""
films = []
for page in range(1, max_pages + 1):
# Build URL
base = "https://letterboxd.com/films/popular"
url_parts = [base]
if time_period:
url_parts.append(time_period)
if genre:
url_parts.append(f"genre/{genre}")
if decade:
url_parts.append(f"decade/{decade}")
url_parts.append(f"page/{page}/")
url = "/".join(url_parts)
soup = get_soup(client, url)
if not soup:
break
posters = soup.find_all("li", class_="poster-container")
if not posters:
break
for poster_li in posters:
div = poster_li.find("div", class_="film-poster")
if not div:
continue
slug = div.get("data-film-slug", "")
film_id = div.get("data-film-id", "")
img = poster_li.find("img")
title = img.get("alt", "") if img else ""
# Poster URL
poster_url = ""
if img:
poster_url = img.get("src", "") or img.get("data-src", "")
# Year (sometimes in data attribute)
year = div.get("data-film-release-year", "")
films.append(LetterboxdListEntry(
title=title,
slug=slug,
year=year or None,
position=len(films) + 1,
poster_url=poster_url,
url=f"https://letterboxd.com/film/{slug}/",
))
time.sleep(delay)
return films
def scrape_films_by_genre(
client: httpx.Client,
genre: str,
sort: str = "popular",
decade: Optional[str] = None,
max_pages: int = 5,
) -> list[LetterboxdListEntry]:
"""
Scrape films filtered by genre.
sort: 'popular', 'average-rating', 'date', 'name'
"""
films = []
for page in range(1, max_pages + 1):
url = f"https://letterboxd.com/films/genre/{genre}/page/{page}/"
if decade:
url = f"https://letterboxd.com/films/genre/{genre}/decade/{decade}/page/{page}/"
soup = get_soup(client, url)
if not soup:
break
posters = soup.find_all("li", class_="poster-container")
if not posters:
break
for poster_li in posters:
div = poster_li.find("div", class_="film-poster")
if not div:
continue
slug = div.get("data-film-slug", "")
img = poster_li.find("img")
title = img.get("alt", "") if img else ""
films.append(LetterboxdListEntry(
title=title, slug=slug, year=None,
position=len(films) + 1,
poster_url="",
url=f"https://letterboxd.com/film/{slug}/",
))
time.sleep(2.0)
return films
Scraping User Profiles, Diary, and Lists
@dataclass
class LetterboxdUserProfile:
username: str
bio: str
films_count: int
following_count: int
followers_count: int
lists_count: int
favorite_films: list
@dataclass
class LetterboxdDiaryEntry:
title: str
slug: str
date: str
rating: Optional[float]
rewatch: bool
liked: bool
@dataclass
class LetterboxdList:
title: str
slug: str
url: str
film_count: int
description: str
published_date: str
likes: int
tags: list
def scrape_user_profile(client: httpx.Client, username: str) -> Optional[LetterboxdUserProfile]:
"""Scrape a user's public profile page."""
url = f"https://letterboxd.com/{username}/"
soup = get_soup(client, url)
if not soup:
return None
# Check if profile is private
if soup.find("section", class_="error"):
return None
# Stats
stats = {}
for stat_el in soup.find_all("h4", class_="profile-statistic"):
value_el = stat_el.find("span", class_="value")
if value_el:
label = stat_el.get_text(strip=True).lower()
value_text = value_el.text.strip().replace(",", "").replace("k", "000")
try:
stats[label] = int(float(value_text))
except ValueError:
pass
# Bio
bio = ""
bio_el = soup.find("div", class_="body-text")
if bio_el:
bio = bio_el.get_text(separator=" ", strip=True)[:500]
# Favorite films
favorites = []
fav_section = soup.find("section", id="favourites")
if fav_section:
for poster_div in fav_section.find_all("div", class_="film-poster"):
img = poster_div.find("img")
if img:
favorites.append({
"title": img.get("alt", ""),
"slug": poster_div.get("data-film-slug", ""),
})
return LetterboxdUserProfile(
username=username,
bio=bio,
films_count=stats.get("films", 0),
following_count=stats.get("following", 0),
followers_count=stats.get("followers", 0),
lists_count=stats.get("lists", 0),
favorite_films=favorites,
)
def scrape_user_diary(
client: httpx.Client,
username: str,
max_pages: int = 5,
year: Optional[int] = None,
delay: float = 2.0,
) -> list[LetterboxdDiaryEntry]:
"""
Scrape a user's film diary (watches with dates and ratings).
year: filter to a specific year (e.g., 2024)
"""
entries = []
base_url = f"https://letterboxd.com/{username}/films/diary"
if year:
base_url = f"{base_url}/for/{year}"
for page in range(1, max_pages + 1):
url = f"{base_url}/page/{page}/"
soup = get_soup(client, url)
if not soup:
break
rows = soup.find_all("tr", class_="diary-entry-row")
if not rows:
break
for row in rows:
# Film info
title_cell = row.find("td", class_="td-film-details")
title = ""
slug = ""
if title_cell:
link_el = title_cell.find("a", href=True)
if link_el:
title = link_el.text.strip()
href = link_el.get("href", "")
if "/film/" in href:
slug = href.split("/film/")[-1].strip("/")
# Watch date
date = ""
cal_cell = row.find("td", class_="td-calendar")
if cal_cell:
date_link = cal_cell.find("a", href=True)
if date_link:
href = date_link.get("href", "")
# URL format: /username/films/diary/for/2024/01/15/
date_match = re.search(r"/for/(\d{4}/\d{2}/\d{2})/", href)
if date_match:
date = date_match.group(1).replace("/", "-")
# Rating
rating_el = row.find("td", class_="td-rating")
rating = None
if rating_el:
span = rating_el.find("span", class_="rating")
if span:
rating = _extract_rating_from_classes(span.get("class", []))
# Rewatch
rewatch = bool(row.find("td", class_="td-rewatch"))
# Liked
liked_el = row.find("td", class_="td-like")
liked = bool(liked_el and liked_el.find(class_=lambda c: c and "liked" in c))
if title:
entries.append(LetterboxdDiaryEntry(
title=title,
slug=slug,
date=date,
rating=rating,
rewatch=rewatch,
liked=liked,
))
time.sleep(delay)
return entries
def scrape_user_lists(
client: httpx.Client,
username: str,
max_pages: int = 3,
delay: float = 2.0,
) -> list[LetterboxdList]:
"""Scrape a user's published lists."""
lists = []
for page in range(1, max_pages + 1):
url = f"https://letterboxd.com/{username}/lists/page/{page}/"
soup = get_soup(client, url)
if not soup:
break
list_items = soup.find_all("section", class_=lambda c: c and "list-set" in str(c))
if not list_items:
# Also try the grid layout
list_items = soup.find_all("div", class_=lambda c: c and "list-" in str(c) and "set" in str(c))
if not list_items:
# Try the main content container directly
list_items = soup.find_all("a", class_=lambda c: c and "list-link" in str(c))
for item in soup.find_all("h2", class_=lambda c: c and "list-title" in str(c)):
link = item.find("a", href=True)
if not link:
continue
href = link.get("href", "")
list_url = "https://letterboxd.com" + href if href.startswith("/") else href
# Film count
count = 0
count_el = item.find_next("small", class_=lambda c: c and "subtitle" in str(c))
if count_el:
count_match = re.search(r"(\d+)\s+film", count_el.text)
if count_match:
count = int(count_match.group(1))
lists.append(LetterboxdList(
title=link.text.strip(),
slug=href.split("/")[-2] if href else "",
url=list_url,
film_count=count,
description="",
published_date="",
likes=0,
tags=[],
))
time.sleep(delay)
return lists
def scrape_list_films(
client: httpx.Client,
list_url: str,
max_pages: int = 10,
delay: float = 2.0,
) -> list[dict]:
"""Scrape all films from a Letterboxd list."""
films = []
for page in range(1, max_pages + 1):
url = f"{list_url.rstrip('/')}/page/{page}/"
soup = get_soup(client, url)
if not soup:
break
posters = soup.find_all("li", class_="poster-container")
if not posters:
break
for i, li in enumerate(posters):
div = li.find("div", class_="film-poster")
if not div:
continue
img = li.find("img")
films.append({
"position": len(films) + 1,
"title": img.get("alt", "") if img else "",
"slug": div.get("data-film-slug", ""),
"year": div.get("data-film-release-year", ""),
"url": f"https://letterboxd.com/film/{div.get('data-film-slug', '')}/",
})
time.sleep(delay)
return films
Anti-Bot Measures and Production Setup
Letterboxd is not aggressively protected, but they do enforce meaningful limits.
Rate limiting. Letterboxd will return 429 responses if you exceed roughly 1 request per second sustained. The 2-second sleep in the examples above keeps you comfortably under this threshold.
IP blocking. Sustained scraping from a single IP — even at polite rates — will eventually get noticed. Letterboxd monitors for patterns: the same IP requesting hundreds of review pages, or systematically crawling user profiles. For any collection beyond a few hundred pages, IP rotation is needed.
Cloudflare. Letterboxd uses Cloudflare at a moderate protection level. Most requests pass without challenge. If you see Cloudflare challenge pages (the "checking your browser" interstitial), your request pattern or IP has been flagged.
ThorData's residential proxy network provides residential IP rotation that prevents individual IPs from accumulating enough requests to trigger Letterboxd's pattern detection. This is especially relevant when scraping user profiles, where the URL pattern makes bot activity easy to detect from server logs.
THORDATA_PROXY_ROTATING = "http://USER:[email protected]:9001"
def create_proxied_client() -> httpx.Client:
return create_client(proxy=THORDATA_PROXY_ROTATING)
def scrape_with_rotation(
slugs: list[str],
scrape_reviews: bool = True,
proxy_url: Optional[str] = None,
delay: float = 2.5,
) -> list[dict]:
"""
Scrape multiple films with proxy rotation and polite pacing.
"""
client = create_client(proxy=proxy_url)
results = []
for i, slug in enumerate(slugs):
try:
film = scrape_film(client, slug)
if not film:
print(f" [{i+1}/{len(slugs)}] Not found: {slug}")
continue
result = {
"slug": slug,
"title": film.title,
"year": film.year,
"director": film.director,
"average_rating": film.average_rating,
"rating_count": film.rating_count,
"genres": film.genres,
"runtime_minutes": film.runtime_minutes,
"reviews": [],
}
if scrape_reviews:
reviews = scrape_film_reviews(client, slug, max_pages=2, delay=delay)
result["reviews_scraped"] = len(reviews)
result["reviews"] = [
{
"reviewer": r.reviewer,
"rating": r.rating,
"text": r.text[:500],
"date": r.date,
"likes": r.likes,
}
for r in reviews[:10]
]
results.append(result)
print(f" [{i+1}/{len(slugs)}] {film.title} ({film.year}) — {film.average_rating or 'N/A'} avg")
except httpx.HTTPStatusError as e:
if e.response.status_code == 429:
print(f" Rate limited on {slug}. Waiting 60s...")
time.sleep(60)
elif e.response.status_code == 403:
print(f" Blocked on {slug}. Waiting 30s and rotating...")
time.sleep(30)
# Re-create client to get fresh proxy IP
client.close()
client = create_client(proxy=proxy_url, ua=random.choice(USER_AGENTS))
else:
print(f" HTTP error on {slug}: {e.response.status_code}")
except Exception as e:
print(f" Error on {slug}: {e}")
time.sleep(delay + random.uniform(0, 1.0))
client.close()
return results
Storage Schema
def setup_letterboxd_db(db_path: str = "letterboxd.db") -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA synchronous=NORMAL")
conn.executescript("""
CREATE TABLE IF NOT EXISTS films (
slug TEXT PRIMARY KEY,
title TEXT,
year TEXT,
director TEXT,
average_rating TEXT,
rating_count INTEGER,
genres TEXT,
description TEXT,
runtime_minutes INTEGER,
countries TEXT,
languages TEXT,
image_url TEXT,
scraped_at TEXT DEFAULT (datetime('now'))
);
CREATE TABLE IF NOT EXISTS reviews (
id INTEGER PRIMARY KEY AUTOINCREMENT,
film_slug TEXT NOT NULL,
reviewer TEXT,
reviewer_url TEXT,
rating REAL,
review_text TEXT,
review_date TEXT,
likes INTEGER DEFAULT 0,
has_spoiler INTEGER DEFAULT 0,
scraped_at TEXT DEFAULT (datetime('now')),
FOREIGN KEY (film_slug) REFERENCES films(slug)
);
CREATE TABLE IF NOT EXISTS popular_snapshots (
id INTEGER PRIMARY KEY AUTOINCREMENT,
film_slug TEXT,
film_title TEXT,
position INTEGER,
time_period TEXT,
genre TEXT,
snapshot_date TEXT DEFAULT (date('now')),
captured_at TEXT DEFAULT (datetime('now'))
);
CREATE TABLE IF NOT EXISTS users (
username TEXT PRIMARY KEY,
films_count INTEGER,
following_count INTEGER,
followers_count INTEGER,
bio TEXT,
scraped_at TEXT DEFAULT (datetime('now'))
);
CREATE TABLE IF NOT EXISTS diary_entries (
id INTEGER PRIMARY KEY AUTOINCREMENT,
username TEXT NOT NULL,
film_slug TEXT,
film_title TEXT,
watch_date TEXT,
rating REAL,
rewatch INTEGER DEFAULT 0,
liked INTEGER DEFAULT 0,
FOREIGN KEY (username) REFERENCES users(username),
UNIQUE (username, film_slug, watch_date)
);
CREATE INDEX IF NOT EXISTS idx_reviews_film ON reviews(film_slug);
CREATE INDEX IF NOT EXISTS idx_reviews_rating ON reviews(rating);
CREATE INDEX IF NOT EXISTS idx_reviews_date ON reviews(review_date);
CREATE INDEX IF NOT EXISTS idx_popular_date ON popular_snapshots(snapshot_date, time_period);
CREATE INDEX IF NOT EXISTS idx_diary_user ON diary_entries(username);
CREATE INDEX IF NOT EXISTS idx_diary_date ON diary_entries(watch_date);
""")
conn.commit()
return conn
def save_film(conn: sqlite3.Connection, film: LetterboxdFilm):
conn.execute("""
INSERT OR REPLACE INTO films
(slug, title, year, director, average_rating, rating_count, genres,
description, runtime_minutes, countries, languages, image_url)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
film.slug, film.title, film.year, film.director,
film.average_rating, film.rating_count, json.dumps(film.genres),
film.description, film.runtime_minutes,
json.dumps(film.countries), json.dumps(film.languages), film.image_url,
))
conn.commit()
def save_reviews(conn: sqlite3.Connection, film_slug: str, reviews: list[LetterboxdReview]):
conn.executemany("""
INSERT OR IGNORE INTO reviews
(film_slug, reviewer, reviewer_url, rating, review_text, review_date, likes, has_spoiler)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
""", [
(film_slug, r.reviewer, r.reviewer_url, r.rating, r.text,
r.date, r.likes, int(r.has_spoiler_warning))
for r in reviews
])
conn.commit()
A Complete Weekly Film Report Pipeline
def letterboxd_weekly_report(
db_path: str = "letterboxd.db",
proxy: Optional[str] = None,
max_films: int = 30,
) -> dict:
"""
Generate a weekly popular films report with reviews.
Scrapes this week's popular films and their top reviews.
"""
conn = setup_letterboxd_db(db_path)
client = create_client(proxy=proxy)
stats = {"films_scraped": 0, "reviews_scraped": 0, "errors": 0}
print("Fetching popular films this week...")
popular = scrape_popular_films(client, time_period="this/week", max_pages=2)
print(f"Found {len(popular)} popular films")
for film_info in popular[:max_films]:
slug = film_info.slug
print(f"\nScraping: {film_info.title}")
try:
film = scrape_film(client, slug)
if not film:
stats["errors"] += 1
continue
save_film(conn, film)
reviews = scrape_film_reviews(client, slug, max_pages=2, delay=2.0)
save_reviews(conn, slug, reviews)
# Save to popular snapshots
conn.execute("""
INSERT INTO popular_snapshots (film_slug, film_title, position, time_period, genre)
VALUES (?, ?, ?, 'this/week', 'all')
""", (slug, film.title, film_info.position))
conn.commit()
stats["films_scraped"] += 1
stats["reviews_scraped"] += len(reviews)
print(f" Rating: {film.average_rating} ({film.rating_count} ratings), {len(reviews)} reviews scraped")
except Exception as e:
print(f" Error: {e}")
stats["errors"] += 1
time.sleep(2.5)
client.close()
conn.close()
print(f"\n=== Report complete ===")
print(f"Films: {stats['films_scraped']}, Reviews: {stats['reviews_scraped']}, Errors: {stats['errors']}")
return stats
What to Watch Out For
HTML structure changes. Letterboxd updates their frontend periodically. The CSS class names used here (film-detail, body-text, poster-container, diary-entry-row) have been stable for years, but check your selectors when things break.
Logged-in vs. logged-out content. Some user profiles are private and return nothing useful. Reviews marked as spoilers are partially hidden. The scraper above collects public, accessible data only.
Rating scale conversion. Letterboxd uses a 0.5 to 5.0 scale. The CSS classes encode this as integers 1-10 (rated-7 means 3.5 stars). The code above converts to the standard scale. Be careful not to confuse rated-7 (3.5 stars) with a 7/10 rating.
Lists vs. watchlists. User "watchlists" and user "lists" are different endpoints with different HTML structures. The list scraper above works for published lists; watchlists use a similar poster grid but different container classes.
No API means fragile selectors. Everything here depends on HTML parsing. Write tests that verify your selectors still return expected results against a saved snapshot of each page type, and run them weekly. A simple CI check that scrapes one known film and validates the output is a reasonable canary.
Letterboxd is a rewarding target for film data because the community is genuinely engaged and the reviews have analytical substance. The lack of a public API makes it scraping-only territory, but the clean HTML and moderate anti-bot stance keep it accessible with basic tools, respectful rate limits, and ThorData residential proxies for production-scale collection.