Scraping BoardGameGeek Data with Python (2026)
Scraping BoardGameGeek Data with Python (2026)
BoardGameGeek is the definitive database for tabletop games — over 140,000 games, 20 million user ratings, and one of the richest structured datasets in any hobbyist domain. For analysts, app developers, and recommendation engine builders, it's an extraordinary resource. BGG even provides a free XML API, which is both a gift and a source of frustration. The API is underdocumented, aggressively rate-limited, and occasionally responds with a 202 that tells you to come back later. This guide covers what the API exposes, how to parse it, and how to handle it without getting blocked.
What Data BGG Exposes
The BGG XML API v2 gives you access to:
- Game details — name, description, year published, minimum and maximum players, play time, age rating, BGG weight (complexity score)
- Ratings and statistics — average rating, Bayesian average (geek rating), number of ratings, rank within categories
- Mechanics — worker placement, deck building, area control, cooperative play, etc.
- Categories — fantasy, war games, economic, card games, etc.
- Designers, artists, publishers — all linked entities with their own IDs
- User collections — every game in a user's library with personal ratings, play counts, and status flags (owned, wishlist, for-trade, etc.)
- Play logs — individual logged plays with date, player count, location, and comments
This is real structured data, not scraped HTML — which makes it far more reliable than most scraping targets. But the API is XML-only, and the XML structure has some rough edges.
API Endpoints Overview
All endpoints live under https://boardgamegeek.com/xmlapi2/.
| Endpoint | Purpose |
|---|---|
/xmlapi2/thing |
Game details, ratings, mechanics, links |
/xmlapi2/collection |
A user's game collection with personal data |
/xmlapi2/search |
Search games by name |
/xmlapi2/plays |
A user's logged plays |
The thing endpoint is the workhorse. You pass one or more game IDs and get full details back. The collection endpoint is more complex — it queues requests server-side and may return a 202 on first hit.
Rate Limiting and Anti-Bot Behavior
BGG's API is free and public, but BGG treats it like a fragile internal service. The rate limiting is aggressive and inconsistently documented.
What you will hit in practice:
- 429 responses — BGG starts returning 429s quickly if you issue requests without delays. In testing, consecutive requests with less than 2-second gaps reliably trigger throttling. Use 5+ second delays between requests to stay clean.
- 202 "please wait" responses — The
/xmlapi2/collectionendpoint is notorious for this. BGG queues the data export on the server side. Your first request returns HTTP 202 with a message body asking you to retry. You must poll with retries until you get a 200. This is by design, not a bug — but many scrapers silently fail here. - IP blocks — Rapid requests from a single IP, especially to
thingwith many IDs, will result in temporary IP blocks. BGG does not communicate these clearly; you typically get connection timeouts or HTML error pages instead of proper API error codes. - No authentication — There are no API keys. Your IP is your identity, which makes rate limit sharing across concurrent processes a real concern.
For any production use — price tracking, recommendation systems, dataset collection — you need to distribute requests across multiple IPs. ThorData is a good fit here: their residential proxy pool rotates IPs automatically, so each BGG request appears to come from a different address and you avoid the cumulative throttling that kills single-IP scrapers.
Setup
pip install httpx
No HTML parsing libraries needed — everything comes back as XML, which Python handles natively.
import httpx
import xml.etree.ElementTree as ET
import time
BASE_URL = "https://boardgamegeek.com/xmlapi2"
def get(path: str, params: dict = None, retries: int = 5) -> ET.Element:
"""Make a BGG API request with retry logic for 429s and 202s."""
url = f"{BASE_URL}{path}"
for attempt in range(retries):
resp = httpx.get(url, params=params, timeout=30)
if resp.status_code == 200:
return ET.fromstring(resp.text)
elif resp.status_code == 202:
# BGG is building the response — retry after delay
print(f"Got 202, retrying in 10s (attempt {attempt + 1})")
time.sleep(10)
elif resp.status_code == 429:
wait = 2 ** attempt * 5
print(f"Rate limited, waiting {wait}s")
time.sleep(wait)
else:
raise Exception(f"Unexpected status {resp.status_code}: {resp.text[:200]}")
raise Exception(f"Failed after {retries} retries for {url}")
Fetching Game Details
The /xmlapi2/thing endpoint accepts a comma-separated list of game IDs and an optional stats=1 flag to include ratings.
def get_games(ids: list[int]) -> list[dict]:
"""Fetch full details for one or more games by BGG ID."""
id_str = ",".join(str(i) for i in ids)
root = get("/thing", params={"id": id_str, "stats": 1, "type": "boardgame"})
games = []
for item in root.findall("item"):
# Primary name
name = ""
for n in item.findall("name"):
if n.get("type") == "primary":
name = n.get("value", "")
break
# Statistics
stats = item.find("statistics/ratings")
avg_rating = None
bgg_rating = None
num_ratings = None
rank = None
if stats is not None:
avg_rating = float(stats.findtext("average", default="0") or 0)
bgg_rating = float(stats.findtext("bayesaverage", default="0") or 0)
num_ratings = int(stats.findtext("usersrated", default="0") or 0)
for r in stats.findall("ranks/rank"):
if r.get("name") == "boardgame":
try:
rank = int(r.get("value"))
except (TypeError, ValueError):
rank = None
# Links — mechanics, categories, designers, publishers
mechanics, categories, designers, publishers = [], [], [], []
for link in item.findall("link"):
ltype = link.get("type")
lval = link.get("value", "")
if ltype == "boardgamemechanic":
mechanics.append(lval)
elif ltype == "boardgamecategory":
categories.append(lval)
elif ltype == "boardgamedesigner":
designers.append(lval)
elif ltype == "boardgamepublisher":
publishers.append(lval)
games.append({
"id": int(item.get("id")),
"name": name,
"year_published": item.findtext("yearpublished"),
"min_players": item.findtext("minplayers"),
"max_players": item.findtext("maxplayers"),
"play_time": item.findtext("playingtime"),
"min_age": item.findtext("minage"),
"description": (item.findtext("description") or "").strip(),
"avg_rating": avg_rating,
"bgg_rating": bgg_rating,
"num_ratings": num_ratings,
"bgg_rank": rank,
"weight": float(stats.findtext("averageweight", default="0") or 0) if stats is not None else None,
"mechanics": mechanics,
"categories": categories,
"designers": designers,
"publishers": publishers,
})
return games
Searching Games
def search_games(query: str, exact: bool = False) -> list[dict]:
"""Search BGG by game name. Returns IDs and names only — fetch details separately."""
params = {"query": query, "type": "boardgame"}
if exact:
params["exact"] = 1
root = get("/search", params=params)
results = []
for item in root.findall("item"):
name_el = item.find("name")
results.append({
"id": int(item.get("id")),
"name": name_el.get("value") if name_el is not None else None,
"year": item.findtext("yearpublished"),
})
return results
Fetching User Collections
This is where the 202 pattern shows up most often. BGG processes collection exports asynchronously — the get() helper above handles the retry loop automatically.
def get_collection(username: str, status: str = "own") -> list[dict]:
"""
Fetch a user's game collection.
status options: own, wishlist, wanttoplay, fortrade, prevowned
"""
params = {"username": username, status: 1, "stats": 1}
root = get("/collection", params=params)
items = []
for item in root.findall("item"):
status_el = item.find("status")
stats_el = item.find("stats/rating")
user_rating = None
if stats_el is not None:
try:
user_rating = float(stats_el.get("value"))
except (TypeError, ValueError):
user_rating = None
items.append({
"game_id": int(item.get("objectid")),
"name": item.findtext("name"),
"year_published": item.findtext("yearpublished"),
"num_plays": int(item.findtext("numplays") or 0),
"user_rating": user_rating,
"owned": status_el.get("own") == "1" if status_el is not None else False,
"wishlist": status_el.get("wishlist") == "1" if status_el is not None else False,
"for_trade": status_el.get("fortrade") == "1" if status_el is not None else False,
})
return items
Fetching Play Logs
Play logs are paginated at 100 plays per page. Iterate until you exhaust them.
def get_plays(username: str, game_id: int = None) -> list[dict]:
"""Fetch all logged plays for a user, optionally filtered to a single game."""
plays = []
page = 1
while True:
params = {"username": username, "page": page}
if game_id:
params["id"] = game_id
params["type"] = "thing"
root = get("/plays", params=params)
total = int(root.get("total", 0))
batch = root.findall("play")
if not batch:
break
for play in batch:
item = play.find("item")
players = [
{"name": p.get("name"), "score": p.get("score"), "win": p.get("win") == "1"}
for p in play.findall("players/player")
]
plays.append({
"play_id": int(play.get("id")),
"date": play.get("date"),
"quantity": int(play.get("quantity", 1)),
"length_minutes": play.get("length"),
"location": play.get("location"),
"game_id": int(item.get("objectid")) if item is not None else None,
"game_name": item.get("name") if item is not None else None,
"players": players,
"comments": play.findtext("comments"),
})
if len(plays) >= total:
break
page += 1
time.sleep(5)
return plays
Putting It Together
# Search for a game, fetch details, then get plays
results = search_games("Wingspan", exact=True)
if results:
game_id = results[0]["id"]
details = get_games([game_id])
print(f"{details[0]['name']} — BGG rank: {details[0]['bgg_rank']}, Rating: {details[0]['avg_rating']:.2f}")
print(f"Mechanics: {', '.join(details[0]['mechanics'][:5])}")
# Fetch a user's collection and plays
time.sleep(5)
collection = get_collection("SomeUsername")
print(f"Collection size: {len(collection)} games")
time.sleep(5)
plays = get_plays("SomeUsername", game_id=game_id)
print(f"Logged plays for this game: {len(plays)}")
Legal Considerations
BGG's API is public and intended for developer use. Their terms of service permit reasonable programmatic access. Keep delays between requests, don't bulk-scrape user profile data at scale, and don't redistribute the full dataset commercially. Academic research, personal tools, and recommendation engines are all well within normal use.
The one area to watch: BGG's community forums treat their data as a commons, but the organization has historically been responsive to requests that threaten server stability. Don't hammer the API with concurrent workers — serialize your requests, respect the 202 retry pattern, and use proxy rotation to distribute load rather than to circumvent protections.
Advanced: Bulk Game ID Discovery
The BGG hot list and rankings give you starting points for bulk collection:
def get_hot_list() -> list[int]:
"""Fetch BGG's current 'hot games' list (top 50 by recent activity)."""
root = get("/hot", params={"type": "boardgame"})
return [int(item.get("id")) for item in root.findall("item")]
def get_ranked_games(start_rank: int = 1, end_rank: int = 1000) -> list[dict]:
"""
Get all games ranked from start_rank to end_rank.
BGG returns ranks in batches via the thing endpoint.
We fetch IDs first from the browse endpoint, then details in bulk.
"""
# Browse endpoint paginates by 50 games
games = []
page = (start_rank - 1) // 50 + 1
while True:
root = get("/browse/boardgame", params={"page": page})
items = root.findall("item")
if not items:
break
ids = [int(item.get("id")) for item in items]
rank_start = (page - 1) * 50 + 1
if rank_start > end_rank:
break
# Fetch details in batches of 20 (BGG recommends this)
for i in range(0, len(ids), 20):
batch = ids[i:i+20]
details = get_games(batch)
games.extend(details)
time.sleep(5)
page += 1
if rank_start + 50 > end_rank:
break
return games
def get_top_games_by_mechanic(mechanic: str, top_n: int = 100) -> list[dict]:
"""Get top-rated games that include a specific mechanic."""
# Search for games with the mechanic
root = get("/search", params={"query": mechanic, "type": "boardgame"})
ids = [int(item.get("id")) for item in root.findall("item")][:top_n]
if not ids:
return []
# Fetch all details in one batched call
all_games = []
for i in range(0, len(ids), 20):
batch = ids[i:i+20]
games = get_games(batch)
all_games.extend(games)
time.sleep(5)
# Filter to those that actually have this mechanic and sort by BGG rank
filtered = [g for g in all_games if mechanic in g.get("mechanics", [])]
return sorted(filtered, key=lambda x: x.get("bgg_rank") or 9999)
Storing Data in SQLite
import sqlite3
import json
def init_bgg_db(db_path: str = "bgg.db") -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.executescript("""
CREATE TABLE IF NOT EXISTS games (
id INTEGER PRIMARY KEY,
name TEXT,
year_published TEXT,
min_players INTEGER,
max_players INTEGER,
play_time INTEGER,
min_age INTEGER,
description TEXT,
avg_rating REAL,
bgg_rating REAL,
num_ratings INTEGER,
bgg_rank INTEGER,
weight REAL,
mechanics TEXT,
categories TEXT,
designers TEXT,
publishers TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS user_collections (
username TEXT,
game_id INTEGER,
game_name TEXT,
num_plays INTEGER,
user_rating REAL,
owned INTEGER,
wishlist INTEGER,
for_trade INTEGER,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (username, game_id)
);
CREATE TABLE IF NOT EXISTS plays (
play_id INTEGER PRIMARY KEY,
username TEXT,
game_id INTEGER,
game_name TEXT,
date TEXT,
quantity INTEGER,
length_minutes TEXT,
location TEXT,
players TEXT,
comments TEXT
);
CREATE INDEX IF NOT EXISTS idx_games_rank
ON games(bgg_rank);
CREATE INDEX IF NOT EXISTS idx_games_rating
ON games(bgg_rating DESC);
CREATE INDEX IF NOT EXISTS idx_collection_user
ON user_collections(username);
""")
conn.commit()
return conn
def save_game(conn: sqlite3.Connection, game: dict):
conn.execute(
"""INSERT OR REPLACE INTO games
(id, name, year_published, min_players, max_players, play_time,
min_age, description, avg_rating, bgg_rating, num_ratings,
bgg_rank, weight, mechanics, categories, designers, publishers)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)""",
(
game.get("id"),
game.get("name"),
game.get("year_published"),
game.get("min_players"),
game.get("max_players"),
game.get("play_time"),
game.get("min_age"),
game.get("description", "")[:2000],
game.get("avg_rating"),
game.get("bgg_rating"),
game.get("num_ratings"),
game.get("bgg_rank"),
game.get("weight"),
json.dumps(game.get("mechanics", [])),
json.dumps(game.get("categories", [])),
json.dumps(game.get("designers", [])),
json.dumps(game.get("publishers", [])),
),
)
conn.commit()
Proxy Configuration for High-Volume Collection
BGG's API is free and has no authentication, but it rate-limits by IP. For bulk collection of thousands of games, distributing requests across multiple IPs prevents cumulative throttling.
ThorData's residential proxies integrate cleanly with httpx. Each request rotates to a new residential IP, so BGG's per-IP rate counter never accumulates:
import httpx
import xml.etree.ElementTree as ET
PROXY = "http://USERNAME:[email protected]:9000"
def get_with_proxy(path: str, params: dict = None) -> ET.Element:
"""BGG API request routed through residential proxy."""
url = f"https://boardgamegeek.com/xmlapi2{path}"
with httpx.Client(proxies={"all://": PROXY}, timeout=30) as client:
for attempt in range(5):
resp = client.get(url, params=params)
if resp.status_code == 200:
return ET.fromstring(resp.text)
elif resp.status_code == 202:
time.sleep(10)
elif resp.status_code == 429:
time.sleep(2 ** attempt * 5)
else:
raise Exception(f"Status {resp.status_code}")
raise Exception(f"Failed after retries: {path}")
Analyzing Board Game Data
Once you have a database of games, you can run interesting analyses:
import sqlite3
import json
import statistics
conn = sqlite3.connect("bgg.db")
# Top games by weight (complexity) in a given mechanic
def games_by_complexity(mechanic: str, min_ratings: int = 500) -> list:
rows = conn.execute("""
SELECT name, bgg_rank, weight, avg_rating, num_ratings, year_published
FROM games
WHERE mechanics LIKE ?
AND num_ratings >= ?
AND weight IS NOT NULL
ORDER BY weight DESC
LIMIT 20
""", (f'%{mechanic}%', min_ratings)).fetchall()
return rows
# Ratings distribution by mechanic
def mechanic_rating_analysis() -> dict:
rows = conn.execute("""
SELECT mechanics, avg_rating, num_ratings
FROM games
WHERE avg_rating IS NOT NULL AND mechanics != '[]'
""").fetchall()
mechanic_ratings = {}
for row in rows:
try:
mechanics = json.loads(row[0])
for mechanic in mechanics:
if mechanic not in mechanic_ratings:
mechanic_ratings[mechanic] = []
mechanic_ratings[mechanic].append(row[1])
except json.JSONDecodeError:
pass
analysis = {}
for mechanic, ratings in mechanic_ratings.items():
if len(ratings) >= 10:
analysis[mechanic] = {
"count": len(ratings),
"avg_rating": round(statistics.mean(ratings), 3),
"median_rating": round(statistics.median(ratings), 3),
}
return dict(sorted(analysis.items(), key=lambda x: x[1]["avg_rating"], reverse=True))
# Print top mechanics by average rating
analysis = mechanic_rating_analysis()
print("Top mechanics by average game rating:")
for mechanic, stats in list(analysis.items())[:15]:
print(f" {mechanic:<35} avg={stats['avg_rating']:.3f} n={stats['count']}")
Recommendation System Basics
The BGG data is an ideal foundation for a simple recommendation engine:
def find_similar_games(
game_id: int,
db_path: str = "bgg.db",
top_n: int = 10,
) -> list:
"""
Find games similar to a target game based on shared mechanics
and categories, weighted by BGG rating.
"""
conn = sqlite3.connect(db_path)
# Get target game's mechanics and categories
target = conn.execute(
"SELECT mechanics, categories FROM games WHERE id = ?", (game_id,)
).fetchone()
if not target:
return []
target_mechanics = set(json.loads(target[0] or "[]"))
target_categories = set(json.loads(target[1] or "[]"))
if not target_mechanics and not target_categories:
return []
# Score all other games by overlap
all_games = conn.execute(
"""SELECT id, name, mechanics, categories, bgg_rating, bgg_rank
FROM games WHERE id != ? AND bgg_rating IS NOT NULL""",
(game_id,)
).fetchall()
scored = []
for row in all_games:
mechanics = set(json.loads(row[2] or "[]"))
categories = set(json.loads(row[3] or "[]"))
mechanic_overlap = len(target_mechanics & mechanics) / max(len(target_mechanics), 1)
category_overlap = len(target_categories & categories) / max(len(target_categories), 1)
similarity = mechanic_overlap * 0.7 + category_overlap * 0.3
if similarity > 0:
scored.append({
"id": row[0],
"name": row[1],
"bgg_rating": row[4],
"bgg_rank": row[5],
"similarity": round(similarity, 3),
})
# Sort by similarity, then by rating
scored.sort(key=lambda x: (x["similarity"], x["bgg_rating"] or 0), reverse=True)
conn.close()
return scored[:top_n]
# Example: find games similar to Wingspan (ID: 266192)
similar = find_similar_games(266192)
print("Games similar to Wingspan:")
for g in similar:
print(f" {g['name']:<35} similarity={g['similarity']:.2f} rating={g['bgg_rating']:.2f}")
Full Collection Pipeline
def run_bgg_pipeline(
start_rank: int = 1,
end_rank: int = 500,
db_path: str = "bgg.db",
):
"""
Collect BGG top games by rank range.
Fetches details including mechanics, ratings, and categories.
"""
conn = init_bgg_db(db_path)
hot_ids = get_hot_list()
# Fetch hot list first
print(f"Fetching {len(hot_ids)} hot games...")
for i in range(0, len(hot_ids), 20):
batch = hot_ids[i:i+20]
games = get_games(batch)
for game in games:
save_game(conn, game)
print(f" Saved {min(i+20, len(hot_ids))}/{len(hot_ids)}")
time.sleep(5)
# Then collect by rank range
print(f"\nCollecting ranked games {start_rank}-{end_rank}...")
ranked = get_ranked_games(start_rank=start_rank, end_rank=end_rank)
for game in ranked:
save_game(conn, game)
conn.close()
print(f"\nPipeline complete: {end_rank - start_rank + 1} games collected")
run_bgg_pipeline(start_rank=1, end_rank=1000)
Legal Considerations
BGG's API is public and intended for developer use. Their terms of service permit reasonable programmatic access. Keep delays between requests (5+ seconds), avoid bulk-scraping user profile data at scale, and do not redistribute the full dataset commercially.
BGG relies on volunteer contributions and community goodwill. Treat their servers with respect -- serialize requests, use 202 retry patterns correctly, and consider contributing back to the community if you build something useful with the data.