Scraping BoardGameGeek Data with Python (2026)

2026-04-09 ["boardgamegeek" "web scraping" "python" "api" "board-games"]

Scraping BoardGameGeek Data with Python (2026)

BoardGameGeek is the definitive database for tabletop games — over 140,000 games, 20 million user ratings, and one of the richest structured datasets in any hobbyist domain. For analysts, app developers, and recommendation engine builders, it's an extraordinary resource. BGG even provides a free XML API, which is both a gift and a source of frustration. The API is underdocumented, aggressively rate-limited, and occasionally responds with a 202 that tells you to come back later. This guide covers what the API exposes, how to parse it, and how to handle it without getting blocked.

What Data BGG Exposes

The BGG XML API v2 gives you access to:

Game details — name, description, year published, minimum and maximum players, play time, age rating, BGG weight (complexity score)
Ratings and statistics — average rating, Bayesian average (geek rating), number of ratings, rank within categories
Mechanics — worker placement, deck building, area control, cooperative play, etc.
Categories — fantasy, war games, economic, card games, etc.
Designers, artists, publishers — all linked entities with their own IDs
User collections — every game in a user's library with personal ratings, play counts, and status flags (owned, wishlist, for-trade, etc.)
Play logs — individual logged plays with date, player count, location, and comments

This is real structured data, not scraped HTML — which makes it far more reliable than most scraping targets. But the API is XML-only, and the XML structure has some rough edges.

API Endpoints Overview

All endpoints live under https://boardgamegeek.com/xmlapi2/.

Endpoint	Purpose
`/xmlapi2/thing`	Game details, ratings, mechanics, links
`/xmlapi2/collection`	A user's game collection with personal data
`/xmlapi2/search`	Search games by name
`/xmlapi2/plays`	A user's logged plays

The thing endpoint is the workhorse. You pass one or more game IDs and get full details back. The collection endpoint is more complex — it queues requests server-side and may return a 202 on first hit.

Rate Limiting and Anti-Bot Behavior

BGG's API is free and public, but BGG treats it like a fragile internal service. The rate limiting is aggressive and inconsistently documented.

What you will hit in practice:

429 responses — BGG starts returning 429s quickly if you issue requests without delays. In testing, consecutive requests with less than 2-second gaps reliably trigger throttling. Use 5+ second delays between requests to stay clean.
202 "please wait" responses — The /xmlapi2/collection endpoint is notorious for this. BGG queues the data export on the server side. Your first request returns HTTP 202 with a message body asking you to retry. You must poll with retries until you get a 200. This is by design, not a bug — but many scrapers silently fail here.
IP blocks — Rapid requests from a single IP, especially to thing with many IDs, will result in temporary IP blocks. BGG does not communicate these clearly; you typically get connection timeouts or HTML error pages instead of proper API error codes.
No authentication — There are no API keys. Your IP is your identity, which makes rate limit sharing across concurrent processes a real concern.

For any production use — price tracking, recommendation systems, dataset collection — you need to distribute requests across multiple IPs. ThorData is a good fit here: their residential proxy pool rotates IPs automatically, so each BGG request appears to come from a different address and you avoid the cumulative throttling that kills single-IP scrapers.

Setup

pip install httpx

No HTML parsing libraries needed — everything comes back as XML, which Python handles natively.

import httpx
import xml.etree.ElementTree as ET
import time

BASE_URL = "https://boardgamegeek.com/xmlapi2"

def get(path: str, params: dict = None, retries: int = 5) -> ET.Element:
    """Make a BGG API request with retry logic for 429s and 202s."""
    url = f"{BASE_URL}{path}"
    for attempt in range(retries):
        resp = httpx.get(url, params=params, timeout=30)
        if resp.status_code == 200:
            return ET.fromstring(resp.text)
        elif resp.status_code == 202:
            # BGG is building the response — retry after delay
            print(f"Got 202, retrying in 10s (attempt {attempt + 1})")
            time.sleep(10)
        elif resp.status_code == 429:
            wait = 2 ** attempt * 5
            print(f"Rate limited, waiting {wait}s")
            time.sleep(wait)
        else:
            raise Exception(f"Unexpected status {resp.status_code}: {resp.text[:200]}")
    raise Exception(f"Failed after {retries} retries for {url}")

Fetching Game Details

The /xmlapi2/thing endpoint accepts a comma-separated list of game IDs and an optional stats=1 flag to include ratings.

def get_games(ids: list[int]) -> list[dict]:
    """Fetch full details for one or more games by BGG ID."""
    id_str = ",".join(str(i) for i in ids)
    root = get("/thing", params={"id": id_str, "stats": 1, "type": "boardgame"})

    games = []
    for item in root.findall("item"):
        # Primary name
        name = ""
        for n in item.findall("name"):
            if n.get("type") == "primary":
                name = n.get("value", "")
                break

        # Statistics
        stats = item.find("statistics/ratings")
        avg_rating = None
        bgg_rating = None
        num_ratings = None
        rank = None
        if stats is not None:
            avg_rating = float(stats.findtext("average", default="0") or 0)
            bgg_rating = float(stats.findtext("bayesaverage", default="0") or 0)
            num_ratings = int(stats.findtext("usersrated", default="0") or 0)
            for r in stats.findall("ranks/rank"):
                if r.get("name") == "boardgame":
                    try:
                        rank = int(r.get("value"))
                    except (TypeError, ValueError):
                        rank = None

        # Links — mechanics, categories, designers, publishers
        mechanics, categories, designers, publishers = [], [], [], []
        for link in item.findall("link"):
            ltype = link.get("type")
            lval = link.get("value", "")
            if ltype == "boardgamemechanic":
                mechanics.append(lval)
            elif ltype == "boardgamecategory":
                categories.append(lval)
            elif ltype == "boardgamedesigner":
                designers.append(lval)
            elif ltype == "boardgamepublisher":
                publishers.append(lval)

        games.append({
            "id": int(item.get("id")),
            "name": name,
            "year_published": item.findtext("yearpublished"),
            "min_players": item.findtext("minplayers"),
            "max_players": item.findtext("maxplayers"),
            "play_time": item.findtext("playingtime"),
            "min_age": item.findtext("minage"),
            "description": (item.findtext("description") or "").strip(),
            "avg_rating": avg_rating,
            "bgg_rating": bgg_rating,
            "num_ratings": num_ratings,
            "bgg_rank": rank,
            "weight": float(stats.findtext("averageweight", default="0") or 0) if stats is not None else None,
            "mechanics": mechanics,
            "categories": categories,
            "designers": designers,
            "publishers": publishers,
        })

    return games

Searching Games

def search_games(query: str, exact: bool = False) -> list[dict]:
    """Search BGG by game name. Returns IDs and names only — fetch details separately."""
    params = {"query": query, "type": "boardgame"}
    if exact:
        params["exact"] = 1
    root = get("/search", params=params)

    results = []
    for item in root.findall("item"):
        name_el = item.find("name")
        results.append({
            "id": int(item.get("id")),
            "name": name_el.get("value") if name_el is not None else None,
            "year": item.findtext("yearpublished"),
        })
    return results

Fetching User Collections

This is where the 202 pattern shows up most often. BGG processes collection exports asynchronously — the get() helper above handles the retry loop automatically.

def get_collection(username: str, status: str = "own") -> list[dict]:
    """
    Fetch a user's game collection.
    status options: own, wishlist, wanttoplay, fortrade, prevowned
    """
    params = {"username": username, status: 1, "stats": 1}
    root = get("/collection", params=params)

    items = []
    for item in root.findall("item"):
        status_el = item.find("status")
        stats_el = item.find("stats/rating")

        user_rating = None
        if stats_el is not None:
            try:
                user_rating = float(stats_el.get("value"))
            except (TypeError, ValueError):
                user_rating = None

        items.append({
            "game_id": int(item.get("objectid")),
            "name": item.findtext("name"),
            "year_published": item.findtext("yearpublished"),
            "num_plays": int(item.findtext("numplays") or 0),
            "user_rating": user_rating,
            "owned": status_el.get("own") == "1" if status_el is not None else False,
            "wishlist": status_el.get("wishlist") == "1" if status_el is not None else False,
            "for_trade": status_el.get("fortrade") == "1" if status_el is not None else False,
        })

    return items

Fetching Play Logs

Play logs are paginated at 100 plays per page. Iterate until you exhaust them.

def get_plays(username: str, game_id: int = None) -> list[dict]:
    """Fetch all logged plays for a user, optionally filtered to a single game."""
    plays = []
    page = 1

    while True:
        params = {"username": username, "page": page}
        if game_id:
            params["id"] = game_id
            params["type"] = "thing"

        root = get("/plays", params=params)
        total = int(root.get("total", 0))
        batch = root.findall("play")

        if not batch:
            break

        for play in batch:
            item = play.find("item")
            players = [
                {"name": p.get("name"), "score": p.get("score"), "win": p.get("win") == "1"}
                for p in play.findall("players/player")
            ]
            plays.append({
                "play_id": int(play.get("id")),
                "date": play.get("date"),
                "quantity": int(play.get("quantity", 1)),
                "length_minutes": play.get("length"),
                "location": play.get("location"),
                "game_id": int(item.get("objectid")) if item is not None else None,
                "game_name": item.get("name") if item is not None else None,
                "players": players,
                "comments": play.findtext("comments"),
            })

        if len(plays) >= total:
            break
        page += 1
        time.sleep(5)

    return plays

Putting It Together

# Search for a game, fetch details, then get plays
results = search_games("Wingspan", exact=True)
if results:
    game_id = results[0]["id"]
    details = get_games([game_id])
    print(f"{details[0]['name']} — BGG rank: {details[0]['bgg_rank']}, Rating: {details[0]['avg_rating']:.2f}")
    print(f"Mechanics: {', '.join(details[0]['mechanics'][:5])}")

    # Fetch a user's collection and plays
    time.sleep(5)
    collection = get_collection("SomeUsername")
    print(f"Collection size: {len(collection)} games")

    time.sleep(5)
    plays = get_plays("SomeUsername", game_id=game_id)
    print(f"Logged plays for this game: {len(plays)}")

Legal Considerations

BGG's API is public and intended for developer use. Their terms of service permit reasonable programmatic access. Keep delays between requests, don't bulk-scrape user profile data at scale, and don't redistribute the full dataset commercially. Academic research, personal tools, and recommendation engines are all well within normal use.

The one area to watch: BGG's community forums treat their data as a commons, but the organization has historically been responsive to requests that threaten server stability. Don't hammer the API with concurrent workers — serialize your requests, respect the 202 retry pattern, and use proxy rotation to distribute load rather than to circumvent protections.

Advanced: Bulk Game ID Discovery

The BGG hot list and rankings give you starting points for bulk collection:

def get_hot_list() -> list[int]:
    """Fetch BGG's current 'hot games' list (top 50 by recent activity)."""
    root = get("/hot", params={"type": "boardgame"})
    return [int(item.get("id")) for item in root.findall("item")]


def get_ranked_games(start_rank: int = 1, end_rank: int = 1000) -> list[dict]:
    """
    Get all games ranked from start_rank to end_rank.
    BGG returns ranks in batches via the thing endpoint.
    We fetch IDs first from the browse endpoint, then details in bulk.
    """
    # Browse endpoint paginates by 50 games
    games = []
    page = (start_rank - 1) // 50 + 1

    while True:
        root = get("/browse/boardgame", params={"page": page})
        items = root.findall("item")

        if not items:
            break

        ids = [int(item.get("id")) for item in items]
        rank_start = (page - 1) * 50 + 1

        if rank_start > end_rank:
            break

        # Fetch details in batches of 20 (BGG recommends this)
        for i in range(0, len(ids), 20):
            batch = ids[i:i+20]
            details = get_games(batch)
            games.extend(details)
            time.sleep(5)

        page += 1
        if rank_start + 50 > end_rank:
            break

    return games


def get_top_games_by_mechanic(mechanic: str, top_n: int = 100) -> list[dict]:
    """Get top-rated games that include a specific mechanic."""
    # Search for games with the mechanic
    root = get("/search", params={"query": mechanic, "type": "boardgame"})
    ids = [int(item.get("id")) for item in root.findall("item")][:top_n]

    if not ids:
        return []

    # Fetch all details in one batched call
    all_games = []
    for i in range(0, len(ids), 20):
        batch = ids[i:i+20]
        games = get_games(batch)
        all_games.extend(games)
        time.sleep(5)

    # Filter to those that actually have this mechanic and sort by BGG rank
    filtered = [g for g in all_games if mechanic in g.get("mechanics", [])]
    return sorted(filtered, key=lambda x: x.get("bgg_rank") or 9999)

Storing Data in SQLite

import sqlite3
import json

def init_bgg_db(db_path: str = "bgg.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS games (
            id INTEGER PRIMARY KEY,
            name TEXT,
            year_published TEXT,
            min_players INTEGER,
            max_players INTEGER,
            play_time INTEGER,
            min_age INTEGER,
            description TEXT,
            avg_rating REAL,
            bgg_rating REAL,
            num_ratings INTEGER,
            bgg_rank INTEGER,
            weight REAL,
            mechanics TEXT,
            categories TEXT,
            designers TEXT,
            publishers TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );

        CREATE TABLE IF NOT EXISTS user_collections (
            username TEXT,
            game_id INTEGER,
            game_name TEXT,
            num_plays INTEGER,
            user_rating REAL,
            owned INTEGER,
            wishlist INTEGER,
            for_trade INTEGER,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
            PRIMARY KEY (username, game_id)
        );

        CREATE TABLE IF NOT EXISTS plays (
            play_id INTEGER PRIMARY KEY,
            username TEXT,
            game_id INTEGER,
            game_name TEXT,
            date TEXT,
            quantity INTEGER,
            length_minutes TEXT,
            location TEXT,
            players TEXT,
            comments TEXT
        );

        CREATE INDEX IF NOT EXISTS idx_games_rank
            ON games(bgg_rank);
        CREATE INDEX IF NOT EXISTS idx_games_rating
            ON games(bgg_rating DESC);
        CREATE INDEX IF NOT EXISTS idx_collection_user
            ON user_collections(username);
    """)
    conn.commit()
    return conn


def save_game(conn: sqlite3.Connection, game: dict):
    conn.execute(
        """INSERT OR REPLACE INTO games
           (id, name, year_published, min_players, max_players, play_time,
            min_age, description, avg_rating, bgg_rating, num_ratings,
            bgg_rank, weight, mechanics, categories, designers, publishers)
           VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)""",
        (
            game.get("id"),
            game.get("name"),
            game.get("year_published"),
            game.get("min_players"),
            game.get("max_players"),
            game.get("play_time"),
            game.get("min_age"),
            game.get("description", "")[:2000],
            game.get("avg_rating"),
            game.get("bgg_rating"),
            game.get("num_ratings"),
            game.get("bgg_rank"),
            game.get("weight"),
            json.dumps(game.get("mechanics", [])),
            json.dumps(game.get("categories", [])),
            json.dumps(game.get("designers", [])),
            json.dumps(game.get("publishers", [])),
        ),
    )
    conn.commit()

Proxy Configuration for High-Volume Collection

BGG's API is free and has no authentication, but it rate-limits by IP. For bulk collection of thousands of games, distributing requests across multiple IPs prevents cumulative throttling.

ThorData's residential proxies integrate cleanly with httpx. Each request rotates to a new residential IP, so BGG's per-IP rate counter never accumulates:

import httpx
import xml.etree.ElementTree as ET

PROXY = "http://USERNAME:[email protected]:9000"

def get_with_proxy(path: str, params: dict = None) -> ET.Element:
    """BGG API request routed through residential proxy."""
    url = f"https://boardgamegeek.com/xmlapi2{path}"

    with httpx.Client(proxies={"all://": PROXY}, timeout=30) as client:
        for attempt in range(5):
            resp = client.get(url, params=params)
            if resp.status_code == 200:
                return ET.fromstring(resp.text)
            elif resp.status_code == 202:
                time.sleep(10)
            elif resp.status_code == 429:
                time.sleep(2 ** attempt * 5)
            else:
                raise Exception(f"Status {resp.status_code}")

    raise Exception(f"Failed after retries: {path}")

Analyzing Board Game Data

Once you have a database of games, you can run interesting analyses:

import sqlite3
import json
import statistics

conn = sqlite3.connect("bgg.db")

# Top games by weight (complexity) in a given mechanic
def games_by_complexity(mechanic: str, min_ratings: int = 500) -> list:
    rows = conn.execute("""
        SELECT name, bgg_rank, weight, avg_rating, num_ratings, year_published
        FROM games
        WHERE mechanics LIKE ?
          AND num_ratings >= ?
          AND weight IS NOT NULL
        ORDER BY weight DESC
        LIMIT 20
    """, (f'%{mechanic}%', min_ratings)).fetchall()
    return rows

# Ratings distribution by mechanic
def mechanic_rating_analysis() -> dict:
    rows = conn.execute("""
        SELECT mechanics, avg_rating, num_ratings
        FROM games
        WHERE avg_rating IS NOT NULL AND mechanics != '[]'
    """).fetchall()

    mechanic_ratings = {}
    for row in rows:
        try:
            mechanics = json.loads(row[0])
            for mechanic in mechanics:
                if mechanic not in mechanic_ratings:
                    mechanic_ratings[mechanic] = []
                mechanic_ratings[mechanic].append(row[1])
        except json.JSONDecodeError:
            pass

    analysis = {}
    for mechanic, ratings in mechanic_ratings.items():
        if len(ratings) >= 10:
            analysis[mechanic] = {
                "count": len(ratings),
                "avg_rating": round(statistics.mean(ratings), 3),
                "median_rating": round(statistics.median(ratings), 3),
            }

    return dict(sorted(analysis.items(), key=lambda x: x[1]["avg_rating"], reverse=True))

# Print top mechanics by average rating
analysis = mechanic_rating_analysis()
print("Top mechanics by average game rating:")
for mechanic, stats in list(analysis.items())[:15]:
    print(f"  {mechanic:<35} avg={stats['avg_rating']:.3f}  n={stats['count']}")

Recommendation System Basics

The BGG data is an ideal foundation for a simple recommendation engine:

def find_similar_games(
    game_id: int,
    db_path: str = "bgg.db",
    top_n: int = 10,
) -> list:
    """
    Find games similar to a target game based on shared mechanics
    and categories, weighted by BGG rating.
    """
    conn = sqlite3.connect(db_path)

    # Get target game's mechanics and categories
    target = conn.execute(
        "SELECT mechanics, categories FROM games WHERE id = ?", (game_id,)
    ).fetchone()

    if not target:
        return []

    target_mechanics = set(json.loads(target[0] or "[]"))
    target_categories = set(json.loads(target[1] or "[]"))

    if not target_mechanics and not target_categories:
        return []

    # Score all other games by overlap
    all_games = conn.execute(
        """SELECT id, name, mechanics, categories, bgg_rating, bgg_rank
           FROM games WHERE id != ? AND bgg_rating IS NOT NULL""",
        (game_id,)
    ).fetchall()

    scored = []
    for row in all_games:
        mechanics = set(json.loads(row[2] or "[]"))
        categories = set(json.loads(row[3] or "[]"))

        mechanic_overlap = len(target_mechanics & mechanics) / max(len(target_mechanics), 1)
        category_overlap = len(target_categories & categories) / max(len(target_categories), 1)
        similarity = mechanic_overlap * 0.7 + category_overlap * 0.3

        if similarity > 0:
            scored.append({
                "id": row[0],
                "name": row[1],
                "bgg_rating": row[4],
                "bgg_rank": row[5],
                "similarity": round(similarity, 3),
            })

    # Sort by similarity, then by rating
    scored.sort(key=lambda x: (x["similarity"], x["bgg_rating"] or 0), reverse=True)
    conn.close()
    return scored[:top_n]


# Example: find games similar to Wingspan (ID: 266192)
similar = find_similar_games(266192)
print("Games similar to Wingspan:")
for g in similar:
    print(f"  {g['name']:<35} similarity={g['similarity']:.2f}  rating={g['bgg_rating']:.2f}")

Full Collection Pipeline

def run_bgg_pipeline(
    start_rank: int = 1,
    end_rank: int = 500,
    db_path: str = "bgg.db",
):
    """
    Collect BGG top games by rank range.
    Fetches details including mechanics, ratings, and categories.
    """
    conn = init_bgg_db(db_path)
    hot_ids = get_hot_list()

    # Fetch hot list first
    print(f"Fetching {len(hot_ids)} hot games...")
    for i in range(0, len(hot_ids), 20):
        batch = hot_ids[i:i+20]
        games = get_games(batch)
        for game in games:
            save_game(conn, game)
        print(f"  Saved {min(i+20, len(hot_ids))}/{len(hot_ids)}")
        time.sleep(5)

    # Then collect by rank range
    print(f"\nCollecting ranked games {start_rank}-{end_rank}...")
    ranked = get_ranked_games(start_rank=start_rank, end_rank=end_rank)

    for game in ranked:
        save_game(conn, game)

    conn.close()
    print(f"\nPipeline complete: {end_rank - start_rank + 1} games collected")


run_bgg_pipeline(start_rank=1, end_rank=1000)

Legal Considerations

BGG's API is public and intended for developer use. Their terms of service permit reasonable programmatic access. Keep delays between requests (5+ seconds), avoid bulk-scraping user profile data at scale, and do not redistribute the full dataset commercially.

BGG relies on volunteer contributions and community goodwill. Treat their servers with respect -- serialize requests, use 202 retry patterns correctly, and consider contributing back to the community if you build something useful with the data.