Scraping Basketball-Reference for NBA Stats with Python (2026)

2026-04-09 [python scraping nba basketball-reference sports-data sports-analytics]

Scraping Basketball-Reference for NBA Stats with Python (2026)

Basketball-Reference is the definitive source for NBA statistics. It covers every player who's ever stepped on an NBA court, going back to the league's founding in 1946, with advanced metrics like PER, Win Shares, VORP, and Box Plus/Minus that you won't find elsewhere for free. If you're building a fantasy basketball tool, training a machine learning model on player performance, or analyzing historical trends for sports journalism, Basketball-Reference is where the data lives.

There's no official API. The data sits in well-structured HTML tables, which makes it pleasant to scrape — if you handle the anti-bot measures correctly. Basketball-Reference is owned by Sports Reference, which also runs Pro-Football-Reference and Baseball-Reference. They all share the same infrastructure, the same table structures, and the same anti-bot protections, so what you learn here applies across all three.

The site serves roughly 30 million page views per month, and they're protective of their bandwidth. Get caught scraping too aggressively and you'll eat a temporary IP ban — usually one hour for the first offense, longer for repeat violations. The site uses Cloudflare for protection, so you'll see JavaScript challenges on suspicious traffic. None of this is insurmountable, but you need to approach it with respect for their infrastructure.

This guide covers everything from basic per-game stats to advanced analytics, game logs, team pages, playoff data, and building complete historical datasets. Every code example is production-ready with proper error handling, rate limiting, and storage. Let's get your data pipeline running.

Setup and Dependencies

pip install requests beautifulsoup4 pandas lxml httpx

Basketball-Reference serves static HTML with data in <table> elements. No Playwright or Selenium needed — this is a pure HTTP + HTML parsing job. The lxml parser is faster than Python's built-in html.parser, and pandas' read_html() makes table extraction trivial.

import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
import time
import random
import sqlite3
from datetime import datetime

# Base configuration
BASE_URL = "https://www.basketball-reference.com"

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/120.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
    "Connection": "keep-alive",
}

def polite_request(url, session=None, min_delay=3.5, max_delay=6.0):
    """Make a request with human-like delay and error handling."""
    time.sleep(random.uniform(min_delay, max_delay))
    client = session or requests
    resp = client.get(url, headers=HEADERS, timeout=30)

    if resp.status_code == 429:
        wait = int(resp.headers.get("Retry-After", 120))
        print(f"Rate limited — waiting {wait}s")
        time.sleep(wait)
        resp = client.get(url, headers=HEADERS, timeout=30)

    if resp.status_code == 403:
        print(f"Blocked (403) — need proxy rotation or longer delay")
        raise requests.HTTPError(f"403 on {url}")

    resp.raise_for_status()
    return resp

Understanding Basketball-Reference's Table Structure

Before diving into code, understand how the site structures its HTML. Most stats pages have multiple tables — per-game averages, totals, per-36-minute, advanced, shooting, play-by-play. Each has a unique id attribute.

The critical gotcha: Basketball-Reference wraps some tables in HTML comments to speed up page rendering. When the page loads in a browser, JavaScript uncomments them. But when you fetch the raw HTML with requests, those tables are invisible to soup.find("table"). You must parse comments separately.

def find_table(soup, table_id):
    """Find a table by ID, checking both visible tables and HTML comments.

    Basketball-Reference hides some tables in HTML comments for performance.
    This function checks both — it's the single most important helper for BBRef scraping.
    """
    # First try: visible table
    table = soup.find("table", id=table_id)
    if table:
        return table

    # Second try: check inside HTML comments
    comments = soup.find_all(string=lambda t: isinstance(t, Comment))
    for comment in comments:
        if f'id="{table_id}"' in comment:
            comment_soup = BeautifulSoup(comment, "lxml")
            table = comment_soup.find("table", id=table_id)
            if table:
                return table

    return None

def table_to_df(table):
    """Convert an HTML table to a clean pandas DataFrame."""
    if table is None:
        return pd.DataFrame()

    df = pd.read_html(str(table))[0]

    # Remove repeated header rows (BBRef uses these as visual separators)
    if "Season" in df.columns:
        df = df[df["Season"] != "Season"]
    if "Rk" in df.columns:
        df = df[df["Rk"] != "Rk"]

    return df

Player Season Stats (Per-Game Averages)

The most common use case — getting a player's career per-game averages:

def get_player_per_game(player_slug):
    """Get career per-game stats for a player.

    Player slug is the URL identifier, e.g., 'jamesle01' for LeBron James.
    Find slugs by searching the site or checking player page URLs.
    """
    url = f"{BASE_URL}/players/{player_slug[0]}/{player_slug}.html"
    resp = polite_request(url)
    soup = BeautifulSoup(resp.text, "lxml")

    table = find_table(soup, "per_game")
    df = table_to_df(table)

    if df.empty:
        return df

    # Convert numeric columns
    numeric_cols = ["G", "GS", "MP", "FG", "FGA", "FG%", "3P", "3PA", "3P%",
                    "2P", "2PA", "2P%", "eFG%", "FT", "FTA", "FT%",
                    "ORB", "DRB", "TRB", "AST", "STL", "BLK", "TOV", "PF", "PTS"]
    for col in numeric_cols:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")

    return df

# Example: LeBron James career per-game
stats = get_player_per_game("jamesle01")
print(stats[["Season", "Tm", "G", "PTS", "TRB", "AST"]].tail(10))

Expected output:

    Season   Tm   G   PTS  TRB  AST
15  2018-19  LAL  55  27.4  8.5  8.3
16  2019-20  LAL  67  25.3  7.8  10.2
17  2020-21  LAL  45  25.0  7.7  7.8
18  2021-22  LAL  56  30.3  8.2  6.2
19  2022-23  LAL  55  28.9  8.3  6.8
20  2023-24  LAL  71  25.7  7.3  8.3
21  2024-25  LAL  62  23.5  7.9  9.0
22  2025-26  LAL  48  21.8  6.8  8.4

Advanced Metrics — PER, Win Shares, VORP

The stats that analysts and front offices actually care about:

def get_advanced_stats(player_slug):
    """Get career advanced stats — PER, TS%, WS, BPM, VORP."""
    url = f"{BASE_URL}/players/{player_slug[0]}/{player_slug}.html"
    resp = polite_request(url)
    soup = BeautifulSoup(resp.text, "lxml")

    table = find_table(soup, "advanced")
    df = table_to_df(table)

    if df.empty:
        return df

    # Key advanced stat columns
    adv_cols = ["PER", "TS%", "3PAr", "FTr", "ORB%", "DRB%", "TRB%",
                "AST%", "STL%", "BLK%", "TOV%", "USG%",
                "OWS", "DWS", "WS", "WS/48", "OBPM", "DBPM", "BPM", "VORP"]
    for col in adv_cols:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")

    return df

# Compare two MVPs
giannis = get_advanced_stats("antetgi01")
jokic = get_advanced_stats("jokicni01")

print("Giannis Advanced (last 5 seasons):")
print(giannis[["Season", "PER", "TS%", "WS", "BPM", "VORP"]].tail(5))
print("\nJokic Advanced (last 5 seasons):")
print(jokic[["Season", "PER", "TS%", "WS", "BPM", "VORP"]].tail(5))

Shooting Splits and Shot Charts

def get_shooting_stats(player_slug):
    """Get shooting splits — by distance, shot type, and quarter."""
    url = f"{BASE_URL}/players/{player_slug[0]}/{player_slug}.html"
    resp = polite_request(url)
    soup = BeautifulSoup(resp.text, "lxml")

    table = find_table(soup, "shooting")
    return table_to_df(table)

def get_shot_chart_data(player_slug, season):
    """Get individual game shooting data for shot chart reconstruction."""
    url = f"{BASE_URL}/players/{player_slug[0]}/{player_slug}/shooting/{season}"
    resp = polite_request(url)
    soup = BeautifulSoup(resp.text, "lxml")

    # Shot chart data is in a specific div
    shots = []
    chart = soup.find("div", id="shot-chart")
    if chart:
        for tip in chart.find_all("div", class_="tooltip"):
            style = tip.get("style", "")
            # Parse position from CSS top/left
            shots.append({
                "description": tip.get_text(strip=True),
                "style": style,
                "made": "make" in tip.get("class", []),
            })
    return shots

Game Logs — Every Game, Full Box Score

The most granular data available — individual game stats:

def get_game_logs(player_slug, season):
    """Get game-by-game stats for a player in a season.

    Season 2026 = the 2025-26 season (ending year).
    """
    url = f"{BASE_URL}/players/{player_slug[0]}/{player_slug}/gamelog/{season}"
    resp = polite_request(url)
    soup = BeautifulSoup(resp.text, "lxml")

    table = find_table(soup, "pgl_basic")
    df = table_to_df(table)

    if df.empty:
        return df

    # Convert stats to numeric
    stat_cols = ["GS", "MP", "FG", "FGA", "FG%", "3P", "3PA", "3P%",
                 "FT", "FTA", "FT%", "ORB", "DRB", "TRB", "AST",
                 "STL", "BLK", "TOV", "PF", "PTS", "GmSc", "+/-"]
    for col in stat_cols:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")

    return df

# Luka Doncic 2025-26 game logs
logs = get_game_logs("doncilu01", 2026)
if not logs.empty:
    print(f"Games played: {len(logs)}")
    print(f"PPG: {logs['PTS'].mean():.1f}")
    print(f"Best game: {logs['PTS'].max()} points")
    print(f"Triple-doubles: {len(logs[(logs['PTS'] >= 10) & (logs['TRB'] >= 10) & (logs['AST'] >= 10)])}")

    # Monthly splits
    if "Date" in logs.columns:
        logs["Month"] = pd.to_datetime(logs["Date"]).dt.month_name()
        monthly = logs.groupby("Month")[["PTS", "TRB", "AST"]].mean().round(1)
        print("\nMonthly averages:")
        print(monthly)

Playoff Stats

Playoff data lives on the same player page but in different tables:

def get_playoff_stats(player_slug):
    """Get career playoff per-game averages."""
    url = f"{BASE_URL}/players/{player_slug[0]}/{player_slug}.html"
    resp = polite_request(url)
    soup = BeautifulSoup(resp.text, "lxml")

    # Playoff tables have _playoffs suffix
    table = find_table(soup, "playoffs_per_game")
    df = table_to_df(table)

    if not df.empty:
        numeric_cols = ["G", "PTS", "TRB", "AST", "STL", "BLK"]
        for col in numeric_cols:
            if col in df.columns:
                df[col] = pd.to_numeric(df[col], errors="coerce")

    return df

def compare_regular_vs_playoffs(player_slug):
    """Compare a player's regular season vs playoff performance."""
    regular = get_player_per_game(player_slug)
    playoffs = get_playoff_stats(player_slug)

    if regular.empty or playoffs.empty:
        print("Insufficient data for comparison")
        return

    # Career averages
    reg_avg = regular[["PTS", "TRB", "AST"]].mean()
    play_avg = playoffs[["PTS", "TRB", "AST"]].mean()

    print(f"Regular Season: {reg_avg['PTS']:.1f}/{reg_avg['TRB']:.1f}/{reg_avg['AST']:.1f}")
    print(f"Playoffs:       {play_avg['PTS']:.1f}/{play_avg['TRB']:.1f}/{play_avg['AST']:.1f}")

    for stat in ["PTS", "TRB", "AST"]:
        diff = play_avg[stat] - reg_avg[stat]
        emoji = "UP" if diff > 0 else "DOWN"
        print(f"  {stat}: {emoji} {abs(diff):.1f}")

League-Wide Season Data

Stats for every player in a season — essential for rankings and comparisons:

def get_season_stats(season, stat_type="per_game"):
    """Get league-wide stats for a season.

    stat_type options: 'per_game', 'totals', 'per_minute', 'per_poss', 'advanced'
    """
    url = f"{BASE_URL}/leagues/NBA_{season}_{stat_type}.html"
    resp = polite_request(url)
    soup = BeautifulSoup(resp.text, "lxml")

    table = find_table(soup, f"{stat_type}_stats")
    df = table_to_df(table)

    if df.empty:
        return df

    # Convert all numeric columns
    skip_cols = {"Player", "Pos", "Tm", "Season", "Rk"}
    for col in df.columns:
        if col not in skip_cols:
            df[col] = pd.to_numeric(df[col], errors="coerce")

    return df

# Top scorers in 2025-26
season = get_season_stats(2026)
if not season.empty:
    # Filter to players with enough games
    qualified = season[season["G"] >= 40].copy()

    print("Top 10 Scorers (2025-26):")
    top_scorers = qualified.nlargest(10, "PTS")
    print(top_scorers[["Player", "Tm", "G", "PTS", "TRB", "AST", "FG%"]].to_string(index=False))

    print("\nTop 10 by PER:")
    adv = get_season_stats(2026, "advanced")
    if not adv.empty:
        qualified_adv = adv[adv["G"] >= 40]
        print(qualified_adv.nlargest(10, "PER")[["Player", "PER", "WS", "BPM", "VORP"]].to_string(index=False))

Team Statistics

def get_team_roster(team_abbr, season):
    """Get a team's roster with per-game stats.

    team_abbr: e.g., 'LAL', 'BOS', 'GSW', 'MIL', 'DEN'
    """
    url = f"{BASE_URL}/teams/{team_abbr}/{season}.html"
    resp = polite_request(url)
    soup = BeautifulSoup(resp.text, "lxml")

    table = find_table(soup, "per_game")
    df = table_to_df(table)

    # Add team context
    if not df.empty:
        df["Team"] = team_abbr
        df["Season"] = season

    return df

def get_team_standings(season):
    """Get full league standings for a season."""
    url = f"{BASE_URL}/leagues/NBA_{season}_standings.html"
    resp = polite_request(url)
    soup = BeautifulSoup(resp.text, "lxml")

    east = find_table(soup, "divs_standings_E")
    west = find_table(soup, "divs_standings_W")

    results = {}
    if east:
        results["Eastern"] = table_to_df(east)
    if west:
        results["Western"] = table_to_df(west)

    return results

def get_team_schedule(team_abbr, season):
    """Get a team's full season schedule with results."""
    url = f"{BASE_URL}/teams/{team_abbr}/{season}_games.html"
    resp = polite_request(url)
    soup = BeautifulSoup(resp.text, "lxml")

    table = find_table(soup, "games")
    return table_to_df(table)

# Example: Lakers roster and team stats
roster = get_team_roster("LAL", 2026)
if not roster.empty:
    print("Lakers 2025-26 Roster:")
    print(roster[["Player", "G", "MP", "PTS", "TRB", "AST"]].to_string(index=False))

Historical Data and Multi-Season Analysis

Building datasets across multiple seasons for trend analysis:

def build_career_dataset(player_slugs):
    """Collect career stats for multiple players into one DataFrame."""
    all_stats = []
    session = requests.Session()
    session.headers.update(HEADERS)

    for name, slug in player_slugs.items():
        print(f"Fetching {name}...")
        try:
            url = f"{BASE_URL}/players/{slug[0]}/{slug}.html"
            resp = polite_request(url, session=session)
            soup = BeautifulSoup(resp.text, "lxml")

            # Get per-game stats
            per_game = table_to_df(find_table(soup, "per_game"))
            if not per_game.empty:
                per_game["Player"] = name
                per_game["Slug"] = slug
                all_stats.append(per_game)

        except Exception as e:
            print(f"  Failed for {name}: {e}")

    if not all_stats:
        return pd.DataFrame()

    combined = pd.concat(all_stats, ignore_index=True)
    return combined

def build_historical_season_data(start_year, end_year):
    """Build a multi-season dataset of league-wide stats."""
    all_seasons = []

    for year in range(start_year, end_year + 1):
        print(f"Fetching {year-1}-{str(year)[2:]} season...")
        try:
            df = get_season_stats(year)
            if not df.empty:
                df["Season_Year"] = year
                all_seasons.append(df)
        except Exception as e:
            print(f"  Failed for {year}: {e}")

    if not all_seasons:
        return pd.DataFrame()

    combined = pd.concat(all_seasons, ignore_index=True)
    print(f"\nCollected {len(combined)} player-seasons from {start_year} to {end_year}")
    return combined

# All-time greats comparison
greats = {
    "LeBron James": "jamesle01",
    "Kevin Durant": "duranke01",
    "Stephen Curry": "curryst01",
    "Nikola Jokic": "jokicni01",
    "Giannis Antetokounmpo": "antetgi01",
    "Luka Doncic": "doncilu01",
}
dataset = build_career_dataset(greats)

Player Search and Discovery

def search_player(name):
    """Search for a player and return their slug and basic info."""
    search_url = f"{BASE_URL}/search/search.fcgi"
    params = {"search": name}
    resp = polite_request(f"{search_url}?search={name}")

    # BBRef either redirects to the player page or shows search results
    if "/players/" in resp.url:
        # Direct match — extract slug from URL
        slug = resp.url.split("/players/")[1].rstrip("/").split("/")[-1].replace(".html", "")
        return {"name": name, "slug": slug, "url": resp.url}

    # Parse search results
    soup = BeautifulSoup(resp.text, "lxml")
    results = []
    for item in soup.select(".search-item-name"):
        link = item.find("a")
        if link and "/players/" in link.get("href", ""):
            href = link["href"]
            slug = href.split("/")[-1].replace(".html", "")
            results.append({
                "name": link.get_text(strip=True),
                "slug": slug,
                "url": BASE_URL + href
            })
    return results

# Find a player
results = search_player("Nikola Jokic")
print(results)

Anti-Bot Protections and How to Handle Them

Basketball-Reference has progressively tightened its defenses. Here's what you'll encounter and how to handle each:

Rate Limiting

The site allows roughly 20 requests per minute. Exceed that and you get 429 responses followed by a temporary IP ban (usually 1 hour, escalating to 24 hours for repeat offenses).

class RateLimiter:
    """Track request timing to stay under BBRef's rate limits."""

    def __init__(self, requests_per_minute=15):
        self.rpm = requests_per_minute
        self.request_times = []

    def wait_if_needed(self):
        """Block until we can make another request without exceeding limits."""
        now = time.time()
        # Remove requests older than 60 seconds
        self.request_times = [t for t in self.request_times if now - t < 60]

        if len(self.request_times) >= self.rpm:
            oldest = self.request_times[0]
            wait = 60 - (now - oldest) + random.uniform(1, 3)
            if wait > 0:
                print(f"Rate limit approaching — waiting {wait:.1f}s")
                time.sleep(wait)

        self.request_times.append(time.time())

limiter = RateLimiter(requests_per_minute=15)

def rate_limited_request(url, session=None):
    """Make a request respecting rate limits."""
    limiter.wait_if_needed()
    client = session or requests
    resp = client.get(url, headers=HEADERS, timeout=30)

    if resp.status_code == 429:
        retry_after = int(resp.headers.get("Retry-After", 120))
        print(f"Rate limited! Backing off {retry_after}s")
        time.sleep(retry_after)
        limiter.request_times.clear()
        resp = client.get(url, headers=HEADERS, timeout=30)

    resp.raise_for_status()
    return resp

IP Bans and Proxy Rotation

The site is harsh on cloud provider IP ranges (AWS, GCP, Azure, DigitalOcean). If you're scraping from a VPS, you'll likely get blocked within minutes. For any project pulling data across multiple players or seasons, you need residential IP rotation.

ThorData provides residential proxy pools that rotate automatically. Each request exits from a different residential IP, making your traffic look like normal users browsing the site:

THORDATA_PROXY = "http://USERNAME:[email protected]:9000"

def request_with_proxy(url, proxy_url=THORDATA_PROXY):
    """Make a request through ThorData residential proxy."""
    proxies = {"http": proxy_url, "https": proxy_url}
    limiter.wait_if_needed()
    resp = requests.get(url, headers=HEADERS, proxies=proxies, timeout=30)

    if resp.status_code == 429:
        print("Rate limited even through proxy — backing off 120s")
        time.sleep(120)
        resp = requests.get(url, headers=HEADERS, proxies=proxies, timeout=30)

    resp.raise_for_status()
    return resp

def scrape_with_fallback(url, session=None):
    """Try direct request first, fall back to proxy on block."""
    try:
        return rate_limited_request(url, session)
    except requests.HTTPError as e:
        if e.response is not None and e.response.status_code in (403, 429):
            print(f"Blocked directly, trying ThorData proxy...")
            return request_with_proxy(url)
        raise

Session Management for Extended Scraping

def create_scraping_session():
    """Create a requests session that mimics a real browser session."""
    session = requests.Session()
    session.headers.update(HEADERS)

    # Visit the homepage first to get cookies
    try:
        session.get(BASE_URL, timeout=30)
        time.sleep(random.uniform(2, 4))
    except Exception:
        pass

    return session

def scrape_multiple_players(player_slugs, output_db="nba_stats.db"):
    """Production scraper for multiple players with session reuse."""
    session = create_scraping_session()
    conn = sqlite3.connect(output_db)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS player_stats (
            slug TEXT, season TEXT, team TEXT, games INTEGER,
            ppg REAL, rpg REAL, apg REAL,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
            PRIMARY KEY (slug, season)
        )
    """)

    for i, (name, slug) in enumerate(player_slugs.items()):
        print(f"[{i+1}/{len(player_slugs)}] {name}")
        try:
            df = get_player_per_game(slug)
            if df.empty:
                continue

            for _, row in df.iterrows():
                conn.execute("""
                    INSERT OR REPLACE INTO player_stats
                    (slug, season, team, games, ppg, rpg, apg)
                    VALUES (?, ?, ?, ?, ?, ?, ?)
                """, (slug, row.get("Season"), row.get("Tm"),
                      row.get("G"), row.get("PTS"),
                      row.get("TRB"), row.get("AST")))
            conn.commit()
            print(f"  Saved {len(df)} seasons")

        except Exception as e:
            print(f"  Failed: {e}")

        # Refresh session periodically to avoid stale cookies
        if (i + 1) % 20 == 0:
            session = create_scraping_session()
            print("  [Session refreshed]")

    conn.close()
    print(f"\nDone. Data saved to {output_db}")

Storing NBA Data with SQLite

A complete storage layer for Basketball-Reference data:

def init_nba_db(db_path="nba_data.db"):
    """Initialize a comprehensive NBA stats database."""
    conn = sqlite3.connect(db_path)
    conn.execute("PRAGMA journal_mode=WAL")

    conn.executescript("""
        CREATE TABLE IF NOT EXISTS players (
            slug TEXT PRIMARY KEY,
            name TEXT NOT NULL,
            position TEXT,
            height TEXT,
            weight TEXT,
            birth_date TEXT,
            draft_year INTEGER,
            draft_pick INTEGER,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );

        CREATE TABLE IF NOT EXISTS season_stats (
            slug TEXT NOT NULL,
            season TEXT NOT NULL,
            team TEXT,
            games INTEGER,
            games_started INTEGER,
            minutes_pg REAL,
            points_pg REAL,
            rebounds_pg REAL,
            assists_pg REAL,
            steals_pg REAL,
            blocks_pg REAL,
            fg_pct REAL,
            three_pct REAL,
            ft_pct REAL,
            per REAL,
            ts_pct REAL,
            win_shares REAL,
            bpm REAL,
            vorp REAL,
            is_playoff INTEGER DEFAULT 0,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
            PRIMARY KEY (slug, season, team, is_playoff),
            FOREIGN KEY (slug) REFERENCES players(slug)
        );

        CREATE TABLE IF NOT EXISTS game_logs (
            slug TEXT NOT NULL,
            season INTEGER NOT NULL,
            game_date TEXT,
            opponent TEXT,
            home_away TEXT,
            result TEXT,
            minutes INTEGER,
            points INTEGER,
            rebounds INTEGER,
            assists INTEGER,
            steals INTEGER,
            blocks INTEGER,
            turnovers INTEGER,
            fg_made INTEGER,
            fg_attempted INTEGER,
            three_made INTEGER,
            three_attempted INTEGER,
            ft_made INTEGER,
            ft_attempted INTEGER,
            plus_minus INTEGER,
            game_score REAL,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
            PRIMARY KEY (slug, season, game_date)
        );

        CREATE INDEX IF NOT EXISTS idx_season_stats_slug ON season_stats(slug);
        CREATE INDEX IF NOT EXISTS idx_game_logs_date ON game_logs(game_date);
    """)
    conn.commit()
    return conn

Real-World Use Cases

1. Fantasy Basketball Draft Tool

def fantasy_value_analysis(season=2026, min_games=50):
    """Rank players by fantasy basketball value (9-cat leagues)."""
    df = get_season_stats(season)
    if df.empty:
        return

    qualified = df[df["G"] >= min_games].copy()

    # Standard 9-category scoring
    categories = ["PTS", "TRB", "AST", "STL", "BLK", "3P", "FG%", "FT%", "TOV"]
    for cat in categories:
        if cat in qualified.columns:
            if cat == "TOV":
                # Lower is better for turnovers
                qualified[f"{cat}_z"] = -(qualified[cat] - qualified[cat].mean()) / qualified[cat].std()
            else:
                qualified[f"{cat}_z"] = (qualified[cat] - qualified[cat].mean()) / qualified[cat].std()

    z_cols = [c for c in qualified.columns if c.endswith("_z")]
    qualified["fantasy_value"] = qualified[z_cols].sum(axis=1)
    qualified = qualified.sort_values("fantasy_value", ascending=False)

    print("Top 20 Fantasy Values (9-cat):")
    print(qualified[["Player", "Tm", "PTS", "TRB", "AST", "fantasy_value"]].head(20).to_string(index=False))

2. MVP Prediction Model Data

def collect_mvp_training_data(start_year=2000, end_year=2026):
    """Collect features for MVP prediction model training."""
    records = []
    for year in range(start_year, end_year + 1):
        print(f"Collecting {year} season data...")
        adv = get_season_stats(year, "advanced")
        per_game = get_season_stats(year)

        if adv.empty or per_game.empty:
            continue

        # Merge per-game and advanced stats
        merged = per_game.merge(adv[["Player", "Tm", "PER", "WS", "BPM", "VORP"]],
                                on=["Player", "Tm"], how="left")

        # Top candidates (high VORP players)
        top = merged.nlargest(10, "VORP")
        for _, row in top.iterrows():
            records.append({
                "season": year,
                "player": row["Player"],
                "team": row["Tm"],
                "ppg": row.get("PTS"),
                "rpg": row.get("TRB"),
                "apg": row.get("AST"),
                "per": row.get("PER"),
                "ws": row.get("WS"),
                "bpm": row.get("BPM"),
                "vorp": row.get("VORP"),
            })
        time.sleep(random.uniform(4, 7))

    return pd.DataFrame(records)

3. Player Comparison Tool

def compare_players(slugs_dict, seasons=5):
    """Compare multiple players across recent seasons."""
    comparisons = {}
    for name, slug in slugs_dict.items():
        per_game = get_player_per_game(slug)
        advanced = get_advanced_stats(slug)

        if not per_game.empty:
            recent = per_game.tail(seasons)
            comparisons[name] = {
                "ppg": recent["PTS"].mean() if "PTS" in recent else None,
                "rpg": recent["TRB"].mean() if "TRB" in recent else None,
                "apg": recent["AST"].mean() if "AST" in recent else None,
                "fg_pct": recent["FG%"].mean() if "FG%" in recent else None,
                "games": recent["G"].sum() if "G" in recent else None,
            }

        if not advanced.empty:
            recent_adv = advanced.tail(seasons)
            if name in comparisons:
                comparisons[name].update({
                    "per": recent_adv["PER"].mean() if "PER" in recent_adv else None,
                    "ws": recent_adv["WS"].sum() if "WS" in recent_adv else None,
                    "bpm": recent_adv["BPM"].mean() if "BPM" in recent_adv else None,
                })

    comp_df = pd.DataFrame(comparisons).T
    print(comp_df.round(1).to_string())
    return comp_df

4. Injury Impact Analysis

def analyze_injury_impact(player_slug, injury_season):
    """Compare player stats before and after an injury season."""
    per_game = get_player_per_game(player_slug)
    if per_game.empty:
        return

    seasons = per_game["Season"].tolist()
    injury_idx = None
    for i, s in enumerate(seasons):
        if injury_season in str(s):
            injury_idx = i
            break

    if injury_idx is None:
        print(f"Season {injury_season} not found")
        return

    before = per_game.iloc[max(0, injury_idx-3):injury_idx]
    after = per_game.iloc[injury_idx+1:injury_idx+4]

    print(f"Pre-injury (3 seasons before):")
    print(f"  PPG: {before['PTS'].mean():.1f}, RPG: {before['TRB'].mean():.1f}")
    print(f"Post-injury (3 seasons after):")
    print(f"  PPG: {after['PTS'].mean():.1f}, RPG: {after['TRB'].mean():.1f}")

5. Draft Class Analysis

def analyze_draft_class(draft_year, seasons_to_check=5):
    """Analyze a draft class's performance over their first N seasons."""
    url = f"{BASE_URL}/draft/NBA_{draft_year}.html"
    resp = polite_request(url)
    soup = BeautifulSoup(resp.text, "lxml")

    table = find_table(soup, "stats")
    if not table:
        print("Draft page table not found")
        return

    df = table_to_df(table)
    print(f"Draft class {draft_year}: {len(df)} picks")
    print(df[["Pk", "Player", "Tm"]].head(15).to_string(index=False))
    return df

6. Clutch Performance Tracker

def find_clutch_performances(player_slug, season, min_points=30):
    """Find games where a player scored big in close games."""
    logs = get_game_logs(player_slug, season)
    if logs.empty:
        return pd.DataFrame()

    big_games = logs[logs["PTS"] >= min_points].copy()
    print(f"Games with {min_points}+ points: {len(big_games)}")

    if not big_games.empty:
        print(big_games[["Date", "Opp", "PTS", "TRB", "AST", "GmSc"]].to_string(index=False))
    return big_games

7. Historical Trend Analysis

def league_scoring_trends(start_year=1980, end_year=2026):
    """Track how league-wide scoring has evolved over decades."""
    trends = []
    for year in range(start_year, end_year + 1):
        try:
            df = get_season_stats(year)
            if not df.empty:
                trends.append({
                    "season": f"{year-1}-{str(year)[2:]}",
                    "avg_ppg": df["PTS"].mean(),
                    "avg_3pa": df["3PA"].mean() if "3PA" in df.columns else None,
                    "avg_3p_pct": df["3P%"].mean() if "3P%" in df.columns else None,
                    "avg_pace": df.get("Pace", pd.Series()).mean(),
                })
        except Exception:
            pass
        time.sleep(random.uniform(4, 8))

    trend_df = pd.DataFrame(trends)
    print(trend_df.to_string(index=False))
    return trend_df

Common Pitfalls and Solutions

1. Comment-Wrapped Tables

The biggest gotcha. Always use the find_table() helper that checks both visible tables and HTML comments. Without it, you'll miss advanced, shooting, and playoff tables.

2. Traded Players Appear Multiple Times

A player traded mid-season has rows for each team plus a "TOT" (total) row. Filter Tm == "TOT" for season totals, or keep individual team rows for team-specific analysis.

def handle_traded_players(df):
    """Keep only the TOT row for traded players."""
    if "Tm" not in df.columns:
        return df
    # Find players who appear multiple times (traded)
    traded = df[df.duplicated("Player", keep=False)]
    if traded.empty:
        return df
    # Keep TOT rows for traded, all rows for non-traded
    non_traded = df[~df.duplicated("Player", keep=False)]
    tot_rows = df[(df.duplicated("Player", keep=False)) & (df["Tm"] == "TOT")]
    return pd.concat([non_traded, tot_rows], ignore_index=True)

3. Season Numbering

"2026" means the 2025-26 season — the ending year, not the starting year. This trips up everyone at first.

4. Table IDs Vary by Page

Per-game stats use per_game, game logs use pgl_basic, advanced uses advanced. Don't assume one ID works everywhere. Check the page source.

5. Missing Data for Older Seasons

Three-point statistics don't exist before 1979-80. Advanced metrics like VORP start later. Always handle NaN values gracefully.

6. Cloudflare Challenges

If you get a response with "Checking your browser" text, your IP is flagged. Switch to residential proxies through ThorData — their residential IPs bypass Cloudflare challenges because they come from real ISPs.

Basketball-Reference is a remarkably clean data source. Keep your request rate slow (3-5 seconds between requests minimum), rotate IPs for larger jobs, always check those HTML comments for hidden tables, and handle traded-player duplicates. Follow these rules and you'll have decades of basketball data at your fingertips.