Scraping Basketball-Reference for NBA Stats with Python (2026)
Scraping Basketball-Reference for NBA Stats with Python (2026)
Basketball-Reference is the definitive source for NBA statistics. It covers every player who's ever stepped on an NBA court, going back to the league's founding in 1946, with advanced metrics like PER, Win Shares, VORP, and Box Plus/Minus that you won't find elsewhere for free. If you're building a fantasy basketball tool, training a machine learning model on player performance, or analyzing historical trends for sports journalism, Basketball-Reference is where the data lives.
There's no official API. The data sits in well-structured HTML tables, which makes it pleasant to scrape — if you handle the anti-bot measures correctly. Basketball-Reference is owned by Sports Reference, which also runs Pro-Football-Reference and Baseball-Reference. They all share the same infrastructure, the same table structures, and the same anti-bot protections, so what you learn here applies across all three.
The site serves roughly 30 million page views per month, and they're protective of their bandwidth. Get caught scraping too aggressively and you'll eat a temporary IP ban — usually one hour for the first offense, longer for repeat violations. The site uses Cloudflare for protection, so you'll see JavaScript challenges on suspicious traffic. None of this is insurmountable, but you need to approach it with respect for their infrastructure.
This guide covers everything from basic per-game stats to advanced analytics, game logs, team pages, playoff data, and building complete historical datasets. Every code example is production-ready with proper error handling, rate limiting, and storage. Let's get your data pipeline running.
Setup and Dependencies
pip install requests beautifulsoup4 pandas lxml httpx
Basketball-Reference serves static HTML with data in <table> elements. No Playwright or Selenium needed — this is a pure HTTP + HTML parsing job. The lxml parser is faster than Python's built-in html.parser, and pandas' read_html() makes table extraction trivial.
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
import time
import random
import sqlite3
from datetime import datetime
# Base configuration
BASE_URL = "https://www.basketball-reference.com"
HEADERS = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
}
def polite_request(url, session=None, min_delay=3.5, max_delay=6.0):
"""Make a request with human-like delay and error handling."""
time.sleep(random.uniform(min_delay, max_delay))
client = session or requests
resp = client.get(url, headers=HEADERS, timeout=30)
if resp.status_code == 429:
wait = int(resp.headers.get("Retry-After", 120))
print(f"Rate limited — waiting {wait}s")
time.sleep(wait)
resp = client.get(url, headers=HEADERS, timeout=30)
if resp.status_code == 403:
print(f"Blocked (403) — need proxy rotation or longer delay")
raise requests.HTTPError(f"403 on {url}")
resp.raise_for_status()
return resp
Understanding Basketball-Reference's Table Structure
Before diving into code, understand how the site structures its HTML. Most stats pages have multiple tables — per-game averages, totals, per-36-minute, advanced, shooting, play-by-play. Each has a unique id attribute.
The critical gotcha: Basketball-Reference wraps some tables in HTML comments to speed up page rendering. When the page loads in a browser, JavaScript uncomments them. But when you fetch the raw HTML with requests, those tables are invisible to soup.find("table"). You must parse comments separately.
def find_table(soup, table_id):
"""Find a table by ID, checking both visible tables and HTML comments.
Basketball-Reference hides some tables in HTML comments for performance.
This function checks both — it's the single most important helper for BBRef scraping.
"""
# First try: visible table
table = soup.find("table", id=table_id)
if table:
return table
# Second try: check inside HTML comments
comments = soup.find_all(string=lambda t: isinstance(t, Comment))
for comment in comments:
if f'id="{table_id}"' in comment:
comment_soup = BeautifulSoup(comment, "lxml")
table = comment_soup.find("table", id=table_id)
if table:
return table
return None
def table_to_df(table):
"""Convert an HTML table to a clean pandas DataFrame."""
if table is None:
return pd.DataFrame()
df = pd.read_html(str(table))[0]
# Remove repeated header rows (BBRef uses these as visual separators)
if "Season" in df.columns:
df = df[df["Season"] != "Season"]
if "Rk" in df.columns:
df = df[df["Rk"] != "Rk"]
return df
Player Season Stats (Per-Game Averages)
The most common use case — getting a player's career per-game averages:
def get_player_per_game(player_slug):
"""Get career per-game stats for a player.
Player slug is the URL identifier, e.g., 'jamesle01' for LeBron James.
Find slugs by searching the site or checking player page URLs.
"""
url = f"{BASE_URL}/players/{player_slug[0]}/{player_slug}.html"
resp = polite_request(url)
soup = BeautifulSoup(resp.text, "lxml")
table = find_table(soup, "per_game")
df = table_to_df(table)
if df.empty:
return df
# Convert numeric columns
numeric_cols = ["G", "GS", "MP", "FG", "FGA", "FG%", "3P", "3PA", "3P%",
"2P", "2PA", "2P%", "eFG%", "FT", "FTA", "FT%",
"ORB", "DRB", "TRB", "AST", "STL", "BLK", "TOV", "PF", "PTS"]
for col in numeric_cols:
if col in df.columns:
df[col] = pd.to_numeric(df[col], errors="coerce")
return df
# Example: LeBron James career per-game
stats = get_player_per_game("jamesle01")
print(stats[["Season", "Tm", "G", "PTS", "TRB", "AST"]].tail(10))
Expected output:
Season Tm G PTS TRB AST
15 2018-19 LAL 55 27.4 8.5 8.3
16 2019-20 LAL 67 25.3 7.8 10.2
17 2020-21 LAL 45 25.0 7.7 7.8
18 2021-22 LAL 56 30.3 8.2 6.2
19 2022-23 LAL 55 28.9 8.3 6.8
20 2023-24 LAL 71 25.7 7.3 8.3
21 2024-25 LAL 62 23.5 7.9 9.0
22 2025-26 LAL 48 21.8 6.8 8.4
Advanced Metrics — PER, Win Shares, VORP
The stats that analysts and front offices actually care about:
def get_advanced_stats(player_slug):
"""Get career advanced stats — PER, TS%, WS, BPM, VORP."""
url = f"{BASE_URL}/players/{player_slug[0]}/{player_slug}.html"
resp = polite_request(url)
soup = BeautifulSoup(resp.text, "lxml")
table = find_table(soup, "advanced")
df = table_to_df(table)
if df.empty:
return df
# Key advanced stat columns
adv_cols = ["PER", "TS%", "3PAr", "FTr", "ORB%", "DRB%", "TRB%",
"AST%", "STL%", "BLK%", "TOV%", "USG%",
"OWS", "DWS", "WS", "WS/48", "OBPM", "DBPM", "BPM", "VORP"]
for col in adv_cols:
if col in df.columns:
df[col] = pd.to_numeric(df[col], errors="coerce")
return df
# Compare two MVPs
giannis = get_advanced_stats("antetgi01")
jokic = get_advanced_stats("jokicni01")
print("Giannis Advanced (last 5 seasons):")
print(giannis[["Season", "PER", "TS%", "WS", "BPM", "VORP"]].tail(5))
print("\nJokic Advanced (last 5 seasons):")
print(jokic[["Season", "PER", "TS%", "WS", "BPM", "VORP"]].tail(5))
Shooting Splits and Shot Charts
def get_shooting_stats(player_slug):
"""Get shooting splits — by distance, shot type, and quarter."""
url = f"{BASE_URL}/players/{player_slug[0]}/{player_slug}.html"
resp = polite_request(url)
soup = BeautifulSoup(resp.text, "lxml")
table = find_table(soup, "shooting")
return table_to_df(table)
def get_shot_chart_data(player_slug, season):
"""Get individual game shooting data for shot chart reconstruction."""
url = f"{BASE_URL}/players/{player_slug[0]}/{player_slug}/shooting/{season}"
resp = polite_request(url)
soup = BeautifulSoup(resp.text, "lxml")
# Shot chart data is in a specific div
shots = []
chart = soup.find("div", id="shot-chart")
if chart:
for tip in chart.find_all("div", class_="tooltip"):
style = tip.get("style", "")
# Parse position from CSS top/left
shots.append({
"description": tip.get_text(strip=True),
"style": style,
"made": "make" in tip.get("class", []),
})
return shots
Game Logs — Every Game, Full Box Score
The most granular data available — individual game stats:
def get_game_logs(player_slug, season):
"""Get game-by-game stats for a player in a season.
Season 2026 = the 2025-26 season (ending year).
"""
url = f"{BASE_URL}/players/{player_slug[0]}/{player_slug}/gamelog/{season}"
resp = polite_request(url)
soup = BeautifulSoup(resp.text, "lxml")
table = find_table(soup, "pgl_basic")
df = table_to_df(table)
if df.empty:
return df
# Convert stats to numeric
stat_cols = ["GS", "MP", "FG", "FGA", "FG%", "3P", "3PA", "3P%",
"FT", "FTA", "FT%", "ORB", "DRB", "TRB", "AST",
"STL", "BLK", "TOV", "PF", "PTS", "GmSc", "+/-"]
for col in stat_cols:
if col in df.columns:
df[col] = pd.to_numeric(df[col], errors="coerce")
return df
# Luka Doncic 2025-26 game logs
logs = get_game_logs("doncilu01", 2026)
if not logs.empty:
print(f"Games played: {len(logs)}")
print(f"PPG: {logs['PTS'].mean():.1f}")
print(f"Best game: {logs['PTS'].max()} points")
print(f"Triple-doubles: {len(logs[(logs['PTS'] >= 10) & (logs['TRB'] >= 10) & (logs['AST'] >= 10)])}")
# Monthly splits
if "Date" in logs.columns:
logs["Month"] = pd.to_datetime(logs["Date"]).dt.month_name()
monthly = logs.groupby("Month")[["PTS", "TRB", "AST"]].mean().round(1)
print("\nMonthly averages:")
print(monthly)
Playoff Stats
Playoff data lives on the same player page but in different tables:
def get_playoff_stats(player_slug):
"""Get career playoff per-game averages."""
url = f"{BASE_URL}/players/{player_slug[0]}/{player_slug}.html"
resp = polite_request(url)
soup = BeautifulSoup(resp.text, "lxml")
# Playoff tables have _playoffs suffix
table = find_table(soup, "playoffs_per_game")
df = table_to_df(table)
if not df.empty:
numeric_cols = ["G", "PTS", "TRB", "AST", "STL", "BLK"]
for col in numeric_cols:
if col in df.columns:
df[col] = pd.to_numeric(df[col], errors="coerce")
return df
def compare_regular_vs_playoffs(player_slug):
"""Compare a player's regular season vs playoff performance."""
regular = get_player_per_game(player_slug)
playoffs = get_playoff_stats(player_slug)
if regular.empty or playoffs.empty:
print("Insufficient data for comparison")
return
# Career averages
reg_avg = regular[["PTS", "TRB", "AST"]].mean()
play_avg = playoffs[["PTS", "TRB", "AST"]].mean()
print(f"Regular Season: {reg_avg['PTS']:.1f}/{reg_avg['TRB']:.1f}/{reg_avg['AST']:.1f}")
print(f"Playoffs: {play_avg['PTS']:.1f}/{play_avg['TRB']:.1f}/{play_avg['AST']:.1f}")
for stat in ["PTS", "TRB", "AST"]:
diff = play_avg[stat] - reg_avg[stat]
emoji = "UP" if diff > 0 else "DOWN"
print(f" {stat}: {emoji} {abs(diff):.1f}")
League-Wide Season Data
Stats for every player in a season — essential for rankings and comparisons:
def get_season_stats(season, stat_type="per_game"):
"""Get league-wide stats for a season.
stat_type options: 'per_game', 'totals', 'per_minute', 'per_poss', 'advanced'
"""
url = f"{BASE_URL}/leagues/NBA_{season}_{stat_type}.html"
resp = polite_request(url)
soup = BeautifulSoup(resp.text, "lxml")
table = find_table(soup, f"{stat_type}_stats")
df = table_to_df(table)
if df.empty:
return df
# Convert all numeric columns
skip_cols = {"Player", "Pos", "Tm", "Season", "Rk"}
for col in df.columns:
if col not in skip_cols:
df[col] = pd.to_numeric(df[col], errors="coerce")
return df
# Top scorers in 2025-26
season = get_season_stats(2026)
if not season.empty:
# Filter to players with enough games
qualified = season[season["G"] >= 40].copy()
print("Top 10 Scorers (2025-26):")
top_scorers = qualified.nlargest(10, "PTS")
print(top_scorers[["Player", "Tm", "G", "PTS", "TRB", "AST", "FG%"]].to_string(index=False))
print("\nTop 10 by PER:")
adv = get_season_stats(2026, "advanced")
if not adv.empty:
qualified_adv = adv[adv["G"] >= 40]
print(qualified_adv.nlargest(10, "PER")[["Player", "PER", "WS", "BPM", "VORP"]].to_string(index=False))
Team Statistics
def get_team_roster(team_abbr, season):
"""Get a team's roster with per-game stats.
team_abbr: e.g., 'LAL', 'BOS', 'GSW', 'MIL', 'DEN'
"""
url = f"{BASE_URL}/teams/{team_abbr}/{season}.html"
resp = polite_request(url)
soup = BeautifulSoup(resp.text, "lxml")
table = find_table(soup, "per_game")
df = table_to_df(table)
# Add team context
if not df.empty:
df["Team"] = team_abbr
df["Season"] = season
return df
def get_team_standings(season):
"""Get full league standings for a season."""
url = f"{BASE_URL}/leagues/NBA_{season}_standings.html"
resp = polite_request(url)
soup = BeautifulSoup(resp.text, "lxml")
east = find_table(soup, "divs_standings_E")
west = find_table(soup, "divs_standings_W")
results = {}
if east:
results["Eastern"] = table_to_df(east)
if west:
results["Western"] = table_to_df(west)
return results
def get_team_schedule(team_abbr, season):
"""Get a team's full season schedule with results."""
url = f"{BASE_URL}/teams/{team_abbr}/{season}_games.html"
resp = polite_request(url)
soup = BeautifulSoup(resp.text, "lxml")
table = find_table(soup, "games")
return table_to_df(table)
# Example: Lakers roster and team stats
roster = get_team_roster("LAL", 2026)
if not roster.empty:
print("Lakers 2025-26 Roster:")
print(roster[["Player", "G", "MP", "PTS", "TRB", "AST"]].to_string(index=False))
Historical Data and Multi-Season Analysis
Building datasets across multiple seasons for trend analysis:
def build_career_dataset(player_slugs):
"""Collect career stats for multiple players into one DataFrame."""
all_stats = []
session = requests.Session()
session.headers.update(HEADERS)
for name, slug in player_slugs.items():
print(f"Fetching {name}...")
try:
url = f"{BASE_URL}/players/{slug[0]}/{slug}.html"
resp = polite_request(url, session=session)
soup = BeautifulSoup(resp.text, "lxml")
# Get per-game stats
per_game = table_to_df(find_table(soup, "per_game"))
if not per_game.empty:
per_game["Player"] = name
per_game["Slug"] = slug
all_stats.append(per_game)
except Exception as e:
print(f" Failed for {name}: {e}")
if not all_stats:
return pd.DataFrame()
combined = pd.concat(all_stats, ignore_index=True)
return combined
def build_historical_season_data(start_year, end_year):
"""Build a multi-season dataset of league-wide stats."""
all_seasons = []
for year in range(start_year, end_year + 1):
print(f"Fetching {year-1}-{str(year)[2:]} season...")
try:
df = get_season_stats(year)
if not df.empty:
df["Season_Year"] = year
all_seasons.append(df)
except Exception as e:
print(f" Failed for {year}: {e}")
if not all_seasons:
return pd.DataFrame()
combined = pd.concat(all_seasons, ignore_index=True)
print(f"\nCollected {len(combined)} player-seasons from {start_year} to {end_year}")
return combined
# All-time greats comparison
greats = {
"LeBron James": "jamesle01",
"Kevin Durant": "duranke01",
"Stephen Curry": "curryst01",
"Nikola Jokic": "jokicni01",
"Giannis Antetokounmpo": "antetgi01",
"Luka Doncic": "doncilu01",
}
dataset = build_career_dataset(greats)
Player Search and Discovery
def search_player(name):
"""Search for a player and return their slug and basic info."""
search_url = f"{BASE_URL}/search/search.fcgi"
params = {"search": name}
resp = polite_request(f"{search_url}?search={name}")
# BBRef either redirects to the player page or shows search results
if "/players/" in resp.url:
# Direct match — extract slug from URL
slug = resp.url.split("/players/")[1].rstrip("/").split("/")[-1].replace(".html", "")
return {"name": name, "slug": slug, "url": resp.url}
# Parse search results
soup = BeautifulSoup(resp.text, "lxml")
results = []
for item in soup.select(".search-item-name"):
link = item.find("a")
if link and "/players/" in link.get("href", ""):
href = link["href"]
slug = href.split("/")[-1].replace(".html", "")
results.append({
"name": link.get_text(strip=True),
"slug": slug,
"url": BASE_URL + href
})
return results
# Find a player
results = search_player("Nikola Jokic")
print(results)
Anti-Bot Protections and How to Handle Them
Basketball-Reference has progressively tightened its defenses. Here's what you'll encounter and how to handle each:
Rate Limiting
The site allows roughly 20 requests per minute. Exceed that and you get 429 responses followed by a temporary IP ban (usually 1 hour, escalating to 24 hours for repeat offenses).
class RateLimiter:
"""Track request timing to stay under BBRef's rate limits."""
def __init__(self, requests_per_minute=15):
self.rpm = requests_per_minute
self.request_times = []
def wait_if_needed(self):
"""Block until we can make another request without exceeding limits."""
now = time.time()
# Remove requests older than 60 seconds
self.request_times = [t for t in self.request_times if now - t < 60]
if len(self.request_times) >= self.rpm:
oldest = self.request_times[0]
wait = 60 - (now - oldest) + random.uniform(1, 3)
if wait > 0:
print(f"Rate limit approaching — waiting {wait:.1f}s")
time.sleep(wait)
self.request_times.append(time.time())
limiter = RateLimiter(requests_per_minute=15)
def rate_limited_request(url, session=None):
"""Make a request respecting rate limits."""
limiter.wait_if_needed()
client = session or requests
resp = client.get(url, headers=HEADERS, timeout=30)
if resp.status_code == 429:
retry_after = int(resp.headers.get("Retry-After", 120))
print(f"Rate limited! Backing off {retry_after}s")
time.sleep(retry_after)
limiter.request_times.clear()
resp = client.get(url, headers=HEADERS, timeout=30)
resp.raise_for_status()
return resp
IP Bans and Proxy Rotation
The site is harsh on cloud provider IP ranges (AWS, GCP, Azure, DigitalOcean). If you're scraping from a VPS, you'll likely get blocked within minutes. For any project pulling data across multiple players or seasons, you need residential IP rotation.
ThorData provides residential proxy pools that rotate automatically. Each request exits from a different residential IP, making your traffic look like normal users browsing the site:
THORDATA_PROXY = "http://USERNAME:[email protected]:9000"
def request_with_proxy(url, proxy_url=THORDATA_PROXY):
"""Make a request through ThorData residential proxy."""
proxies = {"http": proxy_url, "https": proxy_url}
limiter.wait_if_needed()
resp = requests.get(url, headers=HEADERS, proxies=proxies, timeout=30)
if resp.status_code == 429:
print("Rate limited even through proxy — backing off 120s")
time.sleep(120)
resp = requests.get(url, headers=HEADERS, proxies=proxies, timeout=30)
resp.raise_for_status()
return resp
def scrape_with_fallback(url, session=None):
"""Try direct request first, fall back to proxy on block."""
try:
return rate_limited_request(url, session)
except requests.HTTPError as e:
if e.response is not None and e.response.status_code in (403, 429):
print(f"Blocked directly, trying ThorData proxy...")
return request_with_proxy(url)
raise
Session Management for Extended Scraping
def create_scraping_session():
"""Create a requests session that mimics a real browser session."""
session = requests.Session()
session.headers.update(HEADERS)
# Visit the homepage first to get cookies
try:
session.get(BASE_URL, timeout=30)
time.sleep(random.uniform(2, 4))
except Exception:
pass
return session
def scrape_multiple_players(player_slugs, output_db="nba_stats.db"):
"""Production scraper for multiple players with session reuse."""
session = create_scraping_session()
conn = sqlite3.connect(output_db)
conn.execute("""
CREATE TABLE IF NOT EXISTS player_stats (
slug TEXT, season TEXT, team TEXT, games INTEGER,
ppg REAL, rpg REAL, apg REAL,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (slug, season)
)
""")
for i, (name, slug) in enumerate(player_slugs.items()):
print(f"[{i+1}/{len(player_slugs)}] {name}")
try:
df = get_player_per_game(slug)
if df.empty:
continue
for _, row in df.iterrows():
conn.execute("""
INSERT OR REPLACE INTO player_stats
(slug, season, team, games, ppg, rpg, apg)
VALUES (?, ?, ?, ?, ?, ?, ?)
""", (slug, row.get("Season"), row.get("Tm"),
row.get("G"), row.get("PTS"),
row.get("TRB"), row.get("AST")))
conn.commit()
print(f" Saved {len(df)} seasons")
except Exception as e:
print(f" Failed: {e}")
# Refresh session periodically to avoid stale cookies
if (i + 1) % 20 == 0:
session = create_scraping_session()
print(" [Session refreshed]")
conn.close()
print(f"\nDone. Data saved to {output_db}")
Storing NBA Data with SQLite
A complete storage layer for Basketball-Reference data:
def init_nba_db(db_path="nba_data.db"):
"""Initialize a comprehensive NBA stats database."""
conn = sqlite3.connect(db_path)
conn.execute("PRAGMA journal_mode=WAL")
conn.executescript("""
CREATE TABLE IF NOT EXISTS players (
slug TEXT PRIMARY KEY,
name TEXT NOT NULL,
position TEXT,
height TEXT,
weight TEXT,
birth_date TEXT,
draft_year INTEGER,
draft_pick INTEGER,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS season_stats (
slug TEXT NOT NULL,
season TEXT NOT NULL,
team TEXT,
games INTEGER,
games_started INTEGER,
minutes_pg REAL,
points_pg REAL,
rebounds_pg REAL,
assists_pg REAL,
steals_pg REAL,
blocks_pg REAL,
fg_pct REAL,
three_pct REAL,
ft_pct REAL,
per REAL,
ts_pct REAL,
win_shares REAL,
bpm REAL,
vorp REAL,
is_playoff INTEGER DEFAULT 0,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (slug, season, team, is_playoff),
FOREIGN KEY (slug) REFERENCES players(slug)
);
CREATE TABLE IF NOT EXISTS game_logs (
slug TEXT NOT NULL,
season INTEGER NOT NULL,
game_date TEXT,
opponent TEXT,
home_away TEXT,
result TEXT,
minutes INTEGER,
points INTEGER,
rebounds INTEGER,
assists INTEGER,
steals INTEGER,
blocks INTEGER,
turnovers INTEGER,
fg_made INTEGER,
fg_attempted INTEGER,
three_made INTEGER,
three_attempted INTEGER,
ft_made INTEGER,
ft_attempted INTEGER,
plus_minus INTEGER,
game_score REAL,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (slug, season, game_date)
);
CREATE INDEX IF NOT EXISTS idx_season_stats_slug ON season_stats(slug);
CREATE INDEX IF NOT EXISTS idx_game_logs_date ON game_logs(game_date);
""")
conn.commit()
return conn
Real-World Use Cases
1. Fantasy Basketball Draft Tool
def fantasy_value_analysis(season=2026, min_games=50):
"""Rank players by fantasy basketball value (9-cat leagues)."""
df = get_season_stats(season)
if df.empty:
return
qualified = df[df["G"] >= min_games].copy()
# Standard 9-category scoring
categories = ["PTS", "TRB", "AST", "STL", "BLK", "3P", "FG%", "FT%", "TOV"]
for cat in categories:
if cat in qualified.columns:
if cat == "TOV":
# Lower is better for turnovers
qualified[f"{cat}_z"] = -(qualified[cat] - qualified[cat].mean()) / qualified[cat].std()
else:
qualified[f"{cat}_z"] = (qualified[cat] - qualified[cat].mean()) / qualified[cat].std()
z_cols = [c for c in qualified.columns if c.endswith("_z")]
qualified["fantasy_value"] = qualified[z_cols].sum(axis=1)
qualified = qualified.sort_values("fantasy_value", ascending=False)
print("Top 20 Fantasy Values (9-cat):")
print(qualified[["Player", "Tm", "PTS", "TRB", "AST", "fantasy_value"]].head(20).to_string(index=False))
2. MVP Prediction Model Data
def collect_mvp_training_data(start_year=2000, end_year=2026):
"""Collect features for MVP prediction model training."""
records = []
for year in range(start_year, end_year + 1):
print(f"Collecting {year} season data...")
adv = get_season_stats(year, "advanced")
per_game = get_season_stats(year)
if adv.empty or per_game.empty:
continue
# Merge per-game and advanced stats
merged = per_game.merge(adv[["Player", "Tm", "PER", "WS", "BPM", "VORP"]],
on=["Player", "Tm"], how="left")
# Top candidates (high VORP players)
top = merged.nlargest(10, "VORP")
for _, row in top.iterrows():
records.append({
"season": year,
"player": row["Player"],
"team": row["Tm"],
"ppg": row.get("PTS"),
"rpg": row.get("TRB"),
"apg": row.get("AST"),
"per": row.get("PER"),
"ws": row.get("WS"),
"bpm": row.get("BPM"),
"vorp": row.get("VORP"),
})
time.sleep(random.uniform(4, 7))
return pd.DataFrame(records)
3. Player Comparison Tool
def compare_players(slugs_dict, seasons=5):
"""Compare multiple players across recent seasons."""
comparisons = {}
for name, slug in slugs_dict.items():
per_game = get_player_per_game(slug)
advanced = get_advanced_stats(slug)
if not per_game.empty:
recent = per_game.tail(seasons)
comparisons[name] = {
"ppg": recent["PTS"].mean() if "PTS" in recent else None,
"rpg": recent["TRB"].mean() if "TRB" in recent else None,
"apg": recent["AST"].mean() if "AST" in recent else None,
"fg_pct": recent["FG%"].mean() if "FG%" in recent else None,
"games": recent["G"].sum() if "G" in recent else None,
}
if not advanced.empty:
recent_adv = advanced.tail(seasons)
if name in comparisons:
comparisons[name].update({
"per": recent_adv["PER"].mean() if "PER" in recent_adv else None,
"ws": recent_adv["WS"].sum() if "WS" in recent_adv else None,
"bpm": recent_adv["BPM"].mean() if "BPM" in recent_adv else None,
})
comp_df = pd.DataFrame(comparisons).T
print(comp_df.round(1).to_string())
return comp_df
4. Injury Impact Analysis
def analyze_injury_impact(player_slug, injury_season):
"""Compare player stats before and after an injury season."""
per_game = get_player_per_game(player_slug)
if per_game.empty:
return
seasons = per_game["Season"].tolist()
injury_idx = None
for i, s in enumerate(seasons):
if injury_season in str(s):
injury_idx = i
break
if injury_idx is None:
print(f"Season {injury_season} not found")
return
before = per_game.iloc[max(0, injury_idx-3):injury_idx]
after = per_game.iloc[injury_idx+1:injury_idx+4]
print(f"Pre-injury (3 seasons before):")
print(f" PPG: {before['PTS'].mean():.1f}, RPG: {before['TRB'].mean():.1f}")
print(f"Post-injury (3 seasons after):")
print(f" PPG: {after['PTS'].mean():.1f}, RPG: {after['TRB'].mean():.1f}")
5. Draft Class Analysis
def analyze_draft_class(draft_year, seasons_to_check=5):
"""Analyze a draft class's performance over their first N seasons."""
url = f"{BASE_URL}/draft/NBA_{draft_year}.html"
resp = polite_request(url)
soup = BeautifulSoup(resp.text, "lxml")
table = find_table(soup, "stats")
if not table:
print("Draft page table not found")
return
df = table_to_df(table)
print(f"Draft class {draft_year}: {len(df)} picks")
print(df[["Pk", "Player", "Tm"]].head(15).to_string(index=False))
return df
6. Clutch Performance Tracker
def find_clutch_performances(player_slug, season, min_points=30):
"""Find games where a player scored big in close games."""
logs = get_game_logs(player_slug, season)
if logs.empty:
return pd.DataFrame()
big_games = logs[logs["PTS"] >= min_points].copy()
print(f"Games with {min_points}+ points: {len(big_games)}")
if not big_games.empty:
print(big_games[["Date", "Opp", "PTS", "TRB", "AST", "GmSc"]].to_string(index=False))
return big_games
7. Historical Trend Analysis
def league_scoring_trends(start_year=1980, end_year=2026):
"""Track how league-wide scoring has evolved over decades."""
trends = []
for year in range(start_year, end_year + 1):
try:
df = get_season_stats(year)
if not df.empty:
trends.append({
"season": f"{year-1}-{str(year)[2:]}",
"avg_ppg": df["PTS"].mean(),
"avg_3pa": df["3PA"].mean() if "3PA" in df.columns else None,
"avg_3p_pct": df["3P%"].mean() if "3P%" in df.columns else None,
"avg_pace": df.get("Pace", pd.Series()).mean(),
})
except Exception:
pass
time.sleep(random.uniform(4, 8))
trend_df = pd.DataFrame(trends)
print(trend_df.to_string(index=False))
return trend_df
Common Pitfalls and Solutions
1. Comment-Wrapped Tables
The biggest gotcha. Always use the find_table() helper that checks both visible tables and HTML comments. Without it, you'll miss advanced, shooting, and playoff tables.
2. Traded Players Appear Multiple Times
A player traded mid-season has rows for each team plus a "TOT" (total) row. Filter Tm == "TOT" for season totals, or keep individual team rows for team-specific analysis.
def handle_traded_players(df):
"""Keep only the TOT row for traded players."""
if "Tm" not in df.columns:
return df
# Find players who appear multiple times (traded)
traded = df[df.duplicated("Player", keep=False)]
if traded.empty:
return df
# Keep TOT rows for traded, all rows for non-traded
non_traded = df[~df.duplicated("Player", keep=False)]
tot_rows = df[(df.duplicated("Player", keep=False)) & (df["Tm"] == "TOT")]
return pd.concat([non_traded, tot_rows], ignore_index=True)
3. Season Numbering
"2026" means the 2025-26 season — the ending year, not the starting year. This trips up everyone at first.
4. Table IDs Vary by Page
Per-game stats use per_game, game logs use pgl_basic, advanced uses advanced. Don't assume one ID works everywhere. Check the page source.
5. Missing Data for Older Seasons
Three-point statistics don't exist before 1979-80. Advanced metrics like VORP start later. Always handle NaN values gracefully.
6. Cloudflare Challenges
If you get a response with "Checking your browser" text, your IP is flagged. Switch to residential proxies through ThorData — their residential IPs bypass Cloudflare challenges because they come from real ISPs.
Basketball-Reference is a remarkably clean data source. Keep your request rate slow (3-5 seconds between requests minimum), rotate IPs for larger jobs, always check those HTML comments for hidden tables, and handle traded-player duplicates. Follow these rules and you'll have decades of basketball data at your fingertips.