Scrape Baseball-Reference: MLB Player Stats, WAR & Game Logs with Python (2026)
Scrape Baseball-Reference: MLB Player Stats, WAR & Game Logs with Python (2026)
If you're doing any kind of baseball analytics, Baseball-Reference is the gold standard. It covers every MLB player back to 1871, has WAR (Wins Above Replacement) for pitchers and hitters, detailed game logs, batting splits, fielding stats, and historical season-by-season breakdowns. For free. No API key required.
That makes it a great scraping target — but it also means they get hammered with requests and have protections in place. This guide walks through scraping it correctly: pulling player stats, game logs, and WAR metrics, handling the anti-bot layer, paginating across seasons, and storing everything cleanly in SQLite for downstream analysis.
Why Baseball-Reference Data Matters
Baseball-Reference aggregates a century and a half of baseball statistics into a single consistent format. What you can do with this data:
- Fantasy baseball optimization: Build models that predict player performance using historical WAR, game logs, and situational splits
- Sabermetric research: Replicate and extend academic baseball research without paying for proprietary datasets
- Historical trend analysis: Track how the game has changed over 150 years — pace of play, strikeout rates, defensive shifts
- Player valuation models: Build your own WAR variants or complement BR's metrics with your own analysis
- Injury impact studies: Correlate player performance metrics with injury history data
- Draft research: Analyze minor league conversion rates and prospect development patterns
Setup
You need three libraries. Nothing exotic.
pip install requests beautifulsoup4 lxml pandas sqlite3
lxml is important here. Baseball-Reference pages are large and have complex table structures. The lxml parser is significantly faster than Python's built-in html.parser and handles malformed HTML more gracefully.
Understanding Baseball-Reference's Structure
Before scraping, understand the URL patterns and how data is organized.
Player pages: https://www.baseball-reference.com/players/{first_letter}/{player_id}.shtml
- Example: Mike Trout → /players/t/troutmi01.shtml
- Player IDs follow: [first 5 of last name][first 2 of first name][2-digit number]
Game log pages: https://www.baseball-reference.com/players/{letter}/{id}/batting-gamelogs/{year}/
Team pages: https://www.baseball-reference.com/teams/{team_abbr}/{year}.shtml
Season leaders: https://www.baseball-reference.com/leagues/MLB/{year}-batting-leaders.shtml
Key table IDs (used with BeautifulSoup):
- batting_standard — Season-by-season batting stats
- pitching_standard — Season-by-season pitching stats
- batting_value — WAR and value stats for hitters (embedded in HTML comment)
- pitching_value — WAR and value stats for pitchers (embedded in HTML comment)
- batting_gamelogs — Game-by-game stats for a season
- pitching_gamelogs — Game-by-game pitching stats
The most important quirk: Baseball-Reference embeds some tables inside HTML comments to reduce page weight. If soup.find() returns None for a table you can see on the page, you need to uncomment it first.
Core Scraping Utilities
import requests
from bs4 import BeautifulSoup
import pandas as pd
import sqlite3
import time
import random
import re
from typing import Optional, Dict, List
from datetime import datetime
HEADERS = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
}
def uncomment_tables(html: str) -> str:
"""Remove HTML comment wrappers from hidden tables."""
return re.sub(r'<!--(.*?)-->', r'\1', html, flags=re.DOTALL)
def make_request(
url: str,
proxy: Optional[str] = None,
max_retries: int = 5,
base_delay: float = 3.0,
) -> Optional[requests.Response]:
"""Make an HTTP request with exponential backoff and proxy support."""
proxies = {"http": proxy, "https": proxy} if proxy else None
for attempt in range(max_retries):
try:
resp = requests.get(
url,
headers=HEADERS,
proxies=proxies,
timeout=25,
)
if resp.status_code == 200:
return resp
elif resp.status_code == 429:
wait = base_delay * (3 ** attempt) + random.uniform(0, 5)
print(f"[RATE LIMIT] {url} — waiting {wait:.1f}s (attempt {attempt + 1})")
time.sleep(wait)
elif resp.status_code == 503:
wait = base_delay * (2 ** attempt)
print(f"[503] Service unavailable, waiting {wait:.1f}s")
time.sleep(wait)
elif resp.status_code == 404:
print(f"[404] Not found: {url}")
return None
else:
print(f"[ERROR] HTTP {resp.status_code} for {url}")
return None
except requests.Timeout:
wait = base_delay * (attempt + 1)
print(f"[TIMEOUT] Attempt {attempt + 1}, waiting {wait:.1f}s")
time.sleep(wait)
except requests.RequestException as e:
print(f"[ERROR] Request failed: {e}")
if attempt < max_retries - 1:
time.sleep(base_delay * (attempt + 1))
print(f"[FAIL] Exhausted retries for {url}")
return None
def polite_sleep(min_s: float = 3.0, max_s: float = 6.0):
"""Sleep for a human-like random duration."""
time.sleep(random.uniform(min_s, max_s))
Scraping Player Season Stats
def get_player_batting_stats(
player_id: str,
proxy: Optional[str] = None,
) -> Optional[pd.DataFrame]:
"""Get career season-by-season batting stats for a player."""
first_letter = player_id[0]
url = f"https://www.baseball-reference.com/players/{first_letter}/{player_id}.shtml"
resp = make_request(url, proxy=proxy)
if not resp:
return None
soup = BeautifulSoup(resp.text, "lxml")
# Standard batting table
table = soup.find("table", {"id": "batting_standard"})
if table is None:
# Table may be in HTML comment
soup_uncommented = BeautifulSoup(uncomment_tables(resp.text), "lxml")
table = soup_uncommented.find("table", {"id": "batting_standard"})
if table is None:
print(f"[WARN] No batting_standard table found for {player_id}")
return None
df = pd.read_html(str(table))[0]
# Drop separator/header rows that repeat mid-table
df = df[pd.to_numeric(df.get("Year", df.get("Rk")), errors="coerce").notna()].copy()
# Convert Year to int where possible
if "Year" in df.columns:
df["Year"] = pd.to_numeric(df["Year"], errors="coerce")
df = df.dropna(subset=["Year"])
df["Year"] = df["Year"].astype(int)
df["PlayerID"] = player_id
df["ScrapedAt"] = datetime.utcnow().isoformat()
return df
def get_player_pitching_stats(
player_id: str,
proxy: Optional[str] = None,
) -> Optional[pd.DataFrame]:
"""Get career season-by-season pitching stats."""
first_letter = player_id[0]
url = f"https://www.baseball-reference.com/players/{first_letter}/{player_id}.shtml"
resp = make_request(url, proxy=proxy)
if not resp:
return None
# Pitching stats are often in commented tables
soup = BeautifulSoup(uncomment_tables(resp.text), "lxml")
table = soup.find("table", {"id": "pitching_standard"})
if table is None:
print(f"[WARN] No pitching_standard table for {player_id}")
return None
df = pd.read_html(str(table))[0]
df = df[pd.to_numeric(df.get("Year", df.get("Rk")), errors="coerce").notna()].copy()
if "Year" in df.columns:
df["Year"] = pd.to_numeric(df["Year"], errors="coerce")
df = df.dropna(subset=["Year"])
df["Year"] = df["Year"].astype(int)
df["PlayerID"] = player_id
df["ScrapedAt"] = datetime.utcnow().isoformat()
return df
def get_player_fielding_stats(
player_id: str,
proxy: Optional[str] = None,
) -> Optional[pd.DataFrame]:
"""Get career fielding stats."""
first_letter = player_id[0]
url = f"https://www.baseball-reference.com/players/{first_letter}/{player_id}.shtml"
resp = make_request(url, proxy=proxy)
if not resp:
return None
soup = BeautifulSoup(uncomment_tables(resp.text), "lxml")
table = soup.find("table", {"id": "standard_fielding"})
if table is None:
return None
df = pd.read_html(str(table))[0]
df["PlayerID"] = player_id
return df
Scraping Game Logs
Game logs give you granular day-by-day performance data — essential for in-season analysis and injury impact studies.
def get_game_logs(
player_id: str,
year: int,
log_type: str = "batting",
proxy: Optional[str] = None,
) -> Optional[pd.DataFrame]:
"""Get game-by-game stats for a player in a given season.
log_type: 'batting' or 'pitching'
"""
first_letter = player_id[0]
url = (
f"https://www.baseball-reference.com/players/{first_letter}/"
f"{player_id}/{log_type}-gamelogs/{year}/"
)
resp = make_request(url, proxy=proxy)
if not resp:
return None
# Game log tables are usually NOT in comments
soup = BeautifulSoup(resp.text, "lxml")
table_id = f"{log_type}_gamelogs"
table = soup.find("table", {"id": table_id})
if table is None:
# Try with uncommented HTML
soup2 = BeautifulSoup(uncomment_tables(resp.text), "lxml")
table = soup2.find("table", {"id": table_id})
if table is None:
print(f"[WARN] No {table_id} table for {player_id} {year}")
return None
df = pd.read_html(str(table))[0]
# Drop header rows that repeat mid-table (Rk column contains "Rk" for separator rows)
if "Rk" in df.columns:
df = df[df["Rk"] != "Rk"].copy()
df = df[pd.to_numeric(df["Rk"], errors="coerce").notna()].copy()
df["Year"] = year
df["PlayerID"] = player_id
df["LogType"] = log_type
df["ScrapedAt"] = datetime.utcnow().isoformat()
return df
def get_multi_season_logs(
player_id: str,
years: List[int],
log_type: str = "batting",
proxy: Optional[str] = None,
) -> pd.DataFrame:
"""Get game logs for multiple seasons and concatenate."""
dfs = []
for year in years:
df = get_game_logs(player_id, year, log_type=log_type, proxy=proxy)
if df is not None and not df.empty:
dfs.append(df)
print(f" {player_id} {year}: {len(df)} games")
polite_sleep(3.0, 6.0)
if not dfs:
return pd.DataFrame()
return pd.concat(dfs, ignore_index=True)
Extracting WAR and Advanced Metrics
WAR is Baseball-Reference's signature metric, and it lives in tables that are embedded in HTML comments — a pattern unique to BR.
def get_war_data(
player_id: str,
player_type: str = "batter",
proxy: Optional[str] = None,
) -> Optional[pd.DataFrame]:
"""Extract WAR and value stats for a player.
player_type: 'batter' or 'pitcher'
"""
first_letter = player_id[0]
url = f"https://www.baseball-reference.com/players/{first_letter}/{player_id}.shtml"
resp = make_request(url, proxy=proxy)
if not resp:
return None
# WAR tables are ALWAYS in HTML comments
soup = BeautifulSoup(uncomment_tables(resp.text), "lxml")
table_id = "batting_value" if player_type == "batter" else "pitching_value"
table = soup.find("table", {"id": table_id})
if table is None:
print(f"[WARN] {table_id} table not found for {player_id}")
return None
df = pd.read_html(str(table))[0]
# Keep only actual season rows (drop career totals, averages)
if "Year" in df.columns:
df = df[pd.to_numeric(df["Year"], errors="coerce").notna()].copy()
df["Year"] = df["Year"].astype(int)
war_cols = [c for c in df.columns if "WAR" in str(c)]
base_cols = ["Year", "Age", "Tm", "G"]
keep_cols = [c for c in base_cols if c in df.columns] + war_cols
df = df[keep_cols].copy()
df["PlayerID"] = player_id
return df
def get_career_war_summary(player_id: str, proxy: Optional[str] = None) -> Dict:
"""Get career WAR totals for a player."""
war_df = get_war_data(player_id, proxy=proxy)
if war_df is None or war_df.empty:
return {}
war_col = next((c for c in war_df.columns if c == "WAR" or c == "rWAR"), None)
if not war_col:
return {}
return {
"player_id": player_id,
"career_war": float(pd.to_numeric(war_df[war_col], errors="coerce").sum()),
"peak_war_season": float(pd.to_numeric(war_df[war_col], errors="coerce").max()),
"peak_war_year": int(war_df.loc[pd.to_numeric(war_df[war_col], errors="coerce").idxmax(), "Year"]) if "Year" in war_df.columns else None,
"seasons": len(war_df),
}
Scraping Team and Season Data
def get_team_roster(
team: str,
year: int,
proxy: Optional[str] = None,
) -> Optional[pd.DataFrame]:
"""Get team roster for a given season.
team: 3-letter team abbreviation (LAA, NYY, BOS, etc.)
"""
url = f"https://www.baseball-reference.com/teams/{team}/{year}.shtml"
resp = make_request(url, proxy=proxy)
if not resp:
return None
soup = BeautifulSoup(uncomment_tables(resp.text), "lxml")
# Team batting table
table = soup.find("table", {"id": "team_batting"})
if not table:
return None
df = pd.read_html(str(table))[0]
df["Team"] = team
df["Year"] = year
# Extract player IDs from links in the HTML table
player_links = {}
for row in table.find_all("tr"):
name_td = row.find("td", {"data-stat": "player"})
if name_td:
link = name_td.find("a")
if link and "/players/" in link.get("href", ""):
href = link["href"]
pid = href.split("/")[-1].replace(".shtml", "")
player_links[link.text.strip()] = pid
df["PlayerID"] = df.get("Name", df.iloc[:, 0]).map(player_links)
return df
def get_season_leaders(
year: int,
stat: str = "batting",
min_pa: int = 300,
proxy: Optional[str] = None,
) -> Optional[pd.DataFrame]:
"""Get season statistical leaders.
stat: 'batting' or 'pitching'
"""
url = f"https://www.baseball-reference.com/leagues/MLB/{year}-{stat}-leaders.shtml"
resp = make_request(url, proxy=proxy)
if not resp:
return None
soup = BeautifulSoup(uncomment_tables(resp.text), "lxml")
# Leader tables are named like leader_onbase_perc, leader_slugging_perc, etc.
# The main summary table is usually the first one
tables = soup.find_all("table", class_="stats_table")
if not tables:
return None
dfs = []
for table in tables[:5]: # Get first 5 leader tables
try:
df = pd.read_html(str(table))[0]
df["Year"] = year
dfs.append(df)
except Exception:
continue
return pd.concat(dfs, ignore_index=True) if dfs else None
Anti-Bot Handling and Proxy Integration
Baseball-Reference has rate limiting, user-agent checks, and will serve CAPTCHAs if you're hitting it too fast or from a datacenter IP. The user-agent header helps. Random delays help more. But for any real scale, rotating residential proxies are essential.
ThorData provides residential proxies that look like real ISP traffic. Baseball-Reference rarely flags residential IPs the way it flags datacenter ranges.
class ThorDataProxyPool:
"""Rotating residential proxy pool via ThorData."""
def __init__(self, username: str, password: str):
self.username = username
self.password = password
self.host = "gate.thordata.com"
self.port = 9000
def get_proxy(self, country: Optional[str] = None) -> str:
"""Get a rotating residential proxy URL."""
user = self.username
if country:
user = f"{self.username}-country-{country.upper()}"
return f"http://{user}:{self.password}@{self.host}:{self.port}"
def get_us_proxy(self) -> str:
"""Get a US-based proxy (useful for Baseball-Reference geo-content)."""
return self.get_proxy(country="US")
def get_with_proxy(
url: str,
proxy_pool: Optional[ThorDataProxyPool] = None,
retries: int = 4,
) -> Optional[requests.Response]:
"""Fetch URL with optional proxy rotation and backoff."""
proxy = proxy_pool.get_us_proxy() if proxy_pool else None
for attempt in range(retries):
# Rotate proxy on each retry
if proxy_pool and attempt > 0:
proxy = proxy_pool.get_proxy()
resp = make_request(url, proxy=proxy, max_retries=1)
if resp:
return resp
wait = (2 ** attempt) * 5 + random.uniform(0, 5)
print(f"[RETRY] Attempt {attempt + 1}/{retries} for {url}, waiting {wait:.1f}s")
time.sleep(wait)
return None
Pagination Handling
Baseball-Reference doesn't paginate season stats (all seasons appear on one page), but some data requires multiple URLs:
def get_full_career_gamelogs(
player_id: str,
start_year: int,
end_year: int,
proxy: Optional[str] = None,
) -> pd.DataFrame:
"""Get all game logs for a player's career."""
all_dfs = []
for year in range(start_year, end_year + 1):
print(f" Fetching {player_id} game logs for {year}...")
df = get_game_logs(player_id, year, proxy=proxy)
if df is not None and not df.empty:
all_dfs.append(df)
polite_sleep(3.5, 7.0) # Be respectful of BR's servers
if not all_dfs:
return pd.DataFrame()
combined = pd.concat(all_dfs, ignore_index=True)
return combined
def paginate_player_search(
search_term: str,
proxy: Optional[str] = None,
) -> List[Dict]:
"""Search for players by name and return player IDs."""
url = f"https://www.baseball-reference.com/search/search.fcgi?search={search_term}&pid=player_search"
resp = make_request(url, proxy=proxy)
if not resp:
return []
soup = BeautifulSoup(resp.text, "lxml")
players = []
# Direct redirect if only one match
if "players" in resp.url and ".shtml" in resp.url:
player_id = resp.url.split("/")[-1].replace(".shtml", "")
name_el = soup.select_one("h1[itemprop='name']")
return [{
"player_id": player_id,
"name": name_el.text.strip() if name_el else search_term,
"url": resp.url,
}]
# Multiple results page
for row in soup.select(".search-item-name"):
link = row.find("a")
if link and "/players/" in link.get("href", ""):
pid = link["href"].split("/")[-1].replace(".shtml", "")
players.append({
"player_id": pid,
"name": link.text.strip(),
"url": f"https://www.baseball-reference.com{link['href']}",
})
return players
Data Storage
SQLite works well for this. You can query it later without loading everything into memory. Historical data (pre-2010) rarely changes, so scrape once and cache permanently.
def init_database(db_path: str = "baseball.db") -> sqlite3.Connection:
"""Initialize the Baseball-Reference database."""
conn = sqlite3.connect(db_path)
conn.execute("PRAGMA journal_mode=WAL")
conn.executescript("""
CREATE TABLE IF NOT EXISTS batting_stats (
player_id TEXT,
year INTEGER,
age INTEGER,
team TEXT,
games INTEGER,
plate_appearances INTEGER,
at_bats INTEGER,
runs INTEGER,
hits INTEGER,
doubles INTEGER,
triples INTEGER,
home_runs INTEGER,
rbi INTEGER,
stolen_bases INTEGER,
caught_stealing INTEGER,
walks INTEGER,
strikeouts INTEGER,
batting_avg REAL,
obp REAL,
slg REAL,
ops REAL,
ops_plus INTEGER,
scraped_at TEXT,
PRIMARY KEY (player_id, year, team)
);
CREATE TABLE IF NOT EXISTS pitching_stats (
player_id TEXT,
year INTEGER,
age INTEGER,
team TEXT,
wins INTEGER,
losses INTEGER,
era REAL,
games INTEGER,
games_started INTEGER,
innings_pitched REAL,
hits_allowed INTEGER,
earned_runs INTEGER,
home_runs_allowed INTEGER,
walks INTEGER,
strikeouts INTEGER,
whip REAL,
era_plus INTEGER,
scraped_at TEXT,
PRIMARY KEY (player_id, year, team)
);
CREATE TABLE IF NOT EXISTS game_logs (
player_id TEXT,
year INTEGER,
game_num INTEGER,
date TEXT,
team TEXT,
opponent TEXT,
at_bats INTEGER,
hits INTEGER,
home_runs INTEGER,
rbi INTEGER,
batting_avg REAL,
log_type TEXT,
scraped_at TEXT,
PRIMARY KEY (player_id, year, game_num, log_type)
);
CREATE TABLE IF NOT EXISTS war_data (
player_id TEXT,
year INTEGER,
team TEXT,
games INTEGER,
war REAL,
scraped_at TEXT,
PRIMARY KEY (player_id, year, team)
);
CREATE TABLE IF NOT EXISTS scrape_cache (
url TEXT PRIMARY KEY,
scraped_at TEXT,
status_code INTEGER
);
CREATE INDEX IF NOT EXISTS idx_batting_year ON batting_stats(year);
CREATE INDEX IF NOT EXISTS idx_pitching_year ON pitching_stats(year);
CREATE INDEX IF NOT EXISTS idx_gamelogs_player ON game_logs(player_id, year);
""")
conn.commit()
return conn
def save_batting_stats(conn: sqlite3.Connection, df: pd.DataFrame, player_id: str):
"""Save batting stats DataFrame to SQLite."""
if df is None or df.empty:
return 0
col_map = {
"Year": "year", "Age": "age", "Tm": "team",
"G": "games", "PA": "plate_appearances", "AB": "at_bats",
"R": "runs", "H": "hits", "2B": "doubles", "3B": "triples",
"HR": "home_runs", "RBI": "rbi", "SB": "stolen_bases",
"CS": "caught_stealing", "BB": "walks", "SO": "strikeouts",
"BA": "batting_avg", "OBP": "obp", "SLG": "slg",
"OPS": "ops", "OPS+": "ops_plus",
}
rows_saved = 0
for _, row in df.iterrows():
try:
conn.execute(
"""INSERT OR REPLACE INTO batting_stats
(player_id, year, age, team, games, plate_appearances, at_bats,
runs, hits, doubles, triples, home_runs, rbi, stolen_bases,
caught_stealing, walks, strikeouts, batting_avg, obp, slg, ops, ops_plus, scraped_at)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)""",
(
player_id,
_safe_int(row.get("Year")),
_safe_int(row.get("Age")),
str(row.get("Tm", "")),
_safe_int(row.get("G")),
_safe_int(row.get("PA")),
_safe_int(row.get("AB")),
_safe_int(row.get("R")),
_safe_int(row.get("H")),
_safe_int(row.get("2B")),
_safe_int(row.get("3B")),
_safe_int(row.get("HR")),
_safe_int(row.get("RBI")),
_safe_int(row.get("SB")),
_safe_int(row.get("CS")),
_safe_int(row.get("BB")),
_safe_int(row.get("SO")),
_safe_float(row.get("BA")),
_safe_float(row.get("OBP")),
_safe_float(row.get("SLG")),
_safe_float(row.get("OPS")),
_safe_int(row.get("OPS+")),
datetime.utcnow().isoformat(),
)
)
rows_saved += 1
except Exception as e:
print(f"[ERROR] Failed to save row: {e}")
conn.commit()
return rows_saved
def _safe_int(val) -> Optional[int]:
try:
return int(float(val))
except (TypeError, ValueError):
return None
def _safe_float(val) -> Optional[float]:
try:
return float(val)
except (TypeError, ValueError):
return None
def already_scraped(
conn: sqlite3.Connection,
player_id: str,
year: int,
table: str = "batting_stats",
) -> bool:
"""Check if data already exists for a player/year combination."""
row = conn.execute(
f"SELECT COUNT(*) FROM {table} WHERE player_id = ? AND year = ?",
(player_id, year),
).fetchone()
return row[0] > 0
Rate Limiting Best Practices
Baseball-Reference is a free resource maintained by a small team. Don't hammer it. The site runs on advertising, and heavy automated traffic doesn't generate ad revenue but does cost them bandwidth.
# Practical rate limits that keep you under the radar:
SCRAPE_CONFIG = {
"min_delay_between_requests": 3.0, # Never go below 3 seconds
"max_delay_between_requests": 6.0,
"delay_between_players": 8.0, # Longer pause when switching players
"delay_between_seasons": 5.0, # Between seasons for same player
"max_requests_per_hour": 120, # 2 per minute max for single IP
"cache_historical_data": True, # Pre-2020 data is static, cache forever
}
class RateLimiter:
"""Simple rate limiter to enforce polite scraping."""
def __init__(self, min_interval: float = 3.0, max_interval: float = 6.0):
self.min_interval = min_interval
self.max_interval = max_interval
self.last_request = 0.0
def wait(self):
"""Wait the appropriate amount before the next request."""
elapsed = time.time() - self.last_request
target = random.uniform(self.min_interval, self.max_interval)
if elapsed < target:
time.sleep(target - elapsed)
self.last_request = time.time()
Complete Production Pipeline
def run_player_pipeline(
player_ids: List[str],
years: List[int],
db_path: str = "baseball.db",
proxy_pool: Optional[ThorDataProxyPool] = None,
) -> Dict:
"""Complete pipeline for scraping player stats across seasons."""
conn = init_database(db_path)
rate_limiter = RateLimiter(min_interval=3.5, max_interval=7.0)
stats = {
"players_scraped": 0,
"seasons_scraped": 0,
"game_logs_scraped": 0,
"errors": 0,
"skipped": 0,
}
for player_id in player_ids:
print(f"\n[PLAYER] {player_id}")
proxy = proxy_pool.get_us_proxy() if proxy_pool else None
# Get career batting stats
rate_limiter.wait()
batting_df = get_player_batting_stats(player_id, proxy=proxy)
if batting_df is not None:
rows = save_batting_stats(conn, batting_df, player_id)
stats["seasons_scraped"] += rows
print(f" Batting stats: {rows} season rows saved")
# Get WAR data
rate_limiter.wait()
war_df = get_war_data(player_id, proxy=proxy)
if war_df is not None:
war_summary = get_career_war_summary(player_id, proxy=None) # Use cached data
print(f" Career WAR: {war_summary.get('career_war', 'N/A')}")
# Get game logs for requested years
for year in years:
if already_scraped(conn, player_id, year, "game_logs"):
print(f" [SKIP] {player_id} {year} game logs already in DB")
stats["skipped"] += 1
continue
rate_limiter.wait()
logs_df = get_game_logs(player_id, year, proxy=proxy)
if logs_df is not None and not logs_df.empty:
# Save game logs (simplified — full save would map all columns)
stats["game_logs_scraped"] += len(logs_df)
print(f" {year} game logs: {len(logs_df)} games")
stats["players_scraped"] += 1
time.sleep(random.uniform(8.0, 15.0)) # Long pause between players
conn.close()
print(f"\nPipeline complete: {stats}")
return stats
# Example usage
if __name__ == "__main__":
PLAYERS = [
"troutmi01", # Mike Trout
"judgeaa01", # Aaron Judge
"bettsmo01", # Mookie Betts
"arenaro01", # Nolan Arenado
"goldspa01", # Paul Goldschmidt
]
YEARS = [2022, 2023, 2024, 2025]
# With ThorData proxy
# pool = ThorDataProxyPool("YOUR_USER", "YOUR_PASS")
# run_player_pipeline(PLAYERS, YEARS, proxy_pool=pool)
# Without proxy (slower, risk of IP block at scale)
run_player_pipeline(PLAYERS, YEARS)
Real-World Use Cases
Fantasy Baseball Model: Pull WAR data and game logs for all current players. Build regression models predicting next-season WAR from age, injury history, and recent performance trends. Backtest against historical data to validate.
Historical Trend Analysis: Pull season leader data from 1900-present to track how baseball has changed. Strikeout rates, home runs per game, stolen base frequency — all quantifiable with this data.
Player Comparison Tool: Given two player IDs, fetch their career stats, WAR trajectories, and game log consistency metrics. Generate statistical comparisons normalized for era and park factors.
Injury Impact Quantification: Cross-reference game log gaps (stretches of consecutive missed games) with subsequent performance. Quantify how different injury types affect player output across their careers.
Draft Value Optimizer: Compile statistics for prospects' minor and major league stints, identify which minor league performance metrics best predict major league success, and build a draft value model.
Baseball-Reference has an enormous amount of data if you're patient about pulling it. WAR going back to 1871, every game log, park factors, splits by handedness, leverage index — it's all there in structured HTML tables. The scraping itself is straightforward once you know which table IDs to target and how to handle the HTML comment-wrapped tables. The main challenges are the anti-bot layer and just being a good citizen about request volume. Use ThorData's residential proxy network for any serious bulk collection, cache historical data aggressively, and never re-fetch data you already have.