Scrape Baseball-Reference: MLB Player Stats, WAR & Game Logs with Python (2026)

2026-04-09 [python scraping baseball mlb sports-data]

Scrape Baseball-Reference: MLB Player Stats, WAR & Game Logs with Python (2026)

If you're doing any kind of baseball analytics, Baseball-Reference is the gold standard. It covers every MLB player back to 1871, has WAR (Wins Above Replacement) for pitchers and hitters, detailed game logs, batting splits, fielding stats, and historical season-by-season breakdowns. For free. No API key required.

That makes it a great scraping target — but it also means they get hammered with requests and have protections in place. This guide walks through scraping it correctly: pulling player stats, game logs, and WAR metrics, handling the anti-bot layer, paginating across seasons, and storing everything cleanly in SQLite for downstream analysis.

Why Baseball-Reference Data Matters

Baseball-Reference aggregates a century and a half of baseball statistics into a single consistent format. What you can do with this data:

Fantasy baseball optimization: Build models that predict player performance using historical WAR, game logs, and situational splits
Sabermetric research: Replicate and extend academic baseball research without paying for proprietary datasets
Historical trend analysis: Track how the game has changed over 150 years — pace of play, strikeout rates, defensive shifts
Player valuation models: Build your own WAR variants or complement BR's metrics with your own analysis
Injury impact studies: Correlate player performance metrics with injury history data
Draft research: Analyze minor league conversion rates and prospect development patterns

Setup

You need three libraries. Nothing exotic.

pip install requests beautifulsoup4 lxml pandas sqlite3

lxml is important here. Baseball-Reference pages are large and have complex table structures. The lxml parser is significantly faster than Python's built-in html.parser and handles malformed HTML more gracefully.

Understanding Baseball-Reference's Structure

Before scraping, understand the URL patterns and how data is organized.

Player pages: https://www.baseball-reference.com/players/{first_letter}/{player_id}.shtml - Example: Mike Trout → /players/t/troutmi01.shtml - Player IDs follow: [first 5 of last name][first 2 of first name][2-digit number]

Game log pages: https://www.baseball-reference.com/players/{letter}/{id}/batting-gamelogs/{year}/

Team pages: https://www.baseball-reference.com/teams/{team_abbr}/{year}.shtml

Season leaders: https://www.baseball-reference.com/leagues/MLB/{year}-batting-leaders.shtml

Key table IDs (used with BeautifulSoup): - batting_standard — Season-by-season batting stats - pitching_standard — Season-by-season pitching stats - batting_value — WAR and value stats for hitters (embedded in HTML comment) - pitching_value — WAR and value stats for pitchers (embedded in HTML comment) - batting_gamelogs — Game-by-game stats for a season - pitching_gamelogs — Game-by-game pitching stats

The most important quirk: Baseball-Reference embeds some tables inside HTML comments to reduce page weight. If soup.find() returns None for a table you can see on the page, you need to uncomment it first.

Core Scraping Utilities

import requests
from bs4 import BeautifulSoup
import pandas as pd
import sqlite3
import time
import random
import re
from typing import Optional, Dict, List
from datetime import datetime

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/124.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
    "Connection": "keep-alive",
}


def uncomment_tables(html: str) -> str:
    """Remove HTML comment wrappers from hidden tables."""
    return re.sub(r'<!--(.*?)-->', r'\1', html, flags=re.DOTALL)


def make_request(
    url: str,
    proxy: Optional[str] = None,
    max_retries: int = 5,
    base_delay: float = 3.0,
) -> Optional[requests.Response]:
    """Make an HTTP request with exponential backoff and proxy support."""
    proxies = {"http": proxy, "https": proxy} if proxy else None

    for attempt in range(max_retries):
        try:
            resp = requests.get(
                url,
                headers=HEADERS,
                proxies=proxies,
                timeout=25,
            )

            if resp.status_code == 200:
                return resp
            elif resp.status_code == 429:
                wait = base_delay * (3 ** attempt) + random.uniform(0, 5)
                print(f"[RATE LIMIT] {url} — waiting {wait:.1f}s (attempt {attempt + 1})")
                time.sleep(wait)
            elif resp.status_code == 503:
                wait = base_delay * (2 ** attempt)
                print(f"[503] Service unavailable, waiting {wait:.1f}s")
                time.sleep(wait)
            elif resp.status_code == 404:
                print(f"[404] Not found: {url}")
                return None
            else:
                print(f"[ERROR] HTTP {resp.status_code} for {url}")
                return None

        except requests.Timeout:
            wait = base_delay * (attempt + 1)
            print(f"[TIMEOUT] Attempt {attempt + 1}, waiting {wait:.1f}s")
            time.sleep(wait)
        except requests.RequestException as e:
            print(f"[ERROR] Request failed: {e}")
            if attempt < max_retries - 1:
                time.sleep(base_delay * (attempt + 1))

    print(f"[FAIL] Exhausted retries for {url}")
    return None


def polite_sleep(min_s: float = 3.0, max_s: float = 6.0):
    """Sleep for a human-like random duration."""
    time.sleep(random.uniform(min_s, max_s))

Scraping Player Season Stats

def get_player_batting_stats(
    player_id: str,
    proxy: Optional[str] = None,
) -> Optional[pd.DataFrame]:
    """Get career season-by-season batting stats for a player."""
    first_letter = player_id[0]
    url = f"https://www.baseball-reference.com/players/{first_letter}/{player_id}.shtml"

    resp = make_request(url, proxy=proxy)
    if not resp:
        return None

    soup = BeautifulSoup(resp.text, "lxml")

    # Standard batting table
    table = soup.find("table", {"id": "batting_standard"})
    if table is None:
        # Table may be in HTML comment
        soup_uncommented = BeautifulSoup(uncomment_tables(resp.text), "lxml")
        table = soup_uncommented.find("table", {"id": "batting_standard"})

    if table is None:
        print(f"[WARN] No batting_standard table found for {player_id}")
        return None

    df = pd.read_html(str(table))[0]

    # Drop separator/header rows that repeat mid-table
    df = df[pd.to_numeric(df.get("Year", df.get("Rk")), errors="coerce").notna()].copy()

    # Convert Year to int where possible
    if "Year" in df.columns:
        df["Year"] = pd.to_numeric(df["Year"], errors="coerce")
        df = df.dropna(subset=["Year"])
        df["Year"] = df["Year"].astype(int)

    df["PlayerID"] = player_id
    df["ScrapedAt"] = datetime.utcnow().isoformat()
    return df


def get_player_pitching_stats(
    player_id: str,
    proxy: Optional[str] = None,
) -> Optional[pd.DataFrame]:
    """Get career season-by-season pitching stats."""
    first_letter = player_id[0]
    url = f"https://www.baseball-reference.com/players/{first_letter}/{player_id}.shtml"

    resp = make_request(url, proxy=proxy)
    if not resp:
        return None

    # Pitching stats are often in commented tables
    soup = BeautifulSoup(uncomment_tables(resp.text), "lxml")

    table = soup.find("table", {"id": "pitching_standard"})
    if table is None:
        print(f"[WARN] No pitching_standard table for {player_id}")
        return None

    df = pd.read_html(str(table))[0]
    df = df[pd.to_numeric(df.get("Year", df.get("Rk")), errors="coerce").notna()].copy()

    if "Year" in df.columns:
        df["Year"] = pd.to_numeric(df["Year"], errors="coerce")
        df = df.dropna(subset=["Year"])
        df["Year"] = df["Year"].astype(int)

    df["PlayerID"] = player_id
    df["ScrapedAt"] = datetime.utcnow().isoformat()
    return df


def get_player_fielding_stats(
    player_id: str,
    proxy: Optional[str] = None,
) -> Optional[pd.DataFrame]:
    """Get career fielding stats."""
    first_letter = player_id[0]
    url = f"https://www.baseball-reference.com/players/{first_letter}/{player_id}.shtml"

    resp = make_request(url, proxy=proxy)
    if not resp:
        return None

    soup = BeautifulSoup(uncomment_tables(resp.text), "lxml")
    table = soup.find("table", {"id": "standard_fielding"})

    if table is None:
        return None

    df = pd.read_html(str(table))[0]
    df["PlayerID"] = player_id
    return df

Scraping Game Logs

Game logs give you granular day-by-day performance data — essential for in-season analysis and injury impact studies.

def get_game_logs(
    player_id: str,
    year: int,
    log_type: str = "batting",
    proxy: Optional[str] = None,
) -> Optional[pd.DataFrame]:
    """Get game-by-game stats for a player in a given season.

    log_type: 'batting' or 'pitching'
    """
    first_letter = player_id[0]
    url = (
        f"https://www.baseball-reference.com/players/{first_letter}/"
        f"{player_id}/{log_type}-gamelogs/{year}/"
    )

    resp = make_request(url, proxy=proxy)
    if not resp:
        return None

    # Game log tables are usually NOT in comments
    soup = BeautifulSoup(resp.text, "lxml")
    table_id = f"{log_type}_gamelogs"
    table = soup.find("table", {"id": table_id})

    if table is None:
        # Try with uncommented HTML
        soup2 = BeautifulSoup(uncomment_tables(resp.text), "lxml")
        table = soup2.find("table", {"id": table_id})

    if table is None:
        print(f"[WARN] No {table_id} table for {player_id} {year}")
        return None

    df = pd.read_html(str(table))[0]

    # Drop header rows that repeat mid-table (Rk column contains "Rk" for separator rows)
    if "Rk" in df.columns:
        df = df[df["Rk"] != "Rk"].copy()
        df = df[pd.to_numeric(df["Rk"], errors="coerce").notna()].copy()

    df["Year"] = year
    df["PlayerID"] = player_id
    df["LogType"] = log_type
    df["ScrapedAt"] = datetime.utcnow().isoformat()
    return df


def get_multi_season_logs(
    player_id: str,
    years: List[int],
    log_type: str = "batting",
    proxy: Optional[str] = None,
) -> pd.DataFrame:
    """Get game logs for multiple seasons and concatenate."""
    dfs = []

    for year in years:
        df = get_game_logs(player_id, year, log_type=log_type, proxy=proxy)
        if df is not None and not df.empty:
            dfs.append(df)
            print(f"  {player_id} {year}: {len(df)} games")
        polite_sleep(3.0, 6.0)

    if not dfs:
        return pd.DataFrame()

    return pd.concat(dfs, ignore_index=True)

Extracting WAR and Advanced Metrics

WAR is Baseball-Reference's signature metric, and it lives in tables that are embedded in HTML comments — a pattern unique to BR.

def get_war_data(
    player_id: str,
    player_type: str = "batter",
    proxy: Optional[str] = None,
) -> Optional[pd.DataFrame]:
    """Extract WAR and value stats for a player.

    player_type: 'batter' or 'pitcher'
    """
    first_letter = player_id[0]
    url = f"https://www.baseball-reference.com/players/{first_letter}/{player_id}.shtml"

    resp = make_request(url, proxy=proxy)
    if not resp:
        return None

    # WAR tables are ALWAYS in HTML comments
    soup = BeautifulSoup(uncomment_tables(resp.text), "lxml")

    table_id = "batting_value" if player_type == "batter" else "pitching_value"
    table = soup.find("table", {"id": table_id})

    if table is None:
        print(f"[WARN] {table_id} table not found for {player_id}")
        return None

    df = pd.read_html(str(table))[0]

    # Keep only actual season rows (drop career totals, averages)
    if "Year" in df.columns:
        df = df[pd.to_numeric(df["Year"], errors="coerce").notna()].copy()
        df["Year"] = df["Year"].astype(int)

    war_cols = [c for c in df.columns if "WAR" in str(c)]
    base_cols = ["Year", "Age", "Tm", "G"]
    keep_cols = [c for c in base_cols if c in df.columns] + war_cols

    df = df[keep_cols].copy()
    df["PlayerID"] = player_id
    return df


def get_career_war_summary(player_id: str, proxy: Optional[str] = None) -> Dict:
    """Get career WAR totals for a player."""
    war_df = get_war_data(player_id, proxy=proxy)
    if war_df is None or war_df.empty:
        return {}

    war_col = next((c for c in war_df.columns if c == "WAR" or c == "rWAR"), None)
    if not war_col:
        return {}

    return {
        "player_id": player_id,
        "career_war": float(pd.to_numeric(war_df[war_col], errors="coerce").sum()),
        "peak_war_season": float(pd.to_numeric(war_df[war_col], errors="coerce").max()),
        "peak_war_year": int(war_df.loc[pd.to_numeric(war_df[war_col], errors="coerce").idxmax(), "Year"]) if "Year" in war_df.columns else None,
        "seasons": len(war_df),
    }

Scraping Team and Season Data

def get_team_roster(
    team: str,
    year: int,
    proxy: Optional[str] = None,
) -> Optional[pd.DataFrame]:
    """Get team roster for a given season.

    team: 3-letter team abbreviation (LAA, NYY, BOS, etc.)
    """
    url = f"https://www.baseball-reference.com/teams/{team}/{year}.shtml"

    resp = make_request(url, proxy=proxy)
    if not resp:
        return None

    soup = BeautifulSoup(uncomment_tables(resp.text), "lxml")

    # Team batting table
    table = soup.find("table", {"id": "team_batting"})
    if not table:
        return None

    df = pd.read_html(str(table))[0]
    df["Team"] = team
    df["Year"] = year

    # Extract player IDs from links in the HTML table
    player_links = {}
    for row in table.find_all("tr"):
        name_td = row.find("td", {"data-stat": "player"})
        if name_td:
            link = name_td.find("a")
            if link and "/players/" in link.get("href", ""):
                href = link["href"]
                pid = href.split("/")[-1].replace(".shtml", "")
                player_links[link.text.strip()] = pid

    df["PlayerID"] = df.get("Name", df.iloc[:, 0]).map(player_links)
    return df


def get_season_leaders(
    year: int,
    stat: str = "batting",
    min_pa: int = 300,
    proxy: Optional[str] = None,
) -> Optional[pd.DataFrame]:
    """Get season statistical leaders.

    stat: 'batting' or 'pitching'
    """
    url = f"https://www.baseball-reference.com/leagues/MLB/{year}-{stat}-leaders.shtml"

    resp = make_request(url, proxy=proxy)
    if not resp:
        return None

    soup = BeautifulSoup(uncomment_tables(resp.text), "lxml")

    # Leader tables are named like leader_onbase_perc, leader_slugging_perc, etc.
    # The main summary table is usually the first one
    tables = soup.find_all("table", class_="stats_table")
    if not tables:
        return None

    dfs = []
    for table in tables[:5]:  # Get first 5 leader tables
        try:
            df = pd.read_html(str(table))[0]
            df["Year"] = year
            dfs.append(df)
        except Exception:
            continue

    return pd.concat(dfs, ignore_index=True) if dfs else None

Anti-Bot Handling and Proxy Integration

Baseball-Reference has rate limiting, user-agent checks, and will serve CAPTCHAs if you're hitting it too fast or from a datacenter IP. The user-agent header helps. Random delays help more. But for any real scale, rotating residential proxies are essential.

ThorData provides residential proxies that look like real ISP traffic. Baseball-Reference rarely flags residential IPs the way it flags datacenter ranges.

class ThorDataProxyPool:
    """Rotating residential proxy pool via ThorData."""

    def __init__(self, username: str, password: str):
        self.username = username
        self.password = password
        self.host = "gate.thordata.com"
        self.port = 9000

    def get_proxy(self, country: Optional[str] = None) -> str:
        """Get a rotating residential proxy URL."""
        user = self.username
        if country:
            user = f"{self.username}-country-{country.upper()}"
        return f"http://{user}:{self.password}@{self.host}:{self.port}"

    def get_us_proxy(self) -> str:
        """Get a US-based proxy (useful for Baseball-Reference geo-content)."""
        return self.get_proxy(country="US")


def get_with_proxy(
    url: str,
    proxy_pool: Optional[ThorDataProxyPool] = None,
    retries: int = 4,
) -> Optional[requests.Response]:
    """Fetch URL with optional proxy rotation and backoff."""
    proxy = proxy_pool.get_us_proxy() if proxy_pool else None

    for attempt in range(retries):
        # Rotate proxy on each retry
        if proxy_pool and attempt > 0:
            proxy = proxy_pool.get_proxy()

        resp = make_request(url, proxy=proxy, max_retries=1)
        if resp:
            return resp

        wait = (2 ** attempt) * 5 + random.uniform(0, 5)
        print(f"[RETRY] Attempt {attempt + 1}/{retries} for {url}, waiting {wait:.1f}s")
        time.sleep(wait)

    return None

Pagination Handling

Baseball-Reference doesn't paginate season stats (all seasons appear on one page), but some data requires multiple URLs:

def get_full_career_gamelogs(
    player_id: str,
    start_year: int,
    end_year: int,
    proxy: Optional[str] = None,
) -> pd.DataFrame:
    """Get all game logs for a player's career."""
    all_dfs = []

    for year in range(start_year, end_year + 1):
        print(f"  Fetching {player_id} game logs for {year}...")
        df = get_game_logs(player_id, year, proxy=proxy)

        if df is not None and not df.empty:
            all_dfs.append(df)

        polite_sleep(3.5, 7.0)  # Be respectful of BR's servers

    if not all_dfs:
        return pd.DataFrame()

    combined = pd.concat(all_dfs, ignore_index=True)
    return combined


def paginate_player_search(
    search_term: str,
    proxy: Optional[str] = None,
) -> List[Dict]:
    """Search for players by name and return player IDs."""
    url = f"https://www.baseball-reference.com/search/search.fcgi?search={search_term}&pid=player_search"

    resp = make_request(url, proxy=proxy)
    if not resp:
        return []

    soup = BeautifulSoup(resp.text, "lxml")
    players = []

    # Direct redirect if only one match
    if "players" in resp.url and ".shtml" in resp.url:
        player_id = resp.url.split("/")[-1].replace(".shtml", "")
        name_el = soup.select_one("h1[itemprop='name']")
        return [{
            "player_id": player_id,
            "name": name_el.text.strip() if name_el else search_term,
            "url": resp.url,
        }]

    # Multiple results page
    for row in soup.select(".search-item-name"):
        link = row.find("a")
        if link and "/players/" in link.get("href", ""):
            pid = link["href"].split("/")[-1].replace(".shtml", "")
            players.append({
                "player_id": pid,
                "name": link.text.strip(),
                "url": f"https://www.baseball-reference.com{link['href']}",
            })

    return players

Data Storage

SQLite works well for this. You can query it later without loading everything into memory. Historical data (pre-2010) rarely changes, so scrape once and cache permanently.

def init_database(db_path: str = "baseball.db") -> sqlite3.Connection:
    """Initialize the Baseball-Reference database."""
    conn = sqlite3.connect(db_path)
    conn.execute("PRAGMA journal_mode=WAL")

    conn.executescript("""
        CREATE TABLE IF NOT EXISTS batting_stats (
            player_id TEXT,
            year INTEGER,
            age INTEGER,
            team TEXT,
            games INTEGER,
            plate_appearances INTEGER,
            at_bats INTEGER,
            runs INTEGER,
            hits INTEGER,
            doubles INTEGER,
            triples INTEGER,
            home_runs INTEGER,
            rbi INTEGER,
            stolen_bases INTEGER,
            caught_stealing INTEGER,
            walks INTEGER,
            strikeouts INTEGER,
            batting_avg REAL,
            obp REAL,
            slg REAL,
            ops REAL,
            ops_plus INTEGER,
            scraped_at TEXT,
            PRIMARY KEY (player_id, year, team)
        );

        CREATE TABLE IF NOT EXISTS pitching_stats (
            player_id TEXT,
            year INTEGER,
            age INTEGER,
            team TEXT,
            wins INTEGER,
            losses INTEGER,
            era REAL,
            games INTEGER,
            games_started INTEGER,
            innings_pitched REAL,
            hits_allowed INTEGER,
            earned_runs INTEGER,
            home_runs_allowed INTEGER,
            walks INTEGER,
            strikeouts INTEGER,
            whip REAL,
            era_plus INTEGER,
            scraped_at TEXT,
            PRIMARY KEY (player_id, year, team)
        );

        CREATE TABLE IF NOT EXISTS game_logs (
            player_id TEXT,
            year INTEGER,
            game_num INTEGER,
            date TEXT,
            team TEXT,
            opponent TEXT,
            at_bats INTEGER,
            hits INTEGER,
            home_runs INTEGER,
            rbi INTEGER,
            batting_avg REAL,
            log_type TEXT,
            scraped_at TEXT,
            PRIMARY KEY (player_id, year, game_num, log_type)
        );

        CREATE TABLE IF NOT EXISTS war_data (
            player_id TEXT,
            year INTEGER,
            team TEXT,
            games INTEGER,
            war REAL,
            scraped_at TEXT,
            PRIMARY KEY (player_id, year, team)
        );

        CREATE TABLE IF NOT EXISTS scrape_cache (
            url TEXT PRIMARY KEY,
            scraped_at TEXT,
            status_code INTEGER
        );

        CREATE INDEX IF NOT EXISTS idx_batting_year ON batting_stats(year);
        CREATE INDEX IF NOT EXISTS idx_pitching_year ON pitching_stats(year);
        CREATE INDEX IF NOT EXISTS idx_gamelogs_player ON game_logs(player_id, year);
    """)

    conn.commit()
    return conn


def save_batting_stats(conn: sqlite3.Connection, df: pd.DataFrame, player_id: str):
    """Save batting stats DataFrame to SQLite."""
    if df is None or df.empty:
        return 0

    col_map = {
        "Year": "year", "Age": "age", "Tm": "team",
        "G": "games", "PA": "plate_appearances", "AB": "at_bats",
        "R": "runs", "H": "hits", "2B": "doubles", "3B": "triples",
        "HR": "home_runs", "RBI": "rbi", "SB": "stolen_bases",
        "CS": "caught_stealing", "BB": "walks", "SO": "strikeouts",
        "BA": "batting_avg", "OBP": "obp", "SLG": "slg",
        "OPS": "ops", "OPS+": "ops_plus",
    }

    rows_saved = 0
    for _, row in df.iterrows():
        try:
            conn.execute(
                """INSERT OR REPLACE INTO batting_stats
                   (player_id, year, age, team, games, plate_appearances, at_bats,
                    runs, hits, doubles, triples, home_runs, rbi, stolen_bases,
                    caught_stealing, walks, strikeouts, batting_avg, obp, slg, ops, ops_plus, scraped_at)
                   VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)""",
                (
                    player_id,
                    _safe_int(row.get("Year")),
                    _safe_int(row.get("Age")),
                    str(row.get("Tm", "")),
                    _safe_int(row.get("G")),
                    _safe_int(row.get("PA")),
                    _safe_int(row.get("AB")),
                    _safe_int(row.get("R")),
                    _safe_int(row.get("H")),
                    _safe_int(row.get("2B")),
                    _safe_int(row.get("3B")),
                    _safe_int(row.get("HR")),
                    _safe_int(row.get("RBI")),
                    _safe_int(row.get("SB")),
                    _safe_int(row.get("CS")),
                    _safe_int(row.get("BB")),
                    _safe_int(row.get("SO")),
                    _safe_float(row.get("BA")),
                    _safe_float(row.get("OBP")),
                    _safe_float(row.get("SLG")),
                    _safe_float(row.get("OPS")),
                    _safe_int(row.get("OPS+")),
                    datetime.utcnow().isoformat(),
                )
            )
            rows_saved += 1
        except Exception as e:
            print(f"[ERROR] Failed to save row: {e}")

    conn.commit()
    return rows_saved


def _safe_int(val) -> Optional[int]:
    try:
        return int(float(val))
    except (TypeError, ValueError):
        return None


def _safe_float(val) -> Optional[float]:
    try:
        return float(val)
    except (TypeError, ValueError):
        return None


def already_scraped(
    conn: sqlite3.Connection,
    player_id: str,
    year: int,
    table: str = "batting_stats",
) -> bool:
    """Check if data already exists for a player/year combination."""
    row = conn.execute(
        f"SELECT COUNT(*) FROM {table} WHERE player_id = ? AND year = ?",
        (player_id, year),
    ).fetchone()
    return row[0] > 0

Rate Limiting Best Practices

Baseball-Reference is a free resource maintained by a small team. Don't hammer it. The site runs on advertising, and heavy automated traffic doesn't generate ad revenue but does cost them bandwidth.

# Practical rate limits that keep you under the radar:
SCRAPE_CONFIG = {
    "min_delay_between_requests": 3.0,  # Never go below 3 seconds
    "max_delay_between_requests": 6.0,
    "delay_between_players": 8.0,       # Longer pause when switching players
    "delay_between_seasons": 5.0,       # Between seasons for same player
    "max_requests_per_hour": 120,       # 2 per minute max for single IP
    "cache_historical_data": True,       # Pre-2020 data is static, cache forever
}


class RateLimiter:
    """Simple rate limiter to enforce polite scraping."""

    def __init__(self, min_interval: float = 3.0, max_interval: float = 6.0):
        self.min_interval = min_interval
        self.max_interval = max_interval
        self.last_request = 0.0

    def wait(self):
        """Wait the appropriate amount before the next request."""
        elapsed = time.time() - self.last_request
        target = random.uniform(self.min_interval, self.max_interval)

        if elapsed < target:
            time.sleep(target - elapsed)

        self.last_request = time.time()

Complete Production Pipeline

def run_player_pipeline(
    player_ids: List[str],
    years: List[int],
    db_path: str = "baseball.db",
    proxy_pool: Optional[ThorDataProxyPool] = None,
) -> Dict:
    """Complete pipeline for scraping player stats across seasons."""
    conn = init_database(db_path)
    rate_limiter = RateLimiter(min_interval=3.5, max_interval=7.0)
    stats = {
        "players_scraped": 0,
        "seasons_scraped": 0,
        "game_logs_scraped": 0,
        "errors": 0,
        "skipped": 0,
    }

    for player_id in player_ids:
        print(f"\n[PLAYER] {player_id}")
        proxy = proxy_pool.get_us_proxy() if proxy_pool else None

        # Get career batting stats
        rate_limiter.wait()
        batting_df = get_player_batting_stats(player_id, proxy=proxy)

        if batting_df is not None:
            rows = save_batting_stats(conn, batting_df, player_id)
            stats["seasons_scraped"] += rows
            print(f"  Batting stats: {rows} season rows saved")

        # Get WAR data
        rate_limiter.wait()
        war_df = get_war_data(player_id, proxy=proxy)
        if war_df is not None:
            war_summary = get_career_war_summary(player_id, proxy=None)  # Use cached data
            print(f"  Career WAR: {war_summary.get('career_war', 'N/A')}")

        # Get game logs for requested years
        for year in years:
            if already_scraped(conn, player_id, year, "game_logs"):
                print(f"  [SKIP] {player_id} {year} game logs already in DB")
                stats["skipped"] += 1
                continue

            rate_limiter.wait()
            logs_df = get_game_logs(player_id, year, proxy=proxy)

            if logs_df is not None and not logs_df.empty:
                # Save game logs (simplified — full save would map all columns)
                stats["game_logs_scraped"] += len(logs_df)
                print(f"  {year} game logs: {len(logs_df)} games")

        stats["players_scraped"] += 1
        time.sleep(random.uniform(8.0, 15.0))  # Long pause between players

    conn.close()
    print(f"\nPipeline complete: {stats}")
    return stats


# Example usage
if __name__ == "__main__":
    PLAYERS = [
        "troutmi01",   # Mike Trout
        "judgeaa01",   # Aaron Judge
        "bettsmo01",   # Mookie Betts
        "arenaro01",   # Nolan Arenado
        "goldspa01",   # Paul Goldschmidt
    ]
    YEARS = [2022, 2023, 2024, 2025]

    # With ThorData proxy
    # pool = ThorDataProxyPool("YOUR_USER", "YOUR_PASS")
    # run_player_pipeline(PLAYERS, YEARS, proxy_pool=pool)

    # Without proxy (slower, risk of IP block at scale)
    run_player_pipeline(PLAYERS, YEARS)

Real-World Use Cases

Fantasy Baseball Model: Pull WAR data and game logs for all current players. Build regression models predicting next-season WAR from age, injury history, and recent performance trends. Backtest against historical data to validate.

Historical Trend Analysis: Pull season leader data from 1900-present to track how baseball has changed. Strikeout rates, home runs per game, stolen base frequency — all quantifiable with this data.

Player Comparison Tool: Given two player IDs, fetch their career stats, WAR trajectories, and game log consistency metrics. Generate statistical comparisons normalized for era and park factors.

Injury Impact Quantification: Cross-reference game log gaps (stretches of consecutive missed games) with subsequent performance. Quantify how different injury types affect player output across their careers.

Draft Value Optimizer: Compile statistics for prospects' minor and major league stints, identify which minor league performance metrics best predict major league success, and build a draft value model.

Baseball-Reference has an enormous amount of data if you're patient about pulling it. WAR going back to 1871, every game log, park factors, splits by handedness, leverage index — it's all there in structured HTML tables. The scraping itself is straightforward once you know which table IDs to target and how to handle the HTML comment-wrapped tables. The main challenges are the anti-bot layer and just being a good citizen about request volume. Use ThorData's residential proxy network for any serious bulk collection, cache historical data aggressively, and never re-fetch data you already have.