scrape-pro-football-reference-2026

title: "Scrape Pro Football Reference: NFL Stats, Game Logs & Advanced Metrics with Python (2026)" date: 2026-04-09 description: How to scrape Pro Football Reference for NFL player stats, game logs, and advanced metrics using Python — with working code, the HTML comment table trick, anti-bot handling, proxy rotation, and SQLite storage. tags: [python, scraping, nfl, football, sports-data]

Scrape Pro Football Reference: NFL Stats, Game Logs & Advanced Metrics with Python (2026)

Pro Football Reference is the canonical source for NFL data. It's part of the Sports Reference family (same people behind Baseball Reference, Basketball Reference, etc.) and covers every NFL season going back to 1920. If you need career stats, game logs, draft data, or advanced metrics like Approximate Value (AV) or EPA, this is the site.

The data is genuinely excellent. AV is their proprietary catch-all metric for comparing players across positions and eras. They also surface EPA (Expected Points Added) for offensive plays, and they link to DVOA data from Football Outsiders. For anyone building models, doing fantasy research, or conducting sports analytics, it's the first stop.

This guide walks through scraping it with Python. There are quirks — including one genuinely weird one involving HTML comments — so let's get into it.

Setup

You need four libraries:

pip install requests beautifulsoup4 lxml pandas

lxml is the parser. It's faster and more lenient than the built-in html.parser, which matters when you're dealing with large stat tables. pandas handles the table parsing cleanly once you have the HTML element.

The Comment Table Quirk

Before anything else, you need to know this: Sports Reference wraps many of their stat tables in HTML comments. This is intentional — it lets their page load fast and prevents naive scrapers from working out of the box. If you run soup.find('table', id='passing') and get None, this is almost certainly why.

Here's how to handle it:

from bs4 import BeautifulSoup, Comment
import requests

SESSION = requests.Session()
SESSION.headers.update({
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4_1) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
})

def get_soup(url: str) -> BeautifulSoup:
    resp = SESSION.get(url, timeout=15)
    resp.raise_for_status()
    return BeautifulSoup(resp.text, "lxml")


def find_commented_table(soup: BeautifulSoup, table_id: str) -> BeautifulSoup | None:
    """Extract a table that Sports Reference hid inside an HTML comment."""
    comments = soup.find_all(string=lambda text: isinstance(text, Comment))
    for comment in comments:
        comment_soup = BeautifulSoup(comment, "lxml")
        table = comment_soup.find("table", id=table_id)
        if table:
            return table
    return None


def get_table(soup: BeautifulSoup, table_id: str) -> BeautifulSoup | None:
    """Try direct lookup first, then fall back to comment extraction."""
    table = soup.find("table", id=table_id)
    if table is None:
        table = find_commented_table(soup, table_id)
    return table

Use get_table() for every PFR table lookup. You'll need the comment fallback constantly — the majority of their detailed stat tables are hidden in comments.

Understanding PFR's URL Structure

Player pages follow a consistent pattern based on an encoded player ID:

https://www.pro-football-reference.com/players/{first_letter}/{player_id}.htm

Where {player_id} is typically LastFirst00 (e.g., MahoPa00 for Patrick Mahomes, BradTo00 for Tom Brady). The 00 suffix is a disambiguation counter — if there are two "Patrick Mahomes" it would be MahoPa01.

Other important URL patterns:

# Season stats pages
/years/{year}/passing.htm     # All passers for a season
/years/{year}/rushing.htm
/years/{year}/receiving.htm
/years/{year}/defense.htm

# Game logs
/players/{L}/{player_id}/gamelog/{year}/

# Advanced stats
/years/{year}/passing_advanced.htm
/years/{year}/rushing_advanced.htm

# Draft
/years/{year}/draft.htm

# Team pages
/teams/{abbr}/{year}.htm

# Play-by-play (game)
/boxscores/{game_id}.htm

Scraping Player Career Stats

Let's pull Patrick Mahomes' career passing stats:

import pandas as pd
from io import StringIO
import time
import random

def scrape_career_passing(player_url: str) -> pd.DataFrame | None:
    """Scrape career passing stats from a player page."""
    soup = get_soup(player_url)
    table = get_table(soup, "passing")

    if table is None:
        print(f"No passing table found at {player_url}")
        return None

    df = pd.read_html(StringIO(str(table)))[0]

    # Drop multi-header rows that PFR inserts every N rows
    df = df[df["Year"].notna()]
    df = df[~df["Year"].astype(str).str.contains("Year|Career|yrs", na=False)]

    # Convert numeric columns
    numeric_cols = ["G", "GS", "Cmp", "Att", "Yds", "TD", "Int", "Rate", "Sk", "AV"]
    for col in numeric_cols:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")

    df = df.reset_index(drop=True)
    return df


mahomes_url = "https://www.pro-football-reference.com/players/M/MahoPa00.htm"
career_stats = scrape_career_passing(mahomes_url)
if career_stats is not None:
    print(career_stats[["Year", "Tm", "G", "Cmp", "Att", "Yds", "TD", "Int", "Rate"]].to_string())

The pd.read_html() call does the heavy lifting once you have the table element. The main cleanup tasks are removing the repeated header rows that PFR inserts every few rows and stripping out the career totals row at the bottom.

Scraping Game Logs for a Season

Game logs give you week-by-week performance — much more useful for modeling than career averages. They live at a URL like /players/M/MahoPa00/gamelog/2024/.

def scrape_game_log(player_id: str, year: int) -> pd.DataFrame | None:
    """Scrape weekly game log for a player in a given season."""
    first_letter = player_id[0].upper()
    url = f"https://www.pro-football-reference.com/players/{first_letter}/{player_id}/gamelog/{year}/"
    soup = get_soup(url)

    table = get_table(soup, "stats")
    if table is None:
        print(f"No stats table found for {player_id} {year}")
        return None

    df = pd.read_html(StringIO(str(table)))[0]

    # Flatten multi-level columns if present (PFR uses grouped headers)
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = ["_".join(filter(None, col)).strip() for col in df.columns]

    # Remove non-game rows (bye weeks, repeated headers, career totals)
    if "Week" in df.columns:
        df = df[pd.to_numeric(df["Week"], errors="coerce").notna()]
    elif "Rk" in df.columns:
        df = df[pd.to_numeric(df["Rk"], errors="coerce").notna()]

    df = df.reset_index(drop=True)
    return df


game_log = scrape_game_log("MahoPa00", 2024)
if game_log is not None:
    print(f"Games: {len(game_log)}")
    print(game_log.head(3).to_string())

The multi-level column handling is necessary because PFR uses grouped headers (Passing / Rushing / etc.) which pandas interprets as a MultiIndex. The flattening logic joins the level names with underscores.

Season-Level Leaderboards

For building datasets across all players in a season, the season stats pages are the entry point:

def scrape_season_stats(year: int, stat_type: str = "passing") -> pd.DataFrame | None:
    """
    Scrape full season stats for all players.
    stat_type: 'passing', 'rushing', 'receiving', 'defense', 'kicking'
    """
    url = f"https://www.pro-football-reference.com/years/{year}/{stat_type}.htm"
    soup = get_soup(url)

    table = get_table(soup, stat_type)
    if table is None:
        # Try alternate table IDs
        for alt_id in [f"{stat_type}_stats", "stats"]:
            table = get_table(soup, alt_id)
            if table:
                break

    if table is None:
        return None

    df = pd.read_html(StringIO(str(table)))[0]

    # Remove repeated header rows
    if "Rk" in df.columns:
        df = df[pd.to_numeric(df["Rk"], errors="coerce").notna()]

    # Drop rank column
    df = df.drop(columns=["Rk"], errors="ignore")
    df = df.reset_index(drop=True)
    return df


# Get all passers for the 2024 season
passers_2024 = scrape_season_stats(2024, "passing")
if passers_2024 is not None:
    print(f"Total passers: {len(passers_2024)}")
    top5 = passers_2024.nlargest(5, "Yds") if "Yds" in passers_2024.columns else passers_2024.head(5)
    print(top5[["Player", "Tm", "G", "Att", "Yds", "TD", "Int"]].to_string())

Extracting Advanced Metrics

The advanced stats pages contain metrics not available on standard leaderboards:

def scrape_advanced_passing(year: int) -> pd.DataFrame | None:
    """Scrape advanced passing stats for a season."""
    url = f"https://www.pro-football-reference.com/years/{year}/passing_advanced.htm"
    soup = get_soup(url)

    # Try multiple table IDs — PFR uses different IDs for different sub-tables
    for table_id in ["advanced_air_yards", "advanced_accuracy", "advanced_pressure"]:
        table = get_table(soup, table_id)
        if table:
            df = pd.read_html(StringIO(str(table)))[0]
            # Flatten multi-level headers
            if isinstance(df.columns, pd.MultiIndex):
                df.columns = ["_".join(filter(None, col)).strip() for col in df.columns]
            df = df[df.apply(lambda row: any(row.astype(str).str.match(r'^\d')), axis=1)]
            return df

    return None


def scrape_advanced_rushing(year: int) -> pd.DataFrame | None:
    """Scrape advanced rushing stats."""
    url = f"https://www.pro-football-reference.com/years/{year}/rushing_advanced.htm"
    soup = get_soup(url)

    table = get_table(soup, "advanced_rushing")
    if table is None:
        return None

    df = pd.read_html(StringIO(str(table)))[0]
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = ["_".join(filter(None, col)).strip() for col in df.columns]
    return df


advanced = scrape_advanced_passing(2024)
if advanced is not None:
    print(f"Advanced passing columns: {list(advanced.columns)[:10]}")
    # Available metrics typically include:
    # IAY/PA (intended air yards per attempt)
    # CAY/Cmp (completed air yards per completion)
    # Drop% (drop rate)
    # BadTh% (bad throw rate)
    # OnTgt% (on-target throw rate)
    # Prss% (under pressure rate)
    # Scrm% (scramble rate)

Draft Data

Historical draft data is valuable for prospect analysis and for building career trajectory models:

def scrape_draft(year: int) -> pd.DataFrame | None:
    """Scrape NFL Draft picks for a given year."""
    url = f"https://www.pro-football-reference.com/years/{year}/draft.htm"
    soup = get_soup(url)
    table = get_table(soup, "drafts")

    if table is None:
        return None

    df = pd.read_html(StringIO(str(table)))[0]

    # Flatten multi-level headers
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = ["_".join(filter(None, col)).strip() for col in df.columns]

    # Remove header-repeat rows
    df = df[df.iloc[:, 0].apply(lambda x: str(x).isdigit())]
    df = df.reset_index(drop=True)
    return df


draft_2024 = scrape_draft(2024)
if draft_2024 is not None:
    print(f"Draft picks: {len(draft_2024)}")
    print(draft_2024.head(5).to_string())

Anti-Bot Handling and Rate Limits

Sports Reference explicitly states in their robots.txt and FAQ that they rate-limit to 20 requests per minute per IP. Exceed that and you'll get a 429 or a temporary ban. They also check user agents — a default python-requests/2.x header will get you blocked fast.

The key rules: 1. Use a realistic browser user agent (rotated across a few options) 2. Add a Referer header pointing to the site itself 3. Wait 3–5 seconds between requests (20 req/min = 3 seconds minimum) 4. Back off exponentially on 429 responses

USER_AGENTS = [
    ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
     "AppleWebKit/537.36 (KHTML, like Gecko) "
     "Chrome/124.0.0.0 Safari/537.36"),
    ("Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4_1) "
     "AppleWebKit/605.1.15 (KHTML, like Gecko) "
     "Version/17.4 Safari/605.1.15"),
    ("Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) "
     "Gecko/20100101 Firefox/125.0"),
]


def get_soup_polite(url: str, min_delay: float = 3.0,
                     max_delay: float = 6.0) -> BeautifulSoup:
    """Fetch a PFR page with realistic headers and rate limiting."""
    SESSION.headers.update({
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Referer": "https://www.pro-football-reference.com/",
    })

    time.sleep(random.uniform(min_delay, max_delay))

    for attempt in range(4):
        try:
            resp = SESSION.get(url, timeout=15)
            if resp.status_code == 429:
                wait = 2 ** attempt * 15 + random.uniform(0, 10)
                print(f"Rate limited. Waiting {wait:.0f}s...")
                time.sleep(wait)
                continue
            resp.raise_for_status()
            return BeautifulSoup(resp.text, "lxml")
        except requests.exceptions.Timeout:
            wait = 2 ** attempt * 5
            print(f"Timeout on attempt {attempt+1}. Waiting {wait}s...")
            time.sleep(wait)

    raise Exception(f"Failed after 4 attempts: {url}")

Proxy Rotation for Bulk Scraping

For pulling data across hundreds of players or multiple seasons, a single IP will hit the rate limit even with delays. Sports Reference's detection is tuned to catch datacenter IPs and volume-based patterns from single sources.

ThorData residential proxies work well here — residential IPs look like real users rather than datacenter traffic. Integrate them into the requests session:

PROXY_USER = "your_thordata_username"
PROXY_PASS = "your_thordata_password"

def get_proxy_session(session_id: int = None) -> requests.Session:
    """Create a requests session routed through a residential proxy."""
    import random
    sid = session_id or random.randint(1000, 9999)
    proxy_url = f"http://{PROXY_USER}-session-{sid}:{PROXY_PASS}@rotating.thordata.net:10000"

    s = requests.Session()
    s.proxies = {"http": proxy_url, "https": proxy_url}
    s.headers.update({
        "User-Agent": random.choice(USER_AGENTS),
        "Accept-Language": "en-US,en;q=0.9",
        "Referer": "https://www.pro-football-reference.com/",
    })
    return s


def scrape_with_proxy(url: str, session_id: int = None) -> BeautifulSoup:
    """Fetch a PFR page via residential proxy."""
    s = get_proxy_session(session_id)
    time.sleep(random.uniform(3, 6))
    resp = s.get(url, timeout=20)
    resp.raise_for_status()
    return BeautifulSoup(resp.text, "lxml")

Rotate the session ID every few requests to get different exit IPs while maintaining session continuity within a small burst.

Storing Results in SQLite

For any serious data collection, persist to SQLite as you go:

import sqlite3
import json

def init_pfr_db(db_path: str = "pfr_data.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS players (
            player_id TEXT PRIMARY KEY,
            name TEXT,
            position TEXT,
            college TEXT,
            draft_year INTEGER,
            draft_round INTEGER,
            draft_pick INTEGER,
            team TEXT
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS career_stats (
            player_id TEXT,
            year INTEGER,
            team TEXT,
            games INTEGER,
            games_started INTEGER,
            stat_type TEXT,
            stats_json TEXT,
            PRIMARY KEY (player_id, year, team, stat_type),
            FOREIGN KEY (player_id) REFERENCES players(player_id)
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS game_logs (
            player_id TEXT,
            year INTEGER,
            week INTEGER,
            opponent TEXT,
            home_away TEXT,
            result TEXT,
            stats_json TEXT,
            PRIMARY KEY (player_id, year, week),
            FOREIGN KEY (player_id) REFERENCES players(player_id)
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS season_leaders (
            year INTEGER,
            stat_type TEXT,
            player_name TEXT,
            team TEXT,
            stats_json TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
            PRIMARY KEY (year, stat_type, player_name)
        )
    """)

    conn.execute("CREATE INDEX IF NOT EXISTS idx_career_player ON career_stats(player_id)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_gamelog_player ON game_logs(player_id, year)")
    conn.commit()
    return conn


def store_career_stats(conn: sqlite3.Connection, player_id: str,
                        df: pd.DataFrame, stat_type: str):
    """Store career stats DataFrame in SQLite."""
    for _, row in df.iterrows():
        year_val = row.get("Year")
        try:
            year = int(str(year_val).replace("*", "").replace("+", ""))
        except (ValueError, TypeError):
            continue

        conn.execute(
            "INSERT OR REPLACE INTO career_stats VALUES (?,?,?,?,?,?,?)",
            (
                player_id,
                year,
                row.get("Tm", ""),
                int(row.get("G", 0) or 0),
                int(row.get("GS", 0) or 0),
                stat_type,
                row.to_json(),
            )
        )
    conn.commit()


def store_season_leaders(conn: sqlite3.Connection, year: int,
                          stat_type: str, df: pd.DataFrame):
    """Store season leaderboard in SQLite."""
    for _, row in df.iterrows():
        conn.execute(
            "INSERT OR REPLACE INTO season_leaders VALUES (?,?,?,?,?,CURRENT_TIMESTAMP)",
            (year, stat_type, row.get("Player", ""), row.get("Tm", ""), row.to_json())
        )
    conn.commit()

Building a Multi-Season Dataset

Here's a complete pipeline for building a multi-season passing dataset:

def build_passing_dataset(years: list, db_path: str = "pfr_passing.db"):
    """Build a complete passing stats dataset for specified years."""
    conn = init_pfr_db(db_path)

    for year in years:
        print(f"\nScraping {year} season...")
        df = scrape_season_stats(year, "passing")
        if df is not None:
            store_season_leaders(conn, year, "passing", df)
            print(f"  {len(df)} passers stored")

        # Advanced stats
        adv = scrape_advanced_passing(year)
        if adv is not None:
            store_season_leaders(conn, year, "passing_advanced", adv)
            print(f"  {len(adv)} advanced passing records stored")

        # Polite delay between seasons
        time.sleep(random.uniform(5, 10))

    conn.close()
    print(f"\nDataset complete: {db_path}")


# Build a 5-year dataset
build_passing_dataset(list(range(2020, 2025)))

Appending New Data Without Duplication

For regular updates, append new data rather than re-scraping everything:

def update_season_data(conn: sqlite3.Connection, year: int, stat_type: str) -> int:
    """Update season data if not already present. Returns rows added."""
    existing = conn.execute(
        "SELECT COUNT(*) FROM season_leaders WHERE year = ? AND stat_type = ?",
        (year, stat_type)
    ).fetchone()[0]

    if existing > 0:
        print(f"  {year} {stat_type}: already have {existing} rows, skipping")
        return 0

    df = scrape_season_stats(year, stat_type)
    if df is None:
        return 0

    store_season_leaders(conn, year, stat_type, df)
    print(f"  {year} {stat_type}: added {len(df)} rows")
    return len(df)

Key Takeaways

The main things to remember when scraping PFR:

Most tables are in HTML comments — always use the get_table() helper with comment fallback
Respect the 20 req/min rate limit — use 3–6 second randomized delays between requests
Use a realistic browser user agent — rotate across Chrome, Firefox, and Safari strings
Add a Referer header — https://www.pro-football-reference.com/
For multi-player or multi-season jobs, rotate proxies — residential IPs via ThorData
Handle multi-level column headers — flatten with "_".join(col) for MultiIndex
Player IDs follow LastFirst00 pattern — construct programmatically from roster lists

PFR's data quality is high and the URL structure is consistent, which makes it reasonably pleasant to work with once you know the comment trick. The combination of career stats, game logs, advanced metrics, and draft data makes it one of the best free sports databases available.

Team Season Pages

Team pages give you roster, schedule, and game-by-game results for a specific season:

def scrape_team_season(team_abbr, year):
    url = f'https://www.pro-football-reference.com/teams/{team_abbr.lower()}/{year}.htm'
    soup = get_soup_polite(url)

    # Passing stats table
    passing_table = get_table(soup, 'passing')
    # Rushing stats table
    rushing_table = get_table(soup, 'rushing')
    # Schedule and results
    schedule_table = get_table(soup, 'games')

    results = {}
    for name, table in [('passing', passing_table), ('rushing', rushing_table),
                          ('schedule', schedule_table)]:
        if table:
            df = pd.read_html(StringIO(str(table)))[0]
            results[name] = df
    return results

# Get the 2024 Kansas City Chiefs season
chiefs_2024 = scrape_team_season('kan', 2024)
if 'schedule' in chiefs_2024:
    print(chiefs_2024['schedule'].head())

Play-by-Play Box Scores

Individual game box scores contain quarter-by-quarter scoring and full play-by-play tables for each game:

def scrape_game_boxscore(game_id):
    url = f'https://www.pro-football-reference.com/boxscores/{game_id}.htm'
    soup = get_soup_polite(url)

    # Team stats table
    team_stats = get_table(soup, 'team_stats')
    # Play-by-play
    pbp = get_table(soup, 'pbp')

    result = {}
    if team_stats:
        result['team_stats'] = pd.read_html(StringIO(str(team_stats)))[0]
    if pbp:
        result['pbp'] = pd.read_html(StringIO(str(pbp)))[0]
    return result

Game IDs follow the format YYYYMMDD0ABB where ABB is the home team abbreviation. For example: 202402040SFO for Super Bowl LVIII.

Approximate Value (AV) — PFR's Catch-All Metric

AV appears in career stat tables and is PFR's attempt to create a single number that captures a player's value in a season across all positions. It is imperfect but useful for cross-position and cross-era comparisons:

def extract_av_history(player_url):
    soup = get_soup_polite(player_url)
    table = get_table(soup, 'passing')  # or rushing, receiving, etc.
    if table is None:
        return None
    df = pd.read_html(StringIO(str(table)))[0]
    df = df[df['Year'].notna()]
    df = df[~df['Year'].astype(str).str.contains('Year|Career', na=False)]
    if 'AV' in df.columns:
        df['AV'] = pd.to_numeric(df['AV'], errors='coerce')
        return df[['Year', 'Tm', 'AV']].dropna()
    return None

av = extract_av_history(mahomes_url)
if av is not None:
    total_av = av['AV'].sum()
    peak_av = av['AV'].max()
    print(f'Career AV: {total_av}, Peak Season: {peak_av}')

Handling Pagination in Season Pages

The main season leaderboard pages (passing.htm, rushing.htm, etc.) load all players on a single page — no pagination required. However, for leagues prior to 1990, some tables may only load partial rosters. Always check the row count against known league totals for that era.

Summary

Pro Football Reference is the most complete free NFL data source available:

The comment trick is essential: use find_commented_table() for every table lookup
Rate limit is 20 req/min — maintain 3-6 second delays between requests
Multi-level headers appear in many tables — flatten with join(col) on MultiIndex
Proxy rotation is needed for bulk multi-player or multi-season jobs
Player IDs follow LastFirst00 convention and are consistent across pages
AV is useful for cross-position comparisons; EPA and DVOA are available elsewhere

Available data: career stats, game logs, play-by-play, draft records, team seasons, advanced metrics (air yards, pressure rate, on-target%), and historical data back to 1920. For fantasy or modeling purposes, game logs and advanced passing metrics are the most actionable tables.