scrape-pro-football-reference-2026
title: "Scrape Pro Football Reference: NFL Stats, Game Logs & Advanced Metrics with Python (2026)" date: 2026-04-09 description: How to scrape Pro Football Reference for NFL player stats, game logs, and advanced metrics using Python — with working code, the HTML comment table trick, anti-bot handling, proxy rotation, and SQLite storage. tags: [python, scraping, nfl, football, sports-data]
Scrape Pro Football Reference: NFL Stats, Game Logs & Advanced Metrics with Python (2026)
Pro Football Reference is the canonical source for NFL data. It's part of the Sports Reference family (same people behind Baseball Reference, Basketball Reference, etc.) and covers every NFL season going back to 1920. If you need career stats, game logs, draft data, or advanced metrics like Approximate Value (AV) or EPA, this is the site.
The data is genuinely excellent. AV is their proprietary catch-all metric for comparing players across positions and eras. They also surface EPA (Expected Points Added) for offensive plays, and they link to DVOA data from Football Outsiders. For anyone building models, doing fantasy research, or conducting sports analytics, it's the first stop.
This guide walks through scraping it with Python. There are quirks — including one genuinely weird one involving HTML comments — so let's get into it.
Setup
You need four libraries:
pip install requests beautifulsoup4 lxml pandas
lxml is the parser. It's faster and more lenient than the built-in html.parser, which matters when you're dealing with large stat tables. pandas handles the table parsing cleanly once you have the HTML element.
The Comment Table Quirk
Before anything else, you need to know this: Sports Reference wraps many of their stat tables in HTML comments. This is intentional — it lets their page load fast and prevents naive scrapers from working out of the box. If you run soup.find('table', id='passing') and get None, this is almost certainly why.
Here's how to handle it:
from bs4 import BeautifulSoup, Comment
import requests
SESSION = requests.Session()
SESSION.headers.update({
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4_1) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
})
def get_soup(url: str) -> BeautifulSoup:
resp = SESSION.get(url, timeout=15)
resp.raise_for_status()
return BeautifulSoup(resp.text, "lxml")
def find_commented_table(soup: BeautifulSoup, table_id: str) -> BeautifulSoup | None:
"""Extract a table that Sports Reference hid inside an HTML comment."""
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for comment in comments:
comment_soup = BeautifulSoup(comment, "lxml")
table = comment_soup.find("table", id=table_id)
if table:
return table
return None
def get_table(soup: BeautifulSoup, table_id: str) -> BeautifulSoup | None:
"""Try direct lookup first, then fall back to comment extraction."""
table = soup.find("table", id=table_id)
if table is None:
table = find_commented_table(soup, table_id)
return table
Use get_table() for every PFR table lookup. You'll need the comment fallback constantly — the majority of their detailed stat tables are hidden in comments.
Understanding PFR's URL Structure
Player pages follow a consistent pattern based on an encoded player ID:
https://www.pro-football-reference.com/players/{first_letter}/{player_id}.htm
Where {player_id} is typically LastFirst00 (e.g., MahoPa00 for Patrick Mahomes, BradTo00 for Tom Brady). The 00 suffix is a disambiguation counter — if there are two "Patrick Mahomes" it would be MahoPa01.
Other important URL patterns:
# Season stats pages
/years/{year}/passing.htm # All passers for a season
/years/{year}/rushing.htm
/years/{year}/receiving.htm
/years/{year}/defense.htm
# Game logs
/players/{L}/{player_id}/gamelog/{year}/
# Advanced stats
/years/{year}/passing_advanced.htm
/years/{year}/rushing_advanced.htm
# Draft
/years/{year}/draft.htm
# Team pages
/teams/{abbr}/{year}.htm
# Play-by-play (game)
/boxscores/{game_id}.htm
Scraping Player Career Stats
Let's pull Patrick Mahomes' career passing stats:
import pandas as pd
from io import StringIO
import time
import random
def scrape_career_passing(player_url: str) -> pd.DataFrame | None:
"""Scrape career passing stats from a player page."""
soup = get_soup(player_url)
table = get_table(soup, "passing")
if table is None:
print(f"No passing table found at {player_url}")
return None
df = pd.read_html(StringIO(str(table)))[0]
# Drop multi-header rows that PFR inserts every N rows
df = df[df["Year"].notna()]
df = df[~df["Year"].astype(str).str.contains("Year|Career|yrs", na=False)]
# Convert numeric columns
numeric_cols = ["G", "GS", "Cmp", "Att", "Yds", "TD", "Int", "Rate", "Sk", "AV"]
for col in numeric_cols:
if col in df.columns:
df[col] = pd.to_numeric(df[col], errors="coerce")
df = df.reset_index(drop=True)
return df
mahomes_url = "https://www.pro-football-reference.com/players/M/MahoPa00.htm"
career_stats = scrape_career_passing(mahomes_url)
if career_stats is not None:
print(career_stats[["Year", "Tm", "G", "Cmp", "Att", "Yds", "TD", "Int", "Rate"]].to_string())
The pd.read_html() call does the heavy lifting once you have the table element. The main cleanup tasks are removing the repeated header rows that PFR inserts every few rows and stripping out the career totals row at the bottom.
Scraping Game Logs for a Season
Game logs give you week-by-week performance — much more useful for modeling than career averages. They live at a URL like /players/M/MahoPa00/gamelog/2024/.
def scrape_game_log(player_id: str, year: int) -> pd.DataFrame | None:
"""Scrape weekly game log for a player in a given season."""
first_letter = player_id[0].upper()
url = f"https://www.pro-football-reference.com/players/{first_letter}/{player_id}/gamelog/{year}/"
soup = get_soup(url)
table = get_table(soup, "stats")
if table is None:
print(f"No stats table found for {player_id} {year}")
return None
df = pd.read_html(StringIO(str(table)))[0]
# Flatten multi-level columns if present (PFR uses grouped headers)
if isinstance(df.columns, pd.MultiIndex):
df.columns = ["_".join(filter(None, col)).strip() for col in df.columns]
# Remove non-game rows (bye weeks, repeated headers, career totals)
if "Week" in df.columns:
df = df[pd.to_numeric(df["Week"], errors="coerce").notna()]
elif "Rk" in df.columns:
df = df[pd.to_numeric(df["Rk"], errors="coerce").notna()]
df = df.reset_index(drop=True)
return df
game_log = scrape_game_log("MahoPa00", 2024)
if game_log is not None:
print(f"Games: {len(game_log)}")
print(game_log.head(3).to_string())
The multi-level column handling is necessary because PFR uses grouped headers (Passing / Rushing / etc.) which pandas interprets as a MultiIndex. The flattening logic joins the level names with underscores.
Season-Level Leaderboards
For building datasets across all players in a season, the season stats pages are the entry point:
def scrape_season_stats(year: int, stat_type: str = "passing") -> pd.DataFrame | None:
"""
Scrape full season stats for all players.
stat_type: 'passing', 'rushing', 'receiving', 'defense', 'kicking'
"""
url = f"https://www.pro-football-reference.com/years/{year}/{stat_type}.htm"
soup = get_soup(url)
table = get_table(soup, stat_type)
if table is None:
# Try alternate table IDs
for alt_id in [f"{stat_type}_stats", "stats"]:
table = get_table(soup, alt_id)
if table:
break
if table is None:
return None
df = pd.read_html(StringIO(str(table)))[0]
# Remove repeated header rows
if "Rk" in df.columns:
df = df[pd.to_numeric(df["Rk"], errors="coerce").notna()]
# Drop rank column
df = df.drop(columns=["Rk"], errors="ignore")
df = df.reset_index(drop=True)
return df
# Get all passers for the 2024 season
passers_2024 = scrape_season_stats(2024, "passing")
if passers_2024 is not None:
print(f"Total passers: {len(passers_2024)}")
top5 = passers_2024.nlargest(5, "Yds") if "Yds" in passers_2024.columns else passers_2024.head(5)
print(top5[["Player", "Tm", "G", "Att", "Yds", "TD", "Int"]].to_string())
Extracting Advanced Metrics
The advanced stats pages contain metrics not available on standard leaderboards:
def scrape_advanced_passing(year: int) -> pd.DataFrame | None:
"""Scrape advanced passing stats for a season."""
url = f"https://www.pro-football-reference.com/years/{year}/passing_advanced.htm"
soup = get_soup(url)
# Try multiple table IDs — PFR uses different IDs for different sub-tables
for table_id in ["advanced_air_yards", "advanced_accuracy", "advanced_pressure"]:
table = get_table(soup, table_id)
if table:
df = pd.read_html(StringIO(str(table)))[0]
# Flatten multi-level headers
if isinstance(df.columns, pd.MultiIndex):
df.columns = ["_".join(filter(None, col)).strip() for col in df.columns]
df = df[df.apply(lambda row: any(row.astype(str).str.match(r'^\d')), axis=1)]
return df
return None
def scrape_advanced_rushing(year: int) -> pd.DataFrame | None:
"""Scrape advanced rushing stats."""
url = f"https://www.pro-football-reference.com/years/{year}/rushing_advanced.htm"
soup = get_soup(url)
table = get_table(soup, "advanced_rushing")
if table is None:
return None
df = pd.read_html(StringIO(str(table)))[0]
if isinstance(df.columns, pd.MultiIndex):
df.columns = ["_".join(filter(None, col)).strip() for col in df.columns]
return df
advanced = scrape_advanced_passing(2024)
if advanced is not None:
print(f"Advanced passing columns: {list(advanced.columns)[:10]}")
# Available metrics typically include:
# IAY/PA (intended air yards per attempt)
# CAY/Cmp (completed air yards per completion)
# Drop% (drop rate)
# BadTh% (bad throw rate)
# OnTgt% (on-target throw rate)
# Prss% (under pressure rate)
# Scrm% (scramble rate)
Draft Data
Historical draft data is valuable for prospect analysis and for building career trajectory models:
def scrape_draft(year: int) -> pd.DataFrame | None:
"""Scrape NFL Draft picks for a given year."""
url = f"https://www.pro-football-reference.com/years/{year}/draft.htm"
soup = get_soup(url)
table = get_table(soup, "drafts")
if table is None:
return None
df = pd.read_html(StringIO(str(table)))[0]
# Flatten multi-level headers
if isinstance(df.columns, pd.MultiIndex):
df.columns = ["_".join(filter(None, col)).strip() for col in df.columns]
# Remove header-repeat rows
df = df[df.iloc[:, 0].apply(lambda x: str(x).isdigit())]
df = df.reset_index(drop=True)
return df
draft_2024 = scrape_draft(2024)
if draft_2024 is not None:
print(f"Draft picks: {len(draft_2024)}")
print(draft_2024.head(5).to_string())
Anti-Bot Handling and Rate Limits
Sports Reference explicitly states in their robots.txt and FAQ that they rate-limit to 20 requests per minute per IP. Exceed that and you'll get a 429 or a temporary ban. They also check user agents — a default python-requests/2.x header will get you blocked fast.
The key rules:
1. Use a realistic browser user agent (rotated across a few options)
2. Add a Referer header pointing to the site itself
3. Wait 3–5 seconds between requests (20 req/min = 3 seconds minimum)
4. Back off exponentially on 429 responses
USER_AGENTS = [
("Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"),
("Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4_1) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) "
"Version/17.4 Safari/605.1.15"),
("Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) "
"Gecko/20100101 Firefox/125.0"),
]
def get_soup_polite(url: str, min_delay: float = 3.0,
max_delay: float = 6.0) -> BeautifulSoup:
"""Fetch a PFR page with realistic headers and rate limiting."""
SESSION.headers.update({
"User-Agent": random.choice(USER_AGENTS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://www.pro-football-reference.com/",
})
time.sleep(random.uniform(min_delay, max_delay))
for attempt in range(4):
try:
resp = SESSION.get(url, timeout=15)
if resp.status_code == 429:
wait = 2 ** attempt * 15 + random.uniform(0, 10)
print(f"Rate limited. Waiting {wait:.0f}s...")
time.sleep(wait)
continue
resp.raise_for_status()
return BeautifulSoup(resp.text, "lxml")
except requests.exceptions.Timeout:
wait = 2 ** attempt * 5
print(f"Timeout on attempt {attempt+1}. Waiting {wait}s...")
time.sleep(wait)
raise Exception(f"Failed after 4 attempts: {url}")
Proxy Rotation for Bulk Scraping
For pulling data across hundreds of players or multiple seasons, a single IP will hit the rate limit even with delays. Sports Reference's detection is tuned to catch datacenter IPs and volume-based patterns from single sources.
ThorData residential proxies work well here — residential IPs look like real users rather than datacenter traffic. Integrate them into the requests session:
PROXY_USER = "your_thordata_username"
PROXY_PASS = "your_thordata_password"
def get_proxy_session(session_id: int = None) -> requests.Session:
"""Create a requests session routed through a residential proxy."""
import random
sid = session_id or random.randint(1000, 9999)
proxy_url = f"http://{PROXY_USER}-session-{sid}:{PROXY_PASS}@rotating.thordata.net:10000"
s = requests.Session()
s.proxies = {"http": proxy_url, "https": proxy_url}
s.headers.update({
"User-Agent": random.choice(USER_AGENTS),
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://www.pro-football-reference.com/",
})
return s
def scrape_with_proxy(url: str, session_id: int = None) -> BeautifulSoup:
"""Fetch a PFR page via residential proxy."""
s = get_proxy_session(session_id)
time.sleep(random.uniform(3, 6))
resp = s.get(url, timeout=20)
resp.raise_for_status()
return BeautifulSoup(resp.text, "lxml")
Rotate the session ID every few requests to get different exit IPs while maintaining session continuity within a small burst.
Storing Results in SQLite
For any serious data collection, persist to SQLite as you go:
import sqlite3
import json
def init_pfr_db(db_path: str = "pfr_data.db") -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS players (
player_id TEXT PRIMARY KEY,
name TEXT,
position TEXT,
college TEXT,
draft_year INTEGER,
draft_round INTEGER,
draft_pick INTEGER,
team TEXT
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS career_stats (
player_id TEXT,
year INTEGER,
team TEXT,
games INTEGER,
games_started INTEGER,
stat_type TEXT,
stats_json TEXT,
PRIMARY KEY (player_id, year, team, stat_type),
FOREIGN KEY (player_id) REFERENCES players(player_id)
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS game_logs (
player_id TEXT,
year INTEGER,
week INTEGER,
opponent TEXT,
home_away TEXT,
result TEXT,
stats_json TEXT,
PRIMARY KEY (player_id, year, week),
FOREIGN KEY (player_id) REFERENCES players(player_id)
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS season_leaders (
year INTEGER,
stat_type TEXT,
player_name TEXT,
team TEXT,
stats_json TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (year, stat_type, player_name)
)
""")
conn.execute("CREATE INDEX IF NOT EXISTS idx_career_player ON career_stats(player_id)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_gamelog_player ON game_logs(player_id, year)")
conn.commit()
return conn
def store_career_stats(conn: sqlite3.Connection, player_id: str,
df: pd.DataFrame, stat_type: str):
"""Store career stats DataFrame in SQLite."""
for _, row in df.iterrows():
year_val = row.get("Year")
try:
year = int(str(year_val).replace("*", "").replace("+", ""))
except (ValueError, TypeError):
continue
conn.execute(
"INSERT OR REPLACE INTO career_stats VALUES (?,?,?,?,?,?,?)",
(
player_id,
year,
row.get("Tm", ""),
int(row.get("G", 0) or 0),
int(row.get("GS", 0) or 0),
stat_type,
row.to_json(),
)
)
conn.commit()
def store_season_leaders(conn: sqlite3.Connection, year: int,
stat_type: str, df: pd.DataFrame):
"""Store season leaderboard in SQLite."""
for _, row in df.iterrows():
conn.execute(
"INSERT OR REPLACE INTO season_leaders VALUES (?,?,?,?,?,CURRENT_TIMESTAMP)",
(year, stat_type, row.get("Player", ""), row.get("Tm", ""), row.to_json())
)
conn.commit()
Building a Multi-Season Dataset
Here's a complete pipeline for building a multi-season passing dataset:
def build_passing_dataset(years: list, db_path: str = "pfr_passing.db"):
"""Build a complete passing stats dataset for specified years."""
conn = init_pfr_db(db_path)
for year in years:
print(f"\nScraping {year} season...")
df = scrape_season_stats(year, "passing")
if df is not None:
store_season_leaders(conn, year, "passing", df)
print(f" {len(df)} passers stored")
# Advanced stats
adv = scrape_advanced_passing(year)
if adv is not None:
store_season_leaders(conn, year, "passing_advanced", adv)
print(f" {len(adv)} advanced passing records stored")
# Polite delay between seasons
time.sleep(random.uniform(5, 10))
conn.close()
print(f"\nDataset complete: {db_path}")
# Build a 5-year dataset
build_passing_dataset(list(range(2020, 2025)))
Appending New Data Without Duplication
For regular updates, append new data rather than re-scraping everything:
def update_season_data(conn: sqlite3.Connection, year: int, stat_type: str) -> int:
"""Update season data if not already present. Returns rows added."""
existing = conn.execute(
"SELECT COUNT(*) FROM season_leaders WHERE year = ? AND stat_type = ?",
(year, stat_type)
).fetchone()[0]
if existing > 0:
print(f" {year} {stat_type}: already have {existing} rows, skipping")
return 0
df = scrape_season_stats(year, stat_type)
if df is None:
return 0
store_season_leaders(conn, year, stat_type, df)
print(f" {year} {stat_type}: added {len(df)} rows")
return len(df)
Key Takeaways
The main things to remember when scraping PFR:
- Most tables are in HTML comments — always use the
get_table()helper with comment fallback - Respect the 20 req/min rate limit — use 3–6 second randomized delays between requests
- Use a realistic browser user agent — rotate across Chrome, Firefox, and Safari strings
- Add a Referer header —
https://www.pro-football-reference.com/ - For multi-player or multi-season jobs, rotate proxies — residential IPs via ThorData
- Handle multi-level column headers — flatten with
"_".join(col)for MultiIndex - Player IDs follow
LastFirst00pattern — construct programmatically from roster lists
PFR's data quality is high and the URL structure is consistent, which makes it reasonably pleasant to work with once you know the comment trick. The combination of career stats, game logs, advanced metrics, and draft data makes it one of the best free sports databases available.
Team Season Pages
Team pages give you roster, schedule, and game-by-game results for a specific season:
def scrape_team_season(team_abbr, year):
url = f'https://www.pro-football-reference.com/teams/{team_abbr.lower()}/{year}.htm'
soup = get_soup_polite(url)
# Passing stats table
passing_table = get_table(soup, 'passing')
# Rushing stats table
rushing_table = get_table(soup, 'rushing')
# Schedule and results
schedule_table = get_table(soup, 'games')
results = {}
for name, table in [('passing', passing_table), ('rushing', rushing_table),
('schedule', schedule_table)]:
if table:
df = pd.read_html(StringIO(str(table)))[0]
results[name] = df
return results
# Get the 2024 Kansas City Chiefs season
chiefs_2024 = scrape_team_season('kan', 2024)
if 'schedule' in chiefs_2024:
print(chiefs_2024['schedule'].head())
Play-by-Play Box Scores
Individual game box scores contain quarter-by-quarter scoring and full play-by-play tables for each game:
def scrape_game_boxscore(game_id):
url = f'https://www.pro-football-reference.com/boxscores/{game_id}.htm'
soup = get_soup_polite(url)
# Team stats table
team_stats = get_table(soup, 'team_stats')
# Play-by-play
pbp = get_table(soup, 'pbp')
result = {}
if team_stats:
result['team_stats'] = pd.read_html(StringIO(str(team_stats)))[0]
if pbp:
result['pbp'] = pd.read_html(StringIO(str(pbp)))[0]
return result
Game IDs follow the format YYYYMMDD0ABB where ABB is the home team abbreviation. For example: 202402040SFO for Super Bowl LVIII.
Approximate Value (AV) — PFR's Catch-All Metric
AV appears in career stat tables and is PFR's attempt to create a single number that captures a player's value in a season across all positions. It is imperfect but useful for cross-position and cross-era comparisons:
def extract_av_history(player_url):
soup = get_soup_polite(player_url)
table = get_table(soup, 'passing') # or rushing, receiving, etc.
if table is None:
return None
df = pd.read_html(StringIO(str(table)))[0]
df = df[df['Year'].notna()]
df = df[~df['Year'].astype(str).str.contains('Year|Career', na=False)]
if 'AV' in df.columns:
df['AV'] = pd.to_numeric(df['AV'], errors='coerce')
return df[['Year', 'Tm', 'AV']].dropna()
return None
av = extract_av_history(mahomes_url)
if av is not None:
total_av = av['AV'].sum()
peak_av = av['AV'].max()
print(f'Career AV: {total_av}, Peak Season: {peak_av}')
Handling Pagination in Season Pages
The main season leaderboard pages (passing.htm, rushing.htm, etc.) load all players on a single page — no pagination required. However, for leagues prior to 1990, some tables may only load partial rosters. Always check the row count against known league totals for that era.
Summary
Pro Football Reference is the most complete free NFL data source available:
- The comment trick is essential: use
find_commented_table()for every table lookup - Rate limit is 20 req/min — maintain 3-6 second delays between requests
- Multi-level headers appear in many tables — flatten with
join(col)on MultiIndex - Proxy rotation is needed for bulk multi-player or multi-season jobs
- Player IDs follow
LastFirst00convention and are consistent across pages - AV is useful for cross-position comparisons; EPA and DVOA are available elsewhere
Available data: career stats, game logs, play-by-play, draft records, team seasons, advanced metrics (air yards, pressure rate, on-target%), and historical data back to 1920. For fantasy or modeling purposes, game logs and advanced passing metrics are the most actionable tables.