How to Scrape FBref for Football Stats with Python (2026)
How to Scrape FBref for Football Stats with Python (2026)
Football analytics has undergone a quiet revolution over the past decade. What was once the domain of clubs with multi-million pound data science budgets is now accessible to independent analysts, fantasy football players, model builders, journalists, and fans who want to understand the game at a deeper level. And at the center of this democratization is FBref.
FBref is the most comprehensive free source of football statistics on the internet. It covers every major league and competition worldwide, with data going back decades for top competitions. The stats themselves come from StatsBomb and Opta — the same data providers used by Premier League clubs and broadcasters — which means you're not looking at approximations or second-tier tracking data. You're looking at the same raw numbers that professional analysts use.
The scope of data available is remarkable. League tables with goal difference and expected goal difference. Player-level stats for goals, assists, expected goals, progressive passes, carries, and pressures. Match-level shot maps with individual expected goal values for every attempt. Passing networks showing who passed to whom and how often. Goalkeeper performance metrics including post-shot xG. Defensive stats like pressure success rates and tackle locations. Age demographics for squad planning. Contract and wage data for transfer analysis.
None of it has a public API.
If you want FBref data programmatically — for a model, a dashboard, a research project, or simply to avoid copying numbers by hand — you need to scrape it. This guide shows you exactly how.
We'll cover the full pipeline: setup and dependencies, scraping each major data type with working code, handling FBref's anti-bot protections (which are real and require genuine care), proxy rotation for larger operations, managing the multi-level column headers that catch many first-time scrapers, dealing with tables hidden in HTML comments, output schemas, and error handling patterns.
The code here is tested against FBref's current structure. The site does change its table IDs and URL patterns occasionally, and I'll explain how to diagnose and fix those changes when they happen rather than giving you brittle selectors that break without warning.
Setup
pip install requests beautifulsoup4 pandas lxml httpx tenacity
# For browser-based scraping (needed for some pages)
pip install playwright
playwright install chromium
FBref renders its core stats tables in plain HTML. You don't need browser automation for most data — requests plus BeautifulSoup plus pandas handles it. The browser is only needed when Cloudflare serves a JS challenge, which happens more frequently on datacenter IPs than residential ones.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
import logging
from typing import Optional
from urllib.parse import urljoin
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)
# FBref base URL
FBREF_BASE = "https://fbref.com"
# Realistic browser headers — critical for avoiding immediate blocks
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Cache-Control": "max-age=0",
}
# FBref explicitly asks for 3+ second delays. Respect this.
MIN_DELAY = 3.0
MAX_DELAY = 6.0
FBref URL Structure
Understanding FBref's URL patterns is essential for building reliable scrapers. The structure is:
Competition stats: /en/comps/{league_id}/{Season}-{Competition}-Stats
Squad stats: /en/squads/{squad_id}/{Season}/stats/{Squad-Name}-Stats
Squad specific stat: /en/squads/{squad_id}/{Season}/shooting/{Squad-Name}-Stats
Player profile: /en/players/{player_id}/scouting/365_m1/{Player-Name}-Scouting-Report
Match report: /en/matches/{match_id}/{Home-Away-Date-Match-Report}
Key competition IDs:
- Premier League: 9
- La Liga: 12
- Serie A: 11
- Bundesliga: 20
- Ligue 1: 13
- Champions League: 8
- Europa League: 19
def fbref_league_url(league_id: int, season: str = "2025-2026") -> str:
"""Build a FBref competition stats URL."""
league_names = {9: "Premier-League", 12: "La-Liga", 11: "Serie-A",
20: "Bundesliga", 13: "Ligue-1", 8: "Champions-League"}
name = league_names.get(league_id, f"League-{league_id}")
return f"{FBREF_BASE}/en/comps/{league_id}/{season}/{season}-{name}-Stats"
def fbref_squad_url(squad_id: str, squad_name: str, season: str = "2025-2026", stat_type: str = "stats") -> str:
"""Build a FBref squad stats URL."""
return f"{FBREF_BASE}/en/squads/{squad_id}/{season}/{stat_type}/{squad_name}-Stats"
Core Request Function
FBref's rate limiting is the primary anti-scraping measure. Exceed it and you'll get a 429, followed by a temporary IP ban if you keep trying.
def polite_get(
url: str,
session: requests.Session,
min_delay: float = MIN_DELAY,
max_delay: float = MAX_DELAY,
max_retries: int = 3,
) -> requests.Response | None:
"""
Fetch a URL with FBref-appropriate rate limiting and retry logic.
Returns None on unrecoverable errors, raises on programming errors.
"""
# Enforce delay before every request
time.sleep(random.uniform(min_delay, max_delay))
for attempt in range(max_retries):
try:
resp = session.get(url, headers=HEADERS, timeout=30)
if resp.status_code == 200:
# Verify it's not a disguised block
if "captcha" in resp.text.lower() or "verify you are human" in resp.text.lower():
logger.warning(f"CAPTCHA/block page received on {url}")
return None
return resp
elif resp.status_code == 429:
retry_after = int(resp.headers.get("Retry-After", 120))
logger.warning(f"Rate limited on {url}. Waiting {retry_after}s (attempt {attempt+1})")
time.sleep(retry_after + random.uniform(5, 15))
continue
elif resp.status_code == 403:
logger.warning(f"Forbidden on {url} — IP likely blocked")
return None
elif resp.status_code == 404:
logger.info(f"Not found: {url}")
return None
else:
logger.error(f"HTTP {resp.status_code} on {url}")
if attempt < max_retries - 1:
time.sleep(random.uniform(10, 30))
continue
except requests.Timeout:
logger.warning(f"Timeout on {url} (attempt {attempt+1})")
if attempt < max_retries - 1:
time.sleep(random.uniform(5, 15))
except requests.RequestException as e:
logger.error(f"Request error on {url}: {e}")
return None
logger.error(f"Gave up on {url} after {max_retries} attempts")
return None
def create_session(proxy_url: str | None = None) -> requests.Session:
session = requests.Session()
session.headers.update(HEADERS)
if proxy_url:
session.proxies = {"http": proxy_url, "https": proxy_url}
return session
Scraping League Tables
The league standings table is the simplest FBref data to extract:
def scrape_league_table(league_id: int, season: str = "2025-2026") -> pd.DataFrame | None:
"""
Scrape the overall league standings table.
Returns DataFrame with columns:
Rk, Squad, MP, W, D, L, GF, GA, GD, Pts, Pts/MP, xG, xGA, xGD, xGD/90
"""
session = create_session()
url = fbref_league_url(league_id, season)
logger.info(f"Fetching league table: {url}")
resp = polite_get(url, session)
if not resp:
return None
soup = BeautifulSoup(resp.text, "lxml")
# Try the specific table ID first, fall back to class-based search
table = (
soup.find("table", id=lambda x: x and "overall" in str(x).lower())
or soup.find("table", class_="stats_table")
)
# FBref sometimes hides tables inside HTML comments (for lazy loading)
if not table:
from bs4 import Comment
for comment in soup.find_all(string=lambda t: isinstance(t, Comment)):
if "stats_table" in comment or "overall" in comment.lower():
comment_soup = BeautifulSoup(comment, "lxml")
table = comment_soup.find("table")
if table:
break
if not table:
logger.error(f"No standings table found at {url}")
return None
try:
df = pd.read_html(str(table))[0]
# Handle MultiIndex columns (FBref uses grouped headers)
if isinstance(df.columns, pd.MultiIndex):
df.columns = [" ".join(str(c) for c in col if "Unnamed" not in str(c)).strip()
for col in df.columns]
# Remove separator rows (FBref inserts blank rows every 5-10 teams)
df = df.dropna(subset=["Squad"])
df = df[df["Squad"] != "Squad"] # Remove header repeat rows
# Convert numeric columns
numeric_cols = ["MP", "W", "D", "L", "GF", "GA", "GD", "Pts", "xG", "xGA"]
for col in numeric_cols:
if col in df.columns:
df[col] = pd.to_numeric(df[col], errors="coerce")
logger.info(f"Scraped {len(df)} teams")
return df
except Exception as e:
logger.error(f"Error parsing standings table: {e}")
return None
# Example output schema:
# {
# "Rk": 1, "Squad": "Liverpool", "MP": 28, "W": 21, "D": 5, "L": 2,
# "GF": 68, "GA": 28, "GD": 40, "Pts": 68, "xG": 61.2, "xGA": 24.8,
# "xGD": 36.4, "xGD/90": 1.30
# }
Player Shooting Stats with xG
Expected Goals (xG) is the most important advanced metric in modern football analytics. FBref provides it at the player level across all major competitions:
def scrape_player_shooting(squad_id: str, squad_name: str, season: str = "2025-2026") -> pd.DataFrame | None:
"""
Scrape per-player shooting and expected goals stats.
Returns DataFrame with columns:
Player, Nation, Pos, Age, MP, Starts, Min, Gls, Sh, SoT, SoT%, Sh/90,
SoT/90, G/Sh, G/SoT, Dist, FK, PK, PKatt, xG, npxG, npxG/Sh, G-xG, np:G-xG
"""
session = create_session()
url = fbref_squad_url(squad_id, squad_name, season, stat_type="shooting")
logger.info(f"Fetching shooting stats: {url}")
resp = polite_get(url, session)
if not resp:
return None
soup = BeautifulSoup(resp.text, "lxml")
# Find shooting table (may be in HTML comments)
table = soup.find("table", id=lambda x: x and "shooting" in str(x).lower())
if not table:
from bs4 import Comment
for comment in soup.find_all(string=lambda t: isinstance(t, Comment)):
if "shooting" in comment.lower():
comment_soup = BeautifulSoup(comment, "lxml")
table = comment_soup.find("table", id=lambda x: x and "shooting" in str(x).lower())
if table:
break
if not table:
logger.error(f"No shooting table found for {squad_name}")
return None
try:
df = pd.read_html(str(table))[0]
# FBref uses two-level column headers for shooting stats
if isinstance(df.columns, pd.MultiIndex):
df.columns = [
col[-1] if col[-1] and "Unnamed" not in str(col[-1]) else col[0]
for col in df.columns
]
# Remove total/separator rows
df = df[df["Player"] != "Player"]
df = df.dropna(subset=["Player"])
df = df[~df["Player"].str.contains("Squad Total|Opponent Total", na=False)]
# Ensure key columns are numeric
numeric_cols = ["Gls", "Sh", "SoT", "xG", "npxG", "G-xG"]
for col in numeric_cols:
if col in df.columns:
df[col] = pd.to_numeric(df[col], errors="coerce")
# Add squad context
df["Squad"] = squad_name
df["Season"] = season
return df
except Exception as e:
logger.error(f"Error parsing shooting table: {e}")
return None
# Example row from output:
# {
# "Player": "Mohamed Salah", "Nation": "eg EGY", "Pos": "FW",
# "Age": "32-181", "MP": 28, "Gls": 22, "Sh": 87, "SoT": 47,
# "xG": 19.4, "npxG": 18.1, "G-xG": 2.6, "Squad": "Liverpool", "Season": "2025-2026"
# }
Match-Level Shot Data
The most granular data FBref offers is individual shot records from match reports. Each record includes the expected goal value for that specific shot:
def scrape_match_shots(match_url: str) -> pd.DataFrame | None:
"""
Scrape all shots from a specific match with per-shot xG values.
Returns DataFrame with columns:
minute, player, squad, xG, outcome, distance, body_part, notes
"""
session = create_session()
resp = polite_get(match_url, session)
if not resp:
return None
soup = BeautifulSoup(resp.text, "lxml")
# Shot tables are usually in comments on match pages
all_tables = []
def find_shot_tables(source_soup: BeautifulSoup):
for table in source_soup.find_all("table", id=lambda x: x and "shots" in str(x).lower()):
all_tables.append(table)
find_shot_tables(soup)
if not all_tables:
from bs4 import Comment
for comment in soup.find_all(string=lambda t: isinstance(t, Comment)):
if "shots" in comment.lower():
comment_soup = BeautifulSoup(comment, "lxml")
find_shot_tables(comment_soup)
if not all_tables:
logger.warning(f"No shot tables found at {match_url}")
return None
shots_data = []
for table in all_tables:
try:
df = pd.read_html(str(table))[0]
if isinstance(df.columns, pd.MultiIndex):
df.columns = [col[-1] if "Unnamed" not in str(col[-1]) else col[0]
for col in df.columns]
df = df.dropna(how="all")
df = df[df.apply(lambda row: not all(str(v) == str(row.iloc[0]) for v in row), axis=1)]
# Extract shot details from table structure
for _, row in df.iterrows():
shot = {}
for col in df.columns:
col_lower = str(col).lower()
if "min" in col_lower or "minute" in col_lower:
shot["minute"] = str(row[col]).strip()
elif "player" in col_lower:
shot["player"] = str(row[col]).strip()
elif "squad" in col_lower or "team" in col_lower:
shot["squad"] = str(row[col]).strip()
elif col_lower == "xg":
shot["xg"] = pd.to_numeric(row[col], errors="coerce")
elif "outcome" in col_lower or "result" in col_lower:
shot["outcome"] = str(row[col]).strip()
elif "dist" in col_lower:
shot["distance_m"] = pd.to_numeric(row[col], errors="coerce")
elif "body" in col_lower:
shot["body_part"] = str(row[col]).strip()
elif "note" in col_lower:
shot["notes"] = str(row[col]).strip()
if shot.get("player") and shot.get("player") != "nan":
shots_data.append(shot)
except Exception as e:
logger.warning(f"Error parsing shot table: {e}")
if not shots_data:
return None
df = pd.DataFrame(shots_data)
df["match_url"] = match_url
return df
# Example output schema:
# {
# "minute": "23", "player": "Bruno Fernandes", "squad": "Manchester Utd",
# "xg": 0.34, "outcome": "Goal", "distance_m": 18.0,
# "body_part": "Right Foot", "notes": "", "match_url": "https://fbref.com/..."
# }
Passing Stats and Progressive Passes
Passing data reveals how teams build attacks and which players are responsible for advancing the ball:
def scrape_passing_stats(squad_id: str, squad_name: str, season: str = "2025-2026") -> pd.DataFrame | None:
"""
Scrape passing stats per player.
Key columns: Cmp, Att, Cmp%, TotDist, PrgDist, Ast, xA, KP, 1/3, PPA, CrsPA, PrgP
"""
session = create_session()
url = fbref_squad_url(squad_id, squad_name, season, stat_type="passing")
resp = polite_get(url, session)
if not resp:
return None
soup = BeautifulSoup(resp.text, "lxml")
table = soup.find("table", id=lambda x: x and "passing" in str(x).lower())
if not table:
from bs4 import Comment
for comment in soup.find_all(string=lambda t: isinstance(t, Comment)):
if "passing" in comment.lower():
cs = BeautifulSoup(comment, "lxml")
table = cs.find("table", id=lambda x: x and "passing" in str(x).lower())
if table:
break
if not table:
return None
try:
df = pd.read_html(str(table))[0]
# Flatten multi-level headers
if isinstance(df.columns, pd.MultiIndex):
# FBref groups passing columns under "Short", "Medium", "Long" headers
# Flatten to include group prefix for disambiguation
new_cols = []
for col in df.columns:
top, bottom = col[0], col[-1]
if "Unnamed" in str(top):
new_cols.append(str(bottom))
else:
if str(top) != str(bottom):
new_cols.append(f"{top}_{bottom}")
else:
new_cols.append(str(bottom))
df.columns = new_cols
df = df[df["Player"] != "Player"].dropna(subset=["Player"])
df = df[~df["Player"].str.contains("Squad Total|Opponent Total", na=False)]
df["Squad"] = squad_name
df["Season"] = season
return df
except Exception as e:
logger.error(f"Error parsing passing table: {e}")
return None
Defensive Stats: Pressures and Tackles
Defensive analytics is where FBref really differentiates itself from traditional stats sources:
def scrape_defensive_stats(squad_id: str, squad_name: str, season: str = "2025-2026") -> pd.DataFrame | None:
"""
Scrape defensive actions per player.
Key columns: Tkl, TklW, Def 3rd, Mid 3rd, Att 3rd (tackles by zone)
Press, Succ, %, Def 3rd, Mid 3rd, Att 3rd (pressures by zone)
Blocks, Sh, Pass, Int, Tkl+Int, Clr, Err
"""
session = create_session()
url = fbref_squad_url(squad_id, squad_name, season, stat_type="defense")
resp = polite_get(url, session)
if not resp:
return None
soup = BeautifulSoup(resp.text, "lxml")
table = soup.find("table", id=lambda x: x and "defense" in str(x).lower())
if not table:
from bs4 import Comment
for comment in soup.find_all(string=lambda t: isinstance(t, Comment)):
if "defense" in comment.lower():
cs = BeautifulSoup(comment, "lxml")
table = cs.find("table")
if table:
break
if not table:
return None
try:
df = pd.read_html(str(table))[0]
if isinstance(df.columns, pd.MultiIndex):
# Defense tables have headers like "Tackles", "Challenges", "Blocks"
# Prefix subcolumns with their group
new_cols = []
prev_group = ""
for col in df.columns:
top = str(col[0])
bottom = str(col[-1])
if "Unnamed" in top:
new_cols.append(bottom)
elif top != bottom and len(col) > 1:
# Shorten common group names
group_short = {
"Tackles": "Tkl", "Challenges": "Chl",
"Blocks": "Blk", "Pressures": "Prs"
}.get(top, top[:4])
new_cols.append(f"{group_short}_{bottom}")
else:
new_cols.append(bottom)
df.columns = new_cols
df = df[df["Player"] != "Player"].dropna(subset=["Player"])
df["Squad"] = squad_name
df["Season"] = season
return df
except Exception as e:
logger.error(f"Error parsing defense table: {e}")
return None
Goalkeeper Performance with Post-Shot xG
Post-shot xG (PSxG) is FBref's measure of goalkeeper performance — the difference between expected goals based on shot quality and actual goals allowed:
def scrape_keeper_stats(squad_id: str, squad_name: str, season: str = "2025-2026") -> pd.DataFrame | None:
"""
Scrape goalkeeper performance stats including PSxG.
Key columns: GA90, SoTA, Saves, Save%, W, D, L, CS, CS%
PSxG, PSxG/SoT, PSxG+/-, /90, #OPA, #OPA/90, AvgDist (advanced)
"""
session = create_session()
# Advanced keeper stats are on a separate page
url = fbref_squad_url(squad_id, squad_name, season, stat_type="keepersadv")
resp = polite_get(url, session)
if not resp:
# Fall back to basic keeper stats
url = fbref_squad_url(squad_id, squad_name, season, stat_type="keepers")
resp = polite_get(url, session)
if not resp:
return None
soup = BeautifulSoup(resp.text, "lxml")
table = soup.find("table", id=lambda x: x and ("keeper" in str(x).lower() or "gk" in str(x).lower()))
if not table:
from bs4 import Comment
for comment in soup.find_all(string=lambda t: isinstance(t, Comment)):
if "keeper" in comment.lower():
cs = BeautifulSoup(comment, "lxml")
table = cs.find("table")
if table:
break
if not table:
return None
try:
df = pd.read_html(str(table))[0]
if isinstance(df.columns, pd.MultiIndex):
df.columns = [col[-1] if "Unnamed" not in str(col[-1]) else col[0] for col in df.columns]
df = df[df["Player"] != "Player"].dropna(subset=["Player"])
df["Squad"] = squad_name
df["Season"] = season
return df
except Exception as e:
logger.error(f"Error parsing keeper table: {e}")
return None
# Example output schema for PSxG analysis:
# {
# "Player": "Alisson", "Nation": "br BRA", "Age": "32-115",
# "GA": 18, "PSxG": 22.3, "PSxG+/-": -4.3, "/90": -0.15,
# "Saves": 78, "Save%": 81.2,
# "Squad": "Liverpool", "Season": "2025-2026"
# }
Multi-Season Data Pipeline
For model building, you need data across multiple seasons. Here's a complete pipeline:
import json
from pathlib import Path
from dataclasses import dataclass, asdict
@dataclass
class SeasonScrapeJob:
squad_id: str
squad_name: str
seasons: list[str]
stat_types: list[str]
def run_multi_season_pipeline(
jobs: list[SeasonScrapeJob],
output_dir: str = "fbref_data",
proxy_url: str | None = None,
) -> dict[str, int]:
"""
Run a multi-squad, multi-season, multi-stat-type scrape.
Returns dict of {filename: row_count}.
"""
Path(output_dir).mkdir(exist_ok=True)
results = {}
STAT_SCRAPERS = {
"shooting": scrape_player_shooting,
"passing": scrape_passing_stats,
"defense": scrape_defensive_stats,
"keepers": scrape_keeper_stats,
}
total_jobs = sum(len(j.seasons) * len(j.stat_types) for j in jobs)
completed = 0
for job in jobs:
for season in job.seasons:
for stat_type in job.stat_types:
completed += 1
logger.info(f"[{completed}/{total_jobs}] {job.squad_name} {season} {stat_type}")
scraper_fn = STAT_SCRAPERS.get(stat_type)
if not scraper_fn:
logger.warning(f"Unknown stat type: {stat_type}")
continue
df = scraper_fn(job.squad_id, job.squad_name, season)
if df is not None and not df.empty:
filename = f"{output_dir}/{job.squad_name.lower().replace(' ', '_')}_{season}_{stat_type}.csv"
df.to_csv(filename, index=False, encoding="utf-8")
results[filename] = len(df)
logger.info(f"Saved {len(df)} rows to {filename}")
else:
logger.warning(f"No data for {job.squad_name} {season} {stat_type}")
# Save metadata
metadata = {
"scraped_at": pd.Timestamp.now().isoformat(),
"total_files": len(results),
"total_rows": sum(results.values()),
"files": results,
}
with open(f"{output_dir}/metadata.json", "w") as f:
json.dump(metadata, f, indent=2)
return results
# Example usage — scrape Big Six Premier League clubs, 3 seasons
top6_jobs = [
SeasonScrapeJob("18bb7c10", "Arsenal", ["2023-2024", "2024-2025", "2025-2026"], ["shooting", "passing"]),
SeasonScrapeJob("b8fd03ef", "Manchester-City", ["2023-2024", "2024-2025", "2025-2026"], ["shooting", "passing"]),
SeasonScrapeJob("822bd0ba", "Liverpool", ["2023-2024", "2024-2025", "2025-2026"], ["shooting", "passing"]),
SeasonScrapeJob("19538871", "Chelsea", ["2023-2024", "2024-2025", "2025-2026"], ["shooting", "passing"]),
SeasonScrapeJob("361ca564", "Tottenham", ["2023-2024", "2024-2025", "2025-2026"], ["shooting", "passing"]),
SeasonScrapeJob("206d90db", "Manchester-Utd", ["2023-2024", "2024-2025", "2025-2026"], ["shooting", "passing"]),
]
# results = run_multi_season_pipeline(top6_jobs, proxy_url="http://user:[email protected]:7777")
Anti-Bot Measures and How to Handle Them
1. Rate Limiting (Primary Threat)
FBref's primary defense is strict rate limiting. The site explicitly states in its terms that automated tools should wait at least 3 seconds between requests. Violate this and you'll get 429 responses followed by temporary IP bans.
class FBrefRateLimiter:
"""Respects FBref's rate limiting with adaptive backoff."""
def __init__(self):
self._last_request = 0.0
self._consecutive_429s = 0
self._base_delay = 3.0
def wait(self):
"""Wait the appropriate time before the next request."""
elapsed = time.time() - self._last_request
# Increase base delay after repeated rate limiting
delay = self._base_delay * (1.5 ** self._consecutive_429s)
delay += random.uniform(0, delay * 0.3) # Add jitter
if elapsed < delay:
time.sleep(delay - elapsed)
self._last_request = time.time()
def on_429(self, retry_after: int = 120):
self._consecutive_429s += 1
logger.warning(f"Rate limited ({self._consecutive_429s} consecutive). Backing off {retry_after}s")
time.sleep(retry_after + random.uniform(10, 30))
def on_success(self):
self._consecutive_429s = max(0, self._consecutive_429s - 1)
2. Cloudflare Challenges
FBref uses Cloudflare. Datacenter IPs (AWS, GCP, Digital Ocean, most VPNs) trigger Cloudflare challenges much more aggressively than residential IPs. If you're getting Cloudflare blocks, the most effective fix is using residential proxies.
ThorData provides residential proxies with session stickiness — you can keep the same IP across a sequence of related requests (useful for scraping a squad's full stat pages without triggering session-based behavioral analysis).
def create_thordata_session(
username: str,
password: str,
country: str = "GB", # UK residential IPs look natural for UK football sites
session_id: str | None = None,
) -> requests.Session:
"""
Create a requests.Session routed through ThorData residential proxy.
Use session_id for sticky sessions (same IP across multiple requests).
Use None for rotating sessions (different IP per request).
"""
session = requests.Session()
session.headers.update(HEADERS)
if session_id:
# Sticky: same IP for this session
proxy_auth = f"{username}-country-{country}-session-{session_id}:{password}"
else:
# Rotating: fresh IP per request
proxy_auth = f"{username}-country-{country}:{password}"
proxy_url = f"http://{proxy_auth}@gate.thordata.com:7777"
session.proxies = {"http": proxy_url, "https": proxy_url}
return session
# For scraping a squad's pages: use sticky session so all requests for one squad
# come from the same IP (more natural browsing pattern)
squad_session = create_thordata_session(
"username", "password",
country="GB",
session_id="arsenal-2025"
)
3. Multi-Level Headers (Most Common Parsing Error)
FBref tables use grouped column headers that pandas reads as MultiIndex. If you try to access columns by name without flattening, you'll get KeyErrors or weird column names.
def flatten_multiindex_columns(df: pd.DataFrame) -> pd.DataFrame:
"""
Flatten FBref's MultiIndex columns into usable single-level column names.
Handles FBref's specific convention for unnamed top-level groups.
"""
if not isinstance(df.columns, pd.MultiIndex):
return df
new_columns = []
for col in df.columns:
parts = [str(c) for c in col if "Unnamed" not in str(c) and str(c) != "nan"]
if len(parts) == 0:
new_columns.append("unknown")
elif len(parts) == 1:
new_columns.append(parts[0])
else:
# For grouped columns, prefix with group name if it adds context
# e.g., ("Expected", "xG") -> "Exp_xG"
# But ("Player",) -> "Player"
top = parts[0]
bottom = parts[-1]
if top == bottom:
new_columns.append(top)
else:
new_columns.append(f"{top[:4]}_{bottom}" if len(top) > 6 else f"{top}_{bottom}")
df.columns = new_columns
return df
4. Tables in HTML Comments
Some FBref tables are wrapped in HTML comments for deferred loading. BeautifulSoup.find("table") won't find them. You have to parse the comments.
from bs4 import Comment
def find_table_in_comments(soup: BeautifulSoup, table_id_pattern: str) -> Optional[BeautifulSoup]:
"""Search HTML comment blocks for a table matching the pattern."""
for comment in soup.find_all(string=lambda t: isinstance(t, Comment)):
if table_id_pattern.lower() in comment.lower():
comment_soup = BeautifulSoup(comment, "lxml")
table = comment_soup.find("table", id=lambda x: x and table_id_pattern.lower() in str(x).lower())
if table:
return table
return None
def get_table_robust(soup: BeautifulSoup, id_pattern: str) -> Optional[BeautifulSoup]:
"""Find a FBref table either in main HTML or in comments."""
# Try main HTML first
table = soup.find("table", id=lambda x: x and id_pattern.lower() in str(x).lower())
if table:
return table
# Fall back to comments
return find_table_in_comments(soup, id_pattern)
Real-World Use Cases
1. xG League Table Generator
Build a table showing expected performance vs. actual results to identify over- and underperforming teams:
def build_xg_analysis(league_id: int, season: str = "2025-2026") -> pd.DataFrame | None:
"""Identify teams outperforming or underperforming their xG metrics."""
df = scrape_league_table(league_id, season)
if df is None:
return None
for col in ["Pts", "xG", "xGA", "GF", "GA", "MP"]:
if col in df.columns:
df[col] = pd.to_numeric(df[col], errors="coerce")
df["xPts_approx"] = (df["xG"] - df["xGA"]) * 0.9 + df["MP"] * 1.3
df["PtsDiff"] = df["Pts"] - df["xPts_approx"]
df["GoalDiff_vs_xG"] = df["GF"] - df["xG"]
df["ConcedeDiff_vs_xGA"] = df["xGA"] - df["GA"]
return df.sort_values("Pts", ascending=False)[
["Squad", "Pts", "xG", "xGA", "GF", "GA", "GoalDiff_vs_xG", "ConcedeDiff_vs_xGA", "PtsDiff"]
]
2. Player Recruitment Profiler
Find players matching a specific profile across multiple leagues:
def find_progressive_passers(
squads: list[tuple[str, str]], # [(squad_id, squad_name), ...]
season: str = "2025-2026",
min_minutes: int = 900,
min_prog_passes_per90: float = 5.0,
) -> pd.DataFrame:
"""Find midfielders making lots of progressive passes per 90 minutes."""
all_data = []
for squad_id, squad_name in squads:
df = scrape_passing_stats(squad_id, squad_name, season)
if df is None:
continue
# Look for progressive passes column (naming varies)
prog_col = next((c for c in df.columns if "PrgP" in c or "Prog" in c), None)
min_col = next((c for c in df.columns if c in ["Min", "90s", "Mn/MP"]), None)
if prog_col and min_col:
df[min_col] = pd.to_numeric(df[min_col], errors="coerce")
df[prog_col] = pd.to_numeric(df[prog_col], errors="coerce")
df["90s"] = df[min_col] / 90
df["PrgP_per90"] = df[prog_col] / df["90s"]
filtered = df[
(df[min_col] >= min_minutes) &
(df["PrgP_per90"] >= min_prog_passes_per90)
]
all_data.append(filtered)
if not all_data:
return pd.DataFrame()
combined = pd.concat(all_data, ignore_index=True)
return combined.sort_values("PrgP_per90", ascending=False)
3. Match xG Timeline Builder
Visualize how a match evolved using shot-by-shot xG:
def build_xg_timeline(match_url: str) -> dict | None:
"""Build a cumulative xG timeline for a match."""
shots_df = scrape_match_shots(match_url)
if shots_df is None or shots_df.empty:
return None
shots_df["minute_int"] = pd.to_numeric(
shots_df["minute"].str.extract(r"(\d+)")[0], errors="coerce"
)
shots_df["xg"] = pd.to_numeric(shots_df.get("xg", 0), errors="coerce").fillna(0)
shots_df = shots_df.sort_values("minute_int")
teams = shots_df["squad"].dropna().unique().tolist()[:2] if "squad" in shots_df.columns else ["Home", "Away"]
timeline = {}
for team in teams:
team_shots = shots_df[shots_df.get("squad", pd.Series()) == team].copy() if "squad" in shots_df.columns else shots_df
team_shots = team_shots.sort_values("minute_int")
team_shots["cumulative_xg"] = team_shots["xg"].cumsum()
timeline[team] = [
{"minute": int(row["minute_int"]), "xg": float(row["xg"]),
"cumulative_xg": float(row["cumulative_xg"]),
"player": row.get("player", ""), "outcome": row.get("outcome", "")}
for _, row in team_shots.iterrows()
]
return {
"match_url": match_url,
"teams": teams,
"timeline": timeline,
"final_xg": {team: round(sum(s["xg"] for s in shots), 2) for team, shots in timeline.items()},
}
# Example output:
# {
# "match_url": "https://fbref.com/en/matches/...",
# "teams": ["Arsenal", "Manchester City"],
# "final_xg": {"Arsenal": 1.84, "Manchester City": 2.31},
# "timeline": {
# "Arsenal": [
# {"minute": 12, "xg": 0.08, "cumulative_xg": 0.08, "player": "Saka", "outcome": "Missed"},
# {"minute": 34, "xg": 0.42, "cumulative_xg": 0.50, "player": "Havertz", "outcome": "Goal"},
# ...
# ]
# }
# }
4. Squad Age Profile for Transfer Planning
def scrape_squad_age_profile(squad_id: str, squad_name: str, season: str = "2025-2026") -> dict | None:
"""Analyze squad age distribution for transfer window planning."""
df = scrape_player_shooting(squad_id, squad_name, season) # Has age data
if df is None:
return None
if "Age" not in df.columns:
return None
# Parse FBref age format "28-142" (years-days)
df["age_years"] = pd.to_numeric(
df["Age"].str.split("-").str[0], errors="coerce"
)
df["Min"] = pd.to_numeric(df.get("Min", 0), errors="coerce").fillna(0)
# Weighted by minutes played
total_mins = df["Min"].sum()
if total_mins > 0:
df["weight"] = df["Min"] / total_mins
weighted_age = (df["age_years"] * df["weight"]).sum()
else:
weighted_age = df["age_years"].mean()
age_bands = {
"U23 (development)": len(df[df["age_years"] < 23]),
"Peak (23-29)": len(df[(df["age_years"] >= 23) & (df["age_years"] <= 29)]),
"Experienced (30+)": len(df[df["age_years"] >= 30]),
}
return {
"squad": squad_name,
"season": season,
"mean_age": round(df["age_years"].mean(), 1),
"weighted_age_by_minutes": round(weighted_age, 1),
"oldest_player": df.loc[df["age_years"].idxmax(), "Player"] if not df.empty else None,
"youngest_player": df.loc[df["age_years"].idxmin(), "Player"] if not df.empty else None,
"age_bands": age_bands,
"players": len(df),
}
5. Top Goalscorer Tracker Across Leagues
MAJOR_LEAGUE_SQUADS = {
# Add real squad IDs from FBref URLs
9: [("18bb7c10", "Arsenal"), ("822bd0ba", "Liverpool"), ("b8fd03ef", "Manchester-City")],
12: [], # La Liga squads
}
def track_golden_boot_race(league_ids: list[int], season: str = "2025-2026") -> pd.DataFrame:
"""Track top scorers across multiple leagues."""
all_scorers = []
for league_id in league_ids:
squads = MAJOR_LEAGUE_SQUADS.get(league_id, [])
for squad_id, squad_name in squads:
df = scrape_player_shooting(squad_id, squad_name, season)
if df is None:
continue
if "Gls" in df.columns and "Player" in df.columns:
df["Gls"] = pd.to_numeric(df["Gls"], errors="coerce")
df["league_id"] = league_id
all_scorers.append(df[["Player", "Squad", "Pos", "Age", "Gls", "xG", "league_id"]])
if not all_scorers:
return pd.DataFrame()
combined = pd.concat(all_scorers, ignore_index=True)
combined = combined.dropna(subset=["Gls"])
# Filter for outfield players only
if "Pos" in combined.columns:
combined = combined[~combined["Pos"].str.contains("GK", na=False)]
return combined.sort_values("Gls", ascending=False).head(20)
Complete Output Schemas
For building consistent data pipelines, here are the canonical output schemas for each data type:
League Table Row
{
"Rk": 1,
"Squad": "Liverpool",
"MP": 29, "W": 22, "D": 5, "L": 2,
"GF": 71, "GA": 29, "GD": 42, "Pts": 71,
"xG": 62.4, "xGA": 25.1, "xGD": 37.3, "xGD/90": 1.29
}
Player Shooting Row
{
"Player": "Mohamed Salah", "Nation": "eg EGY", "Pos": "FW",
"Age": "32-245", "MP": 29, "Starts": 29, "Min": 2493,
"Gls": 23, "Sh": 91, "SoT": 49, "SoT%": 53.8,
"Sh/90": 3.28, "SoT/90": 1.77, "G/Sh": 0.25, "Dist": 14.2,
"xG": 20.1, "npxG": 18.8, "npxG/Sh": 0.21, "G-xG": 2.9,
"Squad": "Liverpool", "Season": "2025-2026"
}
Shot Record
{
"minute": "34+2", "player": "Havertz", "squad": "Arsenal",
"xg": 0.38, "outcome": "Goal", "distance_m": 12.0,
"body_part": "Right Foot", "notes": "",
"match_url": "https://fbref.com/en/matches/..."
}
Goalkeeper Advanced Row
{
"Player": "Alisson", "Nation": "br BRA",
"GA": 19, "PSxG": 23.8, "PSxG/SoT": 0.29, "PSxG+/-": -4.8, "/90": -0.17,
"Stp%": 10.3, "AvgDist": 15.8,
"Squad": "Liverpool", "Season": "2025-2026"
}
FBref is one of the most valuable publicly accessible sports databases in existence. Keep your request rate conservative — 3 seconds minimum between requests — use residential proxies via ThorData if you're hitting Cloudflare blocks, always check for tables in HTML comments when find("table") returns nothing, and flatten those multi-level headers before doing anything with the data.
With those guardrails in place, you'll have access to the same professional-grade football statistics used by Premier League analysts, sports journalists, and the growing community of football data scientists building the next generation of analysis tools.