How to Scrape FBref for Football Stats with Python (2026)

2026-04-09 [python scraping football fbref sports-data xg pandas playwright]

How to Scrape FBref for Football Stats with Python (2026)

Football analytics has undergone a quiet revolution over the past decade. What was once the domain of clubs with multi-million pound data science budgets is now accessible to independent analysts, fantasy football players, model builders, journalists, and fans who want to understand the game at a deeper level. And at the center of this democratization is FBref.

FBref is the most comprehensive free source of football statistics on the internet. It covers every major league and competition worldwide, with data going back decades for top competitions. The stats themselves come from StatsBomb and Opta — the same data providers used by Premier League clubs and broadcasters — which means you're not looking at approximations or second-tier tracking data. You're looking at the same raw numbers that professional analysts use.

The scope of data available is remarkable. League tables with goal difference and expected goal difference. Player-level stats for goals, assists, expected goals, progressive passes, carries, and pressures. Match-level shot maps with individual expected goal values for every attempt. Passing networks showing who passed to whom and how often. Goalkeeper performance metrics including post-shot xG. Defensive stats like pressure success rates and tackle locations. Age demographics for squad planning. Contract and wage data for transfer analysis.

None of it has a public API.

If you want FBref data programmatically — for a model, a dashboard, a research project, or simply to avoid copying numbers by hand — you need to scrape it. This guide shows you exactly how.

We'll cover the full pipeline: setup and dependencies, scraping each major data type with working code, handling FBref's anti-bot protections (which are real and require genuine care), proxy rotation for larger operations, managing the multi-level column headers that catch many first-time scrapers, dealing with tables hidden in HTML comments, output schemas, and error handling patterns.

The code here is tested against FBref's current structure. The site does change its table IDs and URL patterns occasionally, and I'll explain how to diagnose and fix those changes when they happen rather than giving you brittle selectors that break without warning.

Setup

pip install requests beautifulsoup4 pandas lxml httpx tenacity
# For browser-based scraping (needed for some pages)
pip install playwright
playwright install chromium

FBref renders its core stats tables in plain HTML. You don't need browser automation for most data — requests plus BeautifulSoup plus pandas handles it. The browser is only needed when Cloudflare serves a JS challenge, which happens more frequently on datacenter IPs than residential ones.

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
import logging
from typing import Optional
from urllib.parse import urljoin

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)

# FBref base URL
FBREF_BASE = "https://fbref.com"

# Realistic browser headers — critical for avoiding immediate blocks
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Cache-Control": "max-age=0",
}

# FBref explicitly asks for 3+ second delays. Respect this.
MIN_DELAY = 3.0
MAX_DELAY = 6.0

FBref URL Structure

Understanding FBref's URL patterns is essential for building reliable scrapers. The structure is:

Competition stats:     /en/comps/{league_id}/{Season}-{Competition}-Stats
Squad stats:           /en/squads/{squad_id}/{Season}/stats/{Squad-Name}-Stats
Squad specific stat:   /en/squads/{squad_id}/{Season}/shooting/{Squad-Name}-Stats
Player profile:        /en/players/{player_id}/scouting/365_m1/{Player-Name}-Scouting-Report
Match report:          /en/matches/{match_id}/{Home-Away-Date-Match-Report}

Key competition IDs: - Premier League: 9 - La Liga: 12 - Serie A: 11 - Bundesliga: 20 - Ligue 1: 13 - Champions League: 8 - Europa League: 19

def fbref_league_url(league_id: int, season: str = "2025-2026") -> str:
    """Build a FBref competition stats URL."""
    league_names = {9: "Premier-League", 12: "La-Liga", 11: "Serie-A", 
                    20: "Bundesliga", 13: "Ligue-1", 8: "Champions-League"}
    name = league_names.get(league_id, f"League-{league_id}")
    return f"{FBREF_BASE}/en/comps/{league_id}/{season}/{season}-{name}-Stats"

def fbref_squad_url(squad_id: str, squad_name: str, season: str = "2025-2026", stat_type: str = "stats") -> str:
    """Build a FBref squad stats URL."""
    return f"{FBREF_BASE}/en/squads/{squad_id}/{season}/{stat_type}/{squad_name}-Stats"

Core Request Function

FBref's rate limiting is the primary anti-scraping measure. Exceed it and you'll get a 429, followed by a temporary IP ban if you keep trying.

def polite_get(
    url: str,
    session: requests.Session,
    min_delay: float = MIN_DELAY,
    max_delay: float = MAX_DELAY,
    max_retries: int = 3,
) -> requests.Response | None:
    """
    Fetch a URL with FBref-appropriate rate limiting and retry logic.
    Returns None on unrecoverable errors, raises on programming errors.
    """
    # Enforce delay before every request
    time.sleep(random.uniform(min_delay, max_delay))

    for attempt in range(max_retries):
        try:
            resp = session.get(url, headers=HEADERS, timeout=30)

            if resp.status_code == 200:
                # Verify it's not a disguised block
                if "captcha" in resp.text.lower() or "verify you are human" in resp.text.lower():
                    logger.warning(f"CAPTCHA/block page received on {url}")
                    return None
                return resp

            elif resp.status_code == 429:
                retry_after = int(resp.headers.get("Retry-After", 120))
                logger.warning(f"Rate limited on {url}. Waiting {retry_after}s (attempt {attempt+1})")
                time.sleep(retry_after + random.uniform(5, 15))
                continue

            elif resp.status_code == 403:
                logger.warning(f"Forbidden on {url} — IP likely blocked")
                return None

            elif resp.status_code == 404:
                logger.info(f"Not found: {url}")
                return None

            else:
                logger.error(f"HTTP {resp.status_code} on {url}")
                if attempt < max_retries - 1:
                    time.sleep(random.uniform(10, 30))
                continue

        except requests.Timeout:
            logger.warning(f"Timeout on {url} (attempt {attempt+1})")
            if attempt < max_retries - 1:
                time.sleep(random.uniform(5, 15))

        except requests.RequestException as e:
            logger.error(f"Request error on {url}: {e}")
            return None

    logger.error(f"Gave up on {url} after {max_retries} attempts")
    return None

def create_session(proxy_url: str | None = None) -> requests.Session:
    session = requests.Session()
    session.headers.update(HEADERS)
    if proxy_url:
        session.proxies = {"http": proxy_url, "https": proxy_url}
    return session

Scraping League Tables

The league standings table is the simplest FBref data to extract:

def scrape_league_table(league_id: int, season: str = "2025-2026") -> pd.DataFrame | None:
    """
    Scrape the overall league standings table.

    Returns DataFrame with columns:
    Rk, Squad, MP, W, D, L, GF, GA, GD, Pts, Pts/MP, xG, xGA, xGD, xGD/90
    """
    session = create_session()
    url = fbref_league_url(league_id, season)
    logger.info(f"Fetching league table: {url}")

    resp = polite_get(url, session)
    if not resp:
        return None

    soup = BeautifulSoup(resp.text, "lxml")

    # Try the specific table ID first, fall back to class-based search
    table = (
        soup.find("table", id=lambda x: x and "overall" in str(x).lower())
        or soup.find("table", class_="stats_table")
    )

    # FBref sometimes hides tables inside HTML comments (for lazy loading)
    if not table:
        from bs4 import Comment
        for comment in soup.find_all(string=lambda t: isinstance(t, Comment)):
            if "stats_table" in comment or "overall" in comment.lower():
                comment_soup = BeautifulSoup(comment, "lxml")
                table = comment_soup.find("table")
                if table:
                    break

    if not table:
        logger.error(f"No standings table found at {url}")
        return None

    try:
        df = pd.read_html(str(table))[0]

        # Handle MultiIndex columns (FBref uses grouped headers)
        if isinstance(df.columns, pd.MultiIndex):
            df.columns = [" ".join(str(c) for c in col if "Unnamed" not in str(c)).strip() 
                         for col in df.columns]

        # Remove separator rows (FBref inserts blank rows every 5-10 teams)
        df = df.dropna(subset=["Squad"])
        df = df[df["Squad"] != "Squad"]  # Remove header repeat rows

        # Convert numeric columns
        numeric_cols = ["MP", "W", "D", "L", "GF", "GA", "GD", "Pts", "xG", "xGA"]
        for col in numeric_cols:
            if col in df.columns:
                df[col] = pd.to_numeric(df[col], errors="coerce")

        logger.info(f"Scraped {len(df)} teams")
        return df

    except Exception as e:
        logger.error(f"Error parsing standings table: {e}")
        return None

# Example output schema:
# {
#   "Rk": 1, "Squad": "Liverpool", "MP": 28, "W": 21, "D": 5, "L": 2,
#   "GF": 68, "GA": 28, "GD": 40, "Pts": 68, "xG": 61.2, "xGA": 24.8,
#   "xGD": 36.4, "xGD/90": 1.30
# }

Player Shooting Stats with xG

Expected Goals (xG) is the most important advanced metric in modern football analytics. FBref provides it at the player level across all major competitions:

def scrape_player_shooting(squad_id: str, squad_name: str, season: str = "2025-2026") -> pd.DataFrame | None:
    """
    Scrape per-player shooting and expected goals stats.

    Returns DataFrame with columns:
    Player, Nation, Pos, Age, MP, Starts, Min, Gls, Sh, SoT, SoT%, Sh/90,
    SoT/90, G/Sh, G/SoT, Dist, FK, PK, PKatt, xG, npxG, npxG/Sh, G-xG, np:G-xG
    """
    session = create_session()
    url = fbref_squad_url(squad_id, squad_name, season, stat_type="shooting")
    logger.info(f"Fetching shooting stats: {url}")

    resp = polite_get(url, session)
    if not resp:
        return None

    soup = BeautifulSoup(resp.text, "lxml")

    # Find shooting table (may be in HTML comments)
    table = soup.find("table", id=lambda x: x and "shooting" in str(x).lower())

    if not table:
        from bs4 import Comment
        for comment in soup.find_all(string=lambda t: isinstance(t, Comment)):
            if "shooting" in comment.lower():
                comment_soup = BeautifulSoup(comment, "lxml")
                table = comment_soup.find("table", id=lambda x: x and "shooting" in str(x).lower())
                if table:
                    break

    if not table:
        logger.error(f"No shooting table found for {squad_name}")
        return None

    try:
        df = pd.read_html(str(table))[0]

        # FBref uses two-level column headers for shooting stats
        if isinstance(df.columns, pd.MultiIndex):
            df.columns = [
                col[-1] if col[-1] and "Unnamed" not in str(col[-1]) else col[0]
                for col in df.columns
            ]

        # Remove total/separator rows
        df = df[df["Player"] != "Player"]
        df = df.dropna(subset=["Player"])
        df = df[~df["Player"].str.contains("Squad Total|Opponent Total", na=False)]

        # Ensure key columns are numeric
        numeric_cols = ["Gls", "Sh", "SoT", "xG", "npxG", "G-xG"]
        for col in numeric_cols:
            if col in df.columns:
                df[col] = pd.to_numeric(df[col], errors="coerce")

        # Add squad context
        df["Squad"] = squad_name
        df["Season"] = season

        return df

    except Exception as e:
        logger.error(f"Error parsing shooting table: {e}")
        return None

# Example row from output:
# {
#   "Player": "Mohamed Salah", "Nation": "eg EGY", "Pos": "FW",
#   "Age": "32-181", "MP": 28, "Gls": 22, "Sh": 87, "SoT": 47,
#   "xG": 19.4, "npxG": 18.1, "G-xG": 2.6, "Squad": "Liverpool", "Season": "2025-2026"
# }

Match-Level Shot Data

The most granular data FBref offers is individual shot records from match reports. Each record includes the expected goal value for that specific shot:

def scrape_match_shots(match_url: str) -> pd.DataFrame | None:
    """
    Scrape all shots from a specific match with per-shot xG values.

    Returns DataFrame with columns:
    minute, player, squad, xG, outcome, distance, body_part, notes
    """
    session = create_session()
    resp = polite_get(match_url, session)
    if not resp:
        return None

    soup = BeautifulSoup(resp.text, "lxml")

    # Shot tables are usually in comments on match pages
    all_tables = []

    def find_shot_tables(source_soup: BeautifulSoup):
        for table in source_soup.find_all("table", id=lambda x: x and "shots" in str(x).lower()):
            all_tables.append(table)

    find_shot_tables(soup)

    if not all_tables:
        from bs4 import Comment
        for comment in soup.find_all(string=lambda t: isinstance(t, Comment)):
            if "shots" in comment.lower():
                comment_soup = BeautifulSoup(comment, "lxml")
                find_shot_tables(comment_soup)

    if not all_tables:
        logger.warning(f"No shot tables found at {match_url}")
        return None

    shots_data = []
    for table in all_tables:
        try:
            df = pd.read_html(str(table))[0]

            if isinstance(df.columns, pd.MultiIndex):
                df.columns = [col[-1] if "Unnamed" not in str(col[-1]) else col[0] 
                              for col in df.columns]

            df = df.dropna(how="all")
            df = df[df.apply(lambda row: not all(str(v) == str(row.iloc[0]) for v in row), axis=1)]

            # Extract shot details from table structure
            for _, row in df.iterrows():
                shot = {}
                for col in df.columns:
                    col_lower = str(col).lower()
                    if "min" in col_lower or "minute" in col_lower:
                        shot["minute"] = str(row[col]).strip()
                    elif "player" in col_lower:
                        shot["player"] = str(row[col]).strip()
                    elif "squad" in col_lower or "team" in col_lower:
                        shot["squad"] = str(row[col]).strip()
                    elif col_lower == "xg":
                        shot["xg"] = pd.to_numeric(row[col], errors="coerce")
                    elif "outcome" in col_lower or "result" in col_lower:
                        shot["outcome"] = str(row[col]).strip()
                    elif "dist" in col_lower:
                        shot["distance_m"] = pd.to_numeric(row[col], errors="coerce")
                    elif "body" in col_lower:
                        shot["body_part"] = str(row[col]).strip()
                    elif "note" in col_lower:
                        shot["notes"] = str(row[col]).strip()

                if shot.get("player") and shot.get("player") != "nan":
                    shots_data.append(shot)

        except Exception as e:
            logger.warning(f"Error parsing shot table: {e}")

    if not shots_data:
        return None

    df = pd.DataFrame(shots_data)
    df["match_url"] = match_url

    return df

# Example output schema:
# {
#   "minute": "23", "player": "Bruno Fernandes", "squad": "Manchester Utd",
#   "xg": 0.34, "outcome": "Goal", "distance_m": 18.0,
#   "body_part": "Right Foot", "notes": "", "match_url": "https://fbref.com/..."
# }

Passing Stats and Progressive Passes

Passing data reveals how teams build attacks and which players are responsible for advancing the ball:

def scrape_passing_stats(squad_id: str, squad_name: str, season: str = "2025-2026") -> pd.DataFrame | None:
    """
    Scrape passing stats per player.

    Key columns: Cmp, Att, Cmp%, TotDist, PrgDist, Ast, xA, KP, 1/3, PPA, CrsPA, PrgP
    """
    session = create_session()
    url = fbref_squad_url(squad_id, squad_name, season, stat_type="passing")
    resp = polite_get(url, session)
    if not resp:
        return None

    soup = BeautifulSoup(resp.text, "lxml")
    table = soup.find("table", id=lambda x: x and "passing" in str(x).lower())

    if not table:
        from bs4 import Comment
        for comment in soup.find_all(string=lambda t: isinstance(t, Comment)):
            if "passing" in comment.lower():
                cs = BeautifulSoup(comment, "lxml")
                table = cs.find("table", id=lambda x: x and "passing" in str(x).lower())
                if table:
                    break

    if not table:
        return None

    try:
        df = pd.read_html(str(table))[0]

        # Flatten multi-level headers
        if isinstance(df.columns, pd.MultiIndex):
            # FBref groups passing columns under "Short", "Medium", "Long" headers
            # Flatten to include group prefix for disambiguation
            new_cols = []
            for col in df.columns:
                top, bottom = col[0], col[-1]
                if "Unnamed" in str(top):
                    new_cols.append(str(bottom))
                else:
                    if str(top) != str(bottom):
                        new_cols.append(f"{top}_{bottom}")
                    else:
                        new_cols.append(str(bottom))
            df.columns = new_cols

        df = df[df["Player"] != "Player"].dropna(subset=["Player"])
        df = df[~df["Player"].str.contains("Squad Total|Opponent Total", na=False)]
        df["Squad"] = squad_name
        df["Season"] = season

        return df

    except Exception as e:
        logger.error(f"Error parsing passing table: {e}")
        return None

Defensive Stats: Pressures and Tackles

Defensive analytics is where FBref really differentiates itself from traditional stats sources:

def scrape_defensive_stats(squad_id: str, squad_name: str, season: str = "2025-2026") -> pd.DataFrame | None:
    """
    Scrape defensive actions per player.

    Key columns: Tkl, TklW, Def 3rd, Mid 3rd, Att 3rd (tackles by zone)
    Press, Succ, %, Def 3rd, Mid 3rd, Att 3rd (pressures by zone)
    Blocks, Sh, Pass, Int, Tkl+Int, Clr, Err
    """
    session = create_session()
    url = fbref_squad_url(squad_id, squad_name, season, stat_type="defense")
    resp = polite_get(url, session)
    if not resp:
        return None

    soup = BeautifulSoup(resp.text, "lxml")
    table = soup.find("table", id=lambda x: x and "defense" in str(x).lower())

    if not table:
        from bs4 import Comment
        for comment in soup.find_all(string=lambda t: isinstance(t, Comment)):
            if "defense" in comment.lower():
                cs = BeautifulSoup(comment, "lxml")
                table = cs.find("table")
                if table:
                    break

    if not table:
        return None

    try:
        df = pd.read_html(str(table))[0]

        if isinstance(df.columns, pd.MultiIndex):
            # Defense tables have headers like "Tackles", "Challenges", "Blocks"
            # Prefix subcolumns with their group
            new_cols = []
            prev_group = ""
            for col in df.columns:
                top = str(col[0])
                bottom = str(col[-1])
                if "Unnamed" in top:
                    new_cols.append(bottom)
                elif top != bottom and len(col) > 1:
                    # Shorten common group names
                    group_short = {
                        "Tackles": "Tkl", "Challenges": "Chl",
                        "Blocks": "Blk", "Pressures": "Prs"
                    }.get(top, top[:4])
                    new_cols.append(f"{group_short}_{bottom}")
                else:
                    new_cols.append(bottom)
            df.columns = new_cols

        df = df[df["Player"] != "Player"].dropna(subset=["Player"])
        df["Squad"] = squad_name
        df["Season"] = season

        return df

    except Exception as e:
        logger.error(f"Error parsing defense table: {e}")
        return None

Goalkeeper Performance with Post-Shot xG

Post-shot xG (PSxG) is FBref's measure of goalkeeper performance — the difference between expected goals based on shot quality and actual goals allowed:

def scrape_keeper_stats(squad_id: str, squad_name: str, season: str = "2025-2026") -> pd.DataFrame | None:
    """
    Scrape goalkeeper performance stats including PSxG.

    Key columns: GA90, SoTA, Saves, Save%, W, D, L, CS, CS%
    PSxG, PSxG/SoT, PSxG+/-, /90, #OPA, #OPA/90, AvgDist (advanced)
    """
    session = create_session()

    # Advanced keeper stats are on a separate page
    url = fbref_squad_url(squad_id, squad_name, season, stat_type="keepersadv")
    resp = polite_get(url, session)
    if not resp:
        # Fall back to basic keeper stats
        url = fbref_squad_url(squad_id, squad_name, season, stat_type="keepers")
        resp = polite_get(url, session)
    if not resp:
        return None

    soup = BeautifulSoup(resp.text, "lxml")
    table = soup.find("table", id=lambda x: x and ("keeper" in str(x).lower() or "gk" in str(x).lower()))

    if not table:
        from bs4 import Comment
        for comment in soup.find_all(string=lambda t: isinstance(t, Comment)):
            if "keeper" in comment.lower():
                cs = BeautifulSoup(comment, "lxml")
                table = cs.find("table")
                if table:
                    break

    if not table:
        return None

    try:
        df = pd.read_html(str(table))[0]
        if isinstance(df.columns, pd.MultiIndex):
            df.columns = [col[-1] if "Unnamed" not in str(col[-1]) else col[0] for col in df.columns]

        df = df[df["Player"] != "Player"].dropna(subset=["Player"])
        df["Squad"] = squad_name
        df["Season"] = season
        return df
    except Exception as e:
        logger.error(f"Error parsing keeper table: {e}")
        return None

# Example output schema for PSxG analysis:
# {
#   "Player": "Alisson", "Nation": "br BRA", "Age": "32-115",
#   "GA": 18, "PSxG": 22.3, "PSxG+/-": -4.3, "/90": -0.15,
#   "Saves": 78, "Save%": 81.2,
#   "Squad": "Liverpool", "Season": "2025-2026"
# }

Multi-Season Data Pipeline

For model building, you need data across multiple seasons. Here's a complete pipeline:

import json
from pathlib import Path
from dataclasses import dataclass, asdict

@dataclass
class SeasonScrapeJob:
    squad_id: str
    squad_name: str
    seasons: list[str]
    stat_types: list[str]

def run_multi_season_pipeline(
    jobs: list[SeasonScrapeJob],
    output_dir: str = "fbref_data",
    proxy_url: str | None = None,
) -> dict[str, int]:
    """
    Run a multi-squad, multi-season, multi-stat-type scrape.
    Returns dict of {filename: row_count}.
    """
    Path(output_dir).mkdir(exist_ok=True)
    results = {}

    STAT_SCRAPERS = {
        "shooting": scrape_player_shooting,
        "passing": scrape_passing_stats,
        "defense": scrape_defensive_stats,
        "keepers": scrape_keeper_stats,
    }

    total_jobs = sum(len(j.seasons) * len(j.stat_types) for j in jobs)
    completed = 0

    for job in jobs:
        for season in job.seasons:
            for stat_type in job.stat_types:
                completed += 1
                logger.info(f"[{completed}/{total_jobs}] {job.squad_name} {season} {stat_type}")

                scraper_fn = STAT_SCRAPERS.get(stat_type)
                if not scraper_fn:
                    logger.warning(f"Unknown stat type: {stat_type}")
                    continue

                df = scraper_fn(job.squad_id, job.squad_name, season)

                if df is not None and not df.empty:
                    filename = f"{output_dir}/{job.squad_name.lower().replace(' ', '_')}_{season}_{stat_type}.csv"
                    df.to_csv(filename, index=False, encoding="utf-8")
                    results[filename] = len(df)
                    logger.info(f"Saved {len(df)} rows to {filename}")
                else:
                    logger.warning(f"No data for {job.squad_name} {season} {stat_type}")

    # Save metadata
    metadata = {
        "scraped_at": pd.Timestamp.now().isoformat(),
        "total_files": len(results),
        "total_rows": sum(results.values()),
        "files": results,
    }
    with open(f"{output_dir}/metadata.json", "w") as f:
        json.dump(metadata, f, indent=2)

    return results

# Example usage — scrape Big Six Premier League clubs, 3 seasons
top6_jobs = [
    SeasonScrapeJob("18bb7c10", "Arsenal", ["2023-2024", "2024-2025", "2025-2026"], ["shooting", "passing"]),
    SeasonScrapeJob("b8fd03ef", "Manchester-City", ["2023-2024", "2024-2025", "2025-2026"], ["shooting", "passing"]),
    SeasonScrapeJob("822bd0ba", "Liverpool", ["2023-2024", "2024-2025", "2025-2026"], ["shooting", "passing"]),
    SeasonScrapeJob("19538871", "Chelsea", ["2023-2024", "2024-2025", "2025-2026"], ["shooting", "passing"]),
    SeasonScrapeJob("361ca564", "Tottenham", ["2023-2024", "2024-2025", "2025-2026"], ["shooting", "passing"]),
    SeasonScrapeJob("206d90db", "Manchester-Utd", ["2023-2024", "2024-2025", "2025-2026"], ["shooting", "passing"]),
]

# results = run_multi_season_pipeline(top6_jobs, proxy_url="http://user:[email protected]:7777")

Anti-Bot Measures and How to Handle Them

1. Rate Limiting (Primary Threat)

FBref's primary defense is strict rate limiting. The site explicitly states in its terms that automated tools should wait at least 3 seconds between requests. Violate this and you'll get 429 responses followed by temporary IP bans.

class FBrefRateLimiter:
    """Respects FBref's rate limiting with adaptive backoff."""

    def __init__(self):
        self._last_request = 0.0
        self._consecutive_429s = 0
        self._base_delay = 3.0

    def wait(self):
        """Wait the appropriate time before the next request."""
        elapsed = time.time() - self._last_request

        # Increase base delay after repeated rate limiting
        delay = self._base_delay * (1.5 ** self._consecutive_429s)
        delay += random.uniform(0, delay * 0.3)  # Add jitter

        if elapsed < delay:
            time.sleep(delay - elapsed)

        self._last_request = time.time()

    def on_429(self, retry_after: int = 120):
        self._consecutive_429s += 1
        logger.warning(f"Rate limited ({self._consecutive_429s} consecutive). Backing off {retry_after}s")
        time.sleep(retry_after + random.uniform(10, 30))

    def on_success(self):
        self._consecutive_429s = max(0, self._consecutive_429s - 1)

2. Cloudflare Challenges

FBref uses Cloudflare. Datacenter IPs (AWS, GCP, Digital Ocean, most VPNs) trigger Cloudflare challenges much more aggressively than residential IPs. If you're getting Cloudflare blocks, the most effective fix is using residential proxies.

ThorData provides residential proxies with session stickiness — you can keep the same IP across a sequence of related requests (useful for scraping a squad's full stat pages without triggering session-based behavioral analysis).

def create_thordata_session(
    username: str, 
    password: str, 
    country: str = "GB",      # UK residential IPs look natural for UK football sites
    session_id: str | None = None,
) -> requests.Session:
    """
    Create a requests.Session routed through ThorData residential proxy.
    Use session_id for sticky sessions (same IP across multiple requests).
    Use None for rotating sessions (different IP per request).
    """
    session = requests.Session()
    session.headers.update(HEADERS)

    if session_id:
        # Sticky: same IP for this session
        proxy_auth = f"{username}-country-{country}-session-{session_id}:{password}"
    else:
        # Rotating: fresh IP per request
        proxy_auth = f"{username}-country-{country}:{password}"

    proxy_url = f"http://{proxy_auth}@gate.thordata.com:7777"
    session.proxies = {"http": proxy_url, "https": proxy_url}

    return session

# For scraping a squad's pages: use sticky session so all requests for one squad
# come from the same IP (more natural browsing pattern)
squad_session = create_thordata_session(
    "username", "password", 
    country="GB",
    session_id="arsenal-2025"
)

3. Multi-Level Headers (Most Common Parsing Error)

FBref tables use grouped column headers that pandas reads as MultiIndex. If you try to access columns by name without flattening, you'll get KeyErrors or weird column names.

def flatten_multiindex_columns(df: pd.DataFrame) -> pd.DataFrame:
    """
    Flatten FBref's MultiIndex columns into usable single-level column names.
    Handles FBref's specific convention for unnamed top-level groups.
    """
    if not isinstance(df.columns, pd.MultiIndex):
        return df

    new_columns = []
    for col in df.columns:
        parts = [str(c) for c in col if "Unnamed" not in str(c) and str(c) != "nan"]
        if len(parts) == 0:
            new_columns.append("unknown")
        elif len(parts) == 1:
            new_columns.append(parts[0])
        else:
            # For grouped columns, prefix with group name if it adds context
            # e.g., ("Expected", "xG") -> "Exp_xG"
            # But ("Player",) -> "Player"
            top = parts[0]
            bottom = parts[-1]
            if top == bottom:
                new_columns.append(top)
            else:
                new_columns.append(f"{top[:4]}_{bottom}" if len(top) > 6 else f"{top}_{bottom}")

    df.columns = new_columns
    return df

4. Tables in HTML Comments

Some FBref tables are wrapped in HTML comments for deferred loading. BeautifulSoup.find("table") won't find them. You have to parse the comments.

from bs4 import Comment

def find_table_in_comments(soup: BeautifulSoup, table_id_pattern: str) -> Optional[BeautifulSoup]:
    """Search HTML comment blocks for a table matching the pattern."""
    for comment in soup.find_all(string=lambda t: isinstance(t, Comment)):
        if table_id_pattern.lower() in comment.lower():
            comment_soup = BeautifulSoup(comment, "lxml")
            table = comment_soup.find("table", id=lambda x: x and table_id_pattern.lower() in str(x).lower())
            if table:
                return table
    return None

def get_table_robust(soup: BeautifulSoup, id_pattern: str) -> Optional[BeautifulSoup]:
    """Find a FBref table either in main HTML or in comments."""
    # Try main HTML first
    table = soup.find("table", id=lambda x: x and id_pattern.lower() in str(x).lower())
    if table:
        return table

    # Fall back to comments
    return find_table_in_comments(soup, id_pattern)

Real-World Use Cases

1. xG League Table Generator

Build a table showing expected performance vs. actual results to identify over- and underperforming teams:

def build_xg_analysis(league_id: int, season: str = "2025-2026") -> pd.DataFrame | None:
    """Identify teams outperforming or underperforming their xG metrics."""
    df = scrape_league_table(league_id, season)
    if df is None:
        return None

    for col in ["Pts", "xG", "xGA", "GF", "GA", "MP"]:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors="coerce")

    df["xPts_approx"] = (df["xG"] - df["xGA"]) * 0.9 + df["MP"] * 1.3
    df["PtsDiff"] = df["Pts"] - df["xPts_approx"]
    df["GoalDiff_vs_xG"] = df["GF"] - df["xG"]
    df["ConcedeDiff_vs_xGA"] = df["xGA"] - df["GA"]

    return df.sort_values("Pts", ascending=False)[
        ["Squad", "Pts", "xG", "xGA", "GF", "GA", "GoalDiff_vs_xG", "ConcedeDiff_vs_xGA", "PtsDiff"]
    ]

2. Player Recruitment Profiler

Find players matching a specific profile across multiple leagues:

def find_progressive_passers(
    squads: list[tuple[str, str]],   # [(squad_id, squad_name), ...]
    season: str = "2025-2026",
    min_minutes: int = 900,
    min_prog_passes_per90: float = 5.0,
) -> pd.DataFrame:
    """Find midfielders making lots of progressive passes per 90 minutes."""
    all_data = []

    for squad_id, squad_name in squads:
        df = scrape_passing_stats(squad_id, squad_name, season)
        if df is None:
            continue

        # Look for progressive passes column (naming varies)
        prog_col = next((c for c in df.columns if "PrgP" in c or "Prog" in c), None)
        min_col = next((c for c in df.columns if c in ["Min", "90s", "Mn/MP"]), None)

        if prog_col and min_col:
            df[min_col] = pd.to_numeric(df[min_col], errors="coerce")
            df[prog_col] = pd.to_numeric(df[prog_col], errors="coerce")
            df["90s"] = df[min_col] / 90
            df["PrgP_per90"] = df[prog_col] / df["90s"]

            filtered = df[
                (df[min_col] >= min_minutes) & 
                (df["PrgP_per90"] >= min_prog_passes_per90)
            ]
            all_data.append(filtered)

    if not all_data:
        return pd.DataFrame()

    combined = pd.concat(all_data, ignore_index=True)
    return combined.sort_values("PrgP_per90", ascending=False)

3. Match xG Timeline Builder

Visualize how a match evolved using shot-by-shot xG:

def build_xg_timeline(match_url: str) -> dict | None:
    """Build a cumulative xG timeline for a match."""
    shots_df = scrape_match_shots(match_url)
    if shots_df is None or shots_df.empty:
        return None

    shots_df["minute_int"] = pd.to_numeric(
        shots_df["minute"].str.extract(r"(\d+)")[0], errors="coerce"
    )
    shots_df["xg"] = pd.to_numeric(shots_df.get("xg", 0), errors="coerce").fillna(0)
    shots_df = shots_df.sort_values("minute_int")

    teams = shots_df["squad"].dropna().unique().tolist()[:2] if "squad" in shots_df.columns else ["Home", "Away"]

    timeline = {}
    for team in teams:
        team_shots = shots_df[shots_df.get("squad", pd.Series()) == team].copy() if "squad" in shots_df.columns else shots_df
        team_shots = team_shots.sort_values("minute_int")
        team_shots["cumulative_xg"] = team_shots["xg"].cumsum()

        timeline[team] = [
            {"minute": int(row["minute_int"]), "xg": float(row["xg"]), 
             "cumulative_xg": float(row["cumulative_xg"]),
             "player": row.get("player", ""), "outcome": row.get("outcome", "")}
            for _, row in team_shots.iterrows()
        ]

    return {
        "match_url": match_url,
        "teams": teams,
        "timeline": timeline,
        "final_xg": {team: round(sum(s["xg"] for s in shots), 2) for team, shots in timeline.items()},
    }

# Example output:
# {
#   "match_url": "https://fbref.com/en/matches/...",
#   "teams": ["Arsenal", "Manchester City"],
#   "final_xg": {"Arsenal": 1.84, "Manchester City": 2.31},
#   "timeline": {
#     "Arsenal": [
#       {"minute": 12, "xg": 0.08, "cumulative_xg": 0.08, "player": "Saka", "outcome": "Missed"},
#       {"minute": 34, "xg": 0.42, "cumulative_xg": 0.50, "player": "Havertz", "outcome": "Goal"},
#       ...
#     ]
#   }
# }

4. Squad Age Profile for Transfer Planning

def scrape_squad_age_profile(squad_id: str, squad_name: str, season: str = "2025-2026") -> dict | None:
    """Analyze squad age distribution for transfer window planning."""
    df = scrape_player_shooting(squad_id, squad_name, season)  # Has age data
    if df is None:
        return None

    if "Age" not in df.columns:
        return None

    # Parse FBref age format "28-142" (years-days)
    df["age_years"] = pd.to_numeric(
        df["Age"].str.split("-").str[0], errors="coerce"
    )
    df["Min"] = pd.to_numeric(df.get("Min", 0), errors="coerce").fillna(0)

    # Weighted by minutes played
    total_mins = df["Min"].sum()
    if total_mins > 0:
        df["weight"] = df["Min"] / total_mins
        weighted_age = (df["age_years"] * df["weight"]).sum()
    else:
        weighted_age = df["age_years"].mean()

    age_bands = {
        "U23 (development)": len(df[df["age_years"] < 23]),
        "Peak (23-29)": len(df[(df["age_years"] >= 23) & (df["age_years"] <= 29)]),
        "Experienced (30+)": len(df[df["age_years"] >= 30]),
    }

    return {
        "squad": squad_name,
        "season": season,
        "mean_age": round(df["age_years"].mean(), 1),
        "weighted_age_by_minutes": round(weighted_age, 1),
        "oldest_player": df.loc[df["age_years"].idxmax(), "Player"] if not df.empty else None,
        "youngest_player": df.loc[df["age_years"].idxmin(), "Player"] if not df.empty else None,
        "age_bands": age_bands,
        "players": len(df),
    }

5. Top Goalscorer Tracker Across Leagues

MAJOR_LEAGUE_SQUADS = {
    # Add real squad IDs from FBref URLs
    9: [("18bb7c10", "Arsenal"), ("822bd0ba", "Liverpool"), ("b8fd03ef", "Manchester-City")],
    12: [],  # La Liga squads
}

def track_golden_boot_race(league_ids: list[int], season: str = "2025-2026") -> pd.DataFrame:
    """Track top scorers across multiple leagues."""
    all_scorers = []

    for league_id in league_ids:
        squads = MAJOR_LEAGUE_SQUADS.get(league_id, [])
        for squad_id, squad_name in squads:
            df = scrape_player_shooting(squad_id, squad_name, season)
            if df is None:
                continue

            if "Gls" in df.columns and "Player" in df.columns:
                df["Gls"] = pd.to_numeric(df["Gls"], errors="coerce")
                df["league_id"] = league_id
                all_scorers.append(df[["Player", "Squad", "Pos", "Age", "Gls", "xG", "league_id"]])

    if not all_scorers:
        return pd.DataFrame()

    combined = pd.concat(all_scorers, ignore_index=True)
    combined = combined.dropna(subset=["Gls"])

    # Filter for outfield players only
    if "Pos" in combined.columns:
        combined = combined[~combined["Pos"].str.contains("GK", na=False)]

    return combined.sort_values("Gls", ascending=False).head(20)

Complete Output Schemas

For building consistent data pipelines, here are the canonical output schemas for each data type:

League Table Row

{
  "Rk": 1,
  "Squad": "Liverpool",
  "MP": 29, "W": 22, "D": 5, "L": 2,
  "GF": 71, "GA": 29, "GD": 42, "Pts": 71,
  "xG": 62.4, "xGA": 25.1, "xGD": 37.3, "xGD/90": 1.29
}

Player Shooting Row

{
  "Player": "Mohamed Salah", "Nation": "eg EGY", "Pos": "FW",
  "Age": "32-245", "MP": 29, "Starts": 29, "Min": 2493,
  "Gls": 23, "Sh": 91, "SoT": 49, "SoT%": 53.8,
  "Sh/90": 3.28, "SoT/90": 1.77, "G/Sh": 0.25, "Dist": 14.2,
  "xG": 20.1, "npxG": 18.8, "npxG/Sh": 0.21, "G-xG": 2.9,
  "Squad": "Liverpool", "Season": "2025-2026"
}

Shot Record

{
  "minute": "34+2", "player": "Havertz", "squad": "Arsenal",
  "xg": 0.38, "outcome": "Goal", "distance_m": 12.0,
  "body_part": "Right Foot", "notes": "",
  "match_url": "https://fbref.com/en/matches/..."
}

Goalkeeper Advanced Row

{
  "Player": "Alisson", "Nation": "br BRA",
  "GA": 19, "PSxG": 23.8, "PSxG/SoT": 0.29, "PSxG+/-": -4.8, "/90": -0.17,
  "Stp%": 10.3, "AvgDist": 15.8,
  "Squad": "Liverpool", "Season": "2025-2026"
}

FBref is one of the most valuable publicly accessible sports databases in existence. Keep your request rate conservative — 3 seconds minimum between requests — use residential proxies via ThorData if you're hitting Cloudflare blocks, always check for tables in HTML comments when find("table") returns nothing, and flatten those multi-level headers before doing anything with the data.

With those guardrails in place, you'll have access to the same professional-grade football statistics used by Premier League analysts, sports journalists, and the growing community of football data scientists building the next generation of analysis tools.