Extract Wikipedia Infobox Data at Scale with Python (2026)

2026-04-09 [python wikipedia scraping data-extraction mediawiki]

Extract Wikipedia Infobox Data at Scale with Python (2026)

Wikipedia infoboxes are the dense rectangles of structured data at the top-right of most articles. Behind the scenes they're wikitext templates — and that makes them machine-readable if you know how to parse them. This guide walks through pulling infobox data at scale using the MediaWiki API and wikitextparser, turning raw wikitext into clean Python dicts suitable for datasets, knowledge graphs, or research pipelines.

Why Infoboxes

Wikidata is the canonical linked-data layer for Wikipedia facts, but infobox data often contains values that haven't been migrated to Wikidata yet, or that exist in more human-readable form. For many use cases — building a company database, collecting population figures for settlements, assembling a biographical dataset — scraping infoboxes directly is faster and more complete than querying Wikidata's SPARQL endpoint.

The other reason is format fidelity. Wikidata normalizes everything into statements with qualifiers. Infoboxes give you the text as editors wrote it: "c. 1450–1516" for birth/death ranges, "$2.3 billion (2025 est.)" for revenue, "New York City, United States" for locations. Sometimes you want this less-normalized form.

Understanding Infobox Types

Wikipedia uses hundreds of infobox templates, each with its own field names. Some common ones:

Template Name	Used For	Key Fields
`Infobox person`	Biographical articles	birth_date, birth_place, death_date, nationality, occupation
`Infobox company`	Corporations	founded, founders, headquarters, industry, revenue, employees
`Infobox settlement`	Cities, towns, villages	country, population_total, population_as_of, area_total_km2
`Infobox film`	Movies	director, producer, writer, starring, released, runtime, budget
`Infobox book`	Published books	author, language, subject, genre, published, publisher
`Infobox album`	Music albums	artist, released, genre, label, producer
`Infobox country`	Nations	capital, official_languages, area_km2, population_estimate
`Infobox university`	Higher education	established, type, president, students, endowment, location
`Infobox sportsperson`	Athletes	nationality, birth_place, sport, team

The template name isn't always an exact match — variants like Infobox_person, infobox person, Infobox Person all exist. Normalize to lowercase for matching.

Getting Raw Wikitext from the MediaWiki API

The action=query endpoint with prop=revisions returns raw wikitext. This is cleaner than scraping HTML because you work with the template syntax directly — HTML rendering loses structural information.

import requests
import json
import time

SESSION = requests.Session()
SESSION.headers.update({
    "User-Agent": "InfoboxScraper/1.0 (https://yourproject.example; [email protected])"
})

API = "https://en.wikipedia.org/w/api.php"


def get_wikitext(title: str) -> str:
    """Fetch raw wikitext for a Wikipedia article."""
    params = {
        "action": "query",
        "titles": title,
        "prop": "revisions",
        "rvprop": "content",
        "rvslots": "main",
        "format": "json",
        "maxlag": 5,
    }

    while True:
        r = SESSION.get(API, params=params, timeout=20)

        if r.status_code == 503:
            # maxlag response — server is busy, retry after specified delay
            retry_after = int(r.headers.get("Retry-After", 10))
            print(f"  Maxlag triggered, waiting {retry_after}s")
            time.sleep(retry_after)
            continue

        r.raise_for_status()
        data = r.json()

        pages = data["query"]["pages"]
        page = next(iter(pages.values()))

        if "missing" in page:
            raise ValueError(f"Article not found: {title}")

        return page["revisions"][0]["slots"]["main"]["*"]


def get_wikitext_batch(titles: list[str]) -> dict[str, str]:
    """Fetch wikitext for up to 50 articles in one API call."""
    params = {
        "action": "query",
        "titles": "|".join(titles[:50]),
        "prop": "revisions",
        "rvprop": "content",
        "rvslots": "main",
        "format": "json",
        "maxlag": 5,
    }

    r = SESSION.get(API, params=params, timeout=30)
    r.raise_for_status()
    data = r.json()

    result = {}
    for page_id, page in data["query"]["pages"].items():
        if "missing" not in page and "revisions" in page:
            title = page["title"]
            result[title] = page["revisions"][0]["slots"]["main"]["*"]

    return result

The User-Agent header is required by Wikimedia's API policy — requests without a descriptive UA identifying your project can be blocked. The maxlag=5 parameter tells the server to return a 503 if replication lag exceeds 5 seconds, which you catch and retry. This is proper API etiquette.

Parsing Infobox Templates with wikitextparser

wikitextparser handles nested templates, wikilinks, and pipe-separated values correctly. Avoid regex for this — wikitext nesting is not regular.

import wikitextparser as wtp


def extract_all_infoboxes(wikitext: str) -> list[dict]:
    """Extract all infobox templates from a wikitext string."""
    parsed = wtp.parse(wikitext)
    infoboxes = []

    for template in parsed.templates:
        name = template.name.strip().lower()
        # Match any template whose name starts with "infobox"
        if name.startswith("infobox"):
            infobox = {
                "type": template.name.strip(),
                "type_normalized": name,
                "fields": parse_template_fields(template),
            }
            infoboxes.append(infobox)

    return infoboxes


def extract_primary_infobox(wikitext: str) -> dict | None:
    """Extract the first (primary) infobox from a wikitext string."""
    infoboxes = extract_all_infoboxes(wikitext)
    return infoboxes[0] if infoboxes else None


def parse_template_fields(template) -> dict:
    """Parse all named arguments from a wikitext template."""
    fields = {}
    for arg in template.arguments:
        key = arg.name.strip() if arg.name else None
        if not key:
            continue

        raw_value = arg.value
        clean = clean_wikitext_value(raw_value)

        if clean:
            fields[key] = clean

    return fields


def clean_wikitext_value(raw: str) -> str:
    """
    Convert a raw wikitext value to plain text.
    Resolves [[links]], removes {{nested templates}}, strips comments.
    """
    if not raw or not raw.strip():
        return ""

    parsed = wtp.parse(raw)

    # Resolve wikilinks: [[Target|Label]] -> Label, [[Target]] -> Target
    for link in reversed(list(parsed.wikilinks)):
        display = link.text.strip() if link.text and link.text.strip() else link.target.strip()
        # Replace the link span with its display text
        parsed = wtp.parse(parsed.string[:link.span[0]] + display + parsed.string[link.span[1]:])

    # Use plain_text() to strip remaining templates, tags, etc.
    try:
        result = parsed.plain_text(
            replace_wikilinks=True,
            replace_templates=False,
        ).strip()
    except Exception:
        # Fallback for edge cases
        result = raw.strip()

    # Remove HTML comments
    import re
    result = re.sub(r"<!--.*?-->", "", result, flags=re.DOTALL)

    # Remove wiki markup artifacts
    result = re.sub(r"\s*\n\s*", " ", result)  # Collapse newlines
    result = result.strip()

    return result

Handling Complex Field Values

Some infobox fields use nested templates to express structured data. Date templates are the most common:

import re


def parse_date_field(raw_value: str) -> str:
    """
    Parse common Wikipedia date template patterns into ISO-ish strings.
    Examples: {{Birth date|1990|3|15}} -> "1990-03-15"
              {{Birth date and age|1945|7|22}} -> "1945-07-22"
    """
    # {{Birth date|YYYY|M|D}} or {{Birth date and age|YYYY|M|D}}
    match = re.search(
        r"\{\{(?:birth|death)\s*date(?:\s*and\s*age)?\s*\|(\d{4})\|(\d{1,2})\|(\d{1,2})",
        raw_value, re.IGNORECASE
    )
    if match:
        y, m, d = match.groups()
        return f"{y}-{int(m):02d}-{int(d):02d}"

    # {{Start date|YYYY|M|D}}
    match = re.search(
        r"\{\{start\s*date\|(\d{4})\|(\d{1,2})(?:\|(\d{1,2}))?",
        raw_value, re.IGNORECASE
    )
    if match:
        y, m, d = match.groups()
        d = d or "01"
        return f"{y}-{int(m):02d}-{int(d):02d}"

    # Fallback: try to clean the value normally
    return clean_wikitext_value(raw_value)


def parse_population_field(raw_value: str) -> int | None:
    """Extract numeric population from formatted infobox values."""
    # Remove formatting templates like {{formatnum:1234567}}
    cleaned = re.sub(r"\{\{formatnum:([0-9,]+)\}\}", r"\1", raw_value, flags=re.IGNORECASE)
    # Remove commas and spaces
    digits = re.sub(r"[^\d]", "", cleaned)
    return int(digits) if digits else None


def parse_currency_field(raw_value: str) -> dict:
    """Extract amount and currency from financial fields."""
    cleaned = clean_wikitext_value(raw_value)

    # Match patterns like "$2.3 billion", "€450 million", "US$1.2 trillion"
    match = re.search(
        r"(US\$|\$|€|£|¥|₹|A\$|C\$)?\s*([\d,.]+)\s*(billion|million|trillion|thousand)?",
        cleaned, re.IGNORECASE
    )
    if match:
        currency_symbol = match.group(1) or ""
        amount_str = match.group(2).replace(",", "")
        multiplier_str = (match.group(3) or "").lower()

        multipliers = {
            "trillion": 1e12, "billion": 1e9, "million": 1e6, "thousand": 1e3
        }
        multiplier = multipliers.get(multiplier_str, 1)

        try:
            amount = float(amount_str) * multiplier
            return {"amount": amount, "currency_symbol": currency_symbol, "raw": cleaned}
        except ValueError:
            pass

    return {"raw": cleaned}

Schema-Driven Normalization

Different infobox types need different field mappings. Define schemas and use them to produce consistent output:

INFOBOX_SCHEMAS = {
    "infobox person": {
        "required": ["name"],
        "fields": {
            "name": "name",
            "birth_name": "birth_name",
            "birth_date": "birth_date",
            "birth_place": "birth_place",
            "death_date": "death_date",
            "death_place": "death_place",
            "nationality": "nationality",
            "citizenship": "nationality",
            "occupation": "occupation",
            "known_for": "known_for",
            "awards": "awards",
            "spouse": "spouse",
            "education": "education",
            "alma_mater": "alma_mater",
            "employer": "employer",
        },
        "parsers": {
            "birth_date": parse_date_field,
            "death_date": parse_date_field,
        },
    },
    "infobox company": {
        "required": ["name"],
        "fields": {
            "name": "name",
            "type": "company_type",
            "industry": "industry",
            "founded": "founded",
            "founders": "founders",
            "hq_location": "headquarters",
            "location": "headquarters",
            "key_people": "key_people",
            "revenue": "revenue",
            "net_income": "net_income",
            "total_assets": "total_assets",
            "num_employees": "employees",
            "products": "products",
            "services": "services",
            "website": "website",
        },
        "parsers": {
            "revenue": parse_currency_field,
            "net_income": parse_currency_field,
        },
    },
    "infobox settlement": {
        "required": ["name"],
        "fields": {
            "name": "name",
            "official_name": "official_name",
            "country": "country",
            "subdivision_type1": "region_type",
            "subdivision_name1": "region",
            "population_total": "population",
            "population_as_of": "population_year",
            "population_density_km2": "population_density",
            "area_total_km2": "area_km2",
            "elevation_m": "elevation_m",
            "timezone": "timezone",
            "utc_offset": "utc_offset",
            "website": "website",
        },
        "parsers": {
            "population": parse_population_field,
        },
    },
}


def normalize_infobox(raw: dict, article_title: str = "") -> dict:
    """
    Apply schema-driven normalization to a raw infobox dict.
    Unknown infobox types are returned with all fields preserved.
    """
    infobox_type = raw["type_normalized"]
    schema = INFOBOX_SCHEMAS.get(infobox_type)

    result = {
        "_title": article_title,
        "_type": raw["type"].strip(),
        "_type_key": infobox_type,
    }

    if schema:
        field_map = schema["fields"]
        parsers = schema.get("parsers", {})

        for source_field, target_field in field_map.items():
            raw_value = raw["fields"].get(source_field)
            if raw_value:
                parser = parsers.get(source_field)
                result[target_field] = parser(raw_value) if parser else raw_value

        # Add any unmapped fields as _extra_*
        mapped_sources = set(field_map.keys())
        for key, val in raw["fields"].items():
            if key not in mapped_sources:
                result[f"_extra_{key}"] = val
    else:
        # Unknown infobox type — include all fields verbatim
        result.update(raw["fields"])

    return result

Bulk Extraction by Category

The action=query endpoint with list=categorymembers returns all articles in a Wikipedia category:

def get_category_members(
    category: str,
    namespace: int = 0,
    limit: int = 500,
) -> list[str]:
    """Return all article titles in a Wikipedia category."""
    params = {
        "action": "query",
        "list": "categorymembers",
        "cmtitle": f"Category:{category}",
        "cmlimit": 50,
        "cmtype": "page",
        "cmnamespace": namespace,
        "format": "json",
        "maxlag": 5,
    }

    all_titles = []

    while True:
        r = SESSION.get(API, params=params, timeout=20)

        if r.status_code == 503:
            retry_after = int(r.headers.get("Retry-After", 10))
            time.sleep(retry_after)
            continue

        r.raise_for_status()
        data = r.json()

        for member in data["query"]["categorymembers"]:
            all_titles.append(member["title"])

        if "continue" not in data or len(all_titles) >= limit:
            break

        params["cmcontinue"] = data["continue"]["cmcontinue"]
        time.sleep(0.5)

    return all_titles[:limit]


def get_subcategories(category: str) -> list[str]:
    """Return names of subcategories within a category."""
    params = {
        "action": "query",
        "list": "categorymembers",
        "cmtitle": f"Category:{category}",
        "cmtype": "subcat",
        "cmlimit": 100,
        "format": "json",
    }

    r = SESSION.get(API, params=params, timeout=20)
    r.raise_for_status()
    data = r.json()

    return [
        m["title"].replace("Category:", "")
        for m in data["query"]["categorymembers"]
    ]


def extract_category(
    category: str,
    output_file: str = None,
    include_subcategories: bool = False,
    max_articles: int = 1000,
) -> list[dict]:
    """
    Extract infobox data for all articles in a Wikipedia category.
    """
    titles = get_category_members(category, limit=max_articles)

    if include_subcategories:
        subcats = get_subcategories(category)
        for subcat in subcats[:10]:  # Limit subcategory depth
            sub_titles = get_category_members(subcat, limit=100)
            titles.extend(sub_titles)
        titles = list(set(titles))[:max_articles]

    print(f"Extracting infoboxes for {len(titles)} articles in Category:{category}")

    results = []
    failed = []

    for i, title in enumerate(titles):
        if i > 0 and i % 50 == 0:
            print(f"  Progress: {i}/{len(titles)}")

        try:
            wikitext = get_wikitext(title)
            raw = extract_primary_infobox(wikitext)

            if raw:
                record = normalize_infobox(raw, article_title=title)
                results.append(record)
            else:
                # No infobox found
                results.append({"_title": title, "_type": None, "_no_infobox": True})

            time.sleep(0.2)  # ~5 req/s, well under the 200/s limit

        except Exception as e:
            failed.append({"title": title, "error": str(e)})
            print(f"  Failed: {title} — {e}")

    if output_file:
        with open(output_file, "w", encoding="utf-8") as f:
            json.dump(results, f, indent=2, ensure_ascii=False)
        print(f"Wrote {len(results)} records to {output_file}")

    if failed:
        print(f"Failed: {len(failed)} articles")

    return results

Parallel Extraction at Scale

For tens of thousands of articles, serial requests are too slow. Parallelize with thread workers, each using a distinct proxy session to distribute load:

from concurrent.futures import ThreadPoolExecutor, as_completed
import threading

# Thread-local sessions for parallel workers
_thread_local = threading.local()

PROXY_POOL = [
    "http://user:[email protected]:PORT",
    "http://user:[email protected]:PORT",
    # Add more proxy endpoints from ThorData
    # https://thordata.partnerstack.com/partner/0a0x4nzq (or [Oxylabs](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=2066&url_id=174))
]


def get_worker_session(worker_id: int) -> requests.Session:
    """Get or create a requests session for this worker thread."""
    if not hasattr(_thread_local, "session"):
        session = requests.Session()
        session.headers.update({
            "User-Agent": "InfoboxScraper/1.0 ([email protected])"
        })
        if PROXY_POOL:
            proxy_url = PROXY_POOL[worker_id % len(PROXY_POOL)]
            session.proxies = {"http": proxy_url, "https": proxy_url}
        _thread_local.session = session

    return _thread_local.session


def fetch_one_parallel(args: tuple) -> dict | None:
    """Fetch and extract infobox for a single article (runs in thread pool)."""
    title, worker_id = args
    session = get_worker_session(worker_id)

    params = {
        "action": "query",
        "titles": title,
        "prop": "revisions",
        "rvprop": "content",
        "rvslots": "main",
        "format": "json",
        "maxlag": 5,
    }

    try:
        r = session.get(API, params=params, timeout=20)
        r.raise_for_status()
        data = r.json()

        pages = data["query"]["pages"]
        page = next(iter(pages.values()))

        if "missing" in page or "revisions" not in page:
            return None

        wikitext = page["revisions"][0]["slots"]["main"]["*"]
        raw = extract_primary_infobox(wikitext)

        if raw:
            return normalize_infobox(raw, article_title=title)

    except Exception as e:
        print(f"  Thread error for {title}: {e}")

    return None


def parallel_extract(
    titles: list[str],
    workers: int = 10,
    delay_per_worker: float = 1.0,
) -> list[dict]:
    """
    Extract infoboxes for a list of titles using a thread pool.
    Each worker uses a different proxy for load distribution.
    """
    results = []
    tasks = [(title, i % workers) for i, title in enumerate(titles)]

    with ThreadPoolExecutor(max_workers=workers) as pool:
        futures = {pool.submit(fetch_one_parallel, task): task[0] for task in tasks}

        completed = 0
        for future in as_completed(futures):
            result = future.result()
            if result:
                results.append(result)
            completed += 1

            if completed % 100 == 0:
                print(f"  Completed {completed}/{len(titles)} ({len(results)} with infoboxes)")

            # Per-worker throttle: sleep to stay under per-IP rate limits
            time.sleep(delay_per_worker / workers)

    return results

The ThorData proxy network is the right tool for large-scale Wikipedia extraction. Wikipedia's API allows 200 requests/second globally, but individual IPs can be throttled or blocked for aggressive crawling. With 10 workers each on a different residential exit IP, you can run at 10 parallel requests while each IP individually makes only 1 req/second — well within safe limits.

Change Tracking

Poll infoboxes on a schedule and diff the output to detect when editors update facts:

import sqlite3
from datetime import datetime


def init_tracking_db(path: str = "infobox_history.db") -> sqlite3.Connection:
    conn = sqlite3.connect(path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS infobox_snapshots (
            id          INTEGER PRIMARY KEY AUTOINCREMENT,
            article     TEXT NOT NULL,
            infobox_type TEXT,
            fields_json TEXT,
            captured_at TEXT DEFAULT (datetime('now'))
        )
    """)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS field_changes (
            id          INTEGER PRIMARY KEY AUTOINCREMENT,
            article     TEXT NOT NULL,
            field_name  TEXT NOT NULL,
            old_value   TEXT,
            new_value   TEXT,
            changed_at  TEXT DEFAULT (datetime('now'))
        )
    """)
    conn.execute("CREATE INDEX IF NOT EXISTS idx_snapshots_article ON infobox_snapshots(article)")
    conn.commit()
    return conn


def save_snapshot(conn: sqlite3.Connection, article: str, infobox: dict):
    """Save current infobox state and detect changes vs. previous snapshot."""
    now = datetime.utcnow().isoformat()
    fields = {k: v for k, v in infobox.items() if not k.startswith("_")}
    fields_json = json.dumps(fields, ensure_ascii=False, sort_keys=True)

    # Get last snapshot
    last = conn.execute(
        "SELECT fields_json FROM infobox_snapshots WHERE article=? ORDER BY id DESC LIMIT 1",
        (article,)
    ).fetchone()

    # Save new snapshot
    conn.execute(
        "INSERT INTO infobox_snapshots (article, infobox_type, fields_json) VALUES (?, ?, ?)",
        (article, infobox.get("_type"), fields_json)
    )

    # Detect and record changes
    if last:
        old_fields = json.loads(last[0])
        for key in set(list(old_fields.keys()) + list(fields.keys())):
            old_val = str(old_fields.get(key, ""))
            new_val = str(fields.get(key, ""))
            if old_val != new_val:
                conn.execute(
                    "INSERT INTO field_changes (article, field_name, old_value, new_value, changed_at) VALUES (?, ?, ?, ?, ?)",
                    (article, key, old_val or None, new_val or None, now)
                )

    conn.commit()


def get_recent_changes(
    conn: sqlite3.Connection,
    since: str,
    field_filter: str = None,
) -> list[dict]:
    """Get all infobox field changes since a given ISO timestamp."""
    query = "SELECT article, field_name, old_value, new_value, changed_at FROM field_changes WHERE changed_at > ?"
    params = [since]

    if field_filter:
        query += " AND field_name = ?"
        params.append(field_filter)

    query += " ORDER BY changed_at DESC"
    rows = conn.execute(query, params).fetchall()

    return [
        {"article": r[0], "field": r[1], "old": r[2], "new": r[3], "at": r[4]}
        for r in rows
    ]

Use Cases

Company intelligence. The Infobox company schema across 50,000+ Wikipedia articles gives you founding dates, current revenue figures, headquarters locations, and key personnel — searchable and exportable. Combine with Infobox settlement for geographic analysis.

Biographical datasets. Infobox person across categories like "American computer scientists" or "Nobel Prize in Physics laureates" gives you birth dates, nationalities, institutional affiliations, and career facts for NLP research or knowledge graph construction.

Population and geography. Infobox settlement for all articles in "Populated places in Germany" (or any country) gives you population, area, and elevation data for tens of thousands of locations — often more current than official statistics since Wikipedia editors update after each census.

Film and media analysis. Infobox film across "2020s films" gives you budget, box office gross, director, studio, and release date for thousands of films — useful for entertainment industry research.

Change monitoring. Organizations that maintain Wikipedia articles (companies, universities, public figures) update them when significant facts change — new CEO, completed acquisition, updated revenue. Polling infoboxes and diffing lets you detect these changes programmatically.

The data is freely licensed under CC BY-SA, consistently structured, and maintained by Wikipedia's editor community. The MediaWiki API is stable and well-documented. wikitextparser handles the wikitext parsing correctly where regex would fail. Start with a small category, validate your output by spot-checking a few articles manually, then scale up.