← Back to blog

Bulk Scrape Wikipedia: Category Trees, Infoboxes & Cross-Language Links (2026)

Bulk Scrape Wikipedia: Category Trees, Infoboxes & Cross-Language Links (2026)

Wikipedia is one of the few major sites that actively wants you to use their data. They have a proper API, reasonable rate limits, openly encourage reuse under CC BY-SA, and publish complete database dumps for bulk access. The data quality is exceptional — millions of structured articles covering every topic imaginable, maintained by tens of thousands of editors, available in 300+ languages.

Doing this at scale — pulling thousands of articles across category trees, extracting structured infobox data, mapping cross-language equivalents — still requires planning. The MediaWiki API has quirks, pagination is mandatory for large result sets, and infobox parsing requires dealing with years of inconsistent editor formatting.

This guide covers every major use case: category tree traversal, infobox extraction, article metadata, cross-language links, bulk fetching, SQLite storage, and full pipeline assembly.

Why Wikipedia Is an Exceptional Data Source

Before getting into code, it is worth understanding what makes Wikipedia different from other web scraping targets:

Officially encouraged. Wikipedia's API exists because the Wikimedia Foundation wants people to build on their data. Setting a proper User-Agent with contact information is a courtesy, not a legal shield.

Freely licensed. Article content is CC BY-SA 4.0. You can republish, transform, and build commercial products on it as long as you attribute and share alike. This is extremely rare among data sources of this quality.

Structured data layer. Infoboxes contain semi-structured data for millions of articles. Countries, cities, companies, chemicals, species, films, albums — all have typed infobox templates with named fields.

Multilingual. The same entity exists in 300+ language editions. Cross-language links via Wikidata let you map "Python (programming language)" to its German, Japanese, and Polish equivalents in one API call.

Wikidata integration. Every notable Wikipedia article links to a Wikidata entity, giving you access to even more structured data through a separate SPARQL query interface.

The MediaWiki API

Wikipedia runs on MediaWiki, and the API lives at https://en.wikipedia.org/w/api.php. No API key, no authentication. You make requests and get JSON.

The API uses an action parameter to determine the operation. The three you will use most:

import httpx
import time

API_URL = "https://en.wikipedia.org/w/api.php"

# Always set a proper User-Agent with contact info — it is required by their ToS
HEADERS = {
    "User-Agent": "YourProjectName/1.0 ([email protected]) python-httpx/0.27",
}

def wiki_query(params: dict) -> dict:
    """Make a MediaWiki API request with default params."""
    defaults = {
        "format": "json",
        "formatversion": "2",
    }
    params = {**defaults, **params}

    response = httpx.get(
        API_URL, params=params, headers=HEADERS, timeout=30
    )
    response.raise_for_status()
    return response.json()

Crawling Category Trees

Wikipedia categories are hierarchical. Category:Programming languages contains subcategories like Category:Python (programming language) which contains individual articles. To get all articles in a category tree, you need recursive traversal with pagination.

from collections import deque

def get_category_members(category: str, depth: int = 3) -> dict:
    """
    Recursively get all articles and subcategories in a category tree.
    Returns {'articles': [...], 'subcategories': [...]}.
    """
    articles = []
    subcategories = []
    visited = set()
    queue = deque([(category, 0)])

    while queue:
        cat, current_depth = queue.popleft()

        if cat in visited or current_depth > depth:
            continue
        visited.add(cat)

        cmcontinue = None
        page_articles = 0

        while True:
            params = {
                "action": "query",
                "list": "categorymembers",
                "cmtitle": cat,
                "cmlimit": "500",
                "cmprop": "title|type|timestamp",
            }
            if cmcontinue:
                params["cmcontinue"] = cmcontinue

            data = wiki_query(params)
            members = data.get("query", {}).get("categorymembers", [])

            for member in members:
                if member["type"] == "subcat":
                    subcategories.append(member["title"])
                    if current_depth < depth:
                        queue.append((member["title"], current_depth + 1))
                elif member["type"] == "page":
                    articles.append({
                        "title": member["title"],
                        "category": cat,
                        "depth": current_depth,
                        "timestamp": member.get("timestamp"),
                    })
                    page_articles += 1

            # Handle pagination — categories can have 500+ members
            if "continue" in data:
                cmcontinue = data["continue"]["cmcontinue"]
            else:
                break

            time.sleep(0.1)

        print(f"  {cat}: {page_articles} articles")
        time.sleep(0.2)

    return {"articles": articles, "subcategories": subcategories}

# Usage
result = get_category_members("Category:Python (programming language)", depth=2)
print(f"Found {len(result['articles'])} articles in {len(result['subcategories'])} subcategories")

A note on depth: Wikipedia categories are loosely organized and deeply nested. Going beyond depth 3 can pull tens of thousands of articles because high-level categories like Category:Science ultimately contain everything. Start at depth 1 or 2, inspect what you have, then increase if needed.

The cmcontinue token is essential — category listings cap at 500 members per request. Any category with more members requires multiple requests with continuation tokens.

Extracting Infoboxes

Infoboxes are the structured data panels on the right side of Wikipedia articles. They contain the most machine-readable information — population figures, coordinates, release dates, chemical formulas, film budgets, sports statistics. Country articles have geographic infoboxes; company articles have business infoboxes; film articles have film infoboxes — each with predictable field names.

The approach: use the parse action with the wikitext property, then parse the infobox template with regex or a proper wikitext parser.

import re

def extract_infobox(title: str) -> dict | None:
    """Extract infobox data from a Wikipedia article."""
    data = wiki_query({
        "action": "parse",
        "page": title,
        "prop": "wikitext",
    })

    wikitext = data.get("parse", {}).get("wikitext", "")
    if not wikitext:
        return None

    # Find infobox template — handles multiple naming conventions
    infobox_match = re.search(
        r"\{\{Infobox(.+?)(?:\n\}\})", wikitext, re.DOTALL | re.IGNORECASE
    )
    if not infobox_match:
        return None

    infobox_text = infobox_match.group(1)
    result = {"_type": "Infobox"}

    # Extract type from first line
    first_line = infobox_text.split("\n")[0].strip()
    if first_line:
        result["_type"] = f"Infobox {first_line}"

    # Parse key-value pairs
    for match in re.finditer(
        r"\|\s*(\w[\w\s]*?)\s*=\s*(.+?)(?=\n\||\n\}\}|$)",
        infobox_text, re.DOTALL
    ):
        key = match.group(1).strip().lower().replace(" ", "_")
        value = match.group(2).strip()

        # Clean up wiki markup
        value = re.sub(r"\[\[(?:[^|\]]*\|)?([^\]]+)\]\]", r"\1", value)  # [[Link|Text]] -> Text
        value = re.sub(r"\{\{.*?\}\}", "", value)  # Remove templates
        value = re.sub(r"<ref[^>]*>.*?</ref>", "", value, flags=re.DOTALL)
        value = re.sub(r"<[^>]+>", "", value)  # Remove HTML tags
        value = value.strip()

        if value:
            result[key] = value

    return result

# Usage
info = extract_infobox("Python (programming language)")
if info:
    for k, v in list(info.items())[:8]:
        print(f"  {k}: {v[:80]}")

Infobox parsing is inherently messy. Editors use inconsistent formatting, nested templates, and inline HTML. The regex approach handles 80-90% of cases. For production use, the mwparserfromhell library parses wikitext as a proper grammar rather than with regex:

# pip install mwparserfromhell
import mwparserfromhell

def extract_infobox_robust(wikitext: str) -> dict | None:
    """Parse infobox using mwparserfromhell for reliable extraction."""
    parsed = mwparserfromhell.parse(wikitext)

    for template in parsed.filter_templates():
        name = str(template.name).strip().lower()
        if name.startswith("infobox"):
            result = {"_type": str(template.name).strip()}
            for param in template.params:
                key = str(param.name).strip().lower().replace(" ", "_")
                # strip_code() removes nested wiki markup
                value = param.value.strip_code().strip()
                if value:
                    result[key] = value
            return result

    return None

strip_code() recursively strips all nested wikitext markup — links, templates, references — leaving only the plain text value. This is much more reliable than regex for deeply nested infobox fields.

Article Metadata in Bulk

The API lets you fetch metadata for up to 50 pages per request using pipe-separated titles. This is dramatically faster than one request per page:

def get_article_metadata(titles: list[str]) -> list[dict]:
    """Fetch metadata for multiple articles in batches of 50."""
    all_metadata = []

    for i in range(0, len(titles), 50):
        batch = titles[i:i + 50]

        data = wiki_query({
            "action": "query",
            "titles": "|".join(batch),
            "prop": "info|pageprops|langlinks|categories",
            "inprop": "protection|url",
            "ppprop": "wikibase_item",
            "lllimit": "500",
            "cllimit": "50",
        })

        pages = data.get("query", {}).get("pages", [])
        for page in pages:
            if "missing" in page:
                continue

            metadata = {
                "title": page["title"],
                "pageid": page["pageid"],
                "length": page.get("length", 0),
                "last_edited": page.get("touched"),
                "url": page.get("canonicalurl", ""),
                "wikidata_id": page.get("pageprops", {}).get("wikibase_item"),
                "languages": [
                    {"lang": ll["lang"], "title": ll["title"]}
                    for ll in page.get("langlinks", [])
                ],
                "language_count": len(page.get("langlinks", [])),
                "categories": [c["title"] for c in page.get("categories", [])],
            }
            all_metadata.append(metadata)

        time.sleep(0.5)
        print(f"  Metadata: {i + len(batch)}/{len(titles)}")

    return all_metadata

The language_count field is a useful proxy for article importance. Major topics tend to have articles in 100+ languages. A topic present in only 3 languages is niche; one present in 80+ is globally significant. This is a quick filter for building importance-ranked datasets.

Wikipedia's interlanguage link system is one of its most powerful features. Every article links to its equivalent in other languages through Wikidata entity IDs. You can use this to build multilingual datasets without having to match articles by translated titles:

def get_cross_language_titles(title: str, target_langs: list[str] = None) -> dict:
    """Get article titles in other languages."""
    params = {
        "action": "query",
        "titles": title,
        "prop": "langlinks",
        "lllimit": "500",
    }

    data = wiki_query(params)
    pages = data.get("query", {}).get("pages", [])

    if not pages:
        return {}

    langlinks = pages[0].get("langlinks", [])
    result = {"en": title}

    for ll in langlinks:
        lang = ll["lang"]
        if target_langs is None or lang in target_langs:
            result[lang] = ll["title"]

    return result


# Get Python article in 5 languages
langs = get_cross_language_titles(
    "Python (programming language)",
    target_langs=["de", "fr", "ja", "pl", "zh"]
)
for lang, title in langs.items():
    print(f"  [{lang}] {title}")

You can then fetch those articles from their respective Wikipedia language editions:

def get_article_in_language(title: str, lang: str) -> dict:
    """Fetch article from a non-English Wikipedia edition."""
    lang_api = f"https://{lang}.wikipedia.org/w/api.php"

    resp = httpx.get(lang_api, params={
        "action": "parse",
        "page": title,
        "prop": "wikitext",
        "format": "json",
        "formatversion": "2",
    }, headers=HEADERS, timeout=30)
    resp.raise_for_status()

    data = resp.json()
    return {
        "title": title,
        "lang": lang,
        "wikitext": data.get("parse", {}).get("wikitext", ""),
    }

The MediaWiki API supports full-text search across all article content:

def search_articles(query: str, limit: int = 10) -> list[dict]:
    """Search Wikipedia articles by keyword."""
    data = wiki_query({
        "action": "query",
        "list": "search",
        "srsearch": query,
        "srlimit": limit,
        "srnamespace": 0,  # Main namespace only
        "srprop": "snippet|titlesnippet|wordcount|timestamp",
    })

    results = []
    for r in data.get("query", {}).get("search", []):
        results.append({
            "title": r["title"],
            "pageid": r["pageid"],
            "wordcount": r.get("wordcount", 0),
            "snippet": re.sub(r"<[^>]+>", "", r.get("snippet", "")),
            "timestamp": r.get("timestamp"),
        })

    return results


# Search example
results = search_articles("transformer neural network attention", limit=20)
for r in results:
    print(f"  {r['title']} ({r['wordcount']} words)")
    print(f"    {r['snippet'][:100]}...")

Fetching Article Sections

For long articles, you often only need specific sections. The parse action with section support lets you target content precisely:

def get_article_sections(title: str) -> list[dict]:
    """Get the section structure of an article."""
    data = wiki_query({
        "action": "parse",
        "page": title,
        "prop": "sections",
    })
    return data.get("parse", {}).get("sections", [])


def get_section_wikitext(title: str, section_index: int) -> str:
    """Fetch wikitext for a specific section."""
    data = wiki_query({
        "action": "parse",
        "page": title,
        "prop": "wikitext",
        "section": section_index,
    })
    return data.get("parse", {}).get("wikitext", "")


# Example: get only the History section
sections = get_article_sections("Python (programming language)")
for s in sections:
    print(f"  [{s['index']}] {'  ' * (int(s['level'])-2)}{s['line']}")

history_section = next(
    (s for s in sections if "history" in s["line"].lower()), None
)
if history_section:
    wikitext = get_section_wikitext("Python (programming language)", int(history_section["index"]))
    print(f"\nHistory section: {len(wikitext)} chars")

SQLite Schema for Wikipedia Data

import sqlite3
import json

def init_wikipedia_db(db_path: str = "wikipedia.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS articles (
            title TEXT PRIMARY KEY,
            pageid INTEGER UNIQUE,
            length INTEGER,
            last_edited TEXT,
            url TEXT,
            wikidata_id TEXT,
            language_count INTEGER,
            wikitext TEXT,
            infobox TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );

        CREATE TABLE IF NOT EXISTS article_languages (
            title TEXT NOT NULL,
            lang TEXT NOT NULL,
            lang_title TEXT NOT NULL,
            PRIMARY KEY (title, lang),
            FOREIGN KEY (title) REFERENCES articles(title)
        );

        CREATE TABLE IF NOT EXISTS category_memberships (
            category TEXT NOT NULL,
            article_title TEXT NOT NULL,
            depth INTEGER DEFAULT 0,
            PRIMARY KEY (category, article_title)
        );

        CREATE INDEX IF NOT EXISTS idx_articles_wikidata
            ON articles(wikidata_id);

        CREATE INDEX IF NOT EXISTS idx_articles_lang_count
            ON articles(language_count DESC);

        CREATE INDEX IF NOT EXISTS idx_cat_article
            ON category_memberships(article_title);
    """)
    conn.commit()
    return conn


def save_article(conn: sqlite3.Connection, article: dict):
    conn.execute(
        """INSERT OR REPLACE INTO articles
           (title, pageid, length, last_edited, url, wikidata_id,
            language_count, wikitext, infobox)
           VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)""",
        (
            article.get("title"),
            article.get("pageid"),
            article.get("length", 0),
            article.get("last_edited"),
            article.get("url"),
            article.get("wikidata_id"),
            article.get("language_count", 0),
            article.get("wikitext"),
            json.dumps(article.get("infobox")) if article.get("infobox") else None,
        ),
    )
    conn.commit()


def save_language_links(conn: sqlite3.Connection, title: str, languages: list[dict]):
    conn.executemany(
        "INSERT OR IGNORE INTO article_languages (title, lang, lang_title) VALUES (?, ?, ?)",
        [(title, ll["lang"], ll["title"]) for ll in languages],
    )
    conn.commit()

Rate Limiting and Best Practices

Wikipedia's API guidelines specify: set a proper User-Agent with contact info, and do not exceed 200 requests per second (you will never approach that). In practice, 5-10 requests per second is a comfortable rate that won't trigger any throttling.

class WikiThrottle:
    """Rate limiter for MediaWiki API requests."""

    def __init__(self, requests_per_second: float = 5.0):
        self.min_interval = 1.0 / requests_per_second
        self.last_request = 0.0

    def wait(self):
        elapsed = time.time() - self.last_request
        if elapsed < self.min_interval:
            time.sleep(self.min_interval - elapsed)
        self.last_request = time.time()


throttle = WikiThrottle(requests_per_second=5)

For truly large-scale operations — millions of articles across multiple language editions — consider distributing requests through ThorData's residential proxies. Wikipedia does not aggressively block scraping, but spreading load across IPs is good citizenship for any bulk collection that puts meaningful load on their servers. Each proxy IP gets its own rate window.

The practical throughput with proper batching: 50 pages per batch × 5 batches per second = 250 article-equivalents per second. For most projects, storage and processing will be the bottleneck, not the API.

Error Handling

The MediaWiki API generally returns 200 OK even for error conditions — errors are encoded in the JSON body:

import time
import random

def wiki_query_safe(params: dict, max_retries: int = 3) -> dict:
    """MediaWiki API request with error handling and retry."""
    for attempt in range(max_retries):
        try:
            data = wiki_query(params)

            # Check for API-level errors
            if "error" in data:
                code = data["error"].get("code", "unknown")
                info = data["error"].get("info", "")

                if code == "maxlag":
                    # Server under load — back off
                    lag = int(data["error"].get("lag", 5))
                    wait = min(lag * 2, 30)
                    print(f"  API maxlag ({lag}s), waiting {wait}s")
                    time.sleep(wait)
                    continue
                elif code == "ratelimited":
                    time.sleep(random.uniform(10, 20))
                    continue
                else:
                    print(f"  API error: {code} — {info}")
                    return {}

            return data

        except httpx.HTTPStatusError as e:
            if e.response.status_code in (429, 503) and attempt < max_retries - 1:
                time.sleep(2 ** attempt * 5)
                continue
            raise
        except (httpx.ConnectError, httpx.TimeoutException):
            if attempt < max_retries - 1:
                time.sleep(5)
                continue
            raise

    return {}

Wikipedia Dumps for Bulk Access

For operations on tens of millions of articles, skip the API entirely. Wikipedia publishes complete database dumps at dumps.wikimedia.org updated every few weeks. The English Wikipedia compressed dump is approximately 22GB; the full XML with text is around 85GB uncompressed.

# pip install mwxml mwparserfromhell
import mwxml

def process_dump(dump_path: str, output_db: str = "wiki_dump.db", max_articles: int = 0):
    """
    Process a Wikipedia XML dump file and extract infoboxes.
    max_articles=0 means no limit.
    """
    conn = init_wikipedia_db(output_db)
    dump = mwxml.Dump.from_file(open(dump_path, "rb"))

    count = 0
    for page in dump:
        if page.namespace != 0:  # Only main namespace
            continue

        for revision in page:
            wikitext = revision.text or ""
            infobox = extract_infobox_robust(wikitext) if wikitext else None

            save_article(conn, {
                "title": page.title,
                "pageid": page.id,
                "length": len(wikitext),
                "wikitext": wikitext,
                "infobox": infobox,
            })

            count += 1
            if count % 10000 == 0:
                print(f"  Processed {count} articles")
            break  # Only latest revision

        if max_articles and count >= max_articles:
            break

    conn.close()
    print(f"Processed {count} articles total")

Use dumps when you need more than approximately 100,000 articles, want to avoid API rate limits entirely, or need consistent point-in-time data for a research dataset.

Full Pipeline

A complete pipeline that crawls a category tree, fetches metadata and infoboxes, saves to SQLite:

def scrape_category_pipeline(
    category: str,
    depth: int = 2,
    db_path: str = "wiki_data.db",
):
    """Full pipeline: category tree -> metadata -> infoboxes -> SQLite."""
    conn = init_wikipedia_db(db_path)
    throttle = WikiThrottle(requests_per_second=5)

    # Phase 1: Category enumeration
    print(f"Crawling category tree: {category} (depth {depth})...")
    tree = get_category_members(category, depth=depth)
    titles = [a["title"] for a in tree["articles"]]
    print(f"Found {len(titles)} articles in {len(tree['subcategories'])} subcategories")

    # Save category memberships
    for a in tree["articles"]:
        conn.execute(
            "INSERT OR IGNORE INTO category_memberships (category, article_title, depth) VALUES (?,?,?)",
            (a["category"], a["title"], a["depth"]),
        )
    conn.commit()

    # Phase 2: Metadata in batches of 50
    print("Fetching article metadata...")
    throttle.wait()
    metadata_list = get_article_metadata(titles)
    meta_map = {m["title"]: m for m in metadata_list}

    # Phase 3: Full wikitext + infoboxes
    print("Fetching wikitext and extracting infoboxes...")
    for i, title in enumerate(titles):
        if i > 0 and i % 50 == 0:
            print(f"  Progress: {i}/{len(titles)}")

        throttle.wait()

        article = meta_map.get(title, {"title": title})

        data = wiki_query_safe({
            "action": "parse",
            "page": title,
            "prop": "wikitext",
        })
        wikitext = data.get("parse", {}).get("wikitext", "")
        article["wikitext"] = wikitext
        article["infobox"] = extract_infobox_robust(wikitext) if wikitext else None

        save_article(conn, article)

        if article.get("languages"):
            save_language_links(conn, title, article["languages"])

    conn.close()
    print(f"Pipeline complete. {len(titles)} articles saved to {db_path}")


# Run it
scrape_category_pipeline("Category:Machine learning", depth=2)

Wikipedia content is CC BY-SA 4.0 licensed. Use it freely as long as you attribute Wikipedia as the source and share derivative works under the same or compatible license. The API itself is open to everyone with no restrictions beyond the rate limits and User-Agent requirement. Wikipedia is one of the cleanest data sources available for both personal and commercial work — cite your source and you are good.