Bulk Scrape Wikipedia: Category Trees, Infoboxes & Cross-Language Links (2026)
Bulk Scrape Wikipedia: Category Trees, Infoboxes & Cross-Language Links (2026)
Wikipedia is one of the few major sites that actively wants you to use their data. They have a proper API, reasonable rate limits, openly encourage reuse under CC BY-SA, and publish complete database dumps for bulk access. The data quality is exceptional — millions of structured articles covering every topic imaginable, maintained by tens of thousands of editors, available in 300+ languages.
Doing this at scale — pulling thousands of articles across category trees, extracting structured infobox data, mapping cross-language equivalents — still requires planning. The MediaWiki API has quirks, pagination is mandatory for large result sets, and infobox parsing requires dealing with years of inconsistent editor formatting.
This guide covers every major use case: category tree traversal, infobox extraction, article metadata, cross-language links, bulk fetching, SQLite storage, and full pipeline assembly.
Why Wikipedia Is an Exceptional Data Source
Before getting into code, it is worth understanding what makes Wikipedia different from other web scraping targets:
Officially encouraged. Wikipedia's API exists because the Wikimedia Foundation wants people to build on their data. Setting a proper User-Agent with contact information is a courtesy, not a legal shield.
Freely licensed. Article content is CC BY-SA 4.0. You can republish, transform, and build commercial products on it as long as you attribute and share alike. This is extremely rare among data sources of this quality.
Structured data layer. Infoboxes contain semi-structured data for millions of articles. Countries, cities, companies, chemicals, species, films, albums — all have typed infobox templates with named fields.
Multilingual. The same entity exists in 300+ language editions. Cross-language links via Wikidata let you map "Python (programming language)" to its German, Japanese, and Polish equivalents in one API call.
Wikidata integration. Every notable Wikipedia article links to a Wikidata entity, giving you access to even more structured data through a separate SPARQL query interface.
The MediaWiki API
Wikipedia runs on MediaWiki, and the API lives at https://en.wikipedia.org/w/api.php. No API key, no authentication. You make requests and get JSON.
The API uses an action parameter to determine the operation. The three you will use most:
action=query— fetch article content, categories, metadata, page propertiesaction=parse— get rendered HTML or raw wikitext from an articleaction=categorymembers— list all articles and subcategories in a category
import httpx
import time
API_URL = "https://en.wikipedia.org/w/api.php"
# Always set a proper User-Agent with contact info — it is required by their ToS
HEADERS = {
"User-Agent": "YourProjectName/1.0 ([email protected]) python-httpx/0.27",
}
def wiki_query(params: dict) -> dict:
"""Make a MediaWiki API request with default params."""
defaults = {
"format": "json",
"formatversion": "2",
}
params = {**defaults, **params}
response = httpx.get(
API_URL, params=params, headers=HEADERS, timeout=30
)
response.raise_for_status()
return response.json()
Crawling Category Trees
Wikipedia categories are hierarchical. Category:Programming languages contains subcategories like Category:Python (programming language) which contains individual articles. To get all articles in a category tree, you need recursive traversal with pagination.
from collections import deque
def get_category_members(category: str, depth: int = 3) -> dict:
"""
Recursively get all articles and subcategories in a category tree.
Returns {'articles': [...], 'subcategories': [...]}.
"""
articles = []
subcategories = []
visited = set()
queue = deque([(category, 0)])
while queue:
cat, current_depth = queue.popleft()
if cat in visited or current_depth > depth:
continue
visited.add(cat)
cmcontinue = None
page_articles = 0
while True:
params = {
"action": "query",
"list": "categorymembers",
"cmtitle": cat,
"cmlimit": "500",
"cmprop": "title|type|timestamp",
}
if cmcontinue:
params["cmcontinue"] = cmcontinue
data = wiki_query(params)
members = data.get("query", {}).get("categorymembers", [])
for member in members:
if member["type"] == "subcat":
subcategories.append(member["title"])
if current_depth < depth:
queue.append((member["title"], current_depth + 1))
elif member["type"] == "page":
articles.append({
"title": member["title"],
"category": cat,
"depth": current_depth,
"timestamp": member.get("timestamp"),
})
page_articles += 1
# Handle pagination — categories can have 500+ members
if "continue" in data:
cmcontinue = data["continue"]["cmcontinue"]
else:
break
time.sleep(0.1)
print(f" {cat}: {page_articles} articles")
time.sleep(0.2)
return {"articles": articles, "subcategories": subcategories}
# Usage
result = get_category_members("Category:Python (programming language)", depth=2)
print(f"Found {len(result['articles'])} articles in {len(result['subcategories'])} subcategories")
A note on depth: Wikipedia categories are loosely organized and deeply nested. Going beyond depth 3 can pull tens of thousands of articles because high-level categories like Category:Science ultimately contain everything. Start at depth 1 or 2, inspect what you have, then increase if needed.
The cmcontinue token is essential — category listings cap at 500 members per request. Any category with more members requires multiple requests with continuation tokens.
Extracting Infoboxes
Infoboxes are the structured data panels on the right side of Wikipedia articles. They contain the most machine-readable information — population figures, coordinates, release dates, chemical formulas, film budgets, sports statistics. Country articles have geographic infoboxes; company articles have business infoboxes; film articles have film infoboxes — each with predictable field names.
The approach: use the parse action with the wikitext property, then parse the infobox template with regex or a proper wikitext parser.
import re
def extract_infobox(title: str) -> dict | None:
"""Extract infobox data from a Wikipedia article."""
data = wiki_query({
"action": "parse",
"page": title,
"prop": "wikitext",
})
wikitext = data.get("parse", {}).get("wikitext", "")
if not wikitext:
return None
# Find infobox template — handles multiple naming conventions
infobox_match = re.search(
r"\{\{Infobox(.+?)(?:\n\}\})", wikitext, re.DOTALL | re.IGNORECASE
)
if not infobox_match:
return None
infobox_text = infobox_match.group(1)
result = {"_type": "Infobox"}
# Extract type from first line
first_line = infobox_text.split("\n")[0].strip()
if first_line:
result["_type"] = f"Infobox {first_line}"
# Parse key-value pairs
for match in re.finditer(
r"\|\s*(\w[\w\s]*?)\s*=\s*(.+?)(?=\n\||\n\}\}|$)",
infobox_text, re.DOTALL
):
key = match.group(1).strip().lower().replace(" ", "_")
value = match.group(2).strip()
# Clean up wiki markup
value = re.sub(r"\[\[(?:[^|\]]*\|)?([^\]]+)\]\]", r"\1", value) # [[Link|Text]] -> Text
value = re.sub(r"\{\{.*?\}\}", "", value) # Remove templates
value = re.sub(r"<ref[^>]*>.*?</ref>", "", value, flags=re.DOTALL)
value = re.sub(r"<[^>]+>", "", value) # Remove HTML tags
value = value.strip()
if value:
result[key] = value
return result
# Usage
info = extract_infobox("Python (programming language)")
if info:
for k, v in list(info.items())[:8]:
print(f" {k}: {v[:80]}")
Infobox parsing is inherently messy. Editors use inconsistent formatting, nested templates, and inline HTML. The regex approach handles 80-90% of cases. For production use, the mwparserfromhell library parses wikitext as a proper grammar rather than with regex:
# pip install mwparserfromhell
import mwparserfromhell
def extract_infobox_robust(wikitext: str) -> dict | None:
"""Parse infobox using mwparserfromhell for reliable extraction."""
parsed = mwparserfromhell.parse(wikitext)
for template in parsed.filter_templates():
name = str(template.name).strip().lower()
if name.startswith("infobox"):
result = {"_type": str(template.name).strip()}
for param in template.params:
key = str(param.name).strip().lower().replace(" ", "_")
# strip_code() removes nested wiki markup
value = param.value.strip_code().strip()
if value:
result[key] = value
return result
return None
strip_code() recursively strips all nested wikitext markup — links, templates, references — leaving only the plain text value. This is much more reliable than regex for deeply nested infobox fields.
Article Metadata in Bulk
The API lets you fetch metadata for up to 50 pages per request using pipe-separated titles. This is dramatically faster than one request per page:
def get_article_metadata(titles: list[str]) -> list[dict]:
"""Fetch metadata for multiple articles in batches of 50."""
all_metadata = []
for i in range(0, len(titles), 50):
batch = titles[i:i + 50]
data = wiki_query({
"action": "query",
"titles": "|".join(batch),
"prop": "info|pageprops|langlinks|categories",
"inprop": "protection|url",
"ppprop": "wikibase_item",
"lllimit": "500",
"cllimit": "50",
})
pages = data.get("query", {}).get("pages", [])
for page in pages:
if "missing" in page:
continue
metadata = {
"title": page["title"],
"pageid": page["pageid"],
"length": page.get("length", 0),
"last_edited": page.get("touched"),
"url": page.get("canonicalurl", ""),
"wikidata_id": page.get("pageprops", {}).get("wikibase_item"),
"languages": [
{"lang": ll["lang"], "title": ll["title"]}
for ll in page.get("langlinks", [])
],
"language_count": len(page.get("langlinks", [])),
"categories": [c["title"] for c in page.get("categories", [])],
}
all_metadata.append(metadata)
time.sleep(0.5)
print(f" Metadata: {i + len(batch)}/{len(titles)}")
return all_metadata
The language_count field is a useful proxy for article importance. Major topics tend to have articles in 100+ languages. A topic present in only 3 languages is niche; one present in 80+ is globally significant. This is a quick filter for building importance-ranked datasets.
Cross-Language Links
Wikipedia's interlanguage link system is one of its most powerful features. Every article links to its equivalent in other languages through Wikidata entity IDs. You can use this to build multilingual datasets without having to match articles by translated titles:
def get_cross_language_titles(title: str, target_langs: list[str] = None) -> dict:
"""Get article titles in other languages."""
params = {
"action": "query",
"titles": title,
"prop": "langlinks",
"lllimit": "500",
}
data = wiki_query(params)
pages = data.get("query", {}).get("pages", [])
if not pages:
return {}
langlinks = pages[0].get("langlinks", [])
result = {"en": title}
for ll in langlinks:
lang = ll["lang"]
if target_langs is None or lang in target_langs:
result[lang] = ll["title"]
return result
# Get Python article in 5 languages
langs = get_cross_language_titles(
"Python (programming language)",
target_langs=["de", "fr", "ja", "pl", "zh"]
)
for lang, title in langs.items():
print(f" [{lang}] {title}")
You can then fetch those articles from their respective Wikipedia language editions:
def get_article_in_language(title: str, lang: str) -> dict:
"""Fetch article from a non-English Wikipedia edition."""
lang_api = f"https://{lang}.wikipedia.org/w/api.php"
resp = httpx.get(lang_api, params={
"action": "parse",
"page": title,
"prop": "wikitext",
"format": "json",
"formatversion": "2",
}, headers=HEADERS, timeout=30)
resp.raise_for_status()
data = resp.json()
return {
"title": title,
"lang": lang,
"wikitext": data.get("parse", {}).get("wikitext", ""),
}
Full Text Search
The MediaWiki API supports full-text search across all article content:
def search_articles(query: str, limit: int = 10) -> list[dict]:
"""Search Wikipedia articles by keyword."""
data = wiki_query({
"action": "query",
"list": "search",
"srsearch": query,
"srlimit": limit,
"srnamespace": 0, # Main namespace only
"srprop": "snippet|titlesnippet|wordcount|timestamp",
})
results = []
for r in data.get("query", {}).get("search", []):
results.append({
"title": r["title"],
"pageid": r["pageid"],
"wordcount": r.get("wordcount", 0),
"snippet": re.sub(r"<[^>]+>", "", r.get("snippet", "")),
"timestamp": r.get("timestamp"),
})
return results
# Search example
results = search_articles("transformer neural network attention", limit=20)
for r in results:
print(f" {r['title']} ({r['wordcount']} words)")
print(f" {r['snippet'][:100]}...")
Fetching Article Sections
For long articles, you often only need specific sections. The parse action with section support lets you target content precisely:
def get_article_sections(title: str) -> list[dict]:
"""Get the section structure of an article."""
data = wiki_query({
"action": "parse",
"page": title,
"prop": "sections",
})
return data.get("parse", {}).get("sections", [])
def get_section_wikitext(title: str, section_index: int) -> str:
"""Fetch wikitext for a specific section."""
data = wiki_query({
"action": "parse",
"page": title,
"prop": "wikitext",
"section": section_index,
})
return data.get("parse", {}).get("wikitext", "")
# Example: get only the History section
sections = get_article_sections("Python (programming language)")
for s in sections:
print(f" [{s['index']}] {' ' * (int(s['level'])-2)}{s['line']}")
history_section = next(
(s for s in sections if "history" in s["line"].lower()), None
)
if history_section:
wikitext = get_section_wikitext("Python (programming language)", int(history_section["index"]))
print(f"\nHistory section: {len(wikitext)} chars")
SQLite Schema for Wikipedia Data
import sqlite3
import json
def init_wikipedia_db(db_path: str = "wikipedia.db") -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.executescript("""
CREATE TABLE IF NOT EXISTS articles (
title TEXT PRIMARY KEY,
pageid INTEGER UNIQUE,
length INTEGER,
last_edited TEXT,
url TEXT,
wikidata_id TEXT,
language_count INTEGER,
wikitext TEXT,
infobox TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS article_languages (
title TEXT NOT NULL,
lang TEXT NOT NULL,
lang_title TEXT NOT NULL,
PRIMARY KEY (title, lang),
FOREIGN KEY (title) REFERENCES articles(title)
);
CREATE TABLE IF NOT EXISTS category_memberships (
category TEXT NOT NULL,
article_title TEXT NOT NULL,
depth INTEGER DEFAULT 0,
PRIMARY KEY (category, article_title)
);
CREATE INDEX IF NOT EXISTS idx_articles_wikidata
ON articles(wikidata_id);
CREATE INDEX IF NOT EXISTS idx_articles_lang_count
ON articles(language_count DESC);
CREATE INDEX IF NOT EXISTS idx_cat_article
ON category_memberships(article_title);
""")
conn.commit()
return conn
def save_article(conn: sqlite3.Connection, article: dict):
conn.execute(
"""INSERT OR REPLACE INTO articles
(title, pageid, length, last_edited, url, wikidata_id,
language_count, wikitext, infobox)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)""",
(
article.get("title"),
article.get("pageid"),
article.get("length", 0),
article.get("last_edited"),
article.get("url"),
article.get("wikidata_id"),
article.get("language_count", 0),
article.get("wikitext"),
json.dumps(article.get("infobox")) if article.get("infobox") else None,
),
)
conn.commit()
def save_language_links(conn: sqlite3.Connection, title: str, languages: list[dict]):
conn.executemany(
"INSERT OR IGNORE INTO article_languages (title, lang, lang_title) VALUES (?, ?, ?)",
[(title, ll["lang"], ll["title"]) for ll in languages],
)
conn.commit()
Rate Limiting and Best Practices
Wikipedia's API guidelines specify: set a proper User-Agent with contact info, and do not exceed 200 requests per second (you will never approach that). In practice, 5-10 requests per second is a comfortable rate that won't trigger any throttling.
class WikiThrottle:
"""Rate limiter for MediaWiki API requests."""
def __init__(self, requests_per_second: float = 5.0):
self.min_interval = 1.0 / requests_per_second
self.last_request = 0.0
def wait(self):
elapsed = time.time() - self.last_request
if elapsed < self.min_interval:
time.sleep(self.min_interval - elapsed)
self.last_request = time.time()
throttle = WikiThrottle(requests_per_second=5)
For truly large-scale operations — millions of articles across multiple language editions — consider distributing requests through ThorData's residential proxies. Wikipedia does not aggressively block scraping, but spreading load across IPs is good citizenship for any bulk collection that puts meaningful load on their servers. Each proxy IP gets its own rate window.
The practical throughput with proper batching: 50 pages per batch × 5 batches per second = 250 article-equivalents per second. For most projects, storage and processing will be the bottleneck, not the API.
Error Handling
The MediaWiki API generally returns 200 OK even for error conditions — errors are encoded in the JSON body:
import time
import random
def wiki_query_safe(params: dict, max_retries: int = 3) -> dict:
"""MediaWiki API request with error handling and retry."""
for attempt in range(max_retries):
try:
data = wiki_query(params)
# Check for API-level errors
if "error" in data:
code = data["error"].get("code", "unknown")
info = data["error"].get("info", "")
if code == "maxlag":
# Server under load — back off
lag = int(data["error"].get("lag", 5))
wait = min(lag * 2, 30)
print(f" API maxlag ({lag}s), waiting {wait}s")
time.sleep(wait)
continue
elif code == "ratelimited":
time.sleep(random.uniform(10, 20))
continue
else:
print(f" API error: {code} — {info}")
return {}
return data
except httpx.HTTPStatusError as e:
if e.response.status_code in (429, 503) and attempt < max_retries - 1:
time.sleep(2 ** attempt * 5)
continue
raise
except (httpx.ConnectError, httpx.TimeoutException):
if attempt < max_retries - 1:
time.sleep(5)
continue
raise
return {}
Wikipedia Dumps for Bulk Access
For operations on tens of millions of articles, skip the API entirely. Wikipedia publishes complete database dumps at dumps.wikimedia.org updated every few weeks. The English Wikipedia compressed dump is approximately 22GB; the full XML with text is around 85GB uncompressed.
# pip install mwxml mwparserfromhell
import mwxml
def process_dump(dump_path: str, output_db: str = "wiki_dump.db", max_articles: int = 0):
"""
Process a Wikipedia XML dump file and extract infoboxes.
max_articles=0 means no limit.
"""
conn = init_wikipedia_db(output_db)
dump = mwxml.Dump.from_file(open(dump_path, "rb"))
count = 0
for page in dump:
if page.namespace != 0: # Only main namespace
continue
for revision in page:
wikitext = revision.text or ""
infobox = extract_infobox_robust(wikitext) if wikitext else None
save_article(conn, {
"title": page.title,
"pageid": page.id,
"length": len(wikitext),
"wikitext": wikitext,
"infobox": infobox,
})
count += 1
if count % 10000 == 0:
print(f" Processed {count} articles")
break # Only latest revision
if max_articles and count >= max_articles:
break
conn.close()
print(f"Processed {count} articles total")
Use dumps when you need more than approximately 100,000 articles, want to avoid API rate limits entirely, or need consistent point-in-time data for a research dataset.
Full Pipeline
A complete pipeline that crawls a category tree, fetches metadata and infoboxes, saves to SQLite:
def scrape_category_pipeline(
category: str,
depth: int = 2,
db_path: str = "wiki_data.db",
):
"""Full pipeline: category tree -> metadata -> infoboxes -> SQLite."""
conn = init_wikipedia_db(db_path)
throttle = WikiThrottle(requests_per_second=5)
# Phase 1: Category enumeration
print(f"Crawling category tree: {category} (depth {depth})...")
tree = get_category_members(category, depth=depth)
titles = [a["title"] for a in tree["articles"]]
print(f"Found {len(titles)} articles in {len(tree['subcategories'])} subcategories")
# Save category memberships
for a in tree["articles"]:
conn.execute(
"INSERT OR IGNORE INTO category_memberships (category, article_title, depth) VALUES (?,?,?)",
(a["category"], a["title"], a["depth"]),
)
conn.commit()
# Phase 2: Metadata in batches of 50
print("Fetching article metadata...")
throttle.wait()
metadata_list = get_article_metadata(titles)
meta_map = {m["title"]: m for m in metadata_list}
# Phase 3: Full wikitext + infoboxes
print("Fetching wikitext and extracting infoboxes...")
for i, title in enumerate(titles):
if i > 0 and i % 50 == 0:
print(f" Progress: {i}/{len(titles)}")
throttle.wait()
article = meta_map.get(title, {"title": title})
data = wiki_query_safe({
"action": "parse",
"page": title,
"prop": "wikitext",
})
wikitext = data.get("parse", {}).get("wikitext", "")
article["wikitext"] = wikitext
article["infobox"] = extract_infobox_robust(wikitext) if wikitext else None
save_article(conn, article)
if article.get("languages"):
save_language_links(conn, title, article["languages"])
conn.close()
print(f"Pipeline complete. {len(titles)} articles saved to {db_path}")
# Run it
scrape_category_pipeline("Category:Machine learning", depth=2)
Legal Notes
Wikipedia content is CC BY-SA 4.0 licensed. Use it freely as long as you attribute Wikipedia as the source and share derivative works under the same or compatible license. The API itself is open to everyone with no restrictions beyond the rate limits and User-Agent requirement. Wikipedia is one of the cleanest data sources available for both personal and commercial work — cite your source and you are good.