Extracting Structured Data from Wikidata: SPARQL, Entities & Bulk Downloads (2026)
Extracting Structured Data from Wikidata: SPARQL, Entities & Bulk Downloads (2026)
Wikidata is the structured database behind Wikipedia. Every fact you see on Wikipedia — population of a city, birth date of a person, chemical formula of a compound — lives in Wikidata as a queryable statement. Over 100 million items, each with properties and values linked to other items. Free to query, download, and use. No API key needed.
The data quality varies significantly by topic area. Popular entities are maintained by thousands of contributors. Niche entries might be sparse, inconsistent, or entirely missing. Understanding these variations is as important as knowing the technical access methods.
This guide covers the three main access patterns: SPARQL queries via the public endpoint, direct entity API lookups, and bulk dump processing for large-scale extraction.
The Wikidata Data Model
Before writing queries, understanding the data model saves a lot of confusion.
Items are the core entity type. Each has a Q-ID (Q42, Q515, etc.). Items represent things: people, places, concepts, organizations, artworks.
Properties define relationships between items or item-value pairs. P-IDs (P31, P569, etc.). Properties have specific data types: entity references, strings, dates, coordinates, quantities, URLs.
Statements are property-value pairs attached to items, optionally with qualifiers (additional context) and references (sourcing).
Truthy statements (wdt: prefix in SPARQL) represent the best-known current value. The full statement model (p: / ps: prefixes) gives you access to qualifiers, references, and deprecated values. For most use cases, truthy statements are what you want.
Key properties you'll encounter constantly:
- P31 — instance of (what type of thing this is)
- P279 — subclass of (taxonomic hierarchy)
- P17 — country
- P131 — located in administrative division
- P569 / P570 — date of birth / death
- P571 — inception date (for organizations)
- P18 — image
- P856 — official website
- P625 — coordinate location
- P1082 — population
SPARQL Queries — The Main Access Method
Wikidata exposes a public SPARQL 1.1 endpoint at https://query.wikidata.org/sparql. Queries return JSON, XML, CSV, or TSV.
import httpx
import time
import random
import json
from typing import Optional, Any
WIKIDATA_SPARQL = "https://query.wikidata.org/sparql"
def run_sparql(
query: str,
timeout: int = 60,
user_agent: str = "DataBot/1.0 ([email protected])",
retries: int = 3,
) -> list[dict]:
"""
Execute a SPARQL query against the Wikidata Query Service.
Returns list of result rows as dicts.
"""
headers = {
"User-Agent": user_agent,
"Accept": "application/sparql-results+json",
}
for attempt in range(retries):
try:
resp = httpx.get(
WIKIDATA_SPARQL,
params={"query": query, "format": "json"},
headers=headers,
timeout=timeout,
)
if resp.status_code == 429:
retry_after = int(resp.headers.get("Retry-After", 30))
print(f"Rate limited. Waiting {retry_after}s...")
time.sleep(retry_after)
continue
if resp.status_code == 503:
print(f"Service unavailable (attempt {attempt + 1}). Waiting...")
time.sleep(10 * (attempt + 1))
continue
resp.raise_for_status()
data = resp.json()
results = []
bindings = data.get("results", {}).get("bindings", [])
for binding in bindings:
row = {}
for key, val in binding.items():
value_type = val.get("type")
raw_value = val.get("value", "")
if value_type == "uri":
# Extract Q-ID or P-ID if it's a Wikidata URI
if "entity/Q" in raw_value or "entity/P" in raw_value:
row[key] = raw_value.rsplit("/", 1)[-1]
row[f"{key}_uri"] = raw_value
else:
row[key] = raw_value
elif value_type == "literal":
datatype = val.get("datatype", "")
if "integer" in datatype or "decimal" in datatype:
try:
row[key] = int(float(raw_value))
except ValueError:
row[key] = raw_value
else:
row[key] = raw_value
else:
row[key] = raw_value
results.append(row)
return results
except httpx.TimeoutException:
print(f"Query timed out (attempt {attempt + 1}). Query may be too complex.")
if attempt < retries - 1:
time.sleep(5)
except httpx.HTTPStatusError as e:
if e.response.status_code == 500:
print(f"Server error on attempt {attempt + 1}: likely a query syntax issue")
raise
raise
return []
Essential SPARQL Patterns
These are the patterns you'll reuse across most Wikidata projects. Study these before writing your own queries.
# Pattern 1: Instance of + label + properties
# Find all national capitals with their country and population
CAPITALS_QUERY = """
SELECT ?city ?cityLabel ?countryLabel ?population ?coords WHERE {
?city wdt:P31 wd:Q5119. # instance of: capital city
?city wdt:P17 ?country. # in country
OPTIONAL { ?city wdt:P1082 ?population. }
OPTIONAL { ?city wdt:P625 ?coords. }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,fr,de". }
}
ORDER BY DESC(?population)
LIMIT 200
"""
# Pattern 2: Time-bounded queries with FILTER
# Companies founded in the last 2 years with websites
RECENT_COMPANIES_QUERY = """
SELECT ?company ?companyLabel ?countryLabel ?inception ?website WHERE {
?company wdt:P31 wd:Q4830453. # instance of: business enterprise
?company wdt:P571 ?inception. # inception date
?company wdt:P17 ?country.
OPTIONAL { ?company wdt:P856 ?website. }
FILTER(YEAR(?inception) >= 2024)
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY DESC(?inception)
LIMIT 100
"""
# Pattern 3: Subclass traversal with */
# Find all living humans who are politicians (including subtypes)
POLITICIANS_QUERY = """
SELECT ?person ?personLabel ?countryLabel ?birth WHERE {
?person wdt:P31 wd:Q5. # instance of: human
?person wdt:P106/wdt:P279* wd:Q82955. # occupation is politician or subclass
?person wdt:P27 ?country. # country of citizenship
?person wdt:P569 ?birth.
FILTER NOT EXISTS { ?person wdt:P570 ?death. } # still alive
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100
"""
# Pattern 4: EXISTS / NOT EXISTS for filtering
# Universities with no Wikipedia article
UNIVERSITIES_NO_WIKI_QUERY = """
SELECT ?uni ?uniLabel ?countryLabel WHERE {
?uni wdt:P31/wdt:P279* wd:Q3918. # instance of university or subclass
?uni wdt:P17 ?country.
FILTER NOT EXISTS {
?article schema:about ?uni.
?article schema:isPartOf <https://en.wikipedia.org/>.
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100
"""
# Pattern 5: Aggregate queries with GROUP BY
# Count Nobel Prize winners by country
NOBEL_BY_COUNTRY_QUERY = """
SELECT ?countryLabel (COUNT(?person) AS ?winners) WHERE {
?person wdt:P166/wdt:P31*/wdt:P279* wd:Q7191. # has Nobel Prize
?person wdt:P27 ?country.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
GROUP BY ?country ?countryLabel
ORDER BY DESC(?winners)
LIMIT 50
"""
# Pattern 6: UNION for multiple types
# Both films and TV series released in 2025
SCREEN_CONTENT_2025_QUERY = """
SELECT ?item ?itemLabel ?typeLabel WHERE {
{
?item wdt:P31 wd:Q11424. # film
BIND(wd:Q11424 AS ?type)
} UNION {
?item wdt:P31 wd:Q5398426. # television series
BIND(wd:Q5398426 AS ?type)
}
?item wdt:P577 ?release.
FILTER(YEAR(?release) = 2025)
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY ?itemLabel
LIMIT 200
"""
def run_all_examples():
queries = {
"capitals": CAPITALS_QUERY,
"recent_companies": RECENT_COMPANIES_QUERY,
"politicians": POLITICIANS_QUERY,
"universities_no_wiki": UNIVERSITIES_NO_WIKI_QUERY,
"nobel_by_country": NOBEL_BY_COUNTRY_QUERY,
"screen_2025": SCREEN_CONTENT_2025_QUERY,
}
results = {}
for name, query in queries.items():
print(f"Running {name}...")
results[name] = run_sparql(query)
print(f" Got {len(results[name])} results")
time.sleep(2)
return results
Entity API — Direct Lookup by Q-ID
When you already have Wikidata IDs, the Mediawiki API is faster than SPARQL and handles batch requests well.
def get_entities(
qids: list[str],
languages: list[str] = None,
props: list[str] = None,
) -> dict[str, dict]:
"""
Fetch Wikidata entities by Q-IDs.
Handles batching (up to 50 IDs per request).
Returns dict mapping Q-ID -> entity data.
"""
if languages is None:
languages = ["en", "fr", "de", "es"]
if props is None:
props = ["labels", "descriptions", "claims", "sitelinks"]
all_entities = {}
batch_size = 50
for i in range(0, len(qids), batch_size):
batch = qids[i:i + batch_size]
params = {
"action": "wbgetentities",
"ids": "|".join(batch),
"format": "json",
"languages": "|".join(languages),
"props": "|".join(props),
}
resp = httpx.get(
"https://www.wikidata.org/w/api.php",
params=params,
headers={"User-Agent": "DataBot/1.0 ([email protected])"},
timeout=30,
)
resp.raise_for_status()
data = resp.json()
for qid, entity in data.get("entities", {}).items():
if "missing" not in entity:
all_entities[qid] = entity
if i + batch_size < len(qids):
time.sleep(0.5)
return all_entities
def extract_label(entity: dict, lang: str = "en") -> str:
"""Extract English label from entity."""
labels = entity.get("labels", {})
if lang in labels:
return labels[lang]["value"]
# Fallback to any available label
for l in ["en-gb", "en-ca", "fr", "de"]:
if l in labels:
return labels[l]["value"]
return entity.get("id", "?")
def extract_claim_values(entity: dict, property_id: str) -> list[Any]:
"""
Extract all values for a specific property from an entity.
Handles all Wikidata data types.
"""
claims = entity.get("claims", {}).get(property_id, [])
values = []
for claim in claims:
rank = claim.get("rank", "normal")
if rank == "deprecated":
continue
mainsnak = claim.get("mainsnak", {})
if mainsnak.get("snaktype") != "value":
continue
datavalue = mainsnak.get("datavalue", {})
dtype = datavalue.get("type")
val = datavalue.get("value")
if dtype == "wikibase-entityid":
values.append(val.get("id"))
elif dtype == "string":
values.append(val)
elif dtype == "monolingualtext":
values.append({"text": val.get("text"), "lang": val.get("language")})
elif dtype == "time":
# Return ISO-8601-like string
time_str = val.get("time", "")
# Wikidata format: +1969-07-20T00:00:00Z
values.append(time_str.lstrip("+"))
elif dtype == "quantity":
amount = val.get("amount", "0").lstrip("+")
unit_uri = val.get("unit", "")
unit_qid = unit_uri.rsplit("/", 1)[-1] if "entity/" in unit_uri else None
values.append({"amount": float(amount), "unit": unit_qid})
elif dtype == "globecoordinate":
values.append({
"lat": val.get("latitude"),
"lon": val.get("longitude"),
"precision": val.get("precision"),
})
elif dtype == "url":
values.append(val)
return values
def enrich_entity(entity: dict) -> dict:
"""Build a flat, useful dict from a raw Wikidata entity."""
qid = entity.get("id", "")
label = extract_label(entity)
description = entity.get("descriptions", {}).get("en", {}).get("value", "")
return {
"qid": qid,
"label": label,
"description": description,
"instance_of": extract_claim_values(entity, "P31"),
"country": extract_claim_values(entity, "P17"),
"inception": extract_claim_values(entity, "P571"),
"website": extract_claim_values(entity, "P856"),
"image": extract_claim_values(entity, "P18"),
"wikipedia_en": entity.get("sitelinks", {}).get("enwiki", {}).get("title", ""),
"coordinates": extract_claim_values(entity, "P625"),
"population": extract_claim_values(entity, "P1082"),
}
Advanced SPARQL: Working with Qualifiers and References
Qualifiers add context to statements — "population was 3.5M as of 2020". References give sourcing. Most use cases don't need them, but they matter for historical data and data quality assessment.
# Query with qualifiers: population at a specific point in time
POPULATION_HISTORY_QUERY = """
SELECT ?city ?cityLabel ?population ?pointInTime WHERE {
?city wdt:P31 wd:Q515.
?city p:P1082 ?statement.
?statement ps:P1082 ?population.
?statement pq:P585 ?pointInTime. # pq: = qualifier predicate
FILTER(YEAR(?pointInTime) >= 2010)
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY ?city DESC(?pointInTime)
LIMIT 200
"""
# Query with references: only statements cited from academic sources
CITED_FACTS_QUERY = """
SELECT ?item ?itemLabel ?property ?value WHERE {
?item wdt:P31 wd:Q5.
?item p:P569 ?birthStatement.
?birthStatement ps:P569 ?value.
?birthStatement prov:wasDerivedFrom ?ref.
?ref pr:P248 ?source. # stated in
?source wdt:P31/wdt:P279* wd:Q5633421. # source is a scientific journal
BIND(wikibase:directClaim(wdt:P569) AS ?property)
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100
"""
# Federated query: cross-reference with DBpedia
FEDERATED_QUERY = """
SELECT ?city ?cityLabel ?population ?dbpediaLink WHERE {
?city wdt:P31 wd:Q515.
?city wdt:P1082 ?population.
FILTER(?population > 1000000)
OPTIONAL {
?dbpediaLink owl:sameAs ?city.
FILTER(STRSTARTS(STR(?dbpediaLink), "http://dbpedia.org/"))
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY DESC(?population)
LIMIT 20
"""
def run_sparql_with_paging(
base_query: str,
page_size: int = 1000,
max_items: int = 10000,
delay: float = 3.0,
) -> list[dict]:
"""
Run a SPARQL query with LIMIT/OFFSET pagination.
Note: Wikidata SPARQL has a hard limit of 10,000 rows — for larger
datasets, use multiple filtered queries or the bulk dump.
"""
all_results = []
offset = 0
while len(all_results) < max_items:
paginated = f"{base_query}\nLIMIT {page_size} OFFSET {offset}"
try:
batch = run_sparql(paginated)
except Exception as e:
print(f"Error at offset {offset}: {e}")
break
if not batch:
break
all_results.extend(batch)
print(f" Fetched {len(all_results)} total results...")
if len(batch) < page_size:
break # last page
offset += page_size
time.sleep(delay)
return all_results[:max_items]
Rate Limiting, Proxies, and Batch Operations
The Wikidata Query Service enforces rate limits per IP. Authenticated requests (with Wikimedia OAuth) get higher limits, but for most use cases anonymous access is sufficient.
For heavy batch jobs — enriching a dataset of thousands of companies, building training data, running research queries — the per-IP rate limits become the bottleneck. ThorData's residential proxy network distributes requests across multiple IPs, keeping each individual IP well under the threshold.
import requests as req_lib # standard requests for proxy sessions
def batch_sparql_queries(
queries: list[str],
proxy_url: Optional[str] = None,
delay_range: tuple = (2.0, 5.0),
user_agent: str = "DataEnrichmentBot/2.0 ([email protected])",
) -> list[list[dict]]:
"""
Run multiple SPARQL queries with rate limiting and optional proxy rotation.
Each query gets a fresh proxy connection.
"""
all_results = []
for i, query in enumerate(queries):
session = req_lib.Session()
session.headers["User-Agent"] = user_agent
session.headers["Accept"] = "application/sparql-results+json"
if proxy_url:
session.proxies = {"http": proxy_url, "https": proxy_url}
try:
resp = session.get(
WIKIDATA_SPARQL,
params={"query": query, "format": "json"},
timeout=60,
)
if resp.status_code == 429:
wait = int(resp.headers.get("Retry-After", 30))
print(f"Rate limited on query {i}. Waiting {wait}s...")
time.sleep(wait)
resp = session.get(
WIKIDATA_SPARQL,
params={"query": query, "format": "json"},
timeout=60,
)
resp.raise_for_status()
bindings = resp.json().get("results", {}).get("bindings", [])
results = [{k: v["value"] for k, v in row.items()} for row in bindings]
all_results.append(results)
except Exception as e:
print(f"Query {i} failed: {e}")
all_results.append([])
delay = random.uniform(*delay_range)
time.sleep(delay)
session.close()
return all_results
def enrich_entities_from_sparql(
qids: list[str],
properties: list[str],
proxy_url: Optional[str] = None,
batch_size: int = 50,
) -> dict[str, dict]:
"""
Enrich a list of Q-IDs with specific properties using batched SPARQL queries.
More efficient than individual API calls for large batches.
"""
enriched = {}
props_sparql = " ".join([f"OPTIONAL {{ ?item wdt:{p} ?val_{p}. }}" for p in properties])
select_vars = " ".join([f"?val_{p}" for p in properties])
for i in range(0, len(qids), batch_size):
batch = qids[i:i + batch_size]
values_clause = " ".join([f"wd:{qid}" for qid in batch])
query = f"""
SELECT ?item ?itemLabel {select_vars} WHERE {{
VALUES ?item {{ {values_clause} }}
{props_sparql}
SERVICE wikibase:label {{ bd:serviceParam wikibase:language "en". }}
}}
"""
try:
results = run_sparql(query)
for row in results:
qid = row.get("item", "")
if qid:
enriched[qid] = row
except Exception as e:
print(f"Enrichment batch {i//batch_size} failed: {e}")
time.sleep(2.0)
return enriched
Bulk Downloads with WDump
For really large operations — building a local knowledge graph, training data, research datasets — the weekly JSON dump is the practical choice. The full dump is ~100GB compressed and takes hours to process even with fast hardware.
import gzip
import subprocess
from pathlib import Path
# Wikidata provides weekly JSON dumps
DUMP_URL = "https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz"
# Entity types dump (smaller, just items and properties without metadata)
ENTITIES_URL = "https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz"
def download_dump(output_path: str, show_progress: bool = True):
"""
Download Wikidata dump with progress tracking.
Uses wget for resumable downloads.
"""
cmd = ["wget", "--continue", "--show-progress", "-O", output_path, DUMP_URL]
if not show_progress:
cmd = ["wget", "--quiet", "--continue", "-O", output_path, DUMP_URL]
subprocess.run(cmd, check=True)
def stream_wikidata_dump(
dump_path: str,
entity_filter=None,
max_entities: Optional[int] = None,
) -> Any:
"""
Stream and optionally filter a Wikidata JSON dump.
The full dump is ~100GB compressed. Streaming is the only practical approach.
entity_filter: callable(entity) -> bool, or None to return all entities.
"""
count = 0
with gzip.open(dump_path, "rt", encoding="utf-8") as f:
for line in f:
if max_entities and count >= max_entities:
break
line = line.strip().rstrip(",")
if not line or line in ("[", "]"):
continue
try:
entity = json.loads(line)
except json.JSONDecodeError:
continue
if entity_filter is None or entity_filter(entity):
yield entity
count += 1
def is_instance_of(qid: str):
"""Create a filter function for entities that are instances of a given type."""
def filter_fn(entity: dict) -> bool:
claims = entity.get("claims", {})
p31 = claims.get("P31", [])
for claim in p31:
try:
val_qid = claim["mainsnak"]["datavalue"]["value"]["id"]
if val_qid == qid:
return True
except (KeyError, TypeError):
continue
return False
return filter_fn
def extract_dump_to_jsonl(
dump_path: str,
output_path: str,
entity_filter=None,
max_entities: Optional[int] = None,
log_every: int = 100000,
):
"""
Extract filtered entities from dump to JSONL file.
JSONL (one JSON object per line) is easier to process than the full dump.
"""
output = Path(output_path)
count = 0
total_seen = 0
with gzip.open(dump_path, "rt", encoding="utf-8") as infile, \
open(output, "w", encoding="utf-8") as outfile:
for line in infile:
total_seen += 1
if total_seen % log_every == 0:
print(f"Processed {total_seen:,} entities, extracted {count:,}")
if max_entities and count >= max_entities:
break
line = line.strip().rstrip(",")
if not line or line in ("[", "]"):
continue
try:
entity = json.loads(line)
except json.JSONDecodeError:
continue
if entity_filter is None or entity_filter(entity):
outfile.write(json.dumps(entity, ensure_ascii=False) + "\n")
count += 1
print(f"Done. Extracted {count:,} entities from {total_seen:,} total.")
return count
# Example: Extract all software entries from the dump
def extract_all_software(dump_path: str):
is_software = is_instance_of("Q7397") # Q7397 = software
def software_filter(entity: dict) -> bool:
if entity.get("type") != "item":
return False
return is_software(entity)
extract_dump_to_jsonl(
dump_path,
"wikidata_software.jsonl",
entity_filter=software_filter,
)
# Efficient: extract multiple entity types in one pass
def extract_multiple_types(dump_path: str, type_qids: list[str], output_path: str):
"""Extract entities matching any of the given types in a single dump pass."""
type_set = set(type_qids)
def multi_type_filter(entity: dict) -> bool:
if entity.get("type") != "item":
return False
claims = entity.get("claims", {})
p31 = claims.get("P31", [])
for claim in p31:
try:
val_qid = claim["mainsnak"]["datavalue"]["value"]["id"]
if val_qid in type_set:
return True
except (KeyError, TypeError):
continue
return False
extract_dump_to_jsonl(dump_path, output_path, entity_filter=multi_type_filter)
Building a Local Wikidata Query Cache
For applications that run the same queries repeatedly, caching SPARQL results locally reduces latency and API load:
import sqlite3
import hashlib
def setup_sparql_cache(db_path: str = "wikidata_cache.db") -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS sparql_cache (
query_hash TEXT PRIMARY KEY,
query_text TEXT,
result_json TEXT,
row_count INTEGER,
cached_at TEXT DEFAULT (datetime('now')),
expires_at TEXT
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS entity_cache (
qid TEXT PRIMARY KEY,
entity_json TEXT,
cached_at TEXT DEFAULT (datetime('now')),
expires_at TEXT
)
""")
conn.commit()
return conn
def cached_sparql(
query: str,
cache_conn: sqlite3.Connection,
ttl_hours: int = 24,
proxy_url: Optional[str] = None,
) -> list[dict]:
"""Run a SPARQL query with local caching."""
query_hash = hashlib.sha256(query.encode()).hexdigest()
# Check cache
row = cache_conn.execute(
"SELECT result_json, expires_at FROM sparql_cache WHERE query_hash = ?",
(query_hash,)
).fetchone()
if row:
from datetime import datetime
expires_at = row[1]
if expires_at and datetime.fromisoformat(expires_at) > datetime.utcnow():
return json.loads(row[0])
# Cache miss — run query
results = run_sparql(query)
# Store in cache
from datetime import datetime, timedelta
expires = (datetime.utcnow() + timedelta(hours=ttl_hours)).isoformat()
cache_conn.execute("""
INSERT OR REPLACE INTO sparql_cache
(query_hash, query_text, result_json, row_count, expires_at)
VALUES (?, ?, ?, ?, ?)
""", (query_hash, query[:500], json.dumps(results), len(results), expires))
cache_conn.commit()
return results
def cached_entity(
qid: str,
cache_conn: sqlite3.Connection,
ttl_hours: int = 168, # 1 week default
) -> Optional[dict]:
"""Fetch a Wikidata entity with local caching."""
from datetime import datetime, timedelta
row = cache_conn.execute(
"SELECT entity_json, expires_at FROM entity_cache WHERE qid = ?",
(qid,)
).fetchone()
if row:
expires_at = row[1]
if expires_at and datetime.fromisoformat(expires_at) > datetime.utcnow():
return json.loads(row[0])
# Fetch fresh
entities = get_entities([qid])
entity = entities.get(qid)
if entity:
expires = (datetime.utcnow() + timedelta(hours=ttl_hours)).isoformat()
cache_conn.execute(
"INSERT OR REPLACE INTO entity_cache (qid, entity_json, expires_at) VALUES (?, ?, ?)",
(qid, json.dumps(entity), expires)
)
cache_conn.commit()
return entity
Practical Use Cases and Data Quality Notes
Property discovery. When exploring a new entity type, run this query to find what properties are most commonly used:
def discover_properties_for_type(type_qid: str, sample_size: int = 100) -> list[dict]:
"""Find the most common properties used by entities of a given type."""
query = f"""
SELECT ?prop ?propLabel (COUNT(?item) AS ?count) WHERE {{
?item wdt:P31 wd:{type_qid}.
?item ?wdtProp ?value.
?prop wikibase:directClaim ?wdtProp.
SERVICE wikibase:label {{ bd:serviceParam wikibase:language "en". }}
}}
GROUP BY ?prop ?propLabel
ORDER BY DESC(?count)
LIMIT 30
"""
results = run_sparql(query)
return [
{
"property_id": r.get("prop", ""),
"label": r.get("propLabel", ""),
"usage_count": r.get("count", 0),
}
for r in results
]
# Example: What properties do software items commonly have?
# software_props = discover_properties_for_type("Q7397")
Data quality varies by topic. English-language Wikipedia editors tend to maintain English entity coverage well. Entities with active WikiProjects (cities, species, chemicals, films) are usually high-quality. Entities in smaller languages, very new things, or niche areas can be sparse or inconsistent.
The 10,000 row SPARQL limit. The public SPARQL endpoint caps results at 10,000 rows per query. For larger datasets, split queries by time range, country, or other dimensions, or use the bulk dump.
Identifier cross-referencing. Wikidata is an excellent hub for linking identifiers across databases. Common properties: P213 (ISNI), P214 (VIAF), P244 (Library of Congress), P356 (DOI), P496 (ORCID), P549 (Mathematics Genealogy Project), P2002 (Twitter username), P4033 (Mastodon address).
CC0 license. The entire Wikidata dataset is CC0 — public domain. No attribution required (though good practice to credit contributors). This makes it suitable for any use case, including training ML models and commercial applications.
The SPARQL endpoint is the entry point for most work. The entity API is for known-ID lookups. The bulk dump is for truly large-scale extraction. Pick the right tool for the job, add caching for repeated queries, and be thoughtful about query complexity to stay within the endpoint's timeout limits.