Extracting Structured Data from Wikidata: SPARQL, Entities & Bulk Downloads (2026)

2026-04-09 [python scraping wikidata sparql knowledge-graph]

Extracting Structured Data from Wikidata: SPARQL, Entities & Bulk Downloads (2026)

Wikidata is the structured database behind Wikipedia. Every fact you see on Wikipedia — population of a city, birth date of a person, chemical formula of a compound — lives in Wikidata as a queryable statement. Over 100 million items, each with properties and values linked to other items. Free to query, download, and use. No API key needed.

The data quality varies significantly by topic area. Popular entities are maintained by thousands of contributors. Niche entries might be sparse, inconsistent, or entirely missing. Understanding these variations is as important as knowing the technical access methods.

This guide covers the three main access patterns: SPARQL queries via the public endpoint, direct entity API lookups, and bulk dump processing for large-scale extraction.

The Wikidata Data Model

Before writing queries, understanding the data model saves a lot of confusion.

Items are the core entity type. Each has a Q-ID (Q42, Q515, etc.). Items represent things: people, places, concepts, organizations, artworks.

Properties define relationships between items or item-value pairs. P-IDs (P31, P569, etc.). Properties have specific data types: entity references, strings, dates, coordinates, quantities, URLs.

Statements are property-value pairs attached to items, optionally with qualifiers (additional context) and references (sourcing).

Truthy statements (wdt: prefix in SPARQL) represent the best-known current value. The full statement model (p: / ps: prefixes) gives you access to qualifiers, references, and deprecated values. For most use cases, truthy statements are what you want.

Key properties you'll encounter constantly: - P31 — instance of (what type of thing this is) - P279 — subclass of (taxonomic hierarchy) - P17 — country - P131 — located in administrative division - P569 / P570 — date of birth / death - P571 — inception date (for organizations) - P18 — image - P856 — official website - P625 — coordinate location - P1082 — population

SPARQL Queries — The Main Access Method

Wikidata exposes a public SPARQL 1.1 endpoint at https://query.wikidata.org/sparql. Queries return JSON, XML, CSV, or TSV.

import httpx
import time
import random
import json
from typing import Optional, Any

WIKIDATA_SPARQL = "https://query.wikidata.org/sparql"

def run_sparql(
    query: str,
    timeout: int = 60,
    user_agent: str = "DataBot/1.0 ([email protected])",
    retries: int = 3,
) -> list[dict]:
    """
    Execute a SPARQL query against the Wikidata Query Service.
    Returns list of result rows as dicts.
    """
    headers = {
        "User-Agent": user_agent,
        "Accept": "application/sparql-results+json",
    }

    for attempt in range(retries):
        try:
            resp = httpx.get(
                WIKIDATA_SPARQL,
                params={"query": query, "format": "json"},
                headers=headers,
                timeout=timeout,
            )

            if resp.status_code == 429:
                retry_after = int(resp.headers.get("Retry-After", 30))
                print(f"Rate limited. Waiting {retry_after}s...")
                time.sleep(retry_after)
                continue

            if resp.status_code == 503:
                print(f"Service unavailable (attempt {attempt + 1}). Waiting...")
                time.sleep(10 * (attempt + 1))
                continue

            resp.raise_for_status()
            data = resp.json()

            results = []
            bindings = data.get("results", {}).get("bindings", [])
            for binding in bindings:
                row = {}
                for key, val in binding.items():
                    value_type = val.get("type")
                    raw_value = val.get("value", "")

                    if value_type == "uri":
                        # Extract Q-ID or P-ID if it's a Wikidata URI
                        if "entity/Q" in raw_value or "entity/P" in raw_value:
                            row[key] = raw_value.rsplit("/", 1)[-1]
                            row[f"{key}_uri"] = raw_value
                        else:
                            row[key] = raw_value
                    elif value_type == "literal":
                        datatype = val.get("datatype", "")
                        if "integer" in datatype or "decimal" in datatype:
                            try:
                                row[key] = int(float(raw_value))
                            except ValueError:
                                row[key] = raw_value
                        else:
                            row[key] = raw_value
                    else:
                        row[key] = raw_value

                results.append(row)

            return results

        except httpx.TimeoutException:
            print(f"Query timed out (attempt {attempt + 1}). Query may be too complex.")
            if attempt < retries - 1:
                time.sleep(5)
        except httpx.HTTPStatusError as e:
            if e.response.status_code == 500:
                print(f"Server error on attempt {attempt + 1}: likely a query syntax issue")
                raise
            raise

    return []

Essential SPARQL Patterns

These are the patterns you'll reuse across most Wikidata projects. Study these before writing your own queries.

# Pattern 1: Instance of + label + properties
# Find all national capitals with their country and population
CAPITALS_QUERY = """
SELECT ?city ?cityLabel ?countryLabel ?population ?coords WHERE {
  ?city wdt:P31 wd:Q5119.          # instance of: capital city
  ?city wdt:P17 ?country.          # in country
  OPTIONAL { ?city wdt:P1082 ?population. }
  OPTIONAL { ?city wdt:P625 ?coords. }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,fr,de". }
}
ORDER BY DESC(?population)
LIMIT 200
"""

# Pattern 2: Time-bounded queries with FILTER
# Companies founded in the last 2 years with websites
RECENT_COMPANIES_QUERY = """
SELECT ?company ?companyLabel ?countryLabel ?inception ?website WHERE {
  ?company wdt:P31 wd:Q4830453.    # instance of: business enterprise
  ?company wdt:P571 ?inception.    # inception date
  ?company wdt:P17 ?country.
  OPTIONAL { ?company wdt:P856 ?website. }
  FILTER(YEAR(?inception) >= 2024)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY DESC(?inception)
LIMIT 100
"""

# Pattern 3: Subclass traversal with */
# Find all living humans who are politicians (including subtypes)
POLITICIANS_QUERY = """
SELECT ?person ?personLabel ?countryLabel ?birth WHERE {
  ?person wdt:P31 wd:Q5.                          # instance of: human
  ?person wdt:P106/wdt:P279* wd:Q82955.           # occupation is politician or subclass
  ?person wdt:P27 ?country.                        # country of citizenship
  ?person wdt:P569 ?birth.
  FILTER NOT EXISTS { ?person wdt:P570 ?death. }   # still alive
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100
"""

# Pattern 4: EXISTS / NOT EXISTS for filtering
# Universities with no Wikipedia article
UNIVERSITIES_NO_WIKI_QUERY = """
SELECT ?uni ?uniLabel ?countryLabel WHERE {
  ?uni wdt:P31/wdt:P279* wd:Q3918.     # instance of university or subclass
  ?uni wdt:P17 ?country.
  FILTER NOT EXISTS {
    ?article schema:about ?uni.
    ?article schema:isPartOf <https://en.wikipedia.org/>.
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100
"""

# Pattern 5: Aggregate queries with GROUP BY
# Count Nobel Prize winners by country
NOBEL_BY_COUNTRY_QUERY = """
SELECT ?countryLabel (COUNT(?person) AS ?winners) WHERE {
  ?person wdt:P166/wdt:P31*/wdt:P279* wd:Q7191.   # has Nobel Prize
  ?person wdt:P27 ?country.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
GROUP BY ?country ?countryLabel
ORDER BY DESC(?winners)
LIMIT 50
"""

# Pattern 6: UNION for multiple types
# Both films and TV series released in 2025
SCREEN_CONTENT_2025_QUERY = """
SELECT ?item ?itemLabel ?typeLabel WHERE {
  {
    ?item wdt:P31 wd:Q11424.         # film
    BIND(wd:Q11424 AS ?type)
  } UNION {
    ?item wdt:P31 wd:Q5398426.       # television series
    BIND(wd:Q5398426 AS ?type)
  }
  ?item wdt:P577 ?release.
  FILTER(YEAR(?release) = 2025)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY ?itemLabel
LIMIT 200
"""

def run_all_examples():
    queries = {
        "capitals": CAPITALS_QUERY,
        "recent_companies": RECENT_COMPANIES_QUERY,
        "politicians": POLITICIANS_QUERY,
        "universities_no_wiki": UNIVERSITIES_NO_WIKI_QUERY,
        "nobel_by_country": NOBEL_BY_COUNTRY_QUERY,
        "screen_2025": SCREEN_CONTENT_2025_QUERY,
    }
    results = {}
    for name, query in queries.items():
        print(f"Running {name}...")
        results[name] = run_sparql(query)
        print(f"  Got {len(results[name])} results")
        time.sleep(2)
    return results

Entity API — Direct Lookup by Q-ID

When you already have Wikidata IDs, the Mediawiki API is faster than SPARQL and handles batch requests well.

def get_entities(
    qids: list[str],
    languages: list[str] = None,
    props: list[str] = None,
) -> dict[str, dict]:
    """
    Fetch Wikidata entities by Q-IDs.
    Handles batching (up to 50 IDs per request).
    Returns dict mapping Q-ID -> entity data.
    """
    if languages is None:
        languages = ["en", "fr", "de", "es"]
    if props is None:
        props = ["labels", "descriptions", "claims", "sitelinks"]

    all_entities = {}
    batch_size = 50

    for i in range(0, len(qids), batch_size):
        batch = qids[i:i + batch_size]
        params = {
            "action": "wbgetentities",
            "ids": "|".join(batch),
            "format": "json",
            "languages": "|".join(languages),
            "props": "|".join(props),
        }

        resp = httpx.get(
            "https://www.wikidata.org/w/api.php",
            params=params,
            headers={"User-Agent": "DataBot/1.0 ([email protected])"},
            timeout=30,
        )
        resp.raise_for_status()
        data = resp.json()

        for qid, entity in data.get("entities", {}).items():
            if "missing" not in entity:
                all_entities[qid] = entity

        if i + batch_size < len(qids):
            time.sleep(0.5)

    return all_entities

def extract_label(entity: dict, lang: str = "en") -> str:
    """Extract English label from entity."""
    labels = entity.get("labels", {})
    if lang in labels:
        return labels[lang]["value"]
    # Fallback to any available label
    for l in ["en-gb", "en-ca", "fr", "de"]:
        if l in labels:
            return labels[l]["value"]
    return entity.get("id", "?")

def extract_claim_values(entity: dict, property_id: str) -> list[Any]:
    """
    Extract all values for a specific property from an entity.
    Handles all Wikidata data types.
    """
    claims = entity.get("claims", {}).get(property_id, [])
    values = []

    for claim in claims:
        rank = claim.get("rank", "normal")
        if rank == "deprecated":
            continue

        mainsnak = claim.get("mainsnak", {})
        if mainsnak.get("snaktype") != "value":
            continue

        datavalue = mainsnak.get("datavalue", {})
        dtype = datavalue.get("type")
        val = datavalue.get("value")

        if dtype == "wikibase-entityid":
            values.append(val.get("id"))
        elif dtype == "string":
            values.append(val)
        elif dtype == "monolingualtext":
            values.append({"text": val.get("text"), "lang": val.get("language")})
        elif dtype == "time":
            # Return ISO-8601-like string
            time_str = val.get("time", "")
            # Wikidata format: +1969-07-20T00:00:00Z
            values.append(time_str.lstrip("+"))
        elif dtype == "quantity":
            amount = val.get("amount", "0").lstrip("+")
            unit_uri = val.get("unit", "")
            unit_qid = unit_uri.rsplit("/", 1)[-1] if "entity/" in unit_uri else None
            values.append({"amount": float(amount), "unit": unit_qid})
        elif dtype == "globecoordinate":
            values.append({
                "lat": val.get("latitude"),
                "lon": val.get("longitude"),
                "precision": val.get("precision"),
            })
        elif dtype == "url":
            values.append(val)

    return values

def enrich_entity(entity: dict) -> dict:
    """Build a flat, useful dict from a raw Wikidata entity."""
    qid = entity.get("id", "")
    label = extract_label(entity)
    description = entity.get("descriptions", {}).get("en", {}).get("value", "")

    return {
        "qid": qid,
        "label": label,
        "description": description,
        "instance_of": extract_claim_values(entity, "P31"),
        "country": extract_claim_values(entity, "P17"),
        "inception": extract_claim_values(entity, "P571"),
        "website": extract_claim_values(entity, "P856"),
        "image": extract_claim_values(entity, "P18"),
        "wikipedia_en": entity.get("sitelinks", {}).get("enwiki", {}).get("title", ""),
        "coordinates": extract_claim_values(entity, "P625"),
        "population": extract_claim_values(entity, "P1082"),
    }

Advanced SPARQL: Working with Qualifiers and References

Qualifiers add context to statements — "population was 3.5M as of 2020". References give sourcing. Most use cases don't need them, but they matter for historical data and data quality assessment.

# Query with qualifiers: population at a specific point in time
POPULATION_HISTORY_QUERY = """
SELECT ?city ?cityLabel ?population ?pointInTime WHERE {
  ?city wdt:P31 wd:Q515.
  ?city p:P1082 ?statement.
  ?statement ps:P1082 ?population.
  ?statement pq:P585 ?pointInTime.   # pq: = qualifier predicate
  FILTER(YEAR(?pointInTime) >= 2010)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY ?city DESC(?pointInTime)
LIMIT 200
"""

# Query with references: only statements cited from academic sources
CITED_FACTS_QUERY = """
SELECT ?item ?itemLabel ?property ?value WHERE {
  ?item wdt:P31 wd:Q5.
  ?item p:P569 ?birthStatement.
  ?birthStatement ps:P569 ?value.
  ?birthStatement prov:wasDerivedFrom ?ref.
  ?ref pr:P248 ?source.     # stated in
  ?source wdt:P31/wdt:P279* wd:Q5633421.   # source is a scientific journal
  BIND(wikibase:directClaim(wdt:P569) AS ?property)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100
"""

# Federated query: cross-reference with DBpedia
FEDERATED_QUERY = """
SELECT ?city ?cityLabel ?population ?dbpediaLink WHERE {
  ?city wdt:P31 wd:Q515.
  ?city wdt:P1082 ?population.
  FILTER(?population > 1000000)
  OPTIONAL {
    ?dbpediaLink owl:sameAs ?city.
    FILTER(STRSTARTS(STR(?dbpediaLink), "http://dbpedia.org/"))
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY DESC(?population)
LIMIT 20
"""

def run_sparql_with_paging(
    base_query: str,
    page_size: int = 1000,
    max_items: int = 10000,
    delay: float = 3.0,
) -> list[dict]:
    """
    Run a SPARQL query with LIMIT/OFFSET pagination.
    Note: Wikidata SPARQL has a hard limit of 10,000 rows — for larger
    datasets, use multiple filtered queries or the bulk dump.
    """
    all_results = []
    offset = 0

    while len(all_results) < max_items:
        paginated = f"{base_query}\nLIMIT {page_size} OFFSET {offset}"

        try:
            batch = run_sparql(paginated)
        except Exception as e:
            print(f"Error at offset {offset}: {e}")
            break

        if not batch:
            break

        all_results.extend(batch)
        print(f"  Fetched {len(all_results)} total results...")

        if len(batch) < page_size:
            break  # last page

        offset += page_size
        time.sleep(delay)

    return all_results[:max_items]

Rate Limiting, Proxies, and Batch Operations

The Wikidata Query Service enforces rate limits per IP. Authenticated requests (with Wikimedia OAuth) get higher limits, but for most use cases anonymous access is sufficient.

For heavy batch jobs — enriching a dataset of thousands of companies, building training data, running research queries — the per-IP rate limits become the bottleneck. ThorData's residential proxy network distributes requests across multiple IPs, keeping each individual IP well under the threshold.

import requests as req_lib  # standard requests for proxy sessions

def batch_sparql_queries(
    queries: list[str],
    proxy_url: Optional[str] = None,
    delay_range: tuple = (2.0, 5.0),
    user_agent: str = "DataEnrichmentBot/2.0 ([email protected])",
) -> list[list[dict]]:
    """
    Run multiple SPARQL queries with rate limiting and optional proxy rotation.
    Each query gets a fresh proxy connection.
    """
    all_results = []

    for i, query in enumerate(queries):
        session = req_lib.Session()
        session.headers["User-Agent"] = user_agent
        session.headers["Accept"] = "application/sparql-results+json"

        if proxy_url:
            session.proxies = {"http": proxy_url, "https": proxy_url}

        try:
            resp = session.get(
                WIKIDATA_SPARQL,
                params={"query": query, "format": "json"},
                timeout=60,
            )

            if resp.status_code == 429:
                wait = int(resp.headers.get("Retry-After", 30))
                print(f"Rate limited on query {i}. Waiting {wait}s...")
                time.sleep(wait)
                resp = session.get(
                    WIKIDATA_SPARQL,
                    params={"query": query, "format": "json"},
                    timeout=60,
                )

            resp.raise_for_status()
            bindings = resp.json().get("results", {}).get("bindings", [])
            results = [{k: v["value"] for k, v in row.items()} for row in bindings]
            all_results.append(results)

        except Exception as e:
            print(f"Query {i} failed: {e}")
            all_results.append([])

        delay = random.uniform(*delay_range)
        time.sleep(delay)
        session.close()

    return all_results

def enrich_entities_from_sparql(
    qids: list[str],
    properties: list[str],
    proxy_url: Optional[str] = None,
    batch_size: int = 50,
) -> dict[str, dict]:
    """
    Enrich a list of Q-IDs with specific properties using batched SPARQL queries.
    More efficient than individual API calls for large batches.
    """
    enriched = {}
    props_sparql = " ".join([f"OPTIONAL {{ ?item wdt:{p} ?val_{p}. }}" for p in properties])
    select_vars = " ".join([f"?val_{p}" for p in properties])

    for i in range(0, len(qids), batch_size):
        batch = qids[i:i + batch_size]
        values_clause = " ".join([f"wd:{qid}" for qid in batch])

        query = f"""
SELECT ?item ?itemLabel {select_vars} WHERE {{
  VALUES ?item {{ {values_clause} }}
  {props_sparql}
  SERVICE wikibase:label {{ bd:serviceParam wikibase:language "en". }}
}}
"""
        try:
            results = run_sparql(query)
            for row in results:
                qid = row.get("item", "")
                if qid:
                    enriched[qid] = row
        except Exception as e:
            print(f"Enrichment batch {i//batch_size} failed: {e}")

        time.sleep(2.0)

    return enriched

Bulk Downloads with WDump

For really large operations — building a local knowledge graph, training data, research datasets — the weekly JSON dump is the practical choice. The full dump is ~100GB compressed and takes hours to process even with fast hardware.

import gzip
import subprocess
from pathlib import Path

# Wikidata provides weekly JSON dumps
DUMP_URL = "https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz"
# Entity types dump (smaller, just items and properties without metadata)
ENTITIES_URL = "https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz"

def download_dump(output_path: str, show_progress: bool = True):
    """
    Download Wikidata dump with progress tracking.
    Uses wget for resumable downloads.
    """
    cmd = ["wget", "--continue", "--show-progress", "-O", output_path, DUMP_URL]
    if not show_progress:
        cmd = ["wget", "--quiet", "--continue", "-O", output_path, DUMP_URL]
    subprocess.run(cmd, check=True)

def stream_wikidata_dump(
    dump_path: str,
    entity_filter=None,
    max_entities: Optional[int] = None,
) -> Any:
    """
    Stream and optionally filter a Wikidata JSON dump.
    The full dump is ~100GB compressed. Streaming is the only practical approach.

    entity_filter: callable(entity) -> bool, or None to return all entities.
    """
    count = 0
    with gzip.open(dump_path, "rt", encoding="utf-8") as f:
        for line in f:
            if max_entities and count >= max_entities:
                break

            line = line.strip().rstrip(",")
            if not line or line in ("[", "]"):
                continue

            try:
                entity = json.loads(line)
            except json.JSONDecodeError:
                continue

            if entity_filter is None or entity_filter(entity):
                yield entity
                count += 1

def is_instance_of(qid: str):
    """Create a filter function for entities that are instances of a given type."""
    def filter_fn(entity: dict) -> bool:
        claims = entity.get("claims", {})
        p31 = claims.get("P31", [])
        for claim in p31:
            try:
                val_qid = claim["mainsnak"]["datavalue"]["value"]["id"]
                if val_qid == qid:
                    return True
            except (KeyError, TypeError):
                continue
        return False
    return filter_fn

def extract_dump_to_jsonl(
    dump_path: str,
    output_path: str,
    entity_filter=None,
    max_entities: Optional[int] = None,
    log_every: int = 100000,
):
    """
    Extract filtered entities from dump to JSONL file.
    JSONL (one JSON object per line) is easier to process than the full dump.
    """
    output = Path(output_path)
    count = 0
    total_seen = 0

    with gzip.open(dump_path, "rt", encoding="utf-8") as infile, \
         open(output, "w", encoding="utf-8") as outfile:

        for line in infile:
            total_seen += 1
            if total_seen % log_every == 0:
                print(f"Processed {total_seen:,} entities, extracted {count:,}")

            if max_entities and count >= max_entities:
                break

            line = line.strip().rstrip(",")
            if not line or line in ("[", "]"):
                continue

            try:
                entity = json.loads(line)
            except json.JSONDecodeError:
                continue

            if entity_filter is None or entity_filter(entity):
                outfile.write(json.dumps(entity, ensure_ascii=False) + "\n")
                count += 1

    print(f"Done. Extracted {count:,} entities from {total_seen:,} total.")
    return count

# Example: Extract all software entries from the dump
def extract_all_software(dump_path: str):
    is_software = is_instance_of("Q7397")  # Q7397 = software

    def software_filter(entity: dict) -> bool:
        if entity.get("type") != "item":
            return False
        return is_software(entity)

    extract_dump_to_jsonl(
        dump_path,
        "wikidata_software.jsonl",
        entity_filter=software_filter,
    )

# Efficient: extract multiple entity types in one pass
def extract_multiple_types(dump_path: str, type_qids: list[str], output_path: str):
    """Extract entities matching any of the given types in a single dump pass."""
    type_set = set(type_qids)

    def multi_type_filter(entity: dict) -> bool:
        if entity.get("type") != "item":
            return False
        claims = entity.get("claims", {})
        p31 = claims.get("P31", [])
        for claim in p31:
            try:
                val_qid = claim["mainsnak"]["datavalue"]["value"]["id"]
                if val_qid in type_set:
                    return True
            except (KeyError, TypeError):
                continue
        return False

    extract_dump_to_jsonl(dump_path, output_path, entity_filter=multi_type_filter)

Building a Local Wikidata Query Cache

For applications that run the same queries repeatedly, caching SPARQL results locally reduces latency and API load:

import sqlite3
import hashlib

def setup_sparql_cache(db_path: str = "wikidata_cache.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS sparql_cache (
            query_hash TEXT PRIMARY KEY,
            query_text TEXT,
            result_json TEXT,
            row_count INTEGER,
            cached_at TEXT DEFAULT (datetime('now')),
            expires_at TEXT
        )
    """)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS entity_cache (
            qid TEXT PRIMARY KEY,
            entity_json TEXT,
            cached_at TEXT DEFAULT (datetime('now')),
            expires_at TEXT
        )
    """)
    conn.commit()
    return conn

def cached_sparql(
    query: str,
    cache_conn: sqlite3.Connection,
    ttl_hours: int = 24,
    proxy_url: Optional[str] = None,
) -> list[dict]:
    """Run a SPARQL query with local caching."""
    query_hash = hashlib.sha256(query.encode()).hexdigest()

    # Check cache
    row = cache_conn.execute(
        "SELECT result_json, expires_at FROM sparql_cache WHERE query_hash = ?",
        (query_hash,)
    ).fetchone()

    if row:
        from datetime import datetime
        expires_at = row[1]
        if expires_at and datetime.fromisoformat(expires_at) > datetime.utcnow():
            return json.loads(row[0])

    # Cache miss — run query
    results = run_sparql(query)

    # Store in cache
    from datetime import datetime, timedelta
    expires = (datetime.utcnow() + timedelta(hours=ttl_hours)).isoformat()
    cache_conn.execute("""
        INSERT OR REPLACE INTO sparql_cache
        (query_hash, query_text, result_json, row_count, expires_at)
        VALUES (?, ?, ?, ?, ?)
    """, (query_hash, query[:500], json.dumps(results), len(results), expires))
    cache_conn.commit()

    return results

def cached_entity(
    qid: str,
    cache_conn: sqlite3.Connection,
    ttl_hours: int = 168,  # 1 week default
) -> Optional[dict]:
    """Fetch a Wikidata entity with local caching."""
    from datetime import datetime, timedelta

    row = cache_conn.execute(
        "SELECT entity_json, expires_at FROM entity_cache WHERE qid = ?",
        (qid,)
    ).fetchone()

    if row:
        expires_at = row[1]
        if expires_at and datetime.fromisoformat(expires_at) > datetime.utcnow():
            return json.loads(row[0])

    # Fetch fresh
    entities = get_entities([qid])
    entity = entities.get(qid)
    if entity:
        expires = (datetime.utcnow() + timedelta(hours=ttl_hours)).isoformat()
        cache_conn.execute(
            "INSERT OR REPLACE INTO entity_cache (qid, entity_json, expires_at) VALUES (?, ?, ?)",
            (qid, json.dumps(entity), expires)
        )
        cache_conn.commit()

    return entity

Practical Use Cases and Data Quality Notes

Property discovery. When exploring a new entity type, run this query to find what properties are most commonly used:

def discover_properties_for_type(type_qid: str, sample_size: int = 100) -> list[dict]:
    """Find the most common properties used by entities of a given type."""
    query = f"""
SELECT ?prop ?propLabel (COUNT(?item) AS ?count) WHERE {{
  ?item wdt:P31 wd:{type_qid}.
  ?item ?wdtProp ?value.
  ?prop wikibase:directClaim ?wdtProp.
  SERVICE wikibase:label {{ bd:serviceParam wikibase:language "en". }}
}}
GROUP BY ?prop ?propLabel
ORDER BY DESC(?count)
LIMIT 30
"""
    results = run_sparql(query)
    return [
        {
            "property_id": r.get("prop", ""),
            "label": r.get("propLabel", ""),
            "usage_count": r.get("count", 0),
        }
        for r in results
    ]

# Example: What properties do software items commonly have?
# software_props = discover_properties_for_type("Q7397")

Data quality varies by topic. English-language Wikipedia editors tend to maintain English entity coverage well. Entities with active WikiProjects (cities, species, chemicals, films) are usually high-quality. Entities in smaller languages, very new things, or niche areas can be sparse or inconsistent.

The 10,000 row SPARQL limit. The public SPARQL endpoint caps results at 10,000 rows per query. For larger datasets, split queries by time range, country, or other dimensions, or use the bulk dump.

Identifier cross-referencing. Wikidata is an excellent hub for linking identifiers across databases. Common properties: P213 (ISNI), P214 (VIAF), P244 (Library of Congress), P356 (DOI), P496 (ORCID), P549 (Mathematics Genealogy Project), P2002 (Twitter username), P4033 (Mastodon address).

CC0 license. The entire Wikidata dataset is CC0 — public domain. No attribution required (though good practice to credit contributors). This makes it suitable for any use case, including training ML models and commercial applications.

The SPARQL endpoint is the entry point for most work. The entity API is for known-ID lookups. The bulk dump is for truly large-scale extraction. Pick the right tool for the job, add caching for repeated queries, and be thoughtful about query complexity to stay within the endpoint's timeout limits.