Extracting OpenStreetMap Data: POIs, Overpass API & Bulk Processing (2026)

2026-04-09 [python scraping openstreetmap geospatial overpass-api]

Extracting OpenStreetMap Data: POIs, Overpass API & Bulk Processing (2026)

OpenStreetMap is a free, editable map of the entire world. That's not just roads — it's restaurants, hospitals, parks, building outlines, hiking trails, fire hydrants, EV chargers, transit stops, ATMs, bike lanes, and effectively any geographic feature someone has mapped.

The data is open (ODbL license) and well-structured. For most use cases, the Overpass API is where you start. For country-scale or larger datasets, grab the bulk exports from Geofabrik. This guide covers both approaches with production-ready Python code.

Data Model Overview

Understanding OSM's data model is essential before writing queries.

Elements are the basic building blocks: - Nodes — single points (lat/lon). A restaurant, ATM, or tree. - Ways — ordered lists of nodes forming lines or areas. A road, building outline, or park boundary. - Relations — groups of nodes/ways with semantic meaning. A bus route, administrative boundary, or multipolygon.

Tags are key-value pairs attached to any element. A coffee shop might have amenity=cafe, name=Blue Bottle Coffee, opening_hours=Mo-Fr 07:00-19:00, wifi=yes.

The OSM Wiki at wiki.openstreetmap.org/wiki/Map_features is your reference for what tags mean and which values are used. Always check the wiki before writing queries — there's often a correct tag and several deprecated alternatives.

Overpass API — Querying Map Data

Overpass is like SQL for OpenStreetMap. You write a query describing what geographic features you want, in what area, and it returns them as structured data.

The public endpoint at overpass-api.de is free but shared. For production workloads, consider running a local Overpass instance or using a private endpoint.

import httpx
import time
import json
import csv
import sqlite3
from typing import Optional

OVERPASS_URL = "https://overpass-api.de/api/interpreter"


def overpass_query(query: str, timeout: int = 30) -> list[dict]:
    """Run an Overpass QL query and return elements list."""
    resp = httpx.post(
        OVERPASS_URL,
        data={"data": query},
        timeout=timeout + 15,
        headers={
            "User-Agent": "OSMDataPipeline/1.0 ([email protected])",
        },
    )
    resp.raise_for_status()
    data = resp.json()
    return data.get("elements", [])


def overpass_query_with_retry(
    query: str,
    timeout: int = 45,
    max_retries: int = 3,
    proxy: str | None = None,
) -> list[dict]:
    """Overpass query with exponential backoff on rate limiting."""
    for attempt in range(max_retries):
        try:
            transport = httpx.HTTPTransport(proxy=proxy) if proxy else None
            resp = httpx.post(
                OVERPASS_URL,
                data={"data": query},
                timeout=timeout + 15,
                transport=transport,
            )
            if resp.status_code == 429:
                wait = 2 ** (attempt + 2)  # 4s, 8s, 16s
                print(f"Rate limited. Waiting {wait}s (attempt {attempt + 1})...")
                time.sleep(wait)
                continue
            if resp.status_code == 504:
                print("Gateway timeout — query too complex or server overloaded")
                time.sleep(30)
                continue
            resp.raise_for_status()
            return resp.json().get("elements", [])
        except httpx.TimeoutException:
            print(f"Timeout on attempt {attempt + 1}")
            if attempt < max_retries - 1:
                time.sleep(10)
    return []


# Example: all cafes in central Berlin
cafes_query = """
[out:json][timeout:30];
area["name"="Berlin"]["admin_level"="4"]->.b;
(
  node["amenity"="cafe"](area.b);
  way["amenity"="cafe"](area.b);
);
out center body;
"""

cafes = overpass_query(cafes_query)
print(f"Found {len(cafes)} cafes in Berlin")
for cafe in cafes[:5]:
    tags = cafe.get("tags", {})
    lat = cafe.get("lat") or cafe.get("center", {}).get("lat")
    lon = cafe.get("lon") or cafe.get("center", {}).get("lon")
    print(f"  {tags.get('name', 'Unnamed')}: {lat:.4f}, {lon:.4f}")
    if tags.get("opening_hours"):
        print(f"    Hours: {tags['opening_hours']}")

Key Overpass QL Syntax

[out:json]          — JSON output (vs XML default)
[timeout:30]        — server-side timeout in seconds
area["name"="Berlin"]["admin_level"="4"]->.b;  — define area by tag match
node["amenity"="cafe"](area.b);   — filter nodes by tag within area
way["amenity"="restaurant"](bbox); — filter ways within bounding box
out center;         — for ways, output center point instead of full geometry
out body;           — include all tags
out geom;           — include full geometry (all nodes for ways)

POI Extraction by Category

Reusable function for extracting different POI types from a named area:

def get_pois_by_area(
    area_name: str,
    poi_tag: str,
    poi_value: str,
    admin_level: str = "4",
    additional_tags: dict | None = None,
    timeout: int = 60,
) -> list[dict]:
    """
    Extract POIs of a specific type from a named administrative area.

    poi_tag: OSM tag key, e.g., 'amenity', 'shop', 'tourism'
    poi_value: OSM tag value, e.g., 'restaurant', 'supermarket', 'hotel'
    admin_level: '4' for state/region, '6' for county, '8' for city/municipality
    """
    extra_filters = ""
    if additional_tags:
        for k, v in additional_tags.items():
            extra_filters += f'["{k}"="{v}"]'

    query = f"""
    [out:json][timeout:{timeout}];
    area["name"="{area_name}"]["admin_level"="{admin_level}"]->.a;
    (
      node["{poi_tag}"="{poi_value}"]{extra_filters}(area.a);
      way["{poi_tag}"="{poi_value}"]{extra_filters}(area.a);
    );
    out center body;
    """
    elements = overpass_query_with_retry(query, timeout=timeout)
    return normalize_elements(elements)


def normalize_elements(elements: list[dict]) -> list[dict]:
    """Normalize raw Overpass elements to clean POI dicts."""
    pois = []
    for el in elements:
        tags = el.get("tags", {})
        lat = el.get("lat") or el.get("center", {}).get("lat")
        lon = el.get("lon") or el.get("center", {}).get("lon")

        if lat is None or lon is None:
            continue

        pois.append({
            "osm_id": el["id"],
            "osm_type": el["type"],
            "lat": lat,
            "lon": lon,
            "name": tags.get("name", ""),
            "name_en": tags.get("name:en", ""),
            "brand": tags.get("brand", ""),
            "operator": tags.get("operator", ""),
            "address": {
                "street": tags.get("addr:street", ""),
                "housenumber": tags.get("addr:housenumber", ""),
                "postcode": tags.get("addr:postcode", ""),
                "city": tags.get("addr:city", ""),
                "country": tags.get("addr:country", ""),
            },
            "phone": tags.get("phone", "") or tags.get("contact:phone", ""),
            "website": tags.get("website", "") or tags.get("contact:website", ""),
            "email": tags.get("email", "") or tags.get("contact:email", ""),
            "opening_hours": tags.get("opening_hours", ""),
            "wheelchair": tags.get("wheelchair", ""),
            "smoking": tags.get("smoking", ""),
            "wifi": tags.get("internet_access", ""),
            "cuisine": tags.get("cuisine", ""),  # for restaurants
            "stars": tags.get("stars", ""),      # for hotels
            "capacity": tags.get("capacity", ""),
            "all_tags": tags,
        })

    return pois


# Common extraction examples
def get_restaurants(city: str, admin_level: str = "8") -> list[dict]:
    return get_pois_by_area(city, "amenity", "restaurant", admin_level)

def get_hospitals(city: str) -> list[dict]:
    return get_pois_by_area(city, "amenity", "hospital")

def get_hotels(city: str) -> list[dict]:
    return get_pois_by_area(city, "tourism", "hotel")

def get_supermarkets(city: str) -> list[dict]:
    return get_pois_by_area(city, "shop", "supermarket")

def get_ev_chargers(city: str) -> list[dict]:
    return get_pois_by_area(city, "amenity", "charging_station")

def get_atms(city: str) -> list[dict]:
    return get_pois_by_area(city, "amenity", "atm")

def get_bus_stops(city: str) -> list[dict]:
    return get_pois_by_area(city, "highway", "bus_stop")

Bounding Box Queries

When you need data for a specific coordinate area rather than a named region:

def get_pois_bbox(
    south: float,
    west: float,
    north: float,
    east: float,
    tags: dict,
    timeout: int = 60,
) -> list[dict]:
    """
    Get POIs within a bounding box.

    tags: dict like {"amenity": "restaurant"} or {"shop": "*"}
    Use "*" as value to match any value for a given key.
    """
    bbox = f"{south},{west},{north},{east}"
    tag_filters = ""
    for key, value in tags.items():
        if value == "*":
            tag_filters += f'  node["{key}"]({bbox});\n'
            tag_filters += f'  way["{key}"]({bbox});\n'
        else:
            tag_filters += f'  node["{key}"="{value}"]({bbox});\n'
            tag_filters += f'  way["{key}"="{value}"]({bbox});\n'

    query = f"""
    [out:json][timeout:{timeout}];
    (
    {tag_filters}
    );
    out center body;
    """
    elements = overpass_query_with_retry(query, timeout=timeout)
    return normalize_elements(elements)


def bbox_from_city_center(
    lat: float,
    lon: float,
    radius_km: float = 5.0,
) -> tuple[float, float, float, float]:
    """Create a bounding box around a lat/lon point."""
    # Approximate: 1 degree lat ≈ 111km, 1 degree lon ≈ 111km * cos(lat)
    import math
    lat_delta = radius_km / 111.0
    lon_delta = radius_km / (111.0 * math.cos(math.radians(lat)))
    return (
        lat - lat_delta,  # south
        lon - lon_delta,  # west
        lat + lat_delta,  # north
        lon + lon_delta,  # east
    )


# All restaurants within 5km of the Eiffel Tower
bbox = bbox_from_city_center(48.8584, 2.2945, radius_km=5)
restaurants = get_pois_bbox(*bbox, {"amenity": "restaurant"})
print(f"Found {len(restaurants)} restaurants near Eiffel Tower")

Multi-City Parallel Scraping

For production workloads across many cities, parallelize with a rate-limiting wrapper:

from concurrent.futures import ThreadPoolExecutor, as_completed
import threading

class OverpassRateLimiter:
    """Thread-safe rate limiter for Overpass queries."""
    def __init__(self, requests_per_minute: int = 10):
        self.delay = 60.0 / requests_per_minute
        self._lock = threading.Lock()
        self._last_request = 0

    def wait(self):
        with self._lock:
            elapsed = time.time() - self._last_request
            wait_time = self.delay - elapsed
            if wait_time > 0:
                time.sleep(wait_time)
            self._last_request = time.time()


rate_limiter = OverpassRateLimiter(requests_per_minute=8)


def fetch_city_pois(
    city_name: str,
    poi_tag: str,
    poi_value: str,
    proxy: str | None = None,
) -> tuple[str, list[dict]]:
    """Fetch POIs for a single city (designed for parallel execution)."""
    rate_limiter.wait()

    query = f"""
    [out:json][timeout:45];
    area["name"="{city_name}"]["admin_level"~"[4-8]"]->.a;
    node["{poi_tag}"="{poi_value}"](area.a);
    out body;
    """
    transport = httpx.HTTPTransport(proxy=proxy) if proxy else None
    resp = httpx.post(
        OVERPASS_URL,
        data={"data": query},
        timeout=60,
        transport=transport,
    )
    if resp.ok:
        elements = resp.json().get("elements", [])
        return city_name, normalize_elements(elements)
    return city_name, []


def parallel_poi_fetch(
    cities: list[str],
    poi_tag: str,
    poi_value: str,
    max_workers: int = 3,
    proxy: str | None = None,
) -> dict[str, list[dict]]:
    """Fetch POIs from multiple cities in parallel."""
    results = {}

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {
            executor.submit(fetch_city_pois, city, poi_tag, poi_value, proxy): city
            for city in cities
        }
        for future in as_completed(futures):
            city = futures[future]
            try:
                city_name, pois = future.result()
                results[city_name] = pois
                print(f"  {city_name}: {len(pois)} {poi_value}s")
            except Exception as e:
                print(f"  {city} failed: {e}")
                results[city] = []

    return results


# Parallel scraping with ThorData proxy rotation
# [ThorData residential proxies](https://thordata.partnerstack.com/partner/0a0x4nzq (or [Oxylabs](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=2066&url_id=174)))
# let you parallelize without hitting per-IP concurrency limits on the public endpoint

PROXY = "http://USER:[email protected]:9000"

european_cities = ["Paris", "Berlin", "Madrid", "Rome", "Amsterdam", "Vienna", "Warsaw"]
results = parallel_poi_fetch(european_cities, "amenity", "hospital", proxy=PROXY)

total = sum(len(v) for v in results.values())
print(f"\nTotal hospitals across {len(results)} cities: {total}")

Bulk OSM Data Processing

For country-scale or larger datasets, don't use Overpass — the API isn't designed for it. Download PBF files from Geofabrik instead.

import subprocess
import os

def download_geofabrik_extract(region: str, output_dir: str = "/tmp/osm") -> str:
    """
    Download a Geofabrik PBF extract.

    region examples:
    - "europe/germany"
    - "north-america/us/new-york"
    - "europe/great-britain"
    - "asia/japan"
    See download.geofabrik.de for the full directory tree.
    """
    os.makedirs(output_dir, exist_ok=True)
    filename = region.split("/")[-1] + "-latest.osm.pbf"
    url = f"https://download.geofabrik.de/{region}-latest.osm.pbf"
    output_path = os.path.join(output_dir, filename)

    if os.path.exists(output_path):
        print(f"Already downloaded: {output_path}")
        return output_path

    print(f"Downloading {url}...")
    subprocess.run(
        ["wget", "-q", "--show-progress", "-O", output_path, url],
        check=True,
    )
    return output_path


def extract_pois_from_pbf(
    pbf_path: str,
    output_csv: str,
    tags_filter: dict,
) -> int:
    """
    Extract POIs from a PBF file using the osmium Python library.

    Requires: pip install osmium

    tags_filter: dict like {"amenity": "restaurant"} or {"shop": "*"}
    """
    try:
        import osmium
    except ImportError:
        raise ImportError("Install osmium: pip install osmium")

    class POIHandler(osmium.SimpleHandler):
        def __init__(self, tags_filter: dict):
            super().__init__()
            self.pois = []
            self.tags_filter = tags_filter

        def _matches_filter(self, tags) -> bool:
            for key, value in self.tags_filter.items():
                tag_val = tags.get(key)
                if value == "*":
                    if tag_val:
                        return True
                else:
                    if tag_val == value:
                        return True
            return False

        def _extract_poi(self, osm_id: int, osm_type: str, lat: float, lon: float, tags) -> dict:
            tags_dict = dict(tags)
            return {
                "osm_id": osm_id,
                "osm_type": osm_type,
                "name": tags_dict.get("name", ""),
                "lat": lat,
                "lon": lon,
                "amenity": tags_dict.get("amenity", ""),
                "shop": tags_dict.get("shop", ""),
                "tourism": tags_dict.get("tourism", ""),
                "phone": tags_dict.get("phone", ""),
                "website": tags_dict.get("website", ""),
                "opening_hours": tags_dict.get("opening_hours", ""),
                "addr_street": tags_dict.get("addr:street", ""),
                "addr_city": tags_dict.get("addr:city", ""),
                "addr_postcode": tags_dict.get("addr:postcode", ""),
            }

        def node(self, n):
            if not n.location.valid():
                return
            tags = n.tags
            if self._matches_filter(tags):
                self.pois.append(
                    self._extract_poi(n.id, "node", n.location.lat, n.location.lon, tags)
                )

        def way(self, w):
            # Ways don't have direct lat/lon — skip for now
            # For way centers, use overpass instead
            pass

    handler = POIHandler(tags_filter=tags_filter)
    handler.apply_file(pbf_path, locations=True)

    if handler.pois:
        with open(output_csv, "w", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=handler.pois[0].keys())
            writer.writeheader()
            writer.writerows(handler.pois)

    return len(handler.pois)


# Example: extract all restaurants from Germany
# pbf_path = download_geofabrik_extract("europe/germany")
# count = extract_pois_from_pbf(pbf_path, "germany_restaurants.csv", {"amenity": "restaurant"})
# print(f"Extracted {count} restaurants from Germany")

GeoJSON Export

Most mapping tools and databases expect GeoJSON. Convert your results for easy visualization:

def pois_to_geojson(pois: list[dict], output_path: str | None = None) -> dict:
    """Convert POI list to GeoJSON FeatureCollection."""
    features = []
    for poi in pois:
        lat = poi.get("lat")
        lon = poi.get("lon")
        if lat is None or lon is None:
            continue

        # Build clean properties (exclude lat/lon and nested dicts)
        props = {}
        for k, v in poi.items():
            if k in ("lat", "lon", "all_tags"):
                continue
            if isinstance(v, dict):
                # Flatten address
                if k == "address":
                    for ak, av in v.items():
                        if av:
                            props[f"addr_{ak}"] = av
            elif v:
                props[k] = v

        features.append({
            "type": "Feature",
            "geometry": {
                "type": "Point",
                "coordinates": [lon, lat],
            },
            "properties": props,
        })

    geojson = {
        "type": "FeatureCollection",
        "features": features,
        "metadata": {
            "count": len(features),
            "source": "OpenStreetMap contributors (ODbL)",
        },
    }

    if output_path:
        with open(output_path, "w", encoding="utf-8") as f:
            json.dump(geojson, f, indent=2, ensure_ascii=False)
        print(f"Saved {len(features)} features to {output_path}")

    return geojson

Storing POI Data in SQLite

For persistent storage and querying across multiple scraping runs:

def init_poi_db(db_path: str) -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS pois (
            osm_id          INTEGER,
            osm_type        TEXT,
            lat             REAL,
            lon             REAL,
            name            TEXT,
            name_en         TEXT,
            brand           TEXT,
            amenity         TEXT,
            shop            TEXT,
            tourism         TEXT,
            phone           TEXT,
            website         TEXT,
            opening_hours   TEXT,
            addr_street     TEXT,
            addr_city       TEXT,
            addr_postcode   TEXT,
            addr_country    TEXT,
            cuisine         TEXT,
            wheelchair      TEXT,
            tags_json       TEXT,
            scraped_at      TEXT DEFAULT CURRENT_TIMESTAMP,
            PRIMARY KEY (osm_id, osm_type)
        );

        CREATE INDEX IF NOT EXISTS idx_poi_location ON pois(lat, lon);
        CREATE INDEX IF NOT EXISTS idx_poi_amenity ON pois(amenity);
        CREATE INDEX IF NOT EXISTS idx_poi_city ON pois(addr_city);
        CREATE INDEX IF NOT EXISTS idx_poi_name ON pois(name);
    """)
    conn.commit()
    return conn


def store_pois(conn: sqlite3.Connection, pois: list[dict]):
    """Store a list of POIs in the database."""
    now = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
    inserted = 0

    for poi in pois:
        addr = poi.get("address", {})
        try:
            conn.execute(
                """INSERT OR REPLACE INTO pois
                   (osm_id, osm_type, lat, lon, name, name_en, brand, amenity,
                    shop, tourism, phone, website, opening_hours,
                    addr_street, addr_city, addr_postcode, addr_country,
                    cuisine, wheelchair, tags_json, scraped_at)
                   VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
                (
                    poi["osm_id"], poi["osm_type"], poi["lat"], poi["lon"],
                    poi.get("name"), poi.get("name_en"), poi.get("brand"),
                    poi.get("all_tags", {}).get("amenity"),
                    poi.get("all_tags", {}).get("shop"),
                    poi.get("all_tags", {}).get("tourism"),
                    poi.get("phone"), poi.get("website"), poi.get("opening_hours"),
                    addr.get("street"), addr.get("city"),
                    addr.get("postcode"), addr.get("country"),
                    poi.get("cuisine"), poi.get("wheelchair"),
                    json.dumps(poi.get("all_tags", {})),
                    now
                )
            )
            inserted += 1
        except sqlite3.IntegrityError:
            pass

    conn.commit()
    return inserted


def find_pois_near(
    conn: sqlite3.Connection,
    lat: float,
    lon: float,
    radius_km: float,
    poi_type: str | None = None,
) -> list[dict]:
    """Find POIs within radius_km of a coordinate (approximate)."""
    import math
    lat_delta = radius_km / 111.0
    lon_delta = radius_km / (111.0 * math.cos(math.radians(lat)))

    query = """
        SELECT osm_id, name, lat, lon, amenity, phone, website, opening_hours,
               addr_street, addr_city
        FROM pois
        WHERE lat BETWEEN ? AND ?
          AND lon BETWEEN ? AND ?
    """
    params = [lat - lat_delta, lat + lat_delta, lon - lon_delta, lon + lon_delta]

    if poi_type:
        query += " AND amenity = ?"
        params.append(poi_type)

    rows = conn.execute(query, params).fetchall()
    keys = ["osm_id", "name", "lat", "lon", "amenity", "phone",
            "website", "opening_hours", "street", "city"]
    return [dict(zip(keys, row)) for row in rows]

Practical Notes on OSM Data Quality

Quality varies by region. Western Europe (especially Germany, the Netherlands, and the UK) and Japan have exceptional OSM coverage — most businesses, bike lanes, and even individual trees are mapped. In many parts of Africa, Southeast Asia, and rural areas globally, coverage can be sparse or outdated.

Verify tag conventions. Before querying, check the OSM wiki for the canonical tag. A grocery store might legitimately be shop=supermarket, shop=convenience, shop=grocery, or shop=greengrocer depending on its type. Use shop=* to catch all, then filter by value afterwards.

ODbL license requirements. OpenStreetMap data is licensed under the Open Database License (ODbL). You can use it commercially, but you must: 1. Attribute the source: "© OpenStreetMap contributors" 2. Keep the data open if you distribute a derivative database 3. Share-alike: datasets derived from OSM must also be licensed ODbL

The Overpass API is shared infrastructure. Be a good citizen: add [timeout:30] to all queries, avoid running complex queries that consume many CPU-seconds, and add delays between requests. For high-frequency production scraping, ThorData proxies distribute load across IPs to avoid hitting per-IP concurrency limits on the public endpoint.

Data freshness. OSM is updated continuously — a new restaurant might be added within hours of opening, and closed businesses are usually removed within days. The public Overpass endpoint reflects data that's typically 1-24 hours old.

Not everything is in OSM. Major chains (McDonald's, Starbucks) are well-represented. Smaller local businesses have patchier coverage. Don't use OSM as your sole source of truth for complete business directories — cross-reference with other data sources for completeness.