Extracting OpenStreetMap Data: POIs, Overpass API & Bulk Processing (2026)
Extracting OpenStreetMap Data: POIs, Overpass API & Bulk Processing (2026)
OpenStreetMap is a free, editable map of the entire world. That's not just roads — it's restaurants, hospitals, parks, building outlines, hiking trails, fire hydrants, EV chargers, transit stops, ATMs, bike lanes, and effectively any geographic feature someone has mapped.
The data is open (ODbL license) and well-structured. For most use cases, the Overpass API is where you start. For country-scale or larger datasets, grab the bulk exports from Geofabrik. This guide covers both approaches with production-ready Python code.
Data Model Overview
Understanding OSM's data model is essential before writing queries.
Elements are the basic building blocks: - Nodes — single points (lat/lon). A restaurant, ATM, or tree. - Ways — ordered lists of nodes forming lines or areas. A road, building outline, or park boundary. - Relations — groups of nodes/ways with semantic meaning. A bus route, administrative boundary, or multipolygon.
Tags are key-value pairs attached to any element. A coffee shop might have amenity=cafe, name=Blue Bottle Coffee, opening_hours=Mo-Fr 07:00-19:00, wifi=yes.
The OSM Wiki at wiki.openstreetmap.org/wiki/Map_features is your reference for what tags mean and which values are used. Always check the wiki before writing queries — there's often a correct tag and several deprecated alternatives.
Overpass API — Querying Map Data
Overpass is like SQL for OpenStreetMap. You write a query describing what geographic features you want, in what area, and it returns them as structured data.
The public endpoint at overpass-api.de is free but shared. For production workloads, consider running a local Overpass instance or using a private endpoint.
import httpx
import time
import json
import csv
import sqlite3
from typing import Optional
OVERPASS_URL = "https://overpass-api.de/api/interpreter"
def overpass_query(query: str, timeout: int = 30) -> list[dict]:
"""Run an Overpass QL query and return elements list."""
resp = httpx.post(
OVERPASS_URL,
data={"data": query},
timeout=timeout + 15,
headers={
"User-Agent": "OSMDataPipeline/1.0 ([email protected])",
},
)
resp.raise_for_status()
data = resp.json()
return data.get("elements", [])
def overpass_query_with_retry(
query: str,
timeout: int = 45,
max_retries: int = 3,
proxy: str | None = None,
) -> list[dict]:
"""Overpass query with exponential backoff on rate limiting."""
for attempt in range(max_retries):
try:
transport = httpx.HTTPTransport(proxy=proxy) if proxy else None
resp = httpx.post(
OVERPASS_URL,
data={"data": query},
timeout=timeout + 15,
transport=transport,
)
if resp.status_code == 429:
wait = 2 ** (attempt + 2) # 4s, 8s, 16s
print(f"Rate limited. Waiting {wait}s (attempt {attempt + 1})...")
time.sleep(wait)
continue
if resp.status_code == 504:
print("Gateway timeout — query too complex or server overloaded")
time.sleep(30)
continue
resp.raise_for_status()
return resp.json().get("elements", [])
except httpx.TimeoutException:
print(f"Timeout on attempt {attempt + 1}")
if attempt < max_retries - 1:
time.sleep(10)
return []
# Example: all cafes in central Berlin
cafes_query = """
[out:json][timeout:30];
area["name"="Berlin"]["admin_level"="4"]->.b;
(
node["amenity"="cafe"](area.b);
way["amenity"="cafe"](area.b);
);
out center body;
"""
cafes = overpass_query(cafes_query)
print(f"Found {len(cafes)} cafes in Berlin")
for cafe in cafes[:5]:
tags = cafe.get("tags", {})
lat = cafe.get("lat") or cafe.get("center", {}).get("lat")
lon = cafe.get("lon") or cafe.get("center", {}).get("lon")
print(f" {tags.get('name', 'Unnamed')}: {lat:.4f}, {lon:.4f}")
if tags.get("opening_hours"):
print(f" Hours: {tags['opening_hours']}")
Key Overpass QL Syntax
[out:json] — JSON output (vs XML default)
[timeout:30] — server-side timeout in seconds
area["name"="Berlin"]["admin_level"="4"]->.b; — define area by tag match
node["amenity"="cafe"](area.b); — filter nodes by tag within area
way["amenity"="restaurant"](bbox); — filter ways within bounding box
out center; — for ways, output center point instead of full geometry
out body; — include all tags
out geom; — include full geometry (all nodes for ways)
POI Extraction by Category
Reusable function for extracting different POI types from a named area:
def get_pois_by_area(
area_name: str,
poi_tag: str,
poi_value: str,
admin_level: str = "4",
additional_tags: dict | None = None,
timeout: int = 60,
) -> list[dict]:
"""
Extract POIs of a specific type from a named administrative area.
poi_tag: OSM tag key, e.g., 'amenity', 'shop', 'tourism'
poi_value: OSM tag value, e.g., 'restaurant', 'supermarket', 'hotel'
admin_level: '4' for state/region, '6' for county, '8' for city/municipality
"""
extra_filters = ""
if additional_tags:
for k, v in additional_tags.items():
extra_filters += f'["{k}"="{v}"]'
query = f"""
[out:json][timeout:{timeout}];
area["name"="{area_name}"]["admin_level"="{admin_level}"]->.a;
(
node["{poi_tag}"="{poi_value}"]{extra_filters}(area.a);
way["{poi_tag}"="{poi_value}"]{extra_filters}(area.a);
);
out center body;
"""
elements = overpass_query_with_retry(query, timeout=timeout)
return normalize_elements(elements)
def normalize_elements(elements: list[dict]) -> list[dict]:
"""Normalize raw Overpass elements to clean POI dicts."""
pois = []
for el in elements:
tags = el.get("tags", {})
lat = el.get("lat") or el.get("center", {}).get("lat")
lon = el.get("lon") or el.get("center", {}).get("lon")
if lat is None or lon is None:
continue
pois.append({
"osm_id": el["id"],
"osm_type": el["type"],
"lat": lat,
"lon": lon,
"name": tags.get("name", ""),
"name_en": tags.get("name:en", ""),
"brand": tags.get("brand", ""),
"operator": tags.get("operator", ""),
"address": {
"street": tags.get("addr:street", ""),
"housenumber": tags.get("addr:housenumber", ""),
"postcode": tags.get("addr:postcode", ""),
"city": tags.get("addr:city", ""),
"country": tags.get("addr:country", ""),
},
"phone": tags.get("phone", "") or tags.get("contact:phone", ""),
"website": tags.get("website", "") or tags.get("contact:website", ""),
"email": tags.get("email", "") or tags.get("contact:email", ""),
"opening_hours": tags.get("opening_hours", ""),
"wheelchair": tags.get("wheelchair", ""),
"smoking": tags.get("smoking", ""),
"wifi": tags.get("internet_access", ""),
"cuisine": tags.get("cuisine", ""), # for restaurants
"stars": tags.get("stars", ""), # for hotels
"capacity": tags.get("capacity", ""),
"all_tags": tags,
})
return pois
# Common extraction examples
def get_restaurants(city: str, admin_level: str = "8") -> list[dict]:
return get_pois_by_area(city, "amenity", "restaurant", admin_level)
def get_hospitals(city: str) -> list[dict]:
return get_pois_by_area(city, "amenity", "hospital")
def get_hotels(city: str) -> list[dict]:
return get_pois_by_area(city, "tourism", "hotel")
def get_supermarkets(city: str) -> list[dict]:
return get_pois_by_area(city, "shop", "supermarket")
def get_ev_chargers(city: str) -> list[dict]:
return get_pois_by_area(city, "amenity", "charging_station")
def get_atms(city: str) -> list[dict]:
return get_pois_by_area(city, "amenity", "atm")
def get_bus_stops(city: str) -> list[dict]:
return get_pois_by_area(city, "highway", "bus_stop")
Bounding Box Queries
When you need data for a specific coordinate area rather than a named region:
def get_pois_bbox(
south: float,
west: float,
north: float,
east: float,
tags: dict,
timeout: int = 60,
) -> list[dict]:
"""
Get POIs within a bounding box.
tags: dict like {"amenity": "restaurant"} or {"shop": "*"}
Use "*" as value to match any value for a given key.
"""
bbox = f"{south},{west},{north},{east}"
tag_filters = ""
for key, value in tags.items():
if value == "*":
tag_filters += f' node["{key}"]({bbox});\n'
tag_filters += f' way["{key}"]({bbox});\n'
else:
tag_filters += f' node["{key}"="{value}"]({bbox});\n'
tag_filters += f' way["{key}"="{value}"]({bbox});\n'
query = f"""
[out:json][timeout:{timeout}];
(
{tag_filters}
);
out center body;
"""
elements = overpass_query_with_retry(query, timeout=timeout)
return normalize_elements(elements)
def bbox_from_city_center(
lat: float,
lon: float,
radius_km: float = 5.0,
) -> tuple[float, float, float, float]:
"""Create a bounding box around a lat/lon point."""
# Approximate: 1 degree lat ≈ 111km, 1 degree lon ≈ 111km * cos(lat)
import math
lat_delta = radius_km / 111.0
lon_delta = radius_km / (111.0 * math.cos(math.radians(lat)))
return (
lat - lat_delta, # south
lon - lon_delta, # west
lat + lat_delta, # north
lon + lon_delta, # east
)
# All restaurants within 5km of the Eiffel Tower
bbox = bbox_from_city_center(48.8584, 2.2945, radius_km=5)
restaurants = get_pois_bbox(*bbox, {"amenity": "restaurant"})
print(f"Found {len(restaurants)} restaurants near Eiffel Tower")
Multi-City Parallel Scraping
For production workloads across many cities, parallelize with a rate-limiting wrapper:
from concurrent.futures import ThreadPoolExecutor, as_completed
import threading
class OverpassRateLimiter:
"""Thread-safe rate limiter for Overpass queries."""
def __init__(self, requests_per_minute: int = 10):
self.delay = 60.0 / requests_per_minute
self._lock = threading.Lock()
self._last_request = 0
def wait(self):
with self._lock:
elapsed = time.time() - self._last_request
wait_time = self.delay - elapsed
if wait_time > 0:
time.sleep(wait_time)
self._last_request = time.time()
rate_limiter = OverpassRateLimiter(requests_per_minute=8)
def fetch_city_pois(
city_name: str,
poi_tag: str,
poi_value: str,
proxy: str | None = None,
) -> tuple[str, list[dict]]:
"""Fetch POIs for a single city (designed for parallel execution)."""
rate_limiter.wait()
query = f"""
[out:json][timeout:45];
area["name"="{city_name}"]["admin_level"~"[4-8]"]->.a;
node["{poi_tag}"="{poi_value}"](area.a);
out body;
"""
transport = httpx.HTTPTransport(proxy=proxy) if proxy else None
resp = httpx.post(
OVERPASS_URL,
data={"data": query},
timeout=60,
transport=transport,
)
if resp.ok:
elements = resp.json().get("elements", [])
return city_name, normalize_elements(elements)
return city_name, []
def parallel_poi_fetch(
cities: list[str],
poi_tag: str,
poi_value: str,
max_workers: int = 3,
proxy: str | None = None,
) -> dict[str, list[dict]]:
"""Fetch POIs from multiple cities in parallel."""
results = {}
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = {
executor.submit(fetch_city_pois, city, poi_tag, poi_value, proxy): city
for city in cities
}
for future in as_completed(futures):
city = futures[future]
try:
city_name, pois = future.result()
results[city_name] = pois
print(f" {city_name}: {len(pois)} {poi_value}s")
except Exception as e:
print(f" {city} failed: {e}")
results[city] = []
return results
# Parallel scraping with ThorData proxy rotation
# [ThorData residential proxies](https://thordata.partnerstack.com/partner/0a0x4nzq (or [Oxylabs](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=2066&url_id=174)))
# let you parallelize without hitting per-IP concurrency limits on the public endpoint
PROXY = "http://USER:[email protected]:9000"
european_cities = ["Paris", "Berlin", "Madrid", "Rome", "Amsterdam", "Vienna", "Warsaw"]
results = parallel_poi_fetch(european_cities, "amenity", "hospital", proxy=PROXY)
total = sum(len(v) for v in results.values())
print(f"\nTotal hospitals across {len(results)} cities: {total}")
Bulk OSM Data Processing
For country-scale or larger datasets, don't use Overpass — the API isn't designed for it. Download PBF files from Geofabrik instead.
import subprocess
import os
def download_geofabrik_extract(region: str, output_dir: str = "/tmp/osm") -> str:
"""
Download a Geofabrik PBF extract.
region examples:
- "europe/germany"
- "north-america/us/new-york"
- "europe/great-britain"
- "asia/japan"
See download.geofabrik.de for the full directory tree.
"""
os.makedirs(output_dir, exist_ok=True)
filename = region.split("/")[-1] + "-latest.osm.pbf"
url = f"https://download.geofabrik.de/{region}-latest.osm.pbf"
output_path = os.path.join(output_dir, filename)
if os.path.exists(output_path):
print(f"Already downloaded: {output_path}")
return output_path
print(f"Downloading {url}...")
subprocess.run(
["wget", "-q", "--show-progress", "-O", output_path, url],
check=True,
)
return output_path
def extract_pois_from_pbf(
pbf_path: str,
output_csv: str,
tags_filter: dict,
) -> int:
"""
Extract POIs from a PBF file using the osmium Python library.
Requires: pip install osmium
tags_filter: dict like {"amenity": "restaurant"} or {"shop": "*"}
"""
try:
import osmium
except ImportError:
raise ImportError("Install osmium: pip install osmium")
class POIHandler(osmium.SimpleHandler):
def __init__(self, tags_filter: dict):
super().__init__()
self.pois = []
self.tags_filter = tags_filter
def _matches_filter(self, tags) -> bool:
for key, value in self.tags_filter.items():
tag_val = tags.get(key)
if value == "*":
if tag_val:
return True
else:
if tag_val == value:
return True
return False
def _extract_poi(self, osm_id: int, osm_type: str, lat: float, lon: float, tags) -> dict:
tags_dict = dict(tags)
return {
"osm_id": osm_id,
"osm_type": osm_type,
"name": tags_dict.get("name", ""),
"lat": lat,
"lon": lon,
"amenity": tags_dict.get("amenity", ""),
"shop": tags_dict.get("shop", ""),
"tourism": tags_dict.get("tourism", ""),
"phone": tags_dict.get("phone", ""),
"website": tags_dict.get("website", ""),
"opening_hours": tags_dict.get("opening_hours", ""),
"addr_street": tags_dict.get("addr:street", ""),
"addr_city": tags_dict.get("addr:city", ""),
"addr_postcode": tags_dict.get("addr:postcode", ""),
}
def node(self, n):
if not n.location.valid():
return
tags = n.tags
if self._matches_filter(tags):
self.pois.append(
self._extract_poi(n.id, "node", n.location.lat, n.location.lon, tags)
)
def way(self, w):
# Ways don't have direct lat/lon — skip for now
# For way centers, use overpass instead
pass
handler = POIHandler(tags_filter=tags_filter)
handler.apply_file(pbf_path, locations=True)
if handler.pois:
with open(output_csv, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=handler.pois[0].keys())
writer.writeheader()
writer.writerows(handler.pois)
return len(handler.pois)
# Example: extract all restaurants from Germany
# pbf_path = download_geofabrik_extract("europe/germany")
# count = extract_pois_from_pbf(pbf_path, "germany_restaurants.csv", {"amenity": "restaurant"})
# print(f"Extracted {count} restaurants from Germany")
GeoJSON Export
Most mapping tools and databases expect GeoJSON. Convert your results for easy visualization:
def pois_to_geojson(pois: list[dict], output_path: str | None = None) -> dict:
"""Convert POI list to GeoJSON FeatureCollection."""
features = []
for poi in pois:
lat = poi.get("lat")
lon = poi.get("lon")
if lat is None or lon is None:
continue
# Build clean properties (exclude lat/lon and nested dicts)
props = {}
for k, v in poi.items():
if k in ("lat", "lon", "all_tags"):
continue
if isinstance(v, dict):
# Flatten address
if k == "address":
for ak, av in v.items():
if av:
props[f"addr_{ak}"] = av
elif v:
props[k] = v
features.append({
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [lon, lat],
},
"properties": props,
})
geojson = {
"type": "FeatureCollection",
"features": features,
"metadata": {
"count": len(features),
"source": "OpenStreetMap contributors (ODbL)",
},
}
if output_path:
with open(output_path, "w", encoding="utf-8") as f:
json.dump(geojson, f, indent=2, ensure_ascii=False)
print(f"Saved {len(features)} features to {output_path}")
return geojson
Storing POI Data in SQLite
For persistent storage and querying across multiple scraping runs:
def init_poi_db(db_path: str) -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.executescript("""
CREATE TABLE IF NOT EXISTS pois (
osm_id INTEGER,
osm_type TEXT,
lat REAL,
lon REAL,
name TEXT,
name_en TEXT,
brand TEXT,
amenity TEXT,
shop TEXT,
tourism TEXT,
phone TEXT,
website TEXT,
opening_hours TEXT,
addr_street TEXT,
addr_city TEXT,
addr_postcode TEXT,
addr_country TEXT,
cuisine TEXT,
wheelchair TEXT,
tags_json TEXT,
scraped_at TEXT DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (osm_id, osm_type)
);
CREATE INDEX IF NOT EXISTS idx_poi_location ON pois(lat, lon);
CREATE INDEX IF NOT EXISTS idx_poi_amenity ON pois(amenity);
CREATE INDEX IF NOT EXISTS idx_poi_city ON pois(addr_city);
CREATE INDEX IF NOT EXISTS idx_poi_name ON pois(name);
""")
conn.commit()
return conn
def store_pois(conn: sqlite3.Connection, pois: list[dict]):
"""Store a list of POIs in the database."""
now = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
inserted = 0
for poi in pois:
addr = poi.get("address", {})
try:
conn.execute(
"""INSERT OR REPLACE INTO pois
(osm_id, osm_type, lat, lon, name, name_en, brand, amenity,
shop, tourism, phone, website, opening_hours,
addr_street, addr_city, addr_postcode, addr_country,
cuisine, wheelchair, tags_json, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
(
poi["osm_id"], poi["osm_type"], poi["lat"], poi["lon"],
poi.get("name"), poi.get("name_en"), poi.get("brand"),
poi.get("all_tags", {}).get("amenity"),
poi.get("all_tags", {}).get("shop"),
poi.get("all_tags", {}).get("tourism"),
poi.get("phone"), poi.get("website"), poi.get("opening_hours"),
addr.get("street"), addr.get("city"),
addr.get("postcode"), addr.get("country"),
poi.get("cuisine"), poi.get("wheelchair"),
json.dumps(poi.get("all_tags", {})),
now
)
)
inserted += 1
except sqlite3.IntegrityError:
pass
conn.commit()
return inserted
def find_pois_near(
conn: sqlite3.Connection,
lat: float,
lon: float,
radius_km: float,
poi_type: str | None = None,
) -> list[dict]:
"""Find POIs within radius_km of a coordinate (approximate)."""
import math
lat_delta = radius_km / 111.0
lon_delta = radius_km / (111.0 * math.cos(math.radians(lat)))
query = """
SELECT osm_id, name, lat, lon, amenity, phone, website, opening_hours,
addr_street, addr_city
FROM pois
WHERE lat BETWEEN ? AND ?
AND lon BETWEEN ? AND ?
"""
params = [lat - lat_delta, lat + lat_delta, lon - lon_delta, lon + lon_delta]
if poi_type:
query += " AND amenity = ?"
params.append(poi_type)
rows = conn.execute(query, params).fetchall()
keys = ["osm_id", "name", "lat", "lon", "amenity", "phone",
"website", "opening_hours", "street", "city"]
return [dict(zip(keys, row)) for row in rows]
Practical Notes on OSM Data Quality
Quality varies by region. Western Europe (especially Germany, the Netherlands, and the UK) and Japan have exceptional OSM coverage — most businesses, bike lanes, and even individual trees are mapped. In many parts of Africa, Southeast Asia, and rural areas globally, coverage can be sparse or outdated.
Verify tag conventions. Before querying, check the OSM wiki for the canonical tag. A grocery store might legitimately be shop=supermarket, shop=convenience, shop=grocery, or shop=greengrocer depending on its type. Use shop=* to catch all, then filter by value afterwards.
ODbL license requirements. OpenStreetMap data is licensed under the Open Database License (ODbL). You can use it commercially, but you must: 1. Attribute the source: "© OpenStreetMap contributors" 2. Keep the data open if you distribute a derivative database 3. Share-alike: datasets derived from OSM must also be licensed ODbL
The Overpass API is shared infrastructure. Be a good citizen: add [timeout:30] to all queries, avoid running complex queries that consume many CPU-seconds, and add delays between requests. For high-frequency production scraping, ThorData proxies distribute load across IPs to avoid hitting per-IP concurrency limits on the public endpoint.
Data freshness. OSM is updated continuously — a new restaurant might be added within hours of opening, and closed businesses are usually removed within days. The public Overpass endpoint reflects data that's typically 1-24 hours old.
Not everything is in OSM. Major chains (McDonald's, Starbucks) are well-represented. Smaller local businesses have patchier coverage. Don't use OSM as your sole source of truth for complete business directories — cross-reference with other data sources for completeness.