Scraping Numbeo Cost of Living Data for City Comparisons (2026)

2026-04-09 scraping numbeo python cost-of-living data proxy

Scraping Numbeo Cost of Living Data for City Comparisons (2026)

Numbeo is the largest crowdsourced database of cost of living data, covering prices for groceries, rent, restaurants, transportation, and more across thousands of cities worldwide. If you are building a relocation tool, doing market research, comparing cities for remote work decisions, or analyzing purchasing power differentials, Numbeo's data is hard to replace anywhere else on the public web.

There is no official public API. Numbeo does license their data commercially, but for smaller projects and research, scraping the public pages is the practical approach. This guide covers every page type Numbeo exposes, their anti-bot defenses, how to work around them, and a complete SQLite-backed data pipeline.

What Data Is Available

Numbeo publishes several distinct page types, each with a predictable URL structure:

Cost of living index rankings — numbeo.com/cost-of-living/rankings.jsp Ranked list of cities with composite index scores. Covers cost of living index, rent index, combined index, groceries index, restaurant price index, and purchasing power index.

City detail pages — numbeo.com/cost-of-living/in/CITY-NAME Individual prices for 50+ categories: meals at restaurants, groceries by item, rent for 1BR/3BR apartments, utilities, transportation, clothing, childcare, salaries, and more. This is the richest data Numbeo offers.

City comparison — numbeo.com/cost-of-living/compare_cities.jsp Side-by-side breakdown of two cities with percentage differences for every line item.

Quality of life index — numbeo.com/quality-of-life/rankings.jsp Composite score incorporating purchasing power, safety, healthcare quality, cost of living, pollution, climate, traffic commute time, and property price to income ratio.

Property prices — numbeo.com/property-investment/rankings.jsp Price-to-income ratio, gross rental yield, mortgage affordability, and price per square meter by city.

Crime index — numbeo.com/crime/rankings.jsp Perceived crime and safety scores by city.

Each of these follows the same URL pattern and table structure, making them straightforward to scrape with shared code.

Scraping the Cost of Living Index Rankings

The main rankings page lists cities with their composite index values:

import httpx
from selectolax.parser import HTMLParser
import time
import random

USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
]

def scrape_cost_of_living_index(proxy_url: str | None = None) -> list[dict]:
    """Scrape Numbeo's cost of living index rankings."""
    url = "https://www.numbeo.com/cost-of-living/rankings.jsp"

    transport = httpx.HTTPTransport(proxy=proxy_url) if proxy_url else None
    client = httpx.Client(
        transport=transport,
        timeout=25,
        follow_redirects=True,
        headers={
            "User-Agent": random.choice(USER_AGENTS),
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "Cache-Control": "no-cache",
        },
    )

    try:
        resp = client.get(url)
        resp.raise_for_status()
    finally:
        client.close()

    tree = HTMLParser(resp.text)
    cities = []

    table = tree.css_first("table#t2")
    if not table:
        return cities

    rows = table.css("tbody tr")
    for row in rows:
        cells = row.css("td")
        if len(cells) < 8:
            continue

        try:
            cities.append({
                "rank": int(cells[0].text(strip=True)),
                "city": cells[1].text(strip=True),
                "cost_of_living_index": float(cells[2].text(strip=True).replace(",", "")),
                "rent_index": float(cells[3].text(strip=True).replace(",", "")),
                "col_plus_rent_index": float(cells[4].text(strip=True).replace(",", "")),
                "groceries_index": float(cells[5].text(strip=True).replace(",", "")),
                "restaurant_price_index": float(cells[6].text(strip=True).replace(",", "")),
                "purchasing_power_index": float(cells[7].text(strip=True).replace(",", "")),
            })
        except (ValueError, IndexError):
            continue

    return cities


# Usage
rankings = scrape_cost_of_living_index()
for city in rankings[:10]:
    print(f"{city['rank']:3d}. {city['city']:<30} COL: {city['cost_of_living_index']:.1f}")

Scraping Detailed City Prices

Each city has a detail page with 50+ specific price items. The URL pattern is /cost-of-living/in/CITY-NAME where the city name uses hyphens and title case:

def scrape_city_prices(city_slug: str, proxy_url: str | None = None) -> dict:
    """
    Scrape individual price items for a city.
    city_slug examples: 'Warsaw', 'New-York', 'Buenos-Aires', 'Ho-Chi-Minh-City'
    """
    url = f"https://www.numbeo.com/cost-of-living/in/{city_slug}"

    transport = httpx.HTTPTransport(proxy=proxy_url) if proxy_url else None
    client = httpx.Client(
        transport=transport,
        timeout=25,
        follow_redirects=True,
        headers={
            "User-Agent": random.choice(USER_AGENTS),
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
            "Referer": "https://www.numbeo.com/cost-of-living/",
        },
    )

    try:
        resp = client.get(url)
        resp.raise_for_status()
    finally:
        client.close()

    tree = HTMLParser(resp.text)
    prices = {}

    table = tree.css_first("table.data_wide_table")
    if not table:
        return prices

    rows = table.css("tr")
    current_category = "General"

    for row in rows:
        header = row.css_first("th")
        if header:
            current_category = header.text(strip=True)
            continue

        cells = row.css("td")
        if len(cells) < 2:
            continue

        item_name = cells[0].text(strip=True)
        price_text = cells[1].text(strip=True).replace(",", "").replace("\xa0", "").strip()

        try:
            # Price format: "12.50 $" or "1,234.56 $"
            price = float(price_text.split()[0])
        except (ValueError, IndexError):
            continue

        # Extract range if available (min-max in the third column)
        price_range = None
        if len(cells) >= 3:
            range_text = cells[2].text(strip=True) if cells[2] else ""
            if range_text and "-" in range_text:
                price_range = range_text

        prices[item_name] = {
            "price_usd": price,
            "category": current_category,
            "range": price_range,
        }

    return prices


# Example
warsaw_prices = scrape_city_prices("Warsaw")
for item, data in list(warsaw_prices.items())[:10]:
    print(f"  [{data['category']}] {item}: ${data['price_usd']:.2f}")

City Comparison Scraper

Numbeo's comparison page shows a side-by-side breakdown with percentage differences:

def scrape_city_comparison(
    city1: str,
    city2: str,
    proxy_url: str | None = None,
) -> list[dict]:
    """Scrape side-by-side price comparison between two cities."""
    url = (
        "https://www.numbeo.com/cost-of-living/compare_cities.jsp"
        f"?country1=&city1={city1}&country2=&city2={city2}"
    )

    transport = httpx.HTTPTransport(proxy=proxy_url) if proxy_url else None
    client = httpx.Client(
        transport=transport,
        timeout=25,
        follow_redirects=True,
        headers={
            "User-Agent": random.choice(USER_AGENTS),
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Referer": "https://www.numbeo.com/cost-of-living/",
        },
    )

    try:
        resp = client.get(url)
        resp.raise_for_status()
    finally:
        client.close()

    tree = HTMLParser(resp.text)
    comparisons = []

    table = tree.css_first("table.data_wide_table")
    if not table:
        return comparisons

    rows = table.css("tr")
    current_category = "General"

    for row in rows:
        header = row.css_first("th.category_header")
        if header:
            current_category = header.text(strip=True)
            continue

        cells = row.css("td")
        if len(cells) < 3:
            continue

        item = cells[0].text(strip=True)
        try:
            price1 = float(
                cells[1].text(strip=True).replace(",", "").replace("\xa0", "").split()[0]
            )
            price2 = float(
                cells[2].text(strip=True).replace(",", "").replace("\xa0", "").split()[0]
            )
        except (ValueError, IndexError):
            continue

        diff_pct = round((price2 - price1) / price1 * 100, 1) if price1 > 0 else 0

        comparisons.append({
            "category": current_category,
            "item": item,
            f"{city1}_price": price1,
            f"{city2}_price": price2,
            "difference_pct": diff_pct,
        })

    return comparisons


# Compare Lisbon vs Berlin
comparison = scrape_city_comparison("Lisbon", "Berlin")
expensive_items = sorted(comparison, key=lambda x: x["difference_pct"], reverse=True)
for item in expensive_items[:10]:
    print(f"  {item['item']}: Berlin is {item['difference_pct']:+.1f}% vs Lisbon")

Quality of Life Index Scraper

The quality of life rankings use the same table structure but different columns:

def scrape_quality_of_life_index(proxy_url: str | None = None) -> list[dict]:
    """Scrape Numbeo's quality of life index rankings."""
    url = "https://www.numbeo.com/quality-of-life/rankings.jsp"

    transport = httpx.HTTPTransport(proxy=proxy_url) if proxy_url else None
    client = httpx.Client(
        transport=transport,
        timeout=25,
        follow_redirects=True,
        headers={
            "User-Agent": random.choice(USER_AGENTS),
            "Accept-Language": "en-US,en;q=0.9",
        },
    )

    try:
        resp = client.get(url)
        resp.raise_for_status()
    finally:
        client.close()

    tree = HTMLParser(resp.text)
    cities = []

    table = tree.css_first("table#t2")
    if not table:
        return cities

    # Parse column headers
    headers = [th.text(strip=True) for th in table.css("thead th")]

    rows = table.css("tbody tr")
    for row in rows:
        cells = row.css("td")
        if len(cells) < 9:
            continue

        try:
            city_data = {
                "rank": int(cells[0].text(strip=True)),
                "city": cells[1].text(strip=True),
            }
            # Map remaining cells to header names
            for i, header in enumerate(headers[2:], start=2):
                if i < len(cells):
                    val_text = cells[i].text(strip=True).replace(",", "")
                    try:
                        city_data[header.lower().replace(" ", "_")] = float(val_text)
                    except ValueError:
                        city_data[header.lower().replace(" ", "_")] = val_text
            cities.append(city_data)
        except (ValueError, IndexError):
            continue

    return cities

Handling Numbeo's Anti-Bot Measures

Numbeo uses several layers of bot detection, listed from most to least impactful:

IP-based rate limiting. More than a few requests per minute from the same IP triggers a CAPTCHA interstitial. You will see a "please verify you are human" page instead of data. The key symptom is receiving an HTML page that does not contain the expected table#t2 or table.data_wide_table selector.

Cookie validation. Numbeo sets tracking cookies on first visit. Requests without those cookies can get blocked. Using httpx.Client() persists cookies automatically within a session.

JavaScript challenge. Some pages require a JavaScript challenge cookie before the real content loads. This is the hardest to defeat without a headless browser.

For serious data collection, rotating residential proxies are required. Numbeo specifically blocks datacenter IP ranges, and rotating through residential IPs with different geolocations also lets you observe region-specific pricing that Numbeo adjusts based on visitor location.

ThorData's residential proxy network works well here — the IPs come from real ISPs across multiple countries and pass Numbeo's IP reputation checks without triggering challenges.

import random

PROXY_LIST = [
    "http://USER:[email protected]:9000",
    # Add more proxy endpoints for rotation
]

def create_numbeo_session(proxy_url: str) -> httpx.Client:
    """Create a session configured for Numbeo scraping."""
    transport = httpx.HTTPTransport(proxy=proxy_url)
    return httpx.Client(
        transport=transport,
        timeout=25,
        follow_redirects=True,
        headers={
            "User-Agent": random.choice(USER_AGENTS),
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "Sec-Fetch-Dest": "document",
            "Sec-Fetch-Mode": "navigate",
            "Sec-Fetch-Site": "none",
            "Cache-Control": "no-cache",
        },
    )


def is_captcha_page(html: str) -> bool:
    """Detect if Numbeo returned a CAPTCHA challenge page."""
    lower = html.lower()
    return any(
        marker in lower
        for marker in [
            "verify you are human",
            "captcha",
            "are you a robot",
            "automated request",
            "cloudflare",
        ]
    )


def scrape_with_retry(
    url: str,
    proxy_list: list[str],
    max_attempts: int = 5,
) -> str | None:
    """Try scraping with different proxies until one works."""
    for attempt in range(max_attempts):
        proxy = random.choice(proxy_list)
        session = create_numbeo_session(proxy)
        try:
            resp = session.get(url)
            resp.raise_for_status()

            if is_captcha_page(resp.text):
                print(f"  CAPTCHA on attempt {attempt + 1}, rotating proxy...")
                time.sleep(random.uniform(5, 15))
                continue

            return resp.text
        except httpx.HTTPError as e:
            print(f"  HTTP error on attempt {attempt + 1}: {e}")
            time.sleep(random.uniform(3, 8))
        finally:
            session.close()

    return None

SQLite Schema

import sqlite3

def init_numbeo_db(db_path: str = "numbeo.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS city_rankings (
            city TEXT PRIMARY KEY,
            rank INTEGER,
            cost_of_living_index REAL,
            rent_index REAL,
            col_plus_rent_index REAL,
            groceries_index REAL,
            restaurant_price_index REAL,
            purchasing_power_index REAL,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );

        CREATE TABLE IF NOT EXISTS city_prices (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            city TEXT NOT NULL,
            category TEXT NOT NULL,
            item TEXT NOT NULL,
            price_usd REAL,
            price_range TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
            UNIQUE(city, item)
        );

        CREATE TABLE IF NOT EXISTS quality_of_life (
            city TEXT PRIMARY KEY,
            rank INTEGER,
            quality_of_life_index REAL,
            purchasing_power_index REAL,
            safety_index REAL,
            health_care_index REAL,
            cost_of_living_index REAL,
            pollution_index REAL,
            traffic_commute_time_index REAL,
            climate_index REAL,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );

        CREATE INDEX IF NOT EXISTS idx_city_prices_city
            ON city_prices(city);

        CREATE INDEX IF NOT EXISTS idx_city_prices_item
            ON city_prices(item);
    """)
    conn.commit()
    return conn


def save_city_ranking(conn: sqlite3.Connection, city_data: dict):
    conn.execute(
        """INSERT OR REPLACE INTO city_rankings
           (city, rank, cost_of_living_index, rent_index, col_plus_rent_index,
            groceries_index, restaurant_price_index, purchasing_power_index)
           VALUES (?,?,?,?,?,?,?,?)""",
        (
            city_data.get("city"),
            city_data.get("rank"),
            city_data.get("cost_of_living_index"),
            city_data.get("rent_index"),
            city_data.get("col_plus_rent_index"),
            city_data.get("groceries_index"),
            city_data.get("restaurant_price_index"),
            city_data.get("purchasing_power_index"),
        ),
    )
    conn.commit()


def save_city_prices(conn: sqlite3.Connection, city: str, prices: dict):
    conn.executemany(
        """INSERT OR REPLACE INTO city_prices
           (city, category, item, price_usd, price_range)
           VALUES (?,?,?,?,?)""",
        [
            (city, data["category"], item, data["price_usd"], data.get("range"))
            for item, data in prices.items()
        ],
    )
    conn.commit()

Building a Multi-City Dataset

Here is a complete pipeline that scrapes rankings, then fetches detailed prices for the top cities:

def build_cost_dataset(
    cities: list[str],
    proxy_list: list[str],
    db_path: str = "numbeo.db",
) -> int:
    """Build a dataset of prices across multiple cities."""
    conn = init_numbeo_db(db_path)
    total_records = 0

    # First, scrape the rankings
    print("Scraping rankings...")
    proxy = random.choice(proxy_list)
    rankings = scrape_cost_of_living_index(proxy_url=proxy)
    for city_data in rankings:
        save_city_ranking(conn, city_data)
    print(f"  Saved {len(rankings)} city rankings")

    time.sleep(random.uniform(8, 15))

    # Then fetch detailed prices for each city
    for city in cities:
        print(f"Scraping detailed prices: {city}")
        proxy = random.choice(proxy_list)
        prices = scrape_city_prices(city, proxy_url=proxy)

        if not prices:
            print(f"  No data for {city} (possible block)")
            time.sleep(random.uniform(30, 60))  # Back off longer after a failed request
            continue

        save_city_prices(conn, city, prices)
        total_records += len(prices)
        print(f"  Saved {len(prices)} price items")

        # Intentionally long delays — Numbeo monitors patterns closely
        delay = random.uniform(10, 20)
        print(f"  Waiting {delay:.0f}s before next city...")
        time.sleep(delay)

    conn.close()
    return total_records


CITIES = [
    "Warsaw", "Lisbon", "Berlin", "Barcelona", "Prague",
    "Budapest", "Vienna", "Amsterdam", "Copenhagen", "Dublin",
    "Krakow", "Porto", "Zagreb", "Ljubljana", "Tallinn",
]

PROXIES = ["http://USER:[email protected]:9000"]

total = build_cost_dataset(CITIES, PROXIES, db_path="numbeo.db")
print(f"\nTotal price records collected: {total}")

The long delays between cities (10-20 seconds) are intentional. Numbeo's monitoring is sophisticated and 1-2 second delays get flagged quickly. Treat this as a slow overnight job rather than something to blast through in minutes.

Useful SQL Queries on the Dataset

Once you have data in SQLite, here are queries that reveal interesting patterns:

import sqlite3

conn = sqlite3.connect("numbeo.db")

# Most expensive cities for rent
rent_expensive = conn.execute("""
    SELECT city, price_usd FROM city_prices
    WHERE item = 'Apartment (1 bedroom) in City Centre'
    ORDER BY price_usd DESC LIMIT 10
""").fetchall()

# Best purchasing power vs cost of living ratio
ratio_query = conn.execute("""
    SELECT city, cost_of_living_index, purchasing_power_index,
           ROUND(purchasing_power_index / cost_of_living_index, 2) AS pp_ratio
    FROM city_rankings
    ORDER BY pp_ratio DESC LIMIT 15
""").fetchall()

# Price of a specific item across all cities
meal_prices = conn.execute("""
    SELECT city, price_usd FROM city_prices
    WHERE item = 'Meal, Inexpensive Restaurant'
    ORDER BY price_usd ASC
""").fetchall()

for city, price in meal_prices[:10]:
    print(f"  {city}: ${price:.2f}")

Comparing Purchasing Power vs Salary Data

One of the most insightful analyses you can build from Numbeo is the salary-adjusted cost comparison. Numbeo publishes net monthly salary estimates alongside all the cost data:

def analyze_purchasing_power(cities: list[str], proxy_url: str | None = None) -> list[dict]:
    """
    Compare purchasing power across cities.
    Fetches salary estimates and computes months of rent, restaurant meals, etc.
    """
    results = []

    for city in cities:
        prices = scrape_city_prices(city, proxy_url=proxy_url)
        if not prices:
            continue

        salary = prices.get("Average Monthly Net Salary (After Tax)", {}).get("price_usd")
        rent_1br_center = prices.get("Apartment (1 bedroom) in City Centre", {}).get("price_usd")
        restaurant_meal = prices.get("Meal, Inexpensive Restaurant", {}).get("price_usd")
        big_mac = prices.get("McMeal at McDonalds (or Equivalent Combo Meal)", {}).get("price_usd")

        if salary and salary > 0:
            results.append({
                "city": city,
                "net_salary_usd": salary,
                "rent_1br_center_usd": rent_1br_center,
                "rent_to_salary_pct": round(rent_1br_center / salary * 100, 1) if rent_1br_center else None,
                "restaurant_meal_usd": restaurant_meal,
                "meals_per_salary": round(salary / restaurant_meal, 0) if restaurant_meal else None,
                "big_mac_usd": big_mac,
            })

        time.sleep(random.uniform(8, 15))

    return sorted(results, key=lambda x: x.get("rent_to_salary_pct") or 999)


# Run the analysis
cities = ["Warsaw", "Lisbon", "Berlin", "Amsterdam", "Prague", "Budapest"]
analysis = analyze_purchasing_power(cities, proxy_url="http://USER:[email protected]:9000")

print(f"{'City':<15} {'Salary':>10} {'Rent (1BR)':>12} {'Rent%':>8} {'Meals':>8}")
print("-" * 58)
for r in analysis:
    print(
        f"{r['city']:<15} "
        f"${r['net_salary_usd']:>9,.0f} "
        f"${r.get('rent_1br_center_usd', 0):>11,.0f} "
        f"{r.get('rent_to_salary_pct', 0):>7.1f}% "
        f"{r.get('meals_per_salary', 0):>7.0f}"
    )

This kind of analysis is exactly what relocation tools and remote work comparison sites are built on. Warsaw and Prague consistently show up with better rent-to-salary ratios than Western European capitals, which is why they attract remote workers earning Northern European or US salaries.

Tracking Price Changes Over Time

If you run the scraper periodically and timestamp your data, you can build a price history:

import sqlite3

def init_numbeo_timeseries_db(db_path: str = "numbeo_ts.db") -> sqlite3.Connection:
    """Initialize database for tracking price changes over time."""
    conn = sqlite3.connect(db_path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS price_snapshots (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            city TEXT NOT NULL,
            category TEXT NOT NULL,
            item TEXT NOT NULL,
            price_usd REAL,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );

        CREATE INDEX IF NOT EXISTS idx_snapshots_city_item
            ON price_snapshots(city, item, scraped_at);
    """)
    conn.commit()
    return conn


def save_snapshot(conn: sqlite3.Connection, city: str, prices: dict):
    """Save a timestamped price snapshot."""
    conn.executemany(
        """INSERT INTO price_snapshots (city, category, item, price_usd)
           VALUES (?, ?, ?, ?)""",
        [
            (city, data["category"], item, data["price_usd"])
            for item, data in prices.items()
        ],
    )
    conn.commit()


def get_price_trend(
    conn: sqlite3.Connection,
    city: str,
    item: str,
    months: int = 6,
) -> list[dict]:
    """Get price history for a specific item in a city."""
    rows = conn.execute(
        """SELECT price_usd, scraped_at FROM price_snapshots
           WHERE city = ? AND item = ?
           AND scraped_at >= datetime('now', ?)
           ORDER BY scraped_at ASC""",
        (city, item, f"-{months} months"),
    ).fetchall()

    return [{"price_usd": r[0], "date": r[1]} for r in rows]

Running this monthly builds a price history that can detect trends — rising rents, inflation in specific categories, or sudden changes like post-COVID price normalization.

Legal and Ethical Notes

Numbeo's Terms of Service prohibit automated scraping. Their data is crowdsourced from user submissions, but the aggregation, indices, and historical data are proprietary to Numbeo. For commercial applications, contact Numbeo about their data licensing program. For personal research and one-off analysis, keep your volume low, do not republish their data as your own product, and give their servers adequate time between requests.