← Back to blog

How to Scrape itch.io Game Data in 2026: Metadata, Ratings & Downloads

How to Scrape itch.io Game Data in 2026: Metadata, Ratings & Downloads

itch.io is the largest indie game marketplace — over 900,000 games, a huge jam culture, and a pay-what-you-want model that generates genuinely interesting pricing data. If you're building a game discovery tool, tracking jam trends, or researching the indie market, itch.io is a goldmine.

The catch: itch.io's public data access story is a mess. There's a partially documented API at itch.io/api/1/ that covers some user-level data, but most of what you want — download counts, ratings, bundle participation, jam entries — requires scraping game pages directly. This guide covers both.

What Data Is Available

Between the API and page scraping, you can collect:

Download counts are the trickiest. Developers can hide them. When visible, they appear as text on the game page. When hidden, you're working with view counts only, which are always visible.

itch.io's Anti-Bot Measures

itch.io is not Steam. Their infrastructure is smaller, which means rate limits kick in fast:

  1. Rate limiting — Aggressive. Hit the same IP repeatedly and you'll start getting 429s within a few minutes. The threshold is somewhere around 60-80 requests per minute before they start throttling.
  2. Cloudflare — itch.io sits behind Cloudflare. The main challenge layer is usually JS challenge for suspicious traffic patterns, not full CAPTCHA. A good user-agent and normal headers get you past most of it.
  3. No official public API for most data — The documented API at itch.io/api/1/ requires an API key and is scoped to authenticated user data. It won't hand you a list of all games.
  4. Search pagination limits — The browse/search pages paginate, but itch.io will start returning empty results after you go deep enough (around page 200+ for some queries).
  5. Inconsistent page structure — itch.io lets developers customize their game pages heavily. Field presence is not guaranteed.

Using the itch.io API

The API endpoint is https://itch.io/api/1/KEY/. You get a key from your itch.io account under Settings > API keys.

What it actually covers: your own profile data, games you've uploaded, purchases, and some game lookup by ID. It's not a public catalog API. That said, game/{id} lookups are useful once you have IDs from scraping.

import requests
import time

ITCH_API_KEY = "YOUR_API_KEY"
BASE_API = "https://itch.io/api/1"

def get_game_by_id(game_id: int) -> dict:
    """Fetch game details from itch.io API by game ID."""
    url = f"{BASE_API}/{ITCH_API_KEY}/game/{game_id}"
    resp = requests.get(url, timeout=15)

    if resp.status_code == 404:
        return None
    if resp.status_code == 429:
        time.sleep(10)
        return get_game_by_id(game_id)

    data = resp.json()
    game = data.get("game", {})
    return {
        "id": game.get("id"),
        "title": game.get("title"),
        "url": game.get("url"),
        "description": game.get("short_text"),
        "cover_url": game.get("cover_url"),
        "min_price": game.get("min_price", 0) / 100,
        "published": game.get("published_at"),
        "platforms": {
            "windows": game.get("p_windows", False),
            "mac": game.get("p_osx", False),
            "linux": game.get("p_linux", False),
            "android": game.get("p_android", False),
        },
        "classification": game.get("classification"),
        "can_be_bought": game.get("can_be_bought", False),
    }

The API returns prices in cents. Game IDs are integers you can find embedded in page HTML or discover by enumerating (though sequential ID enumeration is slow and wasteful — pull IDs from browse pages instead).

Scraping Game Pages with BeautifulSoup

For ratings, download counts, jam entries, and bundle flags, you're scraping HTML. itch.io game pages are mostly server-rendered, which makes BeautifulSoup straightforward.

pip install requests beautifulsoup4 lxml
from bs4 import BeautifulSoup
import requests
import re

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
}

def scrape_game_page(game_url: str) -> dict:
    """Scrape metadata from an itch.io game page."""
    resp = requests.get(game_url, headers=HEADERS, timeout=20)
    if resp.status_code != 200:
        return {"error": resp.status_code, "url": game_url}

    soup = BeautifulSoup(resp.text, "lxml")

    # Game title
    title_el = soup.select_one(".game_title h1") or soup.select_one("h1.game_title")
    title = title_el.get_text(strip=True) if title_el else None

    # Rating — appears as "Rated X.X out of 5 stars" in a span
    rating = None
    rating_count = None
    rating_el = soup.select_one(".aggregate_rating")
    if rating_el:
        stars_el = rating_el.select_one(".rating_value")
        count_el = rating_el.select_one(".rating_count")
        if stars_el:
            rating = float(stars_el.get_text(strip=True))
        if count_el:
            count_text = count_el.get_text(strip=True)
            match = re.search(r"[\d,]+", count_text)
            if match:
                rating_count = int(match.group().replace(",", ""))

    # Download / view counts from the stat table
    downloads = None
    views = None
    for row in soup.select(".game_info_panel_widget table tr"):
        cells = row.select("td")
        if len(cells) == 2:
            label = cells[0].get_text(strip=True).lower()
            value_text = cells[1].get_text(strip=True).replace(",", "")
            if "download" in label:
                match = re.search(r"\d+", value_text)
                if match:
                    downloads = int(match.group())
            elif "view" in label:
                match = re.search(r"\d+", value_text)
                if match:
                    views = int(match.group())

    # Tags
    tags = [a.get_text(strip=True) for a in soup.select(".tags a")]

    # Jam entries — listed in a sidebar section
    jams = []
    jam_section = soup.select_one(".game_info_panel_widget .jam_links")
    if jam_section:
        for a in jam_section.select("a"):
            jams.append({"name": a.get_text(strip=True), "url": a.get("href")})

    # Extract game ID from page source
    game_id = None
    id_match = re.search(r'"id":(\d+)', resp.text[:5000])
    if id_match:
        game_id = int(id_match.group(1))

    return {
        "id": game_id,
        "url": game_url,
        "title": title,
        "rating": rating,
        "rating_count": rating_count,
        "downloads": downloads,
        "views": views,
        "tags": tags,
        "jams": jams,
    }

Discovering Games: Browse Page Scraping

The itch.io browse pages at https://itch.io/games are your starting point for building a game list.

def scrape_browse_page(page: int = 1, tag: str = None, sort: str = "popular") -> list:
    """Scrape a page of games from itch.io browse."""
    url = "https://itch.io/games"
    params = {"page": page, "sort": sort}
    if tag:
        url = f"https://itch.io/games/tag-{tag}"

    resp = requests.get(url, params=params, headers=HEADERS, timeout=20)
    soup = BeautifulSoup(resp.text, "lxml")

    games = []
    for cell in soup.select(".game_cell"):
        link = cell.select_one("a.title")
        price_el = cell.select_one(".price_value")
        thumb_el = cell.select_one(".thumb_link")

        game_url = link.get("href") if link else None
        game_id_match = re.search(r"/(\d+)/", game_url or "")

        games.append({
            "id": int(game_id_match.group(1)) if game_id_match else None,
            "title": link.get_text(strip=True) if link else None,
            "url": game_url,
            "price_text": price_el.get_text(strip=True) if price_el else "Free",
            "thumb": thumb_el.get("data-background_image") if thumb_el else None,
        })

    return games


def scrape_all_browse(tag: str = None, max_pages: int = 50) -> list:
    """Paginate through browse results, respecting rate limits."""
    all_games = []
    for page in range(1, max_pages + 1):
        batch = scrape_browse_page(page=page, tag=tag)
        if not batch:
            break
        all_games.extend(batch)
        print(f"Page {page}: {len(batch)} games, total {len(all_games)}")
        time.sleep(2)  # Stay well under the rate limit threshold
    return all_games

Scraping Game Jam Pages

Game jams are a unique data source on itch.io — they generate concentrated activity and are indexed publicly. The jam listing page at https://itch.io/jams is fully scrapeable:

def scrape_jam_listings(page: int = 1) -> list:
    """Scrape active and recent game jams from itch.io."""
    url = "https://itch.io/jams"
    params = {"page": page}
    resp = requests.get(url, params=params, headers=HEADERS, timeout=20)
    soup = BeautifulSoup(resp.text, "lxml")

    jams = []
    for cell in soup.select(".jam_list_widget .jam"):
        title_el = cell.select_one(".title a")
        participants_el = cell.select_one(".stat_value")
        date_el = cell.select_one(".jam_dates")
        host_el = cell.select_one(".hosted_by a")

        jam_url = title_el.get("href") if title_el else None
        participants_text = participants_el.get_text(strip=True) if participants_el else "0"
        participants_match = re.search(r"\d+", participants_text.replace(",", ""))

        jams.append({
            "title": title_el.get_text(strip=True) if title_el else None,
            "url": jam_url,
            "participants": int(participants_match.group()) if participants_match else 0,
            "dates": date_el.get_text(strip=True) if date_el else None,
            "host": host_el.get_text(strip=True) if host_el else None,
        })

    return jams


def scrape_jam_entries(jam_url: str, max_pages: int = 10) -> list:
    """Scrape game entries from a specific jam page."""
    all_entries = []

    for page in range(1, max_pages + 1):
        resp = requests.get(
            jam_url,
            params={"page": page},
            headers=HEADERS,
            timeout=20,
        )
        soup = BeautifulSoup(resp.text, "lxml")

        entries = []
        for cell in soup.select(".game_cell"):
            link = cell.select_one("a.title")
            rating_el = cell.select_one(".aggregate_rating .rating_value")
            entries.append({
                "title": link.get_text(strip=True) if link else None,
                "url": link.get("href") if link else None,
                "rating": float(rating_el.get_text(strip=True)) if rating_el else None,
            })

        if not entries:
            break

        all_entries.extend(entries)
        print(f"Jam page {page}: {len(entries)} entries")
        time.sleep(2)

    return all_entries

Proxy Rotation for Scale

If you're scraping more than a few hundred games per run, you'll need proxy rotation. itch.io's rate limits are IP-based, and a single residential IP will hit throttling quickly.

Residential proxies make a real difference here because Cloudflare treats datacenter IPs with more suspicion. For high-volume itch.io scraping, ThorData works well — their residential proxy pool covers 195+ countries and supports sticky sessions if you need to maintain a session across a paginated crawl.

import random

PROXIES = [
    "http://USER:[email protected]:9000",
    # Add more proxy endpoints or configure rotation via ThorData's dashboard
]

def get_with_proxy(url: str, retries: int = 3) -> requests.Response:
    """Make a request through a rotating proxy with retry logic."""
    for attempt in range(retries):
        proxy = random.choice(PROXIES)
        try:
            resp = requests.get(
                url,
                headers=HEADERS,
                proxies={"http": proxy, "https": proxy},
                timeout=20,
            )
            if resp.status_code == 429:
                wait = 5 * (attempt + 1)
                print(f"Rate limited, waiting {wait}s...")
                time.sleep(wait)
                continue
            return resp
        except requests.exceptions.ProxyError:
            print(f"Proxy error on attempt {attempt + 1}, rotating...")
            time.sleep(2)

    return None

The sticky session feature is useful when you're scraping game pages from a browse listing — keeping the same IP across a batch of requests looks more natural than rotating on every single call.

Enriching Data: Fetching Individual Game Pages at Scale

Once you have a list of game URLs from the browse scraper, enrich each one with full metadata:

import json
import sqlite3

def batch_enrich_games(game_urls: list, db_path: str = "itch_games.db", proxy: str = None) -> int:
    """Fetch and store full details for a list of game URLs."""
    conn = init_db(db_path)
    enriched = 0

    for i, url in enumerate(game_urls):
        try:
            if proxy:
                resp_obj = get_with_proxy(url)
                html = resp_obj.text if resp_obj else None
                # Parse inline since scrape_game_page makes its own request
                if html:
                    soup = BeautifulSoup(html, "lxml")
                    # Re-parse manually here if needed
            game_data = scrape_game_page(url)
            if game_data and not game_data.get("error"):
                save_game(conn, game_data)
                enriched += 1

            if i % 10 == 0:
                print(f"Progress: {i}/{len(game_urls)} ({enriched} saved)")

            time.sleep(2.5 if not proxy else 1.5)
        except Exception as e:
            print(f"Error on {url}: {e}")

    conn.close()
    return enriched

Combine browse data with jam participation to spot games gaining traction early:

def find_rising_games(db_path: str = "itch_games.db") -> list:
    """Find games with high jam participation and growing ratings."""
    conn = sqlite3.connect(db_path)

    rows = conn.execute("""
        SELECT g.id, g.title, g.url, g.rating, g.rating_count,
               g.downloads, g.views, g.jams,
               (g.rating_count * 1.0 / NULLIF(g.views, 0)) AS engagement_rate
        FROM games g
        WHERE g.rating >= 4.0
          AND g.rating_count >= 10
          AND g.jams != '[]'
        ORDER BY engagement_rate DESC
        LIMIT 50
    """).fetchall()

    conn.close()
    return [
        {
            "id": r[0], "title": r[1], "url": r[2],
            "rating": r[3], "rating_count": r[4],
            "downloads": r[5], "views": r[6],
            "jams": json.loads(r[7]) if r[7] else [],
            "engagement_rate": round(r[8] * 100, 3) if r[8] else 0,
        }
        for r in rows
    ]


rising = find_rising_games()
for game in rising[:10]:
    jams_count = len(game["jams"])
    print(f"{game['title']}: {game['rating']}★ ({game['rating_count']} ratings), "
          f"{jams_count} jam(s), {game['engagement_rate']}% engagement")
    print(f"  {game['url']}")

Storing Results

For an ongoing scrape, SQLite is the right choice. itch.io data changes — ratings accumulate, download counts grow, games go on sale. You want to track changes over time, not just snapshot once.

import sqlite3
import json

def init_db(path: str = "itch_games.db") -> sqlite3.Connection:
    conn = sqlite3.connect(path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS games (
            id INTEGER PRIMARY KEY,
            title TEXT,
            url TEXT UNIQUE,
            rating REAL,
            rating_count INTEGER,
            downloads INTEGER,
            views INTEGER,
            tags TEXT,
            jams TEXT,
            min_price REAL,
            platforms TEXT,
            first_seen TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
            last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS snapshots (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            game_id INTEGER,
            rating REAL,
            rating_count INTEGER,
            downloads INTEGER,
            views INTEGER,
            recorded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)
    conn.commit()
    return conn


def save_game(conn: sqlite3.Connection, game: dict):
    conn.execute("""
        INSERT INTO games (id, title, url, rating, rating_count, downloads, views, tags, jams, min_price)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        ON CONFLICT(url) DO UPDATE SET
            rating=excluded.rating,
            rating_count=excluded.rating_count,
            downloads=excluded.downloads,
            views=excluded.views,
            last_updated=CURRENT_TIMESTAMP
    """, (
        game.get("id"), game.get("title"), game.get("url"),
        game.get("rating"), game.get("rating_count"),
        game.get("downloads"), game.get("views"),
        json.dumps(game.get("tags", [])),
        json.dumps(game.get("jams", [])),
        game.get("min_price"),
    ))
    # Record a snapshot for trend tracking
    if game.get("id"):
        conn.execute("""
            INSERT INTO snapshots (game_id, rating, rating_count, downloads, views)
            VALUES (?, ?, ?, ?, ?)
        """, (game["id"], game.get("rating"), game.get("rating_count"),
              game.get("downloads"), game.get("views")))
    conn.commit()

If you just want a quick export to CSV:

import csv

def export_csv(conn: sqlite3.Connection, path: str = "itch_games.csv"):
    cursor = conn.execute("SELECT id, title, url, rating, rating_count, downloads, views FROM games")
    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(["id", "title", "url", "rating", "rating_count", "downloads", "views"])
        writer.writerows(cursor.fetchall())

Snapshots let you calculate velocity — how fast a game's rating count is growing:

from datetime import datetime

def compute_rating_velocity(conn: sqlite3.Connection, game_id: int, days: int = 7) -> dict:
    """Calculate how fast a game is accumulating ratings."""
    rows = conn.execute("""
        SELECT rating_count, recorded_at
        FROM snapshots
        WHERE game_id = ?
          AND recorded_at >= datetime('now', '-' || ? || ' days')
        ORDER BY recorded_at
    """, (game_id, days)).fetchall()

    if len(rows) < 2:
        return {"game_id": game_id, "velocity": 0, "snapshots": len(rows)}

    first_count = rows[0][0] or 0
    last_count = rows[-1][0] or 0
    delta = last_count - first_count

    first_ts = datetime.fromisoformat(rows[0][1])
    last_ts = datetime.fromisoformat(rows[-1][1])
    elapsed_days = (last_ts - first_ts).total_seconds() / 86400

    velocity = delta / elapsed_days if elapsed_days > 0 else 0

    return {
        "game_id": game_id,
        "ratings_gained": delta,
        "elapsed_days": round(elapsed_days, 1),
        "ratings_per_day": round(velocity, 2),
        "snapshots": len(rows),
    }

Complete Scraping Pipeline

Tie everything together into a weekly job:

def run_itch_pipeline(
    tags: list = None,
    max_browse_pages: int = 20,
    db_path: str = "itch_games.db",
    proxy: str = None,
) -> dict:
    """Full pipeline: browse -> collect IDs -> enrich -> store."""
    tags = tags or ["horror", "puzzle", "platformer", "roguelike", "visual-novel"]
    conn = init_db(db_path)
    all_game_urls = set()

    for tag in tags:
        print(f"\nBrowsing tag: {tag}")
        games = scrape_all_browse(tag=tag, max_pages=max_browse_pages)
        for g in games:
            if g.get("url"):
                all_game_urls.add(g["url"])
        print(f"  Found {len(games)} games for '{tag}'")
        time.sleep(3)

    print(f"\nTotal unique game URLs: {len(all_game_urls)}")
    enriched = batch_enrich_games(list(all_game_urls), db_path=db_path, proxy=proxy)

    total_games = conn.execute("SELECT COUNT(*) FROM games").fetchone()[0]
    conn.close()

    return {
        "tags_scraped": tags,
        "urls_found": len(all_game_urls),
        "enriched_this_run": enriched,
        "total_in_db": total_games,
    }


if __name__ == "__main__":
    PROXY = "http://USER:[email protected]:9000"
    results = run_itch_pipeline(proxy=PROXY)
    print(f"\nPipeline complete: {results['enriched_this_run']} enriched, {results['total_in_db']} total in DB")

itch.io's Terms of Service don't explicitly permit automated scraping, but they don't have a robots.txt that blanket-blocks scrapers either. The site is largely open — game pages are public, no login required for most data.

Practical rules: don't hit them faster than a human could reasonably browse, don't scrape private data (purchases, user accounts), and don't republish the data in a way that replicates their storefront. Game metadata for research, market analysis, or building discovery tools sits in a reasonable gray area. If you're building something commercial that depends heavily on their data, reaching out to ask is worth the two minutes it takes.

The jam data is the most scraper-friendly: jams are explicitly public, itch.io promotes participation counts prominently, and the community culture around jams is open. Building jam analytics tools is clearly within the spirit of the platform.

Key Takeaways