How to Scrape itch.io Game Data in 2026: Metadata, Ratings & Downloads
How to Scrape itch.io Game Data in 2026: Metadata, Ratings & Downloads
itch.io is the largest indie game marketplace — over 900,000 games, a huge jam culture, and a pay-what-you-want model that generates genuinely interesting pricing data. If you're building a game discovery tool, tracking jam trends, or researching the indie market, itch.io is a goldmine.
The catch: itch.io's public data access story is a mess. There's a partially documented API at itch.io/api/1/ that covers some user-level data, but most of what you want — download counts, ratings, bundle participation, jam entries — requires scraping game pages directly. This guide covers both.
What Data Is Available
Between the API and page scraping, you can collect:
- Game metadata — title, description, genre tags, platform support (Windows/Mac/Linux/browser), game engine hints, release date
- Pricing — minimum price, suggested price, pay-what-you-want flag, sale status
- Rating data — star ratings and total rating count (displayed on game pages)
- Download estimates — not exposed directly, but view counts and "number of downloads" appear on some game pages when the developer opts in
- Bundle participation — whether a game was part of itch.io bundles (including the massive Bundle for Racial Justice and Equality)
- Jam entries — which game jams a game was submitted to, jam placement, number of entries in the jam
- Creator info — developer username, links, total games published
- Community data — comment counts, community posts if the game has a forum
Download counts are the trickiest. Developers can hide them. When visible, they appear as text on the game page. When hidden, you're working with view counts only, which are always visible.
itch.io's Anti-Bot Measures
itch.io is not Steam. Their infrastructure is smaller, which means rate limits kick in fast:
- Rate limiting — Aggressive. Hit the same IP repeatedly and you'll start getting 429s within a few minutes. The threshold is somewhere around 60-80 requests per minute before they start throttling.
- Cloudflare — itch.io sits behind Cloudflare. The main challenge layer is usually JS challenge for suspicious traffic patterns, not full CAPTCHA. A good user-agent and normal headers get you past most of it.
- No official public API for most data — The documented API at
itch.io/api/1/requires an API key and is scoped to authenticated user data. It won't hand you a list of all games. - Search pagination limits — The browse/search pages paginate, but itch.io will start returning empty results after you go deep enough (around page 200+ for some queries).
- Inconsistent page structure — itch.io lets developers customize their game pages heavily. Field presence is not guaranteed.
Using the itch.io API
The API endpoint is https://itch.io/api/1/KEY/. You get a key from your itch.io account under Settings > API keys.
What it actually covers: your own profile data, games you've uploaded, purchases, and some game lookup by ID. It's not a public catalog API. That said, game/{id} lookups are useful once you have IDs from scraping.
import requests
import time
ITCH_API_KEY = "YOUR_API_KEY"
BASE_API = "https://itch.io/api/1"
def get_game_by_id(game_id: int) -> dict:
"""Fetch game details from itch.io API by game ID."""
url = f"{BASE_API}/{ITCH_API_KEY}/game/{game_id}"
resp = requests.get(url, timeout=15)
if resp.status_code == 404:
return None
if resp.status_code == 429:
time.sleep(10)
return get_game_by_id(game_id)
data = resp.json()
game = data.get("game", {})
return {
"id": game.get("id"),
"title": game.get("title"),
"url": game.get("url"),
"description": game.get("short_text"),
"cover_url": game.get("cover_url"),
"min_price": game.get("min_price", 0) / 100,
"published": game.get("published_at"),
"platforms": {
"windows": game.get("p_windows", False),
"mac": game.get("p_osx", False),
"linux": game.get("p_linux", False),
"android": game.get("p_android", False),
},
"classification": game.get("classification"),
"can_be_bought": game.get("can_be_bought", False),
}
The API returns prices in cents. Game IDs are integers you can find embedded in page HTML or discover by enumerating (though sequential ID enumeration is slow and wasteful — pull IDs from browse pages instead).
Scraping Game Pages with BeautifulSoup
For ratings, download counts, jam entries, and bundle flags, you're scraping HTML. itch.io game pages are mostly server-rendered, which makes BeautifulSoup straightforward.
pip install requests beautifulsoup4 lxml
from bs4 import BeautifulSoup
import requests
import re
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
}
def scrape_game_page(game_url: str) -> dict:
"""Scrape metadata from an itch.io game page."""
resp = requests.get(game_url, headers=HEADERS, timeout=20)
if resp.status_code != 200:
return {"error": resp.status_code, "url": game_url}
soup = BeautifulSoup(resp.text, "lxml")
# Game title
title_el = soup.select_one(".game_title h1") or soup.select_one("h1.game_title")
title = title_el.get_text(strip=True) if title_el else None
# Rating — appears as "Rated X.X out of 5 stars" in a span
rating = None
rating_count = None
rating_el = soup.select_one(".aggregate_rating")
if rating_el:
stars_el = rating_el.select_one(".rating_value")
count_el = rating_el.select_one(".rating_count")
if stars_el:
rating = float(stars_el.get_text(strip=True))
if count_el:
count_text = count_el.get_text(strip=True)
match = re.search(r"[\d,]+", count_text)
if match:
rating_count = int(match.group().replace(",", ""))
# Download / view counts from the stat table
downloads = None
views = None
for row in soup.select(".game_info_panel_widget table tr"):
cells = row.select("td")
if len(cells) == 2:
label = cells[0].get_text(strip=True).lower()
value_text = cells[1].get_text(strip=True).replace(",", "")
if "download" in label:
match = re.search(r"\d+", value_text)
if match:
downloads = int(match.group())
elif "view" in label:
match = re.search(r"\d+", value_text)
if match:
views = int(match.group())
# Tags
tags = [a.get_text(strip=True) for a in soup.select(".tags a")]
# Jam entries — listed in a sidebar section
jams = []
jam_section = soup.select_one(".game_info_panel_widget .jam_links")
if jam_section:
for a in jam_section.select("a"):
jams.append({"name": a.get_text(strip=True), "url": a.get("href")})
# Extract game ID from page source
game_id = None
id_match = re.search(r'"id":(\d+)', resp.text[:5000])
if id_match:
game_id = int(id_match.group(1))
return {
"id": game_id,
"url": game_url,
"title": title,
"rating": rating,
"rating_count": rating_count,
"downloads": downloads,
"views": views,
"tags": tags,
"jams": jams,
}
Discovering Games: Browse Page Scraping
The itch.io browse pages at https://itch.io/games are your starting point for building a game list.
def scrape_browse_page(page: int = 1, tag: str = None, sort: str = "popular") -> list:
"""Scrape a page of games from itch.io browse."""
url = "https://itch.io/games"
params = {"page": page, "sort": sort}
if tag:
url = f"https://itch.io/games/tag-{tag}"
resp = requests.get(url, params=params, headers=HEADERS, timeout=20)
soup = BeautifulSoup(resp.text, "lxml")
games = []
for cell in soup.select(".game_cell"):
link = cell.select_one("a.title")
price_el = cell.select_one(".price_value")
thumb_el = cell.select_one(".thumb_link")
game_url = link.get("href") if link else None
game_id_match = re.search(r"/(\d+)/", game_url or "")
games.append({
"id": int(game_id_match.group(1)) if game_id_match else None,
"title": link.get_text(strip=True) if link else None,
"url": game_url,
"price_text": price_el.get_text(strip=True) if price_el else "Free",
"thumb": thumb_el.get("data-background_image") if thumb_el else None,
})
return games
def scrape_all_browse(tag: str = None, max_pages: int = 50) -> list:
"""Paginate through browse results, respecting rate limits."""
all_games = []
for page in range(1, max_pages + 1):
batch = scrape_browse_page(page=page, tag=tag)
if not batch:
break
all_games.extend(batch)
print(f"Page {page}: {len(batch)} games, total {len(all_games)}")
time.sleep(2) # Stay well under the rate limit threshold
return all_games
Scraping Game Jam Pages
Game jams are a unique data source on itch.io — they generate concentrated activity and are indexed publicly. The jam listing page at https://itch.io/jams is fully scrapeable:
def scrape_jam_listings(page: int = 1) -> list:
"""Scrape active and recent game jams from itch.io."""
url = "https://itch.io/jams"
params = {"page": page}
resp = requests.get(url, params=params, headers=HEADERS, timeout=20)
soup = BeautifulSoup(resp.text, "lxml")
jams = []
for cell in soup.select(".jam_list_widget .jam"):
title_el = cell.select_one(".title a")
participants_el = cell.select_one(".stat_value")
date_el = cell.select_one(".jam_dates")
host_el = cell.select_one(".hosted_by a")
jam_url = title_el.get("href") if title_el else None
participants_text = participants_el.get_text(strip=True) if participants_el else "0"
participants_match = re.search(r"\d+", participants_text.replace(",", ""))
jams.append({
"title": title_el.get_text(strip=True) if title_el else None,
"url": jam_url,
"participants": int(participants_match.group()) if participants_match else 0,
"dates": date_el.get_text(strip=True) if date_el else None,
"host": host_el.get_text(strip=True) if host_el else None,
})
return jams
def scrape_jam_entries(jam_url: str, max_pages: int = 10) -> list:
"""Scrape game entries from a specific jam page."""
all_entries = []
for page in range(1, max_pages + 1):
resp = requests.get(
jam_url,
params={"page": page},
headers=HEADERS,
timeout=20,
)
soup = BeautifulSoup(resp.text, "lxml")
entries = []
for cell in soup.select(".game_cell"):
link = cell.select_one("a.title")
rating_el = cell.select_one(".aggregate_rating .rating_value")
entries.append({
"title": link.get_text(strip=True) if link else None,
"url": link.get("href") if link else None,
"rating": float(rating_el.get_text(strip=True)) if rating_el else None,
})
if not entries:
break
all_entries.extend(entries)
print(f"Jam page {page}: {len(entries)} entries")
time.sleep(2)
return all_entries
Proxy Rotation for Scale
If you're scraping more than a few hundred games per run, you'll need proxy rotation. itch.io's rate limits are IP-based, and a single residential IP will hit throttling quickly.
Residential proxies make a real difference here because Cloudflare treats datacenter IPs with more suspicion. For high-volume itch.io scraping, ThorData works well — their residential proxy pool covers 195+ countries and supports sticky sessions if you need to maintain a session across a paginated crawl.
import random
PROXIES = [
"http://USER:[email protected]:9000",
# Add more proxy endpoints or configure rotation via ThorData's dashboard
]
def get_with_proxy(url: str, retries: int = 3) -> requests.Response:
"""Make a request through a rotating proxy with retry logic."""
for attempt in range(retries):
proxy = random.choice(PROXIES)
try:
resp = requests.get(
url,
headers=HEADERS,
proxies={"http": proxy, "https": proxy},
timeout=20,
)
if resp.status_code == 429:
wait = 5 * (attempt + 1)
print(f"Rate limited, waiting {wait}s...")
time.sleep(wait)
continue
return resp
except requests.exceptions.ProxyError:
print(f"Proxy error on attempt {attempt + 1}, rotating...")
time.sleep(2)
return None
The sticky session feature is useful when you're scraping game pages from a browse listing — keeping the same IP across a batch of requests looks more natural than rotating on every single call.
Enriching Data: Fetching Individual Game Pages at Scale
Once you have a list of game URLs from the browse scraper, enrich each one with full metadata:
import json
import sqlite3
def batch_enrich_games(game_urls: list, db_path: str = "itch_games.db", proxy: str = None) -> int:
"""Fetch and store full details for a list of game URLs."""
conn = init_db(db_path)
enriched = 0
for i, url in enumerate(game_urls):
try:
if proxy:
resp_obj = get_with_proxy(url)
html = resp_obj.text if resp_obj else None
# Parse inline since scrape_game_page makes its own request
if html:
soup = BeautifulSoup(html, "lxml")
# Re-parse manually here if needed
game_data = scrape_game_page(url)
if game_data and not game_data.get("error"):
save_game(conn, game_data)
enriched += 1
if i % 10 == 0:
print(f"Progress: {i}/{len(game_urls)} ({enriched} saved)")
time.sleep(2.5 if not proxy else 1.5)
except Exception as e:
print(f"Error on {url}: {e}")
conn.close()
return enriched
Detecting Trending Games Before They Go Viral
Combine browse data with jam participation to spot games gaining traction early:
def find_rising_games(db_path: str = "itch_games.db") -> list:
"""Find games with high jam participation and growing ratings."""
conn = sqlite3.connect(db_path)
rows = conn.execute("""
SELECT g.id, g.title, g.url, g.rating, g.rating_count,
g.downloads, g.views, g.jams,
(g.rating_count * 1.0 / NULLIF(g.views, 0)) AS engagement_rate
FROM games g
WHERE g.rating >= 4.0
AND g.rating_count >= 10
AND g.jams != '[]'
ORDER BY engagement_rate DESC
LIMIT 50
""").fetchall()
conn.close()
return [
{
"id": r[0], "title": r[1], "url": r[2],
"rating": r[3], "rating_count": r[4],
"downloads": r[5], "views": r[6],
"jams": json.loads(r[7]) if r[7] else [],
"engagement_rate": round(r[8] * 100, 3) if r[8] else 0,
}
for r in rows
]
rising = find_rising_games()
for game in rising[:10]:
jams_count = len(game["jams"])
print(f"{game['title']}: {game['rating']}★ ({game['rating_count']} ratings), "
f"{jams_count} jam(s), {game['engagement_rate']}% engagement")
print(f" {game['url']}")
Storing Results
For an ongoing scrape, SQLite is the right choice. itch.io data changes — ratings accumulate, download counts grow, games go on sale. You want to track changes over time, not just snapshot once.
import sqlite3
import json
def init_db(path: str = "itch_games.db") -> sqlite3.Connection:
conn = sqlite3.connect(path)
conn.execute("""
CREATE TABLE IF NOT EXISTS games (
id INTEGER PRIMARY KEY,
title TEXT,
url TEXT UNIQUE,
rating REAL,
rating_count INTEGER,
downloads INTEGER,
views INTEGER,
tags TEXT,
jams TEXT,
min_price REAL,
platforms TEXT,
first_seen TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS snapshots (
id INTEGER PRIMARY KEY AUTOINCREMENT,
game_id INTEGER,
rating REAL,
rating_count INTEGER,
downloads INTEGER,
views INTEGER,
recorded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
conn.commit()
return conn
def save_game(conn: sqlite3.Connection, game: dict):
conn.execute("""
INSERT INTO games (id, title, url, rating, rating_count, downloads, views, tags, jams, min_price)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(url) DO UPDATE SET
rating=excluded.rating,
rating_count=excluded.rating_count,
downloads=excluded.downloads,
views=excluded.views,
last_updated=CURRENT_TIMESTAMP
""", (
game.get("id"), game.get("title"), game.get("url"),
game.get("rating"), game.get("rating_count"),
game.get("downloads"), game.get("views"),
json.dumps(game.get("tags", [])),
json.dumps(game.get("jams", [])),
game.get("min_price"),
))
# Record a snapshot for trend tracking
if game.get("id"):
conn.execute("""
INSERT INTO snapshots (game_id, rating, rating_count, downloads, views)
VALUES (?, ?, ?, ?, ?)
""", (game["id"], game.get("rating"), game.get("rating_count"),
game.get("downloads"), game.get("views")))
conn.commit()
If you just want a quick export to CSV:
import csv
def export_csv(conn: sqlite3.Connection, path: str = "itch_games.csv"):
cursor = conn.execute("SELECT id, title, url, rating, rating_count, downloads, views FROM games")
with open(path, "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(["id", "title", "url", "rating", "rating_count", "downloads", "views"])
writer.writerows(cursor.fetchall())
Tracking Rating Trends Over Time
Snapshots let you calculate velocity — how fast a game's rating count is growing:
from datetime import datetime
def compute_rating_velocity(conn: sqlite3.Connection, game_id: int, days: int = 7) -> dict:
"""Calculate how fast a game is accumulating ratings."""
rows = conn.execute("""
SELECT rating_count, recorded_at
FROM snapshots
WHERE game_id = ?
AND recorded_at >= datetime('now', '-' || ? || ' days')
ORDER BY recorded_at
""", (game_id, days)).fetchall()
if len(rows) < 2:
return {"game_id": game_id, "velocity": 0, "snapshots": len(rows)}
first_count = rows[0][0] or 0
last_count = rows[-1][0] or 0
delta = last_count - first_count
first_ts = datetime.fromisoformat(rows[0][1])
last_ts = datetime.fromisoformat(rows[-1][1])
elapsed_days = (last_ts - first_ts).total_seconds() / 86400
velocity = delta / elapsed_days if elapsed_days > 0 else 0
return {
"game_id": game_id,
"ratings_gained": delta,
"elapsed_days": round(elapsed_days, 1),
"ratings_per_day": round(velocity, 2),
"snapshots": len(rows),
}
Complete Scraping Pipeline
Tie everything together into a weekly job:
def run_itch_pipeline(
tags: list = None,
max_browse_pages: int = 20,
db_path: str = "itch_games.db",
proxy: str = None,
) -> dict:
"""Full pipeline: browse -> collect IDs -> enrich -> store."""
tags = tags or ["horror", "puzzle", "platformer", "roguelike", "visual-novel"]
conn = init_db(db_path)
all_game_urls = set()
for tag in tags:
print(f"\nBrowsing tag: {tag}")
games = scrape_all_browse(tag=tag, max_pages=max_browse_pages)
for g in games:
if g.get("url"):
all_game_urls.add(g["url"])
print(f" Found {len(games)} games for '{tag}'")
time.sleep(3)
print(f"\nTotal unique game URLs: {len(all_game_urls)}")
enriched = batch_enrich_games(list(all_game_urls), db_path=db_path, proxy=proxy)
total_games = conn.execute("SELECT COUNT(*) FROM games").fetchone()[0]
conn.close()
return {
"tags_scraped": tags,
"urls_found": len(all_game_urls),
"enriched_this_run": enriched,
"total_in_db": total_games,
}
if __name__ == "__main__":
PROXY = "http://USER:[email protected]:9000"
results = run_itch_pipeline(proxy=PROXY)
print(f"\nPipeline complete: {results['enriched_this_run']} enriched, {results['total_in_db']} total in DB")
Legal and Ethical Notes
itch.io's Terms of Service don't explicitly permit automated scraping, but they don't have a robots.txt that blanket-blocks scrapers either. The site is largely open — game pages are public, no login required for most data.
Practical rules: don't hit them faster than a human could reasonably browse, don't scrape private data (purchases, user accounts), and don't republish the data in a way that replicates their storefront. Game metadata for research, market analysis, or building discovery tools sits in a reasonable gray area. If you're building something commercial that depends heavily on their data, reaching out to ask is worth the two minutes it takes.
The jam data is the most scraper-friendly: jams are explicitly public, itch.io promotes participation counts prominently, and the community culture around jams is open. Building jam analytics tools is clearly within the spirit of the platform.
Key Takeaways
- The
itch.io/api/1/API is limited to authenticated user-level data. It's useful for resolving game IDs but not for bulk catalog discovery. - Browse pages and individual game pages are server-rendered and parse cleanly with BeautifulSoup.
- Download counts are only visible when developers opt to show them. Don't assume they'll be present.
- itch.io rate-limits by IP and it happens fast. Keep delays at 2+ seconds between requests for sustained crawls, and use ThorData's residential proxy rotation if you're scraping at any real scale.
- Store in SQLite and record snapshots — this data is most valuable as a time series, not a one-time dump.
- Jam data is a unique edge: it connects games to community events and competition placement, giving you context that pure download counts can't provide.
- Combine browse scraping with jam scraping to build a multi-dimensional picture of the indie market — which tags are producing the most jams, which jam participants get the most community ratings, and what price points drive the highest engagement rates.