Scraping Steam Game Reviews in 2026: API, Cursors, and Sentiment Analysis
Scraping Steam Game Reviews in 2026: API, Cursors, and Sentiment Analysis
Steam has over 70,000 games and millions of player reviews — an incredible dataset for sentiment analysis, market research, and indie game developers trying to understand what players actually care about. The good news? Valve exposes a semi-public JSON API for reviews that most people don't know about. The bad news? It has quirks that will trip you up if you don't understand cursor-based pagination.
This guide covers the complete pipeline: fetching reviews, handling pagination, filtering quality data, multi-language collection, proxy usage for scale, SQLite storage, and basic sentiment analysis.
The Steam Reviews API
Valve provides an undocumented JSON endpoint that returns reviews for any game by its App ID:
https://store.steampowered.com/appreviews/{appid}?json=1
For example, Counter-Strike 2 (App ID 730):
https://store.steampowered.com/appreviews/730?json=1
No API key. No authentication. It returns JSON. This is unusually generous compared to most platforms in 2026, but Valve imposes rate limits — hit it too fast and you'll get empty responses or 429 errors.
Finding App IDs
import requests
import json
import re
def search_steam_apps(query: str) -> list:
"""Search for Steam apps by name."""
resp = requests.get(
"https://store.steampowered.com/api/storesearch/",
params={"term": query, "l": "english", "cc": "us"},
timeout=15
)
resp.raise_for_status()
items = resp.json().get("items", [])
return [{"appid": item["id"], "name": item["name"]} for item in items]
def get_app_details(appid: int) -> dict:
"""Get detailed app info from the Steam store API."""
resp = requests.get(
"https://store.steampowered.com/api/appdetails",
params={"appids": appid, "cc": "us", "l": "english"},
timeout=15
)
resp.raise_for_status()
data = resp.json().get(str(appid), {})
if not data.get("success"):
return {}
d = data["data"]
return {
"appid": appid,
"name": d.get("name"),
"type": d.get("type"),
"developer": ", ".join(d.get("developers", [])),
"publisher": ", ".join(d.get("publishers", [])),
"genres": ", ".join(g["description"] for g in d.get("genres", [])),
"categories": ", ".join(c["description"] for c in d.get("categories", [])),
"release_date": d.get("release_date", {}).get("date"),
"price": d.get("price_overview", {}).get("final_formatted", "Free"),
"total_reviews": None, # filled by appreviews endpoint
"positive_ratio": None,
}
# Find Elden Ring
results = search_steam_apps("Elden Ring")
for r in results[:3]:
print(f"AppID {r['appid']}: {r['name']}")
Key Parameters
| Parameter | Values | Description |
|---|---|---|
filter |
recent, updated, all |
Sort order — all uses relevance algorithm |
language |
english, spanish, german, schinese, etc. |
Filter by review language |
cursor |
URL-encoded string | Pagination cursor (returned in each response) |
num_per_page |
1-100 | Reviews per page (default 20, max 100) |
review_type |
all, positive, negative |
Filter by recommendation |
purchase_type |
all, steam, non_steam_purchase |
Filter by purchase method |
day_range |
integer | Only reviews from the last N days |
start_date |
Unix timestamp | Reviews after this date |
end_date |
Unix timestamp | Reviews before this date |
Understanding Cursor Pagination
This is where most people get stuck. Steam doesn't use page numbers — it uses opaque cursor strings. The first request uses cursor=* (URL-encoded as cursor=%2A). Each response includes a cursor field you must pass into the next request.
The trap: If you forget to pass the cursor, you'll get the same first page forever and think the API is broken.
import requests
import time
import random
import sqlite3
from datetime import datetime
def scrape_steam_reviews(app_id: int, max_reviews: int = 1000,
language: str = "english",
filter_type: str = "recent",
review_type: str = "all",
purchase_type: str = "steam",
proxy_url: str = None) -> list:
"""
Scrape Steam reviews using cursor-based pagination.
filter_type: recent | updated | all
review_type: all | positive | negative
"""
url = f"https://store.steampowered.com/appreviews/{app_id}"
cursor = "*" # initial cursor — REQUIRED
all_reviews = []
params = {
"json": 1,
"filter": filter_type,
"language": language,
"day_range": 9223372036854775807, # all time
"num_per_page": 100,
"purchase_type": purchase_type,
"review_type": review_type,
"cursor": cursor
}
kwargs = {"params": params, "timeout": 15}
if proxy_url:
kwargs["proxies"] = {"http": proxy_url, "https": proxy_url}
consecutive_empty = 0
while len(all_reviews) < max_reviews:
params["cursor"] = cursor
try:
resp = requests.get(url, **kwargs)
except requests.RequestException as e:
print(f"Request failed: {e}")
break
if resp.status_code == 429:
print("Rate limited — waiting 30 seconds")
time.sleep(30)
continue
if resp.status_code != 200:
print(f"HTTP {resp.status_code}, stopping")
break
data = resp.json()
if data.get("success") != 1:
print("API returned success=0, stopping")
break
reviews = data.get("reviews", [])
# Check for empty response trap (Steam sometimes returns {} instead of 429)
summary = data.get("query_summary", {})
if not reviews:
consecutive_empty += 1
if consecutive_empty >= 2:
print("No more reviews available")
break
time.sleep(5)
continue
consecutive_empty = 0
all_reviews.extend(reviews)
cursor = data.get("cursor", "")
total_available = summary.get("total_reviews", "?")
print(f"Fetched {len(all_reviews)}/{max_reviews} reviews "
f"(total available: {total_available})")
if not cursor:
print("No next cursor — end of reviews")
break
# Respect rate limits: 1 request per 1.5 seconds
time.sleep(random.uniform(1.2, 2.0))
return all_reviews[:max_reviews]
What You Get Back
Each review object contains rich metadata:
def parse_review(r: dict, app_id: int) -> dict:
"""Parse a raw Steam review into a clean dict."""
author = r.get("author", {})
return {
"recommendation_id": r.get("recommendationid"),
"app_id": app_id,
"steam_id": author.get("steamid"),
"num_games_owned": author.get("num_games_owned", 0),
"num_reviews": author.get("num_reviews", 0),
"playtime_forever_hours": round(author.get("playtime_forever", 0) / 60, 1),
"playtime_at_review_hours": round(author.get("playtime_at_review", 0) / 60, 1),
"last_played": author.get("last_played"),
"language": r.get("language"),
"review_text": r.get("review", ""),
"review_length": len(r.get("review", "")),
"voted_up": r.get("voted_up", False), # True = positive, False = negative
"votes_up": r.get("votes_up", 0), # helpful votes
"votes_funny": r.get("votes_funny", 0),
"weighted_vote_score": r.get("weighted_vote_score", 0),
"comment_count": r.get("comment_count", 0),
"timestamp_created": r.get("timestamp_created"),
"timestamp_updated": r.get("timestamp_updated"),
"steam_purchase": r.get("steam_purchase", True),
"received_for_free": r.get("received_for_free", False),
"written_during_early_access": r.get("written_during_early_access", False),
"developer_response": r.get("developer_response", ""),
}
def parse_all_reviews(raw_reviews: list, app_id: int) -> list:
return [parse_review(r, app_id) for r in raw_reviews]
Filtering for Quality Reviews
Raw Steam reviews include a lot of noise: meme reviews, one-word responses, reviews from players with 5 minutes of playtime. Here's how to filter for genuinely useful reviews:
def filter_quality_reviews(reviews: list,
min_playtime_hours: float = 5.0,
min_text_length: int = 100,
max_text_length: int = 10000,
min_helpful_ratio: float = 0.0,
exclude_free: bool = False,
exclude_early_access: bool = False) -> list:
"""Filter reviews by multiple quality criteria."""
quality = []
for r in reviews:
# Minimum playtime filter
if r.get("playtime_at_review_hours", 0) < min_playtime_hours:
continue
# Text length filter
text = r.get("review_text", "")
if len(text) < min_text_length or len(text) > max_text_length:
continue
# Helpful vote ratio filter (only apply if enough votes to be meaningful)
total_votes = r.get("votes_up", 0) + r.get("votes_funny", 0)
if total_votes >= 10:
helpful_ratio = r.get("votes_up", 0) / total_votes
if helpful_ratio < min_helpful_ratio:
continue
# Optional filters
if exclude_free and r.get("received_for_free"):
continue
if exclude_early_access and r.get("written_during_early_access"):
continue
quality.append(r)
print(f"Quality filter: {len(quality)}/{len(reviews)} reviews passed")
return quality
# Example: get high-quality reviews for sentiment analysis
raw = scrape_steam_reviews(1245620, max_reviews=500) # Elden Ring
parsed = parse_all_reviews(raw, 1245620)
quality = filter_quality_reviews(
parsed,
min_playtime_hours=10,
min_text_length=150,
min_helpful_ratio=0.5,
exclude_free=True,
)
print(f"High-quality reviews for analysis: {len(quality)}")
Multi-Language Review Scraping
Steam supports dozens of review languages. For cross-market sentiment analysis:
STEAM_LANGUAGES = [
("english", "en"),
("spanish", "es"),
("latam", "es-419"),
("german", "de"),
("french", "fr"),
("portuguese", "pt"),
("brazilian", "pt-BR"),
("russian", "ru"),
("japanese", "ja"),
("koreana", "ko"),
("schinese", "zh-CN"),
("tchinese", "zh-TW"),
("polish", "pl"),
("italian", "it"),
("dutch", "nl"),
("swedish", "sv"),
]
def scrape_multilang_reviews(app_id: int, reviews_per_lang: int = 200,
languages: list = None,
proxy_url: str = None) -> dict:
"""Scrape reviews across multiple languages."""
if not languages:
languages = STEAM_LANGUAGES[:8] # top 8 by default
all_data = {}
for lang_key, lang_code in languages:
print(f"\nScraping {lang_key} reviews for app {app_id}...")
reviews = scrape_steam_reviews(
app_id,
max_reviews=reviews_per_lang,
language=lang_key,
proxy_url=proxy_url,
)
parsed = parse_all_reviews(reviews, app_id)
all_data[lang_key] = {
"lang_code": lang_code,
"count": len(parsed),
"reviews": parsed,
}
print(f" Got {len(parsed)} {lang_key} reviews")
time.sleep(random.uniform(3, 6)) # extra pause between languages
return all_data
Anti-Bot Measures and Proxy Usage
Steam's API protection is moderate but real:
Rate limiting — More than ~1 request per second sustained will trigger 429 responses. Stick to 1.5+ second delays.
Empty response trap — When soft-rate-limited, Steam returns valid JSON with empty reviews array instead of 429. Always check query_summary.num_reviews.
IP-based throttling — For scraping reviews across hundreds of games daily, you'll accumulate rate limits per IP. Datacenter IPs can work for moderate volumes but residential IPs are safer for sustained use.
Geo-blocking — Some review content is filtered based on your IP's location. Country-targeted proxies solve this.
For serious volume — tracking reviews across the top 500 Steam games daily, or building historical datasets — residential proxies distribute the load. ThorData's residential proxy network provides rotating residential IPs that look like normal user traffic to Steam's servers.
THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
def get_proxy(country: str = "us", sticky_id: str = None) -> str:
"""Build ThorData proxy URL with optional country targeting."""
user = f"{THORDATA_USER}-country-{country}"
if sticky_id:
user += f"-session-{sticky_id}"
return f"http://{user}:{THORDATA_PASS}@proxy.thordata.net:9000"
# Rotate proxies across different game scrapes
import uuid
proxy = get_proxy(country="us", sticky_id=str(uuid.uuid4())[:8])
reviews = scrape_steam_reviews(1245620, max_reviews=1000, proxy_url=proxy)
Saving to SQLite
For any real analysis, dump reviews into SQLite:
def setup_steam_db(db_path: str = "steam_reviews.db") -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS apps (
appid INTEGER PRIMARY KEY,
name TEXT, developer TEXT, publisher TEXT,
genres TEXT, release_date TEXT, price TEXT,
scraped_at TEXT
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS reviews (
recommendation_id TEXT PRIMARY KEY,
app_id INTEGER,
steam_id TEXT,
num_games_owned INTEGER,
num_reviews INTEGER,
playtime_forever_hours REAL,
playtime_at_review_hours REAL,
language TEXT,
review_text TEXT,
review_length INTEGER,
voted_up BOOLEAN,
votes_up INTEGER,
votes_funny INTEGER,
weighted_vote_score REAL,
timestamp_created INTEGER,
steam_purchase BOOLEAN,
received_for_free BOOLEAN,
written_during_early_access BOOLEAN,
developer_response TEXT,
scraped_at TEXT
)
""")
conn.execute("CREATE INDEX IF NOT EXISTS idx_app ON reviews(app_id)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_lang ON reviews(language)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_voted ON reviews(voted_up)")
conn.commit()
return conn
def save_reviews_to_db(conn: sqlite3.Connection, reviews: list):
now = datetime.utcnow().isoformat()
rows = []
for r in reviews:
rows.append((
r.get("recommendation_id"),
r.get("app_id"),
r.get("steam_id"),
r.get("num_games_owned"),
r.get("num_reviews"),
r.get("playtime_forever_hours"),
r.get("playtime_at_review_hours"),
r.get("language"),
r.get("review_text"),
r.get("review_length"),
r.get("voted_up"),
r.get("votes_up"),
r.get("votes_funny"),
r.get("weighted_vote_score"),
r.get("timestamp_created"),
r.get("steam_purchase"),
r.get("received_for_free"),
r.get("written_during_early_access"),
r.get("developer_response"),
now,
))
conn.executemany("""
INSERT OR IGNORE INTO reviews
(recommendation_id, app_id, steam_id, num_games_owned, num_reviews,
playtime_forever_hours, playtime_at_review_hours, language, review_text,
review_length, voted_up, votes_up, votes_funny, weighted_vote_score,
timestamp_created, steam_purchase, received_for_free,
written_during_early_access, developer_response, scraped_at)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)
""", rows)
conn.commit()
return len(rows)
def save_app_to_db(conn: sqlite3.Connection, app_data: dict):
now = datetime.utcnow().isoformat()
conn.execute("""
INSERT OR REPLACE INTO apps
(appid, name, developer, publisher, genres, release_date, price, scraped_at)
VALUES (?,?,?,?,?,?,?,?)
""", (
app_data.get("appid"), app_data.get("name"),
app_data.get("developer"), app_data.get("publisher"),
app_data.get("genres"), app_data.get("release_date"),
app_data.get("price"), now
))
conn.commit()
Review Statistics and Sentiment Analysis
from collections import Counter
import statistics
def compute_review_stats(reviews: list) -> dict:
"""Compute statistics from a set of parsed Steam reviews."""
if not reviews:
return {}
positive = [r for r in reviews if r.get("voted_up")]
negative = [r for r in reviews if not r.get("voted_up")]
playtimes = [r["playtime_at_review_hours"] for r in reviews
if r.get("playtime_at_review_hours", 0) > 0]
helpful_votes = [r.get("votes_up", 0) for r in reviews]
text_lengths = [r.get("review_length", 0) for r in reviews]
lang_dist = Counter(r.get("language") for r in reviews)
return {
"total": len(reviews),
"positive": len(positive),
"negative": len(negative),
"positive_ratio": round(len(positive) / len(reviews) * 100, 1),
"avg_playtime_hours": round(statistics.mean(playtimes), 1) if playtimes else 0,
"median_playtime_hours": round(statistics.median(playtimes), 1) if playtimes else 0,
"avg_review_length": round(statistics.mean(text_lengths), 0) if text_lengths else 0,
"total_helpful_votes": sum(helpful_votes),
"early_access_reviews": sum(1 for r in reviews if r.get("written_during_early_access")),
"free_key_reviews": sum(1 for r in reviews if r.get("received_for_free")),
"language_distribution": dict(lang_dist.most_common(10)),
}
def extract_common_themes(reviews: list, n: int = 20) -> dict:
"""Extract most common words/phrases from positive and negative reviews."""
import re
# Simple stopword list
stopwords = {
"the", "a", "an", "and", "or", "but", "in", "on", "at", "to", "for",
"of", "with", "is", "it", "this", "that", "was", "are", "be", "have",
"has", "had", "not", "i", "you", "me", "my", "your", "we", "they",
"game", "games", "play", "playing", "played", "hours", "time", "very",
}
def get_words(text: str) -> list:
words = re.findall(r"[a-z]+", text.lower())
return [w for w in words if w not in stopwords and len(w) > 3]
pos_words = []
neg_words = []
for r in reviews:
words = get_words(r.get("review_text", ""))
if r.get("voted_up"):
pos_words.extend(words)
else:
neg_words.extend(words)
return {
"top_positive_words": Counter(pos_words).most_common(n),
"top_negative_words": Counter(neg_words).most_common(n),
}
# Example analysis
conn = setup_steam_db()
app_id = 1245620 # Elden Ring
raw_reviews = scrape_steam_reviews(app_id, max_reviews=500)
parsed = parse_all_reviews(raw_reviews, app_id)
app_details = get_app_details(app_id)
save_app_to_db(conn, app_details)
saved = save_reviews_to_db(conn, parsed)
print(f"Saved {saved} reviews")
stats = compute_review_stats(parsed)
print(f"\nReview Statistics for App {app_id}:")
for key, val in stats.items():
print(f" {key}: {val}")
quality = filter_quality_reviews(parsed, min_playtime_hours=10, min_text_length=200)
themes = extract_common_themes(quality)
print(f"\nTop positive words: {themes['top_positive_words'][:5]}")
print(f"Top negative words: {themes['top_negative_words'][:5]}")
Competitive Analysis: Comparing Multiple Games
def compare_games(app_ids: list, reviews_per_game: int = 200) -> dict:
"""Scrape and compare reviews across multiple games."""
conn = setup_steam_db()
comparison = {}
for app_id in app_ids:
print(f"\nProcessing app {app_id}...")
app_info = get_app_details(app_id)
save_app_to_db(conn, app_info)
raw = scrape_steam_reviews(app_id, max_reviews=reviews_per_game)
parsed = parse_all_reviews(raw, app_id)
save_reviews_to_db(conn, parsed)
stats = compute_review_stats(parsed)
comparison[app_id] = {
"name": app_info.get("name", f"App {app_id}"),
**stats,
}
time.sleep(random.uniform(3, 8))
conn.close()
return comparison
# Compare top RPGs
rpg_ids = [1245620, 1086940, 952060] # Elden Ring, BG3, Hades
results = compare_games(rpg_ids, reviews_per_game=300)
print("\nGame Comparison:")
for app_id, data in results.items():
print(f"\n{data['name']} (App {app_id})")
print(f" Positive: {data['positive_ratio']}%")
print(f" Avg playtime at review: {data['avg_playtime_hours']}h")
print(f" Avg review length: {data['avg_review_length']} chars")
Getting the Top Selling Games List
def get_top_sellers(count: int = 100) -> list:
"""Get Steam top sellers list with app IDs."""
resp = requests.get(
"https://store.steampowered.com/api/featuredcategories/",
params={"cc": "us", "l": "english"},
timeout=15
)
resp.raise_for_status()
data = resp.json()
apps = []
for category in data.get("top_sellers", {}).get("items", []):
apps.append({
"appid": category.get("id"),
"name": category.get("name"),
})
return apps[:count]
top = get_top_sellers(50)
print(f"Top {len(top)} Steam sellers:")
for app in top[:10]:
print(f" {app['appid']}: {app['name']}")
Final Thoughts
Steam's review API is one of the most accessible data sources in gaming: - No authentication required - Rich metadata (playtime, helpful votes, developer responses) - Cursor pagination works reliably once you understand it - Multi-language support built in
Key points to remember:
1. Always pass the cursor — this is the most common mistake
2. Watch for empty responses — check query_summary, not just the reviews array
3. 1.5+ second delays — Steam rate limits are real
4. Filter by playtime — separates thoughtful reviews from drive-by posts
5. Use residential proxies for scale — ThorData works well for distributing load across hundreds of games
6. Store in SQLite — enables powerful cross-game analysis with SQL
The playtime and helpful vote data make Steam reviews uniquely valuable compared to most user-generated content sources — you can actually distinguish thoughtful criticism from drive-by negativity, which makes them excellent training data for sentiment classifiers.