Scraping Steam Game Reviews in 2026: API, Cursors, and Sentiment Analysis

2026-04-09 steam web-scraping gaming api python

Scraping Steam Game Reviews in 2026: API, Cursors, and Sentiment Analysis

Steam has over 70,000 games and millions of player reviews — an incredible dataset for sentiment analysis, market research, and indie game developers trying to understand what players actually care about. The good news? Valve exposes a semi-public JSON API for reviews that most people don't know about. The bad news? It has quirks that will trip you up if you don't understand cursor-based pagination.

This guide covers the complete pipeline: fetching reviews, handling pagination, filtering quality data, multi-language collection, proxy usage for scale, SQLite storage, and basic sentiment analysis.

The Steam Reviews API

Valve provides an undocumented JSON endpoint that returns reviews for any game by its App ID:

https://store.steampowered.com/appreviews/{appid}?json=1

For example, Counter-Strike 2 (App ID 730):

https://store.steampowered.com/appreviews/730?json=1

No API key. No authentication. It returns JSON. This is unusually generous compared to most platforms in 2026, but Valve imposes rate limits — hit it too fast and you'll get empty responses or 429 errors.

Finding App IDs

import requests
import json
import re

def search_steam_apps(query: str) -> list:
    """Search for Steam apps by name."""
    resp = requests.get(
        "https://store.steampowered.com/api/storesearch/",
        params={"term": query, "l": "english", "cc": "us"},
        timeout=15
    )
    resp.raise_for_status()
    items = resp.json().get("items", [])
    return [{"appid": item["id"], "name": item["name"]} for item in items]


def get_app_details(appid: int) -> dict:
    """Get detailed app info from the Steam store API."""
    resp = requests.get(
        "https://store.steampowered.com/api/appdetails",
        params={"appids": appid, "cc": "us", "l": "english"},
        timeout=15
    )
    resp.raise_for_status()
    data = resp.json().get(str(appid), {})
    if not data.get("success"):
        return {}
    d = data["data"]
    return {
        "appid": appid,
        "name": d.get("name"),
        "type": d.get("type"),
        "developer": ", ".join(d.get("developers", [])),
        "publisher": ", ".join(d.get("publishers", [])),
        "genres": ", ".join(g["description"] for g in d.get("genres", [])),
        "categories": ", ".join(c["description"] for c in d.get("categories", [])),
        "release_date": d.get("release_date", {}).get("date"),
        "price": d.get("price_overview", {}).get("final_formatted", "Free"),
        "total_reviews": None,  # filled by appreviews endpoint
        "positive_ratio": None,
    }

# Find Elden Ring
results = search_steam_apps("Elden Ring")
for r in results[:3]:
    print(f"AppID {r['appid']}: {r['name']}")

Key Parameters

Parameter	Values	Description
`filter`	`recent`, `updated`, `all`	Sort order — `all` uses relevance algorithm
`language`	`english`, `spanish`, `german`, `schinese`, etc.	Filter by review language
`cursor`	URL-encoded string	Pagination cursor (returned in each response)
`num_per_page`	1-100	Reviews per page (default 20, max 100)
`review_type`	`all`, `positive`, `negative`	Filter by recommendation
`purchase_type`	`all`, `steam`, `non_steam_purchase`	Filter by purchase method
`day_range`	integer	Only reviews from the last N days
`start_date`	Unix timestamp	Reviews after this date
`end_date`	Unix timestamp	Reviews before this date

Understanding Cursor Pagination

This is where most people get stuck. Steam doesn't use page numbers — it uses opaque cursor strings. The first request uses cursor=* (URL-encoded as cursor=%2A). Each response includes a cursor field you must pass into the next request.

The trap: If you forget to pass the cursor, you'll get the same first page forever and think the API is broken.

import requests
import time
import random
import sqlite3
from datetime import datetime

def scrape_steam_reviews(app_id: int, max_reviews: int = 1000,
                          language: str = "english",
                          filter_type: str = "recent",
                          review_type: str = "all",
                          purchase_type: str = "steam",
                          proxy_url: str = None) -> list:
    """
    Scrape Steam reviews using cursor-based pagination.
    filter_type: recent | updated | all
    review_type: all | positive | negative
    """
    url = f"https://store.steampowered.com/appreviews/{app_id}"
    cursor = "*"  # initial cursor — REQUIRED
    all_reviews = []

    params = {
        "json": 1,
        "filter": filter_type,
        "language": language,
        "day_range": 9223372036854775807,  # all time
        "num_per_page": 100,
        "purchase_type": purchase_type,
        "review_type": review_type,
        "cursor": cursor
    }

    kwargs = {"params": params, "timeout": 15}
    if proxy_url:
        kwargs["proxies"] = {"http": proxy_url, "https": proxy_url}

    consecutive_empty = 0

    while len(all_reviews) < max_reviews:
        params["cursor"] = cursor

        try:
            resp = requests.get(url, **kwargs)
        except requests.RequestException as e:
            print(f"Request failed: {e}")
            break

        if resp.status_code == 429:
            print("Rate limited — waiting 30 seconds")
            time.sleep(30)
            continue

        if resp.status_code != 200:
            print(f"HTTP {resp.status_code}, stopping")
            break

        data = resp.json()

        if data.get("success") != 1:
            print("API returned success=0, stopping")
            break

        reviews = data.get("reviews", [])

        # Check for empty response trap (Steam sometimes returns {} instead of 429)
        summary = data.get("query_summary", {})
        if not reviews:
            consecutive_empty += 1
            if consecutive_empty >= 2:
                print("No more reviews available")
                break
            time.sleep(5)
            continue

        consecutive_empty = 0
        all_reviews.extend(reviews)
        cursor = data.get("cursor", "")

        total_available = summary.get("total_reviews", "?")
        print(f"Fetched {len(all_reviews)}/{max_reviews} reviews "
              f"(total available: {total_available})")

        if not cursor:
            print("No next cursor — end of reviews")
            break

        # Respect rate limits: 1 request per 1.5 seconds
        time.sleep(random.uniform(1.2, 2.0))

    return all_reviews[:max_reviews]

What You Get Back

Each review object contains rich metadata:

def parse_review(r: dict, app_id: int) -> dict:
    """Parse a raw Steam review into a clean dict."""
    author = r.get("author", {})
    return {
        "recommendation_id": r.get("recommendationid"),
        "app_id": app_id,
        "steam_id": author.get("steamid"),
        "num_games_owned": author.get("num_games_owned", 0),
        "num_reviews": author.get("num_reviews", 0),
        "playtime_forever_hours": round(author.get("playtime_forever", 0) / 60, 1),
        "playtime_at_review_hours": round(author.get("playtime_at_review", 0) / 60, 1),
        "last_played": author.get("last_played"),
        "language": r.get("language"),
        "review_text": r.get("review", ""),
        "review_length": len(r.get("review", "")),
        "voted_up": r.get("voted_up", False),  # True = positive, False = negative
        "votes_up": r.get("votes_up", 0),       # helpful votes
        "votes_funny": r.get("votes_funny", 0),
        "weighted_vote_score": r.get("weighted_vote_score", 0),
        "comment_count": r.get("comment_count", 0),
        "timestamp_created": r.get("timestamp_created"),
        "timestamp_updated": r.get("timestamp_updated"),
        "steam_purchase": r.get("steam_purchase", True),
        "received_for_free": r.get("received_for_free", False),
        "written_during_early_access": r.get("written_during_early_access", False),
        "developer_response": r.get("developer_response", ""),
    }


def parse_all_reviews(raw_reviews: list, app_id: int) -> list:
    return [parse_review(r, app_id) for r in raw_reviews]

Filtering for Quality Reviews

Raw Steam reviews include a lot of noise: meme reviews, one-word responses, reviews from players with 5 minutes of playtime. Here's how to filter for genuinely useful reviews:

def filter_quality_reviews(reviews: list,
                             min_playtime_hours: float = 5.0,
                             min_text_length: int = 100,
                             max_text_length: int = 10000,
                             min_helpful_ratio: float = 0.0,
                             exclude_free: bool = False,
                             exclude_early_access: bool = False) -> list:
    """Filter reviews by multiple quality criteria."""
    quality = []
    for r in reviews:
        # Minimum playtime filter
        if r.get("playtime_at_review_hours", 0) < min_playtime_hours:
            continue

        # Text length filter
        text = r.get("review_text", "")
        if len(text) < min_text_length or len(text) > max_text_length:
            continue

        # Helpful vote ratio filter (only apply if enough votes to be meaningful)
        total_votes = r.get("votes_up", 0) + r.get("votes_funny", 0)
        if total_votes >= 10:
            helpful_ratio = r.get("votes_up", 0) / total_votes
            if helpful_ratio < min_helpful_ratio:
                continue

        # Optional filters
        if exclude_free and r.get("received_for_free"):
            continue
        if exclude_early_access and r.get("written_during_early_access"):
            continue

        quality.append(r)

    print(f"Quality filter: {len(quality)}/{len(reviews)} reviews passed")
    return quality


# Example: get high-quality reviews for sentiment analysis
raw = scrape_steam_reviews(1245620, max_reviews=500)  # Elden Ring
parsed = parse_all_reviews(raw, 1245620)
quality = filter_quality_reviews(
    parsed,
    min_playtime_hours=10,
    min_text_length=150,
    min_helpful_ratio=0.5,
    exclude_free=True,
)
print(f"High-quality reviews for analysis: {len(quality)}")

Multi-Language Review Scraping

Steam supports dozens of review languages. For cross-market sentiment analysis:

STEAM_LANGUAGES = [
    ("english", "en"),
    ("spanish", "es"),
    ("latam", "es-419"),
    ("german", "de"),
    ("french", "fr"),
    ("portuguese", "pt"),
    ("brazilian", "pt-BR"),
    ("russian", "ru"),
    ("japanese", "ja"),
    ("koreana", "ko"),
    ("schinese", "zh-CN"),
    ("tchinese", "zh-TW"),
    ("polish", "pl"),
    ("italian", "it"),
    ("dutch", "nl"),
    ("swedish", "sv"),
]

def scrape_multilang_reviews(app_id: int, reviews_per_lang: int = 200,
                               languages: list = None,
                               proxy_url: str = None) -> dict:
    """Scrape reviews across multiple languages."""
    if not languages:
        languages = STEAM_LANGUAGES[:8]  # top 8 by default

    all_data = {}
    for lang_key, lang_code in languages:
        print(f"\nScraping {lang_key} reviews for app {app_id}...")
        reviews = scrape_steam_reviews(
            app_id,
            max_reviews=reviews_per_lang,
            language=lang_key,
            proxy_url=proxy_url,
        )
        parsed = parse_all_reviews(reviews, app_id)
        all_data[lang_key] = {
            "lang_code": lang_code,
            "count": len(parsed),
            "reviews": parsed,
        }
        print(f"  Got {len(parsed)} {lang_key} reviews")
        time.sleep(random.uniform(3, 6))  # extra pause between languages

    return all_data

Anti-Bot Measures and Proxy Usage

Steam's API protection is moderate but real:

Rate limiting — More than ~1 request per second sustained will trigger 429 responses. Stick to 1.5+ second delays.

Empty response trap — When soft-rate-limited, Steam returns valid JSON with empty reviews array instead of 429. Always check query_summary.num_reviews.

IP-based throttling — For scraping reviews across hundreds of games daily, you'll accumulate rate limits per IP. Datacenter IPs can work for moderate volumes but residential IPs are safer for sustained use.

Geo-blocking — Some review content is filtered based on your IP's location. Country-targeted proxies solve this.

For serious volume — tracking reviews across the top 500 Steam games daily, or building historical datasets — residential proxies distribute the load. ThorData's residential proxy network provides rotating residential IPs that look like normal user traffic to Steam's servers.

THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"

def get_proxy(country: str = "us", sticky_id: str = None) -> str:
    """Build ThorData proxy URL with optional country targeting."""
    user = f"{THORDATA_USER}-country-{country}"
    if sticky_id:
        user += f"-session-{sticky_id}"
    return f"http://{user}:{THORDATA_PASS}@proxy.thordata.net:9000"

# Rotate proxies across different game scrapes
import uuid
proxy = get_proxy(country="us", sticky_id=str(uuid.uuid4())[:8])
reviews = scrape_steam_reviews(1245620, max_reviews=1000, proxy_url=proxy)

Saving to SQLite

For any real analysis, dump reviews into SQLite:

def setup_steam_db(db_path: str = "steam_reviews.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS apps (
            appid INTEGER PRIMARY KEY,
            name TEXT, developer TEXT, publisher TEXT,
            genres TEXT, release_date TEXT, price TEXT,
            scraped_at TEXT
        )
    """)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS reviews (
            recommendation_id TEXT PRIMARY KEY,
            app_id INTEGER,
            steam_id TEXT,
            num_games_owned INTEGER,
            num_reviews INTEGER,
            playtime_forever_hours REAL,
            playtime_at_review_hours REAL,
            language TEXT,
            review_text TEXT,
            review_length INTEGER,
            voted_up BOOLEAN,
            votes_up INTEGER,
            votes_funny INTEGER,
            weighted_vote_score REAL,
            timestamp_created INTEGER,
            steam_purchase BOOLEAN,
            received_for_free BOOLEAN,
            written_during_early_access BOOLEAN,
            developer_response TEXT,
            scraped_at TEXT
        )
    """)
    conn.execute("CREATE INDEX IF NOT EXISTS idx_app ON reviews(app_id)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_lang ON reviews(language)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_voted ON reviews(voted_up)")
    conn.commit()
    return conn


def save_reviews_to_db(conn: sqlite3.Connection, reviews: list):
    now = datetime.utcnow().isoformat()
    rows = []
    for r in reviews:
        rows.append((
            r.get("recommendation_id"),
            r.get("app_id"),
            r.get("steam_id"),
            r.get("num_games_owned"),
            r.get("num_reviews"),
            r.get("playtime_forever_hours"),
            r.get("playtime_at_review_hours"),
            r.get("language"),
            r.get("review_text"),
            r.get("review_length"),
            r.get("voted_up"),
            r.get("votes_up"),
            r.get("votes_funny"),
            r.get("weighted_vote_score"),
            r.get("timestamp_created"),
            r.get("steam_purchase"),
            r.get("received_for_free"),
            r.get("written_during_early_access"),
            r.get("developer_response"),
            now,
        ))

    conn.executemany("""
        INSERT OR IGNORE INTO reviews
        (recommendation_id, app_id, steam_id, num_games_owned, num_reviews,
         playtime_forever_hours, playtime_at_review_hours, language, review_text,
         review_length, voted_up, votes_up, votes_funny, weighted_vote_score,
         timestamp_created, steam_purchase, received_for_free,
         written_during_early_access, developer_response, scraped_at)
        VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)
    """, rows)
    conn.commit()
    return len(rows)


def save_app_to_db(conn: sqlite3.Connection, app_data: dict):
    now = datetime.utcnow().isoformat()
    conn.execute("""
        INSERT OR REPLACE INTO apps
        (appid, name, developer, publisher, genres, release_date, price, scraped_at)
        VALUES (?,?,?,?,?,?,?,?)
    """, (
        app_data.get("appid"), app_data.get("name"),
        app_data.get("developer"), app_data.get("publisher"),
        app_data.get("genres"), app_data.get("release_date"),
        app_data.get("price"), now
    ))
    conn.commit()

Review Statistics and Sentiment Analysis

from collections import Counter
import statistics

def compute_review_stats(reviews: list) -> dict:
    """Compute statistics from a set of parsed Steam reviews."""
    if not reviews:
        return {}

    positive = [r for r in reviews if r.get("voted_up")]
    negative = [r for r in reviews if not r.get("voted_up")]

    playtimes = [r["playtime_at_review_hours"] for r in reviews
                 if r.get("playtime_at_review_hours", 0) > 0]

    helpful_votes = [r.get("votes_up", 0) for r in reviews]
    text_lengths = [r.get("review_length", 0) for r in reviews]

    lang_dist = Counter(r.get("language") for r in reviews)

    return {
        "total": len(reviews),
        "positive": len(positive),
        "negative": len(negative),
        "positive_ratio": round(len(positive) / len(reviews) * 100, 1),
        "avg_playtime_hours": round(statistics.mean(playtimes), 1) if playtimes else 0,
        "median_playtime_hours": round(statistics.median(playtimes), 1) if playtimes else 0,
        "avg_review_length": round(statistics.mean(text_lengths), 0) if text_lengths else 0,
        "total_helpful_votes": sum(helpful_votes),
        "early_access_reviews": sum(1 for r in reviews if r.get("written_during_early_access")),
        "free_key_reviews": sum(1 for r in reviews if r.get("received_for_free")),
        "language_distribution": dict(lang_dist.most_common(10)),
    }


def extract_common_themes(reviews: list, n: int = 20) -> dict:
    """Extract most common words/phrases from positive and negative reviews."""
    import re
    # Simple stopword list
    stopwords = {
        "the", "a", "an", "and", "or", "but", "in", "on", "at", "to", "for",
        "of", "with", "is", "it", "this", "that", "was", "are", "be", "have",
        "has", "had", "not", "i", "you", "me", "my", "your", "we", "they",
        "game", "games", "play", "playing", "played", "hours", "time", "very",
    }

    def get_words(text: str) -> list:
        words = re.findall(r"[a-z]+", text.lower())
        return [w for w in words if w not in stopwords and len(w) > 3]

    pos_words = []
    neg_words = []
    for r in reviews:
        words = get_words(r.get("review_text", ""))
        if r.get("voted_up"):
            pos_words.extend(words)
        else:
            neg_words.extend(words)

    return {
        "top_positive_words": Counter(pos_words).most_common(n),
        "top_negative_words": Counter(neg_words).most_common(n),
    }


# Example analysis
conn = setup_steam_db()
app_id = 1245620  # Elden Ring

raw_reviews = scrape_steam_reviews(app_id, max_reviews=500)
parsed = parse_all_reviews(raw_reviews, app_id)

app_details = get_app_details(app_id)
save_app_to_db(conn, app_details)
saved = save_reviews_to_db(conn, parsed)
print(f"Saved {saved} reviews")

stats = compute_review_stats(parsed)
print(f"\nReview Statistics for App {app_id}:")
for key, val in stats.items():
    print(f"  {key}: {val}")

quality = filter_quality_reviews(parsed, min_playtime_hours=10, min_text_length=200)
themes = extract_common_themes(quality)
print(f"\nTop positive words: {themes['top_positive_words'][:5]}")
print(f"Top negative words: {themes['top_negative_words'][:5]}")

Competitive Analysis: Comparing Multiple Games

def compare_games(app_ids: list, reviews_per_game: int = 200) -> dict:
    """Scrape and compare reviews across multiple games."""
    conn = setup_steam_db()
    comparison = {}

    for app_id in app_ids:
        print(f"\nProcessing app {app_id}...")

        app_info = get_app_details(app_id)
        save_app_to_db(conn, app_info)

        raw = scrape_steam_reviews(app_id, max_reviews=reviews_per_game)
        parsed = parse_all_reviews(raw, app_id)
        save_reviews_to_db(conn, parsed)

        stats = compute_review_stats(parsed)
        comparison[app_id] = {
            "name": app_info.get("name", f"App {app_id}"),
            **stats,
        }

        time.sleep(random.uniform(3, 8))

    conn.close()
    return comparison


# Compare top RPGs
rpg_ids = [1245620, 1086940, 952060]  # Elden Ring, BG3, Hades
results = compare_games(rpg_ids, reviews_per_game=300)

print("\nGame Comparison:")
for app_id, data in results.items():
    print(f"\n{data['name']} (App {app_id})")
    print(f"  Positive: {data['positive_ratio']}%")
    print(f"  Avg playtime at review: {data['avg_playtime_hours']}h")
    print(f"  Avg review length: {data['avg_review_length']} chars")

Getting the Top Selling Games List

def get_top_sellers(count: int = 100) -> list:
    """Get Steam top sellers list with app IDs."""
    resp = requests.get(
        "https://store.steampowered.com/api/featuredcategories/",
        params={"cc": "us", "l": "english"},
        timeout=15
    )
    resp.raise_for_status()
    data = resp.json()

    apps = []
    for category in data.get("top_sellers", {}).get("items", []):
        apps.append({
            "appid": category.get("id"),
            "name": category.get("name"),
        })

    return apps[:count]

top = get_top_sellers(50)
print(f"Top {len(top)} Steam sellers:")
for app in top[:10]:
    print(f"  {app['appid']}: {app['name']}")

Final Thoughts

Steam's review API is one of the most accessible data sources in gaming: - No authentication required - Rich metadata (playtime, helpful votes, developer responses) - Cursor pagination works reliably once you understand it - Multi-language support built in

Key points to remember: 1. Always pass the cursor — this is the most common mistake 2. Watch for empty responses — check query_summary, not just the reviews array 3. 1.5+ second delays — Steam rate limits are real 4. Filter by playtime — separates thoughtful reviews from drive-by posts 5. Use residential proxies for scale — ThorData works well for distributing load across hundreds of games 6. Store in SQLite — enables powerful cross-game analysis with SQL

The playtime and helpful vote data make Steam reviews uniquely valuable compared to most user-generated content sources — you can actually distinguish thoughtful criticism from drive-by negativity, which makes them excellent training data for sentiment classifiers.