← Back to blog

How to Scrape Reddit Without the API in 2026 (Complete Python Guide)

How to Scrape Reddit Without the API in 2026 (Complete Python Guide)

Reddit's API drama of 2023 was a turning point. In June 2023, Reddit jacked up API pricing so aggressively that beloved apps like Apollo, Reddit is Fun, and Sync shut down within weeks. The goal: force everyone through official channels, monetize the data, and cut off third-party access.

But here's the thing — Reddit still runs on the same infrastructure, and a surprising amount of public data is still accessible without authentication. The old.reddit.com JSON endpoints, RSS feeds, and Pushshift alternatives give you enough to build production-grade scrapers for research, monitoring, and analysis.

This guide covers every method that works in 2026, with complete Python code you can run today.

What Reddit Shut Down (and What Still Works)

The 2023 changes killed OAuth-based API access for third-party apps. Reddit implemented tiered pricing that priced out indie developers:

This effectively ended Apollo, Reddit is Fun, and similar apps. But it didn't shut down Reddit's own public-facing data endpoints.

What still works without authentication in 2026: - Public subreddit listings (hot, new, top, rising) - Post metadata (title, score, author, timestamps, flair) - Comment trees on public posts - User profile pages and comment history - Subreddit search within specific communities - Reddit-wide search across all public subreddits - RSS feeds for any subreddit or user

What requires authentication: - Private/quarantined subreddits - Voting, posting, commenting (write operations) - Saved posts, subscriptions, messaging - NSFW content (age-gated behind login) - Detailed user karma breakdowns

Method 1: The Old Reddit JSON Endpoints

Reddit's old interface (old.reddit.com) exposes a simple trick that most people don't know about: append .json to any Reddit URL and you get structured JSON instead of HTML.

This isn't a backdoor — it's an intentional feature Reddit has had for over a decade. It's rate-limited and they know about it, but it's still functional for moderate scraping needs.

Scraping Subreddit Posts

import httpx
import time
import json
from dataclasses import dataclass, field
from datetime import datetime


@dataclass
class RedditPost:
    post_id: str
    title: str
    author: str
    score: int
    num_comments: int
    url: str
    selftext: str
    permalink: str
    subreddit: str
    created_utc: float
    is_self: bool
    link_flair_text: str
    over_18: bool
    upvote_ratio: float


HEADERS = {
    "User-Agent": "PythonScraper/2.0 (research project; [email protected])"
}


def scrape_subreddit(
    subreddit: str,
    sort: str = "hot",
    max_posts: int = 100,
    time_filter: str = "all",
) -> list[RedditPost]:
    """
    Scrape posts from a subreddit using the old.reddit.com JSON API.

    Args:
        subreddit: Subreddit name without the r/ prefix.
        sort: One of 'hot', 'new', 'top', 'rising', 'controversial'.
        max_posts: Maximum number of posts to fetch (up to ~1000).
        time_filter: For 'top' and 'controversial': 'hour', 'day',
                     'week', 'month', 'year', 'all'.

    Returns:
        List of RedditPost objects with full metadata.
    """
    posts = []
    after = None
    base_url = f"https://old.reddit.com/r/{subreddit}/{sort}.json"

    client = httpx.Client(headers=HEADERS, timeout=15, follow_redirects=True)

    while len(posts) < max_posts:
        params = {"limit": 100, "raw_json": 1}
        if after:
            params["after"] = after
        if sort in ("top", "controversial"):
            params["t"] = time_filter

        resp = client.get(base_url, params=params)

        if resp.status_code == 429:
            retry_after = int(resp.headers.get("Retry-After", 60))
            print(f"Rate limited. Waiting {retry_after}s...")
            time.sleep(retry_after)
            continue

        resp.raise_for_status()
        data = resp.json()["data"]

        for child in data["children"]:
            post = child["data"]
            posts.append(RedditPost(
                post_id=post["id"],
                title=post["title"],
                author=post["author"],
                score=post["score"],
                num_comments=post["num_comments"],
                url=post["url"],
                selftext=post.get("selftext", ""),
                permalink=post["permalink"],
                subreddit=post["subreddit"],
                created_utc=post["created_utc"],
                is_self=post["is_self"],
                link_flair_text=post.get("link_flair_text", ""),
                over_18=post.get("over_18", False),
                upvote_ratio=post.get("upvote_ratio", 0.0),
            ))

        after = data.get("after")
        if not after:
            break

        time.sleep(1.2)  # Respect the 1 req/sec rate limit

    client.close()
    return posts[:max_posts]


# Usage
posts = scrape_subreddit("python", sort="top", max_posts=50, time_filter="week")
for p in posts[:5]:
    print(f"{p.score:>6} | {p.title[:60]}")
    print(f"       | {p.num_comments} comments | by u/{p.author}")

Understanding the JSON Response Structure

Every Reddit JSON endpoint returns data in the same structure:

response["data"]["children"]     →  list of items (posts, comments, etc.)
  ["data"]["title"]              →  post title
  ["data"]["score"]              →  net upvotes
  ["data"]["url"]                →  linked URL (or self-post URL)
  ["data"]["num_comments"]       →  comment count
  ["data"]["created_utc"]        →  Unix timestamp
  ["data"]["author"]             →  username (string)
  ["data"]["selftext"]           →  body text (self posts only)
  ["data"]["is_self"]            →  true for text posts, false for links
  ["data"]["link_flair_text"]    →  post flair label
  ["data"]["upvote_ratio"]       →  0.0 to 1.0 — percentage of upvotes
  ["data"]["over_18"]            →  NSFW flag
response["data"]["after"]        →  cursor for next page (null = last page)

The after value is a fullname like t3_abc123 — the t3_ prefix indicates it's a link (post). Use it as the after parameter to paginate. Keep fetching until after is null.

Method 2: Scraping Comment Trees

Comments are where Reddit's real value lives. Here's how to extract full comment trees:

from dataclasses import dataclass


@dataclass
class RedditComment:
    comment_id: str
    author: str
    body: str
    score: int
    created_utc: float
    depth: int
    parent_id: str
    replies_count: int


def scrape_comments(
    subreddit: str,
    post_id: str,
    sort: str = "best",
) -> list[RedditComment]:
    """
    Scrape all visible comments from a Reddit post.

    Args:
        subreddit: Subreddit name without r/ prefix.
        post_id: The post's base36 ID (from the URL).
        sort: Comment sort order: 'best', 'top', 'new', 'controversial', 'old'.

    Returns:
        Flattened list of comments with depth information.
    """
    url = f"https://old.reddit.com/r/{subreddit}/comments/{post_id}.json"
    params = {"sort": sort, "limit": 500, "raw_json": 1}

    resp = httpx.get(url, headers=HEADERS, params=params, timeout=15)
    resp.raise_for_status()
    data = resp.json()

    # Response is a two-element array:
    # [0] = post metadata, [1] = comment tree
    comment_tree = data[1]["data"]["children"]

    comments = []
    _flatten_comments(comment_tree, comments, depth=0)
    return comments


def _flatten_comments(
    children: list,
    output: list[RedditComment],
    depth: int,
):
    """Recursively flatten nested comment tree into a flat list."""
    for child in children:
        if child["kind"] != "t1":  # t1 = comment, t3 = post
            continue

        c = child["data"]
        replies_data = c.get("replies", "")
        reply_children = []
        if isinstance(replies_data, dict):
            reply_children = replies_data["data"]["children"]

        output.append(RedditComment(
            comment_id=c["id"],
            author=c.get("author", "[deleted]"),
            body=c.get("body", ""),
            score=c.get("score", 0),
            created_utc=c.get("created_utc", 0),
            depth=depth,
            parent_id=c.get("parent_id", ""),
            replies_count=len([
                r for r in reply_children if r["kind"] == "t1"
            ]),
        ))

        # Recurse into nested replies
        if reply_children:
            _flatten_comments(reply_children, output, depth + 1)


# Usage
comments = scrape_comments("python", "abc123", sort="top")
for c in comments[:10]:
    indent = "  " * c.depth
    print(f"{indent}[{c.score:>4}] u/{c.author}: {c.body[:60]}...")

Method 3: Search Across Reddit

Reddit's search endpoint also supports the .json trick:

def search_reddit(
    query: str,
    subreddit: str | None = None,
    sort: str = "relevance",
    time_filter: str = "all",
    max_results: int = 100,
) -> list[RedditPost]:
    """
    Search Reddit for posts matching a query.

    Args:
        query: Search terms.
        subreddit: Restrict search to this subreddit (optional).
        sort: 'relevance', 'hot', 'top', 'new', 'comments'.
        time_filter: 'hour', 'day', 'week', 'month', 'year', 'all'.
        max_results: Maximum number of results.

    Returns:
        List of matching RedditPost objects.
    """
    if subreddit:
        base_url = (
            f"https://old.reddit.com/r/{subreddit}/search.json"
        )
        params = {
            "q": query, "restrict_sr": 1, "sort": sort,
            "t": time_filter, "limit": 100, "raw_json": 1,
        }
    else:
        base_url = "https://old.reddit.com/search.json"
        params = {
            "q": query, "sort": sort,
            "t": time_filter, "limit": 100, "raw_json": 1,
        }

    client = httpx.Client(headers=HEADERS, timeout=15, follow_redirects=True)
    posts = []
    after = None

    while len(posts) < max_results:
        if after:
            params["after"] = after

        resp = client.get(base_url, params=params)
        if resp.status_code == 429:
            time.sleep(60)
            continue
        resp.raise_for_status()

        data = resp.json()["data"]
        for child in data["children"]:
            post = child["data"]
            posts.append(RedditPost(
                post_id=post["id"],
                title=post["title"],
                author=post["author"],
                score=post["score"],
                num_comments=post["num_comments"],
                url=post["url"],
                selftext=post.get("selftext", ""),
                permalink=post["permalink"],
                subreddit=post["subreddit"],
                created_utc=post["created_utc"],
                is_self=post["is_self"],
                link_flair_text=post.get("link_flair_text", ""),
                over_18=post.get("over_18", False),
                upvote_ratio=post.get("upvote_ratio", 0.0),
            ))

        after = data.get("after")
        if not after:
            break
        time.sleep(1.2)

    client.close()
    return posts[:max_results]


# Usage — search within a specific subreddit
results = search_reddit("asyncio tutorial", subreddit="python", sort="top")
for r in results[:5]:
    print(f"{r.score:>5} | {r.title[:60]}")

# Usage — search across all of Reddit
results = search_reddit("best python web framework 2026", sort="relevance")

Method 4: User Profile Scraping

You can scrape any public user's post and comment history:

def scrape_user_history(
    username: str,
    content_type: str = "overview",
    max_items: int = 100,
) -> list[dict]:
    """
    Scrape a Reddit user's public post/comment history.

    Args:
        username: Reddit username without u/ prefix.
        content_type: 'overview', 'submitted' (posts only),
                      'comments' (comments only).
        max_items: Maximum items to fetch.

    Returns:
        List of dicts with post or comment data.
    """
    base_url = (
        f"https://old.reddit.com/user/{username}/{content_type}.json"
    )
    client = httpx.Client(headers=HEADERS, timeout=15, follow_redirects=True)
    items = []
    after = None

    while len(items) < max_items:
        params = {"limit": 100, "raw_json": 1}
        if after:
            params["after"] = after

        resp = client.get(base_url, params=params)
        if resp.status_code == 404:
            print(f"User u/{username} not found or suspended.")
            break
        if resp.status_code == 429:
            time.sleep(60)
            continue
        resp.raise_for_status()

        data = resp.json()["data"]
        for child in data["children"]:
            item = child["data"]
            if child["kind"] == "t3":  # Post
                items.append({
                    "type": "post",
                    "title": item["title"],
                    "subreddit": item["subreddit"],
                    "score": item["score"],
                    "created_utc": item["created_utc"],
                    "num_comments": item["num_comments"],
                    "permalink": item["permalink"],
                })
            elif child["kind"] == "t1":  # Comment
                items.append({
                    "type": "comment",
                    "body": item["body"][:500],
                    "subreddit": item["subreddit"],
                    "score": item["score"],
                    "created_utc": item["created_utc"],
                    "permalink": item["permalink"],
                })

        after = data.get("after")
        if not after:
            break
        time.sleep(1.2)

    client.close()
    return items[:max_items]


# Usage
history = scrape_user_history("spez", content_type="submitted", max_items=20)
for item in history[:5]:
    ts = datetime.fromtimestamp(item["created_utc"]).strftime("%Y-%m-%d")
    print(f"[{ts}] r/{item['subreddit']} — {item.get('title', item.get('body', '')[:60])}")

Storing Reddit Data in SQLite

For any collection beyond a quick experiment, store your data properly:

import sqlite3
from datetime import datetime


def init_reddit_db(db_path: str = "reddit_data.db") -> sqlite3.Connection:
    """Create SQLite tables for Reddit posts and comments."""
    db = sqlite3.connect(db_path)
    db.executescript("""
        CREATE TABLE IF NOT EXISTS posts (
            post_id TEXT PRIMARY KEY,
            subreddit TEXT NOT NULL,
            title TEXT,
            author TEXT,
            score INTEGER,
            num_comments INTEGER,
            url TEXT,
            selftext TEXT,
            permalink TEXT,
            created_utc REAL,
            is_self BOOLEAN,
            flair TEXT,
            upvote_ratio REAL,
            scraped_at TEXT
        );

        CREATE TABLE IF NOT EXISTS comments (
            comment_id TEXT PRIMARY KEY,
            post_id TEXT,
            subreddit TEXT,
            author TEXT,
            body TEXT,
            score INTEGER,
            depth INTEGER,
            created_utc REAL,
            parent_id TEXT,
            scraped_at TEXT,
            FOREIGN KEY (post_id) REFERENCES posts(post_id)
        );

        CREATE INDEX IF NOT EXISTS idx_posts_subreddit
            ON posts(subreddit);
        CREATE INDEX IF NOT EXISTS idx_posts_score
            ON posts(score DESC);
        CREATE INDEX IF NOT EXISTS idx_comments_post
            ON comments(post_id);
        CREATE INDEX IF NOT EXISTS idx_comments_author
            ON comments(author);
    """)
    return db


def save_posts(db: sqlite3.Connection, posts: list[RedditPost]):
    """Batch insert posts into the database."""
    db.executemany(
        """INSERT OR REPLACE INTO posts
           (post_id, subreddit, title, author, score, num_comments,
            url, selftext, permalink, created_utc, is_self,
            flair, upvote_ratio, scraped_at)
           VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
        [
            (p.post_id, p.subreddit, p.title, p.author, p.score,
             p.num_comments, p.url, p.selftext, p.permalink,
             p.created_utc, p.is_self, p.link_flair_text,
             p.upvote_ratio, datetime.utcnow().isoformat())
            for p in posts
        ],
    )
    db.commit()
    print(f"Saved {len(posts)} posts to database.")


def save_comments(
    db: sqlite3.Connection,
    comments: list[RedditComment],
    post_id: str,
    subreddit: str,
):
    """Batch insert comments into the database."""
    db.executemany(
        """INSERT OR REPLACE INTO comments
           (comment_id, post_id, subreddit, author, body, score,
            depth, created_utc, parent_id, scraped_at)
           VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
        [
            (c.comment_id, post_id, subreddit, c.author, c.body,
             c.score, c.depth, c.created_utc, c.parent_id,
             datetime.utcnow().isoformat())
            for c in comments
        ],
    )
    db.commit()


# Example: Scrape a subreddit and store everything
db = init_reddit_db()
posts = scrape_subreddit("machinelearning", sort="top", max_posts=50, time_filter="week")
save_posts(db, posts)

# Then scrape comments for the top 10 posts
for post in sorted(posts, key=lambda p: p.score, reverse=True)[:10]:
    print(f"Scraping comments for: {post.title[:50]}...")
    comments = scrape_comments("machinelearning", post.post_id)
    save_comments(db, comments, post.post_id, "machinelearning")
    time.sleep(2)  # Be respectful between requests

Rate Limits and Anti-Blocking Strategies

Reddit's rate limits for the JSON API are strict and well-enforced:

The time.sleep(1.2) in the code examples above is non-negotiable. That extra 0.2 seconds of margin prevents you from hitting the exact threshold during response processing.

Scaling Beyond Rate Limits

For production workloads — thousands of posts, full comment trees, historical data — the 1 req/sec limit becomes a real bottleneck. Options for scaling:

Rotating residential proxies: Distribute requests across many IPs to multiply your effective throughput. ThorData provides residential proxy pools that work well for Reddit — the IPs come from real residential connections, so Reddit treats them as regular user traffic rather than bot activity. Each IP gets its own rate limit allocation.

# Using httpx with rotating proxy
proxy_url = "http://USER:[email protected]:9000"
client = httpx.Client(
    headers=HEADERS,
    proxy=proxy_url,
    timeout=20,
)

RSS feeds as a supplement: Every subreddit and user has an RSS feed at https://old.reddit.com/r/SUBREDDIT/.rss. These update less frequently but don't count against JSON API rate limits. Useful for monitoring new posts without burning your request budget.

Managed scraping tools: If you want to skip the proxy infrastructure, there's a free Reddit Scraper on Apify that handles pagination, rate limiting, and proxy rotation automatically.

Business Use Cases for Reddit Data

Reddit data has genuine commercial value across several domains:

Summary

Reddit's 2023 API changes were painful but didn't kill scraping. The old.reddit.com/.json endpoints remain functional in 2026 and cover most use cases:

Feature Status Method
Public subreddit posts Works .json endpoint
Post metadata and scores Works .json endpoint
Comment trees (nested) Works /comments/{id}.json
User profiles and history Works /user/{name}.json
Search (global and per-sub) Works /search.json
RSS feeds Works .rss endpoint
Private subreddits Requires auth OAuth API only
Write operations Requires auth OAuth API only

Stick to 1 request/second, always send a descriptive User-Agent, add the raw_json=1 parameter to every request, and you'll be pulling Reddit data reliably with a few dozen lines of Python.