How to Scrape Reddit Without the API in 2026 (Complete Python Guide)
How to Scrape Reddit Without the API in 2026 (Complete Python Guide)
Reddit's API drama of 2023 was a turning point. In June 2023, Reddit jacked up API pricing so aggressively that beloved apps like Apollo, Reddit is Fun, and Sync shut down within weeks. The goal: force everyone through official channels, monetize the data, and cut off third-party access.
But here's the thing — Reddit still runs on the same infrastructure, and a surprising amount of public data is still accessible without authentication. The old.reddit.com JSON endpoints, RSS feeds, and Pushshift alternatives give you enough to build production-grade scrapers for research, monitoring, and analysis.
This guide covers every method that works in 2026, with complete Python code you can run today.
What Reddit Shut Down (and What Still Works)
The 2023 changes killed OAuth-based API access for third-party apps. Reddit implemented tiered pricing that priced out indie developers:
- Free tier: 100 requests/minute (barely useful for any real workload)
- Paid tier: $0.24 per 1,000 API calls (expensive at scale — 1M calls = $240)
This effectively ended Apollo, Reddit is Fun, and similar apps. But it didn't shut down Reddit's own public-facing data endpoints.
What still works without authentication in 2026: - Public subreddit listings (hot, new, top, rising) - Post metadata (title, score, author, timestamps, flair) - Comment trees on public posts - User profile pages and comment history - Subreddit search within specific communities - Reddit-wide search across all public subreddits - RSS feeds for any subreddit or user
What requires authentication: - Private/quarantined subreddits - Voting, posting, commenting (write operations) - Saved posts, subscriptions, messaging - NSFW content (age-gated behind login) - Detailed user karma breakdowns
Method 1: The Old Reddit JSON Endpoints
Reddit's old interface (old.reddit.com) exposes a simple trick that most people don't know about: append .json to any Reddit URL and you get structured JSON instead of HTML.
This isn't a backdoor — it's an intentional feature Reddit has had for over a decade. It's rate-limited and they know about it, but it's still functional for moderate scraping needs.
Scraping Subreddit Posts
import httpx
import time
import json
from dataclasses import dataclass, field
from datetime import datetime
@dataclass
class RedditPost:
post_id: str
title: str
author: str
score: int
num_comments: int
url: str
selftext: str
permalink: str
subreddit: str
created_utc: float
is_self: bool
link_flair_text: str
over_18: bool
upvote_ratio: float
HEADERS = {
"User-Agent": "PythonScraper/2.0 (research project; [email protected])"
}
def scrape_subreddit(
subreddit: str,
sort: str = "hot",
max_posts: int = 100,
time_filter: str = "all",
) -> list[RedditPost]:
"""
Scrape posts from a subreddit using the old.reddit.com JSON API.
Args:
subreddit: Subreddit name without the r/ prefix.
sort: One of 'hot', 'new', 'top', 'rising', 'controversial'.
max_posts: Maximum number of posts to fetch (up to ~1000).
time_filter: For 'top' and 'controversial': 'hour', 'day',
'week', 'month', 'year', 'all'.
Returns:
List of RedditPost objects with full metadata.
"""
posts = []
after = None
base_url = f"https://old.reddit.com/r/{subreddit}/{sort}.json"
client = httpx.Client(headers=HEADERS, timeout=15, follow_redirects=True)
while len(posts) < max_posts:
params = {"limit": 100, "raw_json": 1}
if after:
params["after"] = after
if sort in ("top", "controversial"):
params["t"] = time_filter
resp = client.get(base_url, params=params)
if resp.status_code == 429:
retry_after = int(resp.headers.get("Retry-After", 60))
print(f"Rate limited. Waiting {retry_after}s...")
time.sleep(retry_after)
continue
resp.raise_for_status()
data = resp.json()["data"]
for child in data["children"]:
post = child["data"]
posts.append(RedditPost(
post_id=post["id"],
title=post["title"],
author=post["author"],
score=post["score"],
num_comments=post["num_comments"],
url=post["url"],
selftext=post.get("selftext", ""),
permalink=post["permalink"],
subreddit=post["subreddit"],
created_utc=post["created_utc"],
is_self=post["is_self"],
link_flair_text=post.get("link_flair_text", ""),
over_18=post.get("over_18", False),
upvote_ratio=post.get("upvote_ratio", 0.0),
))
after = data.get("after")
if not after:
break
time.sleep(1.2) # Respect the 1 req/sec rate limit
client.close()
return posts[:max_posts]
# Usage
posts = scrape_subreddit("python", sort="top", max_posts=50, time_filter="week")
for p in posts[:5]:
print(f"{p.score:>6} | {p.title[:60]}")
print(f" | {p.num_comments} comments | by u/{p.author}")
Understanding the JSON Response Structure
Every Reddit JSON endpoint returns data in the same structure:
response["data"]["children"] → list of items (posts, comments, etc.)
["data"]["title"] → post title
["data"]["score"] → net upvotes
["data"]["url"] → linked URL (or self-post URL)
["data"]["num_comments"] → comment count
["data"]["created_utc"] → Unix timestamp
["data"]["author"] → username (string)
["data"]["selftext"] → body text (self posts only)
["data"]["is_self"] → true for text posts, false for links
["data"]["link_flair_text"] → post flair label
["data"]["upvote_ratio"] → 0.0 to 1.0 — percentage of upvotes
["data"]["over_18"] → NSFW flag
response["data"]["after"] → cursor for next page (null = last page)
The after value is a fullname like t3_abc123 — the t3_ prefix indicates it's a link (post). Use it as the after parameter to paginate. Keep fetching until after is null.
Method 2: Scraping Comment Trees
Comments are where Reddit's real value lives. Here's how to extract full comment trees:
from dataclasses import dataclass
@dataclass
class RedditComment:
comment_id: str
author: str
body: str
score: int
created_utc: float
depth: int
parent_id: str
replies_count: int
def scrape_comments(
subreddit: str,
post_id: str,
sort: str = "best",
) -> list[RedditComment]:
"""
Scrape all visible comments from a Reddit post.
Args:
subreddit: Subreddit name without r/ prefix.
post_id: The post's base36 ID (from the URL).
sort: Comment sort order: 'best', 'top', 'new', 'controversial', 'old'.
Returns:
Flattened list of comments with depth information.
"""
url = f"https://old.reddit.com/r/{subreddit}/comments/{post_id}.json"
params = {"sort": sort, "limit": 500, "raw_json": 1}
resp = httpx.get(url, headers=HEADERS, params=params, timeout=15)
resp.raise_for_status()
data = resp.json()
# Response is a two-element array:
# [0] = post metadata, [1] = comment tree
comment_tree = data[1]["data"]["children"]
comments = []
_flatten_comments(comment_tree, comments, depth=0)
return comments
def _flatten_comments(
children: list,
output: list[RedditComment],
depth: int,
):
"""Recursively flatten nested comment tree into a flat list."""
for child in children:
if child["kind"] != "t1": # t1 = comment, t3 = post
continue
c = child["data"]
replies_data = c.get("replies", "")
reply_children = []
if isinstance(replies_data, dict):
reply_children = replies_data["data"]["children"]
output.append(RedditComment(
comment_id=c["id"],
author=c.get("author", "[deleted]"),
body=c.get("body", ""),
score=c.get("score", 0),
created_utc=c.get("created_utc", 0),
depth=depth,
parent_id=c.get("parent_id", ""),
replies_count=len([
r for r in reply_children if r["kind"] == "t1"
]),
))
# Recurse into nested replies
if reply_children:
_flatten_comments(reply_children, output, depth + 1)
# Usage
comments = scrape_comments("python", "abc123", sort="top")
for c in comments[:10]:
indent = " " * c.depth
print(f"{indent}[{c.score:>4}] u/{c.author}: {c.body[:60]}...")
Method 3: Search Across Reddit
Reddit's search endpoint also supports the .json trick:
def search_reddit(
query: str,
subreddit: str | None = None,
sort: str = "relevance",
time_filter: str = "all",
max_results: int = 100,
) -> list[RedditPost]:
"""
Search Reddit for posts matching a query.
Args:
query: Search terms.
subreddit: Restrict search to this subreddit (optional).
sort: 'relevance', 'hot', 'top', 'new', 'comments'.
time_filter: 'hour', 'day', 'week', 'month', 'year', 'all'.
max_results: Maximum number of results.
Returns:
List of matching RedditPost objects.
"""
if subreddit:
base_url = (
f"https://old.reddit.com/r/{subreddit}/search.json"
)
params = {
"q": query, "restrict_sr": 1, "sort": sort,
"t": time_filter, "limit": 100, "raw_json": 1,
}
else:
base_url = "https://old.reddit.com/search.json"
params = {
"q": query, "sort": sort,
"t": time_filter, "limit": 100, "raw_json": 1,
}
client = httpx.Client(headers=HEADERS, timeout=15, follow_redirects=True)
posts = []
after = None
while len(posts) < max_results:
if after:
params["after"] = after
resp = client.get(base_url, params=params)
if resp.status_code == 429:
time.sleep(60)
continue
resp.raise_for_status()
data = resp.json()["data"]
for child in data["children"]:
post = child["data"]
posts.append(RedditPost(
post_id=post["id"],
title=post["title"],
author=post["author"],
score=post["score"],
num_comments=post["num_comments"],
url=post["url"],
selftext=post.get("selftext", ""),
permalink=post["permalink"],
subreddit=post["subreddit"],
created_utc=post["created_utc"],
is_self=post["is_self"],
link_flair_text=post.get("link_flair_text", ""),
over_18=post.get("over_18", False),
upvote_ratio=post.get("upvote_ratio", 0.0),
))
after = data.get("after")
if not after:
break
time.sleep(1.2)
client.close()
return posts[:max_results]
# Usage — search within a specific subreddit
results = search_reddit("asyncio tutorial", subreddit="python", sort="top")
for r in results[:5]:
print(f"{r.score:>5} | {r.title[:60]}")
# Usage — search across all of Reddit
results = search_reddit("best python web framework 2026", sort="relevance")
Method 4: User Profile Scraping
You can scrape any public user's post and comment history:
def scrape_user_history(
username: str,
content_type: str = "overview",
max_items: int = 100,
) -> list[dict]:
"""
Scrape a Reddit user's public post/comment history.
Args:
username: Reddit username without u/ prefix.
content_type: 'overview', 'submitted' (posts only),
'comments' (comments only).
max_items: Maximum items to fetch.
Returns:
List of dicts with post or comment data.
"""
base_url = (
f"https://old.reddit.com/user/{username}/{content_type}.json"
)
client = httpx.Client(headers=HEADERS, timeout=15, follow_redirects=True)
items = []
after = None
while len(items) < max_items:
params = {"limit": 100, "raw_json": 1}
if after:
params["after"] = after
resp = client.get(base_url, params=params)
if resp.status_code == 404:
print(f"User u/{username} not found or suspended.")
break
if resp.status_code == 429:
time.sleep(60)
continue
resp.raise_for_status()
data = resp.json()["data"]
for child in data["children"]:
item = child["data"]
if child["kind"] == "t3": # Post
items.append({
"type": "post",
"title": item["title"],
"subreddit": item["subreddit"],
"score": item["score"],
"created_utc": item["created_utc"],
"num_comments": item["num_comments"],
"permalink": item["permalink"],
})
elif child["kind"] == "t1": # Comment
items.append({
"type": "comment",
"body": item["body"][:500],
"subreddit": item["subreddit"],
"score": item["score"],
"created_utc": item["created_utc"],
"permalink": item["permalink"],
})
after = data.get("after")
if not after:
break
time.sleep(1.2)
client.close()
return items[:max_items]
# Usage
history = scrape_user_history("spez", content_type="submitted", max_items=20)
for item in history[:5]:
ts = datetime.fromtimestamp(item["created_utc"]).strftime("%Y-%m-%d")
print(f"[{ts}] r/{item['subreddit']} — {item.get('title', item.get('body', '')[:60])}")
Storing Reddit Data in SQLite
For any collection beyond a quick experiment, store your data properly:
import sqlite3
from datetime import datetime
def init_reddit_db(db_path: str = "reddit_data.db") -> sqlite3.Connection:
"""Create SQLite tables for Reddit posts and comments."""
db = sqlite3.connect(db_path)
db.executescript("""
CREATE TABLE IF NOT EXISTS posts (
post_id TEXT PRIMARY KEY,
subreddit TEXT NOT NULL,
title TEXT,
author TEXT,
score INTEGER,
num_comments INTEGER,
url TEXT,
selftext TEXT,
permalink TEXT,
created_utc REAL,
is_self BOOLEAN,
flair TEXT,
upvote_ratio REAL,
scraped_at TEXT
);
CREATE TABLE IF NOT EXISTS comments (
comment_id TEXT PRIMARY KEY,
post_id TEXT,
subreddit TEXT,
author TEXT,
body TEXT,
score INTEGER,
depth INTEGER,
created_utc REAL,
parent_id TEXT,
scraped_at TEXT,
FOREIGN KEY (post_id) REFERENCES posts(post_id)
);
CREATE INDEX IF NOT EXISTS idx_posts_subreddit
ON posts(subreddit);
CREATE INDEX IF NOT EXISTS idx_posts_score
ON posts(score DESC);
CREATE INDEX IF NOT EXISTS idx_comments_post
ON comments(post_id);
CREATE INDEX IF NOT EXISTS idx_comments_author
ON comments(author);
""")
return db
def save_posts(db: sqlite3.Connection, posts: list[RedditPost]):
"""Batch insert posts into the database."""
db.executemany(
"""INSERT OR REPLACE INTO posts
(post_id, subreddit, title, author, score, num_comments,
url, selftext, permalink, created_utc, is_self,
flair, upvote_ratio, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
[
(p.post_id, p.subreddit, p.title, p.author, p.score,
p.num_comments, p.url, p.selftext, p.permalink,
p.created_utc, p.is_self, p.link_flair_text,
p.upvote_ratio, datetime.utcnow().isoformat())
for p in posts
],
)
db.commit()
print(f"Saved {len(posts)} posts to database.")
def save_comments(
db: sqlite3.Connection,
comments: list[RedditComment],
post_id: str,
subreddit: str,
):
"""Batch insert comments into the database."""
db.executemany(
"""INSERT OR REPLACE INTO comments
(comment_id, post_id, subreddit, author, body, score,
depth, created_utc, parent_id, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
[
(c.comment_id, post_id, subreddit, c.author, c.body,
c.score, c.depth, c.created_utc, c.parent_id,
datetime.utcnow().isoformat())
for c in comments
],
)
db.commit()
# Example: Scrape a subreddit and store everything
db = init_reddit_db()
posts = scrape_subreddit("machinelearning", sort="top", max_posts=50, time_filter="week")
save_posts(db, posts)
# Then scrape comments for the top 10 posts
for post in sorted(posts, key=lambda p: p.score, reverse=True)[:10]:
print(f"Scraping comments for: {post.title[:50]}...")
comments = scrape_comments("machinelearning", post.post_id)
save_comments(db, comments, post.post_id, "machinelearning")
time.sleep(2) # Be respectful between requests
Rate Limits and Anti-Blocking Strategies
Reddit's rate limits for the JSON API are strict and well-enforced:
- 1 request per second — hard limit. Exceeding this triggers 429 errors immediately.
- User-Agent is mandatory — requests without a descriptive User-Agent get 429 or 403 responses. Use a string like
MyBot/1.0 (purpose; contact). - IP-based throttling — sustained traffic from one IP gets temporarily blocked, even within rate limits.
- No
raw_json=1— without this parameter, HTML entities in responses are double-encoded, breaking your parser.
The time.sleep(1.2) in the code examples above is non-negotiable. That extra 0.2 seconds of margin prevents you from hitting the exact threshold during response processing.
Scaling Beyond Rate Limits
For production workloads — thousands of posts, full comment trees, historical data — the 1 req/sec limit becomes a real bottleneck. Options for scaling:
Rotating residential proxies: Distribute requests across many IPs to multiply your effective throughput. ThorData provides residential proxy pools that work well for Reddit — the IPs come from real residential connections, so Reddit treats them as regular user traffic rather than bot activity. Each IP gets its own rate limit allocation.
# Using httpx with rotating proxy
proxy_url = "http://USER:[email protected]:9000"
client = httpx.Client(
headers=HEADERS,
proxy=proxy_url,
timeout=20,
)
RSS feeds as a supplement: Every subreddit and user has an RSS feed at https://old.reddit.com/r/SUBREDDIT/.rss. These update less frequently but don't count against JSON API rate limits. Useful for monitoring new posts without burning your request budget.
Managed scraping tools: If you want to skip the proxy infrastructure, there's a free Reddit Scraper on Apify that handles pagination, rate limiting, and proxy rotation automatically.
Business Use Cases for Reddit Data
Reddit data has genuine commercial value across several domains:
- Market research — Track what customers say about your product (and competitors) in their own words, unfiltered by survey design. Subreddits like r/SaaS, r/startups, and r/Entrepreneur surface honest product feedback.
- Content strategy — Identify what questions and topics get the most engagement in your niche. The top posts in relevant subreddits tell you exactly what your audience cares about.
- Sentiment analysis — Monitor brand perception over time by scraping mentions and running NLP analysis. Comment scores give you a built-in quality signal that surveys lack.
- Competitor intelligence — Track competitor product launches, pricing changes, and customer complaints as they happen in real-time.
- SEO keyword research — Reddit threads rank increasingly well on Google. Understanding what terminology and questions appear frequently helps you target the same search intent.
- Recruitment — Subreddits like r/cscareerquestions and r/ExperiencedDevs reveal what candidates actually care about, beyond what they say in interviews.
Summary
Reddit's 2023 API changes were painful but didn't kill scraping. The old.reddit.com/.json endpoints remain functional in 2026 and cover most use cases:
| Feature | Status | Method |
|---|---|---|
| Public subreddit posts | Works | .json endpoint |
| Post metadata and scores | Works | .json endpoint |
| Comment trees (nested) | Works | /comments/{id}.json |
| User profiles and history | Works | /user/{name}.json |
| Search (global and per-sub) | Works | /search.json |
| RSS feeds | Works | .rss endpoint |
| Private subreddits | Requires auth | OAuth API only |
| Write operations | Requires auth | OAuth API only |
Stick to 1 request/second, always send a descriptive User-Agent, add the raw_json=1 parameter to every request, and you'll be pulling Reddit data reliably with a few dozen lines of Python.