How to Scrape Substack Newsletters with Python (2026 Guide)
How to Scrape Substack Newsletters with Python (2026 Guide)
Substack has become one of the dominant platforms for independent writers, with tens of thousands of newsletters ranging from niche hobbyist projects to publications pulling millions in annual revenue. Whether you're doing competitive intelligence, building a content aggregator, tracking newsletter growth trends, or just archiving a publication you love, scraping Substack at scale is a practical skill to have.
The good news: Substack doesn't have a public API, but it has something almost as good — a consistent internal API that every Substack site uses. Each newsletter runs on its own subdomain (newsletter.substack.com), and they all share the same backend endpoints. The bad news: Substack sits behind Cloudflare, and scraping hundreds of publications from a single IP address will get you blocked.
This guide walks through every major data extraction approach, from simple RSS polling to full archive collection, with working Python code and real anti-detection techniques.
Understanding Substack's Architecture
Before writing a single line of code, it helps to understand how Substack is built. Every newsletter is a separate subdomain — platformer.substack.com, astralcodexten.substack.com, and so on. Some newsletters use custom domains, but they still run the same Substack backend.
The key insight is that all subdomains share the same API structure. The endpoints follow the pattern https://{publication}.substack.com/api/v1/.... For read-only operations on public content, no authentication is needed.
The main endpoints you'll use:
- /api/v1/archive — paginated list of posts
- /api/v1/posts/{slug} — full post content
- /api/v1/recommendations — newsletters this author recommends
- /api/v1/publication/all_categories — publication metadata
Fetching Newsletter Posts
The archive endpoint returns posts in reverse chronological order:
import requests
import time
publication = "platformer" # The subdomain
url = f"https://{publication}.substack.com/api/v1/archive"
response = requests.get(url, params={
"sort": "new",
"search": "",
"offset": 0,
"limit": 12
})
posts = response.json()
for post in posts:
print(f"{post['title']}")
print(f" Date: {post['post_date']}")
print(f" Slug: {post['slug']}")
print(f" Type: {post['type']}") # newsletter, podcast, thread
print(f" Audience: {post['audience']}") # everyone, only_paid
print(f" Reactions: {post.get('reactions', {}).get('❤', 0)}")
print(f" Comments: {post.get('comment_count', 0)}")
print()
The audience field tells you if a post is free (everyone) or paywalled (only_paid). You can only get full content for free posts. The type field distinguishes regular newsletter posts from podcasts and threads (Substack's Twitter-style short posts).
Getting Full Post Content
Each post has its own endpoint that returns the complete HTML body:
def get_post(publication, slug):
url = f"https://{publication}.substack.com/api/v1/posts/{slug}"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Accept": "application/json",
"Referer": f"https://{publication}.substack.com"
}
response = requests.get(url, headers=headers, timeout=15)
if response.status_code == 404:
return None # post deleted or slug changed
if response.status_code == 403:
return None # paywalled or access restricted
response.raise_for_status()
post = response.json()
return {
"title": post["title"],
"subtitle": post.get("subtitle"),
"author": post["publishedBylines"][0]["name"] if post.get("publishedBylines") else None,
"author_id": post["publishedBylines"][0].get("id") if post.get("publishedBylines") else None,
"date": post["post_date"],
"body_html": post.get("body_html", ""),
"word_count": post.get("wordcount", 0),
"audience": post["audience"],
"canonical_url": post["canonical_url"],
"cover_image": post.get("cover_image"),
"reactions": post.get("reactions", {}),
"comment_count": post.get("comment_count", 0),
"podcast_url": post.get("podcast_url"), # for audio posts
"description": post.get("description"),
}
post = get_post("platformer", "some-post-slug")
if post:
print(f"{post['title']} ({post['word_count']} words)")
print(f"By {post['author']} on {post['date']}")
For posts with body_html, you'll often want to convert to plain text or markdown. The HTML uses standard tags — headings, paragraphs, blockquotes, and anchor tags for links. A library like html2text or markdownify handles the conversion cleanly.
Paginating the Full Archive
To get every post from a newsletter:
import time
import random
def get_all_posts(publication, max_pages=None):
"""Fetch complete post list from a Substack publication."""
all_posts = []
offset = 0
limit = 25
page = 0
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Accept": "application/json",
}
while True:
if max_pages and page >= max_pages:
break
response = requests.get(
f"https://{publication}.substack.com/api/v1/archive",
params={"sort": "new", "offset": offset, "limit": limit},
headers=headers,
timeout=15
)
if response.status_code == 429:
retry_after = int(response.headers.get("Retry-After", 60))
print(f" Rate limited, waiting {retry_after}s...")
time.sleep(retry_after)
continue # retry same page
if response.status_code != 200:
print(f" Error: {response.status_code}")
break
batch = response.json()
if not batch:
break # empty page = we hit the end
all_posts.extend(batch)
print(f"Fetched {len(all_posts)} posts...")
offset += limit
page += 1
time.sleep(random.uniform(0.8, 1.5)) # polite, randomized delay
return all_posts
posts = get_all_posts("platformer")
print(f"Total: {len(posts)} posts")
# Separate free vs paid
free_posts = [p for p in posts if p.get("audience") == "everyone"]
paid_posts = [p for p in posts if p.get("audience") == "only_paid"]
print(f"Free: {len(free_posts)}, Paywalled: {len(paid_posts)}")
Newsletter Metadata and Author Profiles
The publication's homepage has metadata embedded in the page source. The Substack frontend stores all initial data in a window._preloads variable:
import re
import json
def get_publication_info(publication):
url = f"https://{publication}.substack.com"
response = requests.get(url, headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
}, timeout=15)
# Extract the embedded JSON config
match = re.search(
r'window\._preloads\s*=\s*JSON\.parse\("(.+?)"\)',
response.text
)
if match:
# The JSON is escaped -- decode the unicode escapes first
raw = match.group(1).encode().decode('unicode_escape')
try:
data = json.loads(raw)
except json.JSONDecodeError:
# Sometimes needs a second pass of unescaping
data = json.loads(raw.encode('latin1').decode('utf-8'))
pub = data.get("publication", {})
return {
"name": pub.get("name"),
"subdomain": pub.get("subdomain"),
"custom_domain": pub.get("custom_domain"),
"author_name": pub.get("author_name"),
"hero_text": pub.get("hero_text"),
"created_at": pub.get("created_at"),
"base_url": pub.get("base_url"),
"logo_url": pub.get("logo_url"),
"cover_photo_url": pub.get("cover_photo_url"),
"type": pub.get("type"), # newsletter, publication
"language": pub.get("language"),
}
return None
info = get_publication_info("platformer")
if info:
print(json.dumps(info, indent=2))
Substack doesn't expose subscriber counts publicly. But you can estimate them from the "Leaderboard" data or by checking the subscriberCountExceeds field that sometimes appears in the page source. Some newsletters also announce their subscriber milestones in posts, which you can parse from the content.
Accessing Author Bio and Multiple Publications
Authors on Substack can run multiple publications. Their profile page lists all of them:
def get_author_publications(author_handle):
"""Get all publications by a Substack author via their profile URL."""
url = f"https://substack.com/profile/{author_handle}"
response = requests.get(url, headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Accept": "application/json"
}, timeout=15)
# Profile pages embed JSON too
match = re.search(r'window\._preloads\s*=\s*JSON\.parse\("(.+?)"\)', response.text)
if match:
raw = match.group(1).encode().decode('unicode_escape')
data = json.loads(raw)
profile = data.get("userProfile", {})
return {
"name": profile.get("name"),
"bio": profile.get("bio"),
"photo": profile.get("photo_url"),
"publications": [
{
"name": p.get("name"),
"subdomain": p.get("subdomain"),
"subscriber_count_range": p.get("subscriberCountExceeds"),
}
for p in profile.get("publications", [])
]
}
return None
RSS as an Alternative
Every Substack newsletter has an RSS feed — sometimes the simplest approach for monitoring new posts:
import xml.etree.ElementTree as ET
def parse_rss(publication):
url = f"https://{publication}.substack.com/feed"
response = requests.get(url, headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
}, timeout=15)
root = ET.fromstring(response.content)
channel = root.find("channel")
newsletter = {
"title": channel.find("title").text,
"description": channel.find("description").text,
"link": channel.find("link").text,
"posts": []
}
for item in channel.findall("item"):
dc_creator = item.find("{http://purl.org/dc/elements/1.1/}creator")
content_encoded = item.find("{http://purl.org/content/}encoded")
newsletter["posts"].append({
"title": item.find("title").text,
"link": item.find("link").text,
"pub_date": item.find("pubDate").text,
"description": item.find("description").text[:500] if item.find("description") is not None else None,
"content_html": content_encoded.text[:1000] if content_encoded is not None else None,
"author": dc_creator.text if dc_creator is not None else None
})
return newsletter
feed = parse_rss("platformer")
print(f"{feed['title']}: {len(feed['posts'])} posts in feed")
for post in feed["posts"][:3]:
print(f" {post['pub_date']}: {post['title']}")
RSS gives you the latest 20-30 posts without pagination. Good for monitoring new releases in near-real-time, not for building complete historical archives.
Scraping Recommendations (Newsletter Graph)
One underused Substack endpoint is the recommendations list — newsletters that an author publicly endorses. This is useful for mapping the "graph" of the Substack ecosystem:
def get_recommendations(publication):
"""Get newsletters that this publication recommends."""
url = f"https://{publication}.substack.com/api/v1/recommendations"
response = requests.get(url, headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Accept": "application/json"
}, timeout=15)
if response.status_code != 200:
return []
data = response.json()
recs = []
for rec in data:
pub = rec.get("recommendedPublication", {})
if pub:
recs.append({
"name": pub.get("name"),
"subdomain": pub.get("subdomain"),
"author": pub.get("author_name"),
"description": pub.get("hero_text"),
})
return recs
recs = get_recommendations("platformer")
print(f"Platformer recommends {len(recs)} newsletters:")
for r in recs[:5]:
print(f" {r['name']} by {r['author']}")
By following recommendation chains recursively (A recommends B, B recommends C, etc.), you can discover the entire connected graph of newsletters in a niche.
Rate Limits and Anti-Bot Measures
Substack sits behind Cloudflare. The API endpoints are more permissive than the website, but you'll still hit limits if you're aggressive.
Common blocks you'll encounter:
- 429 Too Many Requests — Back off and retry. Respect the
Retry-Afterheader. Usually clears in 30-120 seconds. - Cloudflare challenge pages — Your IP has been flagged. You'll get a 403 with a Cloudflare "checking your browser" page instead of JSON. Raw HTTP requests without proper browser headers trigger this.
- 403 Forbidden — Often means your User-Agent is blocked or the publication has restricted access.
- Empty arrays — Sometimes Substack returns HTTP 200 with an empty array instead of an error. This usually means soft rate limiting or the publication has no public posts.
For scraping a single newsletter, the simple requests approach with polite delays works fine. For scraping across many publications — say, building a dataset of the top 500 Substack newsletters — you need IP rotation. Sending hundreds of requests from a single IP to subdomains that all resolve to the same Cloudflare-protected backend will get you blocked within minutes.
A residential proxy service like ThorData solves this by routing each request through a different household IP. Cloudflare sees distributed traffic from real ISP ranges instead of concentrated requests from a single address.
import requests
def make_proxied_request(url, params=None, proxy_user="USER", proxy_pass="PASS"):
"""Make a request via ThorData rotating residential proxies."""
proxies = {
"http": f"http://{proxy_user}:{proxy_pass}@proxy.thordata.com:9000",
"https": f"http://{proxy_user}:{proxy_pass}@proxy.thordata.com:9000"
}
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
"Accept": "application/json",
"Accept-Language": "en-US,en;q=0.9",
}
response = requests.get(
url,
params=params,
headers=headers,
proxies=proxies,
timeout=20
)
return response
# Use it like the regular requests calls above
response = make_proxied_request(
"https://platformer.substack.com/api/v1/archive",
params={"sort": "new", "limit": 25, "offset": 0}
)
posts = response.json()
Each request through ThorData's pool rotates to a fresh IP, making your scraping pattern look like distributed human traffic rather than a bot.
Scraping Multiple Newsletters at Scale
For building a dataset across many publications:
import json
import time
import random
def scrape_newsletters(publications, posts_per_pub=50):
"""Scrape multiple Substack publications."""
results = {}
for i, pub in enumerate(publications):
print(f"[{i+1}/{len(publications)}] Scraping {pub}...")
# Get metadata first
info = get_publication_info(pub)
posts = []
offset = 0
failures = 0
while len(posts) < posts_per_pub and failures < 3:
try:
response = requests.get(
f"https://{pub}.substack.com/api/v1/archive",
params={"sort": "new", "offset": offset, "limit": 25},
headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"},
timeout=15
)
if response.status_code == 429:
wait = int(response.headers.get("Retry-After", 60))
print(f" Rate limited, waiting {wait}s...")
time.sleep(wait)
continue
if response.status_code == 403:
print(f" Blocked -- may need proxy rotation")
failures += 1
time.sleep(random.uniform(5, 15))
continue
if response.status_code != 200:
print(f" HTTP {response.status_code}")
failures += 1
continue
batch = response.json()
if not batch:
break # no more posts
posts.extend(batch)
offset += 25
failures = 0 # reset on success
time.sleep(random.uniform(0.8, 2.0))
except requests.exceptions.Timeout:
print(f" Timeout on {pub}")
failures += 1
time.sleep(5)
results[pub] = {
"metadata": info,
"post_count": len(posts),
"posts": [
{
"title": p["title"],
"date": p["post_date"],
"slug": p["slug"],
"type": p.get("type"),
"audience": p.get("audience"),
"comment_count": p.get("comment_count", 0),
"reactions": p.get("reactions", {}),
}
for p in posts
]
}
print(f" Got {len(posts)} posts")
time.sleep(random.uniform(2, 5))
return results
pubs = ["platformer", "stratechery", "pragmaticengineer", "lennysnewsletter", "thebrowser"]
data = scrape_newsletters(pubs, posts_per_pub=50)
with open("substack_data.json", "w") as f:
json.dump(data, f, indent=2, default=str)
print(f"Saved data for {len(data)} publications")
Storing and Querying the Data
Once you have the raw data, SQLite is the most practical local storage format:
import sqlite3
from datetime import datetime
def init_db(db_path="substack.db"):
conn = sqlite3.connect(db_path)
conn.executescript("""
CREATE TABLE IF NOT EXISTS publications (
subdomain TEXT PRIMARY KEY,
name TEXT,
author_name TEXT,
hero_text TEXT,
custom_domain TEXT,
created_at TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS posts (
id TEXT,
publication TEXT,
title TEXT,
slug TEXT,
post_date TEXT,
audience TEXT,
type TEXT,
word_count INTEGER,
comment_count INTEGER,
reaction_count INTEGER,
canonical_url TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id, publication)
);
CREATE INDEX IF NOT EXISTS idx_posts_publication ON posts(publication);
CREATE INDEX IF NOT EXISTS idx_posts_date ON posts(post_date);
CREATE INDEX IF NOT EXISTS idx_posts_audience ON posts(audience);
""")
conn.commit()
return conn
def save_posts(conn, publication, posts):
for p in posts:
conn.execute(
"INSERT OR REPLACE INTO posts VALUES (?,?,?,?,?,?,?,?,?,?,?,?)",
(
p.get("id", p["slug"]),
publication,
p["title"],
p["slug"],
p["post_date"],
p.get("audience"),
p.get("type"),
p.get("wordcount", 0),
p.get("comment_count", 0),
sum(p.get("reactions", {}).values()),
p.get("canonical_url"),
datetime.utcnow().isoformat()
)
)
conn.commit()
conn = init_db()
# Now you can query:
# SELECT publication, COUNT(*) as post_count FROM posts GROUP BY publication
# SELECT * FROM posts WHERE audience = 'only_paid' ORDER BY comment_count DESC LIMIT 20
Handling Cloudflare Challenges
For heavier scraping jobs where even proxied requests get challenged, you can use curl_cffi which mimics real Chrome TLS fingerprints:
# pip install curl_cffi
from curl_cffi import requests as cffi_requests
def fetch_with_tls_fingerprint(url, params=None):
"""Fetch using Chrome TLS fingerprint to pass Cloudflare checks."""
response = cffi_requests.get(
url,
params=params,
impersonate="chrome124", # mimic Chrome 124
timeout=20
)
return response
# Works transparently instead of the standard requests library
response = fetch_with_tls_fingerprint(
"https://platformer.substack.com/api/v1/archive",
params={"sort": "new", "limit": 25}
)
posts = response.json()
curl_cffi uses libcurl under the hood and replicates the exact TLS handshake, cipher suites, and HTTP/2 fingerprint of a real browser. This passes the majority of Cloudflare bot detection without needing a headless browser.
Parsing Post HTML to Plain Text
When you fetch full post content, you get HTML. Converting to clean text is usually necessary for NLP tasks:
from html.parser import HTMLParser
class HTMLTextExtractor(HTMLParser):
"""Simple HTML to text converter without dependencies."""
def __init__(self):
super().__init__()
self.result = []
self.skip_tags = {'script', 'style', 'head'}
self.current_skip = None
def handle_starttag(self, tag, attrs):
if tag in self.skip_tags:
self.current_skip = tag
if tag in ('p', 'br', 'h1', 'h2', 'h3', 'h4', 'li'):
self.result.append('\n')
def handle_endtag(self, tag):
if tag == self.current_skip:
self.current_skip = None
def handle_data(self, data):
if not self.current_skip:
self.result.append(data)
def get_text(self):
return ''.join(self.result).strip()
def html_to_text(html):
extractor = HTMLTextExtractor()
extractor.feed(html)
return extractor.get_text()
# Or use the popular html2text library:
# pip install html2text
# import html2text
# h = html2text.HTML2Text()
# h.ignore_links = True
# plain_text = h.handle(post["body_html"])
What You Can't Easily Get
- Subscriber counts — Substack keeps these private. You can sometimes find ranges in the
subscriberCountExceedsfield, or in press coverage and the author's own posts. - Paid post content — Requires an active subscription. The API returns truncated content with
"audience": "only_paid". - Revenue data — Only visible to the newsletter owner.
- Email open rates — Internal metrics, not exposed anywhere.
- Subscriber growth over time — There's no historical subscriber data in the API. You'd need to snapshot periodically and build the timeline yourself.
Practical Use Cases
Newsletter research tool — Build a dashboard that tracks engagement metrics (comments, reactions) across dozens of newsletters in a niche. Identify which topics drive the most engagement by analyzing post titles and reaction counts.
Content monitoring — Set up programmatic monitoring for specific newsletters and get alerts when they post about keywords you're tracking. RSS polling plus keyword matching is the simplest implementation.
Competitive analysis — If you're running a newsletter, scrape similar publications to understand their posting cadence, content mix (free vs paid), and engagement patterns.
Building a newsletter directory — Combine metadata, RSS data, and post analysis to create a searchable directory of newsletters by topic.
Conclusion
Substack's internal API is surprisingly stable and generous for a platform without official documentation — endpoints haven't changed significantly since 2023. The main challenges are Cloudflare blocking on high-volume scraping and the absence of subscriber count data. With polite delays for single-newsletter scraping, or rotating residential proxies via ThorData for multi-publication datasets, you can build robust Substack data pipelines that run reliably in production.
Start with the archive endpoint, add RSS monitoring for real-time updates, and extend to full content extraction as your use case demands. The data is there — you just need to ask nicely (and not too quickly).