How to Scrape Substack Newsletters with Python (2026 Guide)

2026-04-09 [python scraping substack newsletters]

How to Scrape Substack Newsletters with Python (2026 Guide)

Substack has become one of the dominant platforms for independent writers, with tens of thousands of newsletters ranging from niche hobbyist projects to publications pulling millions in annual revenue. Whether you're doing competitive intelligence, building a content aggregator, tracking newsletter growth trends, or just archiving a publication you love, scraping Substack at scale is a practical skill to have.

The good news: Substack doesn't have a public API, but it has something almost as good — a consistent internal API that every Substack site uses. Each newsletter runs on its own subdomain (newsletter.substack.com), and they all share the same backend endpoints. The bad news: Substack sits behind Cloudflare, and scraping hundreds of publications from a single IP address will get you blocked.

This guide walks through every major data extraction approach, from simple RSS polling to full archive collection, with working Python code and real anti-detection techniques.

Understanding Substack's Architecture

Before writing a single line of code, it helps to understand how Substack is built. Every newsletter is a separate subdomain — platformer.substack.com, astralcodexten.substack.com, and so on. Some newsletters use custom domains, but they still run the same Substack backend.

The key insight is that all subdomains share the same API structure. The endpoints follow the pattern https://{publication}.substack.com/api/v1/.... For read-only operations on public content, no authentication is needed.

The main endpoints you'll use: - /api/v1/archive — paginated list of posts - /api/v1/posts/{slug} — full post content - /api/v1/recommendations — newsletters this author recommends - /api/v1/publication/all_categories — publication metadata

The archive endpoint returns posts in reverse chronological order:

import requests
import time

publication = "platformer"  # The subdomain
url = f"https://{publication}.substack.com/api/v1/archive"

response = requests.get(url, params={
    "sort": "new",
    "search": "",
    "offset": 0,
    "limit": 12
})
posts = response.json()

for post in posts:
    print(f"{post['title']}")
    print(f"  Date: {post['post_date']}")
    print(f"  Slug: {post['slug']}")
    print(f"  Type: {post['type']}")  # newsletter, podcast, thread
    print(f"  Audience: {post['audience']}")  # everyone, only_paid
    print(f"  Reactions: {post.get('reactions', {}).get('❤', 0)}")
    print(f"  Comments: {post.get('comment_count', 0)}")
    print()

The audience field tells you if a post is free (everyone) or paywalled (only_paid). You can only get full content for free posts. The type field distinguishes regular newsletter posts from podcasts and threads (Substack's Twitter-style short posts).

Getting Full Post Content

Each post has its own endpoint that returns the complete HTML body:

def get_post(publication, slug):
    url = f"https://{publication}.substack.com/api/v1/posts/{slug}"

    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
        "Accept": "application/json",
        "Referer": f"https://{publication}.substack.com"
    }

    response = requests.get(url, headers=headers, timeout=15)

    if response.status_code == 404:
        return None  # post deleted or slug changed
    if response.status_code == 403:
        return None  # paywalled or access restricted

    response.raise_for_status()
    post = response.json()

    return {
        "title": post["title"],
        "subtitle": post.get("subtitle"),
        "author": post["publishedBylines"][0]["name"] if post.get("publishedBylines") else None,
        "author_id": post["publishedBylines"][0].get("id") if post.get("publishedBylines") else None,
        "date": post["post_date"],
        "body_html": post.get("body_html", ""),
        "word_count": post.get("wordcount", 0),
        "audience": post["audience"],
        "canonical_url": post["canonical_url"],
        "cover_image": post.get("cover_image"),
        "reactions": post.get("reactions", {}),
        "comment_count": post.get("comment_count", 0),
        "podcast_url": post.get("podcast_url"),  # for audio posts
        "description": post.get("description"),
    }

post = get_post("platformer", "some-post-slug")
if post:
    print(f"{post['title']} ({post['word_count']} words)")
    print(f"By {post['author']} on {post['date']}")

For posts with body_html, you'll often want to convert to plain text or markdown. The HTML uses standard tags — headings, paragraphs, blockquotes, and anchor tags for links. A library like html2text or markdownify handles the conversion cleanly.

Paginating the Full Archive

To get every post from a newsletter:

import time
import random

def get_all_posts(publication, max_pages=None):
    """Fetch complete post list from a Substack publication."""
    all_posts = []
    offset = 0
    limit = 25
    page = 0

    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
        "Accept": "application/json",
    }

    while True:
        if max_pages and page >= max_pages:
            break

        response = requests.get(
            f"https://{publication}.substack.com/api/v1/archive",
            params={"sort": "new", "offset": offset, "limit": limit},
            headers=headers,
            timeout=15
        )

        if response.status_code == 429:
            retry_after = int(response.headers.get("Retry-After", 60))
            print(f"  Rate limited, waiting {retry_after}s...")
            time.sleep(retry_after)
            continue  # retry same page

        if response.status_code != 200:
            print(f"  Error: {response.status_code}")
            break

        batch = response.json()
        if not batch:
            break  # empty page = we hit the end

        all_posts.extend(batch)
        print(f"Fetched {len(all_posts)} posts...")
        offset += limit
        page += 1
        time.sleep(random.uniform(0.8, 1.5))  # polite, randomized delay

    return all_posts

posts = get_all_posts("platformer")
print(f"Total: {len(posts)} posts")

# Separate free vs paid
free_posts = [p for p in posts if p.get("audience") == "everyone"]
paid_posts = [p for p in posts if p.get("audience") == "only_paid"]
print(f"Free: {len(free_posts)}, Paywalled: {len(paid_posts)}")

The publication's homepage has metadata embedded in the page source. The Substack frontend stores all initial data in a window._preloads variable:

import re
import json

def get_publication_info(publication):
    url = f"https://{publication}.substack.com"
    response = requests.get(url, headers={
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
    }, timeout=15)

    # Extract the embedded JSON config
    match = re.search(
        r'window\._preloads\s*=\s*JSON\.parse\("(.+?)"\)',
        response.text
    )

    if match:
        # The JSON is escaped -- decode the unicode escapes first
        raw = match.group(1).encode().decode('unicode_escape')
        try:
            data = json.loads(raw)
        except json.JSONDecodeError:
            # Sometimes needs a second pass of unescaping
            data = json.loads(raw.encode('latin1').decode('utf-8'))

        pub = data.get("publication", {})

        return {
            "name": pub.get("name"),
            "subdomain": pub.get("subdomain"),
            "custom_domain": pub.get("custom_domain"),
            "author_name": pub.get("author_name"),
            "hero_text": pub.get("hero_text"),
            "created_at": pub.get("created_at"),
            "base_url": pub.get("base_url"),
            "logo_url": pub.get("logo_url"),
            "cover_photo_url": pub.get("cover_photo_url"),
            "type": pub.get("type"),  # newsletter, publication
            "language": pub.get("language"),
        }

    return None

info = get_publication_info("platformer")
if info:
    print(json.dumps(info, indent=2))

Substack doesn't expose subscriber counts publicly. But you can estimate them from the "Leaderboard" data or by checking the subscriberCountExceeds field that sometimes appears in the page source. Some newsletters also announce their subscriber milestones in posts, which you can parse from the content.

Accessing Author Bio and Multiple Publications

Authors on Substack can run multiple publications. Their profile page lists all of them:

def get_author_publications(author_handle):
    """Get all publications by a Substack author via their profile URL."""
    url = f"https://substack.com/profile/{author_handle}"

    response = requests.get(url, headers={
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
        "Accept": "application/json"
    }, timeout=15)

    # Profile pages embed JSON too
    match = re.search(r'window\._preloads\s*=\s*JSON\.parse\("(.+?)"\)', response.text)
    if match:
        raw = match.group(1).encode().decode('unicode_escape')
        data = json.loads(raw)
        profile = data.get("userProfile", {})

        return {
            "name": profile.get("name"),
            "bio": profile.get("bio"),
            "photo": profile.get("photo_url"),
            "publications": [
                {
                    "name": p.get("name"),
                    "subdomain": p.get("subdomain"),
                    "subscriber_count_range": p.get("subscriberCountExceeds"),
                }
                for p in profile.get("publications", [])
            ]
        }
    return None

RSS as an Alternative

Every Substack newsletter has an RSS feed — sometimes the simplest approach for monitoring new posts:

import xml.etree.ElementTree as ET

def parse_rss(publication):
    url = f"https://{publication}.substack.com/feed"
    response = requests.get(url, headers={
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
    }, timeout=15)

    root = ET.fromstring(response.content)
    channel = root.find("channel")

    newsletter = {
        "title": channel.find("title").text,
        "description": channel.find("description").text,
        "link": channel.find("link").text,
        "posts": []
    }

    for item in channel.findall("item"):
        dc_creator = item.find("{http://purl.org/dc/elements/1.1/}creator")
        content_encoded = item.find("{http://purl.org/content/}encoded")

        newsletter["posts"].append({
            "title": item.find("title").text,
            "link": item.find("link").text,
            "pub_date": item.find("pubDate").text,
            "description": item.find("description").text[:500] if item.find("description") is not None else None,
            "content_html": content_encoded.text[:1000] if content_encoded is not None else None,
            "author": dc_creator.text if dc_creator is not None else None
        })

    return newsletter

feed = parse_rss("platformer")
print(f"{feed['title']}: {len(feed['posts'])} posts in feed")
for post in feed["posts"][:3]:
    print(f"  {post['pub_date']}: {post['title']}")

RSS gives you the latest 20-30 posts without pagination. Good for monitoring new releases in near-real-time, not for building complete historical archives.

One underused Substack endpoint is the recommendations list — newsletters that an author publicly endorses. This is useful for mapping the "graph" of the Substack ecosystem:

def get_recommendations(publication):
    """Get newsletters that this publication recommends."""
    url = f"https://{publication}.substack.com/api/v1/recommendations"

    response = requests.get(url, headers={
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
        "Accept": "application/json"
    }, timeout=15)

    if response.status_code != 200:
        return []

    data = response.json()
    recs = []
    for rec in data:
        pub = rec.get("recommendedPublication", {})
        if pub:
            recs.append({
                "name": pub.get("name"),
                "subdomain": pub.get("subdomain"),
                "author": pub.get("author_name"),
                "description": pub.get("hero_text"),
            })
    return recs

recs = get_recommendations("platformer")
print(f"Platformer recommends {len(recs)} newsletters:")
for r in recs[:5]:
    print(f"  {r['name']} by {r['author']}")

By following recommendation chains recursively (A recommends B, B recommends C, etc.), you can discover the entire connected graph of newsletters in a niche.

Rate Limits and Anti-Bot Measures

Substack sits behind Cloudflare. The API endpoints are more permissive than the website, but you'll still hit limits if you're aggressive.

Common blocks you'll encounter:

429 Too Many Requests — Back off and retry. Respect the Retry-After header. Usually clears in 30-120 seconds.
Cloudflare challenge pages — Your IP has been flagged. You'll get a 403 with a Cloudflare "checking your browser" page instead of JSON. Raw HTTP requests without proper browser headers trigger this.
403 Forbidden — Often means your User-Agent is blocked or the publication has restricted access.
Empty arrays — Sometimes Substack returns HTTP 200 with an empty array instead of an error. This usually means soft rate limiting or the publication has no public posts.

For scraping a single newsletter, the simple requests approach with polite delays works fine. For scraping across many publications — say, building a dataset of the top 500 Substack newsletters — you need IP rotation. Sending hundreds of requests from a single IP to subdomains that all resolve to the same Cloudflare-protected backend will get you blocked within minutes.

A residential proxy service like ThorData solves this by routing each request through a different household IP. Cloudflare sees distributed traffic from real ISP ranges instead of concentrated requests from a single address.

import requests

def make_proxied_request(url, params=None, proxy_user="USER", proxy_pass="PASS"):
    """Make a request via ThorData rotating residential proxies."""
    proxies = {
        "http": f"http://{proxy_user}:{proxy_pass}@proxy.thordata.com:9000",
        "https": f"http://{proxy_user}:{proxy_pass}@proxy.thordata.com:9000"
    }

    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
        "Accept": "application/json",
        "Accept-Language": "en-US,en;q=0.9",
    }

    response = requests.get(
        url,
        params=params,
        headers=headers,
        proxies=proxies,
        timeout=20
    )
    return response

# Use it like the regular requests calls above
response = make_proxied_request(
    "https://platformer.substack.com/api/v1/archive",
    params={"sort": "new", "limit": 25, "offset": 0}
)
posts = response.json()

Each request through ThorData's pool rotates to a fresh IP, making your scraping pattern look like distributed human traffic rather than a bot.

Scraping Multiple Newsletters at Scale

For building a dataset across many publications:

import json
import time
import random

def scrape_newsletters(publications, posts_per_pub=50):
    """Scrape multiple Substack publications."""
    results = {}

    for i, pub in enumerate(publications):
        print(f"[{i+1}/{len(publications)}] Scraping {pub}...")

        # Get metadata first
        info = get_publication_info(pub)

        posts = []
        offset = 0
        failures = 0

        while len(posts) < posts_per_pub and failures < 3:
            try:
                response = requests.get(
                    f"https://{pub}.substack.com/api/v1/archive",
                    params={"sort": "new", "offset": offset, "limit": 25},
                    headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"},
                    timeout=15
                )

                if response.status_code == 429:
                    wait = int(response.headers.get("Retry-After", 60))
                    print(f"  Rate limited, waiting {wait}s...")
                    time.sleep(wait)
                    continue

                if response.status_code == 403:
                    print(f"  Blocked -- may need proxy rotation")
                    failures += 1
                    time.sleep(random.uniform(5, 15))
                    continue

                if response.status_code != 200:
                    print(f"  HTTP {response.status_code}")
                    failures += 1
                    continue

                batch = response.json()
                if not batch:
                    break  # no more posts

                posts.extend(batch)
                offset += 25
                failures = 0  # reset on success
                time.sleep(random.uniform(0.8, 2.0))

            except requests.exceptions.Timeout:
                print(f"  Timeout on {pub}")
                failures += 1
                time.sleep(5)

        results[pub] = {
            "metadata": info,
            "post_count": len(posts),
            "posts": [
                {
                    "title": p["title"],
                    "date": p["post_date"],
                    "slug": p["slug"],
                    "type": p.get("type"),
                    "audience": p.get("audience"),
                    "comment_count": p.get("comment_count", 0),
                    "reactions": p.get("reactions", {}),
                }
                for p in posts
            ]
        }

        print(f"  Got {len(posts)} posts")
        time.sleep(random.uniform(2, 5))

    return results

pubs = ["platformer", "stratechery", "pragmaticengineer", "lennysnewsletter", "thebrowser"]
data = scrape_newsletters(pubs, posts_per_pub=50)

with open("substack_data.json", "w") as f:
    json.dump(data, f, indent=2, default=str)

print(f"Saved data for {len(data)} publications")

Storing and Querying the Data

Once you have the raw data, SQLite is the most practical local storage format:

import sqlite3
from datetime import datetime

def init_db(db_path="substack.db"):
    conn = sqlite3.connect(db_path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS publications (
            subdomain TEXT PRIMARY KEY,
            name TEXT,
            author_name TEXT,
            hero_text TEXT,
            custom_domain TEXT,
            created_at TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );

        CREATE TABLE IF NOT EXISTS posts (
            id TEXT,
            publication TEXT,
            title TEXT,
            slug TEXT,
            post_date TEXT,
            audience TEXT,
            type TEXT,
            word_count INTEGER,
            comment_count INTEGER,
            reaction_count INTEGER,
            canonical_url TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
            PRIMARY KEY (id, publication)
        );

        CREATE INDEX IF NOT EXISTS idx_posts_publication ON posts(publication);
        CREATE INDEX IF NOT EXISTS idx_posts_date ON posts(post_date);
        CREATE INDEX IF NOT EXISTS idx_posts_audience ON posts(audience);
    """)
    conn.commit()
    return conn

def save_posts(conn, publication, posts):
    for p in posts:
        conn.execute(
            "INSERT OR REPLACE INTO posts VALUES (?,?,?,?,?,?,?,?,?,?,?,?)",
            (
                p.get("id", p["slug"]),
                publication,
                p["title"],
                p["slug"],
                p["post_date"],
                p.get("audience"),
                p.get("type"),
                p.get("wordcount", 0),
                p.get("comment_count", 0),
                sum(p.get("reactions", {}).values()),
                p.get("canonical_url"),
                datetime.utcnow().isoformat()
            )
        )
    conn.commit()

conn = init_db()
# Now you can query:
# SELECT publication, COUNT(*) as post_count FROM posts GROUP BY publication
# SELECT * FROM posts WHERE audience = 'only_paid' ORDER BY comment_count DESC LIMIT 20

Handling Cloudflare Challenges

For heavier scraping jobs where even proxied requests get challenged, you can use curl_cffi which mimics real Chrome TLS fingerprints:

# pip install curl_cffi
from curl_cffi import requests as cffi_requests

def fetch_with_tls_fingerprint(url, params=None):
    """Fetch using Chrome TLS fingerprint to pass Cloudflare checks."""
    response = cffi_requests.get(
        url,
        params=params,
        impersonate="chrome124",  # mimic Chrome 124
        timeout=20
    )
    return response

# Works transparently instead of the standard requests library
response = fetch_with_tls_fingerprint(
    "https://platformer.substack.com/api/v1/archive",
    params={"sort": "new", "limit": 25}
)
posts = response.json()

curl_cffi uses libcurl under the hood and replicates the exact TLS handshake, cipher suites, and HTTP/2 fingerprint of a real browser. This passes the majority of Cloudflare bot detection without needing a headless browser.

Parsing Post HTML to Plain Text

When you fetch full post content, you get HTML. Converting to clean text is usually necessary for NLP tasks:

from html.parser import HTMLParser

class HTMLTextExtractor(HTMLParser):
    """Simple HTML to text converter without dependencies."""

    def __init__(self):
        super().__init__()
        self.result = []
        self.skip_tags = {'script', 'style', 'head'}
        self.current_skip = None

    def handle_starttag(self, tag, attrs):
        if tag in self.skip_tags:
            self.current_skip = tag
        if tag in ('p', 'br', 'h1', 'h2', 'h3', 'h4', 'li'):
            self.result.append('\n')

    def handle_endtag(self, tag):
        if tag == self.current_skip:
            self.current_skip = None

    def handle_data(self, data):
        if not self.current_skip:
            self.result.append(data)

    def get_text(self):
        return ''.join(self.result).strip()

def html_to_text(html):
    extractor = HTMLTextExtractor()
    extractor.feed(html)
    return extractor.get_text()

# Or use the popular html2text library:
# pip install html2text
# import html2text
# h = html2text.HTML2Text()
# h.ignore_links = True
# plain_text = h.handle(post["body_html"])

What You Can't Easily Get

Subscriber counts — Substack keeps these private. You can sometimes find ranges in the subscriberCountExceeds field, or in press coverage and the author's own posts.
Paid post content — Requires an active subscription. The API returns truncated content with "audience": "only_paid".
Revenue data — Only visible to the newsletter owner.
Email open rates — Internal metrics, not exposed anywhere.
Subscriber growth over time — There's no historical subscriber data in the API. You'd need to snapshot periodically and build the timeline yourself.

Practical Use Cases

Newsletter research tool — Build a dashboard that tracks engagement metrics (comments, reactions) across dozens of newsletters in a niche. Identify which topics drive the most engagement by analyzing post titles and reaction counts.

Content monitoring — Set up programmatic monitoring for specific newsletters and get alerts when they post about keywords you're tracking. RSS polling plus keyword matching is the simplest implementation.

Competitive analysis — If you're running a newsletter, scrape similar publications to understand their posting cadence, content mix (free vs paid), and engagement patterns.

Building a newsletter directory — Combine metadata, RSS data, and post analysis to create a searchable directory of newsletters by topic.

Conclusion

Substack's internal API is surprisingly stable and generous for a platform without official documentation — endpoints haven't changed significantly since 2023. The main challenges are Cloudflare blocking on high-volume scraping and the absence of subscriber count data. With polite delays for single-newsletter scraping, or rotating residential proxies via ThorData for multi-publication datasets, you can build robust Substack data pipelines that run reliably in production.

Start with the archive endpoint, add RSS monitoring for real-time updates, and extend to full content extraction as your use case demands. The data is there — you just need to ask nicely (and not too quickly).

How to Scrape Substack Newsletters with Python (2026 Guide)

How to Scrape Substack Newsletters with Python (2026 Guide)

Understanding Substack's Architecture

Fetching Newsletter Posts

Getting Full Post Content

Paginating the Full Archive

Newsletter Metadata and Author Profiles

Accessing Author Bio and Multiple Publications

RSS as an Alternative

Scraping Recommendations (Newsletter Graph)

Rate Limits and Anti-Bot Measures

Scraping Multiple Newsletters at Scale

Storing and Querying the Data

Handling Cloudflare Challenges

Parsing Post HTML to Plain Text

What You Can't Easily Get

Practical Use Cases

Conclusion