Scraping Twitter/X Data in 2026: API Tiers, Nitter, and What Still Works

2026-04-02 twitter x web-scraping api social-media

Scraping Twitter/X Data in 2026: API Tiers, Nitter, and What Still Works

Twitter - or X, depending on how petty you feel about branding - has the most hostile API pricing in the industry. What used to be a free, developer-friendly platform now charges $100/month for basic read access. And the free tier? Basically useless for data work.

Let's go through what actually works in 2026 if you need Twitter data.

The Official API: Three Tiers, One Problem

Twitter's API v2 has three pricing tiers, and none of them are generous.

Free tier ($0/month): You get 500 tweet reads per month and 1 app. That's it. You can post tweets (1,500/month) and delete them, but for data collection? 500 reads is gone in a single API call if you're pulling timelines. The free tier exists so Twitter can say they have a free API. It's not for actual work.

Basic tier ($100/month): 10,000 tweet reads per month. This is the minimum viable option if you need data for a small project. You get the search endpoint, user lookup, and timeline fetching. Still tight - 10K reads means you need to be careful about what you query.

Pro tier ($5,000/month): 1 million tweet reads, full-archive search, and the streaming endpoints. If you're doing serious research or running a product that depends on Twitter data, this is where you end up. The price jump from $100 to $5,000 is absurd, but that's where we are.

What Data Is Available

Through the official API (depending on tier):

Tweets — text, created_at, author_id, language, source, reply/retweet/quote counts
Users — username, display name, bio, follower count, following count, verified status, created date, profile image
Search — recent tweets (7 days, Basic) or full-archive (Pro) matching a query
Timelines — chronological tweet history for specific users
Mentions — tweets mentioning a specific user
Threads — conversation threads via conversation_id

What's NOT in the official API: - Like counts (removed from public API surface) - Real-time streaming (Pro only) - Historical data beyond 7 days (Pro only) - DM data (requires user OAuth)

OAuth 2.0 Setup

Twitter v2 uses OAuth 2.0 with PKCE for user context, or simple Bearer tokens for app-only access. For read operations — which is what scraping is about — app-only auth is enough.

Go to developer.x.com, create a project and app, and grab your Bearer Token from the app settings.

Basic Tweet Search with Python

import requests
import time

BEARER_TOKEN = "your-bearer-token"
HEADERS = {"Authorization": f"Bearer {BEARER_TOKEN}"}

def search_tweets(query: str, max_results: int = 10,
                  start_time: str = None, end_time: str = None) -> dict:
    """Search for tweets matching a query."""
    url = "https://api.x.com/2/tweets/search/recent"
    params = {
        "query": query,
        "max_results": max(10, min(max_results, 100)),
        "tweet.fields": "created_at,public_metrics,author_id,lang,entities,conversation_id",
        "expansions": "author_id",
        "user.fields": "username,name,public_metrics,description,verified",
    }
    if start_time:
        params["start_time"] = start_time
    if end_time:
        params["end_time"] = end_time

    response = requests.get(url, headers=HEADERS, params=params)
    response.raise_for_status()
    return response.json()

results = search_tweets("python web scraping", max_results=50)
for tweet in results.get("data", []):
    print(f"{tweet['created_at']}: {tweet['text'][:100]}")
    metrics = tweet.get("public_metrics", {})
    print(f"  RT: {metrics.get('retweet_count', 0)}, Replies: {metrics.get('reply_count', 0)}")

Paginated Search

For collecting more than 100 tweets, implement cursor-based pagination:

def search_tweets_paginated(query: str, max_total: int = 500) -> list[dict]:
    """Collect tweets across multiple pages."""
    url = "https://api.x.com/2/tweets/search/recent"
    params = {
        "query": query,
        "max_results": 100,
        "tweet.fields": "created_at,public_metrics,author_id,lang",
        "expansions": "author_id",
        "user.fields": "username,name,public_metrics",
    }

    all_tweets = []
    next_token = None

    while len(all_tweets) < max_total:
        if next_token:
            params["next_token"] = next_token

        response = requests.get(url, headers=HEADERS, params=params)
        data = response.json()

        tweets = data.get("data", [])
        all_tweets.extend(tweets)

        meta = data.get("meta", {})
        next_token = meta.get("next_token")

        if not next_token:
            break

        remaining = int(response.headers.get("x-rate-limit-remaining", 1))
        if remaining <= 1:
            reset = int(response.headers.get("x-rate-limit-reset", 0))
            wait = max(reset - int(time.time()), 1)
            print(f"Rate limited. Waiting {wait}s...")
            time.sleep(wait)
        else:
            time.sleep(0.5)

    return all_tweets[:max_total]

# Collect up to 300 tweets about Python scraping
tweets = search_tweets_paginated("#webscraping python", max_total=300)
print(f"Collected {len(tweets)} tweets")

User Timeline Fetch

def get_user_id(username: str) -> str:
    url = f"https://api.x.com/2/users/by/username/{username}"
    response = requests.get(url, headers=HEADERS)
    response.raise_for_status()
    return response.json()["data"]["id"]

def get_user_tweets(user_id: str, max_results: int = 100,
                    exclude: str = "retweets,replies") -> list[dict]:
    url = f"https://api.x.com/2/users/{user_id}/tweets"
    params = {
        "max_results": min(max_results, 100),
        "tweet.fields": "created_at,public_metrics,entities",
        "exclude": exclude,
    }
    response = requests.get(url, headers=HEADERS, params=params)
    response.raise_for_status()
    data = response.json()
    return data.get("data", [])

uid = get_user_id("pypi")
tweets = get_user_tweets(uid, max_results=50)
for t in tweets[:5]:
    print(f"  {t['created_at']}: {t['text'][:80]}")

User Lookup

def lookup_users(usernames: list[str]) -> list[dict]:
    """Batch lookup multiple user profiles."""
    url = "https://api.x.com/2/users/by"
    results = []

    # API supports up to 100 usernames per request
    for i in range(0, len(usernames), 100):
        batch = usernames[i:i + 100]
        params = {
            "usernames": ",".join(batch),
            "user.fields": "created_at,description,public_metrics,verified,profile_image_url,location",
        }
        response = requests.get(url, headers=HEADERS, params=params)
        response.raise_for_status()
        results.extend(response.json().get("data", []))
        time.sleep(0.3)

    return results

profiles = lookup_users(["python", "github", "docker"])
for user in profiles:
    metrics = user.get("public_metrics", {})
    print(f"@{user['username']}: {metrics.get('followers_count', 0):,} followers")

Rate Limit Handling

import time

def rate_limited_request(url: str, params: dict) -> dict:
    """Make a request with automatic rate limit handling."""
    max_retries = 3
    for attempt in range(max_retries):
        response = requests.get(url, headers=HEADERS, params=params)

        remaining = int(response.headers.get("x-rate-limit-remaining", 1))
        reset_time = int(response.headers.get("x-rate-limit-reset", 0))

        if response.status_code == 429:
            wait = max(reset_time - int(time.time()), 60)
            print(f"Rate limited. Waiting {wait}s (attempt {attempt+1}/{max_retries})...")
            time.sleep(wait)
            continue

        if remaining == 0:
            wait = max(reset_time - int(time.time()), 1)
            print(f"Quota exhausted. Waiting {wait}s...")
            time.sleep(wait)

        response.raise_for_status()
        return response.json()

    raise Exception("Max retries exceeded")

Nitter: The Open-Source Alternative

Nitter is an open-source Twitter frontend that lets you view tweets without logging in and without using the official API. It parses Twitter's frontend and serves a clean, JavaScript-free version.

The appeal for scraping is obvious - no auth tokens, no rate limits tied to your account, no monthly fees. You can parse Nitter's HTML with BeautifulSoup and grab tweets, profiles, and search results.

import requests
from bs4 import BeautifulSoup
import time, random

# Check current instance availability: https://status.d420.de/
NITTER_INSTANCES = [
    "https://nitter.privacydev.net",
    "https://nitter.poast.org",
    "https://nitter.woodland.cafe",
]

def get_working_nitter() -> str | None:
    """Find a working Nitter instance."""
    for instance in NITTER_INSTANCES:
        try:
            resp = requests.get(f"{instance}/twitter", timeout=5)
            if resp.status_code == 200:
                return instance
        except Exception:
            continue
    return None

def scrape_nitter_user(username: str, instance: str = None) -> list[dict]:
    """Scrape a Twitter user's recent tweets from Nitter."""
    if not instance:
        instance = get_working_nitter()
    if not instance:
        return []

    url = f"{instance}/{username}"
    headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)"}
    response = requests.get(url, headers=headers, timeout=15)
    if response.status_code != 200:
        return []

    soup = BeautifulSoup(response.text, "html.parser")
    tweets = []

    for item in soup.select(".timeline-item"):
        content = item.select_one(".tweet-content")
        date = item.select_one(".tweet-date a")
        stats = item.select_one(".tweet-stats")

        if not content:
            continue

        tweet = {
            "text": content.get_text(strip=True),
            "date": date.get("title", "") if date else "",
            "link": date.get("href", "") if date else "",
        }

        if stats:
            stat_text = stats.get_text()
            import re
            rt_match = re.search(r"(\d+)\s*(?:Retweet|RT)", stat_text)
            reply_match = re.search(r"(\d+)\s*(?:Reply|Replies)", stat_text)
            tweet["retweets"] = int(rt_match.group(1)) if rt_match else 0
            tweet["replies"] = int(reply_match.group(1)) if reply_match else 0

        tweets.append(tweet)

    return tweets

def scrape_nitter_search(query: str, instance: str = None, max_pages: int = 3) -> list[dict]:
    """Search tweets via Nitter."""
    if not instance:
        instance = get_working_nitter()
    if not instance:
        return []

    all_tweets = []
    cursor = None

    for page in range(max_pages):
        url = f"{instance}/search"
        params = {"q": query, "f": "tweets"}
        if cursor:
            params["cursor"] = cursor

        headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}
        resp = requests.get(url, params=params, headers=headers, timeout=15)
        if resp.status_code != 200:
            break

        soup = BeautifulSoup(resp.text, "html.parser")
        items = soup.select(".timeline-item")
        if not items:
            break

        for item in items:
            content = item.select_one(".tweet-content")
            date = item.select_one(".tweet-date a")
            username = item.select_one(".username")
            if content:
                all_tweets.append({
                    "text": content.get_text(strip=True),
                    "date": date.get("title", "") if date else "",
                    "author": username.get_text(strip=True) if username else "",
                })

        # Look for next cursor
        next_el = soup.select_one("div.show-more a")
        if next_el:
            import re
            m = re.search(r"cursor=([^&]+)", next_el.get("href", ""))
            cursor = m.group(1) if m else None
        else:
            break

        time.sleep(random.uniform(2, 4))

    return all_tweets

The problem: Nitter instances are unreliable. Twitter keeps breaking the endpoints Nitter relies on. Check the Nitter status page for current instance availability.

Direct Scraping and Guest Tokens

There's one semi-working approach for anonymous access: guest tokens.

def get_guest_token() -> str:
    """Get a temporary guest Bearer token."""
    url = "https://api.x.com/1.1/guest/activate.json"
    headers = {
        # Twitter's own web client Bearer token (public, stable for years)
        "Authorization": "Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs=1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA",
    }
    response = requests.post(url, headers=headers)
    return response.json().get("guest_token", "")

def get_tweet_guest(tweet_id: str) -> dict:
    """Fetch a single tweet via guest token (no auth)."""
    token = get_guest_token()
    url = f"https://api.x.com/2/tweets/{tweet_id}"
    headers = {
        "Authorization": "Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs=1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA",
        "x-guest-token": token,
    }
    params = {"tweet.fields": "created_at,public_metrics,author_id"}
    resp = requests.get(url, headers=headers, params=params)
    if resp.status_code == 200:
        return resp.json()
    return {}

Guest tokens expire quickly and have tight rate limits. Don't build products on them.

IP-Based Rate Limits and Proxy Rotation

Even with valid API credentials, some rate limits are enforced per IP address. If you're running multiple API keys or scraping Nitter instances from the same server, you'll get blocked fast.

Residential proxy rotation helps here - services like ThorData give you a pool of residential IPs that look like normal user traffic. It's not a magic fix for API quotas - those are tracked by your Bearer token - but for IP-based blocks on direct requests or Nitter scraping, rotating through residential IPs keeps your server from being blacklisted.

import httpx

PROXY = "http://USER:[email protected]:9000"

# For IP-rotation on supplementary scraping (not the main API)
proxied_client = httpx.Client(
    proxy=PROXY,
    headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "Accept-Language": "en-US,en;q=0.9",
    },
    timeout=20,
)

def scrape_nitter_proxied(username: str, nitter_instance: str) -> list[dict]:
    """Scrape Nitter through a rotating residential proxy."""
    url = f"{nitter_instance}/{username}"
    resp = proxied_client.get(url)
    if resp.status_code != 200:
        return []

    soup = BeautifulSoup(resp.text, "html.parser")
    tweets = []
    for item in soup.select(".timeline-item"):
        content = item.select_one(".tweet-content")
        if content:
            tweets.append({"text": content.get_text(strip=True)})
    return tweets

Storing Twitter Data

import sqlite3
import json
from datetime import datetime

def init_twitter_db(path: str = "twitter_data.db") -> sqlite3.Connection:
    conn = sqlite3.connect(path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS tweets (
            tweet_id TEXT PRIMARY KEY,
            author_id TEXT,
            author_username TEXT,
            text TEXT,
            created_at TEXT,
            retweet_count INTEGER DEFAULT 0,
            reply_count INTEGER DEFAULT 0,
            like_count INTEGER DEFAULT 0,
            quote_count INTEGER DEFAULT 0,
            lang TEXT,
            query TEXT,
            scraped_at TEXT DEFAULT (datetime('now'))
        );

        CREATE TABLE IF NOT EXISTS users (
            user_id TEXT PRIMARY KEY,
            username TEXT,
            display_name TEXT,
            followers_count INTEGER,
            following_count INTEGER,
            tweet_count INTEGER,
            description TEXT,
            verified INTEGER,
            created_at TEXT,
            scraped_at TEXT DEFAULT (datetime('now'))
        );

        CREATE INDEX IF NOT EXISTS idx_tweets_author ON tweets(author_id);
        CREATE INDEX IF NOT EXISTS idx_tweets_created ON tweets(created_at);
        CREATE INDEX IF NOT EXISTS idx_tweets_query ON tweets(query);
    """)
    conn.commit()
    return conn

def save_tweets(conn: sqlite3.Connection, tweets: list[dict],
                users: list[dict] = None, query: str = None):
    user_map = {}
    if users:
        for u in users:
            user_map[u["id"]] = u
            metrics = u.get("public_metrics", {})
            conn.execute("""
                INSERT OR REPLACE INTO users
                (user_id, username, display_name, followers_count, following_count,
                 tweet_count, description, verified, created_at, scraped_at)
                VALUES (?,?,?,?,?,?,?,?,?,datetime('now'))
            """, (
                u["id"], u.get("username"), u.get("name"),
                metrics.get("followers_count", 0), metrics.get("following_count", 0),
                metrics.get("tweet_count", 0), u.get("description"),
                int(u.get("verified", False)), u.get("created_at"),
            ))

    for t in tweets:
        metrics = t.get("public_metrics", {})
        conn.execute("""
            INSERT OR REPLACE INTO tweets
            (tweet_id, author_id, text, created_at, retweet_count, reply_count,
             like_count, quote_count, lang, query, scraped_at)
            VALUES (?,?,?,?,?,?,?,?,?,?,datetime('now'))
        """, (
            t["id"], t.get("author_id"), t.get("text"), t.get("created_at"),
            metrics.get("retweet_count", 0), metrics.get("reply_count", 0),
            metrics.get("like_count", 0), metrics.get("quote_count", 0),
            t.get("lang"), query,
        ))

    conn.commit()

The Honest Take on Limitations

The free API tier is unusable for data work. 500 reads per month is a joke.
$100/month for Basic is the real entry price. And even then, 10K reads is tight.
Nitter is dying slowly. Twitter keeps patching the endpoints it relies on.
Guest tokens are ephemeral. They work today, might not tomorrow.
Full-archive search is Pro only. $5,000/month for tweets older than 7 days.

For comparison, Reddit's API is free for non-commercial use with reasonable rate limits. Mastodon's API is open. Twitter went the other direction entirely.

If you need a small amount of Twitter data for a project, get the Basic tier for one month. Collect what you need. Cancel. Don't try to build a continuously-running pipeline on it unless you're prepared for the Pro tier cost.

If you need data for research or a one-off analysis, check if an academic dataset already exists. Datasets collected before the API changes are floating around on Zenodo and Kaggle.

If you need real-time monitoring of specific accounts or keywords, the Pro tier streaming endpoints are genuinely good. But at $5,000/month, make sure the ROI is there before you commit.

The days of casually scraping Twitter for free are over. Plan your data collection carefully, use your API reads efficiently, and always have a fallback for when your current approach breaks. Because with Twitter, something always breaks.

Key Takeaways

Basic tier ($100/month) gives 10,000 tweet reads — budget them carefully, one badly scoped query can burn your monthly allocation
Always check x-rate-limit-remaining and x-rate-limit-reset headers and build sleep logic around them
Nitter instances work when they work — build with a working-instance check and fallback logic
Guest tokens provide no-auth access but are rate-limited and ephemeral — don't build production systems on them
For IP-based rate limit bypass on Nitter and web scraping, use ThorData residential proxies
Cursor-based pagination (next_token) is required for collecting more than 100 tweets per query
Store raw API responses alongside parsed data — the Twitter data model changes frequently and you may need to re-parse

Efficient Read Budget Planning

The biggest mistake beginners make with Twitter's paid API is blowing through their monthly read budget in the first week with exploratory queries. Here's a framework for planning your usage:

Audit before you query. Before writing any production code, figure out exactly what data you need and whether you actually need fresh Twitter data. For historical analysis, academic datasets on Zenodo or Kaggle often cover the same ground. For trend analysis, pre-computed aggregates from social analytics tools may be cheaper than building your own pipeline.

Use min_results strategically. The Basic tier gives you 10,000 reads per month. If you set max_results=100 in search queries, each search burns 100 reads. If you set max_results=10, it burns 10. Many use cases work fine with 10-tweet samples — monitoring brand mentions, checking if a topic is trending, validating that a username exists.

Incremental collection. For monitoring use cases, collect the latest N tweets and track the newest tweet ID you've seen. On the next run, use the since_id parameter to fetch only new tweets since your last collection. This avoids redundant reads.

def incremental_collect(query: str, conn: sqlite3.Connection,
                          max_per_run: int = 100) -> int:
    """Collect new tweets incrementally, tracking the latest seen ID."""
    # Get the most recent tweet ID we already have
    row = conn.execute("""
        SELECT id FROM tweets
        WHERE id IN (
            SELECT tweet_id FROM search_results WHERE query = ?
        )
        ORDER BY created_at DESC LIMIT 1
    """, (query,)).fetchone()

    params = {
        "query": f"{query} -is:retweet lang:en",
        "max_results": max_per_run,
        "tweet.fields": "created_at,public_metrics,author_id",
        "expansions": "author_id",
        "user.fields": "username,name",
    }
    if row:
        params["since_id"] = row[0]

    url = "https://api.x.com/2/tweets/search/recent"
    response = requests.get(url, headers=HEADERS, params=params)
    response.raise_for_status()
    data = response.json()

    tweets = data.get("data", [])
    users = {u["id"]: u for u in
             data.get("includes", {}).get("users", [])}

    saved = save_tweets(conn, tweets, users)
    for tweet in tweets:
        conn.execute(
            "INSERT OR IGNORE INTO search_results (query, tweet_id) "
            "VALUES (?,?)",
            (query, tweet["id"])
        )
    conn.commit()
    return saved

Cache user lookups. User metadata (follower count, bio, join date) doesn't change often. If you're looking up the same accounts repeatedly, store the results in SQLite and only re-fetch when the data is more than 24 hours old. User lookups cost 1 read each — fetching 100 users daily instead of weekly costs 2,800 extra reads per month.

Alternative Data Sources When Twitter Is Too Expensive

Sometimes the right answer is to use a different platform:

Reddit — Similar conversational content, free API with 100 requests/minute, OAuth required for user auth but not for public posts. Much more scraper-friendly in 2026.
Mastodon — Federated open-source platform. Each instance has its own public API with no authentication required for public timelines. Can search across instances using tools like Mastodon.social's search.
Bluesky — AT Protocol API is fully open, no payment required, generous rate limits. Rapidly growing user base in 2026.
Threads — Instagram's Twitter competitor. No public API yet, but Playwright scraping works reasonably well.
YouTube comments — If you need opinion data around media/topics, YouTube's Data API v3 gives you comment data for free within daily quota limits.

These alternatives won't give you the same data as Twitter, but for many use cases — sentiment analysis, trend detection, public opinion research — they're sufficient and orders of magnitude cheaper.

SQLite Schema for Twitter Data

For any sustained collection, a clean schema makes querying much easier:

import sqlite3

def init_twitter_db(db_path: str = "twitter.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS users (
            id TEXT PRIMARY KEY,
            username TEXT,
            name TEXT,
            description TEXT,
            followers INTEGER DEFAULT 0,
            following INTEGER DEFAULT 0,
            tweet_count INTEGER DEFAULT 0,
            verified INTEGER DEFAULT 0,
            created_at TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );

        CREATE TABLE IF NOT EXISTS tweets (
            id TEXT PRIMARY KEY,
            author_id TEXT,
            text TEXT,
            created_at TEXT,
            likes INTEGER DEFAULT 0,
            retweets INTEGER DEFAULT 0,
            replies INTEGER DEFAULT 0,
            quotes INTEGER DEFAULT 0,
            lang TEXT,
            source TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );

        CREATE TABLE IF NOT EXISTS search_runs (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            query TEXT,
            tweets_found INTEGER,
            run_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );

        CREATE INDEX IF NOT EXISTS idx_tweets_author ON tweets(author_id);
        CREATE INDEX IF NOT EXISTS idx_tweets_created ON tweets(created_at);
    """)
    conn.commit()
    return conn

Summary

Twitter/X in 2026 requires either a paid API subscription ($100-$5,000/month), community Nitter instances (unreliable), or guest tokens (ephemeral). For most data projects, the Basic tier at $100/month is the entry point — plan your reads carefully, use incremental collection to avoid re-fetching, and supplement with ThorData residential proxies for IP-based rate limit bypass when scraping Nitter or running parallel key setups.

Scraping Twitter/X Data in 2026: API Tiers, Nitter, and What Still Works

Scraping Twitter/X Data in 2026: API Tiers, Nitter, and What Still Works

The Official API: Three Tiers, One Problem

What Data Is Available

OAuth 2.0 Setup

Basic Tweet Search with Python

Paginated Search

User Timeline Fetch

User Lookup

Rate Limit Handling

Nitter: The Open-Source Alternative

Direct Scraping and Guest Tokens

IP-Based Rate Limits and Proxy Rotation

Storing Twitter Data

The Honest Take on Limitations

What I Would Actually Recommend

Key Takeaways

Efficient Read Budget Planning

Alternative Data Sources When Twitter Is Too Expensive

SQLite Schema for Twitter Data

Summary