← Back to blog

How to Scrape Twitter/X Without the API in 2026 (Complete Guide)

How to Scrape Twitter/X Without the API in 2026 (Complete Guide)

Twitter's API pricing has pushed most developers toward scraping. The Basic tier runs $100/month for a mere 10,000 tweet reads. Pro costs $5,000/month. Enterprise starts at $42,000/year. For researchers, marketers, and data analysts, those numbers kill most projects before they start.

The good news: Twitter's public web interface still loads data in your browser, which means a well-built scraper can extract profiles, tweets, search results, and engagement metrics without paying a dime.

This guide covers what's actually accessible in 2026, two distinct scraping approaches (lightweight HTTP interception and full browser automation), anti-detection strategies that work against Twitter's bot defenses, and how to store the data for analysis.

What's Publicly Accessible Without Login

Before writing any code, you need to understand what Twitter exposes to unauthenticated visitors:

Accessible without login: - Public profile pages — display name, bio, follower/following counts, profile image, banner, verified status, join date - Individual tweet pages — full tweet text, media attachments, engagement counts (likes, retweets, replies, bookmarks, views), timestamp - Reply threads — visible replies on public tweet pages (limited depth) - Embedded tweets — tweets embedded on other websites load via Twitter's oEmbed endpoint

Requires authentication (can't scrape without an account): - Search results — Twitter's search and explore features require login, even for public content - Timeline feeds — home timeline, lists, "For You" recommendations - Follower/following lists — visible on the web but gated behind auth - Spaces, Communities, DMs — all require authentication - Analytics and engagement breakdowns beyond top-level counts

This distinction matters because it defines your scraping ceiling. If you need search or follower data, you'll need to work with logged-in sessions, which carries account suspension risk.

Approach 1: Intercepting Twitter's Internal API

Twitter's web client communicates with a GraphQL API at x.com/i/api/graphql/. When you load any profile or tweet page, the browser fires requests to these endpoints with temporary authentication tokens.

Historically, "guest tokens" — temporary auth tokens issued without login — let you call these endpoints programmatically. In 2026, this still partially works but with severe limitations:

For individual tweet lookups, guest tokens can still work. Here's a lightweight approach:

import httpx
import time
from dataclasses import dataclass, asdict


@dataclass
class TweetData:
    tweet_id: str
    text: str
    author: str
    author_handle: str
    likes: int
    retweets: int
    replies: int
    views: int
    created_at: str
    media_urls: list[str]


BEARER = (
    "Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejR"
    "COuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA"
)

def get_guest_token(client: httpx.Client) -> str:
    """Fetch a temporary guest token from Twitter's activation endpoint."""
    resp = client.post(
        "https://api.x.com/1.1/guest/activate.json",
        headers={"Authorization": BEARER},
    )
    resp.raise_for_status()
    return resp.json()["guest_token"]


def fetch_tweet_by_id(tweet_id: str) -> TweetData | None:
    """Fetch a single tweet's data using the guest token approach."""
    client = httpx.Client(
        headers={
            "User-Agent": (
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/131.0.0.0 Safari/537.36"
            ),
        },
        timeout=15,
    )

    try:
        guest_token = get_guest_token(client)
    except httpx.HTTPStatusError:
        print("Failed to get guest token — Twitter may have rotated the bearer.")
        return None

    # TweetResultByRestId endpoint — hash changes on deploys, check DevTools
    endpoint = (
        "https://x.com/i/api/graphql/0hWvDhmW8YQ-S_ib3azIrw/TweetResultByRestId"
    )
    params = {
        "variables": f'{{"tweetId":"{tweet_id}","withCommunity":false}}',
        "features": '{"creator_subscriptions_tweet_preview_api_enabled":true}',
    }
    headers = {
        "Authorization": BEARER,
        "X-Guest-Token": guest_token,
        "Content-Type": "application/json",
    }

    resp = client.get(endpoint, params=params, headers=headers)
    if resp.status_code != 200:
        print(f"API returned {resp.status_code}")
        return None

    result = resp.json()
    tweet = result["data"]["tweetResult"]["result"]
    legacy = tweet["legacy"]
    user = tweet["core"]["user_results"]["result"]["legacy"]

    media_urls = []
    if "extended_entities" in legacy:
        for m in legacy["extended_entities"].get("media", []):
            media_urls.append(m.get("media_url_https", ""))

    return TweetData(
        tweet_id=tweet_id,
        text=legacy["full_text"],
        author=user["name"],
        author_handle=user["screen_name"],
        likes=legacy["favorite_count"],
        retweets=legacy["retweet_count"],
        replies=legacy["reply_count"],
        views=int(tweet.get("views", {}).get("count", 0)),
        created_at=legacy["created_at"],
        media_urls=media_urls,
    )


# Usage
tweet = fetch_tweet_by_id("1234567890123456789")
if tweet:
    print(f"@{tweet.author_handle}: {tweet.text[:80]}...")
    print(f"  {tweet.likes} likes | {tweet.retweets} RTs | {tweet.views} views")

Important caveat: The GraphQL endpoint hash (0hWvDhmW8YQ-S_ib3azIrw) changes when Twitter deploys new code. You'll need to update it by inspecting network requests in your browser's DevTools. This is the main maintenance burden of this approach.

Approach 2: Full Browser Automation with Playwright

For more reliable scraping — especially profiles and tweet threads — Playwright renders the actual page like a real user. This bypasses API-level restrictions entirely because you're interacting with the rendered DOM.

import asyncio
import json
from datetime import datetime
from playwright.async_api import async_playwright


async def scrape_profile(username: str) -> dict | None:
    """Scrape a Twitter profile page for bio, stats, and recent tweets."""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            viewport={"width": 1280, "height": 900},
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/131.0.0.0 Safari/537.36"
            ),
        )
        page = await context.new_page()

        try:
            await page.goto(
                f"https://x.com/{username}",
                wait_until="networkidle",
                timeout=30000,
            )

            # Wait for profile data to render
            await page.wait_for_selector(
                '[data-testid="UserName"]', timeout=10000
            )

            name = await page.text_content('[data-testid="UserName"]')
            bio_el = await page.query_selector('[data-testid="UserDescription"]')
            bio = await bio_el.text_content() if bio_el else ""

            # Extract follower/following counts
            followers_link = await page.query_selector(
                f'a[href="/{username}/verified_followers"]'
            )
            following_link = await page.query_selector(
                f'a[href="/{username}/following"]'
            )

            followers_text = ""
            following_text = ""
            if followers_link:
                followers_text = await followers_link.text_content()
            if following_link:
                following_text = await following_link.text_content()

            # Extract tweets with engagement data
            tweet_articles = await page.query_selector_all(
                'article[data-testid="tweet"]'
            )
            tweet_data = []
            for article in tweet_articles[:10]:
                text_el = await article.query_selector(
                    '[data-testid="tweetText"]'
                )
                text = await text_el.text_content() if text_el else ""

                # Engagement buttons contain aria-labels with counts
                like_btn = await article.query_selector('[data-testid="like"]')
                retweet_btn = await article.query_selector(
                    '[data-testid="retweet"]'
                )
                reply_btn = await article.query_selector('[data-testid="reply"]')

                like_label = (
                    await like_btn.get_attribute("aria-label") if like_btn else ""
                )
                rt_label = (
                    await retweet_btn.get_attribute("aria-label")
                    if retweet_btn else ""
                )
                reply_label = (
                    await reply_btn.get_attribute("aria-label")
                    if reply_btn else ""
                )

                tweet_data.append({
                    "text": text.strip()[:200],
                    "likes": like_label,
                    "retweets": rt_label,
                    "replies": reply_label,
                })

            return {
                "username": username,
                "name": name.strip() if name else None,
                "bio": bio.strip(),
                "followers": followers_text.strip(),
                "following": following_text.strip(),
                "recent_tweets": tweet_data,
                "scraped_at": datetime.utcnow().isoformat(),
            }

        except Exception as e:
            print(f"Failed to scrape @{username}: {e}")
            return None
        finally:
            await browser.close()


async def scrape_multiple_profiles(usernames: list[str]) -> list[dict]:
    """Scrape multiple profiles with delays between requests."""
    results = []
    for username in usernames:
        print(f"Scraping @{username}...")
        data = await scrape_profile(username)
        if data:
            results.append(data)
        await asyncio.sleep(12)  # 10-15 second gap between profiles
    return results


# Usage
profiles = asyncio.run(
    scrape_multiple_profiles(["elonmusk", "github", "ycombinator"])
)
for p in profiles:
    print(f"@{p['username']}: {p['followers']} followers")
    print(f"  Bio: {p['bio'][:80]}...")
    print(f"  Tweets collected: {len(p['recent_tweets'])}")

Scraping Tweet Threads and Replies

Individual tweet pages show the original tweet plus visible replies. Here's how to extract a conversation thread:

async def scrape_tweet_thread(tweet_url: str) -> dict:
    """Extract a tweet and its visible reply thread."""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            viewport={"width": 1280, "height": 900},
            user_agent=(
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/131.0.0.0 Safari/537.36"
            ),
        )
        page = await context.new_page()
        await page.goto(tweet_url, wait_until="networkidle", timeout=30000)

        # Scroll down to load more replies
        for _ in range(3):
            await page.evaluate("window.scrollBy(0, 800)")
            await asyncio.sleep(1.5)

        articles = await page.query_selector_all(
            'article[data-testid="tweet"]'
        )

        thread = []
        for article in articles:
            user_el = await article.query_selector(
                '[data-testid="User-Name"] a[role="link"]'
            )
            text_el = await article.query_selector('[data-testid="tweetText"]')
            time_el = await article.query_selector("time")

            handle = ""
            if user_el:
                href = await user_el.get_attribute("href")
                handle = href.strip("/") if href else ""

            text = await text_el.text_content() if text_el else ""
            timestamp = (
                await time_el.get_attribute("datetime") if time_el else ""
            )

            thread.append({
                "author": handle,
                "text": text.strip(),
                "timestamp": timestamp,
            })

        await browser.close()
        return {"url": tweet_url, "thread": thread}

Storing Scraped Twitter Data in SQLite

For any serious collection, you need structured storage. SQLite works perfectly for moderate volumes:

import sqlite3
import json
from datetime import datetime


def init_twitter_db(db_path: str = "twitter_data.db") -> sqlite3.Connection:
    """Initialize SQLite database for storing scraped Twitter data."""
    db = sqlite3.connect(db_path)
    db.executescript("""
        CREATE TABLE IF NOT EXISTS profiles (
            username TEXT PRIMARY KEY,
            display_name TEXT,
            bio TEXT,
            followers TEXT,
            following TEXT,
            scraped_at TEXT
        );

        CREATE TABLE IF NOT EXISTS tweets (
            tweet_id TEXT PRIMARY KEY,
            author_handle TEXT,
            text TEXT,
            likes INTEGER DEFAULT 0,
            retweets INTEGER DEFAULT 0,
            replies INTEGER DEFAULT 0,
            views INTEGER DEFAULT 0,
            created_at TEXT,
            media_urls TEXT,
            scraped_at TEXT
        );

        CREATE INDEX IF NOT EXISTS idx_tweets_author
            ON tweets(author_handle);
        CREATE INDEX IF NOT EXISTS idx_tweets_created
            ON tweets(created_at);
    """)
    return db


def save_profile(db: sqlite3.Connection, profile: dict):
    """Upsert a scraped profile into the database."""
    db.execute(
        """INSERT OR REPLACE INTO profiles
           (username, display_name, bio, followers, following, scraped_at)
           VALUES (?, ?, ?, ?, ?, ?)""",
        (
            profile["username"],
            profile.get("name", ""),
            profile.get("bio", ""),
            profile.get("followers", ""),
            profile.get("following", ""),
            datetime.utcnow().isoformat(),
        ),
    )
    db.commit()


def save_tweet(db: sqlite3.Connection, tweet: TweetData):
    """Insert or update a tweet record."""
    db.execute(
        """INSERT OR REPLACE INTO tweets
           (tweet_id, author_handle, text, likes, retweets,
            replies, views, created_at, media_urls, scraped_at)
           VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
        (
            tweet.tweet_id, tweet.author_handle, tweet.text,
            tweet.likes, tweet.retweets, tweet.replies, tweet.views,
            tweet.created_at, json.dumps(tweet.media_urls),
            datetime.utcnow().isoformat(),
        ),
    )
    db.commit()


def get_top_tweets(db: sqlite3.Connection, handle: str, limit: int = 10):
    """Retrieve top tweets by engagement for a given author."""
    cursor = db.execute(
        """SELECT text, likes, retweets, views FROM tweets
           WHERE author_handle = ?
           ORDER BY likes + retweets DESC
           LIMIT ?""",
        (handle, limit),
    )
    return cursor.fetchall()

Anti-Detection Strategies That Actually Work

Twitter invests heavily in bot detection. Here's what matters:

Browser fingerprinting: Twitter checks your TLS fingerprint (JA3/JA4), canvas fingerprint, WebGL renderer, and timezone. Playwright's default Chromium fingerprint is widely known — Twitter may flag it. Using playwright-stealth helps mask automation signals:

# pip install playwright-stealth
from playwright_stealth import stealth_async

async def create_stealth_page(browser):
    context = await browser.new_context(
        viewport={"width": 1280, "height": 900},
        locale="en-US",
        timezone_id="America/New_York",
    )
    page = await context.new_page()
    await stealth_async(page)
    return page

IP rotation is non-negotiable for volume: A single IP gets rate-limited after 20-30 Playwright page loads. For anything beyond casual scraping, you need residential proxies. ThorData's residential network works well here because the IPs originate from real consumer connections — Twitter's detection systems treat them as normal user traffic rather than datacenter bot traffic.

# Playwright with rotating residential proxy
context = await browser.new_context(
    proxy={
        "server": "http://proxy.thordata.com:9000",
        "username": "YOUR_USER",
        "password": "YOUR_PASS",
    },
)

Behavioral patterns matter: Real users don't load 50 profiles in 50 seconds. Add realistic delays, scroll behavior, and occasional "idle" periods:

Why Twitter Scrapers Break (and How to Handle It)

Any Twitter scraper you build will need maintenance within 2-6 weeks. Twitter actively fights scraping through:

Mitigation strategy: Store your selectors in a config file instead of hardcoding them. When something breaks, update the config rather than editing code throughout your codebase. Add health checks that alert you when scrape success rate drops below 90%.

Business Use Cases for Twitter Scraping

Companies scrape Twitter data for measurable value across several areas:

Alternatives to Building Your Own Scraper

If maintaining a scraper sounds like too much work:

Final Thoughts

Scraping Twitter without the API is a cat-and-mouse game. For small-scale, one-time data collection — a few hundred profiles, a thousand tweets — the Playwright approach works reliably. For ongoing production workloads, budget for continuous maintenance: plan to update selectors monthly, rotate proxies, and monitor your success rate.

The code in this guide gives you a working foundation. Build on it incrementally, test against real pages before scaling up, and respect the platform's rate limits. Twitter scraping works in 2026 — it just requires more sophistication than it did three years ago.