How to Scrape Twitter/X Without the API in 2026 (Complete Guide)
How to Scrape Twitter/X Without the API in 2026 (Complete Guide)
Twitter's API pricing has pushed most developers toward scraping. The Basic tier runs $100/month for a mere 10,000 tweet reads. Pro costs $5,000/month. Enterprise starts at $42,000/year. For researchers, marketers, and data analysts, those numbers kill most projects before they start.
The good news: Twitter's public web interface still loads data in your browser, which means a well-built scraper can extract profiles, tweets, search results, and engagement metrics without paying a dime.
This guide covers what's actually accessible in 2026, two distinct scraping approaches (lightweight HTTP interception and full browser automation), anti-detection strategies that work against Twitter's bot defenses, and how to store the data for analysis.
What's Publicly Accessible Without Login
Before writing any code, you need to understand what Twitter exposes to unauthenticated visitors:
Accessible without login: - Public profile pages — display name, bio, follower/following counts, profile image, banner, verified status, join date - Individual tweet pages — full tweet text, media attachments, engagement counts (likes, retweets, replies, bookmarks, views), timestamp - Reply threads — visible replies on public tweet pages (limited depth) - Embedded tweets — tweets embedded on other websites load via Twitter's oEmbed endpoint
Requires authentication (can't scrape without an account): - Search results — Twitter's search and explore features require login, even for public content - Timeline feeds — home timeline, lists, "For You" recommendations - Follower/following lists — visible on the web but gated behind auth - Spaces, Communities, DMs — all require authentication - Analytics and engagement breakdowns beyond top-level counts
This distinction matters because it defines your scraping ceiling. If you need search or follower data, you'll need to work with logged-in sessions, which carries account suspension risk.
Approach 1: Intercepting Twitter's Internal API
Twitter's web client communicates with a GraphQL API at x.com/i/api/graphql/. When you load any profile or tweet page, the browser fires requests to these endpoints with temporary authentication tokens.
Historically, "guest tokens" — temporary auth tokens issued without login — let you call these endpoints programmatically. In 2026, this still partially works but with severe limitations:
- Guest tokens expire within 2-5 minutes (down from hours in 2024)
- Rate limits are brutally low: roughly 50 requests per token
- Many endpoints now require full cookie-based authentication
- Twitter rotates endpoint hashes frequently, breaking hardcoded URLs
For individual tweet lookups, guest tokens can still work. Here's a lightweight approach:
import httpx
import time
from dataclasses import dataclass, asdict
@dataclass
class TweetData:
tweet_id: str
text: str
author: str
author_handle: str
likes: int
retweets: int
replies: int
views: int
created_at: str
media_urls: list[str]
BEARER = (
"Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejR"
"COuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA"
)
def get_guest_token(client: httpx.Client) -> str:
"""Fetch a temporary guest token from Twitter's activation endpoint."""
resp = client.post(
"https://api.x.com/1.1/guest/activate.json",
headers={"Authorization": BEARER},
)
resp.raise_for_status()
return resp.json()["guest_token"]
def fetch_tweet_by_id(tweet_id: str) -> TweetData | None:
"""Fetch a single tweet's data using the guest token approach."""
client = httpx.Client(
headers={
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/131.0.0.0 Safari/537.36"
),
},
timeout=15,
)
try:
guest_token = get_guest_token(client)
except httpx.HTTPStatusError:
print("Failed to get guest token — Twitter may have rotated the bearer.")
return None
# TweetResultByRestId endpoint — hash changes on deploys, check DevTools
endpoint = (
"https://x.com/i/api/graphql/0hWvDhmW8YQ-S_ib3azIrw/TweetResultByRestId"
)
params = {
"variables": f'{{"tweetId":"{tweet_id}","withCommunity":false}}',
"features": '{"creator_subscriptions_tweet_preview_api_enabled":true}',
}
headers = {
"Authorization": BEARER,
"X-Guest-Token": guest_token,
"Content-Type": "application/json",
}
resp = client.get(endpoint, params=params, headers=headers)
if resp.status_code != 200:
print(f"API returned {resp.status_code}")
return None
result = resp.json()
tweet = result["data"]["tweetResult"]["result"]
legacy = tweet["legacy"]
user = tweet["core"]["user_results"]["result"]["legacy"]
media_urls = []
if "extended_entities" in legacy:
for m in legacy["extended_entities"].get("media", []):
media_urls.append(m.get("media_url_https", ""))
return TweetData(
tweet_id=tweet_id,
text=legacy["full_text"],
author=user["name"],
author_handle=user["screen_name"],
likes=legacy["favorite_count"],
retweets=legacy["retweet_count"],
replies=legacy["reply_count"],
views=int(tweet.get("views", {}).get("count", 0)),
created_at=legacy["created_at"],
media_urls=media_urls,
)
# Usage
tweet = fetch_tweet_by_id("1234567890123456789")
if tweet:
print(f"@{tweet.author_handle}: {tweet.text[:80]}...")
print(f" {tweet.likes} likes | {tweet.retweets} RTs | {tweet.views} views")
Important caveat: The GraphQL endpoint hash (0hWvDhmW8YQ-S_ib3azIrw) changes when Twitter deploys new code. You'll need to update it by inspecting network requests in your browser's DevTools. This is the main maintenance burden of this approach.
Approach 2: Full Browser Automation with Playwright
For more reliable scraping — especially profiles and tweet threads — Playwright renders the actual page like a real user. This bypasses API-level restrictions entirely because you're interacting with the rendered DOM.
import asyncio
import json
from datetime import datetime
from playwright.async_api import async_playwright
async def scrape_profile(username: str) -> dict | None:
"""Scrape a Twitter profile page for bio, stats, and recent tweets."""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
viewport={"width": 1280, "height": 900},
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/131.0.0.0 Safari/537.36"
),
)
page = await context.new_page()
try:
await page.goto(
f"https://x.com/{username}",
wait_until="networkidle",
timeout=30000,
)
# Wait for profile data to render
await page.wait_for_selector(
'[data-testid="UserName"]', timeout=10000
)
name = await page.text_content('[data-testid="UserName"]')
bio_el = await page.query_selector('[data-testid="UserDescription"]')
bio = await bio_el.text_content() if bio_el else ""
# Extract follower/following counts
followers_link = await page.query_selector(
f'a[href="/{username}/verified_followers"]'
)
following_link = await page.query_selector(
f'a[href="/{username}/following"]'
)
followers_text = ""
following_text = ""
if followers_link:
followers_text = await followers_link.text_content()
if following_link:
following_text = await following_link.text_content()
# Extract tweets with engagement data
tweet_articles = await page.query_selector_all(
'article[data-testid="tweet"]'
)
tweet_data = []
for article in tweet_articles[:10]:
text_el = await article.query_selector(
'[data-testid="tweetText"]'
)
text = await text_el.text_content() if text_el else ""
# Engagement buttons contain aria-labels with counts
like_btn = await article.query_selector('[data-testid="like"]')
retweet_btn = await article.query_selector(
'[data-testid="retweet"]'
)
reply_btn = await article.query_selector('[data-testid="reply"]')
like_label = (
await like_btn.get_attribute("aria-label") if like_btn else ""
)
rt_label = (
await retweet_btn.get_attribute("aria-label")
if retweet_btn else ""
)
reply_label = (
await reply_btn.get_attribute("aria-label")
if reply_btn else ""
)
tweet_data.append({
"text": text.strip()[:200],
"likes": like_label,
"retweets": rt_label,
"replies": reply_label,
})
return {
"username": username,
"name": name.strip() if name else None,
"bio": bio.strip(),
"followers": followers_text.strip(),
"following": following_text.strip(),
"recent_tweets": tweet_data,
"scraped_at": datetime.utcnow().isoformat(),
}
except Exception as e:
print(f"Failed to scrape @{username}: {e}")
return None
finally:
await browser.close()
async def scrape_multiple_profiles(usernames: list[str]) -> list[dict]:
"""Scrape multiple profiles with delays between requests."""
results = []
for username in usernames:
print(f"Scraping @{username}...")
data = await scrape_profile(username)
if data:
results.append(data)
await asyncio.sleep(12) # 10-15 second gap between profiles
return results
# Usage
profiles = asyncio.run(
scrape_multiple_profiles(["elonmusk", "github", "ycombinator"])
)
for p in profiles:
print(f"@{p['username']}: {p['followers']} followers")
print(f" Bio: {p['bio'][:80]}...")
print(f" Tweets collected: {len(p['recent_tweets'])}")
Scraping Tweet Threads and Replies
Individual tweet pages show the original tweet plus visible replies. Here's how to extract a conversation thread:
async def scrape_tweet_thread(tweet_url: str) -> dict:
"""Extract a tweet and its visible reply thread."""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
viewport={"width": 1280, "height": 900},
user_agent=(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/131.0.0.0 Safari/537.36"
),
)
page = await context.new_page()
await page.goto(tweet_url, wait_until="networkidle", timeout=30000)
# Scroll down to load more replies
for _ in range(3):
await page.evaluate("window.scrollBy(0, 800)")
await asyncio.sleep(1.5)
articles = await page.query_selector_all(
'article[data-testid="tweet"]'
)
thread = []
for article in articles:
user_el = await article.query_selector(
'[data-testid="User-Name"] a[role="link"]'
)
text_el = await article.query_selector('[data-testid="tweetText"]')
time_el = await article.query_selector("time")
handle = ""
if user_el:
href = await user_el.get_attribute("href")
handle = href.strip("/") if href else ""
text = await text_el.text_content() if text_el else ""
timestamp = (
await time_el.get_attribute("datetime") if time_el else ""
)
thread.append({
"author": handle,
"text": text.strip(),
"timestamp": timestamp,
})
await browser.close()
return {"url": tweet_url, "thread": thread}
Storing Scraped Twitter Data in SQLite
For any serious collection, you need structured storage. SQLite works perfectly for moderate volumes:
import sqlite3
import json
from datetime import datetime
def init_twitter_db(db_path: str = "twitter_data.db") -> sqlite3.Connection:
"""Initialize SQLite database for storing scraped Twitter data."""
db = sqlite3.connect(db_path)
db.executescript("""
CREATE TABLE IF NOT EXISTS profiles (
username TEXT PRIMARY KEY,
display_name TEXT,
bio TEXT,
followers TEXT,
following TEXT,
scraped_at TEXT
);
CREATE TABLE IF NOT EXISTS tweets (
tweet_id TEXT PRIMARY KEY,
author_handle TEXT,
text TEXT,
likes INTEGER DEFAULT 0,
retweets INTEGER DEFAULT 0,
replies INTEGER DEFAULT 0,
views INTEGER DEFAULT 0,
created_at TEXT,
media_urls TEXT,
scraped_at TEXT
);
CREATE INDEX IF NOT EXISTS idx_tweets_author
ON tweets(author_handle);
CREATE INDEX IF NOT EXISTS idx_tweets_created
ON tweets(created_at);
""")
return db
def save_profile(db: sqlite3.Connection, profile: dict):
"""Upsert a scraped profile into the database."""
db.execute(
"""INSERT OR REPLACE INTO profiles
(username, display_name, bio, followers, following, scraped_at)
VALUES (?, ?, ?, ?, ?, ?)""",
(
profile["username"],
profile.get("name", ""),
profile.get("bio", ""),
profile.get("followers", ""),
profile.get("following", ""),
datetime.utcnow().isoformat(),
),
)
db.commit()
def save_tweet(db: sqlite3.Connection, tweet: TweetData):
"""Insert or update a tweet record."""
db.execute(
"""INSERT OR REPLACE INTO tweets
(tweet_id, author_handle, text, likes, retweets,
replies, views, created_at, media_urls, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
(
tweet.tweet_id, tweet.author_handle, tweet.text,
tweet.likes, tweet.retweets, tweet.replies, tweet.views,
tweet.created_at, json.dumps(tweet.media_urls),
datetime.utcnow().isoformat(),
),
)
db.commit()
def get_top_tweets(db: sqlite3.Connection, handle: str, limit: int = 10):
"""Retrieve top tweets by engagement for a given author."""
cursor = db.execute(
"""SELECT text, likes, retweets, views FROM tweets
WHERE author_handle = ?
ORDER BY likes + retweets DESC
LIMIT ?""",
(handle, limit),
)
return cursor.fetchall()
Anti-Detection Strategies That Actually Work
Twitter invests heavily in bot detection. Here's what matters:
Browser fingerprinting: Twitter checks your TLS fingerprint (JA3/JA4), canvas fingerprint, WebGL renderer, and timezone. Playwright's default Chromium fingerprint is widely known — Twitter may flag it. Using playwright-stealth helps mask automation signals:
# pip install playwright-stealth
from playwright_stealth import stealth_async
async def create_stealth_page(browser):
context = await browser.new_context(
viewport={"width": 1280, "height": 900},
locale="en-US",
timezone_id="America/New_York",
)
page = await context.new_page()
await stealth_async(page)
return page
IP rotation is non-negotiable for volume: A single IP gets rate-limited after 20-30 Playwright page loads. For anything beyond casual scraping, you need residential proxies. ThorData's residential network works well here because the IPs originate from real consumer connections — Twitter's detection systems treat them as normal user traffic rather than datacenter bot traffic.
# Playwright with rotating residential proxy
context = await browser.new_context(
proxy={
"server": "http://proxy.thordata.com:9000",
"username": "YOUR_USER",
"password": "YOUR_PASS",
},
)
Behavioral patterns matter: Real users don't load 50 profiles in 50 seconds. Add realistic delays, scroll behavior, and occasional "idle" periods:
- 10-15 seconds between profile loads
- Scroll down slowly (2-3 small scrolls, not one giant jump)
- Vary your timing with
random.uniform(8, 18)seconds between requests - After every 20-30 requests, pause for 2-5 minutes
- Rotate your browser context (cookies, local storage) every 15-20 requests
Why Twitter Scrapers Break (and How to Handle It)
Any Twitter scraper you build will need maintenance within 2-6 weeks. Twitter actively fights scraping through:
- DOM selector changes —
data-testidattributes get renamed without notice - GraphQL endpoint rotation — The hash in the GraphQL URL changes on every deploy, sometimes weekly
- New JavaScript challenges — Twitter periodically adds Cloudflare-style challenges
- Rate limit adjustments — Thresholds decrease without warning
Mitigation strategy: Store your selectors in a config file instead of hardcoding them. When something breaks, update the config rather than editing code throughout your codebase. Add health checks that alert you when scrape success rate drops below 90%.
Business Use Cases for Twitter Scraping
Companies scrape Twitter data for measurable value across several areas:
- Brand monitoring — Track mentions of your brand and competitors. Feed sentiment analysis on scraped tweets into marketing dashboards for real-time reputation tracking.
- Influencer research — Evaluate potential partners by analyzing actual engagement rates (likes-to-followers ratio), posting frequency, and content themes.
- Market intelligence — Monitor public conversations about product launches, pricing changes, and industry trends. Faster and more honest than surveys.
- Academic research — Study information spread patterns, political discourse, public health narratives. Universities scrape Twitter extensively for social science research.
- Lead generation — Find people publicly asking for product recommendations. Someone tweeting "looking for a good CRM" is a warm lead.
- Crisis detection — Real-time monitoring for PR issues or viral complaints before they escalate.
Alternatives to Building Your Own Scraper
If maintaining a scraper sounds like too much work:
- Common Crawl / Wayback Machine — Historical web data including cached Twitter pages. Zero blocking risk but data is weeks to months old.
- Academic API access — Twitter's academic research program offers better rates. Worth applying if your use case qualifies.
- Third-party datasets — Internet Archive and academic institutions maintain curated Twitter datasets for research.
- Managed scraping services — Pay someone else to maintain the selectors and proxy infrastructure.
Final Thoughts
Scraping Twitter without the API is a cat-and-mouse game. For small-scale, one-time data collection — a few hundred profiles, a thousand tweets — the Playwright approach works reliably. For ongoing production workloads, budget for continuous maintenance: plan to update selectors monthly, rotate proxies, and monitor your success rate.
The code in this guide gives you a working foundation. Build on it incrementally, test against real pages before scaling up, and respect the platform's rate limits. Twitter scraping works in 2026 — it just requires more sophistication than it did three years ago.