How to Find and Use Unofficial APIs for Web Scraping (2026 Complete Guide)
Every modern web application is fundamentally a thin UI layer on top of an API. When you scrape the HTML of a React or Vue app, you are parsing the rendered output of a process that already fetched the data you want as clean, structured JSON and injected it into the page. You are decoding the end product of a pipeline rather than tapping into the pipeline itself.
The unofficial API approach bypasses all of that. Instead of fighting CSS selectors that change weekly, JavaScript rendering that requires a full browser, anti-bot walls that watch for headless Chrome signals, and HTML structure that varies by viewport size — you talk directly to the same JSON endpoint that the frontend does. The response is already structured, loads in milliseconds rather than seconds, and tends to be far more stable than the frontend markup because changing the API breaks the app while changing the HTML only breaks your scraper.
This guide is the complete playbook. We cover finding unofficial APIs through browser DevTools and mitmproxy interception, reproducing them in Python with the minimum required headers, handling authentication token flows and refresh patterns, navigating cursor-based pagination, managing rate limits with residential proxy rotation via ThorData, and designing scrapers resilient enough to survive unannounced API changes. Every code example is complete and runnable.
The Core Insight: Every SPA is an API Client
When you open a modern web application and see a product listing, your browser has already done the work you want to do. It sent an HTTP request to an endpoint like /api/v2/products?category=electronics&sort=popular&limit=50, received a JSON response with all the product data, and rendered it into HTML you can read. The JSON response was already there — you just never saw it because the browser swallowed it.
Your job as a scraper is simply to intercept that request and replay it yourself. Once you can do that, you can call the endpoint as many times as you want with whatever parameters you want, without running a browser at all. Pagination that would require scrolling through dozens of pages of HTML is just incrementing a page parameter or following a cursor field. Filtering that would require manipulating complex UI widgets is just changing query parameters.
The browser DevTools Network tab is the window into this layer. Every request the browser makes is visible there — the URL, the method, the request headers, the request body, and the full response. For most modern web apps, you will find the exact endpoint returning the data you want within five minutes of opening DevTools.
Finding APIs with Browser DevTools
Open the site you want to scrape. Right-click anywhere → Inspect (or F12) → Network tab. You will see a stream of requests. Most are irrelevant: static assets (images, CSS, JavaScript bundles), analytics pings, ad tracking, font downloads. Filter them out by clicking the "Fetch/XHR" filter button. This shows only XMLHttpRequest and Fetch API calls — the programmatic HTTP requests the JavaScript code makes to load data.
Now interact with the page. Scroll down to load more results. Use the search box. Apply a filter. Click to a detail view. Watch the Network tab. You are looking for requests that return JSON arrays or objects with the actual data you care about — product records, user profiles, search results, whatever.
Identifying the right request:
- Look for responses with
Content-Type: application/json - Response bodies that are arrays (
[{...}, {...}]) are often paginated lists - Response bodies that are objects (
{"data": {...}, "meta": {...}}) are often detail views or wrapped lists - Ignore responses under 1KB — probably analytics pings
- Check the Response tab in DevTools to preview the content
What to record once you find it:
- The full URL (including query parameters)
- The HTTP method (GET or POST)
- All request headers (especially Authorization, Cookie, and any custom headers like
X-Api-KeyorX-Client-Version) - The request body if it is a POST request
- Example response to understand the structure
Right-click the request → "Copy as cURL" to get a working command you can immediately test in your terminal.
Live Examples You Can Try Right Now
Reddit JSON API
Append .json to almost any Reddit URL and get a structured JSON response:
curl "https://www.reddit.com/r/python/top.json?t=week&limit=25" \
-H "User-Agent: MyResearchBot/1.0 ([email protected])"
This returns a data.children array with post objects containing title, score, url, author, created_utc, num_comments, and much more. Pagination uses the after parameter — you get the after value from the response and pass it in the next request:
import httpx
def get_reddit_posts(subreddit: str, limit: int = 100) -> list[dict]:
url = f"https://www.reddit.com/r/{subreddit}/top.json"
headers = {"User-Agent": "ResearchBot/1.0 ([email protected])"}
posts = []
after = None
while len(posts) < limit:
params = {"t": "month", "limit": 100}
if after:
params["after"] = after
resp = httpx.get(url, params=params, headers=headers)
data = resp.json()
children = data["data"]["children"]
if not children:
break
for child in children:
post = child["data"]
posts.append({
"id": post["id"],
"title": post["title"],
"score": post["score"],
"url": post["url"],
"author": post["author"],
"subreddit": post["subreddit"],
"num_comments": post["num_comments"],
"created_utc": post["created_utc"],
"is_self": post["is_self"],
"selftext": post.get("selftext", ""),
})
after = data["data"]["after"]
if not after:
break
return posts[:limit]
Hacker News Firebase API
The official HN API is public and well-documented but most people have never used it:
import httpx
import asyncio
BASE = "https://hacker-news.firebaseio.com/v0"
async def get_top_stories(count: int = 30) -> list[dict]:
async with httpx.AsyncClient() as client:
# Get list of top story IDs
resp = await client.get(f"{BASE}/topstories.json")
ids = resp.json()[:count]
# Fetch each story concurrently
async def get_item(item_id: int) -> dict:
r = await client.get(f"{BASE}/item/{item_id}.json")
return r.json()
stories = await asyncio.gather(*[get_item(i) for i in ids])
return [s for s in stories if s and s.get("type") == "story"]
YouTube InnerTube Search API
Open YouTube, search for something, and watch the Network tab for POST requests to www.youtube.com/youtubei/v1/search. The request body contains your query and the response contains the full search results with video IDs, titles, view counts, upload dates, and channel info:
import httpx
import json
def youtube_search(query: str, max_results: int = 20) -> list[dict]:
url = "https://www.youtube.com/youtubei/v1/search"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Content-Type": "application/json",
"X-YouTube-Client-Name": "1",
"X-YouTube-Client-Version": "2.20240101.00.00",
}
body = {
"context": {
"client": {
"clientName": "WEB",
"clientVersion": "2.20240101.00.00",
"hl": "en",
"gl": "US",
}
},
"query": query,
}
resp = httpx.post(url, headers=headers, json=body)
data = resp.json()
videos = []
try:
contents = data["contents"]["twoColumnSearchResultsRenderer"]["primaryContents"]
items = contents["sectionListRenderer"]["contents"][0]["itemSectionRenderer"]["contents"]
for item in items:
if "videoRenderer" not in item:
continue
v = item["videoRenderer"]
videos.append({
"video_id": v.get("videoId"),
"title": v.get("title", {}).get("runs", [{}])[0].get("text", ""),
"channel": v.get("ownerText", {}).get("runs", [{}])[0].get("text", ""),
"view_count": v.get("viewCountText", {}).get("simpleText", ""),
"published": v.get("publishedTimeText", {}).get("simpleText", ""),
"duration": v.get("lengthText", {}).get("simpleText", ""),
"url": f"https://youtube.com/watch?v={v.get('videoId')}",
})
except (KeyError, IndexError):
pass # Structure may change — check data shape on failure
return videos[:max_results]
Finding Mobile App APIs with mitmproxy
Mobile applications cannot render HTML — they must communicate with their backend via APIs. This makes them ideal targets: the APIs are clean, well-structured (the app team designed them for their own use), and often return more data than the web version.
The technique is a man-in-the-middle SSL proxy. You route the phone's traffic through your laptop, decrypt it with a custom CA certificate, and watch every request the app makes.
Setup (one-time):
pip install mitmproxy
mitmproxy # Starts on port 8080
On your iPhone or Android:
1. Go to Settings → Wi-Fi → tap your network → HTTP Proxy → Manual
2. Set Server to your laptop's local IP, Port to 8080
3. Visit mitm.it in the phone browser and install the CA certificate for your platform
4. Trust the certificate in Settings (iPhone: Settings → General → About → Certificate Trust Settings)
Now open any app and use it normally. Every HTTPS request appears in mitmproxy, fully decrypted.
What you will find:
- The exact API endpoints the app calls
- Required request headers (often including version headers, device IDs, and signatures)
- Authentication flows (how the app logs in, how tokens are obtained)
- Response schemas for every data type
- Pagination patterns
Common mobile API characteristics:
Mobile APIs are designed for bandwidth efficiency. They typically return paginated responses of 20-50 items, use cursor-based pagination rather than page numbers (because insertions between pages break offset-based pagination), and version their endpoints with /v1/, /v2/ patterns. Authentication usually involves a long-lived refresh token and a short-lived access token.
Here is a generic Python client built from mitmproxy-intercepted mobile API calls:
import httpx
import time
from typing import Optional
class MobileAPIClient:
def __init__(self, base_url: str, app_version: str, device_id: str):
self.base_url = base_url
self.session = httpx.Client(
headers={
"User-Agent": f"App/{app_version} (iPhone; iOS 17.0)",
"Accept": "application/json",
"X-App-Version": app_version,
"X-Device-ID": device_id,
"X-Platform": "iOS",
},
timeout=15.0,
)
self.access_token: Optional[str] = None
self.refresh_token: Optional[str] = None
self.token_expires_at: float = 0
def login(self, username: str, password: str) -> bool:
resp = self.session.post(f"{self.base_url}/auth/login", json={
"username": username,
"password": password,
})
if resp.status_code != 200:
return False
data = resp.json()
self.access_token = data["access_token"]
self.refresh_token = data["refresh_token"]
self.token_expires_at = time.time() + data.get("expires_in", 3600)
self.session.headers["Authorization"] = f"Bearer {self.access_token}"
return True
def _refresh_if_needed(self):
if time.time() >= self.token_expires_at - 60: # Refresh 60s early
resp = self.session.post(f"{self.base_url}/auth/refresh", json={
"refresh_token": self.refresh_token,
})
if resp.status_code == 200:
data = resp.json()
self.access_token = data["access_token"]
self.token_expires_at = time.time() + data.get("expires_in", 3600)
self.session.headers["Authorization"] = f"Bearer {self.access_token}"
def get(self, path: str, **kwargs) -> httpx.Response:
self._refresh_if_needed()
resp = self.session.get(f"{self.base_url}{path}", **kwargs)
if resp.status_code == 401:
self._refresh_if_needed()
resp = self.session.get(f"{self.base_url}{path}", **kwargs)
return resp
def paginate(self, path: str, data_key: str = "items") -> list[dict]:
"""Fetch all pages using cursor-based pagination."""
all_items = []
cursor = None
while True:
params = {"limit": 50}
if cursor:
params["cursor"] = cursor
resp = self.get(path, params=params)
if resp.status_code != 200:
break
data = resp.json()
items = data.get(data_key, [])
all_items.extend(items)
cursor = data.get("next_cursor") or data.get("pagination", {}).get("next_cursor")
if not cursor or not items:
break
return all_items
Reproducing API Calls in Python with httpx
httpx is the modern replacement for requests. It supports both sync and async APIs, has HTTP/2 support (required by some modern APIs), handles redirects and cookies cleanly, and integrates well with proxy configuration.
The general workflow for reproducing a captured API call:
- Start with every header from the DevTools copy
- Make the request work
- Strip headers one at a time to find the minimum required set
- Parameterize it
import httpx
import json
# Step 1: Start with all headers from DevTools
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept": "application/json, text/plain, */*",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://www.example.com/search",
"Origin": "https://www.example.com",
"X-Requested-With": "XMLHttpRequest",
"X-Api-Version": "2024-01",
"Authorization": "Bearer REPLACE_WITH_ACTUAL_TOKEN",
"Cookie": "session=REPLACE_WITH_ACTUAL_COOKIE",
}
# Step 2: Make it work, then strip headers to find minimum set
# Often only User-Agent + Authorization are required
minimal_headers = {
"User-Agent": "Mozilla/5.0 ...",
"Authorization": "Bearer TOKEN",
}
with httpx.Client() as client:
resp = client.get(
"https://api.example.com/v2/search",
headers=minimal_headers,
params={"q": "python", "page": 1, "limit": 50},
)
print(resp.status_code)
print(json.dumps(resp.json(), indent=2)[:500]) # Preview first 500 chars
Handling Guest Tokens
Many APIs issue a temporary "guest" token on the first page load, then use it for subsequent API calls. This is common on social platforms and e-commerce sites. Here is how to automate the token acquisition:
import httpx
import re
def get_guest_token(homepage_url: str) -> str:
"""
Fetch the homepage and extract a guest API token.
Token is often embedded in a <script> tag or returned by an /init endpoint.
"""
resp = httpx.get(homepage_url, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})
# Pattern 1: Token in script tag as a JSON value
match = re.search(r'"token"\s*:\s*"([A-Za-z0-9\-_\.]+)"', resp.text)
if match:
return match.group(1)
# Pattern 2: Token set as a cookie
if "api_token" in resp.cookies:
return resp.cookies["api_token"]
# Pattern 3: Call a dedicated init endpoint
init_resp = httpx.post(
f"{homepage_url}/api/init",
headers={"User-Agent": "Mozilla/5.0 ..."},
json={"client": "web"}
)
return init_resp.json().get("guest_token", "")
def make_api_request(endpoint: str, token: str, params: dict) -> dict:
resp = httpx.get(
endpoint,
headers={"Authorization": f"Bearer {token}"},
params=params,
)
if resp.status_code == 401:
# Token expired — get a new one and retry
new_token = get_guest_token("https://www.example.com")
resp = httpx.get(
endpoint,
headers={"Authorization": f"Bearer {new_token}"},
params=params,
)
return resp.json()
Complete Examples by API Pattern
Cursor-Based Pagination
import httpx
from typing import Iterator
def paginate_cursor(
client: httpx.Client,
url: str,
headers: dict,
params: dict,
items_key: str = "items",
cursor_key: str = "next_cursor",
limit: int = 50,
) -> Iterator[dict]:
"""Generic cursor-based pagination — yields individual items."""
cursor = None
while True:
page_params = {**params, "limit": limit}
if cursor:
page_params["cursor"] = cursor
resp = client.get(url, headers=headers, params=page_params)
if resp.status_code != 200:
print(f"Request failed: {resp.status_code}")
break
data = resp.json()
items = data.get(items_key, [])
for item in items:
yield item
cursor = data.get(cursor_key)
if not cursor or not items:
break
# Usage
with httpx.Client() as client:
count = 0
for item in paginate_cursor(
client,
url="https://api.example.com/v1/posts",
headers={"Authorization": "Bearer TOKEN"},
params={"category": "tech"},
):
print(item.get("title", ""))
count += 1
if count >= 200: # Safety limit
break
GraphQL Endpoints
Twitter/X, Facebook, Instagram, and many modern apps use GraphQL. The URL is always the same; only the query body changes.
import httpx
def graphql_query(
endpoint: str,
query: str,
variables: dict,
headers: dict,
) -> dict:
"""Execute a GraphQL query against an unofficial endpoint."""
resp = httpx.post(
endpoint,
headers={**headers, "Content-Type": "application/json"},
json={"query": query, "variables": variables},
)
resp.raise_for_status()
data = resp.json()
if "errors" in data:
raise ValueError(f"GraphQL errors: {data['errors']}")
return data.get("data", {})
# Example: Instagram-style GraphQL for user posts
POSTS_QUERY = """
query UserPosts($userId: String!, $count: Int!, $after: String) {
user(id: $userId) {
timeline(first: $count, after: $after) {
pageInfo {
hasNextPage
endCursor
}
edges {
node {
id
caption
likeCount
commentCount
timestamp
mediaUrl
}
}
}
}
}
"""
def get_all_user_posts(user_id: str, headers: dict) -> list[dict]:
posts = []
after = None
while True:
variables = {"userId": user_id, "count": 50}
if after:
variables["after"] = after
data = graphql_query(
"https://api.example.com/graphql",
POSTS_QUERY,
variables,
headers,
)
timeline = data.get("user", {}).get("timeline", {})
edges = timeline.get("edges", [])
posts.extend(edge["node"] for edge in edges)
page_info = timeline.get("pageInfo", {})
if not page_info.get("hasNextPage"):
break
after = page_info.get("endCursor")
return posts
Async Batch Scraping
import asyncio
import httpx
from typing import list
async def fetch_items_async(
item_ids: list[str],
base_url: str,
headers: dict,
concurrency: int = 10,
) -> list[dict]:
"""Fetch many item detail pages concurrently."""
semaphore = asyncio.Semaphore(concurrency)
async def fetch_one(client: httpx.AsyncClient, item_id: str) -> dict:
async with semaphore:
try:
resp = await client.get(
f"{base_url}/items/{item_id}",
headers=headers,
timeout=15.0,
)
if resp.status_code == 200:
return {"id": item_id, "status": "ok", "data": resp.json()}
return {"id": item_id, "status": f"http_{resp.status_code}", "data": None}
except Exception as e:
return {"id": item_id, "status": "error", "error": str(e), "data": None}
async with httpx.AsyncClient() as client:
tasks = [fetch_one(client, item_id) for item_id in item_ids]
results = await asyncio.gather(*tasks)
return list(results)
Proxy Rotation with ThorData for Rate Limit Bypass
When you make many requests from a single IP, even the most carefully crafted headers eventually trigger IP-based rate limits or bans. This is separate from the fingerprinting problem — it is purely volume-based. Rotating residential proxies solve this by spreading your requests across thousands of real ISP-assigned IP addresses.
ThorData provides residential proxy pools that look like genuine user traffic to API servers. Each request can come from a different residential IP, making automated access patterns invisible.
Basic httpx integration:
import httpx
THORDATA_PROXY = "http://USERNAME:[email protected]:9000"
# Synchronous
with httpx.Client(proxy=THORDATA_PROXY) as client:
resp = client.get("https://api.example.com/data", headers={"User-Agent": "Mozilla/5.0 ..."})
print(resp.json())
# Async
async with httpx.AsyncClient(proxy=THORDATA_PROXY) as client:
resp = await client.get("https://api.example.com/data", headers={"User-Agent": "Mozilla/5.0 ..."})
print(resp.json())
For sticky sessions (needed when the API requires the same IP across multiple requests for session continuity):
# ThorData sticky session — same IP for the duration of the session ID
session_id = "my_session_abc123"
sticky_proxy = f"http://USERNAME-session-{session_id}:[email protected]:9000"
For geo-targeted requests (needed when scraping content that varies by country):
us_proxy = "http://USERNAME-country-us:[email protected]:9000"
uk_proxy = "http://USERNAME-country-gb:[email protected]:9000"
de_proxy = "http://USERNAME-country-de:[email protected]:9000"
A robust scraper that rotates proxies and handles failures:
import httpx
import asyncio
import random
from typing import Optional
THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
def make_proxy_url(country: Optional[str] = None, session: Optional[str] = None) -> str:
user = THORDATA_USER
if country:
user += f"-country-{country}"
if session:
user += f"-session-{session}"
return f"http://{user}:{THORDATA_PASS}@proxy.thordata.com:9000"
async def scrape_with_rotation(
urls: list[str],
headers: dict,
concurrency: int = 5,
country: str = "us",
) -> list[dict]:
semaphore = asyncio.Semaphore(concurrency)
results = []
async def scrape_one(url: str) -> dict:
async with semaphore:
proxy = make_proxy_url(country=country)
async with httpx.AsyncClient(proxy=proxy, timeout=20.0) as client:
for attempt in range(3):
try:
resp = await client.get(url, headers=headers)
if resp.status_code == 429:
wait = 5 * (2 ** attempt) + random.uniform(0, 2)
await asyncio.sleep(wait)
continue
resp.raise_for_status()
return {"url": url, "status": "ok", "data": resp.json()}
except httpx.HTTPError as e:
if attempt == 2:
return {"url": url, "status": "error", "error": str(e)}
await asyncio.sleep(2 ** attempt)
return {"url": url, "status": "failed", "error": "max retries exceeded"}
tasks = [scrape_one(url) for url in urls]
results = await asyncio.gather(*tasks)
return list(results)
Validating and Monitoring API Response Structure
Unofficial APIs change without notice. When the app team ships a refactor, your scraper silently starts returning wrong data or crashing on unexpected shapes. Build structure validation in from the start:
from typing import Any, Optional
import logging
logger = logging.getLogger(__name__)
def safe_get(obj: dict, *keys, default=None) -> Any:
"""Safely traverse nested dict keys."""
for key in keys:
if not isinstance(obj, dict):
return default
obj = obj.get(key, default)
if obj is None:
return default
return obj
def validate_product_response(data: dict) -> Optional[dict]:
"""
Validate the shape of an API response before parsing.
Returns None if the structure is unexpected (API changed).
"""
required_keys = ["id", "title", "price"]
missing = [k for k in required_keys if k not in data]
if missing:
logger.warning(f"API response missing expected keys: {missing}. Got: {list(data.keys())}")
return None
# Validate types
if not isinstance(data.get("id"), (str, int)):
logger.warning(f"Unexpected type for 'id': {type(data.get('id'))}")
return None
return {
"id": str(data["id"]),
"title": data.get("title", ""),
"price": data.get("price"), # Accept None — might be sold out
"brand": data.get("brand") or data.get("brand_name") or data.get("manufacturer"),
# Handle field renames gracefully
"category": data.get("category") or data.get("category_name") or data.get("categoryId"),
"in_stock": data.get("in_stock") or data.get("available") or data.get("is_available"),
}
def monitor_api_health(responses: list[dict]) -> dict:
"""Aggregate response validation metrics."""
total = len(responses)
valid = sum(1 for r in responses if validate_product_response(r) is not None)
return {
"total": total,
"valid": valid,
"invalid": total - valid,
"validity_rate": valid / total if total else 0,
}
Output Schema for API-Scraped Data
from dataclasses import dataclass, field, asdict
from typing import Optional
import json
from datetime import datetime, timezone
@dataclass
class APIScrapedItem:
# Identity
source_api: str # "reddit", "youtube", "custom_retailer"
item_id: str # Native ID from the API
scraped_at: str = field(default_factory=lambda: datetime.now(timezone.utc).isoformat())
# Core content
title: str = ""
body: Optional[str] = None
url: Optional[str] = None
# Engagement metrics
score: Optional[int] = None
comment_count: Optional[int] = None
view_count: Optional[int] = None
like_count: Optional[int] = None
# Author
author_id: Optional[str] = None
author_name: Optional[str] = None
# Timestamps
published_at: Optional[str] = None
updated_at: Optional[str] = None
# Commerce (for product APIs)
price_raw: Optional[str] = None
price_cents: Optional[int] = None
in_stock: Optional[bool] = None
category: Optional[str] = None
# Scraper metadata
api_endpoint: Optional[str] = None
api_version: Optional[str] = None
proxy_country: Optional[str] = None
def to_json(self) -> str:
return json.dumps(asdict(self), indent=2, default=str)
# Example output
example = APIScrapedItem(
source_api="reddit",
item_id="t3_abc123",
title="Best Python web scraping libraries in 2026",
url="https://reddit.com/r/Python/comments/abc123",
score=847,
comment_count=93,
author_name="pythonista_2026",
published_at="2026-03-28T14:22:00Z",
api_endpoint="https://www.reddit.com/r/Python/top.json",
)
print(example.to_json())
Output:
{
"source_api": "reddit",
"item_id": "t3_abc123",
"scraped_at": "2026-03-31T14:22:00Z",
"title": "Best Python web scraping libraries in 2026",
"body": null,
"url": "https://reddit.com/r/Python/comments/abc123",
"score": 847,
"comment_count": 93,
"view_count": null,
"like_count": null,
"author_id": null,
"author_name": "pythonista_2026",
"published_at": "2026-03-28T14:22:00Z",
"updated_at": null,
"price_raw": null,
"price_cents": null,
"in_stock": null,
"category": null,
"api_endpoint": "https://www.reddit.com/r/Python/top.json",
"api_version": null,
"proxy_country": null
}
7 Real-World Use Cases
1. Social Media Analytics Dashboard
Aggregate engagement data across platforms to track brand mentions, sentiment, and competitor performance:
import httpx
import asyncio
from datetime import datetime, timezone
async def collect_brand_mentions(brand: str) -> dict:
async with httpx.AsyncClient() as client:
# Reddit
reddit_resp = await client.get(
f"https://www.reddit.com/search.json",
params={"q": brand, "sort": "new", "limit": 25},
headers={"User-Agent": "BrandMonitor/1.0"},
)
reddit_data = reddit_resp.json()
reddit_mentions = [
{"platform": "reddit", "title": c["data"]["title"], "score": c["data"]["score"]}
for c in reddit_data.get("data", {}).get("children", [])
]
return {
"brand": brand,
"collected_at": datetime.now(timezone.utc).isoformat(),
"reddit_mentions": reddit_mentions,
"total_mentions": len(reddit_mentions),
}
2. E-commerce Price Intelligence
Track prices across multiple retailers using their mobile APIs (intercepted via mitmproxy) for real-time competitive intelligence:
async def collect_price_data(product_ids: list[str], retailers: list[dict]) -> list[dict]:
"""
retailers: [{"name": "retailer_a", "url": "https://api.retailer-a.com/products/{id}", "headers": {...}}]
"""
results = []
async with httpx.AsyncClient(proxy="http://USER:[email protected]:9000") as client:
for product_id in product_ids:
for retailer in retailers:
url = retailer["url"].format(id=product_id)
resp = await client.get(url, headers=retailer["headers"])
if resp.status_code == 200:
data = resp.json()
results.append({
"product_id": product_id,
"retailer": retailer["name"],
"price": data.get("price") or data.get("currentPrice"),
"in_stock": data.get("inStock") or data.get("available"),
})
await asyncio.sleep(0.2)
return results
3. Job Market Research
Aggregate job postings from multiple job boards using their unofficial JSON APIs to analyze salary trends, required skills, and location data:
def scrape_job_listings(title: str, location: str) -> list[dict]:
jobs = []
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"}
# Indeed unofficial API pattern
resp = httpx.get(
"https://www.indeed.com/jobs",
params={"q": title, "l": location, "format": "json"},
headers=headers,
)
# Parse response — structure varies by session
return jobs
4. Financial Data Collection
Pull stock prices, crypto rates, and financial metrics from exchange APIs:
def get_crypto_prices(symbols: list[str]) -> dict[str, float]:
"""CoinGecko has a free public API — no auth required."""
ids = ",".join(s.lower() for s in symbols)
resp = httpx.get(
"https://api.coingecko.com/api/v3/simple/price",
params={"ids": ids, "vs_currencies": "usd"},
)
return resp.json()
prices = get_crypto_prices(["bitcoin", "ethereum", "solana"])
# {"bitcoin": {"usd": 65420.0}, "ethereum": {"usd": 3120.0}, ...}
5. News and Media Monitoring
RSS feeds are the original unofficial API — structured, stable, and free:
import httpx
from xml.etree import ElementTree
def scrape_rss_feed(url: str) -> list[dict]:
resp = httpx.get(url, headers={"User-Agent": "RSSReader/1.0"})
root = ElementTree.fromstring(resp.content)
items = []
for item in root.findall(".//item"):
items.append({
"title": item.findtext("title", ""),
"link": item.findtext("link", ""),
"description": item.findtext("description", ""),
"pubDate": item.findtext("pubDate", ""),
"category": item.findtext("category", ""),
})
return items
# Combine multiple sources
tech_news = []
for feed_url in [
"https://feeds.arstechnica.com/arstechnica/technology-lab",
"https://www.wired.com/feed/rss",
"https://techcrunch.com/feed/",
]:
tech_news.extend(scrape_rss_feed(feed_url))
6. Review Aggregation
Collect product reviews from multiple platforms to build comprehensive sentiment datasets:
async def aggregate_product_reviews(product_name: str, asin: str) -> list[dict]:
reviews = []
async with httpx.AsyncClient() as client:
# Amazon has an unofficial reviews API for their own mobile app
# Intercept via mitmproxy for the actual endpoint
headers = {
"User-Agent": "Amazon/15.0 (iPhone; iOS 17.0)",
"X-Amzn-RequestId": "unique-request-id",
}
resp = await client.get(
f"https://api.amazon.com/products/{asin}/reviews",
headers=headers,
params={"pageSize": 20, "sortBy": "RECENT"},
)
if resp.status_code == 200:
data = resp.json()
for review in data.get("reviews", []):
reviews.append({
"source": "amazon",
"product": product_name,
"rating": review.get("rating"),
"title": review.get("title"),
"body": review.get("body"),
"helpful_votes": review.get("helpfulVotes"),
"verified": review.get("verifiedPurchase"),
"date": review.get("date"),
})
return reviews
7. Research Data Collection
Academic APIs, government data portals, and research databases often have JSON APIs:
def search_semantic_scholar(query: str, year_from: int = 2024, limit: int = 50) -> list[dict]:
"""Semantic Scholar has a free public API."""
resp = httpx.get(
"https://api.semanticscholar.org/graph/v1/paper/search",
params={
"query": query,
"year": f"{year_from}-",
"limit": limit,
"fields": "title,abstract,year,citationCount,authors,url",
},
headers={"User-Agent": "ResearchBot/1.0 ([email protected])"},
)
data = resp.json()
return [
{
"title": p.get("title"),
"abstract": p.get("abstract"),
"year": p.get("year"),
"citations": p.get("citationCount"),
"authors": [a["name"] for a in p.get("authors", [])],
"url": p.get("url"),
}
for p in data.get("data", [])
]
The Complete Workflow
The approach that works on every modern web and mobile application:
-
Intercept — DevTools Network tab for web, mitmproxy for mobile. Filter for Fetch/XHR. Interact with the page and watch for JSON responses containing your target data.
-
Isolate — Right-click the request → Copy as cURL. Test it in your terminal. Confirm you get the expected data.
-
Reproduce — Port to Python with httpx. Start with all headers. Confirm it works. Then strip headers one by one to find the minimum required set.
-
Parameterize — Replace hardcoded values (search terms, IDs, cursors) with variables. Test with different inputs.
-
Paginate — Implement the pagination pattern: cursor-based (most common), offset-based, or time-based. Test that you can retrieve multiple pages.
-
Harden — Add token refresh logic. Add retry with exponential backoff. Add response structure validation. Add rate limit detection and adaptive backoff.
-
Scale — Add proxy rotation via ThorData if needed. Implement concurrency with asyncio semaphores. Add output validation and monitoring.
The data is already there, structured and waiting, flowing between the browser and the server on every page load. You just need to know where to look.