How to Find and Use Unofficial APIs for Web Scraping (2026 Complete Guide)

2026-03-30 [scraping apis python devtools mitmproxy proxy-rotation anti-detection]

Every modern web application is fundamentally a thin UI layer on top of an API. When you scrape the HTML of a React or Vue app, you are parsing the rendered output of a process that already fetched the data you want as clean, structured JSON and injected it into the page. You are decoding the end product of a pipeline rather than tapping into the pipeline itself.

The unofficial API approach bypasses all of that. Instead of fighting CSS selectors that change weekly, JavaScript rendering that requires a full browser, anti-bot walls that watch for headless Chrome signals, and HTML structure that varies by viewport size — you talk directly to the same JSON endpoint that the frontend does. The response is already structured, loads in milliseconds rather than seconds, and tends to be far more stable than the frontend markup because changing the API breaks the app while changing the HTML only breaks your scraper.

This guide is the complete playbook. We cover finding unofficial APIs through browser DevTools and mitmproxy interception, reproducing them in Python with the minimum required headers, handling authentication token flows and refresh patterns, navigating cursor-based pagination, managing rate limits with residential proxy rotation via ThorData, and designing scrapers resilient enough to survive unannounced API changes. Every code example is complete and runnable.

The Core Insight: Every SPA is an API Client

When you open a modern web application and see a product listing, your browser has already done the work you want to do. It sent an HTTP request to an endpoint like /api/v2/products?category=electronics&sort=popular&limit=50, received a JSON response with all the product data, and rendered it into HTML you can read. The JSON response was already there — you just never saw it because the browser swallowed it.

Your job as a scraper is simply to intercept that request and replay it yourself. Once you can do that, you can call the endpoint as many times as you want with whatever parameters you want, without running a browser at all. Pagination that would require scrolling through dozens of pages of HTML is just incrementing a page parameter or following a cursor field. Filtering that would require manipulating complex UI widgets is just changing query parameters.

The browser DevTools Network tab is the window into this layer. Every request the browser makes is visible there — the URL, the method, the request headers, the request body, and the full response. For most modern web apps, you will find the exact endpoint returning the data you want within five minutes of opening DevTools.

Finding APIs with Browser DevTools

Open the site you want to scrape. Right-click anywhere → Inspect (or F12) → Network tab. You will see a stream of requests. Most are irrelevant: static assets (images, CSS, JavaScript bundles), analytics pings, ad tracking, font downloads. Filter them out by clicking the "Fetch/XHR" filter button. This shows only XMLHttpRequest and Fetch API calls — the programmatic HTTP requests the JavaScript code makes to load data.

Now interact with the page. Scroll down to load more results. Use the search box. Apply a filter. Click to a detail view. Watch the Network tab. You are looking for requests that return JSON arrays or objects with the actual data you care about — product records, user profiles, search results, whatever.

Identifying the right request:

Look for responses with Content-Type: application/json
Response bodies that are arrays ([{...}, {...}]) are often paginated lists
Response bodies that are objects ({"data": {...}, "meta": {...}}) are often detail views or wrapped lists
Ignore responses under 1KB — probably analytics pings
Check the Response tab in DevTools to preview the content

What to record once you find it:

The full URL (including query parameters)
The HTTP method (GET or POST)
All request headers (especially Authorization, Cookie, and any custom headers like X-Api-Key or X-Client-Version)
The request body if it is a POST request
Example response to understand the structure

Right-click the request → "Copy as cURL" to get a working command you can immediately test in your terminal.

Live Examples You Can Try Right Now

Reddit JSON API

Append .json to almost any Reddit URL and get a structured JSON response:

curl "https://www.reddit.com/r/python/top.json?t=week&limit=25" \
  -H "User-Agent: MyResearchBot/1.0 ([email protected])"

This returns a data.children array with post objects containing title, score, url, author, created_utc, num_comments, and much more. Pagination uses the after parameter — you get the after value from the response and pass it in the next request:

import httpx

def get_reddit_posts(subreddit: str, limit: int = 100) -> list[dict]:
    url = f"https://www.reddit.com/r/{subreddit}/top.json"
    headers = {"User-Agent": "ResearchBot/1.0 ([email protected])"}
    posts = []
    after = None

    while len(posts) < limit:
        params = {"t": "month", "limit": 100}
        if after:
            params["after"] = after

        resp = httpx.get(url, params=params, headers=headers)
        data = resp.json()
        children = data["data"]["children"]
        if not children:
            break

        for child in children:
            post = child["data"]
            posts.append({
                "id": post["id"],
                "title": post["title"],
                "score": post["score"],
                "url": post["url"],
                "author": post["author"],
                "subreddit": post["subreddit"],
                "num_comments": post["num_comments"],
                "created_utc": post["created_utc"],
                "is_self": post["is_self"],
                "selftext": post.get("selftext", ""),
            })

        after = data["data"]["after"]
        if not after:
            break

    return posts[:limit]

Hacker News Firebase API

The official HN API is public and well-documented but most people have never used it:

import httpx
import asyncio

BASE = "https://hacker-news.firebaseio.com/v0"

async def get_top_stories(count: int = 30) -> list[dict]:
    async with httpx.AsyncClient() as client:
        # Get list of top story IDs
        resp = await client.get(f"{BASE}/topstories.json")
        ids = resp.json()[:count]

        # Fetch each story concurrently
        async def get_item(item_id: int) -> dict:
            r = await client.get(f"{BASE}/item/{item_id}.json")
            return r.json()

        stories = await asyncio.gather(*[get_item(i) for i in ids])
        return [s for s in stories if s and s.get("type") == "story"]

YouTube InnerTube Search API

Open YouTube, search for something, and watch the Network tab for POST requests to www.youtube.com/youtubei/v1/search. The request body contains your query and the response contains the full search results with video IDs, titles, view counts, upload dates, and channel info:

import httpx
import json

def youtube_search(query: str, max_results: int = 20) -> list[dict]:
    url = "https://www.youtube.com/youtubei/v1/search"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "Content-Type": "application/json",
        "X-YouTube-Client-Name": "1",
        "X-YouTube-Client-Version": "2.20240101.00.00",
    }
    body = {
        "context": {
            "client": {
                "clientName": "WEB",
                "clientVersion": "2.20240101.00.00",
                "hl": "en",
                "gl": "US",
            }
        },
        "query": query,
    }

    resp = httpx.post(url, headers=headers, json=body)
    data = resp.json()

    videos = []
    try:
        contents = data["contents"]["twoColumnSearchResultsRenderer"]["primaryContents"]
        items = contents["sectionListRenderer"]["contents"][0]["itemSectionRenderer"]["contents"]
        for item in items:
            if "videoRenderer" not in item:
                continue
            v = item["videoRenderer"]
            videos.append({
                "video_id": v.get("videoId"),
                "title": v.get("title", {}).get("runs", [{}])[0].get("text", ""),
                "channel": v.get("ownerText", {}).get("runs", [{}])[0].get("text", ""),
                "view_count": v.get("viewCountText", {}).get("simpleText", ""),
                "published": v.get("publishedTimeText", {}).get("simpleText", ""),
                "duration": v.get("lengthText", {}).get("simpleText", ""),
                "url": f"https://youtube.com/watch?v={v.get('videoId')}",
            })
    except (KeyError, IndexError):
        pass  # Structure may change — check data shape on failure

    return videos[:max_results]

Finding Mobile App APIs with mitmproxy

Mobile applications cannot render HTML — they must communicate with their backend via APIs. This makes them ideal targets: the APIs are clean, well-structured (the app team designed them for their own use), and often return more data than the web version.

The technique is a man-in-the-middle SSL proxy. You route the phone's traffic through your laptop, decrypt it with a custom CA certificate, and watch every request the app makes.

Setup (one-time):

pip install mitmproxy
mitmproxy  # Starts on port 8080

On your iPhone or Android: 1. Go to Settings → Wi-Fi → tap your network → HTTP Proxy → Manual 2. Set Server to your laptop's local IP, Port to 8080 3. Visit mitm.it in the phone browser and install the CA certificate for your platform 4. Trust the certificate in Settings (iPhone: Settings → General → About → Certificate Trust Settings)

Now open any app and use it normally. Every HTTPS request appears in mitmproxy, fully decrypted.

What you will find:

The exact API endpoints the app calls
Required request headers (often including version headers, device IDs, and signatures)
Authentication flows (how the app logs in, how tokens are obtained)
Response schemas for every data type
Pagination patterns

Common mobile API characteristics:

Mobile APIs are designed for bandwidth efficiency. They typically return paginated responses of 20-50 items, use cursor-based pagination rather than page numbers (because insertions between pages break offset-based pagination), and version their endpoints with /v1/, /v2/ patterns. Authentication usually involves a long-lived refresh token and a short-lived access token.

Here is a generic Python client built from mitmproxy-intercepted mobile API calls:

import httpx
import time
from typing import Optional

class MobileAPIClient:
    def __init__(self, base_url: str, app_version: str, device_id: str):
        self.base_url = base_url
        self.session = httpx.Client(
            headers={
                "User-Agent": f"App/{app_version} (iPhone; iOS 17.0)",
                "Accept": "application/json",
                "X-App-Version": app_version,
                "X-Device-ID": device_id,
                "X-Platform": "iOS",
            },
            timeout=15.0,
        )
        self.access_token: Optional[str] = None
        self.refresh_token: Optional[str] = None
        self.token_expires_at: float = 0

    def login(self, username: str, password: str) -> bool:
        resp = self.session.post(f"{self.base_url}/auth/login", json={
            "username": username,
            "password": password,
        })
        if resp.status_code != 200:
            return False
        data = resp.json()
        self.access_token = data["access_token"]
        self.refresh_token = data["refresh_token"]
        self.token_expires_at = time.time() + data.get("expires_in", 3600)
        self.session.headers["Authorization"] = f"Bearer {self.access_token}"
        return True

    def _refresh_if_needed(self):
        if time.time() >= self.token_expires_at - 60:  # Refresh 60s early
            resp = self.session.post(f"{self.base_url}/auth/refresh", json={
                "refresh_token": self.refresh_token,
            })
            if resp.status_code == 200:
                data = resp.json()
                self.access_token = data["access_token"]
                self.token_expires_at = time.time() + data.get("expires_in", 3600)
                self.session.headers["Authorization"] = f"Bearer {self.access_token}"

    def get(self, path: str, **kwargs) -> httpx.Response:
        self._refresh_if_needed()
        resp = self.session.get(f"{self.base_url}{path}", **kwargs)
        if resp.status_code == 401:
            self._refresh_if_needed()
            resp = self.session.get(f"{self.base_url}{path}", **kwargs)
        return resp

    def paginate(self, path: str, data_key: str = "items") -> list[dict]:
        """Fetch all pages using cursor-based pagination."""
        all_items = []
        cursor = None

        while True:
            params = {"limit": 50}
            if cursor:
                params["cursor"] = cursor

            resp = self.get(path, params=params)
            if resp.status_code != 200:
                break

            data = resp.json()
            items = data.get(data_key, [])
            all_items.extend(items)

            cursor = data.get("next_cursor") or data.get("pagination", {}).get("next_cursor")
            if not cursor or not items:
                break

        return all_items

Reproducing API Calls in Python with httpx

httpx is the modern replacement for requests. It supports both sync and async APIs, has HTTP/2 support (required by some modern APIs), handles redirects and cookies cleanly, and integrates well with proxy configuration.

The general workflow for reproducing a captured API call:

Start with every header from the DevTools copy
Make the request work
Strip headers one at a time to find the minimum required set
Parameterize it

import httpx
import json

# Step 1: Start with all headers from DevTools
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Accept": "application/json, text/plain, */*",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Referer": "https://www.example.com/search",
    "Origin": "https://www.example.com",
    "X-Requested-With": "XMLHttpRequest",
    "X-Api-Version": "2024-01",
    "Authorization": "Bearer REPLACE_WITH_ACTUAL_TOKEN",
    "Cookie": "session=REPLACE_WITH_ACTUAL_COOKIE",
}

# Step 2: Make it work, then strip headers to find minimum set
# Often only User-Agent + Authorization are required
minimal_headers = {
    "User-Agent": "Mozilla/5.0 ...",
    "Authorization": "Bearer TOKEN",
}

with httpx.Client() as client:
    resp = client.get(
        "https://api.example.com/v2/search",
        headers=minimal_headers,
        params={"q": "python", "page": 1, "limit": 50},
    )
    print(resp.status_code)
    print(json.dumps(resp.json(), indent=2)[:500])  # Preview first 500 chars

Handling Guest Tokens

Many APIs issue a temporary "guest" token on the first page load, then use it for subsequent API calls. This is common on social platforms and e-commerce sites. Here is how to automate the token acquisition:

import httpx
import re

def get_guest_token(homepage_url: str) -> str:
    """
    Fetch the homepage and extract a guest API token.
    Token is often embedded in a <script> tag or returned by an /init endpoint.
    """
    resp = httpx.get(homepage_url, headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    })

    # Pattern 1: Token in script tag as a JSON value
    match = re.search(r'"token"\s*:\s*"([A-Za-z0-9\-_\.]+)"', resp.text)
    if match:
        return match.group(1)

    # Pattern 2: Token set as a cookie
    if "api_token" in resp.cookies:
        return resp.cookies["api_token"]

    # Pattern 3: Call a dedicated init endpoint
    init_resp = httpx.post(
        f"{homepage_url}/api/init",
        headers={"User-Agent": "Mozilla/5.0 ..."},
        json={"client": "web"}
    )
    return init_resp.json().get("guest_token", "")


def make_api_request(endpoint: str, token: str, params: dict) -> dict:
    resp = httpx.get(
        endpoint,
        headers={"Authorization": f"Bearer {token}"},
        params=params,
    )
    if resp.status_code == 401:
        # Token expired — get a new one and retry
        new_token = get_guest_token("https://www.example.com")
        resp = httpx.get(
            endpoint,
            headers={"Authorization": f"Bearer {new_token}"},
            params=params,
        )
    return resp.json()

Complete Examples by API Pattern

Cursor-Based Pagination

import httpx
from typing import Iterator

def paginate_cursor(
    client: httpx.Client,
    url: str,
    headers: dict,
    params: dict,
    items_key: str = "items",
    cursor_key: str = "next_cursor",
    limit: int = 50,
) -> Iterator[dict]:
    """Generic cursor-based pagination — yields individual items."""
    cursor = None

    while True:
        page_params = {**params, "limit": limit}
        if cursor:
            page_params["cursor"] = cursor

        resp = client.get(url, headers=headers, params=page_params)
        if resp.status_code != 200:
            print(f"Request failed: {resp.status_code}")
            break

        data = resp.json()
        items = data.get(items_key, [])

        for item in items:
            yield item

        cursor = data.get(cursor_key)
        if not cursor or not items:
            break


# Usage
with httpx.Client() as client:
    count = 0
    for item in paginate_cursor(
        client,
        url="https://api.example.com/v1/posts",
        headers={"Authorization": "Bearer TOKEN"},
        params={"category": "tech"},
    ):
        print(item.get("title", ""))
        count += 1
        if count >= 200:  # Safety limit
            break

GraphQL Endpoints

Twitter/X, Facebook, Instagram, and many modern apps use GraphQL. The URL is always the same; only the query body changes.

import httpx

def graphql_query(
    endpoint: str,
    query: str,
    variables: dict,
    headers: dict,
) -> dict:
    """Execute a GraphQL query against an unofficial endpoint."""
    resp = httpx.post(
        endpoint,
        headers={**headers, "Content-Type": "application/json"},
        json={"query": query, "variables": variables},
    )
    resp.raise_for_status()
    data = resp.json()
    if "errors" in data:
        raise ValueError(f"GraphQL errors: {data['errors']}")
    return data.get("data", {})


# Example: Instagram-style GraphQL for user posts
POSTS_QUERY = """
query UserPosts($userId: String!, $count: Int!, $after: String) {
  user(id: $userId) {
    timeline(first: $count, after: $after) {
      pageInfo {
        hasNextPage
        endCursor
      }
      edges {
        node {
          id
          caption
          likeCount
          commentCount
          timestamp
          mediaUrl
        }
      }
    }
  }
}
"""

def get_all_user_posts(user_id: str, headers: dict) -> list[dict]:
    posts = []
    after = None

    while True:
        variables = {"userId": user_id, "count": 50}
        if after:
            variables["after"] = after

        data = graphql_query(
            "https://api.example.com/graphql",
            POSTS_QUERY,
            variables,
            headers,
        )

        timeline = data.get("user", {}).get("timeline", {})
        edges = timeline.get("edges", [])
        posts.extend(edge["node"] for edge in edges)

        page_info = timeline.get("pageInfo", {})
        if not page_info.get("hasNextPage"):
            break
        after = page_info.get("endCursor")

    return posts

Async Batch Scraping

import asyncio
import httpx
from typing import list

async def fetch_items_async(
    item_ids: list[str],
    base_url: str,
    headers: dict,
    concurrency: int = 10,
) -> list[dict]:
    """Fetch many item detail pages concurrently."""
    semaphore = asyncio.Semaphore(concurrency)

    async def fetch_one(client: httpx.AsyncClient, item_id: str) -> dict:
        async with semaphore:
            try:
                resp = await client.get(
                    f"{base_url}/items/{item_id}",
                    headers=headers,
                    timeout=15.0,
                )
                if resp.status_code == 200:
                    return {"id": item_id, "status": "ok", "data": resp.json()}
                return {"id": item_id, "status": f"http_{resp.status_code}", "data": None}
            except Exception as e:
                return {"id": item_id, "status": "error", "error": str(e), "data": None}

    async with httpx.AsyncClient() as client:
        tasks = [fetch_one(client, item_id) for item_id in item_ids]
        results = await asyncio.gather(*tasks)
    return list(results)

Proxy Rotation with ThorData for Rate Limit Bypass

When you make many requests from a single IP, even the most carefully crafted headers eventually trigger IP-based rate limits or bans. This is separate from the fingerprinting problem — it is purely volume-based. Rotating residential proxies solve this by spreading your requests across thousands of real ISP-assigned IP addresses.

ThorData provides residential proxy pools that look like genuine user traffic to API servers. Each request can come from a different residential IP, making automated access patterns invisible.

Basic httpx integration:

import httpx

THORDATA_PROXY = "http://USERNAME:[email protected]:9000"

# Synchronous
with httpx.Client(proxy=THORDATA_PROXY) as client:
    resp = client.get("https://api.example.com/data", headers={"User-Agent": "Mozilla/5.0 ..."})
    print(resp.json())

# Async
async with httpx.AsyncClient(proxy=THORDATA_PROXY) as client:
    resp = await client.get("https://api.example.com/data", headers={"User-Agent": "Mozilla/5.0 ..."})
    print(resp.json())

For sticky sessions (needed when the API requires the same IP across multiple requests for session continuity):

# ThorData sticky session — same IP for the duration of the session ID
session_id = "my_session_abc123"
sticky_proxy = f"http://USERNAME-session-{session_id}:[email protected]:9000"

For geo-targeted requests (needed when scraping content that varies by country):

us_proxy = "http://USERNAME-country-us:[email protected]:9000"
uk_proxy = "http://USERNAME-country-gb:[email protected]:9000"
de_proxy = "http://USERNAME-country-de:[email protected]:9000"

A robust scraper that rotates proxies and handles failures:

import httpx
import asyncio
import random
from typing import Optional

THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"

def make_proxy_url(country: Optional[str] = None, session: Optional[str] = None) -> str:
    user = THORDATA_USER
    if country:
        user += f"-country-{country}"
    if session:
        user += f"-session-{session}"
    return f"http://{user}:{THORDATA_PASS}@proxy.thordata.com:9000"


async def scrape_with_rotation(
    urls: list[str],
    headers: dict,
    concurrency: int = 5,
    country: str = "us",
) -> list[dict]:
    semaphore = asyncio.Semaphore(concurrency)
    results = []

    async def scrape_one(url: str) -> dict:
        async with semaphore:
            proxy = make_proxy_url(country=country)
            async with httpx.AsyncClient(proxy=proxy, timeout=20.0) as client:
                for attempt in range(3):
                    try:
                        resp = await client.get(url, headers=headers)
                        if resp.status_code == 429:
                            wait = 5 * (2 ** attempt) + random.uniform(0, 2)
                            await asyncio.sleep(wait)
                            continue
                        resp.raise_for_status()
                        return {"url": url, "status": "ok", "data": resp.json()}
                    except httpx.HTTPError as e:
                        if attempt == 2:
                            return {"url": url, "status": "error", "error": str(e)}
                        await asyncio.sleep(2 ** attempt)
            return {"url": url, "status": "failed", "error": "max retries exceeded"}

    tasks = [scrape_one(url) for url in urls]
    results = await asyncio.gather(*tasks)
    return list(results)

Validating and Monitoring API Response Structure

Unofficial APIs change without notice. When the app team ships a refactor, your scraper silently starts returning wrong data or crashing on unexpected shapes. Build structure validation in from the start:

from typing import Any, Optional
import logging

logger = logging.getLogger(__name__)

def safe_get(obj: dict, *keys, default=None) -> Any:
    """Safely traverse nested dict keys."""
    for key in keys:
        if not isinstance(obj, dict):
            return default
        obj = obj.get(key, default)
        if obj is None:
            return default
    return obj


def validate_product_response(data: dict) -> Optional[dict]:
    """
    Validate the shape of an API response before parsing.
    Returns None if the structure is unexpected (API changed).
    """
    required_keys = ["id", "title", "price"]
    missing = [k for k in required_keys if k not in data]
    if missing:
        logger.warning(f"API response missing expected keys: {missing}. Got: {list(data.keys())}")
        return None

    # Validate types
    if not isinstance(data.get("id"), (str, int)):
        logger.warning(f"Unexpected type for 'id': {type(data.get('id'))}")
        return None

    return {
        "id": str(data["id"]),
        "title": data.get("title", ""),
        "price": data.get("price"),  # Accept None — might be sold out
        "brand": data.get("brand") or data.get("brand_name") or data.get("manufacturer"),
        # Handle field renames gracefully
        "category": data.get("category") or data.get("category_name") or data.get("categoryId"),
        "in_stock": data.get("in_stock") or data.get("available") or data.get("is_available"),
    }


def monitor_api_health(responses: list[dict]) -> dict:
    """Aggregate response validation metrics."""
    total = len(responses)
    valid = sum(1 for r in responses if validate_product_response(r) is not None)
    return {
        "total": total,
        "valid": valid,
        "invalid": total - valid,
        "validity_rate": valid / total if total else 0,
    }

Output Schema for API-Scraped Data

from dataclasses import dataclass, field, asdict
from typing import Optional
import json
from datetime import datetime, timezone

@dataclass
class APIScrapedItem:
    # Identity
    source_api: str          # "reddit", "youtube", "custom_retailer"
    item_id: str             # Native ID from the API
    scraped_at: str = field(default_factory=lambda: datetime.now(timezone.utc).isoformat())

    # Core content
    title: str = ""
    body: Optional[str] = None
    url: Optional[str] = None

    # Engagement metrics
    score: Optional[int] = None
    comment_count: Optional[int] = None
    view_count: Optional[int] = None
    like_count: Optional[int] = None

    # Author
    author_id: Optional[str] = None
    author_name: Optional[str] = None

    # Timestamps
    published_at: Optional[str] = None
    updated_at: Optional[str] = None

    # Commerce (for product APIs)
    price_raw: Optional[str] = None
    price_cents: Optional[int] = None
    in_stock: Optional[bool] = None
    category: Optional[str] = None

    # Scraper metadata
    api_endpoint: Optional[str] = None
    api_version: Optional[str] = None
    proxy_country: Optional[str] = None

    def to_json(self) -> str:
        return json.dumps(asdict(self), indent=2, default=str)


# Example output
example = APIScrapedItem(
    source_api="reddit",
    item_id="t3_abc123",
    title="Best Python web scraping libraries in 2026",
    url="https://reddit.com/r/Python/comments/abc123",
    score=847,
    comment_count=93,
    author_name="pythonista_2026",
    published_at="2026-03-28T14:22:00Z",
    api_endpoint="https://www.reddit.com/r/Python/top.json",
)
print(example.to_json())

Output:

{
  "source_api": "reddit",
  "item_id": "t3_abc123",
  "scraped_at": "2026-03-31T14:22:00Z",
  "title": "Best Python web scraping libraries in 2026",
  "body": null,
  "url": "https://reddit.com/r/Python/comments/abc123",
  "score": 847,
  "comment_count": 93,
  "view_count": null,
  "like_count": null,
  "author_id": null,
  "author_name": "pythonista_2026",
  "published_at": "2026-03-28T14:22:00Z",
  "updated_at": null,
  "price_raw": null,
  "price_cents": null,
  "in_stock": null,
  "category": null,
  "api_endpoint": "https://www.reddit.com/r/Python/top.json",
  "api_version": null,
  "proxy_country": null
}

7 Real-World Use Cases

Aggregate engagement data across platforms to track brand mentions, sentiment, and competitor performance:

import httpx
import asyncio
from datetime import datetime, timezone

async def collect_brand_mentions(brand: str) -> dict:
    async with httpx.AsyncClient() as client:
        # Reddit
        reddit_resp = await client.get(
            f"https://www.reddit.com/search.json",
            params={"q": brand, "sort": "new", "limit": 25},
            headers={"User-Agent": "BrandMonitor/1.0"},
        )
        reddit_data = reddit_resp.json()
        reddit_mentions = [
            {"platform": "reddit", "title": c["data"]["title"], "score": c["data"]["score"]}
            for c in reddit_data.get("data", {}).get("children", [])
        ]

        return {
            "brand": brand,
            "collected_at": datetime.now(timezone.utc).isoformat(),
            "reddit_mentions": reddit_mentions,
            "total_mentions": len(reddit_mentions),
        }

2. E-commerce Price Intelligence

Track prices across multiple retailers using their mobile APIs (intercepted via mitmproxy) for real-time competitive intelligence:

async def collect_price_data(product_ids: list[str], retailers: list[dict]) -> list[dict]:
    """
    retailers: [{"name": "retailer_a", "url": "https://api.retailer-a.com/products/{id}", "headers": {...}}]
    """
    results = []
    async with httpx.AsyncClient(proxy="http://USER:[email protected]:9000") as client:
        for product_id in product_ids:
            for retailer in retailers:
                url = retailer["url"].format(id=product_id)
                resp = await client.get(url, headers=retailer["headers"])
                if resp.status_code == 200:
                    data = resp.json()
                    results.append({
                        "product_id": product_id,
                        "retailer": retailer["name"],
                        "price": data.get("price") or data.get("currentPrice"),
                        "in_stock": data.get("inStock") or data.get("available"),
                    })
                await asyncio.sleep(0.2)
    return results

3. Job Market Research

Aggregate job postings from multiple job boards using their unofficial JSON APIs to analyze salary trends, required skills, and location data:

def scrape_job_listings(title: str, location: str) -> list[dict]:
    jobs = []
    headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"}

    # Indeed unofficial API pattern
    resp = httpx.get(
        "https://www.indeed.com/jobs",
        params={"q": title, "l": location, "format": "json"},
        headers=headers,
    )
    # Parse response — structure varies by session

    return jobs

4. Financial Data Collection

Pull stock prices, crypto rates, and financial metrics from exchange APIs:

def get_crypto_prices(symbols: list[str]) -> dict[str, float]:
    """CoinGecko has a free public API — no auth required."""
    ids = ",".join(s.lower() for s in symbols)
    resp = httpx.get(
        "https://api.coingecko.com/api/v3/simple/price",
        params={"ids": ids, "vs_currencies": "usd"},
    )
    return resp.json()

prices = get_crypto_prices(["bitcoin", "ethereum", "solana"])
# {"bitcoin": {"usd": 65420.0}, "ethereum": {"usd": 3120.0}, ...}

5. News and Media Monitoring

RSS feeds are the original unofficial API — structured, stable, and free:

import httpx
from xml.etree import ElementTree

def scrape_rss_feed(url: str) -> list[dict]:
    resp = httpx.get(url, headers={"User-Agent": "RSSReader/1.0"})
    root = ElementTree.fromstring(resp.content)
    items = []
    for item in root.findall(".//item"):
        items.append({
            "title": item.findtext("title", ""),
            "link": item.findtext("link", ""),
            "description": item.findtext("description", ""),
            "pubDate": item.findtext("pubDate", ""),
            "category": item.findtext("category", ""),
        })
    return items

# Combine multiple sources
tech_news = []
for feed_url in [
    "https://feeds.arstechnica.com/arstechnica/technology-lab",
    "https://www.wired.com/feed/rss",
    "https://techcrunch.com/feed/",
]:
    tech_news.extend(scrape_rss_feed(feed_url))

6. Review Aggregation

Collect product reviews from multiple platforms to build comprehensive sentiment datasets:

async def aggregate_product_reviews(product_name: str, asin: str) -> list[dict]:
    reviews = []
    async with httpx.AsyncClient() as client:
        # Amazon has an unofficial reviews API for their own mobile app
        # Intercept via mitmproxy for the actual endpoint
        headers = {
            "User-Agent": "Amazon/15.0 (iPhone; iOS 17.0)",
            "X-Amzn-RequestId": "unique-request-id",
        }
        resp = await client.get(
            f"https://api.amazon.com/products/{asin}/reviews",
            headers=headers,
            params={"pageSize": 20, "sortBy": "RECENT"},
        )
        if resp.status_code == 200:
            data = resp.json()
            for review in data.get("reviews", []):
                reviews.append({
                    "source": "amazon",
                    "product": product_name,
                    "rating": review.get("rating"),
                    "title": review.get("title"),
                    "body": review.get("body"),
                    "helpful_votes": review.get("helpfulVotes"),
                    "verified": review.get("verifiedPurchase"),
                    "date": review.get("date"),
                })
    return reviews

7. Research Data Collection

Academic APIs, government data portals, and research databases often have JSON APIs:

def search_semantic_scholar(query: str, year_from: int = 2024, limit: int = 50) -> list[dict]:
    """Semantic Scholar has a free public API."""
    resp = httpx.get(
        "https://api.semanticscholar.org/graph/v1/paper/search",
        params={
            "query": query,
            "year": f"{year_from}-",
            "limit": limit,
            "fields": "title,abstract,year,citationCount,authors,url",
        },
        headers={"User-Agent": "ResearchBot/1.0 ([email protected])"},
    )
    data = resp.json()
    return [
        {
            "title": p.get("title"),
            "abstract": p.get("abstract"),
            "year": p.get("year"),
            "citations": p.get("citationCount"),
            "authors": [a["name"] for a in p.get("authors", [])],
            "url": p.get("url"),
        }
        for p in data.get("data", [])
    ]

The Complete Workflow

The approach that works on every modern web and mobile application:

Intercept — DevTools Network tab for web, mitmproxy for mobile. Filter for Fetch/XHR. Interact with the page and watch for JSON responses containing your target data.
Isolate — Right-click the request → Copy as cURL. Test it in your terminal. Confirm you get the expected data.
Reproduce — Port to Python with httpx. Start with all headers. Confirm it works. Then strip headers one by one to find the minimum required set.
Parameterize — Replace hardcoded values (search terms, IDs, cursors) with variables. Test with different inputs.
Paginate — Implement the pagination pattern: cursor-based (most common), offset-based, or time-based. Test that you can retrieve multiple pages.
Harden — Add token refresh logic. Add retry with exponential backoff. Add response structure validation. Add rate limit detection and adaptive backoff.
Scale — Add proxy rotation via ThorData if needed. Implement concurrency with asyncio semaphores. Add output validation and monitoring.

The data is already there, structured and waiting, flowing between the browser and the server on every page load. You just need to know where to look.