Scraping TikTok in 2026: Video Data, Profiles, and the Unofficial API

2026-04-09 tiktok web-scraping api social-media bot-detection

Scraping TikTok in 2026: Video Data, Profiles, and the Unofficial API

TikTok is arguably the hardest major platform to scrape in 2026. It's not just "hard like Instagram" -- TikTok has an entirely different level of anti-bot infrastructure. Every API request requires cryptographic signatures that rotate every few minutes. The platform fingerprints your browser at the device level. And ByteDance has a legal team that actively sues scraping operations, with several successful lawsuits on record.

If you're thinking about scraping TikTok, you need to understand what you're dealing with before you write a single line of code.

Why TikTok Is So Difficult to Scrape
The Official TikTok API: Extremely Limited
The TikTok Research API
The msToken and X-Bogus Signature System
Public Profile og:meta Scraping
Video Page Embedded JSON Extraction
Hashtag and Trending Page Scraping
Pagination Strategies
Downloading TikTok Videos
Proxy Strategy for TikTok
Anti-Detection Techniques
Storing TikTok Data
Real Use Cases and What's Feasible
Risks: Legal and Technical
Realistic Alternatives for 2026

1. Why TikTok Is So Difficult to Scrape {#why-hard}

Most social platforms protect their data with rate limits, login walls, and maybe a CAPTCHA. TikTok does all of that, plus:

Cryptographic signature validation on every request. TikTok's internal API requires multiple parameters -- msToken, X-Bogus, _signature, and X-Tt-Params -- attached to every single request. These are generated client-side by obfuscated JavaScript that changes with each app update. Without valid signatures, the API returns empty responses or 403 errors.

Device fingerprinting at the hardware level. TikTok collects an extensive fingerprint: canvas hashes, WebGL renderer, installed fonts, screen resolution, battery status, touch support, and more. This fingerprint is tied to your session. A mismatch between what TikTok expects from a real device and what your scraper sends gets you flagged immediately.

Behavioral analysis beyond fingerprinting. Mouse movements, scroll patterns, time between requests, and interaction sequences are all monitored. Headless browsers are detected even with stealth plugins. Playwright and Puppeteer get caught within minutes without significant modification.

Cloudflare Bot Management integration. TikTok uses Cloudflare's enterprise bot protection, which adds another layer of JavaScript challenges, CAPTCHA triggers, and IP reputation checks on top of TikTok's native protection.

Legal enforcement. ByteDance has sued multiple scraping companies and won. They actively monitor for large-scale data collection and issue cease-and-desist letters. This isn't theoretical -- it's happening regularly in 2026.

2. The Official TikTok API: Extremely Limited {#official-api}

TikTok offers an official API through the TikTok for Developers platform. To use it, you need to create an app, get it approved, and request specific scopes. The approval process is strict and primarily intended for apps that integrate with TikTok (posting content, managing ads, etc.).

For data collection, the official API is almost useless. You can access your own account data and limited information about videos your app's users have explicitly shared. There's no endpoint to search for users, browse trending videos, or pull data about arbitrary public profiles.

What the Official API Does Provide

import requests

TIKTOK_ACCESS_TOKEN = "your_oauth_access_token"
TIKTOK_BASE = "https://open.tiktokapis.com/v2"

def get_own_videos(max_count: int = 20) -> list[dict]:
    """Get videos from your own TikTok account via official API."""
    resp = requests.post(
        f"{TIKTOK_BASE}/video/list/",
        headers={
            "Authorization": f"Bearer {TIKTOK_ACCESS_TOKEN}",
            "Content-Type": "application/json",
        },
        json={
            "fields": ["id", "title", "video_description", "duration",
                       "cover_image_url", "share_url", "view_count",
                       "like_count", "comment_count", "share_count"],
            "max_count": max_count,
        }
    )
    resp.raise_for_status()
    return resp.json().get("data", {}).get("videos", [])

def get_user_info_official() -> dict:
    """Get info about the authenticated user."""
    resp = requests.get(
        f"{TIKTOK_BASE}/user/info/",
        headers={"Authorization": f"Bearer {TIKTOK_ACCESS_TOKEN}"},
        params={"fields": "open_id,union_id,avatar_url,display_name,"
                          "bio_description,profile_deep_link,is_verified,"
                          "follower_count,following_count,likes_count,video_count"}
    )
    resp.raise_for_status()
    return resp.json().get("data", {}).get("user", {})

This is only useful if you're building a TikTok integration where users authenticate with your app. For research or data collection, you need other methods.

3. The TikTok Research API {#research-api}

TikTok launched a Research API aimed at academic researchers. It provides access to public video data, comments, and user information through a structured query interface.

Requirements for Access

Must be affiliated with a research institution (university, recognized research org)
Submit a detailed application explaining research goals
Wait 4-12 weeks for approval
Pass IRB/ethics review if required by your institution

If you get access, the Research API is genuinely useful:

import requests

RESEARCH_ACCESS_TOKEN = "your_research_api_token"
RESEARCH_BASE = "https://open.tiktokapis.com/v2/research"

def search_videos_research(keywords: list[str],
                           start_date: str, end_date: str,
                           max_count: int = 100) -> list[dict]:
    """Query TikTok Research API for videos matching keywords."""
    resp = requests.post(
        f"{RESEARCH_BASE}/video/query/",
        headers={
            "Authorization": f"Bearer {RESEARCH_ACCESS_TOKEN}",
            "Content-Type": "application/json",
        },
        json={
            "query": {
                "and": [
                    {"operation": "IN", "field_name": "keyword",
                     "field_values": keywords},
                ]
            },
            "start_date": start_date,  # YYYYMMDD
            "end_date": end_date,
            "max_count": max_count,
            "fields": "id,video_description,create_time,region_code,"
                      "share_count,view_count,like_count,comment_count,"
                      "music_id,hashtag_names,username,effect_ids",
        }
    )
    resp.raise_for_status()
    return resp.json().get("data", {}).get("videos", [])

def get_video_comments_research(video_id: str,
                                max_count: int = 100) -> list[dict]:
    """Get comments for a specific video via Research API."""
    resp = requests.post(
        f"{RESEARCH_BASE}/video/comment/list/",
        headers={
            "Authorization": f"Bearer {RESEARCH_ACCESS_TOKEN}",
            "Content-Type": "application/json",
        },
        json={
            "video_id": video_id,
            "max_count": max_count,
            "fields": "id,video_id,text,like_count,reply_count,"
                      "parent_comment_id,create_time,username",
        }
    )
    resp.raise_for_status()
    return resp.json().get("data", {}).get("comments", [])

For most people reading this, the Research API won't be available. But if you're at a university, apply -- it's the only stable, legal way to access TikTok data at scale.

4. The msToken and X-Bogus Signature System {#signatures}

Here's where things get technical. TikTok's web frontend makes requests to internal API endpoints like /api/user/detail/ and /api/post/item_list/. Every request must include:

msToken: Generated by TikTok's anti-bot JavaScript, tied to session, rotates frequently
X-Bogus: Signature computed from URL + device fingerprint + timestamp, algorithm changes with every TikTok update
_signature: Older parameter still required on some endpoints
webid: Device identifier derived from browser fingerprint

Open-source projects like TikTokApi attempt to replicate these signatures by running TikTok's own JavaScript in a headless browser. The approach works -- until TikTok pushes an update to their obfuscated JS, which happens every few days to weeks.

The fundamental problem: Any solution that depends on reverse-engineering the signature system has a shelf life measured in days to weeks. This is why most open-source TikTok scrapers are perpetually broken or require constant maintenance.

If you need the signed API approach, using a maintained library with active developers is the only practical path:

# TikTokApi library approach (requires Playwright for signature generation)
# Note: May be broken at time of reading -- check library's GitHub for status
from TikTokApi import TikTokApi
import asyncio

async def get_tiktok_user_videos(username: str, count: int = 30) -> list[dict]:
    async with TikTokApi() as api:
        await api.create_sessions(
            ms_tokens=["your_ms_token"],
            num_sessions=1,
            sleep_after=3
        )
        user = api.user(username)
        videos = []
        async for video in user.videos(count=count):
            videos.append({
                "id": video.id,
                "description": video.as_dict.get("desc", ""),
                "author": video.author.username,
                "stats": video.stats,
            })
        return videos

# videos = asyncio.run(get_tiktok_user_videos("charlidamelio", 30))

5. Public Profile og:meta Scraping {#og-meta}

Like most social platforms, TikTok serves server-rendered HTML for public profile pages. These pages include OpenGraph meta tags with basic profile data.

import requests
from html.parser import HTMLParser
import time
import random

class OGParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.og_data = {}

    def handle_starttag(self, tag, attrs):
        if tag == "meta":
            attrs_dict = dict(attrs)
            prop = attrs_dict.get("property", "")
            name = attrs_dict.get("name", "")
            if prop.startswith("og:"):
                self.og_data[prop] = attrs_dict.get("content", "")
            elif name in ("description", "twitter:description", "twitter:title"):
                self.og_data[name] = attrs_dict.get("content", "")

MOBILE_UA = ("Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) "
             "AppleWebKit/605.1.15 (KHTML, like Gecko) "
             "Version/17.0 Mobile/15E148 Safari/604.1")

def scrape_tiktok_profile(username: str, proxy: str = None,
                           retries: int = 3) -> dict:
    """Scrape basic TikTok profile data from public page og:meta tags."""
    url = f"https://www.tiktok.com/@{username}"
    headers = {
        "User-Agent": MOBILE_UA,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
    }
    kwargs = {"headers": headers, "timeout": 15}
    if proxy:
        kwargs["proxies"] = {"http": proxy, "https": proxy}

    for attempt in range(retries):
        try:
            response = requests.get(url, **kwargs)
            if response.status_code == 403:
                # Cloudflare challenge -- need better proxy or browser automation
                time.sleep(5 * (attempt + 1))
                continue
            response.raise_for_status()

            parser = OGParser()
            parser.feed(response.text)

            return {
                "username": username,
                "title": parser.og_data.get("og:title", ""),
                "description": parser.og_data.get("og:description", ""),
                "image": parser.og_data.get("og:image", ""),
                "url": parser.og_data.get("og:url", url),
                "twitter_desc": parser.og_data.get("twitter:description", ""),
            }
        except Exception as e:
            if attempt < retries - 1:
                time.sleep(random.uniform(3, 7))
            else:
                raise

    return {"username": username, "error": "failed after retries"}

# Batch profile scraping with delays
def batch_scrape_profiles(usernames: list[str],
                          proxies: list[str] = None,
                          delay_range: tuple = (3, 7)) -> list[dict]:
    results = []
    for i, username in enumerate(usernames):
        proxy = proxies[i % len(proxies)] if proxies else None
        result = scrape_tiktok_profile(username, proxy=proxy)
        results.append(result)
        time.sleep(random.uniform(*delay_range))
    return results

6. Video Page Embedded JSON Extraction {#video-json}

Individual TikTok video pages contain a rich JSON blob embedded in a script tag. This blob is called __UNIVERSAL_DATA_FOR_REHYDRATION__ and contains video metadata, stats, author info, and music data.

import json
import re

def scrape_tiktok_video_page(username: str, video_id: str,
                              proxy: str = None) -> dict:
    """Extract full video metadata from TikTok video page."""
    url = f"https://www.tiktok.com/@{username}/video/{video_id}"
    headers = {
        "User-Agent": MOBILE_UA,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Referer": f"https://www.tiktok.com/@{username}",
    }
    kwargs = {"headers": headers, "timeout": 20}
    if proxy:
        kwargs["proxies"] = {"http": proxy, "https": proxy}

    resp = requests.get(url, **kwargs)
    resp.raise_for_status()
    html = resp.text

    # Method 1: Extract __UNIVERSAL_DATA_FOR_REHYDRATION__
    marker = '__UNIVERSAL_DATA_FOR_REHYDRATION__">'
    if marker in html:
        start = html.index(marker) + len(marker)
        end = html.index("</script>", start)
        try:
            data = json.loads(html[start:end])
            return _parse_universal_data(data, video_id)
        except (json.JSONDecodeError, KeyError):
            pass

    # Method 2: Search for video data in any script tags
    patterns = [
        r'window\.__INIT_PROPS__\s*=\s*(\{.*?\})\s*(?:;|</script>)',
        r'"ItemModule":\s*(\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\})',
    ]
    for pattern in patterns:
        m = re.search(pattern, html, re.DOTALL)
        if m:
            try:
                return json.loads(m.group(1))
            except json.JSONDecodeError:
                continue

    # Method 3: Extract og:meta as fallback
    parser = OGParser()
    parser.feed(html)
    return {
        "video_id": video_id,
        "title": parser.og_data.get("og:title", ""),
        "description": parser.og_data.get("og:description", ""),
        "thumbnail": parser.og_data.get("og:image", ""),
        "source": "og_meta_fallback",
    }

def _parse_universal_data(data: dict, video_id: str) -> dict:
    """Navigate the nested UNIVERSAL_DATA structure to extract video info."""
    # Structure varies -- try multiple paths
    video = None
    paths = [
        ["__DEFAULT_SCOPE__", "webapp.video-detail", "itemInfo", "itemStruct"],
        ["ItemModule", video_id],
    ]
    for path in paths:
        try:
            node = data
            for key in path:
                node = node[key]
            video = node
            break
        except (KeyError, TypeError):
            continue

    if not video:
        return {"video_id": video_id, "raw": data}

    author = video.get("author", {})
    stats = video.get("stats", {})
    music = video.get("music", {})

    return {
        "video_id": video.get("id", video_id),
        "description": video.get("desc", ""),
        "author_username": author.get("uniqueId", ""),
        "author_id": author.get("id", ""),
        "author_nickname": author.get("nickname", ""),
        "author_verified": author.get("verified", False),
        "play_count": stats.get("playCount", 0),
        "like_count": stats.get("diggCount", 0),
        "comment_count": stats.get("commentCount", 0),
        "share_count": stats.get("shareCount", 0),
        "collect_count": stats.get("collectCount", 0),
        "duration": video.get("video", {}).get("duration", 0),
        "created_time": video.get("createTime", 0),
        "music_title": music.get("title", ""),
        "music_author": music.get("authorName", ""),
        "music_id": music.get("id", ""),
        "hashtags": [h["hashtagName"] for h in video.get("textExtra", [])
                    if h.get("hashtagName")],
        "thumbnail_url": video.get("video", {}).get("cover", ""),
    }

TikTok's hashtag pages at https://www.tiktok.com/tag/{hashtag} also contain embedded JSON. However, they often require JavaScript execution to render the actual video list.

For hashtag data without JavaScript execution:

def scrape_hashtag_page(hashtag: str, proxy: str = None) -> dict:
    """Scrape basic hashtag info from TikTok tag page."""
    url = f"https://www.tiktok.com/tag/{hashtag}"
    headers = {
        "User-Agent": MOBILE_UA,
        "Accept-Language": "en-US,en;q=0.9",
    }
    kwargs = {"headers": headers, "timeout": 15}
    if proxy:
        kwargs["proxies"] = {"http": proxy, "https": proxy}

    resp = requests.get(url, **kwargs)
    resp.raise_for_status()
    html = resp.text

    # Extract challenge/hashtag data from embedded JSON
    marker = '__UNIVERSAL_DATA_FOR_REHYDRATION__">'
    if marker in html:
        start = html.index(marker) + len(marker)
        end = html.index("</script>", start)
        try:
            data = json.loads(html[start:end])
            # Navigate to challenge info
            challenge = (data.get("__DEFAULT_SCOPE__", {})
                        .get("webapp.challenge-detail", {})
                        .get("challengeInfo", {}))
            stats = challenge.get("stats", {})
            info = challenge.get("challenge", {})
            return {
                "hashtag": hashtag,
                "id": info.get("id"),
                "title": info.get("title"),
                "description": info.get("desc"),
                "view_count": stats.get("viewCount", 0),
                "video_count": stats.get("videoCount", 0),
            }
        except (json.JSONDecodeError, KeyError):
            pass

    # Fallback to og:meta
    parser = OGParser()
    parser.feed(html)
    return {
        "hashtag": hashtag,
        "title": parser.og_data.get("og:title", ""),
        "description": parser.og_data.get("og:description", ""),
    }

TikTok Creative Center -- Free Official Data

Before scraping, check TikTok's own analytics tool: https://ads.tiktok.com/business/creativecenter/. It exposes trending hashtags, songs, and creator data without any scraping needed.

def get_creative_center_trending(period: str = "7", region: str = "US") -> list[dict]:
    """Fetch trending hashtags from TikTok Creative Center API."""
    url = "https://ads.tiktok.com/creative_radar_api/v1/popular_trend/hashtag/list"
    headers = {
        "User-Agent": MOBILE_UA,
        "Referer": "https://ads.tiktok.com/business/creativecenter/",
    }
    params = {
        "period": period,  # 7, 30, 120
        "region": region,
        "page": 1,
        "limit": 50,
    }
    resp = requests.get(url, headers=headers, params=params, timeout=10)
    if resp.status_code == 200:
        return resp.json().get("data", {}).get("list", [])
    return []

8. Pagination Strategies {#pagination}

TikTok uses cursor-based pagination for most feeds. The cursor is typically returned as cursor or min_cursor in the response.

def paginate_user_videos(username: str, max_pages: int = 10,
                         proxy: str = None) -> list[dict]:
    """Paginate through a user's videos using the unofficial API (requires valid signatures)."""
    # NOTE: This requires a working signature implementation
    # Using as a structural example
    all_videos = []
    cursor = 0
    has_more = True

    for page in range(max_pages):
        if not has_more:
            break

        # The actual API call requires X-Bogus signature
        params = {
            "uniqueId": username,
            "count": 35,
            "cursor": cursor,
            "app_language": "en",
            "device_platform": "web_pc",
            # msToken, X-Bogus, _signature must be generated dynamically
        }

        # Simulated response structure
        # In practice you'd call the TikTok API here
        data = {"itemList": [], "hasMore": False, "cursor": 0}

        for item in data.get("itemList", []):
            all_videos.append({
                "id": item.get("id"),
                "description": item.get("desc"),
                "stats": item.get("stats", {}),
                "created_time": item.get("createTime"),
            })

        has_more = data.get("hasMore", False)
        cursor = data.get("cursor", 0)

        time.sleep(random.uniform(2, 4))

    return all_videos

9. Downloading TikTok Videos {#downloading}

TikTok videos can be downloaded without watermark from certain endpoints. The video URL is embedded in the page JSON:

import os
import requests

def download_tiktok_video(video_url: str, output_path: str,
                          proxy: str = None) -> bool:
    """Download a TikTok video to local file."""
    if not video_url:
        return False

    headers = {
        "User-Agent": MOBILE_UA,
        "Referer": "https://www.tiktok.com/",
        "Range": "bytes=0-",
    }
    kwargs = {
        "headers": headers,
        "stream": True,
        "timeout": 60,
    }
    if proxy:
        kwargs["proxies"] = {"http": proxy, "https": proxy}

    try:
        resp = requests.get(video_url, **kwargs)
        if resp.status_code not in (200, 206):
            return False

        os.makedirs(os.path.dirname(output_path) or ".", exist_ok=True)
        with open(output_path, "wb") as f:
            for chunk in resp.iter_content(chunk_size=65536):
                f.write(chunk)
        return True
    except Exception as e:
        print(f"Download failed for {video_url[:50]}: {e}")
        return False

def extract_video_url_from_page_data(page_data: dict) -> str:
    """Extract the best quality video download URL."""
    video_info = page_data.get("video", {})

    # Try various URL fields in order of preference
    for field in ["playAddr", "downloadAddr", "play_addr", "download_addr"]:
        url = video_info.get(field, "")
        if url and url.startswith("http"):
            return url

    return ""

Note: Downloading TikTok videos may violate their ToS depending on how you use the content. For personal archiving it's generally tolerated; commercial redistribution is not.

10. Proxy Strategy for TikTok {#proxies}

TikTok's bot detection is more sophisticated than most platforms. You need residential proxies -- datacenter IPs get blocked before completing even a single request in most cases.

ThorData provides rotating residential proxy pools with country targeting. For TikTok, country-matching proxies are particularly important -- requests from the same country as your target content perform better and raise fewer flags.

THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
THORDATA_HOST = "proxy.thordata.com"
THORDATA_PORT = 9000

def get_proxy(country: str = "US", rotate: bool = True) -> str:
    """Build ThorData proxy URL with optional country targeting."""
    if rotate:
        user = f"{THORDATA_USER}-country-{country.lower()}"
    else:
        session_id = f"sess{random.randint(10000, 99999)}"
        user = f"{THORDATA_USER}-country-{country.lower()}-session-{session_id}"
    return f"http://{user}:{THORDATA_PASS}@{THORDATA_HOST}:{THORDATA_PORT}"

def test_proxy_for_tiktok(proxy: str) -> bool:
    """Test if a proxy can access TikTok without being blocked."""
    try:
        resp = requests.get(
            "https://www.tiktok.com/",
            headers={"User-Agent": MOBILE_UA},
            proxies={"https": proxy},
            timeout=10,
            allow_redirects=False
        )
        # 200 or redirect is fine; 403 means blocked
        return resp.status_code in (200, 301, 302)
    except Exception:
        return False

Key insights for TikTok proxy usage: - Residential IPs are not optional -- they're a hard requirement - Rotate per-request for profile pages; use sticky sessions for paginated API calls - US residential IPs work best for US-targeted content - Budget 3-5x more proxy bandwidth for TikTok than other platforms due to Cloudflare challenge pages wasting bandwidth

11. Anti-Detection Techniques {#anti-detection}

Beyond proxies, TikTok specifically checks:

Browser Fingerprint Consistency

When using Playwright or similar tools:

from playwright.async_api import async_playwright
import asyncio

async def create_tiktok_browser_context(proxy_url: str = None):
    """Create a Playwright context configured to avoid TikTok detection."""
    p = await async_playwright().__aenter__()
    launch_args = {
        "headless": True,
        "args": [
            "--disable-blink-features=AutomationControlled",
            "--disable-dev-shm-usage",
            "--no-sandbox",
            "--disable-setuid-sandbox",
            "--disable-accelerated-2d-canvas",
            "--no-first-run",
            "--no-zygote",
            "--disable-gpu",
        ]
    }
    if proxy_url:
        launch_args["proxy"] = {"server": proxy_url}

    browser = await p.chromium.launch(**launch_args)
    context = await browser.new_context(
        viewport={"width": 390, "height": 844},  # iPhone 14 Pro
        user_agent=(
            "Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) "
            "AppleWebKit/605.1.15 (KHTML, like Gecko) "
            "Version/17.0 Mobile/15E148 Safari/604.1"
        ),
        locale="en-US",
        timezone_id="America/New_York",
    )
    # Remove webdriver indicators
    await context.add_init_script("""
        delete Object.getPrototypeOf(navigator).webdriver;
        Object.defineProperty(navigator, 'platform', {get: () => 'iPhone'});
        Object.defineProperty(navigator, 'maxTouchPoints', {get: () => 5});
    """)
    return browser, context

Request Timing Patterns

Human users don't make perfectly spaced requests. Add noise:

import random
import time

def human_like_delay(base_seconds: float = 3.0, variance: float = 0.5):
    """Sleep for a randomized human-like delay."""
    delay = base_seconds + random.gauss(0, variance)
    delay = max(1.0, delay)  # never less than 1 second
    time.sleep(delay)

def simulate_reading_time(content_length: int) -> float:
    """Estimate how long a human would take to 'read' content."""
    words = content_length / 5  # rough word count
    reading_speed = random.uniform(200, 300)  # words per minute
    return (words / reading_speed) * 60

12. Storing TikTok Data {#storage}

import sqlite3
import time

def init_tiktok_db(db_path: str = "tiktok_data.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.execute("PRAGMA journal_mode=WAL")

    conn.execute("""
        CREATE TABLE IF NOT EXISTS profiles (
            username TEXT PRIMARY KEY,
            user_id TEXT,
            nickname TEXT,
            bio TEXT,
            followers INTEGER,
            following INTEGER,
            video_count INTEGER,
            like_count INTEGER,
            is_verified INTEGER,
            scraped_at REAL
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS videos (
            video_id TEXT PRIMARY KEY,
            author_username TEXT,
            description TEXT,
            play_count INTEGER,
            like_count INTEGER,
            comment_count INTEGER,
            share_count INTEGER,
            duration INTEGER,
            created_time INTEGER,
            music_title TEXT,
            music_author TEXT,
            hashtags TEXT,
            thumbnail_url TEXT,
            video_path TEXT,
            scraped_at REAL
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS hashtags (
            hashtag TEXT PRIMARY KEY,
            view_count INTEGER,
            video_count INTEGER,
            scraped_at REAL
        )
    """)

    conn.execute("CREATE INDEX IF NOT EXISTS idx_videos_author ON videos(author_username)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_videos_created ON videos(created_time)")

    conn.commit()
    return conn

def save_video(conn: sqlite3.Connection, video: dict):
    import json
    conn.execute("""
        INSERT OR REPLACE INTO videos VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)
    """, (
        video.get("video_id"), video.get("author_username"),
        video.get("description"), video.get("play_count", 0),
        video.get("like_count", 0), video.get("comment_count", 0),
        video.get("share_count", 0), video.get("duration", 0),
        video.get("created_time"), video.get("music_title"),
        video.get("music_author"),
        json.dumps(video.get("hashtags", [])),
        video.get("thumbnail_url"), video.get("video_path"),
        time.time()
    ))
    conn.commit()

13. Real Use Cases and What's Feasible {#use-cases}

What's Realistic in 2026

Goal	Feasibility	Method
Basic profile info (bio, approx followers)	High	og:meta scraping
Single video metadata	High	Video page JSON extraction
Trending hashtag data	High	Creative Center API
User's recent videos (with Research API)	High (if approved)	Research API
User's full video list	Medium	TikTokApi library (breaks with updates)
Comments at scale	Low	Research API only reliable method
Video downloads	Medium	Direct URL from page JSON
Follower lists	Very Low	Requires working signature system

Competitor Content Analysis

def analyze_creator_content(username: str, video_data: list[dict]) -> dict:
    """Analyze posting patterns and performance from scraped video data."""
    if not video_data:
        return {}

    import statistics
    play_counts = [v.get("play_count", 0) for v in video_data if v.get("play_count")]
    like_counts = [v.get("like_count", 0) for v in video_data if v.get("like_count")]

    # Hashtag frequency
    hashtag_freq = {}
    for v in video_data:
        for tag in v.get("hashtags", []):
            hashtag_freq[tag] = hashtag_freq.get(tag, 0) + 1

    return {
        "video_count": len(video_data),
        "avg_views": statistics.mean(play_counts) if play_counts else 0,
        "median_views": statistics.median(play_counts) if play_counts else 0,
        "avg_likes": statistics.mean(like_counts) if like_counts else 0,
        "top_hashtags": sorted(hashtag_freq.items(), key=lambda x: -x[1])[:10],
        "avg_duration": statistics.mean([v.get("duration", 0) for v in video_data]),
    }

14. Risks: Legal and Technical {#risks}

Legal risks: - ByteDance has successfully sued scraping operations (multiple US and EU cases) - CFAA exposure for bypassing access controls - GDPR/CCPA apply to personal data collection from EU/CA users - ToS violation creates breach-of-contract exposure

Technical risks: - Everything breaks constantly -- signature system, HTML structure, JSON format all change regularly - TikTok's detection is among the most sophisticated on the internet - Getting flagged means IP bans that extend to entire subnets - Residential proxy costs add up fast given TikTok's challenge page frequency

15. Realistic Alternatives for 2026 {#alternatives}

TikTok Research API -- if you're affiliated with a university or research institution, apply. It's the only stable, sanctioned way to access TikTok data at scale.

TikTok Creative Center -- Free, official, no scraping needed. Trending hashtags, songs, and creator stats: ads.tiktok.com/business/creativecenter/

Paid data providers -- Bright Data, Oxylabs, and others offer TikTok datasets or maintained scraping APIs. Expensive but actually works consistently. They handle the signature system and proxy rotation.

yt-dlp -- For video downloading specifically, yt-dlp handles TikTok better than custom scrapers and is actively maintained:

# Install
pip install yt-dlp

# Download a TikTok video
yt-dlp "https://www.tiktok.com/@username/video/VIDEO_ID" -o "%(id)s.%(ext)s"

# Download without watermark (may require login)
yt-dlp --cookies cookies.txt "https://www.tiktok.com/@username/video/VIDEO_ID"

import subprocess
import json

def download_with_ytdlp(url: str, output_dir: str = ".", proxy: str = None) -> dict:
    """Download TikTok video using yt-dlp."""
    cmd = ["yt-dlp", "--dump-json", url]
    if proxy:
        cmd.extend(["--proxy", proxy])

    result = subprocess.run(cmd, capture_output=True, text=True)
    if result.returncode == 0:
        info = json.loads(result.stdout)
        # Now download
        dl_cmd = ["yt-dlp", "-o", f"{output_dir}/%(id)s.%(ext)s", url]
        if proxy:
            dl_cmd.extend(["--proxy", proxy])
        subprocess.run(dl_cmd)
        return info
    return {"error": result.stderr}

For most legitimate use cases -- content research, trend analysis, competitor monitoring -- combining the Creative Center's free data with the Research API (if accessible) and yt-dlp for specific video downloads covers 90% of needs without the risk and maintenance burden of custom scrapers.

If you do build custom TikTok scrapers, use ThorData's residential proxies as a baseline infrastructure investment -- there's simply no viable path to consistent TikTok access without residential IPs, and ThorData's rotating pool handles the country-targeting granularity that TikTok's geo-based rate limiting requires.

Scraping TikTok in 2026: Video Data, Profiles, and the Unofficial API

Scraping TikTok in 2026: Video Data, Profiles, and the Unofficial API

Table of Contents

1. Why TikTok Is So Difficult to Scrape {#why-hard}

2. The Official TikTok API: Extremely Limited {#official-api}

What the Official API Does Provide

3. The TikTok Research API {#research-api}

Requirements for Access

4. The msToken and X-Bogus Signature System {#signatures}

5. Public Profile og:meta Scraping {#og-meta}

6. Video Page Embedded JSON Extraction {#video-json}

7. Hashtag and Trending Page Scraping {#hashtag}

TikTok Creative Center -- Free Official Data

8. Pagination Strategies {#pagination}

9. Downloading TikTok Videos {#downloading}

10. Proxy Strategy for TikTok {#proxies}

11. Anti-Detection Techniques {#anti-detection}

Browser Fingerprint Consistency

Request Timing Patterns

12. Storing TikTok Data {#storage}

13. Real Use Cases and What's Feasible {#use-cases}

What's Realistic in 2026

Competitor Content Analysis

14. Risks: Legal and Technical {#risks}

15. Realistic Alternatives for 2026 {#alternatives}