Scraping TikTok in 2026: Video Data, Profiles, and the Unofficial API
Scraping TikTok in 2026: Video Data, Profiles, and the Unofficial API
TikTok is arguably the hardest major platform to scrape in 2026. It's not just "hard like Instagram" -- TikTok has an entirely different level of anti-bot infrastructure. Every API request requires cryptographic signatures that rotate every few minutes. The platform fingerprints your browser at the device level. And ByteDance has a legal team that actively sues scraping operations, with several successful lawsuits on record.
If you're thinking about scraping TikTok, you need to understand what you're dealing with before you write a single line of code.
Table of Contents
- Why TikTok Is So Difficult to Scrape
- The Official TikTok API: Extremely Limited
- The TikTok Research API
- The msToken and X-Bogus Signature System
- Public Profile og:meta Scraping
- Video Page Embedded JSON Extraction
- Hashtag and Trending Page Scraping
- Pagination Strategies
- Downloading TikTok Videos
- Proxy Strategy for TikTok
- Anti-Detection Techniques
- Storing TikTok Data
- Real Use Cases and What's Feasible
- Risks: Legal and Technical
- Realistic Alternatives for 2026
1. Why TikTok Is So Difficult to Scrape {#why-hard}
Most social platforms protect their data with rate limits, login walls, and maybe a CAPTCHA. TikTok does all of that, plus:
Cryptographic signature validation on every request. TikTok's internal API requires multiple parameters -- msToken, X-Bogus, _signature, and X-Tt-Params -- attached to every single request. These are generated client-side by obfuscated JavaScript that changes with each app update. Without valid signatures, the API returns empty responses or 403 errors.
Device fingerprinting at the hardware level. TikTok collects an extensive fingerprint: canvas hashes, WebGL renderer, installed fonts, screen resolution, battery status, touch support, and more. This fingerprint is tied to your session. A mismatch between what TikTok expects from a real device and what your scraper sends gets you flagged immediately.
Behavioral analysis beyond fingerprinting. Mouse movements, scroll patterns, time between requests, and interaction sequences are all monitored. Headless browsers are detected even with stealth plugins. Playwright and Puppeteer get caught within minutes without significant modification.
Cloudflare Bot Management integration. TikTok uses Cloudflare's enterprise bot protection, which adds another layer of JavaScript challenges, CAPTCHA triggers, and IP reputation checks on top of TikTok's native protection.
Legal enforcement. ByteDance has sued multiple scraping companies and won. They actively monitor for large-scale data collection and issue cease-and-desist letters. This isn't theoretical -- it's happening regularly in 2026.
2. The Official TikTok API: Extremely Limited {#official-api}
TikTok offers an official API through the TikTok for Developers platform. To use it, you need to create an app, get it approved, and request specific scopes. The approval process is strict and primarily intended for apps that integrate with TikTok (posting content, managing ads, etc.).
For data collection, the official API is almost useless. You can access your own account data and limited information about videos your app's users have explicitly shared. There's no endpoint to search for users, browse trending videos, or pull data about arbitrary public profiles.
What the Official API Does Provide
import requests
TIKTOK_ACCESS_TOKEN = "your_oauth_access_token"
TIKTOK_BASE = "https://open.tiktokapis.com/v2"
def get_own_videos(max_count: int = 20) -> list[dict]:
"""Get videos from your own TikTok account via official API."""
resp = requests.post(
f"{TIKTOK_BASE}/video/list/",
headers={
"Authorization": f"Bearer {TIKTOK_ACCESS_TOKEN}",
"Content-Type": "application/json",
},
json={
"fields": ["id", "title", "video_description", "duration",
"cover_image_url", "share_url", "view_count",
"like_count", "comment_count", "share_count"],
"max_count": max_count,
}
)
resp.raise_for_status()
return resp.json().get("data", {}).get("videos", [])
def get_user_info_official() -> dict:
"""Get info about the authenticated user."""
resp = requests.get(
f"{TIKTOK_BASE}/user/info/",
headers={"Authorization": f"Bearer {TIKTOK_ACCESS_TOKEN}"},
params={"fields": "open_id,union_id,avatar_url,display_name,"
"bio_description,profile_deep_link,is_verified,"
"follower_count,following_count,likes_count,video_count"}
)
resp.raise_for_status()
return resp.json().get("data", {}).get("user", {})
This is only useful if you're building a TikTok integration where users authenticate with your app. For research or data collection, you need other methods.
3. The TikTok Research API {#research-api}
TikTok launched a Research API aimed at academic researchers. It provides access to public video data, comments, and user information through a structured query interface.
Requirements for Access
- Must be affiliated with a research institution (university, recognized research org)
- Submit a detailed application explaining research goals
- Wait 4-12 weeks for approval
- Pass IRB/ethics review if required by your institution
If you get access, the Research API is genuinely useful:
import requests
RESEARCH_ACCESS_TOKEN = "your_research_api_token"
RESEARCH_BASE = "https://open.tiktokapis.com/v2/research"
def search_videos_research(keywords: list[str],
start_date: str, end_date: str,
max_count: int = 100) -> list[dict]:
"""Query TikTok Research API for videos matching keywords."""
resp = requests.post(
f"{RESEARCH_BASE}/video/query/",
headers={
"Authorization": f"Bearer {RESEARCH_ACCESS_TOKEN}",
"Content-Type": "application/json",
},
json={
"query": {
"and": [
{"operation": "IN", "field_name": "keyword",
"field_values": keywords},
]
},
"start_date": start_date, # YYYYMMDD
"end_date": end_date,
"max_count": max_count,
"fields": "id,video_description,create_time,region_code,"
"share_count,view_count,like_count,comment_count,"
"music_id,hashtag_names,username,effect_ids",
}
)
resp.raise_for_status()
return resp.json().get("data", {}).get("videos", [])
def get_video_comments_research(video_id: str,
max_count: int = 100) -> list[dict]:
"""Get comments for a specific video via Research API."""
resp = requests.post(
f"{RESEARCH_BASE}/video/comment/list/",
headers={
"Authorization": f"Bearer {RESEARCH_ACCESS_TOKEN}",
"Content-Type": "application/json",
},
json={
"video_id": video_id,
"max_count": max_count,
"fields": "id,video_id,text,like_count,reply_count,"
"parent_comment_id,create_time,username",
}
)
resp.raise_for_status()
return resp.json().get("data", {}).get("comments", [])
For most people reading this, the Research API won't be available. But if you're at a university, apply -- it's the only stable, legal way to access TikTok data at scale.
4. The msToken and X-Bogus Signature System {#signatures}
Here's where things get technical. TikTok's web frontend makes requests to internal API endpoints like /api/user/detail/ and /api/post/item_list/. Every request must include:
- msToken: Generated by TikTok's anti-bot JavaScript, tied to session, rotates frequently
- X-Bogus: Signature computed from URL + device fingerprint + timestamp, algorithm changes with every TikTok update
- _signature: Older parameter still required on some endpoints
- webid: Device identifier derived from browser fingerprint
Open-source projects like TikTokApi attempt to replicate these signatures by running TikTok's own JavaScript in a headless browser. The approach works -- until TikTok pushes an update to their obfuscated JS, which happens every few days to weeks.
The fundamental problem: Any solution that depends on reverse-engineering the signature system has a shelf life measured in days to weeks. This is why most open-source TikTok scrapers are perpetually broken or require constant maintenance.
If you need the signed API approach, using a maintained library with active developers is the only practical path:
# TikTokApi library approach (requires Playwright for signature generation)
# Note: May be broken at time of reading -- check library's GitHub for status
from TikTokApi import TikTokApi
import asyncio
async def get_tiktok_user_videos(username: str, count: int = 30) -> list[dict]:
async with TikTokApi() as api:
await api.create_sessions(
ms_tokens=["your_ms_token"],
num_sessions=1,
sleep_after=3
)
user = api.user(username)
videos = []
async for video in user.videos(count=count):
videos.append({
"id": video.id,
"description": video.as_dict.get("desc", ""),
"author": video.author.username,
"stats": video.stats,
})
return videos
# videos = asyncio.run(get_tiktok_user_videos("charlidamelio", 30))
5. Public Profile og:meta Scraping {#og-meta}
Like most social platforms, TikTok serves server-rendered HTML for public profile pages. These pages include OpenGraph meta tags with basic profile data.
import requests
from html.parser import HTMLParser
import time
import random
class OGParser(HTMLParser):
def __init__(self):
super().__init__()
self.og_data = {}
def handle_starttag(self, tag, attrs):
if tag == "meta":
attrs_dict = dict(attrs)
prop = attrs_dict.get("property", "")
name = attrs_dict.get("name", "")
if prop.startswith("og:"):
self.og_data[prop] = attrs_dict.get("content", "")
elif name in ("description", "twitter:description", "twitter:title"):
self.og_data[name] = attrs_dict.get("content", "")
MOBILE_UA = ("Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) "
"Version/17.0 Mobile/15E148 Safari/604.1")
def scrape_tiktok_profile(username: str, proxy: str = None,
retries: int = 3) -> dict:
"""Scrape basic TikTok profile data from public page og:meta tags."""
url = f"https://www.tiktok.com/@{username}"
headers = {
"User-Agent": MOBILE_UA,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
}
kwargs = {"headers": headers, "timeout": 15}
if proxy:
kwargs["proxies"] = {"http": proxy, "https": proxy}
for attempt in range(retries):
try:
response = requests.get(url, **kwargs)
if response.status_code == 403:
# Cloudflare challenge -- need better proxy or browser automation
time.sleep(5 * (attempt + 1))
continue
response.raise_for_status()
parser = OGParser()
parser.feed(response.text)
return {
"username": username,
"title": parser.og_data.get("og:title", ""),
"description": parser.og_data.get("og:description", ""),
"image": parser.og_data.get("og:image", ""),
"url": parser.og_data.get("og:url", url),
"twitter_desc": parser.og_data.get("twitter:description", ""),
}
except Exception as e:
if attempt < retries - 1:
time.sleep(random.uniform(3, 7))
else:
raise
return {"username": username, "error": "failed after retries"}
# Batch profile scraping with delays
def batch_scrape_profiles(usernames: list[str],
proxies: list[str] = None,
delay_range: tuple = (3, 7)) -> list[dict]:
results = []
for i, username in enumerate(usernames):
proxy = proxies[i % len(proxies)] if proxies else None
result = scrape_tiktok_profile(username, proxy=proxy)
results.append(result)
time.sleep(random.uniform(*delay_range))
return results
6. Video Page Embedded JSON Extraction {#video-json}
Individual TikTok video pages contain a rich JSON blob embedded in a script tag. This blob is called __UNIVERSAL_DATA_FOR_REHYDRATION__ and contains video metadata, stats, author info, and music data.
import json
import re
def scrape_tiktok_video_page(username: str, video_id: str,
proxy: str = None) -> dict:
"""Extract full video metadata from TikTok video page."""
url = f"https://www.tiktok.com/@{username}/video/{video_id}"
headers = {
"User-Agent": MOBILE_UA,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Referer": f"https://www.tiktok.com/@{username}",
}
kwargs = {"headers": headers, "timeout": 20}
if proxy:
kwargs["proxies"] = {"http": proxy, "https": proxy}
resp = requests.get(url, **kwargs)
resp.raise_for_status()
html = resp.text
# Method 1: Extract __UNIVERSAL_DATA_FOR_REHYDRATION__
marker = '__UNIVERSAL_DATA_FOR_REHYDRATION__">'
if marker in html:
start = html.index(marker) + len(marker)
end = html.index("</script>", start)
try:
data = json.loads(html[start:end])
return _parse_universal_data(data, video_id)
except (json.JSONDecodeError, KeyError):
pass
# Method 2: Search for video data in any script tags
patterns = [
r'window\.__INIT_PROPS__\s*=\s*(\{.*?\})\s*(?:;|</script>)',
r'"ItemModule":\s*(\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\})',
]
for pattern in patterns:
m = re.search(pattern, html, re.DOTALL)
if m:
try:
return json.loads(m.group(1))
except json.JSONDecodeError:
continue
# Method 3: Extract og:meta as fallback
parser = OGParser()
parser.feed(html)
return {
"video_id": video_id,
"title": parser.og_data.get("og:title", ""),
"description": parser.og_data.get("og:description", ""),
"thumbnail": parser.og_data.get("og:image", ""),
"source": "og_meta_fallback",
}
def _parse_universal_data(data: dict, video_id: str) -> dict:
"""Navigate the nested UNIVERSAL_DATA structure to extract video info."""
# Structure varies -- try multiple paths
video = None
paths = [
["__DEFAULT_SCOPE__", "webapp.video-detail", "itemInfo", "itemStruct"],
["ItemModule", video_id],
]
for path in paths:
try:
node = data
for key in path:
node = node[key]
video = node
break
except (KeyError, TypeError):
continue
if not video:
return {"video_id": video_id, "raw": data}
author = video.get("author", {})
stats = video.get("stats", {})
music = video.get("music", {})
return {
"video_id": video.get("id", video_id),
"description": video.get("desc", ""),
"author_username": author.get("uniqueId", ""),
"author_id": author.get("id", ""),
"author_nickname": author.get("nickname", ""),
"author_verified": author.get("verified", False),
"play_count": stats.get("playCount", 0),
"like_count": stats.get("diggCount", 0),
"comment_count": stats.get("commentCount", 0),
"share_count": stats.get("shareCount", 0),
"collect_count": stats.get("collectCount", 0),
"duration": video.get("video", {}).get("duration", 0),
"created_time": video.get("createTime", 0),
"music_title": music.get("title", ""),
"music_author": music.get("authorName", ""),
"music_id": music.get("id", ""),
"hashtags": [h["hashtagName"] for h in video.get("textExtra", [])
if h.get("hashtagName")],
"thumbnail_url": video.get("video", {}).get("cover", ""),
}
7. Hashtag and Trending Page Scraping {#hashtag}
TikTok's hashtag pages at https://www.tiktok.com/tag/{hashtag} also contain embedded JSON. However, they often require JavaScript execution to render the actual video list.
For hashtag data without JavaScript execution:
def scrape_hashtag_page(hashtag: str, proxy: str = None) -> dict:
"""Scrape basic hashtag info from TikTok tag page."""
url = f"https://www.tiktok.com/tag/{hashtag}"
headers = {
"User-Agent": MOBILE_UA,
"Accept-Language": "en-US,en;q=0.9",
}
kwargs = {"headers": headers, "timeout": 15}
if proxy:
kwargs["proxies"] = {"http": proxy, "https": proxy}
resp = requests.get(url, **kwargs)
resp.raise_for_status()
html = resp.text
# Extract challenge/hashtag data from embedded JSON
marker = '__UNIVERSAL_DATA_FOR_REHYDRATION__">'
if marker in html:
start = html.index(marker) + len(marker)
end = html.index("</script>", start)
try:
data = json.loads(html[start:end])
# Navigate to challenge info
challenge = (data.get("__DEFAULT_SCOPE__", {})
.get("webapp.challenge-detail", {})
.get("challengeInfo", {}))
stats = challenge.get("stats", {})
info = challenge.get("challenge", {})
return {
"hashtag": hashtag,
"id": info.get("id"),
"title": info.get("title"),
"description": info.get("desc"),
"view_count": stats.get("viewCount", 0),
"video_count": stats.get("videoCount", 0),
}
except (json.JSONDecodeError, KeyError):
pass
# Fallback to og:meta
parser = OGParser()
parser.feed(html)
return {
"hashtag": hashtag,
"title": parser.og_data.get("og:title", ""),
"description": parser.og_data.get("og:description", ""),
}
TikTok Creative Center -- Free Official Data
Before scraping, check TikTok's own analytics tool: https://ads.tiktok.com/business/creativecenter/. It exposes trending hashtags, songs, and creator data without any scraping needed.
def get_creative_center_trending(period: str = "7", region: str = "US") -> list[dict]:
"""Fetch trending hashtags from TikTok Creative Center API."""
url = "https://ads.tiktok.com/creative_radar_api/v1/popular_trend/hashtag/list"
headers = {
"User-Agent": MOBILE_UA,
"Referer": "https://ads.tiktok.com/business/creativecenter/",
}
params = {
"period": period, # 7, 30, 120
"region": region,
"page": 1,
"limit": 50,
}
resp = requests.get(url, headers=headers, params=params, timeout=10)
if resp.status_code == 200:
return resp.json().get("data", {}).get("list", [])
return []
8. Pagination Strategies {#pagination}
TikTok uses cursor-based pagination for most feeds. The cursor is typically returned as cursor or min_cursor in the response.
def paginate_user_videos(username: str, max_pages: int = 10,
proxy: str = None) -> list[dict]:
"""Paginate through a user's videos using the unofficial API (requires valid signatures)."""
# NOTE: This requires a working signature implementation
# Using as a structural example
all_videos = []
cursor = 0
has_more = True
for page in range(max_pages):
if not has_more:
break
# The actual API call requires X-Bogus signature
params = {
"uniqueId": username,
"count": 35,
"cursor": cursor,
"app_language": "en",
"device_platform": "web_pc",
# msToken, X-Bogus, _signature must be generated dynamically
}
# Simulated response structure
# In practice you'd call the TikTok API here
data = {"itemList": [], "hasMore": False, "cursor": 0}
for item in data.get("itemList", []):
all_videos.append({
"id": item.get("id"),
"description": item.get("desc"),
"stats": item.get("stats", {}),
"created_time": item.get("createTime"),
})
has_more = data.get("hasMore", False)
cursor = data.get("cursor", 0)
time.sleep(random.uniform(2, 4))
return all_videos
9. Downloading TikTok Videos {#downloading}
TikTok videos can be downloaded without watermark from certain endpoints. The video URL is embedded in the page JSON:
import os
import requests
def download_tiktok_video(video_url: str, output_path: str,
proxy: str = None) -> bool:
"""Download a TikTok video to local file."""
if not video_url:
return False
headers = {
"User-Agent": MOBILE_UA,
"Referer": "https://www.tiktok.com/",
"Range": "bytes=0-",
}
kwargs = {
"headers": headers,
"stream": True,
"timeout": 60,
}
if proxy:
kwargs["proxies"] = {"http": proxy, "https": proxy}
try:
resp = requests.get(video_url, **kwargs)
if resp.status_code not in (200, 206):
return False
os.makedirs(os.path.dirname(output_path) or ".", exist_ok=True)
with open(output_path, "wb") as f:
for chunk in resp.iter_content(chunk_size=65536):
f.write(chunk)
return True
except Exception as e:
print(f"Download failed for {video_url[:50]}: {e}")
return False
def extract_video_url_from_page_data(page_data: dict) -> str:
"""Extract the best quality video download URL."""
video_info = page_data.get("video", {})
# Try various URL fields in order of preference
for field in ["playAddr", "downloadAddr", "play_addr", "download_addr"]:
url = video_info.get(field, "")
if url and url.startswith("http"):
return url
return ""
Note: Downloading TikTok videos may violate their ToS depending on how you use the content. For personal archiving it's generally tolerated; commercial redistribution is not.
10. Proxy Strategy for TikTok {#proxies}
TikTok's bot detection is more sophisticated than most platforms. You need residential proxies -- datacenter IPs get blocked before completing even a single request in most cases.
ThorData provides rotating residential proxy pools with country targeting. For TikTok, country-matching proxies are particularly important -- requests from the same country as your target content perform better and raise fewer flags.
THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
THORDATA_HOST = "proxy.thordata.com"
THORDATA_PORT = 9000
def get_proxy(country: str = "US", rotate: bool = True) -> str:
"""Build ThorData proxy URL with optional country targeting."""
if rotate:
user = f"{THORDATA_USER}-country-{country.lower()}"
else:
session_id = f"sess{random.randint(10000, 99999)}"
user = f"{THORDATA_USER}-country-{country.lower()}-session-{session_id}"
return f"http://{user}:{THORDATA_PASS}@{THORDATA_HOST}:{THORDATA_PORT}"
def test_proxy_for_tiktok(proxy: str) -> bool:
"""Test if a proxy can access TikTok without being blocked."""
try:
resp = requests.get(
"https://www.tiktok.com/",
headers={"User-Agent": MOBILE_UA},
proxies={"https": proxy},
timeout=10,
allow_redirects=False
)
# 200 or redirect is fine; 403 means blocked
return resp.status_code in (200, 301, 302)
except Exception:
return False
Key insights for TikTok proxy usage: - Residential IPs are not optional -- they're a hard requirement - Rotate per-request for profile pages; use sticky sessions for paginated API calls - US residential IPs work best for US-targeted content - Budget 3-5x more proxy bandwidth for TikTok than other platforms due to Cloudflare challenge pages wasting bandwidth
11. Anti-Detection Techniques {#anti-detection}
Beyond proxies, TikTok specifically checks:
Browser Fingerprint Consistency
When using Playwright or similar tools:
from playwright.async_api import async_playwright
import asyncio
async def create_tiktok_browser_context(proxy_url: str = None):
"""Create a Playwright context configured to avoid TikTok detection."""
p = await async_playwright().__aenter__()
launch_args = {
"headless": True,
"args": [
"--disable-blink-features=AutomationControlled",
"--disable-dev-shm-usage",
"--no-sandbox",
"--disable-setuid-sandbox",
"--disable-accelerated-2d-canvas",
"--no-first-run",
"--no-zygote",
"--disable-gpu",
]
}
if proxy_url:
launch_args["proxy"] = {"server": proxy_url}
browser = await p.chromium.launch(**launch_args)
context = await browser.new_context(
viewport={"width": 390, "height": 844}, # iPhone 14 Pro
user_agent=(
"Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) "
"Version/17.0 Mobile/15E148 Safari/604.1"
),
locale="en-US",
timezone_id="America/New_York",
)
# Remove webdriver indicators
await context.add_init_script("""
delete Object.getPrototypeOf(navigator).webdriver;
Object.defineProperty(navigator, 'platform', {get: () => 'iPhone'});
Object.defineProperty(navigator, 'maxTouchPoints', {get: () => 5});
""")
return browser, context
Request Timing Patterns
Human users don't make perfectly spaced requests. Add noise:
import random
import time
def human_like_delay(base_seconds: float = 3.0, variance: float = 0.5):
"""Sleep for a randomized human-like delay."""
delay = base_seconds + random.gauss(0, variance)
delay = max(1.0, delay) # never less than 1 second
time.sleep(delay)
def simulate_reading_time(content_length: int) -> float:
"""Estimate how long a human would take to 'read' content."""
words = content_length / 5 # rough word count
reading_speed = random.uniform(200, 300) # words per minute
return (words / reading_speed) * 60
12. Storing TikTok Data {#storage}
import sqlite3
import time
def init_tiktok_db(db_path: str = "tiktok_data.db") -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("""
CREATE TABLE IF NOT EXISTS profiles (
username TEXT PRIMARY KEY,
user_id TEXT,
nickname TEXT,
bio TEXT,
followers INTEGER,
following INTEGER,
video_count INTEGER,
like_count INTEGER,
is_verified INTEGER,
scraped_at REAL
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS videos (
video_id TEXT PRIMARY KEY,
author_username TEXT,
description TEXT,
play_count INTEGER,
like_count INTEGER,
comment_count INTEGER,
share_count INTEGER,
duration INTEGER,
created_time INTEGER,
music_title TEXT,
music_author TEXT,
hashtags TEXT,
thumbnail_url TEXT,
video_path TEXT,
scraped_at REAL
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS hashtags (
hashtag TEXT PRIMARY KEY,
view_count INTEGER,
video_count INTEGER,
scraped_at REAL
)
""")
conn.execute("CREATE INDEX IF NOT EXISTS idx_videos_author ON videos(author_username)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_videos_created ON videos(created_time)")
conn.commit()
return conn
def save_video(conn: sqlite3.Connection, video: dict):
import json
conn.execute("""
INSERT OR REPLACE INTO videos VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)
""", (
video.get("video_id"), video.get("author_username"),
video.get("description"), video.get("play_count", 0),
video.get("like_count", 0), video.get("comment_count", 0),
video.get("share_count", 0), video.get("duration", 0),
video.get("created_time"), video.get("music_title"),
video.get("music_author"),
json.dumps(video.get("hashtags", [])),
video.get("thumbnail_url"), video.get("video_path"),
time.time()
))
conn.commit()
13. Real Use Cases and What's Feasible {#use-cases}
What's Realistic in 2026
| Goal | Feasibility | Method |
|---|---|---|
| Basic profile info (bio, approx followers) | High | og:meta scraping |
| Single video metadata | High | Video page JSON extraction |
| Trending hashtag data | High | Creative Center API |
| User's recent videos (with Research API) | High (if approved) | Research API |
| User's full video list | Medium | TikTokApi library (breaks with updates) |
| Comments at scale | Low | Research API only reliable method |
| Video downloads | Medium | Direct URL from page JSON |
| Follower lists | Very Low | Requires working signature system |
Competitor Content Analysis
def analyze_creator_content(username: str, video_data: list[dict]) -> dict:
"""Analyze posting patterns and performance from scraped video data."""
if not video_data:
return {}
import statistics
play_counts = [v.get("play_count", 0) for v in video_data if v.get("play_count")]
like_counts = [v.get("like_count", 0) for v in video_data if v.get("like_count")]
# Hashtag frequency
hashtag_freq = {}
for v in video_data:
for tag in v.get("hashtags", []):
hashtag_freq[tag] = hashtag_freq.get(tag, 0) + 1
return {
"video_count": len(video_data),
"avg_views": statistics.mean(play_counts) if play_counts else 0,
"median_views": statistics.median(play_counts) if play_counts else 0,
"avg_likes": statistics.mean(like_counts) if like_counts else 0,
"top_hashtags": sorted(hashtag_freq.items(), key=lambda x: -x[1])[:10],
"avg_duration": statistics.mean([v.get("duration", 0) for v in video_data]),
}
14. Risks: Legal and Technical {#risks}
Legal risks: - ByteDance has successfully sued scraping operations (multiple US and EU cases) - CFAA exposure for bypassing access controls - GDPR/CCPA apply to personal data collection from EU/CA users - ToS violation creates breach-of-contract exposure
Technical risks: - Everything breaks constantly -- signature system, HTML structure, JSON format all change regularly - TikTok's detection is among the most sophisticated on the internet - Getting flagged means IP bans that extend to entire subnets - Residential proxy costs add up fast given TikTok's challenge page frequency
15. Realistic Alternatives for 2026 {#alternatives}
TikTok Research API -- if you're affiliated with a university or research institution, apply. It's the only stable, sanctioned way to access TikTok data at scale.
TikTok Creative Center -- Free, official, no scraping needed. Trending hashtags, songs, and creator stats: ads.tiktok.com/business/creativecenter/
Paid data providers -- Bright Data, Oxylabs, and others offer TikTok datasets or maintained scraping APIs. Expensive but actually works consistently. They handle the signature system and proxy rotation.
yt-dlp -- For video downloading specifically, yt-dlp handles TikTok better than custom scrapers and is actively maintained:
# Install
pip install yt-dlp
# Download a TikTok video
yt-dlp "https://www.tiktok.com/@username/video/VIDEO_ID" -o "%(id)s.%(ext)s"
# Download without watermark (may require login)
yt-dlp --cookies cookies.txt "https://www.tiktok.com/@username/video/VIDEO_ID"
import subprocess
import json
def download_with_ytdlp(url: str, output_dir: str = ".", proxy: str = None) -> dict:
"""Download TikTok video using yt-dlp."""
cmd = ["yt-dlp", "--dump-json", url]
if proxy:
cmd.extend(["--proxy", proxy])
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode == 0:
info = json.loads(result.stdout)
# Now download
dl_cmd = ["yt-dlp", "-o", f"{output_dir}/%(id)s.%(ext)s", url]
if proxy:
dl_cmd.extend(["--proxy", proxy])
subprocess.run(dl_cmd)
return info
return {"error": result.stderr}
For most legitimate use cases -- content research, trend analysis, competitor monitoring -- combining the Creative Center's free data with the Research API (if accessible) and yt-dlp for specific video downloads covers 90% of needs without the risk and maintenance burden of custom scrapers.
If you do build custom TikTok scrapers, use ThorData's residential proxies as a baseline infrastructure investment -- there's simply no viable path to consistent TikTok access without residential IPs, and ThorData's rotating pool handles the country-targeting granularity that TikTok's geo-based rate limiting requires.