Scraping Instagram Data in 2026: Profiles, Posts, Reels, and the Mobile API
Scraping Instagram Data in 2026: Profiles, Posts, Reels, and the Mobile API
Instagram is the hardest major social platform to scrape. Meta has spent years building layers of anti-bot protection -- aggressive rate limiting, login walls, browser fingerprinting, and machine learning models that flag automated behavior within minutes. If you're used to scraping sites where a rotating User-Agent and some delays get the job done, Instagram will humble you fast.
Let's look at what actually works in 2026, from the safest public-data methods down to the private API and how to handle CDN expiry, pagination, data storage, and proxy rotation at scale.
Table of Contents
- The Official Instagram Graph API
- Public Profile Data via og:meta Tags
- Parsing the Shared Data JSON Blob
- The Mobile Private API
- Getting a Session Cookie
- Paginating Post Feeds
- Scraping Reels Metadata
- CDN Media URL Expiry -- Download Immediately
- Rate Limits and Soft Blocks
- Proxy Strategy for Instagram
- Storing Instagram Data: Schema and Best Practices
- Handling Edge Cases: Private Accounts, Restricted Content
- Real Use Cases
- Legal Reality
- What Actually Works: Practical Strategy for 2026
1. The Official Instagram Graph API {#graph-api}
Meta offers the Instagram Graph API for business and creator accounts. On paper it sounds perfect -- structured data, no scraping needed. In practice it's one of the most restricted APIs in existence.
To use it, you need a Meta app with Instagram permissions, which requires App Review. Meta's review process takes weeks and rejects most applications that aren't clearly tied to a published product. Even if you get approved, the API only works with accounts that have granted your app permission. You cannot use the Graph API to look up arbitrary public profiles.
What you can do: - Pull your own posts and metrics - Read comments on your own content - Get basic business discovery data for other business accounts (username, bio, media count, follower count only) - Access Instagram Shopping catalogs
What you cannot do: - Scrape arbitrary public profiles - Access follower/following lists at scale - Pull Reels data for accounts you don't own - Read DMs in bulk
If you're building a social media management tool and have customers willing to connect their accounts, the Graph API is fine. For data collection across many profiles, it's essentially useless.
Basic Graph API Setup
import requests
ACCESS_TOKEN = "your_page_access_token"
GRAPH_BASE = "https://graph.facebook.com/v19.0"
def get_ig_business_account(facebook_page_id: str) -> str:
"""Get the Instagram Business Account ID linked to a Facebook Page."""
resp = requests.get(
f"{GRAPH_BASE}/{facebook_page_id}",
params={
"fields": "instagram_business_account",
"access_token": ACCESS_TOKEN,
}
)
resp.raise_for_status()
data = resp.json()
return data["instagram_business_account"]["id"]
def get_own_media(ig_user_id: str, limit: int = 50) -> list[dict]:
"""Get your own Instagram posts via Graph API."""
posts = []
url = f"{GRAPH_BASE}/{ig_user_id}/media"
params = {
"fields": "id,caption,media_type,media_url,thumbnail_url,timestamp,"
"like_count,comments_count,permalink",
"limit": min(limit, 100),
"access_token": ACCESS_TOKEN,
}
while True:
resp = requests.get(url, params=params)
resp.raise_for_status()
data = resp.json()
posts.extend(data.get("data", []))
paging = data.get("paging", {})
if "next" not in paging or len(posts) >= limit:
break
url = paging["next"]
params = {}
return posts[:limit]
2. Public Profile Data via og:meta Tags {#og-meta}
Here's what most people don't realize: Instagram still serves public profile pages as server-rendered HTML to web crawlers. And those pages contain OpenGraph meta tags with structured data.
When you hit https://www.instagram.com/username/ with a clean request (no cookies, standard headers), Instagram returns a page with og:title, og:description, and og:image tags. The description tag typically contains the bio, follower count, following count, and post count.
import requests
from html.parser import HTMLParser
class OGParser(HTMLParser):
def __init__(self):
super().__init__()
self.og_data = {}
def handle_starttag(self, tag, attrs):
if tag == "meta":
attrs_dict = dict(attrs)
prop = attrs_dict.get("property", "")
name = attrs_dict.get("name", "")
if prop.startswith("og:"):
self.og_data[prop] = attrs_dict.get("content", "")
elif name == "description":
self.og_data["description"] = attrs_dict.get("content", "")
def scrape_instagram_profile_public(username: str,
proxy: str = None) -> dict:
"""Scrape a public Instagram profile using og:meta -- no auth needed."""
url = f"https://www.instagram.com/{username}/"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/131.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}
kwargs = {"headers": headers, "timeout": 10}
if proxy:
kwargs["proxies"] = {"http": proxy, "https": proxy}
response = requests.get(url, **kwargs)
response.raise_for_status()
parser = OGParser()
parser.feed(response.text)
# og:description format: "X Followers, Y Following, Z Posts - ..."
desc = parser.og_data.get("og:description", "")
title = parser.og_data.get("og:title", "")
image = parser.og_data.get("og:image", "")
# Parse follower/following/posts from description
followers = following = post_count = None
import re
m = re.match(r"([\d,KM]+) Followers,\s*([\d,KM]+) Following,\s*([\d,KM]+) Posts", desc)
if m:
def parse_count(s):
s = s.replace(",", "")
if s.endswith("M"):
return int(float(s[:-1]) * 1_000_000)
if s.endswith("K"):
return int(float(s[:-1]) * 1_000)
return int(s)
followers = parse_count(m.group(1))
following = parse_count(m.group(2))
post_count = parse_count(m.group(3))
return {
"username": username,
"title": title,
"description": desc,
"profile_pic": image,
"followers": followers,
"following": following,
"post_count": post_count,
}
# Usage
profile = scrape_instagram_profile_public("natgeo")
print(f"{profile['username']}: {profile['followers']:,} followers")
This works without authentication. The catch: you get very limited data. And Meta throttles this aggressively -- after 20-30 requests from the same IP in a short window, you'll start getting 429 responses or login redirects.
3. Parsing the Shared Data JSON Blob {#shared-data}
Instagram's profile pages embed a JSON blob inside the HTML that contains much richer data than og:meta tags. It's inside a <script> tag and the structure has changed several times, but in 2026 it still works on public profiles:
import json
import re
def extract_shared_data(html: str) -> dict:
"""Extract __additionalDataLoaded or window._sharedData from Instagram HTML."""
# Try newer format first
pattern1 = r'window\.__additionalDataLoaded\s*\(\s*[^,]+,\s*(\{.*?\})\s*\);'
m = re.search(pattern1, html, re.DOTALL)
if m:
try:
return json.loads(m.group(1))
except json.JSONDecodeError:
pass
# Try older format
pattern2 = r'<script[^>]+>\s*window\._sharedData\s*=\s*(\{.*?\})\s*;</script>'
m = re.search(pattern2, html, re.DOTALL)
if m:
try:
return json.loads(m.group(1))
except json.JSONDecodeError:
pass
# Try script tags with type="application/json"
pattern3 = r'<script type="application/json"[^>]*>(\{.*?\})</script>'
for m in re.finditer(pattern3, html, re.DOTALL):
try:
data = json.loads(m.group(1))
if "user" in str(data)[:200]:
return data
except json.JSONDecodeError:
continue
return {}
def get_profile_from_shared_data(username: str) -> dict:
"""Get enriched profile data including recent posts if available."""
response = requests.get(
f"https://www.instagram.com/{username}/",
headers={"User-Agent": "Mozilla/5.0 (compatible; Googlebot/2.1)"},
timeout=10
)
data = extract_shared_data(response.text)
# Navigate the nested structure (changes with IG updates)
user = None
for path in [
["entry_data", "ProfilePage", 0, "graphql", "user"],
["data", "user"],
["user"],
]:
try:
node = data
for key in path:
node = node[key]
user = node
break
except (KeyError, IndexError, TypeError):
continue
if not user:
return {}
return {
"id": user.get("id"),
"username": user.get("username"),
"full_name": user.get("full_name"),
"biography": user.get("biography"),
"followers": user.get("edge_followed_by", {}).get("count"),
"following": user.get("edge_follow", {}).get("count"),
"post_count": user.get("edge_owner_to_timeline_media", {}).get("count"),
"is_verified": user.get("is_verified"),
"is_business": user.get("is_business_account"),
"category": user.get("business_category_name"),
"external_url": user.get("external_url"),
"profile_pic_hd": user.get("profile_pic_url_hd"),
}
4. The Mobile Private API {#mobile-api}
Instagram's mobile app communicates with Meta's servers through a private API -- a set of undocumented REST endpoints that return JSON. This API exposes far more data than any official channel.
Key endpoints still working in 2026:
| Endpoint | Description |
|---|---|
/api/v1/users/web_profile_info/?username=X |
Full profile by username |
/api/v1/users/{user_id}/info/ |
Profile data by user ID |
/api/v1/feed/user/{user_id}/ |
User's post feed |
/api/v1/usertags/{user_id}/feed/ |
Posts the user is tagged in |
/api/v1/feed/reels_media/ |
Reels metadata |
/api/v1/media/{media_id}/comments/ |
Comments on a post |
/api/v1/friendships/{user_id}/followers/ |
Follower list (auth required) |
These endpoints require authentication -- a valid sessionid cookie from a logged-in Instagram account.
Making Private API Requests
import requests
SESSION_ID = "your-session-id-from-browser"
MOBILE_HEADERS = {
"User-Agent": "Instagram 317.0.0.34.109 Android (30/11; 420dpi; "
"1080x2220; samsung; SM-G991B; o1s; exynos2100)",
"X-IG-App-ID": "936619743392459",
"X-IG-Capabilities": "3brTvw==",
"X-IG-Connection-Type": "WIFI",
"Accept-Language": "en-US",
"Accept-Encoding": "gzip, deflate",
}
def ig_api_get(endpoint: str, params: dict = None,
proxy: str = None) -> dict:
"""Make authenticated request to Instagram private API."""
url = f"https://i.instagram.com{endpoint}"
cookies = {"sessionid": SESSION_ID}
kwargs = {
"headers": MOBILE_HEADERS,
"cookies": cookies,
"params": params or {},
"timeout": 15,
}
if proxy:
kwargs["proxies"] = {"http": proxy, "https": proxy}
resp = requests.get(url, **kwargs)
if resp.status_code == 429:
import time
retry_after = int(resp.headers.get("Retry-After", 30))
print(f"Rate limited, waiting {retry_after}s")
time.sleep(retry_after)
return ig_api_get(endpoint, params, proxy)
if resp.status_code == 400:
# Often means account needs re-authentication
raise Exception(f"400 on {endpoint} - session may be expired")
resp.raise_for_status()
return resp.json()
def get_profile_info(username: str) -> dict:
"""Get full profile data by username."""
data = ig_api_get("/api/v1/users/web_profile_info/",
params={"username": username})
user = data.get("data", {}).get("user", {})
return {
"id": user.get("id"),
"username": user.get("username"),
"full_name": user.get("full_name"),
"biography": user.get("biography"),
"followers": user.get("edge_followed_by", {}).get("count"),
"following": user.get("edge_follow", {}).get("count"),
"post_count": user.get("edge_owner_to_timeline_media", {}).get("count"),
"is_verified": user.get("is_verified"),
"is_business": user.get("is_business_account"),
"external_url": user.get("external_url"),
"profile_pic_hd": user.get("profile_pic_url_hd"),
"category": user.get("business_category_name"),
"is_private": user.get("is_private"),
}
5. Getting a Session Cookie {#session-cookie}
The practical way to get a session cookie: log into Instagram in a regular browser, open DevTools > Application > Cookies > instagram.com, and copy the sessionid value.
The session cookie lasts several weeks before expiring. When it expires, you'll start getting 400 or 401 responses.
Never automate the login flow. Instagram's ML systems detect Playwright/Puppeteer logins even with stealth plugins and will immediately flag the account, often requiring phone verification or triggering a permanent ban.
For bulk operations that need multiple session cookies, use separate Instagram accounts and log into each one manually in different browsers. Meta allows up to 5 accounts per person under their ToS.
6. Paginating Post Feeds {#pagination}
Instagram's post feed API returns 12-33 posts per page with a cursor for the next page:
import time
def get_user_posts(user_id: str, max_pages: int = 10,
delay: float = 2.0) -> list[dict]:
"""Paginate through a user's post feed."""
posts = []
max_id = None
for page_num in range(max_pages):
params = {}
if max_id:
params["max_id"] = max_id
data = ig_api_get(f"/api/v1/feed/user/{user_id}/", params=params)
for item in data.get("items", []):
post = {
"id": item.get("pk"),
"shortcode": item.get("code"),
"caption": (item.get("caption") or {}).get("text", ""),
"like_count": item.get("like_count", 0),
"comment_count": item.get("comment_count", 0),
"taken_at": item.get("taken_at"),
"media_type": item.get("media_type"), # 1=photo, 2=video, 8=carousel
"play_count": item.get("play_count"), # videos only
"view_count": item.get("view_count"),
"location": (item.get("location") or {}).get("name"),
"image_url": _extract_image_url(item),
"video_url": _extract_video_url(item),
"is_paid_partnership": item.get("is_paid_partnership"),
"tagged_users": [
u["user"]["username"]
for u in item.get("usertags", {}).get("in", [])
],
}
posts.append(post)
if not data.get("more_available"):
break
max_id = data.get("next_max_id")
if not max_id:
break
time.sleep(delay + (page_num * 0.3)) # Increasing delay
return posts
def _extract_image_url(item: dict) -> str:
"""Extract best-quality image URL from a post item."""
# Single image
candidates = item.get("image_versions2", {}).get("candidates", [])
if candidates:
return candidates[0].get("url", "")
# Carousel - get first image
carousel = item.get("carousel_media", [])
if carousel:
candidates = carousel[0].get("image_versions2", {}).get("candidates", [])
if candidates:
return candidates[0].get("url", "")
return ""
def _extract_video_url(item: dict) -> str:
"""Extract video URL if post is a video."""
if item.get("media_type") == 2: # video
versions = item.get("video_versions", [])
if versions:
return versions[0].get("url", "")
return ""
Getting Comments for a Post
def get_post_comments(media_id: str, max_pages: int = 5) -> list[dict]:
"""Get comments for a specific post."""
comments = []
min_id = None
for _ in range(max_pages):
params = {"can_support_threading": "true"}
if min_id:
params["min_id"] = min_id
data = ig_api_get(f"/api/v1/media/{media_id}/comments/", params=params)
for c in data.get("comments", []):
comments.append({
"id": c.get("pk"),
"text": c.get("text", ""),
"author": c.get("user", {}).get("username"),
"author_id": c.get("user", {}).get("pk"),
"created_at": c.get("created_at_utc"),
"like_count": c.get("comment_like_count", 0),
"reply_count": c.get("child_comment_count", 0),
})
if not data.get("has_more_comments"):
break
min_id = data.get("next_min_id")
if not min_id:
break
time.sleep(1.5)
return comments
7. Scraping Reels Metadata {#reels}
Reels data is accessible through the clips endpoint:
def get_user_reels(user_id: str, max_pages: int = 5) -> list[dict]:
"""Get Reels from a user's profile."""
reels = []
max_id = None
for _ in range(max_pages):
params = {"target_user_id": user_id, "page_size": "12"}
if max_id:
params["max_id"] = max_id
data = ig_api_get("/api/v1/clips/user/", params=params)
for item in data.get("items", []):
media = item.get("media", {})
reel = {
"id": media.get("pk"),
"code": media.get("code"),
"caption": (media.get("caption") or {}).get("text", ""),
"play_count": media.get("play_count", 0),
"like_count": media.get("like_count", 0),
"comment_count": media.get("comment_count", 0),
"duration": media.get("video_duration"),
"taken_at": media.get("taken_at"),
"music": {
"title": media.get("clips_metadata", {})
.get("music_info", {})
.get("music_asset_info", {})
.get("title", ""),
"artist": media.get("clips_metadata", {})
.get("music_info", {})
.get("music_asset_info", {})
.get("display_artist", ""),
},
"thumbnail": (media.get("image_versions2", {})
.get("candidates", [{}])[0]
.get("url", "")),
}
reels.append(reel)
if not data.get("paging_info", {}).get("more_available"):
break
max_id = data.get("paging_info", {}).get("max_id")
if not max_id:
break
time.sleep(2.0)
return reels
8. CDN Media URL Expiry -- Download Immediately {#cdn-expiry}
One of the biggest gotchas with Instagram scraping: media URLs are temporary. Every image and video URL from Instagram's CDN (scontent-*.cdninstagram.com) contains signed parameters with an expiration timestamp. Typically these expire within 24-48 hours.
If you store CDN URLs and try to use them next week, you'll get 403 errors. Always download media files immediately after collecting URLs.
import requests
import os
import re
from urllib.parse import urlparse, parse_qs
import time
def check_cdn_expiry(url: str) -> int | None:
"""Extract expiry timestamp from Instagram CDN URL."""
parsed = urlparse(url)
# Try 'oe' query param (hex timestamp)
params = parse_qs(parsed.query)
if "oe" in params:
try:
return int(params["oe"][0], 16)
except ValueError:
pass
# Try path-encoded expiry
m = re.search(r"[/_]e(\d{10})", url)
if m:
return int(m.group(1))
return None
def is_url_expired(url: str) -> bool:
"""Check if an Instagram CDN URL has expired."""
expiry = check_cdn_expiry(url)
if expiry is None:
return False
return time.time() > expiry
def download_media(url: str, output_path: str,
proxy: str = None) -> bool:
"""Download Instagram media to local file. Returns True on success."""
if is_url_expired(url):
print(f"URL already expired: {url[:60]}...")
return False
kwargs = {
"headers": {"User-Agent": "Instagram/317.0.0.34 Android"},
"timeout": 30,
"stream": True,
}
if proxy:
kwargs["proxies"] = {"http": proxy, "https": proxy}
try:
resp = requests.get(url, **kwargs)
resp.raise_for_status()
with open(output_path, "wb") as f:
for chunk in resp.iter_content(chunk_size=8192):
f.write(chunk)
return True
except Exception as e:
print(f"Download failed: {e}")
return False
def batch_download_posts(posts: list[dict], output_dir: str,
delay: float = 0.5):
"""Download all images/videos from a list of scraped posts."""
os.makedirs(output_dir, exist_ok=True)
for post in posts:
post_id = post.get("id", "unknown")
img_url = post.get("image_url", "")
vid_url = post.get("video_url", "")
if img_url:
ext = "jpg"
path = os.path.join(output_dir, f"{post_id}.{ext}")
if not os.path.exists(path):
download_media(img_url, path)
time.sleep(delay)
if vid_url:
path = os.path.join(output_dir, f"{post_id}.mp4")
if not os.path.exists(path):
download_media(vid_url, path)
time.sleep(delay)
9. Rate Limits and Soft Blocks {#rate-limits}
Instagram's rate limiting on the private API is strict. From a single session cookie and IP:
- Roughly 100-200 requests per day before soft block
- Soft block: requests return 429 or empty
items: []for 24-48 hours - Hard block: account temporarily suspended (usually 1-7 days)
- Permanent ban: reserved for high-volume automated access
Meta's detection looks at request patterns, not just volume -- 50 requests in 5 minutes is worse than 200 spread across a full day.
import time
import random
class InstagramRateLimiter:
"""Conservative rate limiter for Instagram private API."""
def __init__(self, requests_per_hour: int = 80):
self.interval = 3600.0 / requests_per_hour
self.last_request = 0.0
self.request_count = 0
self.session_start = time.time()
def wait(self):
# Base interval
elapsed = time.time() - self.last_request
if elapsed < self.interval:
time.sleep(self.interval - elapsed)
# Add jitter to avoid predictable patterns
time.sleep(random.uniform(0.5, 2.0))
self.last_request = time.time()
self.request_count += 1
# Take a longer break every 50 requests
if self.request_count % 50 == 0:
print(f"Took {self.request_count} requests, resting 5 minutes")
time.sleep(300)
def daily_limit_check(self, limit: int = 150):
"""Check if we've hit the daily limit."""
if time.time() - self.session_start < 86400:
if self.request_count >= limit:
raise Exception(f"Daily limit of {limit} requests reached")
rate_limiter = InstagramRateLimiter(requests_per_hour=60)
10. Proxy Strategy for Instagram {#proxies}
Residential proxies are essential for any Instagram scraping at volume. Instagram fingerprints datacenter IPs and blocks them aggressively. Even with valid session cookies, datacenter IPs trigger additional verification challenges.
ThorData provides rotating residential proxy pools with country targeting. Their residential IPs appear as regular household connections to Instagram's detection systems.
THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
THORDATA_HOST = "proxy.thordata.com"
THORDATA_PORT = 9000
def get_rotating_proxy(country: str = "US") -> str:
"""Get a rotating residential proxy URL."""
user = f"{THORDATA_USER}-country-{country.lower()}"
return f"http://{user}:{THORDATA_PASS}@{THORDATA_HOST}:{THORDATA_PORT}"
def get_sticky_proxy(session_id: str, country: str = "US") -> str:
"""Get a sticky session proxy (same IP for duration of session)."""
user = f"{THORDATA_USER}-country-{country.lower()}-session-{session_id}"
return f"http://{user}:{THORDATA_PASS}@{THORDATA_HOST}:{THORDATA_PORT}"
# Use sticky sessions for Instagram -- you want the same IP
# for all requests in a scraping session to avoid detection
import uuid
session_proxy = get_sticky_proxy(str(uuid.uuid4())[:8])
print(f"Using proxy session: {session_proxy}")
# Verify proxy works
resp = requests.get("https://httpbin.org/ip", proxies={"https": session_proxy})
print(f"Outbound IP: {resp.json()['origin']}")
Key proxy advice for Instagram: - Use sticky sessions -- the same IP for an entire scraping session. Rapidly switching IPs is more suspicious than staying on one residential IP. - Target the same country as your Instagram account's registered location. - Keep 1 proxy session per Instagram account to maintain consistent fingerprint.
11. Storing Instagram Data: Schema and Best Practices {#storage}
import sqlite3
import json
import time
def init_instagram_db(db_path: str = "instagram.db") -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("""
CREATE TABLE IF NOT EXISTS profiles (
id TEXT PRIMARY KEY,
username TEXT UNIQUE NOT NULL,
full_name TEXT,
biography TEXT,
followers INTEGER,
following INTEGER,
post_count INTEGER,
is_verified INTEGER,
is_business INTEGER,
category TEXT,
external_url TEXT,
is_private INTEGER,
scraped_at REAL
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS posts (
id TEXT PRIMARY KEY,
user_id TEXT,
username TEXT,
shortcode TEXT,
caption TEXT,
media_type INTEGER,
like_count INTEGER,
comment_count INTEGER,
play_count INTEGER,
taken_at INTEGER,
location TEXT,
is_paid_partnership INTEGER,
image_path TEXT,
video_path TEXT,
scraped_at REAL,
FOREIGN KEY (user_id) REFERENCES profiles(id)
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS comments (
id TEXT PRIMARY KEY,
post_id TEXT,
author TEXT,
author_id TEXT,
text TEXT,
like_count INTEGER,
created_at INTEGER,
scraped_at REAL,
FOREIGN KEY (post_id) REFERENCES posts(id)
)
""")
conn.execute("CREATE INDEX IF NOT EXISTS idx_posts_user ON posts(user_id)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_posts_taken ON posts(taken_at)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_comments_post ON comments(post_id)")
conn.commit()
return conn
def save_profile(conn: sqlite3.Connection, profile: dict):
conn.execute("""
INSERT OR REPLACE INTO profiles VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)
""", (
profile.get("id"), profile.get("username"), profile.get("full_name"),
profile.get("biography"), profile.get("followers"), profile.get("following"),
profile.get("post_count"), int(profile.get("is_verified") or 0),
int(profile.get("is_business") or 0), profile.get("category"),
profile.get("external_url"), int(profile.get("is_private") or 0),
time.time()
))
conn.commit()
12. Handling Edge Cases: Private Accounts, Restricted Content {#edge-cases}
def check_account_accessibility(username: str) -> dict:
"""Check if an account is accessible and what data is available."""
try:
profile = get_profile_info(username)
return {
"accessible": True,
"is_private": profile.get("is_private", False),
"is_verified": profile.get("is_verified", False),
"posts_available": not profile.get("is_private", False),
"profile": profile,
}
except requests.exceptions.HTTPError as e:
if e.response.status_code == 404:
return {"accessible": False, "reason": "account_not_found"}
if e.response.status_code == 403:
return {"accessible": False, "reason": "blocked_or_restricted"}
return {"accessible": False, "reason": f"http_{e.response.status_code}"}
except Exception as e:
return {"accessible": False, "reason": str(e)}
def handle_private_account(user_id: str) -> dict:
"""For private accounts, return only what's publicly available."""
# Only profile info is accessible for private accounts
# Posts, reels, and follower lists require being an approved follower
profile = get_profile_info_by_id(user_id)
return {
"username": profile.get("username"),
"full_name": profile.get("full_name"),
"post_count": profile.get("post_count"),
"followers": profile.get("followers"),
"is_private": True,
"posts": [], # Not accessible
}
def get_profile_info_by_id(user_id: str) -> dict:
data = ig_api_get(f"/api/v1/users/{user_id}/info/")
user = data.get("user", {})
return {
"id": user.get("pk"),
"username": user.get("username"),
"full_name": user.get("full_name"),
"biography": user.get("biography"),
"followers": user.get("follower_count"),
"following": user.get("following_count"),
"post_count": user.get("media_count"),
"is_private": user.get("is_private"),
"is_verified": user.get("is_verified"),
}
13. Real Use Cases {#use-cases}
Influencer Research
Find verified accounts in a niche and benchmark their engagement rates:
def calculate_engagement_rate(profile: dict, posts: list[dict]) -> float:
"""Calculate average engagement rate across recent posts."""
if not posts or not profile.get("followers"):
return 0.0
total_engagement = sum(
p.get("like_count", 0) + p.get("comment_count", 0)
for p in posts
)
avg_engagement = total_engagement / len(posts)
return (avg_engagement / profile["followers"]) * 100
# Benchmark an account
profile = get_profile_info("natgeo")
user_id = profile["id"]
posts = get_user_posts(user_id, max_pages=3)
er = calculate_engagement_rate(profile, posts)
print(f"Engagement rate: {er:.2f}%")
Hashtag Content Archiving
def search_hashtag(hashtag: str, max_pages: int = 5) -> list[dict]:
"""Get recent posts for a hashtag via private API."""
results = []
max_id = None
for _ in range(max_pages):
params = {}
if max_id:
params["max_id"] = max_id
data = ig_api_get(f"/api/v1/feed/tag/{hashtag}/", params=params)
for item in data.get("items", []):
results.append({
"id": item.get("pk"),
"code": item.get("code"),
"like_count": item.get("like_count", 0),
"comment_count": item.get("comment_count", 0),
"author": item.get("user", {}).get("username"),
"caption": (item.get("caption") or {}).get("text", ""),
})
if not data.get("more_available"):
break
max_id = data.get("next_max_id")
if not max_id:
break
time.sleep(2.0)
return results
14. Legal Reality {#legal}
Using the private API violates Instagram's Terms of Service. Meta has sued scraping companies. The CFAA and EU database directive add additional dimensions depending on jurisdiction.
In practice: Meta primarily goes after companies scraping at commercial scale -- data brokers, surveillance firms, and competitors. Individual developers doing research or building personal tools rarely face legal action, though account bans are common.
Public data (og:meta tags, JSON-LD) sits in grayer legal territory. Courts have generally held that scraping publicly accessible data isn't a CFAA violation, especially for research.
If you're scraping Instagram for academic research, competitive analysis of your own market, or archiving your own data, you're on relatively solid ground. If you're building a surveillance product or selling scraped data, expect legal attention.
15. What Actually Works: Practical Strategy for 2026 {#summary}
| Data Needed | Method | Auth Required | Risk Level |
|---|---|---|---|
| Basic profile (bio, follower count) | og:meta tags | No | Low |
| Verified profile + post count | og:meta + shared data JSON | No | Low |
| Full profile + all posts | Private mobile API | Session cookie | Medium |
| Comments, Reels, engagement | Private mobile API | Session cookie | Medium |
| Follower/following lists | Private mobile API | Session cookie | High |
| Own account data | Official Graph API | OAuth app | Low (official) |
For most legitimate use cases in 2026:
- Use og:meta for basic profile data (bio, follower counts) -- it's public, doesn't require auth, and is legally the safest
- Use the Graph API if you need data from accounts willing to grant permission
- Use the private API sparingly for specific data you can't get any other way -- keep volume low, use residential sticky proxies from ThorData, and understand the risk
- Download media immediately -- CDN URLs expire within 24-48 hours
- Respect rate limits -- 80-100 requests/day per session, with 2-5 second random delays
- Never automate Instagram login -- always export cookies manually from a real browser session
Instagram scraping in 2026 is a game of patience. The days of pulling 10,000 profiles in an afternoon are gone. But if you're thoughtful about volume, use residential proxies for consistent IP reputation, and stick to public-facing data wherever possible, meaningful data collection remains feasible.