Scraping Instagram Data in 2026: Profiles, Posts, Reels, and the Mobile API

2026-04-07 instagram web-scraping api social-media meta

Scraping Instagram Data in 2026: Profiles, Posts, Reels, and the Mobile API

Instagram is the hardest major social platform to scrape. Meta has spent years building layers of anti-bot protection -- aggressive rate limiting, login walls, browser fingerprinting, and machine learning models that flag automated behavior within minutes. If you're used to scraping sites where a rotating User-Agent and some delays get the job done, Instagram will humble you fast.

Let's look at what actually works in 2026, from the safest public-data methods down to the private API and how to handle CDN expiry, pagination, data storage, and proxy rotation at scale.

The Official Instagram Graph API
Public Profile Data via og:meta Tags
Parsing the Shared Data JSON Blob
The Mobile Private API
Getting a Session Cookie
Paginating Post Feeds
Scraping Reels Metadata
CDN Media URL Expiry -- Download Immediately
Rate Limits and Soft Blocks
Proxy Strategy for Instagram
Storing Instagram Data: Schema and Best Practices
Handling Edge Cases: Private Accounts, Restricted Content
Real Use Cases
Legal Reality
What Actually Works: Practical Strategy for 2026

1. The Official Instagram Graph API {#graph-api}

Meta offers the Instagram Graph API for business and creator accounts. On paper it sounds perfect -- structured data, no scraping needed. In practice it's one of the most restricted APIs in existence.

To use it, you need a Meta app with Instagram permissions, which requires App Review. Meta's review process takes weeks and rejects most applications that aren't clearly tied to a published product. Even if you get approved, the API only works with accounts that have granted your app permission. You cannot use the Graph API to look up arbitrary public profiles.

What you can do: - Pull your own posts and metrics - Read comments on your own content - Get basic business discovery data for other business accounts (username, bio, media count, follower count only) - Access Instagram Shopping catalogs

What you cannot do: - Scrape arbitrary public profiles - Access follower/following lists at scale - Pull Reels data for accounts you don't own - Read DMs in bulk

If you're building a social media management tool and have customers willing to connect their accounts, the Graph API is fine. For data collection across many profiles, it's essentially useless.

Basic Graph API Setup

import requests

ACCESS_TOKEN = "your_page_access_token"
GRAPH_BASE = "https://graph.facebook.com/v19.0"

def get_ig_business_account(facebook_page_id: str) -> str:
    """Get the Instagram Business Account ID linked to a Facebook Page."""
    resp = requests.get(
        f"{GRAPH_BASE}/{facebook_page_id}",
        params={
            "fields": "instagram_business_account",
            "access_token": ACCESS_TOKEN,
        }
    )
    resp.raise_for_status()
    data = resp.json()
    return data["instagram_business_account"]["id"]

def get_own_media(ig_user_id: str, limit: int = 50) -> list[dict]:
    """Get your own Instagram posts via Graph API."""
    posts = []
    url = f"{GRAPH_BASE}/{ig_user_id}/media"
    params = {
        "fields": "id,caption,media_type,media_url,thumbnail_url,timestamp,"
                  "like_count,comments_count,permalink",
        "limit": min(limit, 100),
        "access_token": ACCESS_TOKEN,
    }
    while True:
        resp = requests.get(url, params=params)
        resp.raise_for_status()
        data = resp.json()
        posts.extend(data.get("data", []))
        paging = data.get("paging", {})
        if "next" not in paging or len(posts) >= limit:
            break
        url = paging["next"]
        params = {}
    return posts[:limit]

2. Public Profile Data via og:meta Tags {#og-meta}

Here's what most people don't realize: Instagram still serves public profile pages as server-rendered HTML to web crawlers. And those pages contain OpenGraph meta tags with structured data.

When you hit https://www.instagram.com/username/ with a clean request (no cookies, standard headers), Instagram returns a page with og:title, og:description, and og:image tags. The description tag typically contains the bio, follower count, following count, and post count.

import requests
from html.parser import HTMLParser

class OGParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.og_data = {}

    def handle_starttag(self, tag, attrs):
        if tag == "meta":
            attrs_dict = dict(attrs)
            prop = attrs_dict.get("property", "")
            name = attrs_dict.get("name", "")
            if prop.startswith("og:"):
                self.og_data[prop] = attrs_dict.get("content", "")
            elif name == "description":
                self.og_data["description"] = attrs_dict.get("content", "")

def scrape_instagram_profile_public(username: str,
                                     proxy: str = None) -> dict:
    """Scrape a public Instagram profile using og:meta -- no auth needed."""
    url = f"https://www.instagram.com/{username}/"
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                      "AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/131.0.0.0 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    }
    kwargs = {"headers": headers, "timeout": 10}
    if proxy:
        kwargs["proxies"] = {"http": proxy, "https": proxy}

    response = requests.get(url, **kwargs)
    response.raise_for_status()

    parser = OGParser()
    parser.feed(response.text)

    # og:description format: "X Followers, Y Following, Z Posts - ..."
    desc = parser.og_data.get("og:description", "")
    title = parser.og_data.get("og:title", "")
    image = parser.og_data.get("og:image", "")

    # Parse follower/following/posts from description
    followers = following = post_count = None
    import re
    m = re.match(r"([\d,KM]+) Followers,\s*([\d,KM]+) Following,\s*([\d,KM]+) Posts", desc)
    if m:
        def parse_count(s):
            s = s.replace(",", "")
            if s.endswith("M"):
                return int(float(s[:-1]) * 1_000_000)
            if s.endswith("K"):
                return int(float(s[:-1]) * 1_000)
            return int(s)
        followers = parse_count(m.group(1))
        following = parse_count(m.group(2))
        post_count = parse_count(m.group(3))

    return {
        "username": username,
        "title": title,
        "description": desc,
        "profile_pic": image,
        "followers": followers,
        "following": following,
        "post_count": post_count,
    }

# Usage
profile = scrape_instagram_profile_public("natgeo")
print(f"{profile['username']}: {profile['followers']:,} followers")

This works without authentication. The catch: you get very limited data. And Meta throttles this aggressively -- after 20-30 requests from the same IP in a short window, you'll start getting 429 responses or login redirects.

3. Parsing the Shared Data JSON Blob {#shared-data}

Instagram's profile pages embed a JSON blob inside the HTML that contains much richer data than og:meta tags. It's inside a <script> tag and the structure has changed several times, but in 2026 it still works on public profiles:

import json
import re

def extract_shared_data(html: str) -> dict:
    """Extract __additionalDataLoaded or window._sharedData from Instagram HTML."""
    # Try newer format first
    pattern1 = r'window\.__additionalDataLoaded\s*\(\s*[^,]+,\s*(\{.*?\})\s*\);'
    m = re.search(pattern1, html, re.DOTALL)
    if m:
        try:
            return json.loads(m.group(1))
        except json.JSONDecodeError:
            pass

    # Try older format
    pattern2 = r'<script[^>]+>\s*window\._sharedData\s*=\s*(\{.*?\})\s*;</script>'
    m = re.search(pattern2, html, re.DOTALL)
    if m:
        try:
            return json.loads(m.group(1))
        except json.JSONDecodeError:
            pass

    # Try script tags with type="application/json"
    pattern3 = r'<script type="application/json"[^>]*>(\{.*?\})</script>'
    for m in re.finditer(pattern3, html, re.DOTALL):
        try:
            data = json.loads(m.group(1))
            if "user" in str(data)[:200]:
                return data
        except json.JSONDecodeError:
            continue

    return {}

def get_profile_from_shared_data(username: str) -> dict:
    """Get enriched profile data including recent posts if available."""
    response = requests.get(
        f"https://www.instagram.com/{username}/",
        headers={"User-Agent": "Mozilla/5.0 (compatible; Googlebot/2.1)"},
        timeout=10
    )
    data = extract_shared_data(response.text)

    # Navigate the nested structure (changes with IG updates)
    user = None
    for path in [
        ["entry_data", "ProfilePage", 0, "graphql", "user"],
        ["data", "user"],
        ["user"],
    ]:
        try:
            node = data
            for key in path:
                node = node[key]
            user = node
            break
        except (KeyError, IndexError, TypeError):
            continue

    if not user:
        return {}

    return {
        "id": user.get("id"),
        "username": user.get("username"),
        "full_name": user.get("full_name"),
        "biography": user.get("biography"),
        "followers": user.get("edge_followed_by", {}).get("count"),
        "following": user.get("edge_follow", {}).get("count"),
        "post_count": user.get("edge_owner_to_timeline_media", {}).get("count"),
        "is_verified": user.get("is_verified"),
        "is_business": user.get("is_business_account"),
        "category": user.get("business_category_name"),
        "external_url": user.get("external_url"),
        "profile_pic_hd": user.get("profile_pic_url_hd"),
    }

4. The Mobile Private API {#mobile-api}

Instagram's mobile app communicates with Meta's servers through a private API -- a set of undocumented REST endpoints that return JSON. This API exposes far more data than any official channel.

Key endpoints still working in 2026:

Endpoint	Description
`/api/v1/users/web_profile_info/?username=X`	Full profile by username
`/api/v1/users/{user_id}/info/`	Profile data by user ID
`/api/v1/feed/user/{user_id}/`	User's post feed
`/api/v1/usertags/{user_id}/feed/`	Posts the user is tagged in
`/api/v1/feed/reels_media/`	Reels metadata
`/api/v1/media/{media_id}/comments/`	Comments on a post
`/api/v1/friendships/{user_id}/followers/`	Follower list (auth required)

These endpoints require authentication -- a valid sessionid cookie from a logged-in Instagram account.

Making Private API Requests

import requests

SESSION_ID = "your-session-id-from-browser"

MOBILE_HEADERS = {
    "User-Agent": "Instagram 317.0.0.34.109 Android (30/11; 420dpi; "
                  "1080x2220; samsung; SM-G991B; o1s; exynos2100)",
    "X-IG-App-ID": "936619743392459",
    "X-IG-Capabilities": "3brTvw==",
    "X-IG-Connection-Type": "WIFI",
    "Accept-Language": "en-US",
    "Accept-Encoding": "gzip, deflate",
}

def ig_api_get(endpoint: str, params: dict = None,
               proxy: str = None) -> dict:
    """Make authenticated request to Instagram private API."""
    url = f"https://i.instagram.com{endpoint}"
    cookies = {"sessionid": SESSION_ID}
    kwargs = {
        "headers": MOBILE_HEADERS,
        "cookies": cookies,
        "params": params or {},
        "timeout": 15,
    }
    if proxy:
        kwargs["proxies"] = {"http": proxy, "https": proxy}

    resp = requests.get(url, **kwargs)

    if resp.status_code == 429:
        import time
        retry_after = int(resp.headers.get("Retry-After", 30))
        print(f"Rate limited, waiting {retry_after}s")
        time.sleep(retry_after)
        return ig_api_get(endpoint, params, proxy)

    if resp.status_code == 400:
        # Often means account needs re-authentication
        raise Exception(f"400 on {endpoint} - session may be expired")

    resp.raise_for_status()
    return resp.json()

def get_profile_info(username: str) -> dict:
    """Get full profile data by username."""
    data = ig_api_get("/api/v1/users/web_profile_info/",
                      params={"username": username})
    user = data.get("data", {}).get("user", {})
    return {
        "id": user.get("id"),
        "username": user.get("username"),
        "full_name": user.get("full_name"),
        "biography": user.get("biography"),
        "followers": user.get("edge_followed_by", {}).get("count"),
        "following": user.get("edge_follow", {}).get("count"),
        "post_count": user.get("edge_owner_to_timeline_media", {}).get("count"),
        "is_verified": user.get("is_verified"),
        "is_business": user.get("is_business_account"),
        "external_url": user.get("external_url"),
        "profile_pic_hd": user.get("profile_pic_url_hd"),
        "category": user.get("business_category_name"),
        "is_private": user.get("is_private"),
    }

The practical way to get a session cookie: log into Instagram in a regular browser, open DevTools > Application > Cookies > instagram.com, and copy the sessionid value.

The session cookie lasts several weeks before expiring. When it expires, you'll start getting 400 or 401 responses.

Never automate the login flow. Instagram's ML systems detect Playwright/Puppeteer logins even with stealth plugins and will immediately flag the account, often requiring phone verification or triggering a permanent ban.

For bulk operations that need multiple session cookies, use separate Instagram accounts and log into each one manually in different browsers. Meta allows up to 5 accounts per person under their ToS.

6. Paginating Post Feeds {#pagination}

Instagram's post feed API returns 12-33 posts per page with a cursor for the next page:

import time

def get_user_posts(user_id: str, max_pages: int = 10,
                   delay: float = 2.0) -> list[dict]:
    """Paginate through a user's post feed."""
    posts = []
    max_id = None

    for page_num in range(max_pages):
        params = {}
        if max_id:
            params["max_id"] = max_id

        data = ig_api_get(f"/api/v1/feed/user/{user_id}/", params=params)

        for item in data.get("items", []):
            post = {
                "id": item.get("pk"),
                "shortcode": item.get("code"),
                "caption": (item.get("caption") or {}).get("text", ""),
                "like_count": item.get("like_count", 0),
                "comment_count": item.get("comment_count", 0),
                "taken_at": item.get("taken_at"),
                "media_type": item.get("media_type"),  # 1=photo, 2=video, 8=carousel
                "play_count": item.get("play_count"),  # videos only
                "view_count": item.get("view_count"),
                "location": (item.get("location") or {}).get("name"),
                "image_url": _extract_image_url(item),
                "video_url": _extract_video_url(item),
                "is_paid_partnership": item.get("is_paid_partnership"),
                "tagged_users": [
                    u["user"]["username"]
                    for u in item.get("usertags", {}).get("in", [])
                ],
            }
            posts.append(post)

        if not data.get("more_available"):
            break

        max_id = data.get("next_max_id")
        if not max_id:
            break

        time.sleep(delay + (page_num * 0.3))  # Increasing delay

    return posts

def _extract_image_url(item: dict) -> str:
    """Extract best-quality image URL from a post item."""
    # Single image
    candidates = item.get("image_versions2", {}).get("candidates", [])
    if candidates:
        return candidates[0].get("url", "")
    # Carousel - get first image
    carousel = item.get("carousel_media", [])
    if carousel:
        candidates = carousel[0].get("image_versions2", {}).get("candidates", [])
        if candidates:
            return candidates[0].get("url", "")
    return ""

def _extract_video_url(item: dict) -> str:
    """Extract video URL if post is a video."""
    if item.get("media_type") == 2:  # video
        versions = item.get("video_versions", [])
        if versions:
            return versions[0].get("url", "")
    return ""

Getting Comments for a Post

def get_post_comments(media_id: str, max_pages: int = 5) -> list[dict]:
    """Get comments for a specific post."""
    comments = []
    min_id = None

    for _ in range(max_pages):
        params = {"can_support_threading": "true"}
        if min_id:
            params["min_id"] = min_id

        data = ig_api_get(f"/api/v1/media/{media_id}/comments/", params=params)

        for c in data.get("comments", []):
            comments.append({
                "id": c.get("pk"),
                "text": c.get("text", ""),
                "author": c.get("user", {}).get("username"),
                "author_id": c.get("user", {}).get("pk"),
                "created_at": c.get("created_at_utc"),
                "like_count": c.get("comment_like_count", 0),
                "reply_count": c.get("child_comment_count", 0),
            })

        if not data.get("has_more_comments"):
            break
        min_id = data.get("next_min_id")
        if not min_id:
            break
        time.sleep(1.5)

    return comments

7. Scraping Reels Metadata {#reels}

Reels data is accessible through the clips endpoint:

def get_user_reels(user_id: str, max_pages: int = 5) -> list[dict]:
    """Get Reels from a user's profile."""
    reels = []
    max_id = None

    for _ in range(max_pages):
        params = {"target_user_id": user_id, "page_size": "12"}
        if max_id:
            params["max_id"] = max_id

        data = ig_api_get("/api/v1/clips/user/", params=params)

        for item in data.get("items", []):
            media = item.get("media", {})
            reel = {
                "id": media.get("pk"),
                "code": media.get("code"),
                "caption": (media.get("caption") or {}).get("text", ""),
                "play_count": media.get("play_count", 0),
                "like_count": media.get("like_count", 0),
                "comment_count": media.get("comment_count", 0),
                "duration": media.get("video_duration"),
                "taken_at": media.get("taken_at"),
                "music": {
                    "title": media.get("clips_metadata", {})
                             .get("music_info", {})
                             .get("music_asset_info", {})
                             .get("title", ""),
                    "artist": media.get("clips_metadata", {})
                              .get("music_info", {})
                              .get("music_asset_info", {})
                              .get("display_artist", ""),
                },
                "thumbnail": (media.get("image_versions2", {})
                              .get("candidates", [{}])[0]
                              .get("url", "")),
            }
            reels.append(reel)

        if not data.get("paging_info", {}).get("more_available"):
            break
        max_id = data.get("paging_info", {}).get("max_id")
        if not max_id:
            break
        time.sleep(2.0)

    return reels

8. CDN Media URL Expiry -- Download Immediately {#cdn-expiry}

One of the biggest gotchas with Instagram scraping: media URLs are temporary. Every image and video URL from Instagram's CDN (scontent-*.cdninstagram.com) contains signed parameters with an expiration timestamp. Typically these expire within 24-48 hours.

If you store CDN URLs and try to use them next week, you'll get 403 errors. Always download media files immediately after collecting URLs.

import requests
import os
import re
from urllib.parse import urlparse, parse_qs
import time

def check_cdn_expiry(url: str) -> int | None:
    """Extract expiry timestamp from Instagram CDN URL."""
    parsed = urlparse(url)
    # Try 'oe' query param (hex timestamp)
    params = parse_qs(parsed.query)
    if "oe" in params:
        try:
            return int(params["oe"][0], 16)
        except ValueError:
            pass
    # Try path-encoded expiry
    m = re.search(r"[/_]e(\d{10})", url)
    if m:
        return int(m.group(1))
    return None

def is_url_expired(url: str) -> bool:
    """Check if an Instagram CDN URL has expired."""
    expiry = check_cdn_expiry(url)
    if expiry is None:
        return False
    return time.time() > expiry

def download_media(url: str, output_path: str,
                   proxy: str = None) -> bool:
    """Download Instagram media to local file. Returns True on success."""
    if is_url_expired(url):
        print(f"URL already expired: {url[:60]}...")
        return False

    kwargs = {
        "headers": {"User-Agent": "Instagram/317.0.0.34 Android"},
        "timeout": 30,
        "stream": True,
    }
    if proxy:
        kwargs["proxies"] = {"http": proxy, "https": proxy}

    try:
        resp = requests.get(url, **kwargs)
        resp.raise_for_status()
        with open(output_path, "wb") as f:
            for chunk in resp.iter_content(chunk_size=8192):
                f.write(chunk)
        return True
    except Exception as e:
        print(f"Download failed: {e}")
        return False

def batch_download_posts(posts: list[dict], output_dir: str,
                         delay: float = 0.5):
    """Download all images/videos from a list of scraped posts."""
    os.makedirs(output_dir, exist_ok=True)
    for post in posts:
        post_id = post.get("id", "unknown")
        img_url = post.get("image_url", "")
        vid_url = post.get("video_url", "")

        if img_url:
            ext = "jpg"
            path = os.path.join(output_dir, f"{post_id}.{ext}")
            if not os.path.exists(path):
                download_media(img_url, path)
                time.sleep(delay)

        if vid_url:
            path = os.path.join(output_dir, f"{post_id}.mp4")
            if not os.path.exists(path):
                download_media(vid_url, path)
                time.sleep(delay)

9. Rate Limits and Soft Blocks {#rate-limits}

Instagram's rate limiting on the private API is strict. From a single session cookie and IP:

Roughly 100-200 requests per day before soft block
Soft block: requests return 429 or empty items: [] for 24-48 hours
Hard block: account temporarily suspended (usually 1-7 days)
Permanent ban: reserved for high-volume automated access

Meta's detection looks at request patterns, not just volume -- 50 requests in 5 minutes is worse than 200 spread across a full day.

import time
import random

class InstagramRateLimiter:
    """Conservative rate limiter for Instagram private API."""

    def __init__(self, requests_per_hour: int = 80):
        self.interval = 3600.0 / requests_per_hour
        self.last_request = 0.0
        self.request_count = 0
        self.session_start = time.time()

    def wait(self):
        # Base interval
        elapsed = time.time() - self.last_request
        if elapsed < self.interval:
            time.sleep(self.interval - elapsed)

        # Add jitter to avoid predictable patterns
        time.sleep(random.uniform(0.5, 2.0))
        self.last_request = time.time()
        self.request_count += 1

        # Take a longer break every 50 requests
        if self.request_count % 50 == 0:
            print(f"Took {self.request_count} requests, resting 5 minutes")
            time.sleep(300)

    def daily_limit_check(self, limit: int = 150):
        """Check if we've hit the daily limit."""
        if time.time() - self.session_start < 86400:
            if self.request_count >= limit:
                raise Exception(f"Daily limit of {limit} requests reached")

rate_limiter = InstagramRateLimiter(requests_per_hour=60)

10. Proxy Strategy for Instagram {#proxies}

Residential proxies are essential for any Instagram scraping at volume. Instagram fingerprints datacenter IPs and blocks them aggressively. Even with valid session cookies, datacenter IPs trigger additional verification challenges.

ThorData provides rotating residential proxy pools with country targeting. Their residential IPs appear as regular household connections to Instagram's detection systems.

THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
THORDATA_HOST = "proxy.thordata.com"
THORDATA_PORT = 9000

def get_rotating_proxy(country: str = "US") -> str:
    """Get a rotating residential proxy URL."""
    user = f"{THORDATA_USER}-country-{country.lower()}"
    return f"http://{user}:{THORDATA_PASS}@{THORDATA_HOST}:{THORDATA_PORT}"

def get_sticky_proxy(session_id: str, country: str = "US") -> str:
    """Get a sticky session proxy (same IP for duration of session)."""
    user = f"{THORDATA_USER}-country-{country.lower()}-session-{session_id}"
    return f"http://{user}:{THORDATA_PASS}@{THORDATA_HOST}:{THORDATA_PORT}"

# Use sticky sessions for Instagram -- you want the same IP
# for all requests in a scraping session to avoid detection
import uuid
session_proxy = get_sticky_proxy(str(uuid.uuid4())[:8])
print(f"Using proxy session: {session_proxy}")

# Verify proxy works
resp = requests.get("https://httpbin.org/ip", proxies={"https": session_proxy})
print(f"Outbound IP: {resp.json()['origin']}")

Key proxy advice for Instagram: - Use sticky sessions -- the same IP for an entire scraping session. Rapidly switching IPs is more suspicious than staying on one residential IP. - Target the same country as your Instagram account's registered location. - Keep 1 proxy session per Instagram account to maintain consistent fingerprint.

11. Storing Instagram Data: Schema and Best Practices {#storage}

import sqlite3
import json
import time

def init_instagram_db(db_path: str = "instagram.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.execute("PRAGMA journal_mode=WAL")

    conn.execute("""
        CREATE TABLE IF NOT EXISTS profiles (
            id TEXT PRIMARY KEY,
            username TEXT UNIQUE NOT NULL,
            full_name TEXT,
            biography TEXT,
            followers INTEGER,
            following INTEGER,
            post_count INTEGER,
            is_verified INTEGER,
            is_business INTEGER,
            category TEXT,
            external_url TEXT,
            is_private INTEGER,
            scraped_at REAL
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS posts (
            id TEXT PRIMARY KEY,
            user_id TEXT,
            username TEXT,
            shortcode TEXT,
            caption TEXT,
            media_type INTEGER,
            like_count INTEGER,
            comment_count INTEGER,
            play_count INTEGER,
            taken_at INTEGER,
            location TEXT,
            is_paid_partnership INTEGER,
            image_path TEXT,
            video_path TEXT,
            scraped_at REAL,
            FOREIGN KEY (user_id) REFERENCES profiles(id)
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS comments (
            id TEXT PRIMARY KEY,
            post_id TEXT,
            author TEXT,
            author_id TEXT,
            text TEXT,
            like_count INTEGER,
            created_at INTEGER,
            scraped_at REAL,
            FOREIGN KEY (post_id) REFERENCES posts(id)
        )
    """)

    conn.execute("CREATE INDEX IF NOT EXISTS idx_posts_user ON posts(user_id)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_posts_taken ON posts(taken_at)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_comments_post ON comments(post_id)")

    conn.commit()
    return conn

def save_profile(conn: sqlite3.Connection, profile: dict):
    conn.execute("""
        INSERT OR REPLACE INTO profiles VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)
    """, (
        profile.get("id"), profile.get("username"), profile.get("full_name"),
        profile.get("biography"), profile.get("followers"), profile.get("following"),
        profile.get("post_count"), int(profile.get("is_verified") or 0),
        int(profile.get("is_business") or 0), profile.get("category"),
        profile.get("external_url"), int(profile.get("is_private") or 0),
        time.time()
    ))
    conn.commit()

12. Handling Edge Cases: Private Accounts, Restricted Content {#edge-cases}

def check_account_accessibility(username: str) -> dict:
    """Check if an account is accessible and what data is available."""
    try:
        profile = get_profile_info(username)
        return {
            "accessible": True,
            "is_private": profile.get("is_private", False),
            "is_verified": profile.get("is_verified", False),
            "posts_available": not profile.get("is_private", False),
            "profile": profile,
        }
    except requests.exceptions.HTTPError as e:
        if e.response.status_code == 404:
            return {"accessible": False, "reason": "account_not_found"}
        if e.response.status_code == 403:
            return {"accessible": False, "reason": "blocked_or_restricted"}
        return {"accessible": False, "reason": f"http_{e.response.status_code}"}
    except Exception as e:
        return {"accessible": False, "reason": str(e)}

def handle_private_account(user_id: str) -> dict:
    """For private accounts, return only what's publicly available."""
    # Only profile info is accessible for private accounts
    # Posts, reels, and follower lists require being an approved follower
    profile = get_profile_info_by_id(user_id)
    return {
        "username": profile.get("username"),
        "full_name": profile.get("full_name"),
        "post_count": profile.get("post_count"),
        "followers": profile.get("followers"),
        "is_private": True,
        "posts": [],  # Not accessible
    }

def get_profile_info_by_id(user_id: str) -> dict:
    data = ig_api_get(f"/api/v1/users/{user_id}/info/")
    user = data.get("user", {})
    return {
        "id": user.get("pk"),
        "username": user.get("username"),
        "full_name": user.get("full_name"),
        "biography": user.get("biography"),
        "followers": user.get("follower_count"),
        "following": user.get("following_count"),
        "post_count": user.get("media_count"),
        "is_private": user.get("is_private"),
        "is_verified": user.get("is_verified"),
    }

13. Real Use Cases {#use-cases}

Influencer Research

Find verified accounts in a niche and benchmark their engagement rates:

def calculate_engagement_rate(profile: dict, posts: list[dict]) -> float:
    """Calculate average engagement rate across recent posts."""
    if not posts or not profile.get("followers"):
        return 0.0
    total_engagement = sum(
        p.get("like_count", 0) + p.get("comment_count", 0)
        for p in posts
    )
    avg_engagement = total_engagement / len(posts)
    return (avg_engagement / profile["followers"]) * 100

# Benchmark an account
profile = get_profile_info("natgeo")
user_id = profile["id"]
posts = get_user_posts(user_id, max_pages=3)
er = calculate_engagement_rate(profile, posts)
print(f"Engagement rate: {er:.2f}%")

Hashtag Content Archiving

def search_hashtag(hashtag: str, max_pages: int = 5) -> list[dict]:
    """Get recent posts for a hashtag via private API."""
    results = []
    max_id = None
    for _ in range(max_pages):
        params = {}
        if max_id:
            params["max_id"] = max_id
        data = ig_api_get(f"/api/v1/feed/tag/{hashtag}/", params=params)
        for item in data.get("items", []):
            results.append({
                "id": item.get("pk"),
                "code": item.get("code"),
                "like_count": item.get("like_count", 0),
                "comment_count": item.get("comment_count", 0),
                "author": item.get("user", {}).get("username"),
                "caption": (item.get("caption") or {}).get("text", ""),
            })
        if not data.get("more_available"):
            break
        max_id = data.get("next_max_id")
        if not max_id:
            break
        time.sleep(2.0)
    return results

14. Legal Reality {#legal}

Using the private API violates Instagram's Terms of Service. Meta has sued scraping companies. The CFAA and EU database directive add additional dimensions depending on jurisdiction.

In practice: Meta primarily goes after companies scraping at commercial scale -- data brokers, surveillance firms, and competitors. Individual developers doing research or building personal tools rarely face legal action, though account bans are common.

Public data (og:meta tags, JSON-LD) sits in grayer legal territory. Courts have generally held that scraping publicly accessible data isn't a CFAA violation, especially for research.

If you're scraping Instagram for academic research, competitive analysis of your own market, or archiving your own data, you're on relatively solid ground. If you're building a surveillance product or selling scraped data, expect legal attention.

15. What Actually Works: Practical Strategy for 2026 {#summary}

Data Needed	Method	Auth Required	Risk Level
Basic profile (bio, follower count)	og:meta tags	No	Low
Verified profile + post count	og:meta + shared data JSON	No	Low
Full profile + all posts	Private mobile API	Session cookie	Medium
Comments, Reels, engagement	Private mobile API	Session cookie	Medium
Follower/following lists	Private mobile API	Session cookie	High
Own account data	Official Graph API	OAuth app	Low (official)

For most legitimate use cases in 2026:

Use og:meta for basic profile data (bio, follower counts) -- it's public, doesn't require auth, and is legally the safest
Use the Graph API if you need data from accounts willing to grant permission
Use the private API sparingly for specific data you can't get any other way -- keep volume low, use residential sticky proxies from ThorData, and understand the risk
Download media immediately -- CDN URLs expire within 24-48 hours
Respect rate limits -- 80-100 requests/day per session, with 2-5 second random delays
Never automate Instagram login -- always export cookies manually from a real browser session

Instagram scraping in 2026 is a game of patience. The days of pulling 10,000 profiles in an afternoon are gone. But if you're thoughtful about volume, use residential proxies for consistent IP reputation, and stick to public-facing data wherever possible, meaningful data collection remains feasible.

Scraping Instagram Data in 2026: Profiles, Posts, Reels, and the Mobile API

Scraping Instagram Data in 2026: Profiles, Posts, Reels, and the Mobile API

Table of Contents

1. The Official Instagram Graph API {#graph-api}

Basic Graph API Setup

2. Public Profile Data via og:meta Tags {#og-meta}

3. Parsing the Shared Data JSON Blob {#shared-data}

4. The Mobile Private API {#mobile-api}

Making Private API Requests

5. Getting a Session Cookie {#session-cookie}

6. Paginating Post Feeds {#pagination}

Getting Comments for a Post

7. Scraping Reels Metadata {#reels}

8. CDN Media URL Expiry -- Download Immediately {#cdn-expiry}

9. Rate Limits and Soft Blocks {#rate-limits}

10. Proxy Strategy for Instagram {#proxies}

11. Storing Instagram Data: Schema and Best Practices {#storage}

12. Handling Edge Cases: Private Accounts, Restricted Content {#edge-cases}

13. Real Use Cases {#use-cases}

Influencer Research

Hashtag Content Archiving

14. Legal Reality {#legal}

15. What Actually Works: Practical Strategy for 2026 {#summary}