Scrape Facebook Public Pages & Post Engagement with Python (2026)

2026-04-09 [python facebook scraping meta-graph-api playwright]

Scrape Facebook Public Pages & Post Engagement with Python (2026)

Facebook scraping in 2026 is a mixed bag. The Meta Graph API exists and gives you structured access to public page data, but its scope has been shrinking steadily since the Cambridge Analytica fallout. Many endpoints that were open in 2020 now require app review or special permissions. For public content that the API doesn't expose, browser-based scraping with Playwright is still an option — though Facebook's anti-bot detection is among the most aggressive on the web.

This guide covers both approaches with complete working code. We use the Graph API where it works, and fall back to Playwright for what remains. Only public data — no private profiles, no closed groups, no direct messages.

What's Available via the Graph API

With a basic Facebook App token, you can access:

Public page info — name, category, follower count, about text, website, phone, address
Page posts — text, type, created time (engagement counts need a Page token)
Page events — public events hosted by pages
Public videos — title, description, views on public page videos
Page insights — available to page admins only

What you cannot get without special permissions: - Individual user reactions (who reacted, what reaction type) - Comment content on most posts - Detailed engagement breakdowns - Anything related to user profiles - Group member lists

Setting Up the Graph API

You need a Facebook App. Go to developers.facebook.com, create an app, and grab your App ID and App Secret. Then generate an access token.

There are two token types for public page data: 1. App Access Token — APP_ID|APP_SECRET format. Works for public page queries. Simple to generate. 2. User Access Token — required for some endpoints. Generated via OAuth flow.

For most public page scraping, the App Access Token is sufficient:

import httpx
import time
import json
import sqlite3
import logging
import re
import asyncio
from typing import Optional
from datetime import datetime

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger(__name__)

GRAPH_BASE = "https://graph.facebook.com/v19.0"
APP_ID = "your_app_id"
APP_SECRET = "your_app_secret"

# App access token = APP_ID|APP_SECRET (simplest form)
APP_TOKEN = f"{APP_ID}|{APP_SECRET}"

# Or generate a proper token via: GET /oauth/access_token
# params: client_id, client_secret, grant_type=client_credentials


def make_client(proxy_url: str = None) -> httpx.Client:
    """Create HTTP client with optional proxy."""
    kwargs = {
        "timeout": httpx.Timeout(30.0, connect=10.0),
        "follow_redirects": True,
    }
    if proxy_url:
        kwargs["proxy"] = proxy_url
    return httpx.Client(**kwargs)


def graph_get(
    client: httpx.Client,
    endpoint: str,
    params: dict = None,
    token: str = None,
    max_retries: int = 4,
) -> Optional[dict]:
    """
    Graph API GET request with error handling and retry.
    """
    if params is None:
        params = {}
    params["access_token"] = token or APP_TOKEN

    url = f"{GRAPH_BASE}/{endpoint.lstrip('/')}"

    for attempt in range(max_retries):
        try:
            resp = client.get(url, params=params)

            if resp.status_code == 200:
                data = resp.json()
                if "error" in data:
                    error = data["error"]
                    code = error.get("code")
                    logger.error(f"Graph API error {code}: {error.get('message')}")
                    # Code 4: rate limited, code 10: permission error
                    if code == 4:
                        time.sleep(60)
                        continue
                    return None
                return data

            elif resp.status_code == 429:
                wait = int(resp.headers.get("retry-after", 60))
                logger.warning(f"Rate limited (429), waiting {wait}s")
                time.sleep(wait)
                continue

            elif resp.status_code in (400, 403):
                try:
                    error_data = resp.json()
                    logger.error(f"API error: {error_data.get('error', {}).get('message')}")
                except Exception:
                    logger.error(f"HTTP {resp.status_code}: {endpoint}")
                return None

            else:
                logger.warning(f"HTTP {resp.status_code}: {endpoint}")
                time.sleep(2 ** attempt)

        except httpx.TimeoutException:
            wait = 2 ** attempt + 2
            logger.warning(f"Timeout, retry in {wait}s (attempt {attempt+1})")
            time.sleep(wait)

        except httpx.NetworkError as e:
            logger.error(f"Network error: {e}")
            time.sleep(5)

    return None

Public Page Information

def get_page_info(
    client: httpx.Client,
    page_id: str,
    extra_fields: list = None,
) -> Optional[dict]:
    """
    Get public page information.
    page_id: page name (e.g., 'NASA') or numeric ID
    """
    base_fields = [
        "id", "name", "category", "fan_count",
        "about", "website", "location", "single_line_address",
        "phone", "email", "founded", "general_info",
        "mission", "description", "cover", "picture",
        "verification_status", "category_list",
    ]

    if extra_fields:
        base_fields.extend(extra_fields)

    data = graph_get(client, page_id, params={"fields": ",".join(base_fields)})
    if not data:
        return None

    return {
        "id": data.get("id"),
        "name": data.get("name"),
        "category": data.get("category"),
        "fan_count": data.get("fan_count", 0),
        "about": data.get("about", ""),
        "website": data.get("website"),
        "phone": data.get("phone"),
        "address": data.get("single_line_address"),
        "founded": data.get("founded"),
        "description": data.get("description", ""),
        "verification_status": data.get("verification_status"),
        "cover_url": (data.get("cover") or {}).get("source"),
        "profile_pic_url": (data.get("picture") or {}).get("data", {}).get("url"),
        "categories": [c.get("name") for c in data.get("category_list", [])],
    }


def check_rate_limit_headers(resp_headers: dict) -> dict:
    """Extract rate limit info from response headers."""
    usage_header = resp_headers.get("x-app-usage", "{}")
    page_usage_header = resp_headers.get("x-page-usage", "{}")

    try:
        app_usage = json.loads(usage_header)
        page_usage = json.loads(page_usage_header)
    except json.JSONDecodeError:
        return {}

    return {
        "app_call_count": app_usage.get("call_count", 0),
        "app_total_time": app_usage.get("total_time", 0),
        "app_total_cputime": app_usage.get("total_cputime", 0),
        "page_call_count": page_usage.get("call_count", 0),
    }

Fetching Page Posts with Pagination

def get_page_posts(
    client: httpx.Client,
    page_id: str,
    limit: int = 100,
    since: str = None,
    until: str = None,
) -> list:
    """
    Get posts from a public page.
    Paginates automatically up to limit.

    since/until: Unix timestamp or YYYY-MM-DD format
    """
    posts = []
    fields = (
        "id,message,created_time,type,permalink_url,"
        "shares,full_picture,name,description,story"
    )

    params = {
        "fields": fields,
        "limit": 25,  # API max per request for posts
    }
    if since:
        params["since"] = since
    if until:
        params["until"] = until

    # Initial request
    data = graph_get(client, f"{page_id}/posts", params=params)

    while data and len(posts) < limit:
        for post in data.get("data", []):
            posts.append({
                "id": post.get("id"),
                "message": post.get("message", ""),
                "story": post.get("story", ""),
                "created_time": post.get("created_time"),
                "type": post.get("type"),
                "url": post.get("permalink_url"),
                "shares": (post.get("shares") or {}).get("count", 0),
                "image_url": post.get("full_picture"),
                "name": post.get("name"),
                "description": post.get("description", ""),
            })

        # Paginate via cursor
        paging = data.get("paging", {})
        next_url = paging.get("next")
        if not next_url or len(posts) >= limit:
            break

        # Follow pagination cursor directly
        try:
            resp = client.get(next_url, params={"access_token": APP_TOKEN})
            data = resp.json() if resp.status_code == 200 else None
        except httpx.RequestError:
            break

        time.sleep(0.5)

    return posts[:limit]


def get_page_events(
    client: httpx.Client,
    page_id: str,
    limit: int = 50,
) -> list:
    """Get public events hosted by a page."""
    fields = (
        "id,name,description,start_time,end_time,"
        "place,cover,attending_count,maybe_count,interested_count"
    )

    data = graph_get(
        client,
        f"{page_id}/events",
        params={"fields": fields, "limit": min(limit, 50)},
    )

    if not data:
        return []

    events = []
    for event in data.get("data", []):
        place = event.get("place", {})
        location = place.get("location", {})
        events.append({
            "id": event.get("id"),
            "name": event.get("name"),
            "description": event.get("description", "")[:500],
            "start_time": event.get("start_time"),
            "end_time": event.get("end_time"),
            "venue_name": place.get("name"),
            "city": location.get("city"),
            "country": location.get("country"),
            "attending_count": event.get("attending_count", 0),
            "maybe_count": event.get("maybe_count", 0),
            "interested_count": event.get("interested_count", 0),
        })

    return events

Batch Requests for Efficiency

For efficiency, batch multiple queries into a single HTTP request (up to 50 sub-requests):

def batch_page_info(
    client: httpx.Client,
    page_ids: list,
    fields: str = "id,name,category,fan_count,about,website",
) -> list:
    """
    Fetch info for multiple pages in a single batch request.
    Up to 50 pages per batch.
    """
    results = []

    # Process in batches of 50
    for i in range(0, len(page_ids), 50):
        batch_ids = page_ids[i:i + 50]

        batch = [
            {
                "method": "GET",
                "relative_url": f"{pid}?fields={fields}",
            }
            for pid in batch_ids
        ]

        try:
            resp = client.post(
                GRAPH_BASE,
                data={
                    "access_token": APP_TOKEN,
                    "batch": json.dumps(batch),
                    "include_headers": "false",
                },
                timeout=30,
            )
            resp.raise_for_status()

            for item in resp.json():
                if item.get("code") == 200:
                    try:
                        results.append(json.loads(item["body"]))
                    except (json.JSONDecodeError, KeyError):
                        pass
                else:
                    logger.debug(f"Batch item returned HTTP {item.get('code')}")

        except httpx.RequestError as e:
            logger.error(f"Batch request error: {e}")

        time.sleep(1.0)

    return results


def get_multiple_pages_concurrent(
    client: httpx.Client,
    page_ids: list,
    delay: float = 0.3,
) -> list:
    """
    Fetch multiple pages' info using batching.
    Rate limit: 200 API calls per hour (user token).
    Each batch request counts as 1 call, returns up to 50 results.
    """
    all_results = []

    for i in range(0, len(page_ids), 50):
        batch = page_ids[i:i + 50]
        results = batch_page_info(client, batch)
        all_results.extend(results)
        logger.info(f"Batch progress: {len(all_results)}/{len(page_ids)} pages")

        if i + 50 < len(page_ids):
            time.sleep(delay)

    return all_results

Rate Limits

Facebook's rate limiting is per-app and per-user-token:

App-level: 200 calls per hour per user (user token)
App-level without user: 200 calls per hour (app token, per IP)
Page tokens: 4,800 calls per 24 hours
Batch requests: Each batch counts as 1 call but executes up to 50 sub-requests

Practical strategy: 1. Use batch requests aggressively — 1 API call for 50 pages worth of data 2. Cache aggressively — page info doesn't change hourly 3. Use App Token for public page scraping (simpler, no OAuth flow) 4. Monitor x-app-usage header to track quota consumption

def monitor_api_usage(client: httpx.Client) -> dict:
    """Check current API usage quota."""
    try:
        resp = client.get(
            f"{GRAPH_BASE}/me",
            params={"access_token": APP_TOKEN, "fields": "id"}
        )
        usage = resp.headers.get("x-app-usage", "{}")
        return json.loads(usage)
    except Exception:
        return {}

Playwright Scraping for Content the API Misses

When the Graph API doesn't give you what you need — engagement metrics, comment counts, post reactions — Playwright can fill the gaps. Facebook's detection is serious but can be navigated:

from playwright.async_api import async_playwright

PROXY_CONFIG = {
    "server": "http://proxy.thordata.com:9000",
    "username": "your_thordata_user",
    "password": "your_thordata_pass",
}


async def create_stealth_context(playwright, proxy: dict = None):
    """
    Create a browser context configured to minimize Facebook's bot detection.
    """
    launch_args = {
        "headless": True,
        "args": [
            "--disable-blink-features=AutomationControlled",
            "--disable-features=IsolateOrigins,site-per-process",
            "--no-first-run",
            "--no-default-browser-check",
            "--disable-extensions",
        ],
    }
    if proxy:
        launch_args["proxy"] = proxy

    browser = await playwright.chromium.launch(**launch_args)

    # Use realistic browser context
    context = await browser.new_context(
        user_agent=(
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/126.0.0.0 Safari/537.36"
        ),
        viewport={"width": 1366, "height": 768},
        locale="en-US",
        timezone_id="America/New_York",
        color_scheme="light",
        java_script_enabled=True,
    )

    # Override bot detection signals
    await context.add_init_script("""
        // Remove webdriver flag
        Object.defineProperty(navigator, 'webdriver', {
            get: () => undefined
        });
        // Add realistic plugin array
        Object.defineProperty(navigator, 'plugins', {
            get: () => [{name: 'Chrome PDF Plugin'}, {name: 'Chrome PDF Viewer'}]
        });
        // Set realistic language array
        Object.defineProperty(navigator, 'languages', {
            get: () => ['en-US', 'en']
        });
        // Remove Playwright-specific globals
        delete window.__playwright;
        delete window.__pw_manual;
    """)

    return browser, context


async def scrape_public_page(
    page_url: str,
    proxy: dict = None,
) -> dict:
    """
    Scrape a public Facebook page without logging in.
    Returns page metadata and visible engagement signals.
    """
    async with async_playwright() as p:
        browser, context = await create_stealth_context(p, proxy=proxy)
        page = await context.new_page()

        # Block tracking pixels and video resources to speed up loading
        await page.route("**/*facebook.com/tr*", lambda route: route.abort())
        await page.route("**/*.mp4", lambda route: route.abort())
        await page.route("**/*.webm", lambda route: route.abort())

        try:
            await page.goto(page_url, wait_until="domcontentloaded", timeout=30000)
            await asyncio.sleep(3)

            # Dismiss login popup if it appears
            for dismiss_selector in [
                "[aria-label='Close']",
                "[data-testid='close-button']",
                "div[role='dialog'] button",
            ]:
                try:
                    btn = await page.query_selector(dismiss_selector)
                    if btn and await btn.is_visible():
                        await btn.click()
                        await asyncio.sleep(1)
                        break
                except Exception:
                    pass

            result = {"url": page_url}

            # Page name
            name_el = await page.query_selector("h1")
            if name_el:
                result["name"] = (await name_el.inner_text()).strip()

            # Follower/like counts
            spans = await page.query_selector_all("span")
            for span in spans:
                try:
                    text = await span.inner_text()
                    if re.search(r"[\d,.]+[KMB]?\s*(follower|like)", text, re.I):
                        match = re.search(r"([\d,.]+[KMB]?)", text)
                        if match:
                            val = match.group(1)
                            if "follower" in text.lower():
                                result["followers_text"] = val
                            elif "like" in text.lower():
                                result["likes_text"] = val
                except Exception:
                    pass

            # About section text
            about_el = await page.query_selector("[data-testid='about_section']")
            if not about_el:
                about_el = await page.query_selector("div[id*='about']")
            if about_el:
                result["about"] = (await about_el.inner_text()).strip()[:500]

            return result

        except Exception as e:
            logger.error(f"Error scraping {page_url}: {e}")
            return {"url": page_url, "error": str(e)}

        finally:
            await browser.close()

Scraping Post Engagement

Public posts on Facebook pages show reaction counts, comment counts, and share counts without login:

async def scrape_page_posts(
    page_url: str,
    max_posts: int = 20,
    proxy: dict = None,
) -> list:
    """
    Scrape visible posts from a public Facebook page.
    Loads more posts by scrolling.
    """
    async with async_playwright() as p:
        browser, context = await create_stealth_context(p, proxy=proxy)
        page = await context.new_page()

        # Block heavy resources
        await page.route("**/*.mp4", lambda route: route.abort())
        await page.route("**/*.webm", lambda route: route.abort())

        try:
            posts_url = (
                page_url.rstrip("/") + "/posts"
                if "/posts" not in page_url
                else page_url
            )
            await page.goto(posts_url, wait_until="domcontentloaded", timeout=35000)
            await asyncio.sleep(5)

            # Dismiss popups
            for dismiss_selector in ["[aria-label='Close']", "[data-testid='close-button']"]:
                try:
                    btn = await page.query_selector(dismiss_selector)
                    if btn and await btn.is_visible():
                        await btn.click()
                        await asyncio.sleep(1)
                        break
                except Exception:
                    pass

            posts = []
            scroll_count = 0
            last_count = 0

            while len(posts) < max_posts and scroll_count < 40:
                # Facebook uses role="article" for feed posts
                post_elements = await page.query_selector_all("[role='article']")
                current_count = len(post_elements)

                # Process newly loaded posts
                for el in post_elements[last_count:]:
                    try:
                        post = {}
                        all_text = await el.inner_text()

                        # Post content
                        text_els = await el.query_selector_all(
                            "div[data-ad-preview='message'], div[dir='auto']"
                        )
                        if text_els:
                            texts = []
                            for text_el in text_els[:3]:
                                t = (await text_el.inner_text()).strip()
                                if t and len(t) > 10:
                                    texts.append(t)
                            post["text"] = " ".join(texts)[:500]

                        # Engagement counts via regex on full element text
                        reactions_m = re.search(
                            r"([\d,]+[KM]?)\s*(?:reaction|like)", all_text, re.I
                        )
                        if reactions_m:
                            post["reactions_text"] = reactions_m.group(1)

                        comments_m = re.search(
                            r"([\d,]+[KM]?)\s*comment", all_text, re.I
                        )
                        if comments_m:
                            post["comments_text"] = comments_m.group(1)

                        shares_m = re.search(
                            r"([\d,]+[KM]?)\s*share", all_text, re.I
                        )
                        if shares_m:
                            post["shares_text"] = shares_m.group(1)

                        if post.get("text") or post.get("reactions_text"):
                            posts.append(post)

                    except Exception:
                        pass

                last_count = current_count

                if len(posts) >= max_posts:
                    break

                # Scroll to load more
                await page.evaluate("window.scrollBy(0, window.innerHeight)")
                await asyncio.sleep(2.5 + random.uniform(0, 1.5))
                scroll_count += 1

                # Break if no new posts loaded after 5 scrolls
                if scroll_count > 5 and current_count == last_count:
                    break

            return posts[:max_posts]

        except Exception as e:
            logger.error(f"Error scraping posts from {page_url}: {e}")
            return []

        finally:
            await browser.close()

Facebook's Anti-Bot Detection

Facebook runs one of the most aggressive anti-bot systems. Understanding what they check helps you avoid detection:

Browser fingerprinting: They hash canvas rendering output, WebGL renderer, AudioContext fingerprint, and dozens of navigator properties. Headless Chromium has detectable differences from real Chrome — the init script overrides above address the most common signals.

Behavioral analysis: Scroll speed, mouse movement patterns, and click timing are analyzed. Automated scrolling with perfectly uniform timing (exactly 1.0 second delays) is a dead giveaway. Use random.uniform() for all delays.

JavaScript challenges: Facebook injects obfuscated JS that probes for automation frameworks. It checks for navigator.webdriver, Playwright-specific properties (window.__playwright), and Chrome DevTools Protocol markers. The add_init_script above handles these.

Rate limiting by IP: Aggressive throttling on repeated requests. Even viewing public pages too quickly triggers login walls and CAPTCHA challenges.

Session consistency: Facebook tracks session behavior over time. A clean IP that immediately starts making many requests looks different from a regular user. Warm up new sessions with a few normal-looking interactions before aggressive scraping.

Practical mitigation strategy:

def get_thordata_proxy(country: str = "US", session_id: str = None) -> dict:
    """
    Get ThorData residential proxy config for Playwright.
    Each unique session_id gets a consistent IP.
    Rotate session_id between scraping sessions (not mid-session).
    """
    user = "your_thordata_user"
    password = "your_thordata_pass"

    if country:
        user += f"-country-{country}"
    if session_id:
        user += f"-session-{session_id}"

    return {
        "server": f"http://proxy.thordata.com:9000",
        "username": user,
        "password": password,
    }


async def warm_up_session(page, delay: float = 3.0) -> None:
    """
    Simulate human-like behavior before scraping.
    Small mouse movements and a pause look more human.
    """
    try:
        # Random mouse movement
        await page.mouse.move(
            random.randint(200, 800),
            random.randint(200, 600),
        )
        await asyncio.sleep(random.uniform(0.5, 1.5))
        await page.mouse.move(
            random.randint(200, 800),
            random.randint(200, 600),
        )
        await asyncio.sleep(delay)
    except Exception:
        pass

ThorData residential proxies work well for Facebook because their IPs pass Facebook's reputation checks — datacenter IPs get blocked almost instantly. Rotate between sessions (not mid-session), since Facebook tracks session consistency:

# Good: consistent IP for entire scraping session
proxy = get_thordata_proxy(country="US", session_id="session_001")
results = await scrape_page_posts(url, proxy=proxy)

# Then for the next batch, use a different session
proxy = get_thordata_proxy(country="US", session_id="session_002")
results2 = await scrape_page_posts(url2, proxy=proxy)

Data Storage with SQLite

def init_database(db_path: str = "facebook_pages.db") -> sqlite3.Connection:
    """Initialize SQLite schema for Facebook page data."""
    conn = sqlite3.connect(db_path)

    conn.executescript("""
        CREATE TABLE IF NOT EXISTS pages (
            id TEXT PRIMARY KEY,
            name TEXT,
            category TEXT,
            fan_count INTEGER DEFAULT 0,
            about TEXT,
            website TEXT,
            phone TEXT,
            address TEXT,
            verification_status TEXT,
            cover_url TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );

        CREATE INDEX IF NOT EXISTS idx_category ON pages(category);
        CREATE INDEX IF NOT EXISTS idx_fan_count ON pages(fan_count);

        CREATE TABLE IF NOT EXISTS posts (
            id TEXT PRIMARY KEY,
            page_id TEXT,
            message TEXT,
            story TEXT,
            created_time TEXT,
            post_type TEXT,
            url TEXT,
            shares INTEGER DEFAULT 0,
            image_url TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );

        CREATE INDEX IF NOT EXISTS idx_post_page ON posts(page_id);
        CREATE INDEX IF NOT EXISTS idx_post_time ON posts(created_time);

        CREATE TABLE IF NOT EXISTS scraped_post_engagement (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            page_id TEXT,
            page_url TEXT,
            post_text TEXT,
            reactions_text TEXT,
            comments_text TEXT,
            shares_text TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );

        CREATE TABLE IF NOT EXISTS page_snapshots (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            page_id TEXT,
            fan_count INTEGER,
            snapshot_date TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );

        CREATE UNIQUE INDEX IF NOT EXISTS idx_snap_date
        ON page_snapshots(page_id, snapshot_date);
    """)

    conn.commit()
    return conn


def save_page(conn: sqlite3.Connection, page: dict) -> None:
    """Save page info and a daily snapshot."""
    try:
        conn.execute("""
            INSERT OR REPLACE INTO pages
            (id, name, category, fan_count, about, website, phone, address,
             verification_status, cover_url)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        """, (
            page.get("id"), page.get("name"), page.get("category"),
            page.get("fan_count", 0), page.get("about"), page.get("website"),
            page.get("phone"), page.get("address"),
            page.get("verification_status"), page.get("cover_url"),
        ))

        today = datetime.now().strftime("%Y-%m-%d")
        conn.execute("""
            INSERT OR REPLACE INTO page_snapshots (page_id, fan_count, snapshot_date)
            VALUES (?, ?, ?)
        """, (page.get("id"), page.get("fan_count", 0), today))

        conn.commit()
    except sqlite3.Error as e:
        logger.error(f"DB error saving page: {e}")


def save_posts(conn: sqlite3.Connection, page_id: str, posts: list) -> int:
    """Save API posts to database."""
    saved = 0
    for post in posts:
        try:
            conn.execute("""
                INSERT OR REPLACE INTO posts
                (id, page_id, message, story, created_time, post_type, url,
                 shares, image_url)
                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
            """, (
                post.get("id"), page_id,
                post.get("message"), post.get("story"),
                post.get("created_time"), post.get("type"),
                post.get("url"), post.get("shares", 0),
                post.get("image_url"),
            ))
            saved += 1
        except sqlite3.Error as e:
            logger.error(f"DB error saving post: {e}")

    conn.commit()
    return saved

Complete Analysis Pipeline

async def run_page_analysis(
    page_ids: list,
    scrape_posts_playwright: bool = True,
    db_path: str = "facebook.db",
) -> None:
    """
    Complete Facebook page analysis pipeline.
    Combines Graph API data with optional Playwright scraping.
    """
    conn = init_database(db_path)
    client = make_client()

    # Phase 1: Batch API collection
    logger.info(f"Fetching info for {len(page_ids)} pages via Graph API...")
    pages_data = get_multiple_pages_concurrent(client, page_ids)

    for page in pages_data:
        if page.get("id"):
            save_page(conn, page)

    logger.info(f"Saved {len(pages_data)} pages")

    # Phase 2: Posts via API
    logger.info("Fetching posts...")
    for page_info in pages_data:
        page_id = page_info.get("id")
        name = page_info.get("name", page_id)
        if not page_id:
            continue

        posts = get_page_posts(client, page_id, limit=50)
        saved = save_posts(conn, page_id, posts)
        logger.info(f"@{name}: {saved} posts saved")
        time.sleep(0.5)

    # Phase 3: Playwright for engagement data (if enabled)
    if scrape_posts_playwright:
        logger.info("Scraping engagement data with Playwright...")

        for page_info in pages_data[:10]:  # Limit for demo
            page_id = page_info.get("id")
            name = page_info.get("name", "unknown")
            page_url = f"https://www.facebook.com/{page_id}"

            proxy = get_thordata_proxy(country="US", session_id=f"fb-{page_id[:8]}")

            try:
                posts = await scrape_page_posts(
                    page_url, max_posts=15, proxy=proxy
                )

                for post in posts:
                    conn.execute("""
                        INSERT INTO scraped_post_engagement
                        (page_id, page_url, post_text, reactions_text,
                         comments_text, shares_text)
                        VALUES (?, ?, ?, ?, ?, ?)
                    """, (
                        page_id, page_url,
                        post.get("text"), post.get("reactions_text"),
                        post.get("comments_text"), post.get("shares_text"),
                    ))
                conn.commit()
                logger.info(f"@{name}: {len(posts)} posts scraped")

            except Exception as e:
                logger.error(f"Playwright error for {name}: {e}")

            # Long pause between pages to avoid detection
            await asyncio.sleep(random.uniform(8, 15))

    # Summary report
    cursor = conn.execute("""
        SELECT name, fan_count, category
        FROM pages
        ORDER BY fan_count DESC
        LIMIT 20
    """)
    print("\n=== Top Pages by Follower Count ===")
    for row in cursor.fetchall():
        print(f"  {row[0]:<40} {row[1]:>12,} followers  [{row[2] or 'N/A'}]")

    conn.close()
    client.close()


if __name__ == "__main__":
    page_ids = ["NASA", "SpaceX", "NationalGeographic", "BBCNews", "CNN"]
    asyncio.run(run_page_analysis(
        page_ids=page_ids,
        scrape_posts_playwright=True,
        db_path="facebook_pages.db",
    ))

Legal Considerations

This deserves a direct statement. Facebook's Terms of Service prohibit automated data collection. The hiQ Labs v. LinkedIn Supreme Court case (2022) established some legal ground for scraping public data, but that ruling was specific to LinkedIn and the CFAA — it doesn't provide blanket permission for Facebook.

In practice, common-sense guidelines:

Stick to public data — page content visible without logging in is the safest category
Never scrape private profiles or closed group content
Don't store personal data (individual names, photos) without a legitimate legal basis under GDPR/CCPA
Use the Graph API first — it's the sanctioned path, even if limited
Rate limit aggressively — don't hammer their servers
Use the data for analysis, not replication — building a competing social network from scraped data is a different risk category than competitive intelligence

For brand monitoring, competitor analysis, and public sentiment research, scraping public Facebook pages is common practice. Be sensible about what you collect and why.

Summary

The combined approach — Graph API for structured data, Playwright for engagement signals — gives you comprehensive coverage:

Graph API for page info, follower counts, post history, and events. Use batch requests for efficiency. Generous rate limits at 200/hour.
Playwright with stealth configuration for engagement counts, reactions, and post content that the restricted API misses.
ThorData residential proxies for Playwright scraping. Rotate between sessions (not mid-session). Facebook's detection tracks session consistency.
SQLite for storage with daily fan count snapshots for growth tracking.
Public data only — page content visible without login. Never attempt to access private or restricted content.