Scrape Facebook Public Pages & Post Engagement with Python (2026)
Scrape Facebook Public Pages & Post Engagement with Python (2026)
Facebook scraping in 2026 is a mixed bag. The Meta Graph API exists and gives you structured access to public page data, but its scope has been shrinking steadily since the Cambridge Analytica fallout. Many endpoints that were open in 2020 now require app review or special permissions. For public content that the API doesn't expose, browser-based scraping with Playwright is still an option — though Facebook's anti-bot detection is among the most aggressive on the web.
This guide covers both approaches with complete working code. We use the Graph API where it works, and fall back to Playwright for what remains. Only public data — no private profiles, no closed groups, no direct messages.
What's Available via the Graph API
With a basic Facebook App token, you can access:
- Public page info — name, category, follower count, about text, website, phone, address
- Page posts — text, type, created time (engagement counts need a Page token)
- Page events — public events hosted by pages
- Public videos — title, description, views on public page videos
- Page insights — available to page admins only
What you cannot get without special permissions: - Individual user reactions (who reacted, what reaction type) - Comment content on most posts - Detailed engagement breakdowns - Anything related to user profiles - Group member lists
Setting Up the Graph API
You need a Facebook App. Go to developers.facebook.com, create an app, and grab your App ID and App Secret. Then generate an access token.
There are two token types for public page data:
1. App Access Token — APP_ID|APP_SECRET format. Works for public page queries. Simple to generate.
2. User Access Token — required for some endpoints. Generated via OAuth flow.
For most public page scraping, the App Access Token is sufficient:
import httpx
import time
import json
import sqlite3
import logging
import re
import asyncio
from typing import Optional
from datetime import datetime
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger(__name__)
GRAPH_BASE = "https://graph.facebook.com/v19.0"
APP_ID = "your_app_id"
APP_SECRET = "your_app_secret"
# App access token = APP_ID|APP_SECRET (simplest form)
APP_TOKEN = f"{APP_ID}|{APP_SECRET}"
# Or generate a proper token via: GET /oauth/access_token
# params: client_id, client_secret, grant_type=client_credentials
def make_client(proxy_url: str = None) -> httpx.Client:
"""Create HTTP client with optional proxy."""
kwargs = {
"timeout": httpx.Timeout(30.0, connect=10.0),
"follow_redirects": True,
}
if proxy_url:
kwargs["proxy"] = proxy_url
return httpx.Client(**kwargs)
def graph_get(
client: httpx.Client,
endpoint: str,
params: dict = None,
token: str = None,
max_retries: int = 4,
) -> Optional[dict]:
"""
Graph API GET request with error handling and retry.
"""
if params is None:
params = {}
params["access_token"] = token or APP_TOKEN
url = f"{GRAPH_BASE}/{endpoint.lstrip('/')}"
for attempt in range(max_retries):
try:
resp = client.get(url, params=params)
if resp.status_code == 200:
data = resp.json()
if "error" in data:
error = data["error"]
code = error.get("code")
logger.error(f"Graph API error {code}: {error.get('message')}")
# Code 4: rate limited, code 10: permission error
if code == 4:
time.sleep(60)
continue
return None
return data
elif resp.status_code == 429:
wait = int(resp.headers.get("retry-after", 60))
logger.warning(f"Rate limited (429), waiting {wait}s")
time.sleep(wait)
continue
elif resp.status_code in (400, 403):
try:
error_data = resp.json()
logger.error(f"API error: {error_data.get('error', {}).get('message')}")
except Exception:
logger.error(f"HTTP {resp.status_code}: {endpoint}")
return None
else:
logger.warning(f"HTTP {resp.status_code}: {endpoint}")
time.sleep(2 ** attempt)
except httpx.TimeoutException:
wait = 2 ** attempt + 2
logger.warning(f"Timeout, retry in {wait}s (attempt {attempt+1})")
time.sleep(wait)
except httpx.NetworkError as e:
logger.error(f"Network error: {e}")
time.sleep(5)
return None
Public Page Information
def get_page_info(
client: httpx.Client,
page_id: str,
extra_fields: list = None,
) -> Optional[dict]:
"""
Get public page information.
page_id: page name (e.g., 'NASA') or numeric ID
"""
base_fields = [
"id", "name", "category", "fan_count",
"about", "website", "location", "single_line_address",
"phone", "email", "founded", "general_info",
"mission", "description", "cover", "picture",
"verification_status", "category_list",
]
if extra_fields:
base_fields.extend(extra_fields)
data = graph_get(client, page_id, params={"fields": ",".join(base_fields)})
if not data:
return None
return {
"id": data.get("id"),
"name": data.get("name"),
"category": data.get("category"),
"fan_count": data.get("fan_count", 0),
"about": data.get("about", ""),
"website": data.get("website"),
"phone": data.get("phone"),
"address": data.get("single_line_address"),
"founded": data.get("founded"),
"description": data.get("description", ""),
"verification_status": data.get("verification_status"),
"cover_url": (data.get("cover") or {}).get("source"),
"profile_pic_url": (data.get("picture") or {}).get("data", {}).get("url"),
"categories": [c.get("name") for c in data.get("category_list", [])],
}
def check_rate_limit_headers(resp_headers: dict) -> dict:
"""Extract rate limit info from response headers."""
usage_header = resp_headers.get("x-app-usage", "{}")
page_usage_header = resp_headers.get("x-page-usage", "{}")
try:
app_usage = json.loads(usage_header)
page_usage = json.loads(page_usage_header)
except json.JSONDecodeError:
return {}
return {
"app_call_count": app_usage.get("call_count", 0),
"app_total_time": app_usage.get("total_time", 0),
"app_total_cputime": app_usage.get("total_cputime", 0),
"page_call_count": page_usage.get("call_count", 0),
}
Fetching Page Posts with Pagination
def get_page_posts(
client: httpx.Client,
page_id: str,
limit: int = 100,
since: str = None,
until: str = None,
) -> list:
"""
Get posts from a public page.
Paginates automatically up to limit.
since/until: Unix timestamp or YYYY-MM-DD format
"""
posts = []
fields = (
"id,message,created_time,type,permalink_url,"
"shares,full_picture,name,description,story"
)
params = {
"fields": fields,
"limit": 25, # API max per request for posts
}
if since:
params["since"] = since
if until:
params["until"] = until
# Initial request
data = graph_get(client, f"{page_id}/posts", params=params)
while data and len(posts) < limit:
for post in data.get("data", []):
posts.append({
"id": post.get("id"),
"message": post.get("message", ""),
"story": post.get("story", ""),
"created_time": post.get("created_time"),
"type": post.get("type"),
"url": post.get("permalink_url"),
"shares": (post.get("shares") or {}).get("count", 0),
"image_url": post.get("full_picture"),
"name": post.get("name"),
"description": post.get("description", ""),
})
# Paginate via cursor
paging = data.get("paging", {})
next_url = paging.get("next")
if not next_url or len(posts) >= limit:
break
# Follow pagination cursor directly
try:
resp = client.get(next_url, params={"access_token": APP_TOKEN})
data = resp.json() if resp.status_code == 200 else None
except httpx.RequestError:
break
time.sleep(0.5)
return posts[:limit]
def get_page_events(
client: httpx.Client,
page_id: str,
limit: int = 50,
) -> list:
"""Get public events hosted by a page."""
fields = (
"id,name,description,start_time,end_time,"
"place,cover,attending_count,maybe_count,interested_count"
)
data = graph_get(
client,
f"{page_id}/events",
params={"fields": fields, "limit": min(limit, 50)},
)
if not data:
return []
events = []
for event in data.get("data", []):
place = event.get("place", {})
location = place.get("location", {})
events.append({
"id": event.get("id"),
"name": event.get("name"),
"description": event.get("description", "")[:500],
"start_time": event.get("start_time"),
"end_time": event.get("end_time"),
"venue_name": place.get("name"),
"city": location.get("city"),
"country": location.get("country"),
"attending_count": event.get("attending_count", 0),
"maybe_count": event.get("maybe_count", 0),
"interested_count": event.get("interested_count", 0),
})
return events
Batch Requests for Efficiency
For efficiency, batch multiple queries into a single HTTP request (up to 50 sub-requests):
def batch_page_info(
client: httpx.Client,
page_ids: list,
fields: str = "id,name,category,fan_count,about,website",
) -> list:
"""
Fetch info for multiple pages in a single batch request.
Up to 50 pages per batch.
"""
results = []
# Process in batches of 50
for i in range(0, len(page_ids), 50):
batch_ids = page_ids[i:i + 50]
batch = [
{
"method": "GET",
"relative_url": f"{pid}?fields={fields}",
}
for pid in batch_ids
]
try:
resp = client.post(
GRAPH_BASE,
data={
"access_token": APP_TOKEN,
"batch": json.dumps(batch),
"include_headers": "false",
},
timeout=30,
)
resp.raise_for_status()
for item in resp.json():
if item.get("code") == 200:
try:
results.append(json.loads(item["body"]))
except (json.JSONDecodeError, KeyError):
pass
else:
logger.debug(f"Batch item returned HTTP {item.get('code')}")
except httpx.RequestError as e:
logger.error(f"Batch request error: {e}")
time.sleep(1.0)
return results
def get_multiple_pages_concurrent(
client: httpx.Client,
page_ids: list,
delay: float = 0.3,
) -> list:
"""
Fetch multiple pages' info using batching.
Rate limit: 200 API calls per hour (user token).
Each batch request counts as 1 call, returns up to 50 results.
"""
all_results = []
for i in range(0, len(page_ids), 50):
batch = page_ids[i:i + 50]
results = batch_page_info(client, batch)
all_results.extend(results)
logger.info(f"Batch progress: {len(all_results)}/{len(page_ids)} pages")
if i + 50 < len(page_ids):
time.sleep(delay)
return all_results
Rate Limits
Facebook's rate limiting is per-app and per-user-token:
- App-level: 200 calls per hour per user (user token)
- App-level without user: 200 calls per hour (app token, per IP)
- Page tokens: 4,800 calls per 24 hours
- Batch requests: Each batch counts as 1 call but executes up to 50 sub-requests
Practical strategy:
1. Use batch requests aggressively — 1 API call for 50 pages worth of data
2. Cache aggressively — page info doesn't change hourly
3. Use App Token for public page scraping (simpler, no OAuth flow)
4. Monitor x-app-usage header to track quota consumption
def monitor_api_usage(client: httpx.Client) -> dict:
"""Check current API usage quota."""
try:
resp = client.get(
f"{GRAPH_BASE}/me",
params={"access_token": APP_TOKEN, "fields": "id"}
)
usage = resp.headers.get("x-app-usage", "{}")
return json.loads(usage)
except Exception:
return {}
Playwright Scraping for Content the API Misses
When the Graph API doesn't give you what you need — engagement metrics, comment counts, post reactions — Playwright can fill the gaps. Facebook's detection is serious but can be navigated:
from playwright.async_api import async_playwright
PROXY_CONFIG = {
"server": "http://proxy.thordata.com:9000",
"username": "your_thordata_user",
"password": "your_thordata_pass",
}
async def create_stealth_context(playwright, proxy: dict = None):
"""
Create a browser context configured to minimize Facebook's bot detection.
"""
launch_args = {
"headless": True,
"args": [
"--disable-blink-features=AutomationControlled",
"--disable-features=IsolateOrigins,site-per-process",
"--no-first-run",
"--no-default-browser-check",
"--disable-extensions",
],
}
if proxy:
launch_args["proxy"] = proxy
browser = await playwright.chromium.launch(**launch_args)
# Use realistic browser context
context = await browser.new_context(
user_agent=(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/126.0.0.0 Safari/537.36"
),
viewport={"width": 1366, "height": 768},
locale="en-US",
timezone_id="America/New_York",
color_scheme="light",
java_script_enabled=True,
)
# Override bot detection signals
await context.add_init_script("""
// Remove webdriver flag
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
});
// Add realistic plugin array
Object.defineProperty(navigator, 'plugins', {
get: () => [{name: 'Chrome PDF Plugin'}, {name: 'Chrome PDF Viewer'}]
});
// Set realistic language array
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en']
});
// Remove Playwright-specific globals
delete window.__playwright;
delete window.__pw_manual;
""")
return browser, context
async def scrape_public_page(
page_url: str,
proxy: dict = None,
) -> dict:
"""
Scrape a public Facebook page without logging in.
Returns page metadata and visible engagement signals.
"""
async with async_playwright() as p:
browser, context = await create_stealth_context(p, proxy=proxy)
page = await context.new_page()
# Block tracking pixels and video resources to speed up loading
await page.route("**/*facebook.com/tr*", lambda route: route.abort())
await page.route("**/*.mp4", lambda route: route.abort())
await page.route("**/*.webm", lambda route: route.abort())
try:
await page.goto(page_url, wait_until="domcontentloaded", timeout=30000)
await asyncio.sleep(3)
# Dismiss login popup if it appears
for dismiss_selector in [
"[aria-label='Close']",
"[data-testid='close-button']",
"div[role='dialog'] button",
]:
try:
btn = await page.query_selector(dismiss_selector)
if btn and await btn.is_visible():
await btn.click()
await asyncio.sleep(1)
break
except Exception:
pass
result = {"url": page_url}
# Page name
name_el = await page.query_selector("h1")
if name_el:
result["name"] = (await name_el.inner_text()).strip()
# Follower/like counts
spans = await page.query_selector_all("span")
for span in spans:
try:
text = await span.inner_text()
if re.search(r"[\d,.]+[KMB]?\s*(follower|like)", text, re.I):
match = re.search(r"([\d,.]+[KMB]?)", text)
if match:
val = match.group(1)
if "follower" in text.lower():
result["followers_text"] = val
elif "like" in text.lower():
result["likes_text"] = val
except Exception:
pass
# About section text
about_el = await page.query_selector("[data-testid='about_section']")
if not about_el:
about_el = await page.query_selector("div[id*='about']")
if about_el:
result["about"] = (await about_el.inner_text()).strip()[:500]
return result
except Exception as e:
logger.error(f"Error scraping {page_url}: {e}")
return {"url": page_url, "error": str(e)}
finally:
await browser.close()
Scraping Post Engagement
Public posts on Facebook pages show reaction counts, comment counts, and share counts without login:
async def scrape_page_posts(
page_url: str,
max_posts: int = 20,
proxy: dict = None,
) -> list:
"""
Scrape visible posts from a public Facebook page.
Loads more posts by scrolling.
"""
async with async_playwright() as p:
browser, context = await create_stealth_context(p, proxy=proxy)
page = await context.new_page()
# Block heavy resources
await page.route("**/*.mp4", lambda route: route.abort())
await page.route("**/*.webm", lambda route: route.abort())
try:
posts_url = (
page_url.rstrip("/") + "/posts"
if "/posts" not in page_url
else page_url
)
await page.goto(posts_url, wait_until="domcontentloaded", timeout=35000)
await asyncio.sleep(5)
# Dismiss popups
for dismiss_selector in ["[aria-label='Close']", "[data-testid='close-button']"]:
try:
btn = await page.query_selector(dismiss_selector)
if btn and await btn.is_visible():
await btn.click()
await asyncio.sleep(1)
break
except Exception:
pass
posts = []
scroll_count = 0
last_count = 0
while len(posts) < max_posts and scroll_count < 40:
# Facebook uses role="article" for feed posts
post_elements = await page.query_selector_all("[role='article']")
current_count = len(post_elements)
# Process newly loaded posts
for el in post_elements[last_count:]:
try:
post = {}
all_text = await el.inner_text()
# Post content
text_els = await el.query_selector_all(
"div[data-ad-preview='message'], div[dir='auto']"
)
if text_els:
texts = []
for text_el in text_els[:3]:
t = (await text_el.inner_text()).strip()
if t and len(t) > 10:
texts.append(t)
post["text"] = " ".join(texts)[:500]
# Engagement counts via regex on full element text
reactions_m = re.search(
r"([\d,]+[KM]?)\s*(?:reaction|like)", all_text, re.I
)
if reactions_m:
post["reactions_text"] = reactions_m.group(1)
comments_m = re.search(
r"([\d,]+[KM]?)\s*comment", all_text, re.I
)
if comments_m:
post["comments_text"] = comments_m.group(1)
shares_m = re.search(
r"([\d,]+[KM]?)\s*share", all_text, re.I
)
if shares_m:
post["shares_text"] = shares_m.group(1)
if post.get("text") or post.get("reactions_text"):
posts.append(post)
except Exception:
pass
last_count = current_count
if len(posts) >= max_posts:
break
# Scroll to load more
await page.evaluate("window.scrollBy(0, window.innerHeight)")
await asyncio.sleep(2.5 + random.uniform(0, 1.5))
scroll_count += 1
# Break if no new posts loaded after 5 scrolls
if scroll_count > 5 and current_count == last_count:
break
return posts[:max_posts]
except Exception as e:
logger.error(f"Error scraping posts from {page_url}: {e}")
return []
finally:
await browser.close()
Facebook's Anti-Bot Detection
Facebook runs one of the most aggressive anti-bot systems. Understanding what they check helps you avoid detection:
Browser fingerprinting: They hash canvas rendering output, WebGL renderer, AudioContext fingerprint, and dozens of navigator properties. Headless Chromium has detectable differences from real Chrome — the init script overrides above address the most common signals.
Behavioral analysis:
Scroll speed, mouse movement patterns, and click timing are analyzed. Automated scrolling with perfectly uniform timing (exactly 1.0 second delays) is a dead giveaway. Use random.uniform() for all delays.
JavaScript challenges:
Facebook injects obfuscated JS that probes for automation frameworks. It checks for navigator.webdriver, Playwright-specific properties (window.__playwright), and Chrome DevTools Protocol markers. The add_init_script above handles these.
Rate limiting by IP: Aggressive throttling on repeated requests. Even viewing public pages too quickly triggers login walls and CAPTCHA challenges.
Session consistency: Facebook tracks session behavior over time. A clean IP that immediately starts making many requests looks different from a regular user. Warm up new sessions with a few normal-looking interactions before aggressive scraping.
Practical mitigation strategy:
def get_thordata_proxy(country: str = "US", session_id: str = None) -> dict:
"""
Get ThorData residential proxy config for Playwright.
Each unique session_id gets a consistent IP.
Rotate session_id between scraping sessions (not mid-session).
"""
user = "your_thordata_user"
password = "your_thordata_pass"
if country:
user += f"-country-{country}"
if session_id:
user += f"-session-{session_id}"
return {
"server": f"http://proxy.thordata.com:9000",
"username": user,
"password": password,
}
async def warm_up_session(page, delay: float = 3.0) -> None:
"""
Simulate human-like behavior before scraping.
Small mouse movements and a pause look more human.
"""
try:
# Random mouse movement
await page.mouse.move(
random.randint(200, 800),
random.randint(200, 600),
)
await asyncio.sleep(random.uniform(0.5, 1.5))
await page.mouse.move(
random.randint(200, 800),
random.randint(200, 600),
)
await asyncio.sleep(delay)
except Exception:
pass
ThorData residential proxies work well for Facebook because their IPs pass Facebook's reputation checks — datacenter IPs get blocked almost instantly. Rotate between sessions (not mid-session), since Facebook tracks session consistency:
# Good: consistent IP for entire scraping session
proxy = get_thordata_proxy(country="US", session_id="session_001")
results = await scrape_page_posts(url, proxy=proxy)
# Then for the next batch, use a different session
proxy = get_thordata_proxy(country="US", session_id="session_002")
results2 = await scrape_page_posts(url2, proxy=proxy)
Data Storage with SQLite
def init_database(db_path: str = "facebook_pages.db") -> sqlite3.Connection:
"""Initialize SQLite schema for Facebook page data."""
conn = sqlite3.connect(db_path)
conn.executescript("""
CREATE TABLE IF NOT EXISTS pages (
id TEXT PRIMARY KEY,
name TEXT,
category TEXT,
fan_count INTEGER DEFAULT 0,
about TEXT,
website TEXT,
phone TEXT,
address TEXT,
verification_status TEXT,
cover_url TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_category ON pages(category);
CREATE INDEX IF NOT EXISTS idx_fan_count ON pages(fan_count);
CREATE TABLE IF NOT EXISTS posts (
id TEXT PRIMARY KEY,
page_id TEXT,
message TEXT,
story TEXT,
created_time TEXT,
post_type TEXT,
url TEXT,
shares INTEGER DEFAULT 0,
image_url TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_post_page ON posts(page_id);
CREATE INDEX IF NOT EXISTS idx_post_time ON posts(created_time);
CREATE TABLE IF NOT EXISTS scraped_post_engagement (
id INTEGER PRIMARY KEY AUTOINCREMENT,
page_id TEXT,
page_url TEXT,
post_text TEXT,
reactions_text TEXT,
comments_text TEXT,
shares_text TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS page_snapshots (
id INTEGER PRIMARY KEY AUTOINCREMENT,
page_id TEXT,
fan_count INTEGER,
snapshot_date TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE UNIQUE INDEX IF NOT EXISTS idx_snap_date
ON page_snapshots(page_id, snapshot_date);
""")
conn.commit()
return conn
def save_page(conn: sqlite3.Connection, page: dict) -> None:
"""Save page info and a daily snapshot."""
try:
conn.execute("""
INSERT OR REPLACE INTO pages
(id, name, category, fan_count, about, website, phone, address,
verification_status, cover_url)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
page.get("id"), page.get("name"), page.get("category"),
page.get("fan_count", 0), page.get("about"), page.get("website"),
page.get("phone"), page.get("address"),
page.get("verification_status"), page.get("cover_url"),
))
today = datetime.now().strftime("%Y-%m-%d")
conn.execute("""
INSERT OR REPLACE INTO page_snapshots (page_id, fan_count, snapshot_date)
VALUES (?, ?, ?)
""", (page.get("id"), page.get("fan_count", 0), today))
conn.commit()
except sqlite3.Error as e:
logger.error(f"DB error saving page: {e}")
def save_posts(conn: sqlite3.Connection, page_id: str, posts: list) -> int:
"""Save API posts to database."""
saved = 0
for post in posts:
try:
conn.execute("""
INSERT OR REPLACE INTO posts
(id, page_id, message, story, created_time, post_type, url,
shares, image_url)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
post.get("id"), page_id,
post.get("message"), post.get("story"),
post.get("created_time"), post.get("type"),
post.get("url"), post.get("shares", 0),
post.get("image_url"),
))
saved += 1
except sqlite3.Error as e:
logger.error(f"DB error saving post: {e}")
conn.commit()
return saved
Complete Analysis Pipeline
async def run_page_analysis(
page_ids: list,
scrape_posts_playwright: bool = True,
db_path: str = "facebook.db",
) -> None:
"""
Complete Facebook page analysis pipeline.
Combines Graph API data with optional Playwright scraping.
"""
conn = init_database(db_path)
client = make_client()
# Phase 1: Batch API collection
logger.info(f"Fetching info for {len(page_ids)} pages via Graph API...")
pages_data = get_multiple_pages_concurrent(client, page_ids)
for page in pages_data:
if page.get("id"):
save_page(conn, page)
logger.info(f"Saved {len(pages_data)} pages")
# Phase 2: Posts via API
logger.info("Fetching posts...")
for page_info in pages_data:
page_id = page_info.get("id")
name = page_info.get("name", page_id)
if not page_id:
continue
posts = get_page_posts(client, page_id, limit=50)
saved = save_posts(conn, page_id, posts)
logger.info(f"@{name}: {saved} posts saved")
time.sleep(0.5)
# Phase 3: Playwright for engagement data (if enabled)
if scrape_posts_playwright:
logger.info("Scraping engagement data with Playwright...")
for page_info in pages_data[:10]: # Limit for demo
page_id = page_info.get("id")
name = page_info.get("name", "unknown")
page_url = f"https://www.facebook.com/{page_id}"
proxy = get_thordata_proxy(country="US", session_id=f"fb-{page_id[:8]}")
try:
posts = await scrape_page_posts(
page_url, max_posts=15, proxy=proxy
)
for post in posts:
conn.execute("""
INSERT INTO scraped_post_engagement
(page_id, page_url, post_text, reactions_text,
comments_text, shares_text)
VALUES (?, ?, ?, ?, ?, ?)
""", (
page_id, page_url,
post.get("text"), post.get("reactions_text"),
post.get("comments_text"), post.get("shares_text"),
))
conn.commit()
logger.info(f"@{name}: {len(posts)} posts scraped")
except Exception as e:
logger.error(f"Playwright error for {name}: {e}")
# Long pause between pages to avoid detection
await asyncio.sleep(random.uniform(8, 15))
# Summary report
cursor = conn.execute("""
SELECT name, fan_count, category
FROM pages
ORDER BY fan_count DESC
LIMIT 20
""")
print("\n=== Top Pages by Follower Count ===")
for row in cursor.fetchall():
print(f" {row[0]:<40} {row[1]:>12,} followers [{row[2] or 'N/A'}]")
conn.close()
client.close()
if __name__ == "__main__":
page_ids = ["NASA", "SpaceX", "NationalGeographic", "BBCNews", "CNN"]
asyncio.run(run_page_analysis(
page_ids=page_ids,
scrape_posts_playwright=True,
db_path="facebook_pages.db",
))
Legal Considerations
This deserves a direct statement. Facebook's Terms of Service prohibit automated data collection. The hiQ Labs v. LinkedIn Supreme Court case (2022) established some legal ground for scraping public data, but that ruling was specific to LinkedIn and the CFAA — it doesn't provide blanket permission for Facebook.
In practice, common-sense guidelines:
- Stick to public data — page content visible without logging in is the safest category
- Never scrape private profiles or closed group content
- Don't store personal data (individual names, photos) without a legitimate legal basis under GDPR/CCPA
- Use the Graph API first — it's the sanctioned path, even if limited
- Rate limit aggressively — don't hammer their servers
- Use the data for analysis, not replication — building a competing social network from scraped data is a different risk category than competitive intelligence
For brand monitoring, competitor analysis, and public sentiment research, scraping public Facebook pages is common practice. Be sensible about what you collect and why.
Summary
The combined approach — Graph API for structured data, Playwright for engagement signals — gives you comprehensive coverage:
-
Graph API for page info, follower counts, post history, and events. Use batch requests for efficiency. Generous rate limits at 200/hour.
-
Playwright with stealth configuration for engagement counts, reactions, and post content that the restricted API misses.
-
ThorData residential proxies for Playwright scraping. Rotate between sessions (not mid-session). Facebook's detection tracks session consistency.
-
SQLite for storage with daily fan count snapshots for growth tracking.
-
Public data only — page content visible without login. Never attempt to access private or restricted content.