Scrape LinkedIn Post Engagement Data with Python (2026)
Scrape LinkedIn Post Engagement Data with Python (2026)
LinkedIn doesn't give you a real API for engagement data. The official API is scoped to recruiting and advertising — useless if you want post-level analytics for competitor research or content strategy work. So you either pay for an expensive SaaS tool, or you scrape it yourself.
This guide covers both major approaches in depth: hitting LinkedIn's internal Voyager API for structured engagement metrics, and scraping public company pages with Playwright for unauthenticated data collection. Both approaches are production-viable with the right anti-detection setup.
What You Can Extract
Via Voyager API (authenticated): - Reaction counts by type (Like, Celebrate, Support, Love, Insightful, Funny) - Comment count and full comment threads (text, author, timestamp) - Share count - Post text and embedded media references - Hashtags used in posts - Post publication timestamp - Company and author URN identifiers
Via Playwright HTML scraping (public pages): - Total reaction count (aggregate, not broken down by type) - Comment count (visible button label) - Post text preview (truncated at "see more") - Hashtags parsed from post text - Post relative timestamps ("3 days ago", "1 week ago") - Company display name and follower count visible on page
What requires premium access or is blocked: - Exact follower count via Voyager (visible on public page but not always in API) - Full audience analytics (impressions, reach) — requires Company Page admin access - Individual liker identities — available via API but rate-limited heavily
LinkedIn's Anti-Bot Measures
Before touching code, understand what you're up against.
Login walls. Most engagement data requires authentication. Reaction counts, comment threads, and share counts are not available on public-facing pages for non-authenticated users in most content types. The exception is company pages, where basic counts are sometimes visible.
The li_at cookie. LinkedIn's session authentication lives in a cookie called li_at. Every authenticated request needs this. It's tied to your account session and expires or rotates after suspicious activity. If you're doing serious volume, burning a scraping account is a real risk.
CSRF tokens. Voyager API requests require a csrf-token header that matches the JSESSIONID cookie value. LinkedIn validates these together server-side. You cannot fake one without the other.
Rate limiting per session. LinkedIn rate-limits at the session level, not just by IP. Even with fresh IPs, hammering requests from the same li_at value triggers 429s and eventually account flags. Conservative pacing (2-3 seconds between requests) is mandatory.
Browser fingerprinting. LinkedIn's frontend JavaScript collects canvas fingerprints, WebGL renderer strings, screen dimensions, font enumeration, and navigator properties. Headless browsers without stealth configuration get flagged within the first page load.
Voyager endpoint protection. Voyager endpoints return 999 status codes if you hit them without the right headers (x-restli-protocol-version, x-li-lang, x-li-page-instance, x-li-track). These headers must be present and structurally valid.
IP reputation. LinkedIn blocks datacenter IP ranges almost entirely. Residential proxies are required for any sustained scraping. Even residential IPs get flagged if they originate from ranges commonly associated with proxy providers.
Approach 1: Voyager API (Authenticated)
The Voyager API is LinkedIn's internal REST layer. It lives at www.linkedin.com/voyager/api/. You need a valid li_at cookie and matching CSRF token extracted from a browser session.
Extracting credentials from your browser:
1. Open LinkedIn in Chrome, log in normally
2. Open DevTools (F12) → Application → Storage → Cookies → www.linkedin.com
3. Copy the value of li_at cookie
4. Copy the value of JSESSIONID cookie (the CSRF token is this value stripped of quotes)
import requests
import json
import time
import re
from typing import Optional
class LinkedInVoyager:
"""Client for LinkedIn's internal Voyager API."""
BASE = "https://www.linkedin.com/voyager/api"
def __init__(self, li_at: str, jsessionid: str):
self.session = requests.Session()
self.session.cookies.set("li_at", li_at, domain=".linkedin.com")
self.session.cookies.set("JSESSIONID", jsessionid, domain=".linkedin.com")
# CSRF token is JSESSIONID value with surrounding quotes stripped
csrf = jsessionid.strip('"')
self.session.headers.update({
"csrf-token": csrf,
"x-restli-protocol-version": "2.0.0",
"x-li-lang": "en_US",
"x-li-page-instance": "urn:li:page:d_flagship3_feed;ABC123",
"x-li-track": json.dumps({
"clientVersion": "1.13.5",
"mpVersion": "1.13.5",
"osName": "web",
"timezoneOffset": 0,
"timezone": "UTC",
"deviceFormFactor": "DESKTOP",
"mpName": "voyager-web",
"displayDensity": 1,
"displayWidth": 1920,
"displayHeight": 1080,
}),
"user-agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/126.0.0.0 Safari/537.36"
),
"accept": "application/vnd.linkedin.normalized+json+2.1",
"accept-language": "en-US,en;q=0.9",
"referer": "https://www.linkedin.com/feed/",
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
})
self._request_count = 0
self._last_request_time = 0
def _throttle(self, min_delay: float = 2.0, jitter: float = 1.5):
"""Enforce minimum delay between requests."""
import random
elapsed = time.time() - self._last_request_time
required = min_delay + random.uniform(0, jitter)
if elapsed < required:
time.sleep(required - elapsed)
self._last_request_time = time.time()
self._request_count += 1
def get_company_posts(
self,
company_urn: str,
count: int = 20,
start: int = 0,
) -> dict:
"""Fetch posts from a company feed."""
self._throttle()
url = f"{self.BASE}/feed/updatesV2"
params = {
"variables": (
f"(start:{start},count:{count},"
f"feedType:COMPANY_FEED,"
f"companyUrn:urn%3Ali%3Aorganization%3A{company_urn})"
),
"queryId": "voyagerFeedDashUpdates.COMPANY_FEED",
}
resp = self.session.get(url, params=params, timeout=15)
if resp.status_code == 429:
retry_after = int(resp.headers.get("Retry-After", 60))
print(f"Rate limited. Waiting {retry_after}s...")
time.sleep(retry_after)
return self.get_company_posts(company_urn, count, start)
resp.raise_for_status()
return resp.json()
def parse_post_metrics(self, raw: dict) -> list[dict]:
"""Extract structured post metrics from a Voyager feed response."""
posts = []
elements = raw.get("included", [])
for el in elements:
if el.get("$type") != "com.linkedin.voyager.feed.render.UpdateV2":
continue
social = el.get("socialDetail", {})
reaction_summaries = social.get("reactionSummaries", [])
total_reactions = sum(r.get("count", 0) for r in reaction_summaries)
reaction_breakdown = {
r.get("reactionType", "UNKNOWN"): r.get("count", 0)
for r in reaction_summaries
}
# Extract post text
commentary = el.get("commentary", {})
text_obj = commentary.get("text", {})
post_text = text_obj.get("text", "")
# Extract hashtags from text attributes
hashtags = []
for seg in text_obj.get("attributesV2", []):
if seg.get("type") == "HASHTAG":
tag = seg.get("text", "").lstrip("#")
if tag:
hashtags.append(tag.lower())
# Activity counts
activity_counts = social.get("totalSocialActivityCounts", {})
# URN and timestamp
update_meta = el.get("updateMetadata", {})
actor = el.get("actor", {})
time_text = actor.get("subDescription", {}).get("text", "")
posts.append({
"urn": update_meta.get("urn", ""),
"share_urn": update_meta.get("shareUrn", ""),
"text": post_text,
"hashtags": hashtags,
"reactions_total": total_reactions,
"reaction_breakdown": reaction_breakdown,
"comments": activity_counts.get("numComments", 0),
"shares": activity_counts.get("numShares", 0),
"posted_at_text": time_text,
})
return posts
def paginate_company_posts(
self,
company_urn: str,
max_posts: int = 100,
) -> list[dict]:
"""Collect all posts for a company, up to max_posts."""
all_posts = []
start = 0
page_size = 20
while len(all_posts) < max_posts:
raw = self.get_company_posts(company_urn, count=page_size, start=start)
batch = self.parse_post_metrics(raw)
if not batch:
break
all_posts.extend(batch)
start += page_size
print(f" Collected {len(all_posts)} posts so far...")
return all_posts[:max_posts]
def get_post_comments(self, post_urn: str, count: int = 20) -> list[dict]:
"""Fetch comments for a specific post URN."""
self._throttle()
encoded_urn = requests.utils.quote(post_urn, safe="")
url = f"{self.BASE}/feed/comments"
params = {
"variables": f"(count:{count},start:0,updateUrn:{encoded_urn})",
"queryId": "voyagerFeedDashComments.FETCH",
}
resp = self.session.get(url, params=params, timeout=15)
if resp.status_code not in (200, 201):
return []
data = resp.json()
comments = []
for el in data.get("included", []):
if "commentV2" not in el.get("$type", ""):
continue
text = el.get("commentV2", {}).get("text", {}).get("text", "")
if text:
comments.append({
"text": text,
"urn": el.get("entityUrn", ""),
})
return comments
Approach 2: HTML Scraping with Playwright
For public company pages where you don't want to risk an account, Playwright with stealth settings extracts basic engagement data. You get totals but not reaction-type breakdowns.
import asyncio
import re
import random
from playwright.async_api import async_playwright
async def scrape_company_page(
company_slug: str,
proxy_config: dict = None,
max_posts: int = 15,
) -> list[dict]:
"""
Scrape engagement metrics from a LinkedIn company's posts page.
company_slug: the URL slug e.g. 'openai' from linkedin.com/company/openai
proxy_config: optional dict with 'server', 'username', 'password'
"""
async with async_playwright() as p:
launch_kwargs = {
"headless": True,
"args": [
"--disable-blink-features=AutomationControlled",
"--no-sandbox",
"--disable-dev-shm-usage",
],
}
if proxy_config:
launch_kwargs["proxy"] = proxy_config
browser = await p.chromium.launch(**launch_kwargs)
context = await browser.new_context(
user_agent=(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/126.0.0.0 Safari/537.36"
),
viewport={"width": 1280, "height": 900},
locale="en-US",
timezone_id="America/New_York",
)
# Suppress automation detection
await context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
Object.defineProperty(navigator, 'plugins', { get: () => [1, 2, 3] });
Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });
window.chrome = { runtime: {} };
""")
page = await context.new_page()
url = f"https://www.linkedin.com/company/{company_slug}/posts/"
await page.goto(url, wait_until="networkidle", timeout=30000)
await page.wait_for_timeout(3000)
posts = []
post_cards = await page.query_selector_all("div[data-id^='urn:li:activity']")
def parse_count(raw: str) -> int:
"""Parse LinkedIn's abbreviated count format (e.g., '1.2K', '3,456')."""
if not raw:
return 0
raw = raw.strip().replace(",", "")
match = re.search(r"([\d.]+)\s*([KkMm]?)", raw)
if not match:
return 0
num = float(match.group(1))
suffix = match.group(2).upper()
if suffix == "K":
return int(num * 1000)
elif suffix == "M":
return int(num * 1_000_000)
return int(num)
for card in post_cards[:max_posts]:
try:
# Post text
text_el = await card.query_selector(
".attributed-text-segment-list__content, "
"[class*='break-words'], .feed-shared-text"
)
text = (await text_el.inner_text()).strip() if text_el else ""
hashtags = re.findall(r"#(\w+)", text)
# Reactions
reaction_el = await card.query_selector(
"[aria-label*='reaction'], "
"[class*='reactions-count'], "
"button[aria-label*='like']"
)
reactions_raw = (
await reaction_el.get_attribute("aria-label") or
await reaction_el.inner_text()
) if reaction_el else "0"
count_match = re.search(r"([\d,.KkMm]+)", reactions_raw or "0")
reactions = parse_count(count_match.group(1)) if count_match else 0
# Comments
comment_el = await card.query_selector(
"button[aria-label*='comment'], "
"[class*='comments-count']"
)
comments_raw = (await comment_el.inner_text()).strip() if comment_el else "0"
count_match2 = re.search(r"([\d,.KkMm]+)", comments_raw)
comments = parse_count(count_match2.group(1)) if count_match2 else 0
# Timestamp (relative)
time_el = await card.query_selector(
"time, [class*='ago'], span[aria-label*='ago']"
)
timestamp = (await time_el.inner_text()).strip() if time_el else None
posts.append({
"company_slug": company_slug,
"text": text[:500], # truncate for storage
"hashtags": hashtags,
"reactions": reactions,
"comments": comments,
"timestamp_text": timestamp,
})
await page.wait_for_timeout(random.randint(800, 1500))
except Exception as e:
print(f"Error extracting post card: {e}")
continue
await browser.close()
return posts
# Run with proxy
proxy = {
"server": "http://proxy.thordata.com:9000",
"username": "YOUR_USER",
"password": "YOUR_PASS",
}
posts = asyncio.run(scrape_company_page("openai", proxy_config=proxy, max_posts=10))
Proxy and Anti-Detection Setup
LinkedIn aggressively blocks datacenter IPs. A fresh AWS or DigitalOcean IP gets a 999 status or redirects to a challenge page within minutes of any scraping activity. For any sustained scraping, residential proxies are required.
ThorData's residential proxies handle LinkedIn's geo-checks reliably. Their automatic rotation keeps sessions alive longer by preventing the IP-level rate limits that accumulate with a single residential address. Wire them into either approach:
import requests
# For Voyager (requests-based)
voyager = LinkedInVoyager(li_at="YOUR_LI_AT", jsessionid="YOUR_JSESSIONID")
voyager.session.proxies = {
"http": "http://USER:[email protected]:9000",
"https": "http://USER:[email protected]:9000",
}
# For Playwright
proxy_config = {
"server": "http://proxy.thordata.com:9000",
"username": "USER",
"password": "PASS",
}
# Pass proxy_config to scrape_company_page() or create_browser()
Best practices:
- Rotate proxies every 50-100 requests
- Randomize request intervals between 2 and 5 seconds for Voyager, 1.5-4 seconds for Playwright
- Never reuse a li_at cookie across multiple proxy IPs in the same session
- Use geo-consistent proxies (match country to your account's registered location)
Engagement Rate Formulas
Consistent formulas matter for cross-company comparison:
from typing import Optional
def engagement_rate(
reactions: int,
comments: int,
shares: int,
followers: int,
) -> float:
"""
Standard engagement rate: total engagements / follower count.
Returns percentage (0-100 scale).
"""
if followers == 0:
return 0.0
total = reactions + comments + shares
return round(total / followers * 100, 4)
def weighted_engagement_rate(
reactions: int,
comments: int,
shares: int,
followers: int,
) -> float:
"""
Weighted engagement rate where comments and shares score higher than reactions.
Weights: reactions=1, comments=3, shares=5
"""
if followers == 0:
return 0.0
score = reactions * 1 + comments * 3 + shares * 5
return round(score / followers * 100, 4)
def reaction_diversity_score(reaction_breakdown: dict) -> float:
"""
Score based on diversity of reaction types.
A post with varied reactions (Celebrate, Insightful, etc.) scores higher
than one with only Likes — indicates deeper emotional resonance.
"""
if not reaction_breakdown:
return 0.0
total = sum(reaction_breakdown.values())
if total == 0:
return 0.0
# Shannon entropy of reaction distribution
import math
entropy = 0.0
for count in reaction_breakdown.values():
if count > 0:
p = count / total
entropy -= p * math.log2(p)
# Max entropy for n types
max_types = len(reaction_breakdown)
max_entropy = math.log2(max_types) if max_types > 1 else 1.0
return round(entropy / max_entropy, 4) if max_entropy > 0 else 0.0
def posting_frequency(post_timestamps: list[float]) -> float:
"""
Average posts per week given a list of Unix timestamps.
Returns 0.0 if fewer than 2 timestamps provided.
"""
if len(post_timestamps) < 2:
return 0.0
span_days = (max(post_timestamps) - min(post_timestamps)) / 86400
if span_days < 1:
return 0.0
return round(len(post_timestamps) / (span_days / 7), 2)
def best_posting_day(post_timestamps: list[float]) -> str:
"""
Identify which day of week has highest average engagement.
Returns day name (Monday–Sunday).
"""
from datetime import datetime
from collections import defaultdict
day_reactions = defaultdict(list)
for ts in post_timestamps:
day_name = datetime.fromtimestamp(ts).strftime("%A")
day_reactions[day_name].append(ts)
if not day_reactions:
return "unknown"
return max(day_reactions.items(), key=lambda x: len(x[1]))[0]
Hashtag Performance Analysis
Track which hashtags drive higher engagement across posts:
from collections import defaultdict
import statistics
def analyze_hashtags(posts: list[dict]) -> list[dict]:
"""
Aggregate engagement metrics by hashtag across all posts.
Returns sorted list by average reactions descending.
"""
hashtag_stats = defaultdict(lambda: {
"uses": 0,
"total_reactions": 0,
"total_comments": 0,
"total_shares": 0,
"reaction_counts": [],
})
for post in posts:
for tag in post.get("hashtags", []):
tag = tag.lower().strip()
if not tag:
continue
hashtag_stats[tag]["uses"] += 1
hashtag_stats[tag]["total_reactions"] += post.get("reactions_total", 0)
hashtag_stats[tag]["total_comments"] += post.get("comments", 0)
hashtag_stats[tag]["total_shares"] += post.get("shares", 0)
hashtag_stats[tag]["reaction_counts"].append(post.get("reactions_total", 0))
results = []
for tag, stats in hashtag_stats.items():
uses = stats["uses"]
counts = stats["reaction_counts"]
results.append({
"hashtag": tag,
"uses": uses,
"avg_reactions": round(stats["total_reactions"] / uses, 1),
"avg_comments": round(stats["total_comments"] / uses, 1),
"avg_shares": round(stats["total_shares"] / uses, 1),
"median_reactions": statistics.median(counts) if counts else 0,
"max_reactions": max(counts) if counts else 0,
})
return sorted(results, key=lambda x: x["avg_reactions"], reverse=True)
SQLite Storage Schema
import sqlite3
import json
from datetime import datetime, timezone
def init_linkedin_db(db_path: str = "linkedin_engagement.db") -> sqlite3.Connection:
"""Initialize LinkedIn engagement tracking database."""
conn = sqlite3.connect(db_path)
conn.executescript("""
CREATE TABLE IF NOT EXISTS companies (
slug TEXT PRIMARY KEY,
name TEXT,
follower_count INTEGER,
description TEXT,
industry TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS posts (
urn TEXT PRIMARY KEY,
company_slug TEXT NOT NULL,
text TEXT,
hashtags TEXT, -- JSON array
reactions_total INTEGER DEFAULT 0,
reaction_breakdown TEXT, -- JSON object
comments INTEGER DEFAULT 0,
shares INTEGER DEFAULT 0,
engagement_rate REAL,
weighted_er REAL,
posted_at_text TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (company_slug) REFERENCES companies(slug)
);
CREATE TABLE IF NOT EXISTS hashtag_stats (
company_slug TEXT NOT NULL,
hashtag TEXT NOT NULL,
uses INTEGER,
avg_reactions REAL,
avg_comments REAL,
computed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (company_slug, hashtag)
);
CREATE INDEX IF NOT EXISTS idx_posts_company ON posts(company_slug);
CREATE INDEX IF NOT EXISTS idx_posts_engagement ON posts(engagement_rate DESC);
CREATE INDEX IF NOT EXISTS idx_hashtag_stats_tag ON hashtag_stats(hashtag);
""")
conn.commit()
return conn
def upsert_post(
conn: sqlite3.Connection,
post: dict,
company_slug: str,
follower_count: int = 0,
):
"""Insert or update a post with computed engagement metrics."""
er = engagement_rate(
post.get("reactions_total", 0),
post.get("comments", 0),
post.get("shares", 0),
follower_count,
)
wer = weighted_engagement_rate(
post.get("reactions_total", 0),
post.get("comments", 0),
post.get("shares", 0),
follower_count,
)
conn.execute("""
INSERT OR REPLACE INTO posts
(urn, company_slug, text, hashtags, reactions_total,
reaction_breakdown, comments, shares, engagement_rate,
weighted_er, posted_at_text, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
post.get("urn") or post.get("text", "")[:50], # fallback key
company_slug,
post.get("text"),
json.dumps(post.get("hashtags", [])),
post.get("reactions_total", post.get("reactions", 0)),
json.dumps(post.get("reaction_breakdown", {})),
post.get("comments", 0),
post.get("shares", 0),
er,
wer,
post.get("posted_at_text"),
datetime.now(timezone.utc).isoformat(),
))
conn.commit()
def get_top_posts(
conn: sqlite3.Connection,
company_slug: str,
metric: str = "engagement_rate",
limit: int = 10,
) -> list:
"""Get top performing posts for a company sorted by metric."""
valid_metrics = {"engagement_rate", "weighted_er", "reactions_total", "comments"}
if metric not in valid_metrics:
metric = "engagement_rate"
return conn.execute(f"""
SELECT text, hashtags, reactions_total, comments, shares, {metric}
FROM posts
WHERE company_slug = ?
ORDER BY {metric} DESC
LIMIT ?
""", (company_slug, limit)).fetchall()
Use Cases and Applications
Content strategy optimization. Collect 6-12 months of posts from your own company page using the Voyager approach (with your own li_at). Run the hashtag analysis and engagement rate calculations. You will quickly see which content formats (native video, carousels, text-only) and which topic clusters drive the highest weighted engagement. Adjust your posting calendar around what the data shows, not gut feeling.
Competitor benchmarking. Track 5-10 competitor companies weekly. Measure their posting frequency, average engagement rate per format type, and top-performing hashtags. Spot inflection points — when a competitor's engagement rate jumps significantly, dig into what changed in their content mix. Often it is a new content type or a campaign that you can analyze and adapt.
Influencer and partner identification. Pull posts mentioning specific hashtags or topics in your niche using the Voyager hashtag feed endpoint. Rank post authors by weighted engagement rate rather than follower count. High engagement rate on a smaller following often signals a more authentic audience — more valuable for partnership decisions than raw follower numbers.
Sales intelligence. Track posts from prospect companies. Spikes in hiring announcements, product launches, or executive commentary indicate buying signals. A company posting heavily about a pain point you solve is a warm outreach opportunity.
Industry benchmarking. Collect post-level data across 50-100 companies in a vertical. Build a benchmark report showing industry-average engagement rates, top hashtag themes, and posting frequency norms. This kind of benchmark content performs well as gated content or as a foundation for thought leadership.
Ethical Considerations and Rate Limits
LinkedIn's Terms of Service prohibit automated scraping. Understand the risk profile before running this in production:
- Accounts used for Voyager scraping can be restricted or terminated without notice
- Scraping public company pages with Playwright carries lower risk than API extraction
- Using this against competitor public pages differs from mass-scraping personal profiles
Minimum safe delays: 2.5 seconds between Voyager calls with 1-2 seconds of additional jitter; 1.5 seconds between Playwright page interactions. If you see 429 responses, back off for at least 60 seconds before retrying.
Do not store personally identifiable data on employees or individual users. Do not use scraped engagement data to build shadow profiles or power unsolicited outreach at scale. The techniques documented here are for content analytics, competitive research, and personal strategy work.