Scraping LinkedIn Company Pages: Python Guide (2026)

2026-04-09 [python scraping linkedin company-data voyager-api proxies sqlite]

Scraping LinkedIn Company Pages: Python Guide (2026)

LinkedIn is one of the hardest websites to scrape. They invest heavily in anti-bot systems, track browser fingerprints at the TLS level, and will ban accounts that automate too aggressively. But company page data — employee counts, industry, headquarters, job postings — is genuinely useful for sales prospecting, competitive intelligence, HR research, and market analysis.

This guide covers what's possible in 2026: the guest API for public data, the Voyager API structure for authenticated access, anti-bot countermeasures, SQLite storage, and proxy integration for sustained operation.

What Data Is Available

LinkedIn company pages expose several categories of data:

Company basics — name, description, industry, company size range, headquarters, founded year, specialties
Employee count — current headcount range (not exact) and follower count
Job postings — active job listings with title, location, remote status, date posted
Leadership — featured employees and executives (limited without authentication)
Updates feed — company posts and announcements
Affiliated pages — subsidiaries and parent companies
Similar companies — LinkedIn's own recommendations

What you cannot get without deep account access: email addresses, personal contact details, connection-level data, or anything behind LinkedIn's premium analytics wall.

The Guest API

LinkedIn exposes some company data through unauthenticated "guest" endpoints. These don't require login but are rate-limited and heavily fingerprinted.

import httpx
import time
import json
from bs4 import BeautifulSoup

def get_company_public(company_slug: str, proxy_url: str = None) -> str:
    """Fetch public company page HTML from LinkedIn."""
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Cache-Control": "max-age=0",
    }

    client = httpx.Client(
        headers=headers,
        proxy=proxy_url,
        follow_redirects=True,
        timeout=30,
        http2=True,
    )

    resp = client.get(f"https://www.linkedin.com/company/{company_slug}/")
    resp.raise_for_status()
    return resp.text

The HTML response contains embedded JSON-LD with structured company data:

def parse_company_jsonld(html: str) -> dict | None:
    """Extract JSON-LD structured data from LinkedIn company page."""
    soup = BeautifulSoup(html, "lxml")
    scripts = soup.find_all("script", type="application/ld+json")

    for script in scripts:
        if not script.string:
            continue
        try:
            data = json.loads(script.string)
            if data.get("@type") == "Organization":
                return {
                    "name": data.get("name"),
                    "description": data.get("description"),
                    "url": data.get("url"),
                    "industry": data.get("industry"),
                    "employee_count": data.get("numberOfEmployees", {}).get("value"),
                    "logo": data.get("logo"),
                    "founded": data.get("foundingDate"),
                    "headquarters": data.get("address", {}).get("addressLocality"),
                    "country": data.get("address", {}).get("addressCountry"),
                    "social_profiles": data.get("sameAs", []),
                }
        except json.JSONDecodeError:
            continue
    return None


def parse_company_meta(html: str) -> dict:
    """Extract Open Graph and meta tags for basic company info."""
    soup = BeautifulSoup(html, "lxml")

    def get_meta(name: str, attr: str = "name") -> str:
        tag = soup.find("meta", attrs={attr: name})
        return tag.get("content", "") if tag else ""

    return {
        "og_title": get_meta("og:title", "property"),
        "og_description": get_meta("og:description", "property"),
        "og_image": get_meta("og:image", "property"),
        "twitter_title": get_meta("twitter:title"),
        "canonical_url": soup.find("link", rel="canonical").get("href", "") if soup.find("link", rel="canonical") else "",
    }


# Example usage
html = get_company_public("openai")
company = parse_company_jsonld(html)
if company:
    print(f"{company['name']} — {company['employee_count']} employees — {company['headquarters']}")

The Voyager API

LinkedIn's internal API is called Voyager. It powers the entire frontend and returns rich JSON responses. The endpoints follow a pattern:

https://www.linkedin.com/voyager/api/voyagerOrganizationDashCompanies?q=universalName&universalName={slug}

Voyager requires authentication — you need valid li_at and JSESSIONID cookies from a logged-in session:

def get_voyager_company(company_slug: str, li_at: str, jsessionid: str,
                         proxy_url: str = None) -> dict:
    """Fetch detailed company data from Voyager API."""
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
        "Accept": "application/vnd.linkedin.normalized+json+2.1",
        "Csrf-Token": jsessionid,
        "X-Restli-Protocol-Version": "2.0.0",
        "X-Li-Track": '{"clientVersion":"1.13.18071","mpVersion":"1.13.18071","osName":"web","timezoneOffset":-5,"timezone":"America/New_York","deviceFormFactor":"DESKTOP","mpName":"voyager-web"}',
    }
    cookies = {
        "li_at": li_at,
        "JSESSIONID": f'"{jsessionid}"',
    }

    client = httpx.Client(proxy=proxy_url, timeout=30)
    resp = client.get(
        f"https://www.linkedin.com/voyager/api/voyagerOrganizationDashCompanies",
        params={"q": "universalName", "universalName": company_slug},
        headers=headers,
        cookies=cookies,
    )
    resp.raise_for_status()
    return resp.json()


def parse_voyager_company(data: dict) -> dict | None:
    """Parse Voyager company response into clean dict."""
    elements = data.get("included", [])
    company = None
    for el in elements:
        if el.get("$type") == "com.linkedin.voyager.dash.organization.Company":
            company = el
            break

    if not company:
        return None

    return {
        "name": company.get("name"),
        "tagline": company.get("tagline"),
        "description": company.get("description"),
        "industry": company.get("industryV2", {}).get("name"),
        "company_size_start": company.get("staffCountRange", {}).get("start"),
        "company_size_end": company.get("staffCountRange", {}).get("end"),
        "staff_count": company.get("staffCount"),
        "headquarters_city": company.get("headquarter", {}).get("city"),
        "headquarters_country": company.get("headquarter", {}).get("country"),
        "founded_year": company.get("foundedOn", {}).get("year"),
        "specialties": company.get("specialities", []),
        "website": company.get("companyPageUrl"),
        "follower_count": company.get("followingInfo", {}).get("followerCount"),
        "logo_url": company.get("logo", {}).get("image", {}).get("rootUrl"),
        "entity_urn": company.get("entityUrn", ""),
        "confirmed_locations": company.get("confirmedLocations", []),
    }


def extract_company_id_from_urn(entity_urn: str) -> str | None:
    """Extract numeric company ID from LinkedIn URN.

    Example: 'urn:li:fsd_company:1441' -> '1441'
    """
    import re
    match = re.search(r":(\d+)$", entity_urn)
    return match.group(1) if match else None

Scraping Job Postings

LinkedIn's job API is accessible through Voyager with authenticated requests:

def get_company_jobs(company_id: str, li_at: str, jsessionid: str,
                      start: int = 0, count: int = 25,
                      proxy_url: str = None) -> list[dict]:
    """Fetch job postings for a company."""
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
        "Accept": "application/vnd.linkedin.normalized+json+2.1",
        "Csrf-Token": jsessionid,
        "X-Restli-Protocol-Version": "2.0.0",
    }
    cookies = {"li_at": li_at, "JSESSIONID": f'"{jsessionid}"'}

    params = {
        "decorationId": "com.linkedin.voyager.dash.deco.jobs.search.JobSearchCard-51",
        "q": "jobSearch",
        "start": start,
        "count": count,
        "filters": f"List(resultType->JOBS,company->List({company_id}))",
    }

    client = httpx.Client(proxy=proxy_url, timeout=30)
    resp = client.get(
        "https://www.linkedin.com/voyager/api/voyagerJobsDashJobCards",
        params=params,
        headers=headers,
        cookies=cookies,
    )
    resp.raise_for_status()
    data = resp.json()

    jobs = []
    for el in data.get("included", []):
        if "jobPostingTitle" in el:
            jobs.append({
                "title": el.get("jobPostingTitle"),
                "location": el.get("formattedLocation"),
                "listed_at": el.get("listedAt"),
                "remote": el.get("workRemoteAllowed", False),
                "employment_type": el.get("employmentStatus"),
                "job_url": el.get("jobPostingUrl", ""),
                "entity_urn": el.get("entityUrn", ""),
            })
    return jobs


def get_all_company_jobs(company_id: str, li_at: str, jsessionid: str,
                          proxy_url: str = None, max_jobs: int = 500) -> list[dict]:
    """Paginate through all active job postings for a company."""
    all_jobs = []
    start = 0
    batch_size = 25

    while len(all_jobs) < max_jobs:
        batch = get_company_jobs(
            company_id, li_at, jsessionid,
            start=start, count=batch_size,
            proxy_url=proxy_url,
        )
        if not batch:
            break

        all_jobs.extend(batch)
        print(f"  Jobs: {len(all_jobs)} (page start={start})")
        start += batch_size
        time.sleep(3)  # Critical — LinkedIn watches request timing

    return all_jobs

Scraping Company Updates

Company posts and announcements are available via Voyager:

def get_company_updates(company_id: str, li_at: str, jsessionid: str,
                          start: int = 0, count: int = 20,
                          proxy_url: str = None) -> list[dict]:
    """Fetch company feed updates (posts, announcements)."""
    headers = {
        "Accept": "application/vnd.linkedin.normalized+json+2.1",
        "Csrf-Token": jsessionid,
    }
    cookies = {"li_at": li_at, "JSESSIONID": f'"{jsessionid}"'}

    params = {
        "q": "memberShareFeed",
        "moduleKey": "member-shares:phone",
        "count": count,
        "start": start,
        "companyId": company_id,
    }

    client = httpx.Client(proxy=proxy_url, timeout=30)
    resp = client.get(
        "https://www.linkedin.com/voyager/api/feed/updates",
        params=params,
        headers=headers,
        cookies=cookies,
    )
    resp.raise_for_status()
    data = resp.json()

    updates = []
    for el in data.get("included", []):
        if el.get("$type") == "com.linkedin.voyager.feed.Update":
            commentary = el.get("commentary", {})
            updates.append({
                "urn": el.get("entityUrn"),
                "text": commentary.get("text", {}).get("text", ""),
                "published_at": el.get("created", {}).get("time"),
                "num_likes": el.get("socialDetail", {}).get("totalSocialActivityCounts", {}).get("numLikes", 0),
                "num_comments": el.get("socialDetail", {}).get("totalSocialActivityCounts", {}).get("numComments", 0),
            })
    return updates

Anti-Bot Measures

LinkedIn runs one of the most sophisticated anti-bot systems on the web. Here's what you're up against in 2026:

TLS Fingerprinting

LinkedIn fingerprints your TLS handshake — the cipher suites, extensions, and their order. Standard Python httpx or requests libraries produce a TLS fingerprint that's instantly recognizable as non-browser. This alone can get you blocked before your request even reaches the server.

# Standard httpx gets flagged by TLS fingerprint
# For production scraping, use curl_cffi which mimics browser TLS fingerprints

from curl_cffi import requests as curl_requests

resp = curl_requests.get(
    "https://www.linkedin.com/company/openai/",
    impersonate="chrome124",
    timeout=30,
)

Browser Behavioral Analysis

LinkedIn tracks mouse movements, scroll patterns, and timing between actions. Automated sessions that navigate too quickly or too predictably get flagged within minutes.

Rate Limiting and Account Restrictions

Accounts that make too many profile or company page views get restricted. LinkedIn's commercial use limit kicks in around 80-100 profile views per day for free accounts. Hit that wall and you'll get soft bans — CAPTCHAs on every page, then temporary lockout.

IP Reputation

Datacenter IPs are blacklisted almost immediately. LinkedIn maintains extensive IP reputation databases. If you're doing any volume of LinkedIn scraping, you need residential proxies.

ThorData's residential proxy network is built for exactly this use case. Their rotating residential IPs mimic real user connections and survive LinkedIn's IP reputation checks. Combined with TLS fingerprint spoofing from curl_cffi, this handles the two biggest technical barriers:

from curl_cffi import requests as curl_requests

# Guest page scraping with TLS spoofing + residential proxy
resp = curl_requests.get(
    "https://www.linkedin.com/company/openai/",
    impersonate="chrome124",
    proxies={"https": "http://user:[email protected]:9000"},
    timeout=30,
)
html = resp.text

Employee Count History Tracking

One valuable dataset is tracking how a company's employee count changes over time. LinkedIn surfaces this in their premium analytics, but you can approximate it by scraping periodically:

import sqlite3
from datetime import datetime


def init_linkedin_db(db_path: str = "linkedin_data.db") -> sqlite3.Connection:
    """Initialize SQLite database for LinkedIn data."""
    conn = sqlite3.connect(db_path)
    conn.execute("PRAGMA journal_mode=WAL")

    conn.execute("""
        CREATE TABLE IF NOT EXISTS companies (
            company_slug TEXT PRIMARY KEY,
            name TEXT,
            industry TEXT,
            headquarters_city TEXT,
            headquarters_country TEXT,
            founded_year INTEGER,
            specialties TEXT,  -- JSON array
            website TEXT,
            entity_urn TEXT,
            last_scraped TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS employee_snapshots (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            company_slug TEXT NOT NULL,
            staff_count_min INTEGER,
            staff_count_max INTEGER,
            staff_count_exact INTEGER,
            follower_count INTEGER,
            recorded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
            FOREIGN KEY (company_slug) REFERENCES companies(company_slug)
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS job_postings (
            entity_urn TEXT PRIMARY KEY,
            company_slug TEXT NOT NULL,
            title TEXT,
            location TEXT,
            remote INTEGER DEFAULT 0,
            employment_type TEXT,
            job_url TEXT,
            listed_at INTEGER,
            first_seen TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
            last_seen TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS company_updates (
            urn TEXT PRIMARY KEY,
            company_slug TEXT NOT NULL,
            text TEXT,
            published_at INTEGER,
            num_likes INTEGER DEFAULT 0,
            num_comments INTEGER DEFAULT 0,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)

    conn.execute("CREATE INDEX IF NOT EXISTS idx_snapshots_company ON employee_snapshots(company_slug)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_jobs_company ON job_postings(company_slug)")
    conn.commit()
    return conn


def record_employee_snapshot(conn: sqlite3.Connection, company_slug: str,
                               company_data: dict) -> None:
    """Store employee count snapshot for trend tracking."""
    conn.execute(
        """INSERT INTO employee_snapshots
           (company_slug, staff_count_min, staff_count_max, staff_count_exact, follower_count)
           VALUES (?, ?, ?, ?, ?)""",
        (
            company_slug,
            company_data.get("company_size_start"),
            company_data.get("company_size_end"),
            company_data.get("staff_count"),
            company_data.get("follower_count"),
        )
    )
    conn.commit()


def save_company(conn: sqlite3.Connection, company_slug: str, company_data: dict) -> None:
    """Upsert company record."""
    conn.execute(
        """INSERT OR REPLACE INTO companies
           (company_slug, name, industry, headquarters_city, headquarters_country,
            founded_year, specialties, website, entity_urn)
           VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)""",
        (
            company_slug,
            company_data.get("name"),
            company_data.get("industry"),
            company_data.get("headquarters_city"),
            company_data.get("headquarters_country"),
            company_data.get("founded_year"),
            json.dumps(company_data.get("specialties", [])),
            company_data.get("website"),
            company_data.get("entity_urn"),
        )
    )
    conn.commit()


def bulk_save_jobs(conn: sqlite3.Connection, company_slug: str,
                    jobs: list[dict]) -> int:
    """Bulk insert job postings, updating last_seen for existing entries."""
    now = datetime.now().isoformat()
    new_count = 0

    for job in jobs:
        urn = job.get("entity_urn", "")
        if not urn:
            continue

        existing = conn.execute(
            "SELECT 1 FROM job_postings WHERE entity_urn = ?", (urn,)
        ).fetchone()

        if existing:
            conn.execute(
                "UPDATE job_postings SET last_seen = ? WHERE entity_urn = ?",
                (now, urn)
            )
        else:
            conn.execute(
                """INSERT INTO job_postings
                   (entity_urn, company_slug, title, location, remote,
                    employment_type, job_url, listed_at)
                   VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
                (
                    urn, company_slug,
                    job.get("title"), job.get("location"),
                    1 if job.get("remote") else 0,
                    job.get("employment_type"),
                    job.get("job_url"),
                    job.get("listed_at"),
                )
            )
            new_count += 1

    conn.commit()
    return new_count

Tracking Growth Trends

With periodic scraping stored in SQLite, you can analyze headcount trends:

def get_employee_trend(db_path: str, company_slug: str) -> list[dict]:
    """Get employee count history for a company."""
    conn = sqlite3.connect(db_path)
    rows = conn.execute(
        """SELECT recorded_at, staff_count_exact, staff_count_min, staff_count_max, follower_count
           FROM employee_snapshots
           WHERE company_slug = ?
           ORDER BY recorded_at""",
        (company_slug,)
    ).fetchall()
    conn.close()

    return [
        {
            "date": row[0],
            "staff_exact": row[1],
            "staff_min": row[2],
            "staff_max": row[3],
            "followers": row[4],
        }
        for row in rows
    ]


def get_job_velocity(db_path: str, company_slug: str, days: int = 30) -> dict:
    """Count new job postings in last N days as a hiring velocity signal."""
    conn = sqlite3.connect(db_path)
    result = conn.execute(
        """SELECT COUNT(*) FROM job_postings
           WHERE company_slug = ?
             AND first_seen >= datetime('now', ?)""",
        (company_slug, f"-{days} days")
    ).fetchone()
    conn.close()

    return {"company": company_slug, "new_jobs_last_n_days": result[0], "period_days": days}


def compare_companies(db_path: str, slugs: list[str]) -> list[dict]:
    """Compare multiple companies by latest employee count and job velocity."""
    conn = sqlite3.connect(db_path)
    comparison = []

    for slug in slugs:
        latest = conn.execute(
            """SELECT staff_count_min, staff_count_max, staff_count_exact, follower_count
               FROM employee_snapshots
               WHERE company_slug = ?
               ORDER BY recorded_at DESC LIMIT 1""",
            (slug,)
        ).fetchone()

        company_row = conn.execute(
            "SELECT name, industry, founded_year FROM companies WHERE company_slug = ?",
            (slug,)
        ).fetchone()

        if company_row:
            comparison.append({
                "slug": slug,
                "name": company_row[0],
                "industry": company_row[1],
                "founded": company_row[2],
                "staff_min": latest[0] if latest else None,
                "staff_max": latest[1] if latest else None,
                "staff_exact": latest[2] if latest else None,
                "followers": latest[3] if latest else None,
            })

    conn.close()
    return comparison

Full Scraping Pipeline

A complete script that scrapes multiple companies and stores the data:

import random

TARGET_COMPANIES = [
    "openai", "anthropic", "google", "meta", "microsoft",
    "nvidia", "apple", "amazon", "tesla", "stripe",
]

def run_company_pipeline(
    db_path: str = "linkedin_data.db",
    li_at: str = None,
    jsessionid: str = None,
    proxy_url: str = "http://USER:[email protected]:9000",
):
    """Scrape LinkedIn company data with full pipeline."""
    conn = init_linkedin_db(db_path)

    for i, slug in enumerate(TARGET_COMPANIES):
        print(f"\n[{i+1}/{len(TARGET_COMPANIES)}] {slug}")

        # 1. Guest API for public data (no auth needed)
        try:
            html = get_company_public(slug, proxy_url=proxy_url)
            basic_data = parse_company_jsonld(html)

            if basic_data:
                print(f"  Guest API: {basic_data['name']}, {basic_data['employee_count']} employees")
        except Exception as e:
            print(f"  Guest API failed: {e}")
            basic_data = None

        # 2. Voyager API for detailed data (requires auth cookies)
        if li_at and jsessionid:
            try:
                voyager_raw = get_voyager_company(slug, li_at, jsessionid, proxy_url)
                voyager_data = parse_voyager_company(voyager_raw)

                if voyager_data:
                    save_company(conn, slug, voyager_data)
                    record_employee_snapshot(conn, slug, voyager_data)
                    print(f"  Voyager: {voyager_data['staff_count']} staff, {voyager_data['follower_count']} followers")

                    # Get jobs if we have company ID
                    company_id = extract_company_id_from_urn(voyager_data.get("entity_urn", ""))
                    if company_id:
                        jobs = get_all_company_jobs(company_id, li_at, jsessionid, proxy_url)
                        new_jobs = bulk_save_jobs(conn, slug, jobs)
                        print(f"  Jobs: {len(jobs)} active ({new_jobs} new)")

            except Exception as e:
                print(f"  Voyager failed: {e}")

        # Respectful delay — LinkedIn watches timing
        delay = random.uniform(5, 12)
        print(f"  Waiting {delay:.1f}s...")
        time.sleep(delay)

    # Print summary
    for slug in TARGET_COMPANIES:
        trend = get_employee_trend(db_path, slug)
        velocity = get_job_velocity(db_path, slug)
        if trend:
            latest = trend[-1]
            print(f"  {slug}: {latest['staff_exact'] or latest['staff_min']} employees, "
                  f"{velocity['new_jobs_last_n_days']} new jobs (30d)")

    conn.close()


if __name__ == "__main__":
    # Set your auth cookies from a logged-in LinkedIn session
    LI_AT = "YOUR_LI_AT_COOKIE"
    JSESSIONID = "YOUR_JSESSIONID"
    PROXY = "http://USER:[email protected]:9000"

    run_company_pipeline("linkedin_companies.db", LI_AT, JSESSIONID, PROXY)

Practical Tips

Start with the guest API. JSON-LD data from public pages gets you company basics without authentication. Only use Voyager when you need deeper data.

Rotate accounts. If you're using authenticated Voyager requests, spread the load across multiple accounts. One account doing hundreds of API calls per day will get restricted within 24-48 hours.

Use curl_cffi not requests. Standard Python HTTP libraries get TLS-fingerprinted and blocked by LinkedIn's edge layer. curl_cffi with impersonate="chrome124" mimics a real Chrome browser at the TLS level and passes LinkedIn's fingerprint checks.

Respect LinkedIn's ToS. LinkedIn actively litigates against scrapers. The hiQ Labs v. LinkedIn case established that scraping public data isn't a CFAA violation in the US, but LinkedIn can still enforce their ToS through cease-and-desist letters and IP blocks. Know the legal landscape before scraping at scale.

Space your requests. 3-5 seconds between requests minimum for Voyager. LinkedIn's behavioral analysis flags regular, machine-like timing patterns. Add random jitter to your delays. For guest API requests, 2-3 seconds is usually sufficient.

Cache everything. Store raw responses before parsing. When LinkedIn changes their JSON structure (they do, regularly), you'll want to re-parse from cached data rather than re-scrape.

LinkedIn scraping is a cat-and-mouse game. The combination of guest API access for public data, Voyager for authenticated data, TLS fingerprint spoofing, and residential proxies from ThorData covers the fundamentals for 2026.