LinkedIn Profile Data Without the API: Using Meta Tags and JSON-LD

2026-03-29 scraping linkedin python

If you have ever tried to get profile data from LinkedIn programmatically, you know the frustration. The official API is locked behind a partner program that rejects almost everyone, and the data you can access through it is extremely limited anyway.

But LinkedIn public profiles still render HTML. And that HTML contains structured data -- Open Graph meta tags and JSON-LD schema markup -- that gives you a surprising amount of information without needing any API key at all.

The LinkedIn API Problem

LinkedIn shut down most of its public API access years ago. What remains is the Marketing and Compliance APIs, which require:

A LinkedIn developer application approved through their partner program
A verified company page with a legitimate business use case
A review process that takes weeks and rejects most independent developers
OAuth tokens that only grant access to the authenticated user's own profile in most cases

If you are building a recruiting tool backed by a funded company, you might get approved. If you are an independent developer who wants to pull public profile info for a side project, a research tool, or a data pipeline -- you are out of luck through official channels.

This is where the public-facing HTML becomes useful.

What Public Profiles Expose

When you load a LinkedIn profile in a browser, the page source contains Open Graph meta tags designed for link previews. These tags are present in the initial HTML response, no JavaScript rendering required:

<meta property="og:title" content="John Smith - Senior Developer at Acme Corp">
<meta property="og:description" content="Experience: Senior Developer at Acme Corp...">
<meta property="og:image" content="https://media.licdn.com/dms/image/...">
<meta property="og:url" content="https://www.linkedin.com/in/johnsmith">
<meta property="og:type" content="profile">
<meta property="profile:first_name" content="John">
<meta property="profile:last_name" content="Smith">

Beyond the OG tags, many profiles also include a JSON-LD block with @type: Person schema that contains structured data about the person, their current job title, employer, and sometimes their location and education history.

This is not hidden data. It is the same information LinkedIn serves to Google's crawler, to Facebook and Twitter for link previews, and to any HTTP client that requests the page. The profile owner chose to make it public.

Data Fields You Can Extract

Here is a complete breakdown of what is realistically available from public LinkedIn profiles via meta tags and embedded structured data:

Field	Source	Reliability
Full name	`og:title`, `profile:first_name/last_name`	High
Current job title	`og:title` (parsed)	Medium
Current employer	`og:title`, JSON-LD `worksFor`	Medium
Profile photo URL	`og:image`	High
Profile URL	`og:url`	High
Experience summary	`og:description` (truncated)	Low
Location	JSON-LD `address`	Low (not always present)
Education	JSON-LD `alumniOf`	Low (sometimes present)
Skills	Not available via meta tags	N/A
Full work history	Not available via meta tags	N/A
Connection count	Not available	N/A
Contact info	Not available	N/A

The truncated summary in og:description typically reads like: "Experience: Senior Developer at Acme Corp. Education: MIT." That is not the full resume -- it is a preview snippet.

Fetching Profile Data with Python

Here is a working example using httpx and BeautifulSoup:

# linkedin_profile.py
import httpx
from bs4 import BeautifulSoup
import json
import time
import random

USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15",
]

def fetch_linkedin_profile(profile_url: str, proxy: str = None) -> dict:
    """Fetch public LinkedIn profile data from meta tags and JSON-LD."""
    ua = random.choice(USER_AGENTS)
    headers = {
        "User-Agent": ua,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "DNT": "1",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Cache-Control": "max-age=0",
    }

    client_kwargs = {
        "headers": headers,
        "follow_redirects": True,
        "timeout": 15,
        "http2": True,  # HTTP/2 is more browser-like
    }
    if proxy:
        client_kwargs["proxies"] = {"all://": proxy}

    with httpx.Client(**client_kwargs) as client:
        resp = client.get(profile_url)

    if resp.status_code == 999:
        raise Exception("LinkedIn returned 999 -- bot detection triggered")
    if resp.status_code == 429:
        raise Exception("Rate limited (HTTP 429)")
    if resp.status_code != 200:
        raise Exception(f"HTTP {resp.status_code}")

    # Check for authwall redirect
    if "authwall" in str(resp.url) or "login" in str(resp.url):
        raise Exception("Redirected to auth wall -- profile may be private or IP flagged")

    soup = BeautifulSoup(resp.text, "html.parser")

    # Extract Open Graph meta tags
    profile = {}
    og_mappings = {
        "og:title": "title",
        "og:description": "description",
        "og:image": "image_url",
        "og:url": "profile_url",
        "profile:first_name": "first_name",
        "profile:last_name": "last_name",
    }
    for prop, key in og_mappings.items():
        tag = soup.find("meta", property=prop)
        if tag and tag.get("content"):
            profile[key] = tag["content"]

    # Also check name-attribute meta tags
    name_mappings = {
        "description": "meta_description",
        "twitter:title": "twitter_title",
        "twitter:description": "twitter_description",
    }
    for name_attr, key in name_mappings.items():
        tag = soup.find("meta", attrs={"name": name_attr})
        if tag and tag.get("content"):
            profile[key] = tag["content"]

    # Extract JSON-LD structured data
    for script in soup.find_all("script", type="application/ld+json"):
        try:
            data = json.loads(script.string)
            if isinstance(data, list):
                for item in data:
                    if item.get("@type") == "Person":
                        profile["json_ld"] = item
                        break
            elif data.get("@type") == "Person":
                profile["json_ld"] = data
                break
        except (json.JSONDecodeError, TypeError):
            continue

    return profile


def parse_title_components(title: str) -> dict:
    """Parse 'Name - Title at Company' format from og:title."""
    if not title:
        return {}

    result = {}
    # LinkedIn titles follow "Full Name - Job Title at Company"
    if " - " in title:
        parts = title.split(" - ", 1)
        result["parsed_name"] = parts[0].strip()
        if " at " in parts[1]:
            title_company = parts[1].split(" at ", 1)
            result["parsed_title"] = title_company[0].strip()
            result["parsed_company"] = title_company[1].strip()
        else:
            result["parsed_title"] = parts[1].strip()
    else:
        result["parsed_name"] = title.strip()

    return result


if __name__ == "__main__":
    url = "https://www.linkedin.com/in/williamhgates"
    result = fetch_linkedin_profile(url)

    for k, v in result.items():
        if k != "json_ld":
            print(f"{k}: {v}")

    if "json_ld" in result:
        print("\nJSON-LD data:")
        print(json.dumps(result["json_ld"], indent=2))

    components = parse_title_components(result.get("title", ""))
    if components:
        print("\nParsed title components:")
        for k, v in components.items():
            print(f"  {k}: {v}")

What You Get Back

From a typical public profile, this extracts:

Name -- full name, first and last separately via profile:first_name and profile:last_name
Headline -- the og:title typically contains "Name - Title at Company"
Summary -- og:description includes a truncated version of their experience
Profile photo URL -- a CDN link to their profile picture (valid for 24-48 hours typically)
Structured job data -- from JSON-LD when available, including employer name and job title as schema.org objects

What you will not get: full work history, skills list, connection count, or contact info. That data requires JavaScript rendering and authenticated access.

Parsing JSON-LD Structured Data

When a profile includes a JSON-LD block, it follows the schema.org Person format:

{
  "@context": "https://schema.org",
  "@type": "Person",
  "name": "John Smith",
  "jobTitle": "Senior Developer",
  "worksFor": {
    "@type": "Organization",
    "name": "Acme Corp"
  },
  "url": "https://www.linkedin.com/in/johnsmith",
  "image": "https://media.licdn.com/dms/image/...",
  "address": {
    "@type": "PostalAddress",
    "addressLocality": "San Francisco",
    "addressRegion": "CA",
    "addressCountry": "US"
  },
  "alumniOf": [
    {
      "@type": "Organization",
      "name": "MIT"
    }
  ]
}

This is cleaner to parse than scraping visible HTML elements, and it is less likely to break when LinkedIn redesigns their frontend. The schema format is standardized and LinkedIn maintains it for SEO purposes.

Not every profile has this block. In testing as of 2026, roughly 60-70% of public profiles include it. When it is present, it is the most reliable data source on the page.

def extract_json_ld_data(json_ld: dict) -> dict:
    """Parse the JSON-LD Person schema into a flat structure."""
    if not json_ld or json_ld.get("@type") != "Person":
        return {}

    result = {
        "name": json_ld.get("name"),
        "job_title": json_ld.get("jobTitle"),
        "profile_url": json_ld.get("url"),
        "image_url": json_ld.get("image"),
    }

    # Current employer
    works_for = json_ld.get("worksFor")
    if isinstance(works_for, dict):
        result["employer"] = works_for.get("name")
    elif isinstance(works_for, list) and works_for:
        result["employer"] = works_for[0].get("name")

    # Location
    address = json_ld.get("address")
    if isinstance(address, dict):
        parts = [
            address.get("addressLocality"),
            address.get("addressRegion"),
            address.get("addressCountry"),
        ]
        result["location"] = ", ".join(p for p in parts if p)

    # Education (alumni)
    alumni_of = json_ld.get("alumniOf", [])
    if isinstance(alumni_of, dict):
        alumni_of = [alumni_of]
    result["education"] = [
        org.get("name") for org in alumni_of
        if isinstance(org, dict) and org.get("name")
    ]

    return result

Batch Profile Fetching with Rate Limiting

For collecting multiple profiles, you need careful rate control. LinkedIn's bot detection is cumulative -- 10 requests in 5 minutes from one IP is much safer than 10 requests in 10 seconds.

from datetime import datetime

def fetch_profiles_batch(
    profile_urls: list,
    proxy: str = None,
    min_delay: float = 8.0,
    max_delay: float = 20.0,
) -> list:
    """Fetch multiple LinkedIn profiles with rate limiting.

    Args:
        profile_urls: List of LinkedIn profile URLs
        proxy: Optional proxy URL (residential proxies required at scale)
        min_delay: Minimum seconds between requests
        max_delay: Maximum seconds between requests

    Returns:
        List of dicts with profile data and error info
    """
    results = []

    for i, url in enumerate(profile_urls):
        print(f"[{datetime.now().strftime('%H:%M:%S')}] {i+1}/{len(profile_urls)}: {url}")

        try:
            profile = fetch_linkedin_profile(url, proxy=proxy)
            # Parse title components
            components = parse_title_components(profile.get("title", ""))
            profile.update(components)
            profile["url"] = url
            results.append(profile)
            print(f"  OK: {profile.get('title', 'no title')[:60]}")

        except Exception as e:
            error_str = str(e)
            print(f"  Error: {error_str}")
            results.append({"url": url, "error": error_str})

            # Extended backoff on bot detection
            if "999" in error_str or "bot" in error_str.lower():
                extra_wait = random.uniform(60, 120)
                print(f"  Bot detected -- backing off {extra_wait:.0f}s")
                time.sleep(extra_wait)
                continue

        # Random delay between requests
        if i < len(profile_urls) - 1:
            delay = random.uniform(min_delay, max_delay)
            # Occasionally add a longer pause to seem more human
            if random.random() < 0.1:
                delay += random.uniform(30, 60)
            time.sleep(delay)

    return results

Storage Schema

import sqlite3

def init_db(db_path: str = "linkedin_profiles.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS profiles (
            url TEXT PRIMARY KEY,
            first_name TEXT,
            last_name TEXT,
            full_name TEXT,
            job_title TEXT,
            employer TEXT,
            location TEXT,
            image_url TEXT,
            description TEXT,
            education TEXT,
            raw_title TEXT,
            has_json_ld INTEGER DEFAULT 0,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );

        CREATE INDEX IF NOT EXISTS idx_employer
            ON profiles(employer);

        CREATE INDEX IF NOT EXISTS idx_job_title
            ON profiles(job_title);

        CREATE INDEX IF NOT EXISTS idx_scraped
            ON profiles(scraped_at);
    """)
    conn.commit()
    return conn


def save_profile(conn: sqlite3.Connection, profile: dict):
    conn.execute(
        """INSERT OR REPLACE INTO profiles
           (url, first_name, last_name, full_name, job_title, employer,
            location, image_url, description, education, raw_title, has_json_ld)
           VALUES (?,?,?,?,?,?,?,?,?,?,?,?)""",
        (
            profile.get("url") or profile.get("profile_url"),
            profile.get("first_name"),
            profile.get("last_name"),
            profile.get("name") or profile.get("parsed_name"),
            profile.get("job_title") or profile.get("parsed_title"),
            profile.get("employer") or profile.get("parsed_company"),
            profile.get("location"),
            profile.get("image_url"),
            profile.get("description"),
            json.dumps(profile.get("education", [])),
            profile.get("title"),
            int("job_title" in profile or "employer" in profile),
        ),
    )
    conn.commit()

Bot Detection and Rate Limits

LinkedIn is aggressive about blocking automated access. Here is what you will run into:

HTTP 999 -- LinkedIn's custom status code for "we think you are a bot." You will see this after just a handful of requests from a datacenter IP. On residential IPs, you can typically get 10-20 requests per hour before hitting it.

Authwall redirects -- some profiles redirect to a login page even when set to public. This varies by the requester's IP reputation and geolocation. European IPs seem to trigger this more often, possibly due to GDPR-related gating policies.

Rate limiting -- even with residential IPs, more than 20-30 requests per hour from the same IP will likely trigger blocks.

TLS fingerprinting -- LinkedIn inspects TLS handshake signatures. Standard Python HTTP libraries have recognizable fingerprints. Using httpx with http2=True presents a more browser-like handshake.

Cookie requirements -- LinkedIn's newer bot detection checks for session cookies that a real browser would have accumulated from previous visits.

Proxy Configuration

For anything beyond a few profiles, you need proxy rotation. Residential proxies are essential here -- datacenter IPs get blocked almost immediately.

ThorData's rotating residential proxies work well for this use case. Their pool includes IPs from ISPs that LinkedIn does not flag as aggressively as typical proxy network ranges. The per-GB pricing model makes sense when you are fetching individual profile pages rather than bulk downloading.

# ThorData proxy configuration for LinkedIn
PROXY_ROTATING = "http://USERNAME:[email protected]:9000"
# US geo-targeting (LinkedIn serves different content by region)
PROXY_US = "http://USERNAME-country-us:[email protected]:9000"

with httpx.Client(
    proxies={"all://": PROXY_ROTATING},
    http2=True,
    timeout=15
) as client:
    resp = client.get(
        "https://www.linkedin.com/in/target-profile",
        headers={"User-Agent": random.choice(USER_AGENTS)},
    )

Tip: Add random delays between 8-20 seconds per request. LinkedIn's detection is partly timing-based. Uniform intervals are a strong bot signal.

Advanced: Playwright for JavaScript-Rendered Data

For profiles that require JavaScript rendering, switch to Playwright:

from playwright.async_api import async_playwright
import asyncio

async def fetch_linkedin_playwright(
    profile_url: str,
    proxy: dict = None,
) -> dict:
    """Fetch LinkedIn profile with full browser rendering."""
    async with async_playwright() as p:
        launch_kwargs = {
            "headless": True,
            "args": [
                "--disable-blink-features=AutomationControlled",
                "--disable-dev-shm-usage",
            ],
        }
        if proxy:
            launch_kwargs["proxy"] = proxy

        browser = await p.chromium.launch(**launch_kwargs)
        context = await browser.new_context(
            viewport={"width": 1366, "height": 768},
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/122.0.0.0 Safari/537.36"
            ),
            locale="en-US",
        )

        await context.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
            Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3]});
        """)

        page = await context.new_page()

        try:
            await page.goto(profile_url, wait_until="domcontentloaded", timeout=25000)
            await asyncio.sleep(random.uniform(2, 4))

            # Check for authwall
            if "authwall" in page.url or "login" in page.url:
                await browser.close()
                return {"error": "authwall", "url": profile_url}

            html = await page.content()
            soup = BeautifulSoup(html, "html.parser")
            profile = {}

            # Extract OG tags
            for prop, key in {
                "og:title": "title",
                "og:description": "description",
                "og:image": "image_url",
                "profile:first_name": "first_name",
                "profile:last_name": "last_name",
            }.items():
                tag = soup.find("meta", property=prop)
                if tag and tag.get("content"):
                    profile[key] = tag["content"]

            # Extract JSON-LD
            for script in soup.find_all("script", type="application/ld+json"):
                try:
                    data = json.loads(script.string)
                    if isinstance(data, dict) and data.get("@type") == "Person":
                        profile.update(extract_json_ld_data(data))
                        break
                except (json.JSONDecodeError, TypeError):
                    continue

            profile["url"] = profile_url

        except Exception as e:
            profile = {"url": profile_url, "error": str(e)}

        await browser.close()
        return profile

Complete Example Pipeline

def run_linkedin_pipeline(
    profile_urls: list,
    proxy: str = None,
    db_path: str = "linkedin_profiles.db",
):
    """
    Full pipeline: fetch profiles, parse data, store in SQLite.
    """
    conn = init_db(db_path)
    print(f"Processing {len(profile_urls)} profiles")

    results = fetch_profiles_batch(profile_urls, proxy=proxy)

    saved = 0
    errors = 0
    for result in results:
        if "error" in result:
            errors += 1
        else:
            save_profile(conn, result)
            saved += 1

    conn.close()
    print(f"\nDone: {saved} saved, {errors} errors")
    return results


# Usage
PROXY = "http://USER:[email protected]:9000"

urls = [
    "https://www.linkedin.com/in/williamhgates",
    "https://www.linkedin.com/in/jeffweiner08",
    "https://www.linkedin.com/in/reidhoffman",
]

run_linkedin_pipeline(urls, proxy=PROXY)

Understanding What You Actually Get in 2026

LinkedIn has progressively reduced the information in their public-facing structured data. In 2022, the JSON-LD blocks contained substantial work history. Today, most profiles show only current employer and title.

What still works reliably: - Name extraction from og:title and profile:first_name/last_name - Current job title and employer from og:title - Profile photo URL from og:image - Confirmation that a profile exists

What is increasingly unreliable: - Location data (often absent from JSON-LD) - Education history (stripped from most profiles) - Full description/summary (truncated severely)

For comprehensive profile data, your realistic options are: 1. Authenticated access with Playwright (most data, highest risk) 2. Official API partner program (limited data, legitimate) 3. Managed scrapers like Apify that maintain their own infrastructure

Legal and Ethical Considerations

The legal situation here is worth being direct about:

The hiQ v. LinkedIn case (2022) established that scraping publicly available data is not a violation of the CFAA. This is a significant precedent but not a blanket permission.
LinkedIn's Terms of Service explicitly prohibit scraping. Violating ToS is a civil matter, not criminal, but LinkedIn has sent cease-and-desist letters and pursued litigation.
Under GDPR (if you target EU users), collecting personal data requires a legitimate interest basis and compliance with data subject rights.
The data in meta tags and JSON-LD is intentionally made public by both LinkedIn (for SEO) and the profile owner (who chose a public profile setting).

Be responsible: Do not build tools that enable harassment, spam, or mass surveillance. Do not scrape private profiles. Do not store data longer than necessary.

Key Takeaways

LinkedIn public profiles expose name, current role, employer, and photo via Open Graph meta tags -- no API key required
JSON-LD @type: Person blocks appear on 60-70% of profiles and provide cleaner structured data than HTML parsing
HTTP 999 is LinkedIn's bot detection code -- expect it quickly on datacenter IPs and after more than 20 requests/hour on residential IPs
Use 8-20 second random delays between requests and rotate User-Agents
Residential proxies are non-negotiable at scale; ThorData's rotating residential proxies work well for LinkedIn's IP reputation checks
The data available via meta tags in 2026 is more limited than previous years -- current role and name are reliable, full work history requires browser automation and authentication

Handling Authwall and Private Profile Detection

One common pain point is profiles that appear public but actually redirect to a login wall for certain IP addresses. Here is a robust detection and fallback system:

import re

def is_linkedin_authwall(html: str, url: str) -> bool:
    """Detect various forms of LinkedIn authentication walls."""
    authwall_signals = [
        "authwall" in url.lower(),
        "login" in url.lower() and "linkedin.com" in url.lower(),
        "join-linkedin" in html.lower(),
        "sign in" in html.lower() and "to see" in html.lower(),
        "uas/login" in url.lower(),
        '<meta name="robots" content="noindex' in html,
    ]
    return any(authwall_signals)


def check_profile_accessibility(html: str, url: str) -> tuple:
    """
    Check what level of access we got for a profile.
    Returns (accessible: bool, reason: str)
    """
    if is_linkedin_authwall(html, url):
        return False, "authwall"

    # Check if we got meaningful profile data
    has_og_title = 'property="og:title"' in html
    has_profile_meta = 'property="profile:first_name"' in html

    if not has_og_title:
        return False, "no_og_tags"

    # Check for GDPR consent walls common in EU
    if "consent" in html.lower() and "gdpr" in html.lower():
        return False, "gdpr_consent_wall"

    return True, "ok"


def fetch_with_fallback(
    profile_url: str,
    primary_proxy: str = None,
    fallback_proxy: str = None,
) -> dict:
    """
    Fetch a LinkedIn profile with automatic fallback on auth walls.
    Tries primary proxy first, then fallback, then direct.
    """
    attempts = [
        ("primary", primary_proxy),
        ("fallback", fallback_proxy),
        ("direct", None),
    ]

    for attempt_name, proxy in attempts:
        if proxy is None and attempt_name != "direct":
            continue

        try:
            profile = fetch_linkedin_profile(profile_url, proxy=proxy)
            return profile
        except Exception as e:
            print(f"  {attempt_name} attempt failed: {e}")
            time.sleep(random.uniform(5, 10))

    return {"url": profile_url, "error": "all_attempts_failed"}

Enriching Profiles with Company Data

Once you have employer names from LinkedIn profiles, you can enrich them with additional company data from other sources:

import httpx
import json

def enrich_with_company_data(profiles: list) -> list:
    """
    Add company size, industry, and funding data to profiles
    by looking up employer names via public sources.
    """
    enriched = []

    for profile in profiles:
        employer = profile.get("employer") or profile.get("parsed_company")
        if not employer:
            enriched.append(profile)
            continue

        # Try Clearbit's free company enrichment
        try:
            resp = httpx.get(
                "https://autocomplete.clearbit.com/v1/companies/suggest",
                params={"query": employer},
                timeout=10,
            )
            if resp.status_code == 200:
                companies = resp.json()
                if companies:
                    company_info = companies[0]
                    profile["company_domain"] = company_info.get("domain")
                    profile["company_logo"] = company_info.get("logo")
                    profile["company_name_normalized"] = company_info.get("name")
        except Exception:
            pass

        enriched.append(profile)
        time.sleep(0.5)

    return enriched


def deduplicate_profiles(profiles: list) -> list:
    """Remove duplicate profiles based on LinkedIn URL normalization."""
    seen_ids = set()
    unique = []

    for profile in profiles:
        url = profile.get("url") or profile.get("profile_url", "")
        # Normalize: extract the profile ID from the URL
        match = re.search(r"/in/([a-zA-Z0-9_-]+)", url)
        if match:
            profile_id = match.group(1).lower()
            if profile_id not in seen_ids:
                seen_ids.add(profile_id)
                unique.append(profile)
        else:
            unique.append(profile)

    return unique

Rate Analysis and Throughput Planning

Before starting a batch job, estimate how long it will take and how many IPs you will need:

def estimate_scraping_time(
    num_profiles: int,
    min_delay: float = 8.0,
    max_delay: float = 20.0,
    success_rate: float = 0.75,
) -> dict:
    """
    Estimate time and cost for batch LinkedIn scraping.

    Args:
        num_profiles: Total profiles to collect
        min_delay: Minimum delay between requests in seconds
        max_delay: Maximum delay between requests in seconds
        success_rate: Expected fraction of profiles that return data
                      (others are authwalls, 999s, or errors)
    """
    avg_delay = (min_delay + max_delay) / 2
    attempts_needed = int(num_profiles / success_rate)

    total_seconds = attempts_needed * avg_delay
    total_hours = total_seconds / 3600

    # Estimate data volume (average LinkedIn profile page is ~80-120KB compressed)
    avg_page_size_kb = 100
    total_mb = (attempts_needed * avg_page_size_kb) / 1024

    return {
        "profiles_target": num_profiles,
        "attempts_needed": attempts_needed,
        "expected_success_rate": f"{success_rate*100:.0f}%",
        "avg_delay_seconds": avg_delay,
        "total_time_hours": round(total_hours, 1),
        "estimated_data_mb": round(total_mb, 1),
        "note": "Single-threaded. Add multiple IPs to parallelize."
    }

# Example planning
plan = estimate_scraping_time(1000)
for k, v in plan.items():
    print(f"  {k}: {v}")
# Output example:
# profiles_target: 1000
# attempts_needed: 1333
# expected_success_rate: 75%
# avg_delay_seconds: 14.0
# total_time_hours: 5.2
# estimated_data_mb: 130.1

Comparing Output Quality Across Methods

Here is a realistic comparison of what each approach yields for a typical professional LinkedIn profile in 2026:

Approach	Name	Title	Employer	Location	Education	Photo
Meta tags only	Yes	Parsed	Parsed	Rare	Rare	Yes
Meta + JSON-LD	Yes	Clean	Clean	Sometimes	Sometimes	Yes
Playwright (unauth)	Yes	Full	Full	Usually	Sometimes	Yes
Playwright (auth)	Yes	Full	Full	Yes	Yes	Yes
Official API	Yes	Full	Full	Yes	Yes	Yes

For a name-and-employer validation pipeline (e.g., "does this person work where they claim?"), the meta tags approach is sufficient and has the lowest risk profile. For full profile enrichment, authenticated access or the official API is necessary.

Monitoring for Schema Changes

LinkedIn updates their page structure periodically. Build in change detection:

def validate_profile_extraction(profile: dict) -> tuple:
    """
    Validate that a profile extraction got usable data.
    Returns (valid: bool, issues: list)
    """
    issues = []

    if not profile.get("title") and not profile.get("first_name"):
        issues.append("no_name_data")

    if not profile.get("image_url"):
        issues.append("no_profile_image")

    title = profile.get("title", "")
    if title and " - " not in title:
        issues.append("title_format_changed")

    # Check for LinkedIn's known error pages
    if profile.get("title") in ["LinkedIn", "Log in or sign up to view"]:
        issues.append("auth_wall_content")

    return len(issues) == 0, issues


# Log format changes for monitoring
def log_extraction_quality(results: list, log_path: str = "extraction_log.json"):
    """Log extraction quality metrics for monitoring schema changes."""
    metrics = {
        "total": len(results),
        "errors": sum(1 for r in results if "error" in r),
        "missing_name": sum(1 for r in results if not r.get("first_name") and not r.get("title")),
        "has_json_ld": sum(1 for r in results if r.get("job_title") or r.get("employer")),
        "has_image": sum(1 for r in results if r.get("image_url")),
    }
    metrics["success_rate"] = round(
        (metrics["total"] - metrics["errors"]) / metrics["total"] * 100, 1
    ) if metrics["total"] else 0

    with open(log_path, "w") as f:
        json.dump(metrics, f, indent=2)

    return metrics

If has_json_ld drops significantly from a baseline, LinkedIn may have changed their schema. If title_format_changed starts appearing, the "Name - Title at Company" parsing logic needs updating.

Summary: When to Use This Approach

The meta tags and JSON-LD method is the right choice when:

You need name and current employer confirmation for a list of profiles
You are working with small-to-medium volumes (under 500 profiles per day)
You want minimal infrastructure -- no browser automation, just httpx
Legal risk tolerance is conservative -- this is the least invasive approach

Switch to authenticated Playwright scraping when:

You need location, full work history, or education
You are doing large-scale profile collection
You accept the higher risk of ToS enforcement

Use the official API when:

You need a stable, long-term data pipeline
You have a legitimate business use case that justifies the partner application
You need data beyond what public profiles expose