← Back to blog

LinkedIn Profile Data Without the API: Using Meta Tags and JSON-LD

If you have ever tried to get profile data from LinkedIn programmatically, you know the frustration. The official API is locked behind a partner program that rejects almost everyone, and the data you can access through it is extremely limited anyway.

But LinkedIn public profiles still render HTML. And that HTML contains structured data -- Open Graph meta tags and JSON-LD schema markup -- that gives you a surprising amount of information without needing any API key at all.

The LinkedIn API Problem

LinkedIn shut down most of its public API access years ago. What remains is the Marketing and Compliance APIs, which require:

If you are building a recruiting tool backed by a funded company, you might get approved. If you are an independent developer who wants to pull public profile info for a side project, a research tool, or a data pipeline -- you are out of luck through official channels.

This is where the public-facing HTML becomes useful.

What Public Profiles Expose

When you load a LinkedIn profile in a browser, the page source contains Open Graph meta tags designed for link previews. These tags are present in the initial HTML response, no JavaScript rendering required:

<meta property="og:title" content="John Smith - Senior Developer at Acme Corp">
<meta property="og:description" content="Experience: Senior Developer at Acme Corp...">
<meta property="og:image" content="https://media.licdn.com/dms/image/...">
<meta property="og:url" content="https://www.linkedin.com/in/johnsmith">
<meta property="og:type" content="profile">
<meta property="profile:first_name" content="John">
<meta property="profile:last_name" content="Smith">

Beyond the OG tags, many profiles also include a JSON-LD block with @type: Person schema that contains structured data about the person, their current job title, employer, and sometimes their location and education history.

This is not hidden data. It is the same information LinkedIn serves to Google's crawler, to Facebook and Twitter for link previews, and to any HTTP client that requests the page. The profile owner chose to make it public.

Data Fields You Can Extract

Here is a complete breakdown of what is realistically available from public LinkedIn profiles via meta tags and embedded structured data:

Field Source Reliability
Full name og:title, profile:first_name/last_name High
Current job title og:title (parsed) Medium
Current employer og:title, JSON-LD worksFor Medium
Profile photo URL og:image High
Profile URL og:url High
Experience summary og:description (truncated) Low
Location JSON-LD address Low (not always present)
Education JSON-LD alumniOf Low (sometimes present)
Skills Not available via meta tags N/A
Full work history Not available via meta tags N/A
Connection count Not available N/A
Contact info Not available N/A

The truncated summary in og:description typically reads like: "Experience: Senior Developer at Acme Corp. Education: MIT." That is not the full resume -- it is a preview snippet.

Fetching Profile Data with Python

Here is a working example using httpx and BeautifulSoup:

# linkedin_profile.py
import httpx
from bs4 import BeautifulSoup
import json
import time
import random

USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15",
]

def fetch_linkedin_profile(profile_url: str, proxy: str = None) -> dict:
    """Fetch public LinkedIn profile data from meta tags and JSON-LD."""
    ua = random.choice(USER_AGENTS)
    headers = {
        "User-Agent": ua,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "DNT": "1",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Cache-Control": "max-age=0",
    }

    client_kwargs = {
        "headers": headers,
        "follow_redirects": True,
        "timeout": 15,
        "http2": True,  # HTTP/2 is more browser-like
    }
    if proxy:
        client_kwargs["proxies"] = {"all://": proxy}

    with httpx.Client(**client_kwargs) as client:
        resp = client.get(profile_url)

    if resp.status_code == 999:
        raise Exception("LinkedIn returned 999 -- bot detection triggered")
    if resp.status_code == 429:
        raise Exception("Rate limited (HTTP 429)")
    if resp.status_code != 200:
        raise Exception(f"HTTP {resp.status_code}")

    # Check for authwall redirect
    if "authwall" in str(resp.url) or "login" in str(resp.url):
        raise Exception("Redirected to auth wall -- profile may be private or IP flagged")

    soup = BeautifulSoup(resp.text, "html.parser")

    # Extract Open Graph meta tags
    profile = {}
    og_mappings = {
        "og:title": "title",
        "og:description": "description",
        "og:image": "image_url",
        "og:url": "profile_url",
        "profile:first_name": "first_name",
        "profile:last_name": "last_name",
    }
    for prop, key in og_mappings.items():
        tag = soup.find("meta", property=prop)
        if tag and tag.get("content"):
            profile[key] = tag["content"]

    # Also check name-attribute meta tags
    name_mappings = {
        "description": "meta_description",
        "twitter:title": "twitter_title",
        "twitter:description": "twitter_description",
    }
    for name_attr, key in name_mappings.items():
        tag = soup.find("meta", attrs={"name": name_attr})
        if tag and tag.get("content"):
            profile[key] = tag["content"]

    # Extract JSON-LD structured data
    for script in soup.find_all("script", type="application/ld+json"):
        try:
            data = json.loads(script.string)
            if isinstance(data, list):
                for item in data:
                    if item.get("@type") == "Person":
                        profile["json_ld"] = item
                        break
            elif data.get("@type") == "Person":
                profile["json_ld"] = data
                break
        except (json.JSONDecodeError, TypeError):
            continue

    return profile


def parse_title_components(title: str) -> dict:
    """Parse 'Name - Title at Company' format from og:title."""
    if not title:
        return {}

    result = {}
    # LinkedIn titles follow "Full Name - Job Title at Company"
    if " - " in title:
        parts = title.split(" - ", 1)
        result["parsed_name"] = parts[0].strip()
        if " at " in parts[1]:
            title_company = parts[1].split(" at ", 1)
            result["parsed_title"] = title_company[0].strip()
            result["parsed_company"] = title_company[1].strip()
        else:
            result["parsed_title"] = parts[1].strip()
    else:
        result["parsed_name"] = title.strip()

    return result


if __name__ == "__main__":
    url = "https://www.linkedin.com/in/williamhgates"
    result = fetch_linkedin_profile(url)

    for k, v in result.items():
        if k != "json_ld":
            print(f"{k}: {v}")

    if "json_ld" in result:
        print("\nJSON-LD data:")
        print(json.dumps(result["json_ld"], indent=2))

    components = parse_title_components(result.get("title", ""))
    if components:
        print("\nParsed title components:")
        for k, v in components.items():
            print(f"  {k}: {v}")

What You Get Back

From a typical public profile, this extracts:

What you will not get: full work history, skills list, connection count, or contact info. That data requires JavaScript rendering and authenticated access.

Parsing JSON-LD Structured Data

When a profile includes a JSON-LD block, it follows the schema.org Person format:

{
  "@context": "https://schema.org",
  "@type": "Person",
  "name": "John Smith",
  "jobTitle": "Senior Developer",
  "worksFor": {
    "@type": "Organization",
    "name": "Acme Corp"
  },
  "url": "https://www.linkedin.com/in/johnsmith",
  "image": "https://media.licdn.com/dms/image/...",
  "address": {
    "@type": "PostalAddress",
    "addressLocality": "San Francisco",
    "addressRegion": "CA",
    "addressCountry": "US"
  },
  "alumniOf": [
    {
      "@type": "Organization",
      "name": "MIT"
    }
  ]
}

This is cleaner to parse than scraping visible HTML elements, and it is less likely to break when LinkedIn redesigns their frontend. The schema format is standardized and LinkedIn maintains it for SEO purposes.

Not every profile has this block. In testing as of 2026, roughly 60-70% of public profiles include it. When it is present, it is the most reliable data source on the page.

def extract_json_ld_data(json_ld: dict) -> dict:
    """Parse the JSON-LD Person schema into a flat structure."""
    if not json_ld or json_ld.get("@type") != "Person":
        return {}

    result = {
        "name": json_ld.get("name"),
        "job_title": json_ld.get("jobTitle"),
        "profile_url": json_ld.get("url"),
        "image_url": json_ld.get("image"),
    }

    # Current employer
    works_for = json_ld.get("worksFor")
    if isinstance(works_for, dict):
        result["employer"] = works_for.get("name")
    elif isinstance(works_for, list) and works_for:
        result["employer"] = works_for[0].get("name")

    # Location
    address = json_ld.get("address")
    if isinstance(address, dict):
        parts = [
            address.get("addressLocality"),
            address.get("addressRegion"),
            address.get("addressCountry"),
        ]
        result["location"] = ", ".join(p for p in parts if p)

    # Education (alumni)
    alumni_of = json_ld.get("alumniOf", [])
    if isinstance(alumni_of, dict):
        alumni_of = [alumni_of]
    result["education"] = [
        org.get("name") for org in alumni_of
        if isinstance(org, dict) and org.get("name")
    ]

    return result

Batch Profile Fetching with Rate Limiting

For collecting multiple profiles, you need careful rate control. LinkedIn's bot detection is cumulative -- 10 requests in 5 minutes from one IP is much safer than 10 requests in 10 seconds.

from datetime import datetime

def fetch_profiles_batch(
    profile_urls: list,
    proxy: str = None,
    min_delay: float = 8.0,
    max_delay: float = 20.0,
) -> list:
    """Fetch multiple LinkedIn profiles with rate limiting.

    Args:
        profile_urls: List of LinkedIn profile URLs
        proxy: Optional proxy URL (residential proxies required at scale)
        min_delay: Minimum seconds between requests
        max_delay: Maximum seconds between requests

    Returns:
        List of dicts with profile data and error info
    """
    results = []

    for i, url in enumerate(profile_urls):
        print(f"[{datetime.now().strftime('%H:%M:%S')}] {i+1}/{len(profile_urls)}: {url}")

        try:
            profile = fetch_linkedin_profile(url, proxy=proxy)
            # Parse title components
            components = parse_title_components(profile.get("title", ""))
            profile.update(components)
            profile["url"] = url
            results.append(profile)
            print(f"  OK: {profile.get('title', 'no title')[:60]}")

        except Exception as e:
            error_str = str(e)
            print(f"  Error: {error_str}")
            results.append({"url": url, "error": error_str})

            # Extended backoff on bot detection
            if "999" in error_str or "bot" in error_str.lower():
                extra_wait = random.uniform(60, 120)
                print(f"  Bot detected -- backing off {extra_wait:.0f}s")
                time.sleep(extra_wait)
                continue

        # Random delay between requests
        if i < len(profile_urls) - 1:
            delay = random.uniform(min_delay, max_delay)
            # Occasionally add a longer pause to seem more human
            if random.random() < 0.1:
                delay += random.uniform(30, 60)
            time.sleep(delay)

    return results

Storage Schema

import sqlite3

def init_db(db_path: str = "linkedin_profiles.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS profiles (
            url TEXT PRIMARY KEY,
            first_name TEXT,
            last_name TEXT,
            full_name TEXT,
            job_title TEXT,
            employer TEXT,
            location TEXT,
            image_url TEXT,
            description TEXT,
            education TEXT,
            raw_title TEXT,
            has_json_ld INTEGER DEFAULT 0,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );

        CREATE INDEX IF NOT EXISTS idx_employer
            ON profiles(employer);

        CREATE INDEX IF NOT EXISTS idx_job_title
            ON profiles(job_title);

        CREATE INDEX IF NOT EXISTS idx_scraped
            ON profiles(scraped_at);
    """)
    conn.commit()
    return conn


def save_profile(conn: sqlite3.Connection, profile: dict):
    conn.execute(
        """INSERT OR REPLACE INTO profiles
           (url, first_name, last_name, full_name, job_title, employer,
            location, image_url, description, education, raw_title, has_json_ld)
           VALUES (?,?,?,?,?,?,?,?,?,?,?,?)""",
        (
            profile.get("url") or profile.get("profile_url"),
            profile.get("first_name"),
            profile.get("last_name"),
            profile.get("name") or profile.get("parsed_name"),
            profile.get("job_title") or profile.get("parsed_title"),
            profile.get("employer") or profile.get("parsed_company"),
            profile.get("location"),
            profile.get("image_url"),
            profile.get("description"),
            json.dumps(profile.get("education", [])),
            profile.get("title"),
            int("job_title" in profile or "employer" in profile),
        ),
    )
    conn.commit()

Bot Detection and Rate Limits

LinkedIn is aggressive about blocking automated access. Here is what you will run into:

HTTP 999 -- LinkedIn's custom status code for "we think you are a bot." You will see this after just a handful of requests from a datacenter IP. On residential IPs, you can typically get 10-20 requests per hour before hitting it.

Authwall redirects -- some profiles redirect to a login page even when set to public. This varies by the requester's IP reputation and geolocation. European IPs seem to trigger this more often, possibly due to GDPR-related gating policies.

Rate limiting -- even with residential IPs, more than 20-30 requests per hour from the same IP will likely trigger blocks.

TLS fingerprinting -- LinkedIn inspects TLS handshake signatures. Standard Python HTTP libraries have recognizable fingerprints. Using httpx with http2=True presents a more browser-like handshake.

Cookie requirements -- LinkedIn's newer bot detection checks for session cookies that a real browser would have accumulated from previous visits.

Proxy Configuration

For anything beyond a few profiles, you need proxy rotation. Residential proxies are essential here -- datacenter IPs get blocked almost immediately.

ThorData's rotating residential proxies work well for this use case. Their pool includes IPs from ISPs that LinkedIn does not flag as aggressively as typical proxy network ranges. The per-GB pricing model makes sense when you are fetching individual profile pages rather than bulk downloading.

# ThorData proxy configuration for LinkedIn
PROXY_ROTATING = "http://USERNAME:[email protected]:9000"
# US geo-targeting (LinkedIn serves different content by region)
PROXY_US = "http://USERNAME-country-us:[email protected]:9000"

with httpx.Client(
    proxies={"all://": PROXY_ROTATING},
    http2=True,
    timeout=15
) as client:
    resp = client.get(
        "https://www.linkedin.com/in/target-profile",
        headers={"User-Agent": random.choice(USER_AGENTS)},
    )

Tip: Add random delays between 8-20 seconds per request. LinkedIn's detection is partly timing-based. Uniform intervals are a strong bot signal.

Advanced: Playwright for JavaScript-Rendered Data

For profiles that require JavaScript rendering, switch to Playwright:

from playwright.async_api import async_playwright
import asyncio

async def fetch_linkedin_playwright(
    profile_url: str,
    proxy: dict = None,
) -> dict:
    """Fetch LinkedIn profile with full browser rendering."""
    async with async_playwright() as p:
        launch_kwargs = {
            "headless": True,
            "args": [
                "--disable-blink-features=AutomationControlled",
                "--disable-dev-shm-usage",
            ],
        }
        if proxy:
            launch_kwargs["proxy"] = proxy

        browser = await p.chromium.launch(**launch_kwargs)
        context = await browser.new_context(
            viewport={"width": 1366, "height": 768},
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/122.0.0.0 Safari/537.36"
            ),
            locale="en-US",
        )

        await context.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
            Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3]});
        """)

        page = await context.new_page()

        try:
            await page.goto(profile_url, wait_until="domcontentloaded", timeout=25000)
            await asyncio.sleep(random.uniform(2, 4))

            # Check for authwall
            if "authwall" in page.url or "login" in page.url:
                await browser.close()
                return {"error": "authwall", "url": profile_url}

            html = await page.content()
            soup = BeautifulSoup(html, "html.parser")
            profile = {}

            # Extract OG tags
            for prop, key in {
                "og:title": "title",
                "og:description": "description",
                "og:image": "image_url",
                "profile:first_name": "first_name",
                "profile:last_name": "last_name",
            }.items():
                tag = soup.find("meta", property=prop)
                if tag and tag.get("content"):
                    profile[key] = tag["content"]

            # Extract JSON-LD
            for script in soup.find_all("script", type="application/ld+json"):
                try:
                    data = json.loads(script.string)
                    if isinstance(data, dict) and data.get("@type") == "Person":
                        profile.update(extract_json_ld_data(data))
                        break
                except (json.JSONDecodeError, TypeError):
                    continue

            profile["url"] = profile_url

        except Exception as e:
            profile = {"url": profile_url, "error": str(e)}

        await browser.close()
        return profile

Complete Example Pipeline

def run_linkedin_pipeline(
    profile_urls: list,
    proxy: str = None,
    db_path: str = "linkedin_profiles.db",
):
    """
    Full pipeline: fetch profiles, parse data, store in SQLite.
    """
    conn = init_db(db_path)
    print(f"Processing {len(profile_urls)} profiles")

    results = fetch_profiles_batch(profile_urls, proxy=proxy)

    saved = 0
    errors = 0
    for result in results:
        if "error" in result:
            errors += 1
        else:
            save_profile(conn, result)
            saved += 1

    conn.close()
    print(f"\nDone: {saved} saved, {errors} errors")
    return results


# Usage
PROXY = "http://USER:[email protected]:9000"

urls = [
    "https://www.linkedin.com/in/williamhgates",
    "https://www.linkedin.com/in/jeffweiner08",
    "https://www.linkedin.com/in/reidhoffman",
]

run_linkedin_pipeline(urls, proxy=PROXY)

Understanding What You Actually Get in 2026

LinkedIn has progressively reduced the information in their public-facing structured data. In 2022, the JSON-LD blocks contained substantial work history. Today, most profiles show only current employer and title.

What still works reliably: - Name extraction from og:title and profile:first_name/last_name - Current job title and employer from og:title - Profile photo URL from og:image - Confirmation that a profile exists

What is increasingly unreliable: - Location data (often absent from JSON-LD) - Education history (stripped from most profiles) - Full description/summary (truncated severely)

For comprehensive profile data, your realistic options are: 1. Authenticated access with Playwright (most data, highest risk) 2. Official API partner program (limited data, legitimate) 3. Managed scrapers like Apify that maintain their own infrastructure

The legal situation here is worth being direct about:

Be responsible: Do not build tools that enable harassment, spam, or mass surveillance. Do not scrape private profiles. Do not store data longer than necessary.

Key Takeaways

Handling Authwall and Private Profile Detection

One common pain point is profiles that appear public but actually redirect to a login wall for certain IP addresses. Here is a robust detection and fallback system:

import re

def is_linkedin_authwall(html: str, url: str) -> bool:
    """Detect various forms of LinkedIn authentication walls."""
    authwall_signals = [
        "authwall" in url.lower(),
        "login" in url.lower() and "linkedin.com" in url.lower(),
        "join-linkedin" in html.lower(),
        "sign in" in html.lower() and "to see" in html.lower(),
        "uas/login" in url.lower(),
        '<meta name="robots" content="noindex' in html,
    ]
    return any(authwall_signals)


def check_profile_accessibility(html: str, url: str) -> tuple:
    """
    Check what level of access we got for a profile.
    Returns (accessible: bool, reason: str)
    """
    if is_linkedin_authwall(html, url):
        return False, "authwall"

    # Check if we got meaningful profile data
    has_og_title = 'property="og:title"' in html
    has_profile_meta = 'property="profile:first_name"' in html

    if not has_og_title:
        return False, "no_og_tags"

    # Check for GDPR consent walls common in EU
    if "consent" in html.lower() and "gdpr" in html.lower():
        return False, "gdpr_consent_wall"

    return True, "ok"


def fetch_with_fallback(
    profile_url: str,
    primary_proxy: str = None,
    fallback_proxy: str = None,
) -> dict:
    """
    Fetch a LinkedIn profile with automatic fallback on auth walls.
    Tries primary proxy first, then fallback, then direct.
    """
    attempts = [
        ("primary", primary_proxy),
        ("fallback", fallback_proxy),
        ("direct", None),
    ]

    for attempt_name, proxy in attempts:
        if proxy is None and attempt_name != "direct":
            continue

        try:
            profile = fetch_linkedin_profile(profile_url, proxy=proxy)
            return profile
        except Exception as e:
            print(f"  {attempt_name} attempt failed: {e}")
            time.sleep(random.uniform(5, 10))

    return {"url": profile_url, "error": "all_attempts_failed"}

Enriching Profiles with Company Data

Once you have employer names from LinkedIn profiles, you can enrich them with additional company data from other sources:

import httpx
import json

def enrich_with_company_data(profiles: list) -> list:
    """
    Add company size, industry, and funding data to profiles
    by looking up employer names via public sources.
    """
    enriched = []

    for profile in profiles:
        employer = profile.get("employer") or profile.get("parsed_company")
        if not employer:
            enriched.append(profile)
            continue

        # Try Clearbit's free company enrichment
        try:
            resp = httpx.get(
                "https://autocomplete.clearbit.com/v1/companies/suggest",
                params={"query": employer},
                timeout=10,
            )
            if resp.status_code == 200:
                companies = resp.json()
                if companies:
                    company_info = companies[0]
                    profile["company_domain"] = company_info.get("domain")
                    profile["company_logo"] = company_info.get("logo")
                    profile["company_name_normalized"] = company_info.get("name")
        except Exception:
            pass

        enriched.append(profile)
        time.sleep(0.5)

    return enriched


def deduplicate_profiles(profiles: list) -> list:
    """Remove duplicate profiles based on LinkedIn URL normalization."""
    seen_ids = set()
    unique = []

    for profile in profiles:
        url = profile.get("url") or profile.get("profile_url", "")
        # Normalize: extract the profile ID from the URL
        match = re.search(r"/in/([a-zA-Z0-9_-]+)", url)
        if match:
            profile_id = match.group(1).lower()
            if profile_id not in seen_ids:
                seen_ids.add(profile_id)
                unique.append(profile)
        else:
            unique.append(profile)

    return unique

Rate Analysis and Throughput Planning

Before starting a batch job, estimate how long it will take and how many IPs you will need:

def estimate_scraping_time(
    num_profiles: int,
    min_delay: float = 8.0,
    max_delay: float = 20.0,
    success_rate: float = 0.75,
) -> dict:
    """
    Estimate time and cost for batch LinkedIn scraping.

    Args:
        num_profiles: Total profiles to collect
        min_delay: Minimum delay between requests in seconds
        max_delay: Maximum delay between requests in seconds
        success_rate: Expected fraction of profiles that return data
                      (others are authwalls, 999s, or errors)
    """
    avg_delay = (min_delay + max_delay) / 2
    attempts_needed = int(num_profiles / success_rate)

    total_seconds = attempts_needed * avg_delay
    total_hours = total_seconds / 3600

    # Estimate data volume (average LinkedIn profile page is ~80-120KB compressed)
    avg_page_size_kb = 100
    total_mb = (attempts_needed * avg_page_size_kb) / 1024

    return {
        "profiles_target": num_profiles,
        "attempts_needed": attempts_needed,
        "expected_success_rate": f"{success_rate*100:.0f}%",
        "avg_delay_seconds": avg_delay,
        "total_time_hours": round(total_hours, 1),
        "estimated_data_mb": round(total_mb, 1),
        "note": "Single-threaded. Add multiple IPs to parallelize."
    }

# Example planning
plan = estimate_scraping_time(1000)
for k, v in plan.items():
    print(f"  {k}: {v}")
# Output example:
# profiles_target: 1000
# attempts_needed: 1333
# expected_success_rate: 75%
# avg_delay_seconds: 14.0
# total_time_hours: 5.2
# estimated_data_mb: 130.1

Comparing Output Quality Across Methods

Here is a realistic comparison of what each approach yields for a typical professional LinkedIn profile in 2026:

Approach Name Title Employer Location Education Photo
Meta tags only Yes Parsed Parsed Rare Rare Yes
Meta + JSON-LD Yes Clean Clean Sometimes Sometimes Yes
Playwright (unauth) Yes Full Full Usually Sometimes Yes
Playwright (auth) Yes Full Full Yes Yes Yes
Official API Yes Full Full Yes Yes Yes

For a name-and-employer validation pipeline (e.g., "does this person work where they claim?"), the meta tags approach is sufficient and has the lowest risk profile. For full profile enrichment, authenticated access or the official API is necessary.

Monitoring for Schema Changes

LinkedIn updates their page structure periodically. Build in change detection:

def validate_profile_extraction(profile: dict) -> tuple:
    """
    Validate that a profile extraction got usable data.
    Returns (valid: bool, issues: list)
    """
    issues = []

    if not profile.get("title") and not profile.get("first_name"):
        issues.append("no_name_data")

    if not profile.get("image_url"):
        issues.append("no_profile_image")

    title = profile.get("title", "")
    if title and " - " not in title:
        issues.append("title_format_changed")

    # Check for LinkedIn's known error pages
    if profile.get("title") in ["LinkedIn", "Log in or sign up to view"]:
        issues.append("auth_wall_content")

    return len(issues) == 0, issues


# Log format changes for monitoring
def log_extraction_quality(results: list, log_path: str = "extraction_log.json"):
    """Log extraction quality metrics for monitoring schema changes."""
    metrics = {
        "total": len(results),
        "errors": sum(1 for r in results if "error" in r),
        "missing_name": sum(1 for r in results if not r.get("first_name") and not r.get("title")),
        "has_json_ld": sum(1 for r in results if r.get("job_title") or r.get("employer")),
        "has_image": sum(1 for r in results if r.get("image_url")),
    }
    metrics["success_rate"] = round(
        (metrics["total"] - metrics["errors"]) / metrics["total"] * 100, 1
    ) if metrics["total"] else 0

    with open(log_path, "w") as f:
        json.dump(metrics, f, indent=2)

    return metrics

If has_json_ld drops significantly from a baseline, LinkedIn may have changed their schema. If title_format_changed starts appearing, the "Name - Title at Company" parsing logic needs updating.

Summary: When to Use This Approach

The meta tags and JSON-LD method is the right choice when:

Switch to authenticated Playwright scraping when:

Use the official API when: