Scraping LinkedIn Profiles and Job Listings in 2026 (Without Getting Banned)

2026-04-01 linkedin web-scraping playwright proxies python curl-cffi

Scraping LinkedIn Profiles and Job Listings in 2026 (Without Getting Banned)

LinkedIn is arguably the hardest major website to scrape in 2026. It combines aggressive TLS fingerprinting, JavaScript-rendered content behind authentication walls, and a legal team that has actually sued scrapers (hiQ Labs v. LinkedIn went to the Supreme Court twice). Here's what the detection stack looks like and what actually works to get data out.

Why LinkedIn Is Hard: The Detection Stack
The 999 Status Code: LinkedIn's Bot Signal
Public Profiles with curl-cffi and JSON-LD
Parsing JSON-LD Profile Data
Scraping Public Job Listings
Paginating and Enriching Job Results
Authenticated Scraping with Playwright
Managing Session Cookies Safely
Company Page Scraping
Rate Limits and Soft Ban Avoidance
Proxy Strategy for LinkedIn
Storing LinkedIn Data: SQLite Schema
The LinkedIn Official API: What It Provides
Real Use Cases and What Is Feasible
Legal Considerations
Key Takeaways

1. Why LinkedIn Is Hard: The Detection Stack {#why-hard}

LinkedIn's anti-bot system operates on multiple layers:

Layer 1: TLS fingerprinting. LinkedIn checks your TLS client hello against known browser fingerprints. Standard Python requests or httpx get blocked immediately with a 999 status code. The TLS handshake must match a known browser before the request even reaches the application layer.

Layer 2: Authentication wall. Most profile data requires a logged-in session. Public profiles show limited information: name, headline, current company, maybe education. Job listings are partially accessible without auth but paginated results beyond the first few pages require login.

Layer 3: Rate limiting per IP and per account. LinkedIn tracks request velocity per IP and per account. Exceeding roughly 80-100 profile views per hour on a single account triggers soft bans that last 24-48 hours. IP-level throttling happens independently at lower thresholds.

Layer 4: Datacenter IP blocking. LinkedIn maintains blocklists of AWS, GCP, Azure, and major VPS IP ranges. You will get 999 errors or CAPTCHA challenges from any cloud or datacenter IP regardless of how well your TLS fingerprint is crafted.

Layer 5: JavaScript rendering. Profile pages use React with heavy client-side rendering. The initial HTML contains minimal data; the full profile loads via internal GraphQL API calls after page load. Standard HTML scraping without JavaScript execution gets you almost nothing.

Layer 6: Behavioral analysis. LinkedIn monitors navigation patterns, time on page, and interaction sequences. Bots that jump directly to profile URLs without navigating through search pages first are flagged. Human users browse LinkedIn -- they search, click results, scroll profiles.

2. The 999 Status Code {#999-error}

The 999 response is LinkedIn's custom bot detection code. It is not a standard HTTP status -- it is their signal that "we know you are a bot."

Common causes: - Datacenter or VPS IP address (most common cause) - TLS fingerprint mismatch (Python requests, httpx, urllib) - Missing or incorrect headers (sec-ch-ua, sec-fetch-*) - Rapid sequential requests from the same IP - Known bot user agent strings

from curl_cffi import requests as cffi_requests
import requests as std_requests

def diagnose_linkedin_access(url: str, proxy: str = None) -> dict:
    results = {}

    # Test 1: Standard requests library (will get 999 from datacenter)
    try:
        resp = std_requests.get(url, timeout=10)
        results["requests_lib"] = resp.status_code
    except Exception as e:
        results["requests_lib"] = f"error: {e}"

    # Test 2: curl-cffi with Chrome TLS impersonation
    try:
        kwargs = {"impersonate": "chrome136", "timeout": 10}
        if proxy:
            kwargs["proxies"] = {"https": proxy}
        resp = cffi_requests.get(url, **kwargs)
        results["curl_cffi_chrome136"] = resp.status_code
    except Exception as e:
        results["curl_cffi_chrome136"] = f"error: {e}"

    return results

# Without proxy: {"requests_lib": 999, "curl_cffi_chrome136": 999}
# With residential proxy + curl-cffi: {"requests_lib": 999, "curl_cffi_chrome136": 200}

The fix is always the same: curl-cffi with chrome136 impersonation plus a residential proxy. Without both, you will get 999.

3. Public Profiles with curl-cffi {#curl-cffi}

curl-cffi impersonates real browser TLS fingerprints, bypassing LinkedIn's first detection layer. Combined with residential proxies, this works for public profile pages.

from curl_cffi import requests
import time
import random

LINKEDIN_HEADERS = {
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,"
              "image/avif,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US,en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
    "cache-control": "no-cache",
    "sec-fetch-dest": "document",
    "sec-fetch-mode": "navigate",
    "sec-fetch-site": "none",
    "sec-fetch-user": "?1",
    "upgrade-insecure-requests": "1",
}

def scrape_linkedin_profile(profile_url: str,
                             proxy: str = None,
                             retries: int = 3) -> dict:
    session = requests.Session()

    for attempt in range(retries):
        kwargs = {
            "headers": LINKEDIN_HEADERS,
            "impersonate": "chrome136",
            "timeout": 20,
            "allow_redirects": True,
        }
        if proxy:
            kwargs["proxies"] = {"https": proxy}

        resp = session.get(profile_url, **kwargs)

        if resp.status_code == 999:
            if proxy:
                wait = (2 ** attempt) + random.uniform(1, 3)
                time.sleep(wait)
                continue
            else:
                return {"url": profile_url, "error": "999_blocked_no_proxy"}

        if resp.status_code == 429:
            retry_after = int(resp.headers.get("Retry-After", 30))
            time.sleep(retry_after)
            continue

        if resp.status_code == 404:
            return {"url": profile_url, "error": "profile_not_found"}

        if resp.status_code == 200:
            return extract_profile_data(resp.text, profile_url)

        time.sleep(2 ** attempt)

    return {"url": profile_url, "error": "failed_after_retries"}

def batch_scrape_profiles(profile_urls: list,
                           proxy: str = None,
                           delay_range: tuple = (4, 9)) -> list:
    results = []
    for i, url in enumerate(profile_urls):
        result = scrape_linkedin_profile(url, proxy=proxy)
        results.append(result)
        print(f"[{i+1}/{len(profile_urls)}] {result.get('name', 'Unknown')}")
        time.sleep(random.uniform(*delay_range))
        if (i + 1) % 20 == 0 and i < len(profile_urls) - 1:
            break_time = random.uniform(300, 600)
            print(f"Taking {break_time:.0f}s break...")
            time.sleep(break_time)
    return results

4. Parsing JSON-LD Profile Data {#json-ld}

LinkedIn embeds schema.org Person data in script tags for public profiles. This is the most stable data source:

import re
import json

def extract_jsonld(html: str) -> dict:
    for match in re.finditer(
        r'<script[^>]+type="application/ld\+json"[^>]*>(.*?)</script>',
        html, re.DOTALL
    ):
        try:
            data = json.loads(match.group(1))
            if isinstance(data, list):
                for item in data:
                    if item.get("@type") == "Person":
                        return _parse_person_schema(item)
            elif data.get("@type") == "Person":
                return _parse_person_schema(data)
        except json.JSONDecodeError:
            continue
    return {}

def _parse_person_schema(schema: dict) -> dict:
    works_for = schema.get("worksFor", [])
    if isinstance(works_for, dict):
        works_for = [works_for]
    alumni_of = schema.get("alumniOf", [])
    if isinstance(alumni_of, dict):
        alumni_of = [alumni_of]
    address = schema.get("address", {})
    if isinstance(address, str):
        location = address
    else:
        parts = [address.get("addressLocality"), address.get("addressRegion"),
                 address.get("addressCountry")]
        location = ", ".join(p for p in parts if p)

    return {
        "name": schema.get("name"),
        "job_title": schema.get("jobTitle"),
        "url": schema.get("url"),
        "description": schema.get("description"),
        "location": location,
        "profile_image": schema.get("image"),
        "same_as": schema.get("sameAs", []),
        "current_company": works_for[0].get("name") if works_for else None,
        "current_company_url": works_for[0].get("url") if works_for else None,
        "education": [
            {"school": e.get("name"), "url": e.get("url")}
            for e in alumni_of
        ],
    }

def extract_og_meta(html: str) -> dict:
    og = {}
    for match in re.finditer(
        r'<meta\s+(?:property|name)="(og:[^"]+)"\s+content="([^"]*)"',
        html, re.IGNORECASE
    ):
        og[match.group(1)] = match.group(2)
    return og

def extract_profile_data(html: str, url: str) -> dict:
    result = {"url": url}
    json_ld = extract_jsonld(html)
    if json_ld:
        result.update(json_ld)
    og_data = extract_og_meta(html)
    result.setdefault("title", og_data.get("og:title", ""))
    result.setdefault("description", og_data.get("og:description", ""))
    result.setdefault("image", og_data.get("og:image", ""))
    return result

What JSON-LD Gives You from Public Profiles

From a typical LinkedIn public profile: - Full name, job title / headline - Current employer name and URL - Location (city/region) - Education institutions - Profile photo URL - Bio/summary (sometimes truncated)

What requires login: - Full work history with dates and descriptions - Skills list with endorsement counts - Recommendations text - Connections and follower counts - Contact information - Posts and activity feed

5. Scraping Public Job Listings {#jobs}

LinkedIn job listings are partially accessible without authentication. The search URL structure:

https://www.linkedin.com/jobs/search/?keywords=python+developer&location=United+States&start=0

Key URL parameters:

Parameter	Description	Example Values
`keywords`	Job title or skills	`python+developer`
`location`	City, state, or country	`United+States`, `New+York`
`f_TPR`	Time posted	`r86400` (24h), `r604800` (7d), `r2592000` (30d)
`f_JT`	Job type	`F` (full-time), `C` (contract), `P` (part-time)
`f_E`	Experience level	`2` (entry), `3` (assoc), `4` (mid-senior), `5` (director)
`f_WT`	Remote filter	`1` (on-site), `2` (remote), `3` (hybrid)
`start`	Pagination offset	0, 25, 50...

from selectolax.parser import HTMLParser

def scrape_linkedin_jobs(keywords: str, location: str,
                          proxy: str = None,
                          max_pages: int = 5,
                          filters: dict = None) -> list:
    jobs = []
    session = requests.Session()
    filters = filters or {}

    for page in range(max_pages):
        params_parts = [
            f"keywords={keywords.replace(' ', '+')}",
            f"location={location.replace(' ', '+')}",
            f"start={page * 25}",
        ]
        for k, v in filters.items():
            params_parts.append(f"{k}={v}")

        url = "https://www.linkedin.com/jobs/search/?" + "&".join(params_parts)

        kwargs = {
            "impersonate": "chrome136",
            "timeout": 20,
            "headers": LINKEDIN_HEADERS,
        }
        if proxy:
            kwargs["proxies"] = {"https": proxy}

        resp = session.get(url, **kwargs)

        if resp.status_code == 999:
            print(f"Blocked on page {page}. Residential proxy required.")
            break
        if resp.status_code != 200:
            print(f"Status {resp.status_code} on page {page}")
            break

        tree = HTMLParser(resp.text)
        page_jobs = []

        for card in tree.css("div.base-search-card"):
            title_el = card.css_first("h3.base-search-card__title")
            company_el = card.css_first("h4.base-search-card__subtitle a")
            location_el = card.css_first("span.job-search-card__location")
            link_el = card.css_first("a.base-card__full-link")
            date_el = card.css_first("time")
            badge_el = card.css_first("span.result-benefits__text")

            job_url = link_el.attributes.get("href", "") if link_el else ""
            job_id = ""
            id_match = re.search(r'/jobs/view/(\d+)', job_url)
            if id_match:
                job_id = id_match.group(1)

            page_jobs.append({
                "job_id": job_id,
                "title": title_el.text(strip=True) if title_el else "",
                "company": company_el.text(strip=True) if company_el else "",
                "location": location_el.text(strip=True) if location_el else "",
                "url": job_url,
                "posted_date": date_el.attributes.get("datetime", "") if date_el else "",
                "posted_text": date_el.text(strip=True) if date_el else "",
                "badge": badge_el.text(strip=True) if badge_el else "",
                "search_keywords": keywords,
                "search_location": location,
            })

        if not page_jobs:
            break

        jobs.extend(page_jobs)
        time.sleep(random.uniform(3, 6))

    return jobs

6. Paginating and Enriching Job Results {#job-pagination}

To get more than ~1,000 results, batch by time period. To get full job descriptions, fetch each job's detail page:

def scrape_jobs_exhaustive(keywords: str, location: str,
                            proxy: str = None) -> list:
    all_jobs = {}
    time_filters = ["r86400", "r604800", "r2592000"]

    for tf in time_filters:
        jobs = scrape_linkedin_jobs(
            keywords, location, proxy=proxy,
            max_pages=10, filters={"f_TPR": tf}
        )
        for job in jobs:
            if job.get("job_id"):
                all_jobs[job["job_id"]] = job
        time.sleep(random.uniform(10, 20))

    return list(all_jobs.values())

def get_job_details(job_id: str, proxy: str = None) -> dict:
    url = f"https://www.linkedin.com/jobs/view/{job_id}/"

    kwargs = {
        "impersonate": "chrome136",
        "timeout": 20,
        "headers": LINKEDIN_HEADERS,
    }
    if proxy:
        kwargs["proxies"] = {"https": proxy}

    resp = requests.get(url, **kwargs)
    if resp.status_code != 200:
        return {"job_id": job_id, "error": f"status_{resp.status_code}"}

    # Extract JSON-LD JobPosting schema
    for match in re.finditer(
        r'<script[^>]+type="application/ld\+json"[^>]*>(.*?)</script>',
        resp.text, re.DOTALL
    ):
        try:
            data = json.loads(match.group(1))
            if data.get("@type") == "JobPosting":
                return {
                    "job_id": job_id,
                    "title": data.get("title"),
                    "company": (data.get("hiringOrganization") or {}).get("name"),
                    "location": ((data.get("jobLocation") or {})
                                 .get("address", {})
                                 .get("addressLocality")),
                    "description": data.get("description"),
                    "employment_type": data.get("employmentType"),
                    "posted_date": data.get("datePosted"),
                    "valid_through": data.get("validThrough"),
                    "salary_min": ((data.get("baseSalary") or {})
                                   .get("value", {}).get("minValue")),
                    "salary_max": ((data.get("baseSalary") or {})
                                   .get("value", {}).get("maxValue")),
                    "remote": data.get("jobLocationType") == "TELECOMMUTE",
                    "source": "json_ld",
                }
        except (json.JSONDecodeError, AttributeError):
            continue

    return {"job_id": job_id, "url": url, "source": "no_schema_found"}

def enrich_jobs_with_details(jobs: list, proxy: str = None,
                              delay_range: tuple = (3, 7)) -> list:
    enriched = []
    for i, job in enumerate(jobs):
        if job.get("job_id"):
            details = get_job_details(job["job_id"], proxy=proxy)
            merged = {**job, **{k: v for k, v in details.items()
                                if v and k not in job}}
            enriched.append(merged)
        else:
            enriched.append(job)
        time.sleep(random.uniform(*delay_range))
    return enriched

7. Authenticated Scraping with Playwright {#playwright}

For full profile data and search results beyond the first few pages, you need an authenticated session:

import asyncio
import json
import random
from playwright.async_api import async_playwright

async def scrape_authenticated_profile(profile_url: str,
                                        cookies_file: str,
                                        proxy_url: str = None) -> dict:
    async with async_playwright() as p:
        launch_args = {
            "headless": True,
            "args": [
                "--disable-blink-features=AutomationControlled",
                "--disable-features=IsolateOrigins,site-per-process",
                "--no-sandbox",
                "--disable-setuid-sandbox",
                "--disable-dev-shm-usage",
            ],
        }
        if proxy_url:
            launch_args["proxy"] = {"server": proxy_url}

        browser = await p.chromium.launch(**launch_args)
        context = await browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent=(
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/136.0.0.0 Safari/537.36"
            ),
            locale="en-US",
            timezone_id="America/New_York",
        )

        # Load saved cookies from browser export
        with open(cookies_file) as f:
            cookies = json.load(f)
        await context.add_cookies(cookies)

        page = await context.new_page()
        await page.add_init_script(
            "Object.defineProperty(navigator, 'webdriver', {get: () => undefined});"
        )

        await page.goto(profile_url, wait_until="domcontentloaded")
        await page.wait_for_timeout(2000 + int(1500 * random.random()))

        # Simulate reading by scrolling
        await page.evaluate("window.scrollTo(0, 500)")
        await page.wait_for_timeout(1500)
        await page.evaluate("window.scrollTo(0, 1200)")
        await page.wait_for_timeout(1000)

        name = headline = location = ""
        name_el = await page.query_selector("h1.text-heading-xlarge")
        if name_el:
            name = (await name_el.text_content()).strip()

        headline_el = await page.query_selector("div.text-body-medium.break-words")
        if headline_el:
            headline = (await headline_el.text_content()).strip()

        location_el = await page.query_selector("span.text-body-small.inline")
        if location_el:
            location = (await location_el.text_content()).strip()

        # Experience section
        experience = []
        exp_items = await page.query_selector_all(
            "#experience ~ .pvs-list__outer-container li.pvs-list__paged-list-item"
        )
        for item in exp_items[:10]:
            title_el = await item.query_selector(
                "span.t-bold span[aria-hidden='true']"
            )
            company_el = await item.query_selector(
                "span.t-normal:not(.t-black--light) span[aria-hidden='true']"
            )
            date_el = await item.query_selector(
                "span.t-black--light span[aria-hidden='true']"
            )
            title = (await title_el.text_content()).strip() if title_el else ""
            company = (await company_el.text_content()).strip() if company_el else ""
            dates = (await date_el.text_content()).strip() if date_el else ""
            if title or company:
                experience.append({
                    "title": title, "company": company, "dates": dates
                })

        # Education section
        education = []
        edu_items = await page.query_selector_all(
            "#education ~ .pvs-list__outer-container li.pvs-list__paged-list-item"
        )
        for item in edu_items[:10]:
            school_el = await item.query_selector(
                "span.t-bold span[aria-hidden='true']"
            )
            degree_el = await item.query_selector(
                "span.t-normal:not(.t-black--light) span[aria-hidden='true']"
            )
            school = (await school_el.text_content()).strip() if school_el else ""
            degree = (await degree_el.text_content()).strip() if degree_el else ""
            if school:
                education.append({"school": school, "degree": degree})

        await browser.close()
        return {
            "name": name,
            "headline": headline,
            "location": location,
            "experience": experience,
            "education": education,
            "profile_url": profile_url,
        }

8. Managing Session Cookies Safely {#cookies}

Never automate the LinkedIn login flow. LinkedIn detects Playwright logins even with stealth plugins and will flag the account, requiring phone verification or triggering a permanent ban.

The safe workflow: 1. Log into LinkedIn manually in a real browser (Chrome or Firefox) 2. Install the "Cookie-Editor" or "EditThisCookie" browser extension 3. Open the extension and export all cookies as JSON 4. Save to ~/linkedin_cookies.json 5. Load this file into Playwright

The li_at cookie is LinkedIn's primary session token. It typically lasts several weeks before expiring.

def validate_linkedin_cookies(cookies_file: str) -> bool:
    import time
    try:
        with open(cookies_file) as f:
            cookies = json.load(f)

        li_at = next((c for c in cookies if c.get("name") == "li_at"), None)
        if not li_at:
            print("No li_at cookie found -- re-export from logged-in LinkedIn session")
            return False

        expiry = li_at.get("expirationDate", 0)
        if expiry and expiry < time.time():
            print("li_at cookie expired -- re-export cookies from browser")
            return False

        return True
    except Exception as e:
        print(f"Cookie validation error: {e}")
        return False

For bulk scraping that needs multiple session cookies, maintain separate Instagram accounts and log into each one manually in different browsers. LinkedIn allows multiple accounts per organization.

9. Company Page Scraping {#company-pages}

LinkedIn company pages have more public data than personal profiles. The JSON-LD schema.org Organization block is available without authentication:

def scrape_company_page(company_slug: str, proxy: str = None) -> dict:
    url = f"https://www.linkedin.com/company/{company_slug}/"

    kwargs = {
        "impersonate": "chrome136",
        "timeout": 20,
        "headers": LINKEDIN_HEADERS,
    }
    if proxy:
        kwargs["proxies"] = {"https": proxy}

    resp = requests.get(url, **kwargs)
    if resp.status_code != 200:
        return {"error": f"status_{resp.status_code}", "url": url}

    for match in re.finditer(
        r'<script[^>]+type="application/ld\+json"[^>]*>(.*?)</script>',
        resp.text, re.DOTALL
    ):
        try:
            data = json.loads(match.group(1))
            candidates = data if isinstance(data, list) else [data]
            for item in candidates:
                if item.get("@type") in ("Organization", "Corporation"):
                    return {
                        "name": item.get("name"),
                        "url": item.get("url"),
                        "description": item.get("description"),
                        "founding_date": item.get("foundingDate"),
                        "employee_count": (item.get("numberOfEmployees", {})
                                           .get("value")),
                        "industry": item.get("industry"),
                        "location": (item.get("address", {})
                                     .get("addressLocality")),
                        "website": item.get("sameAs"),
                        "logo": (item.get("logo", {}).get("url")),
                        "source": "json_ld",
                    }
        except (json.JSONDecodeError, AttributeError):
            continue

    og = extract_og_meta(resp.text)
    return {
        "name": og.get("og:title", "").replace(" | LinkedIn", ""),
        "description": og.get("og:description", ""),
        "image": og.get("og:image", ""),
        "url": url,
        "source": "og_meta",
    }

def batch_scrape_companies(slugs: list, proxy: str = None) -> list:
    results = []
    for slug in slugs:
        result = scrape_company_page(slug, proxy=proxy)
        results.append(result)
        time.sleep(random.uniform(3, 7))
    return results

10. Rate Limits and Soft Ban Avoidance {#rate-limits}

LinkedIn rate limits operate at two levels:

IP level: Roughly 30-50 requests per hour before IP throttling begins
Account level: ~80-100 profile views per hour before soft ban (24-48h restriction)
Soft ban: 24-48 hour restriction where the account can still browse but not view new profiles
Hard ban: Account suspended -- usually from automated login or extremely high volume

import time
import random
from collections import deque

class LinkedInRateLimiter:

    def __init__(self, requests_per_hour: int = 30):
        self.interval = 3600.0 / requests_per_hour
        self.requests = deque()
        self.window = 3600

    def wait(self):
        now = time.time()
        while self.requests and now - self.requests[0] > self.window:
            self.requests.popleft()

        if len(self.requests) >= 30:
            sleep_time = self.window - (now - self.requests[0]) + 10
            print(f"Hourly limit reached, waiting {sleep_time:.0f}s")
            time.sleep(sleep_time)

        # Gaussian jitter -- never use fixed intervals
        base = self.interval
        jitter = random.gauss(0, base * 0.25)
        time.sleep(max(base + jitter, 5.0))
        self.requests.append(time.time())

    def take_break(self):
        break_time = random.uniform(300, 600)
        print(f"Taking a {break_time:.0f}s break...")
        time.sleep(break_time)

limiter = LinkedInRateLimiter(requests_per_hour=25)

def scrape_profiles_safely(profile_urls: list,
                            proxy: str = None) -> list:
    results = []
    for i, url in enumerate(profile_urls):
        limiter.wait()
        if i > 0 and i % 20 == 0:
            limiter.take_break()
        result = scrape_linkedin_profile(url, proxy=proxy)
        results.append(result)
        print(f"[{i+1}/{len(profile_urls)}] {result.get('name', 'N/A')}")
    return results

Anti-detection rules: - Never use fixed-interval delays -- add Gaussian jitter - Take 5-10 minute breaks every 20-30 requests - Vary your scraping schedule -- don't always scrape at the same time of day - Mix profile visits with other page types to simulate realistic browsing patterns

11. Proxy Strategy for LinkedIn {#proxies}

Datacenter IPs are blocked at LinkedIn's network layer. Residential proxies are the only option.

ThorData provides rotating residential proxy pools with US country targeting. Their residential IPs appear as regular household connections to LinkedIn's detection infrastructure.

Critical: use sticky sessions. LinkedIn's detection specifically flags IP address changes mid-session as bot behavior. A single stable residential IP per scraping session is far less suspicious than rotating IPs on every request.

THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
THORDATA_HOST = "proxy.thordata.com"
THORDATA_PORT = 9000

def get_linkedin_proxy(sticky: bool = True,
                        session_id: str = None,
                        country: str = "US") -> str:
    import uuid
    if sticky:
        if not session_id:
            session_id = str(uuid.uuid4())[:8]
        user = f"{THORDATA_USER}-country-{country.lower()}-session-{session_id}"
    else:
        user = f"{THORDATA_USER}-country-{country.lower()}"
    return f"http://{user}:{THORDATA_PASS}@{THORDATA_HOST}:{THORDATA_PORT}"

def create_scraping_session() -> dict:
    import uuid, time
    session_id = str(uuid.uuid4())[:8]
    return {
        "session_id": session_id,
        "proxy": get_linkedin_proxy(sticky=True, session_id=session_id),
        "created_at": time.time(),
        "request_count": 0,
        "max_requests": 30,
    }

def should_rotate_session(session: dict) -> bool:
    return session["request_count"] >= session["max_requests"]

def rotate_session(session: dict) -> dict:
    print(f"Rotating session {session['session_id']} "
          f"after {session['request_count']} requests")
    time.sleep(random.uniform(30, 60))
    return create_scraping_session()

12. Storing LinkedIn Data: SQLite Schema {#storage}

import sqlite3
import json
import time

def init_linkedin_db(db_path: str = "linkedin.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.execute("PRAGMA journal_mode=WAL")

    conn.execute('''CREATE TABLE IF NOT EXISTS profiles (
        url TEXT PRIMARY KEY,
        name TEXT,
        job_title TEXT,
        current_company TEXT,
        location TEXT,
        description TEXT,
        image_url TEXT,
        education TEXT,
        experience TEXT,
        skills TEXT,
        is_public INTEGER,
        scraped_at REAL,
        scrape_method TEXT
    )''')

    conn.execute('''CREATE TABLE IF NOT EXISTS jobs (
        job_id TEXT PRIMARY KEY,
        title TEXT,
        company TEXT,
        location TEXT,
        url TEXT,
        posted_date TEXT,
        description TEXT,
        employment_type TEXT,
        remote INTEGER,
        salary_min REAL,
        salary_max REAL,
        salary_currency TEXT,
        search_keywords TEXT,
        scraped_at REAL
    )''')

    conn.execute('''CREATE TABLE IF NOT EXISTS companies (
        url TEXT PRIMARY KEY,
        name TEXT,
        description TEXT,
        industry TEXT,
        employee_count INTEGER,
        founding_date TEXT,
        website TEXT,
        location TEXT,
        logo_url TEXT,
        scraped_at REAL
    )''')

    conn.execute(
        "CREATE INDEX IF NOT EXISTS idx_profiles_company ON profiles(current_company)"
    )
    conn.execute(
        "CREATE INDEX IF NOT EXISTS idx_jobs_company ON jobs(company)"
    )
    conn.execute(
        "CREATE INDEX IF NOT EXISTS idx_jobs_title ON jobs(title)"
    )
    conn.commit()
    return conn

def save_profile(conn: sqlite3.Connection, profile: dict):
    conn.execute(
        "INSERT OR REPLACE INTO profiles VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)",
        (
            profile.get("url") or profile.get("profile_url"),
            profile.get("name"),
            profile.get("job_title"),
            profile.get("current_company"),
            profile.get("location"),
            profile.get("description"),
            profile.get("image") or profile.get("profile_image"),
            json.dumps(profile.get("education", [])),
            json.dumps(profile.get("experience", [])),
            json.dumps(profile.get("skills", [])),
            int(profile.get("is_public", True)),
            time.time(),
            profile.get("source", "unknown"),
        )
    )
    conn.commit()

def save_job(conn: sqlite3.Connection, job: dict):
    if not job.get("job_id"):
        return
    conn.execute(
        "INSERT OR REPLACE INTO jobs VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?)",
        (
            job["job_id"], job.get("title"), job.get("company"),
            job.get("location"), job.get("url"),
            job.get("posted_date") or job.get("posted_text"),
            job.get("description"), job.get("employment_type"),
            int(job.get("remote", False)),
            job.get("salary_min"), job.get("salary_max"),
            job.get("salary_currency"),
            job.get("search_keywords"), time.time()
        )
    )
    conn.commit()

13. The LinkedIn Official API {#official-api}

LinkedIn's official API is locked behind partner programs with lengthy approval processes:

Marketing API: Ad management only
Community Management API: Company page posting and analytics
Consumer API: Effectively shut down. What remains is OAuth sign-in only.

If users authenticate your app via LinkedIn OAuth, you can access basic profile data about the authenticated user:

import requests as std_requests

def get_linkedin_oauth_profile(access_token: str) -> dict:
    headers = {"Authorization": f"Bearer {access_token}"}
    resp = std_requests.get(
        "https://api.linkedin.com/v2/userinfo",
        headers=headers
    )
    resp.raise_for_status()
    data = resp.json()
    return {
        "sub": data.get("sub"),       # LinkedIn member ID
        "name": data.get("name"),
        "given_name": data.get("given_name"),
        "family_name": data.get("family_name"),
        "picture": data.get("picture"),
        "email": data.get("email"),
        "email_verified": data.get("email_verified"),
    }

That is the extent of the official API for data collection. Name, email, profile picture. Nothing about job history, skills, connections, or company data. The official API won't help you for data collection or research.

14. Real Use Cases and What Is Feasible {#use-cases}

Goal	Method	Feasibility	Safe Scale
Basic profile (name, title, company, location)	curl-cffi + JSON-LD	High	500+ per day with proxies
Job listings in a field	curl-cffi job search	High	1000+ per day
Full work history and skills	Playwright + auth	Medium	80-100/hour per account
Company data (size, industry, description)	curl-cffi company pages	High	200+ per day
People search results	Playwright + auth	Medium	200/day per account
Contact information	Not publicly available	None	N/A
Follower and connection lists	Authenticated -- high ban risk	Low	Not recommended

Job Market Analysis

def analyze_job_market_for_role(role: str, locations: list,
                                 proxy: str = None) -> dict:
    results = {}
    db = init_linkedin_db("job_market.db")

    for location in locations:
        jobs = scrape_jobs_exhaustive(role, location, proxy=proxy)
        for job in jobs:
            save_job(db, job)

        companies = list(set(j["company"] for j in jobs if j.get("company")))
        remote_count = sum(1 for j in jobs
                          if "Remote" in j.get("location", ""))

        results[location] = {
            "total_postings": len(jobs),
            "unique_companies": len(companies),
            "remote_percentage": (remote_count / len(jobs) * 100) if jobs else 0,
            "top_companies": companies[:10],
        }
        time.sleep(random.uniform(15, 25))

    return results

# Example usage
market = analyze_job_market_for_role(
    "machine learning engineer",
    ["San Francisco Bay Area", "New York", "Remote"],
)
for loc, data in market.items():
    print(f"{loc}: {data['total_postings']} postings, "
          f"{data['remote_percentage']:.0f}% remote")

Competitor Employee Research

def research_competitor_employees(company_slug: str,
                                   role_filter: str = None,
                                   proxy: str = None) -> list:
    company = scrape_company_page(company_slug, proxy=proxy)
    jobs = scrape_linkedin_jobs(
        keywords=role_filter or "",
        location="",
        proxy=proxy,
        max_pages=3,
    )
    company_jobs = [
        j for j in jobs
        if company.get("name", "").lower() in j.get("company", "").lower()
    ]
    return {
        "company": company,
        "open_roles": company_jobs,
        "role_count": len(company_jobs),
    }

15. Legal Considerations {#legal}

The hiQ v. LinkedIn Supreme Court case (2022) established that scraping publicly accessible data does not violate the CFAA. However:

LinkedIn's Terms of Service explicitly prohibit scraping
Violating ToS creates breach-of-contract exposure
GDPR and CCPA apply to personal data -- storing scraped profiles of EU or California residents requires lawful basis
LinkedIn actively sends cease-and-desist letters to large-scale commercial scrapers

In practice: - Scraping public job listings for market research: very low risk - Scraping public company pages for competitive analysis: low risk - Scraping personal profiles for academic research: moderate risk - Mass-scraping profiles for lead generation databases: high risk (legal threats common)

LinkedIn primarily targets commercial-scale scrapers -- data brokers, surveillance firms, and competitors. Individual developers doing research rarely face legal action, but account bans are routine if rate limits are exceeded.

16. Key Takeaways {#summary}

The non-negotiables: - Use curl-cffi with chrome136 impersonation -- standard HTTP clients get 999'd at the TLS layer - Extract JSON-LD from public profiles and job pages for structured data without authentication - Playwright with saved cookies for full profile data -- never automate the login itself - Residential proxies with sticky sessions are mandatory -- not just residential IPs, but the same IP for an entire session

Rate limit discipline: - Stay under ~80 profile views per hour per account for authenticated scraping - Stay under ~30-40 requests per hour per IP for unauthenticated scraping - Add Gaussian-jittered delays (never fixed intervals) - Take 5-10 minute breaks every 20-30 requests to break predictable patterns

What public pages give you without auth: - Profiles: name, title, company, location, education (JSON-LD) - Job listings: title, company, location, posted date, full description (JSON-LD) - Companies: name, description, industry, employee count, location

The infrastructure investment: For proxy infrastructure, ThorData provides the sticky residential sessions that LinkedIn scraping specifically requires. Their US residential pool maintains consistent IP identities across an entire scraping session -- the key difference between a scraper that works and one that gets permanently banned after a dozen requests. Budget for sticky sessions (not rotating per-request) to match how real browser sessions behave.

Scraping LinkedIn Profiles and Job Listings in 2026 (Without Getting Banned)

Scraping LinkedIn Profiles and Job Listings in 2026 (Without Getting Banned)

Table of Contents

1. Why LinkedIn Is Hard: The Detection Stack {#why-hard}

2. The 999 Status Code {#999-error}

3. Public Profiles with curl-cffi {#curl-cffi}

4. Parsing JSON-LD Profile Data {#json-ld}

What JSON-LD Gives You from Public Profiles

5. Scraping Public Job Listings {#jobs}

6. Paginating and Enriching Job Results {#job-pagination}

7. Authenticated Scraping with Playwright {#playwright}

8. Managing Session Cookies Safely {#cookies}

9. Company Page Scraping {#company-pages}

10. Rate Limits and Soft Ban Avoidance {#rate-limits}

11. Proxy Strategy for LinkedIn {#proxies}

12. Storing LinkedIn Data: SQLite Schema {#storage}

13. The LinkedIn Official API {#official-api}

14. Real Use Cases and What Is Feasible {#use-cases}

Job Market Analysis

Competitor Employee Research

15. Legal Considerations {#legal}

16. Key Takeaways {#summary}