Scraping LinkedIn Profiles and Job Listings in 2026 (Without Getting Banned)
Scraping LinkedIn Profiles and Job Listings in 2026 (Without Getting Banned)
LinkedIn is arguably the hardest major website to scrape in 2026. It combines aggressive TLS fingerprinting, JavaScript-rendered content behind authentication walls, and a legal team that has actually sued scrapers (hiQ Labs v. LinkedIn went to the Supreme Court twice). Here's what the detection stack looks like and what actually works to get data out.
Table of Contents
- Why LinkedIn Is Hard: The Detection Stack
- The 999 Status Code: LinkedIn's Bot Signal
- Public Profiles with curl-cffi and JSON-LD
- Parsing JSON-LD Profile Data
- Scraping Public Job Listings
- Paginating and Enriching Job Results
- Authenticated Scraping with Playwright
- Managing Session Cookies Safely
- Company Page Scraping
- Rate Limits and Soft Ban Avoidance
- Proxy Strategy for LinkedIn
- Storing LinkedIn Data: SQLite Schema
- The LinkedIn Official API: What It Provides
- Real Use Cases and What Is Feasible
- Legal Considerations
- Key Takeaways
1. Why LinkedIn Is Hard: The Detection Stack {#why-hard}
LinkedIn's anti-bot system operates on multiple layers:
Layer 1: TLS fingerprinting. LinkedIn checks your TLS client hello against known browser fingerprints. Standard Python requests or httpx get blocked immediately with a 999 status code. The TLS handshake must match a known browser before the request even reaches the application layer.
Layer 2: Authentication wall. Most profile data requires a logged-in session. Public profiles show limited information: name, headline, current company, maybe education. Job listings are partially accessible without auth but paginated results beyond the first few pages require login.
Layer 3: Rate limiting per IP and per account. LinkedIn tracks request velocity per IP and per account. Exceeding roughly 80-100 profile views per hour on a single account triggers soft bans that last 24-48 hours. IP-level throttling happens independently at lower thresholds.
Layer 4: Datacenter IP blocking. LinkedIn maintains blocklists of AWS, GCP, Azure, and major VPS IP ranges. You will get 999 errors or CAPTCHA challenges from any cloud or datacenter IP regardless of how well your TLS fingerprint is crafted.
Layer 5: JavaScript rendering. Profile pages use React with heavy client-side rendering. The initial HTML contains minimal data; the full profile loads via internal GraphQL API calls after page load. Standard HTML scraping without JavaScript execution gets you almost nothing.
Layer 6: Behavioral analysis. LinkedIn monitors navigation patterns, time on page, and interaction sequences. Bots that jump directly to profile URLs without navigating through search pages first are flagged. Human users browse LinkedIn -- they search, click results, scroll profiles.
2. The 999 Status Code {#999-error}
The 999 response is LinkedIn's custom bot detection code. It is not a standard HTTP status -- it is their signal that "we know you are a bot."
Common causes: - Datacenter or VPS IP address (most common cause) - TLS fingerprint mismatch (Python requests, httpx, urllib) - Missing or incorrect headers (sec-ch-ua, sec-fetch-*) - Rapid sequential requests from the same IP - Known bot user agent strings
from curl_cffi import requests as cffi_requests
import requests as std_requests
def diagnose_linkedin_access(url: str, proxy: str = None) -> dict:
results = {}
# Test 1: Standard requests library (will get 999 from datacenter)
try:
resp = std_requests.get(url, timeout=10)
results["requests_lib"] = resp.status_code
except Exception as e:
results["requests_lib"] = f"error: {e}"
# Test 2: curl-cffi with Chrome TLS impersonation
try:
kwargs = {"impersonate": "chrome136", "timeout": 10}
if proxy:
kwargs["proxies"] = {"https": proxy}
resp = cffi_requests.get(url, **kwargs)
results["curl_cffi_chrome136"] = resp.status_code
except Exception as e:
results["curl_cffi_chrome136"] = f"error: {e}"
return results
# Without proxy: {"requests_lib": 999, "curl_cffi_chrome136": 999}
# With residential proxy + curl-cffi: {"requests_lib": 999, "curl_cffi_chrome136": 200}
The fix is always the same: curl-cffi with chrome136 impersonation plus a residential proxy. Without both, you will get 999.
3. Public Profiles with curl-cffi {#curl-cffi}
curl-cffi impersonates real browser TLS fingerprints, bypassing LinkedIn's first detection layer. Combined with residential proxies, this works for public profile pages.
from curl_cffi import requests
import time
import random
LINKEDIN_HEADERS = {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,"
"image/avif,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US,en;q=0.9",
"accept-encoding": "gzip, deflate, br",
"cache-control": "no-cache",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "none",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
}
def scrape_linkedin_profile(profile_url: str,
proxy: str = None,
retries: int = 3) -> dict:
session = requests.Session()
for attempt in range(retries):
kwargs = {
"headers": LINKEDIN_HEADERS,
"impersonate": "chrome136",
"timeout": 20,
"allow_redirects": True,
}
if proxy:
kwargs["proxies"] = {"https": proxy}
resp = session.get(profile_url, **kwargs)
if resp.status_code == 999:
if proxy:
wait = (2 ** attempt) + random.uniform(1, 3)
time.sleep(wait)
continue
else:
return {"url": profile_url, "error": "999_blocked_no_proxy"}
if resp.status_code == 429:
retry_after = int(resp.headers.get("Retry-After", 30))
time.sleep(retry_after)
continue
if resp.status_code == 404:
return {"url": profile_url, "error": "profile_not_found"}
if resp.status_code == 200:
return extract_profile_data(resp.text, profile_url)
time.sleep(2 ** attempt)
return {"url": profile_url, "error": "failed_after_retries"}
def batch_scrape_profiles(profile_urls: list,
proxy: str = None,
delay_range: tuple = (4, 9)) -> list:
results = []
for i, url in enumerate(profile_urls):
result = scrape_linkedin_profile(url, proxy=proxy)
results.append(result)
print(f"[{i+1}/{len(profile_urls)}] {result.get('name', 'Unknown')}")
time.sleep(random.uniform(*delay_range))
if (i + 1) % 20 == 0 and i < len(profile_urls) - 1:
break_time = random.uniform(300, 600)
print(f"Taking {break_time:.0f}s break...")
time.sleep(break_time)
return results
4. Parsing JSON-LD Profile Data {#json-ld}
LinkedIn embeds schema.org Person data in script tags for public profiles. This is the most stable data source:
import re
import json
def extract_jsonld(html: str) -> dict:
for match in re.finditer(
r'<script[^>]+type="application/ld\+json"[^>]*>(.*?)</script>',
html, re.DOTALL
):
try:
data = json.loads(match.group(1))
if isinstance(data, list):
for item in data:
if item.get("@type") == "Person":
return _parse_person_schema(item)
elif data.get("@type") == "Person":
return _parse_person_schema(data)
except json.JSONDecodeError:
continue
return {}
def _parse_person_schema(schema: dict) -> dict:
works_for = schema.get("worksFor", [])
if isinstance(works_for, dict):
works_for = [works_for]
alumni_of = schema.get("alumniOf", [])
if isinstance(alumni_of, dict):
alumni_of = [alumni_of]
address = schema.get("address", {})
if isinstance(address, str):
location = address
else:
parts = [address.get("addressLocality"), address.get("addressRegion"),
address.get("addressCountry")]
location = ", ".join(p for p in parts if p)
return {
"name": schema.get("name"),
"job_title": schema.get("jobTitle"),
"url": schema.get("url"),
"description": schema.get("description"),
"location": location,
"profile_image": schema.get("image"),
"same_as": schema.get("sameAs", []),
"current_company": works_for[0].get("name") if works_for else None,
"current_company_url": works_for[0].get("url") if works_for else None,
"education": [
{"school": e.get("name"), "url": e.get("url")}
for e in alumni_of
],
}
def extract_og_meta(html: str) -> dict:
og = {}
for match in re.finditer(
r'<meta\s+(?:property|name)="(og:[^"]+)"\s+content="([^"]*)"',
html, re.IGNORECASE
):
og[match.group(1)] = match.group(2)
return og
def extract_profile_data(html: str, url: str) -> dict:
result = {"url": url}
json_ld = extract_jsonld(html)
if json_ld:
result.update(json_ld)
og_data = extract_og_meta(html)
result.setdefault("title", og_data.get("og:title", ""))
result.setdefault("description", og_data.get("og:description", ""))
result.setdefault("image", og_data.get("og:image", ""))
return result
What JSON-LD Gives You from Public Profiles
From a typical LinkedIn public profile: - Full name, job title / headline - Current employer name and URL - Location (city/region) - Education institutions - Profile photo URL - Bio/summary (sometimes truncated)
What requires login: - Full work history with dates and descriptions - Skills list with endorsement counts - Recommendations text - Connections and follower counts - Contact information - Posts and activity feed
5. Scraping Public Job Listings {#jobs}
LinkedIn job listings are partially accessible without authentication. The search URL structure:
https://www.linkedin.com/jobs/search/?keywords=python+developer&location=United+States&start=0
Key URL parameters:
| Parameter | Description | Example Values |
|---|---|---|
keywords |
Job title or skills | python+developer |
location |
City, state, or country | United+States, New+York |
f_TPR |
Time posted | r86400 (24h), r604800 (7d), r2592000 (30d) |
f_JT |
Job type | F (full-time), C (contract), P (part-time) |
f_E |
Experience level | 2 (entry), 3 (assoc), 4 (mid-senior), 5 (director) |
f_WT |
Remote filter | 1 (on-site), 2 (remote), 3 (hybrid) |
start |
Pagination offset | 0, 25, 50... |
from selectolax.parser import HTMLParser
def scrape_linkedin_jobs(keywords: str, location: str,
proxy: str = None,
max_pages: int = 5,
filters: dict = None) -> list:
jobs = []
session = requests.Session()
filters = filters or {}
for page in range(max_pages):
params_parts = [
f"keywords={keywords.replace(' ', '+')}",
f"location={location.replace(' ', '+')}",
f"start={page * 25}",
]
for k, v in filters.items():
params_parts.append(f"{k}={v}")
url = "https://www.linkedin.com/jobs/search/?" + "&".join(params_parts)
kwargs = {
"impersonate": "chrome136",
"timeout": 20,
"headers": LINKEDIN_HEADERS,
}
if proxy:
kwargs["proxies"] = {"https": proxy}
resp = session.get(url, **kwargs)
if resp.status_code == 999:
print(f"Blocked on page {page}. Residential proxy required.")
break
if resp.status_code != 200:
print(f"Status {resp.status_code} on page {page}")
break
tree = HTMLParser(resp.text)
page_jobs = []
for card in tree.css("div.base-search-card"):
title_el = card.css_first("h3.base-search-card__title")
company_el = card.css_first("h4.base-search-card__subtitle a")
location_el = card.css_first("span.job-search-card__location")
link_el = card.css_first("a.base-card__full-link")
date_el = card.css_first("time")
badge_el = card.css_first("span.result-benefits__text")
job_url = link_el.attributes.get("href", "") if link_el else ""
job_id = ""
id_match = re.search(r'/jobs/view/(\d+)', job_url)
if id_match:
job_id = id_match.group(1)
page_jobs.append({
"job_id": job_id,
"title": title_el.text(strip=True) if title_el else "",
"company": company_el.text(strip=True) if company_el else "",
"location": location_el.text(strip=True) if location_el else "",
"url": job_url,
"posted_date": date_el.attributes.get("datetime", "") if date_el else "",
"posted_text": date_el.text(strip=True) if date_el else "",
"badge": badge_el.text(strip=True) if badge_el else "",
"search_keywords": keywords,
"search_location": location,
})
if not page_jobs:
break
jobs.extend(page_jobs)
time.sleep(random.uniform(3, 6))
return jobs
6. Paginating and Enriching Job Results {#job-pagination}
To get more than ~1,000 results, batch by time period. To get full job descriptions, fetch each job's detail page:
def scrape_jobs_exhaustive(keywords: str, location: str,
proxy: str = None) -> list:
all_jobs = {}
time_filters = ["r86400", "r604800", "r2592000"]
for tf in time_filters:
jobs = scrape_linkedin_jobs(
keywords, location, proxy=proxy,
max_pages=10, filters={"f_TPR": tf}
)
for job in jobs:
if job.get("job_id"):
all_jobs[job["job_id"]] = job
time.sleep(random.uniform(10, 20))
return list(all_jobs.values())
def get_job_details(job_id: str, proxy: str = None) -> dict:
url = f"https://www.linkedin.com/jobs/view/{job_id}/"
kwargs = {
"impersonate": "chrome136",
"timeout": 20,
"headers": LINKEDIN_HEADERS,
}
if proxy:
kwargs["proxies"] = {"https": proxy}
resp = requests.get(url, **kwargs)
if resp.status_code != 200:
return {"job_id": job_id, "error": f"status_{resp.status_code}"}
# Extract JSON-LD JobPosting schema
for match in re.finditer(
r'<script[^>]+type="application/ld\+json"[^>]*>(.*?)</script>',
resp.text, re.DOTALL
):
try:
data = json.loads(match.group(1))
if data.get("@type") == "JobPosting":
return {
"job_id": job_id,
"title": data.get("title"),
"company": (data.get("hiringOrganization") or {}).get("name"),
"location": ((data.get("jobLocation") or {})
.get("address", {})
.get("addressLocality")),
"description": data.get("description"),
"employment_type": data.get("employmentType"),
"posted_date": data.get("datePosted"),
"valid_through": data.get("validThrough"),
"salary_min": ((data.get("baseSalary") or {})
.get("value", {}).get("minValue")),
"salary_max": ((data.get("baseSalary") or {})
.get("value", {}).get("maxValue")),
"remote": data.get("jobLocationType") == "TELECOMMUTE",
"source": "json_ld",
}
except (json.JSONDecodeError, AttributeError):
continue
return {"job_id": job_id, "url": url, "source": "no_schema_found"}
def enrich_jobs_with_details(jobs: list, proxy: str = None,
delay_range: tuple = (3, 7)) -> list:
enriched = []
for i, job in enumerate(jobs):
if job.get("job_id"):
details = get_job_details(job["job_id"], proxy=proxy)
merged = {**job, **{k: v for k, v in details.items()
if v and k not in job}}
enriched.append(merged)
else:
enriched.append(job)
time.sleep(random.uniform(*delay_range))
return enriched
7. Authenticated Scraping with Playwright {#playwright}
For full profile data and search results beyond the first few pages, you need an authenticated session:
import asyncio
import json
import random
from playwright.async_api import async_playwright
async def scrape_authenticated_profile(profile_url: str,
cookies_file: str,
proxy_url: str = None) -> dict:
async with async_playwright() as p:
launch_args = {
"headless": True,
"args": [
"--disable-blink-features=AutomationControlled",
"--disable-features=IsolateOrigins,site-per-process",
"--no-sandbox",
"--disable-setuid-sandbox",
"--disable-dev-shm-usage",
],
}
if proxy_url:
launch_args["proxy"] = {"server": proxy_url}
browser = await p.chromium.launch(**launch_args)
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent=(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/136.0.0.0 Safari/537.36"
),
locale="en-US",
timezone_id="America/New_York",
)
# Load saved cookies from browser export
with open(cookies_file) as f:
cookies = json.load(f)
await context.add_cookies(cookies)
page = await context.new_page()
await page.add_init_script(
"Object.defineProperty(navigator, 'webdriver', {get: () => undefined});"
)
await page.goto(profile_url, wait_until="domcontentloaded")
await page.wait_for_timeout(2000 + int(1500 * random.random()))
# Simulate reading by scrolling
await page.evaluate("window.scrollTo(0, 500)")
await page.wait_for_timeout(1500)
await page.evaluate("window.scrollTo(0, 1200)")
await page.wait_for_timeout(1000)
name = headline = location = ""
name_el = await page.query_selector("h1.text-heading-xlarge")
if name_el:
name = (await name_el.text_content()).strip()
headline_el = await page.query_selector("div.text-body-medium.break-words")
if headline_el:
headline = (await headline_el.text_content()).strip()
location_el = await page.query_selector("span.text-body-small.inline")
if location_el:
location = (await location_el.text_content()).strip()
# Experience section
experience = []
exp_items = await page.query_selector_all(
"#experience ~ .pvs-list__outer-container li.pvs-list__paged-list-item"
)
for item in exp_items[:10]:
title_el = await item.query_selector(
"span.t-bold span[aria-hidden='true']"
)
company_el = await item.query_selector(
"span.t-normal:not(.t-black--light) span[aria-hidden='true']"
)
date_el = await item.query_selector(
"span.t-black--light span[aria-hidden='true']"
)
title = (await title_el.text_content()).strip() if title_el else ""
company = (await company_el.text_content()).strip() if company_el else ""
dates = (await date_el.text_content()).strip() if date_el else ""
if title or company:
experience.append({
"title": title, "company": company, "dates": dates
})
# Education section
education = []
edu_items = await page.query_selector_all(
"#education ~ .pvs-list__outer-container li.pvs-list__paged-list-item"
)
for item in edu_items[:10]:
school_el = await item.query_selector(
"span.t-bold span[aria-hidden='true']"
)
degree_el = await item.query_selector(
"span.t-normal:not(.t-black--light) span[aria-hidden='true']"
)
school = (await school_el.text_content()).strip() if school_el else ""
degree = (await degree_el.text_content()).strip() if degree_el else ""
if school:
education.append({"school": school, "degree": degree})
await browser.close()
return {
"name": name,
"headline": headline,
"location": location,
"experience": experience,
"education": education,
"profile_url": profile_url,
}
8. Managing Session Cookies Safely {#cookies}
Never automate the LinkedIn login flow. LinkedIn detects Playwright logins even with stealth plugins and will flag the account, requiring phone verification or triggering a permanent ban.
The safe workflow:
1. Log into LinkedIn manually in a real browser (Chrome or Firefox)
2. Install the "Cookie-Editor" or "EditThisCookie" browser extension
3. Open the extension and export all cookies as JSON
4. Save to ~/linkedin_cookies.json
5. Load this file into Playwright
The li_at cookie is LinkedIn's primary session token. It typically lasts several weeks before expiring.
def validate_linkedin_cookies(cookies_file: str) -> bool:
import time
try:
with open(cookies_file) as f:
cookies = json.load(f)
li_at = next((c for c in cookies if c.get("name") == "li_at"), None)
if not li_at:
print("No li_at cookie found -- re-export from logged-in LinkedIn session")
return False
expiry = li_at.get("expirationDate", 0)
if expiry and expiry < time.time():
print("li_at cookie expired -- re-export cookies from browser")
return False
return True
except Exception as e:
print(f"Cookie validation error: {e}")
return False
For bulk scraping that needs multiple session cookies, maintain separate Instagram accounts and log into each one manually in different browsers. LinkedIn allows multiple accounts per organization.
9. Company Page Scraping {#company-pages}
LinkedIn company pages have more public data than personal profiles. The JSON-LD schema.org Organization block is available without authentication:
def scrape_company_page(company_slug: str, proxy: str = None) -> dict:
url = f"https://www.linkedin.com/company/{company_slug}/"
kwargs = {
"impersonate": "chrome136",
"timeout": 20,
"headers": LINKEDIN_HEADERS,
}
if proxy:
kwargs["proxies"] = {"https": proxy}
resp = requests.get(url, **kwargs)
if resp.status_code != 200:
return {"error": f"status_{resp.status_code}", "url": url}
for match in re.finditer(
r'<script[^>]+type="application/ld\+json"[^>]*>(.*?)</script>',
resp.text, re.DOTALL
):
try:
data = json.loads(match.group(1))
candidates = data if isinstance(data, list) else [data]
for item in candidates:
if item.get("@type") in ("Organization", "Corporation"):
return {
"name": item.get("name"),
"url": item.get("url"),
"description": item.get("description"),
"founding_date": item.get("foundingDate"),
"employee_count": (item.get("numberOfEmployees", {})
.get("value")),
"industry": item.get("industry"),
"location": (item.get("address", {})
.get("addressLocality")),
"website": item.get("sameAs"),
"logo": (item.get("logo", {}).get("url")),
"source": "json_ld",
}
except (json.JSONDecodeError, AttributeError):
continue
og = extract_og_meta(resp.text)
return {
"name": og.get("og:title", "").replace(" | LinkedIn", ""),
"description": og.get("og:description", ""),
"image": og.get("og:image", ""),
"url": url,
"source": "og_meta",
}
def batch_scrape_companies(slugs: list, proxy: str = None) -> list:
results = []
for slug in slugs:
result = scrape_company_page(slug, proxy=proxy)
results.append(result)
time.sleep(random.uniform(3, 7))
return results
10. Rate Limits and Soft Ban Avoidance {#rate-limits}
LinkedIn rate limits operate at two levels:
- IP level: Roughly 30-50 requests per hour before IP throttling begins
- Account level: ~80-100 profile views per hour before soft ban (24-48h restriction)
- Soft ban: 24-48 hour restriction where the account can still browse but not view new profiles
- Hard ban: Account suspended -- usually from automated login or extremely high volume
import time
import random
from collections import deque
class LinkedInRateLimiter:
def __init__(self, requests_per_hour: int = 30):
self.interval = 3600.0 / requests_per_hour
self.requests = deque()
self.window = 3600
def wait(self):
now = time.time()
while self.requests and now - self.requests[0] > self.window:
self.requests.popleft()
if len(self.requests) >= 30:
sleep_time = self.window - (now - self.requests[0]) + 10
print(f"Hourly limit reached, waiting {sleep_time:.0f}s")
time.sleep(sleep_time)
# Gaussian jitter -- never use fixed intervals
base = self.interval
jitter = random.gauss(0, base * 0.25)
time.sleep(max(base + jitter, 5.0))
self.requests.append(time.time())
def take_break(self):
break_time = random.uniform(300, 600)
print(f"Taking a {break_time:.0f}s break...")
time.sleep(break_time)
limiter = LinkedInRateLimiter(requests_per_hour=25)
def scrape_profiles_safely(profile_urls: list,
proxy: str = None) -> list:
results = []
for i, url in enumerate(profile_urls):
limiter.wait()
if i > 0 and i % 20 == 0:
limiter.take_break()
result = scrape_linkedin_profile(url, proxy=proxy)
results.append(result)
print(f"[{i+1}/{len(profile_urls)}] {result.get('name', 'N/A')}")
return results
Anti-detection rules: - Never use fixed-interval delays -- add Gaussian jitter - Take 5-10 minute breaks every 20-30 requests - Vary your scraping schedule -- don't always scrape at the same time of day - Mix profile visits with other page types to simulate realistic browsing patterns
11. Proxy Strategy for LinkedIn {#proxies}
Datacenter IPs are blocked at LinkedIn's network layer. Residential proxies are the only option.
ThorData provides rotating residential proxy pools with US country targeting. Their residential IPs appear as regular household connections to LinkedIn's detection infrastructure.
Critical: use sticky sessions. LinkedIn's detection specifically flags IP address changes mid-session as bot behavior. A single stable residential IP per scraping session is far less suspicious than rotating IPs on every request.
THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
THORDATA_HOST = "proxy.thordata.com"
THORDATA_PORT = 9000
def get_linkedin_proxy(sticky: bool = True,
session_id: str = None,
country: str = "US") -> str:
import uuid
if sticky:
if not session_id:
session_id = str(uuid.uuid4())[:8]
user = f"{THORDATA_USER}-country-{country.lower()}-session-{session_id}"
else:
user = f"{THORDATA_USER}-country-{country.lower()}"
return f"http://{user}:{THORDATA_PASS}@{THORDATA_HOST}:{THORDATA_PORT}"
def create_scraping_session() -> dict:
import uuid, time
session_id = str(uuid.uuid4())[:8]
return {
"session_id": session_id,
"proxy": get_linkedin_proxy(sticky=True, session_id=session_id),
"created_at": time.time(),
"request_count": 0,
"max_requests": 30,
}
def should_rotate_session(session: dict) -> bool:
return session["request_count"] >= session["max_requests"]
def rotate_session(session: dict) -> dict:
print(f"Rotating session {session['session_id']} "
f"after {session['request_count']} requests")
time.sleep(random.uniform(30, 60))
return create_scraping_session()
12. Storing LinkedIn Data: SQLite Schema {#storage}
import sqlite3
import json
import time
def init_linkedin_db(db_path: str = "linkedin.db") -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.execute("PRAGMA journal_mode=WAL")
conn.execute('''CREATE TABLE IF NOT EXISTS profiles (
url TEXT PRIMARY KEY,
name TEXT,
job_title TEXT,
current_company TEXT,
location TEXT,
description TEXT,
image_url TEXT,
education TEXT,
experience TEXT,
skills TEXT,
is_public INTEGER,
scraped_at REAL,
scrape_method TEXT
)''')
conn.execute('''CREATE TABLE IF NOT EXISTS jobs (
job_id TEXT PRIMARY KEY,
title TEXT,
company TEXT,
location TEXT,
url TEXT,
posted_date TEXT,
description TEXT,
employment_type TEXT,
remote INTEGER,
salary_min REAL,
salary_max REAL,
salary_currency TEXT,
search_keywords TEXT,
scraped_at REAL
)''')
conn.execute('''CREATE TABLE IF NOT EXISTS companies (
url TEXT PRIMARY KEY,
name TEXT,
description TEXT,
industry TEXT,
employee_count INTEGER,
founding_date TEXT,
website TEXT,
location TEXT,
logo_url TEXT,
scraped_at REAL
)''')
conn.execute(
"CREATE INDEX IF NOT EXISTS idx_profiles_company ON profiles(current_company)"
)
conn.execute(
"CREATE INDEX IF NOT EXISTS idx_jobs_company ON jobs(company)"
)
conn.execute(
"CREATE INDEX IF NOT EXISTS idx_jobs_title ON jobs(title)"
)
conn.commit()
return conn
def save_profile(conn: sqlite3.Connection, profile: dict):
conn.execute(
"INSERT OR REPLACE INTO profiles VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)",
(
profile.get("url") or profile.get("profile_url"),
profile.get("name"),
profile.get("job_title"),
profile.get("current_company"),
profile.get("location"),
profile.get("description"),
profile.get("image") or profile.get("profile_image"),
json.dumps(profile.get("education", [])),
json.dumps(profile.get("experience", [])),
json.dumps(profile.get("skills", [])),
int(profile.get("is_public", True)),
time.time(),
profile.get("source", "unknown"),
)
)
conn.commit()
def save_job(conn: sqlite3.Connection, job: dict):
if not job.get("job_id"):
return
conn.execute(
"INSERT OR REPLACE INTO jobs VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?)",
(
job["job_id"], job.get("title"), job.get("company"),
job.get("location"), job.get("url"),
job.get("posted_date") or job.get("posted_text"),
job.get("description"), job.get("employment_type"),
int(job.get("remote", False)),
job.get("salary_min"), job.get("salary_max"),
job.get("salary_currency"),
job.get("search_keywords"), time.time()
)
)
conn.commit()
13. The LinkedIn Official API {#official-api}
LinkedIn's official API is locked behind partner programs with lengthy approval processes:
- Marketing API: Ad management only
- Community Management API: Company page posting and analytics
- Consumer API: Effectively shut down. What remains is OAuth sign-in only.
If users authenticate your app via LinkedIn OAuth, you can access basic profile data about the authenticated user:
import requests as std_requests
def get_linkedin_oauth_profile(access_token: str) -> dict:
headers = {"Authorization": f"Bearer {access_token}"}
resp = std_requests.get(
"https://api.linkedin.com/v2/userinfo",
headers=headers
)
resp.raise_for_status()
data = resp.json()
return {
"sub": data.get("sub"), # LinkedIn member ID
"name": data.get("name"),
"given_name": data.get("given_name"),
"family_name": data.get("family_name"),
"picture": data.get("picture"),
"email": data.get("email"),
"email_verified": data.get("email_verified"),
}
That is the extent of the official API for data collection. Name, email, profile picture. Nothing about job history, skills, connections, or company data. The official API won't help you for data collection or research.
14. Real Use Cases and What Is Feasible {#use-cases}
| Goal | Method | Feasibility | Safe Scale |
|---|---|---|---|
| Basic profile (name, title, company, location) | curl-cffi + JSON-LD | High | 500+ per day with proxies |
| Job listings in a field | curl-cffi job search | High | 1000+ per day |
| Full work history and skills | Playwright + auth | Medium | 80-100/hour per account |
| Company data (size, industry, description) | curl-cffi company pages | High | 200+ per day |
| People search results | Playwright + auth | Medium | 200/day per account |
| Contact information | Not publicly available | None | N/A |
| Follower and connection lists | Authenticated -- high ban risk | Low | Not recommended |
Job Market Analysis
def analyze_job_market_for_role(role: str, locations: list,
proxy: str = None) -> dict:
results = {}
db = init_linkedin_db("job_market.db")
for location in locations:
jobs = scrape_jobs_exhaustive(role, location, proxy=proxy)
for job in jobs:
save_job(db, job)
companies = list(set(j["company"] for j in jobs if j.get("company")))
remote_count = sum(1 for j in jobs
if "Remote" in j.get("location", ""))
results[location] = {
"total_postings": len(jobs),
"unique_companies": len(companies),
"remote_percentage": (remote_count / len(jobs) * 100) if jobs else 0,
"top_companies": companies[:10],
}
time.sleep(random.uniform(15, 25))
return results
# Example usage
market = analyze_job_market_for_role(
"machine learning engineer",
["San Francisco Bay Area", "New York", "Remote"],
)
for loc, data in market.items():
print(f"{loc}: {data['total_postings']} postings, "
f"{data['remote_percentage']:.0f}% remote")
Competitor Employee Research
def research_competitor_employees(company_slug: str,
role_filter: str = None,
proxy: str = None) -> list:
company = scrape_company_page(company_slug, proxy=proxy)
jobs = scrape_linkedin_jobs(
keywords=role_filter or "",
location="",
proxy=proxy,
max_pages=3,
)
company_jobs = [
j for j in jobs
if company.get("name", "").lower() in j.get("company", "").lower()
]
return {
"company": company,
"open_roles": company_jobs,
"role_count": len(company_jobs),
}
15. Legal Considerations {#legal}
The hiQ v. LinkedIn Supreme Court case (2022) established that scraping publicly accessible data does not violate the CFAA. However:
- LinkedIn's Terms of Service explicitly prohibit scraping
- Violating ToS creates breach-of-contract exposure
- GDPR and CCPA apply to personal data -- storing scraped profiles of EU or California residents requires lawful basis
- LinkedIn actively sends cease-and-desist letters to large-scale commercial scrapers
In practice: - Scraping public job listings for market research: very low risk - Scraping public company pages for competitive analysis: low risk - Scraping personal profiles for academic research: moderate risk - Mass-scraping profiles for lead generation databases: high risk (legal threats common)
LinkedIn primarily targets commercial-scale scrapers -- data brokers, surveillance firms, and competitors. Individual developers doing research rarely face legal action, but account bans are routine if rate limits are exceeded.
16. Key Takeaways {#summary}
The non-negotiables:
- Use curl-cffi with chrome136 impersonation -- standard HTTP clients get 999'd at the TLS layer
- Extract JSON-LD from public profiles and job pages for structured data without authentication
- Playwright with saved cookies for full profile data -- never automate the login itself
- Residential proxies with sticky sessions are mandatory -- not just residential IPs, but the same IP for an entire session
Rate limit discipline: - Stay under ~80 profile views per hour per account for authenticated scraping - Stay under ~30-40 requests per hour per IP for unauthenticated scraping - Add Gaussian-jittered delays (never fixed intervals) - Take 5-10 minute breaks every 20-30 requests to break predictable patterns
What public pages give you without auth: - Profiles: name, title, company, location, education (JSON-LD) - Job listings: title, company, location, posted date, full description (JSON-LD) - Companies: name, description, industry, employee count, location
The infrastructure investment: For proxy infrastructure, ThorData provides the sticky residential sessions that LinkedIn scraping specifically requires. Their US residential pool maintains consistent IP identities across an entire scraping session -- the key difference between a scraper that works and one that gets permanently banned after a dozen requests. Budget for sticky sessions (not rotating per-request) to match how real browser sessions behave.