Scraping AngelList/Wellfound Jobs (2026)
Scraping AngelList/Wellfound Jobs (2026)
Wellfound (formerly AngelList Talent) is unmatched as a source of startup job data. Each listing combines salary ranges, equity percentages, company stage, funding history, and tech stack — information that is genuinely hard to find consolidated anywhere else. Job boards like LinkedIn or Indeed don't show equity. Crunchbase doesn't show open roles. Wellfound shows all of it.
This post covers how to extract it programmatically in 2026 — from the GraphQL API, through auth workarounds, to pagination, proxy integration, and building a usable database for tracking salary and equity trends.
What Data You Can Extract
A single Wellfound job listing exposes a surprising amount of structured data:
Job Fields
- Title — role name and slug
- Compensation — salary range as string (e.g., "$120K – $160K")
- Equity — percentage range (e.g., "0.10% – 0.50%")
- Remote flag — boolean
- Location names — cities/regions where role is based
- Role type — full-time, contract, internship
- Experience level — entry, mid, senior
- Start date — earliest start date
Company Fields (nested in each job)
- Name — company display name
- Company size — headcount bucket (1-10, 11-50, 51-200, etc.)
- High concept — one-line description
- Product description — longer description
- Funding stage — seed, Series A, B, C, growth, etc.
- Total raised — dollar amount
- Investors — named VC firms
- Tech stack — tools and frameworks used
This makes Wellfound useful not just for job hunting but for salary benchmarking, market research, investor tracking, and startup intelligence tools.
GraphQL API Structure
Wellfound's frontend is a Next.js app that talks to a GraphQL endpoint at https://wellfound.com/graphql. The schema is not publicly documented but has been stable for years. Job search goes through talent.jobSearchResultsByPage.
import httpx
import json
import time
import random
from typing import Optional
HEADERS = {
"Content-Type": "application/json",
"Accept": "application/json",
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
),
"Referer": "https://wellfound.com/jobs",
"Origin": "https://wellfound.com",
"Accept-Language": "en-US,en;q=0.9",
}
GQL_URL = "https://wellfound.com/graphql"
def graphql_post(query, variables, proxy_url=None, timeout=20):
"""Execute a Wellfound GraphQL query."""
kwargs = {
"headers": HEADERS,
"timeout": timeout,
"follow_redirects": True,
}
if proxy_url:
kwargs["proxies"] = {"all://": proxy_url}
try:
with httpx.Client(**kwargs) as client:
resp = client.post(GQL_URL, json={"query": query, "variables": variables})
resp.raise_for_status()
return resp.json()
except httpx.HTTPStatusError as e:
print(f"HTTP error: {e.response.status_code}")
return None
except Exception as e:
print(f"Request error: {e}")
return None
def search_jobs(
role: str = "software-engineer",
location: str = None,
remote: bool = None,
page: int = 0,
proxy: str = None,
) -> dict:
"""Search Wellfound jobs via GraphQL API."""
query = """
query JobSearchResultsByPage($slug: String!, $page: Int, $filters: JobSearchFiltersInput) {
talent {
jobSearchResultsByPage(slug: $slug, page: $page, filters: $filters) {
results {
id
title
slug
compensation
equity
remote
locationNames
roleType
experienceLevel
startup {
id
name
slug
companySize
highConcept
stage
totalRaised
markets { displayName }
techStack { displayName }
}
}
totalCount
totalPages
currentPage
}
}
}
"""
filters = {}
if remote is not None:
filters["remote"] = remote
if location:
filters["locationSlug"] = location
variables = {
"slug": role,
"page": page,
"filters": filters if filters else None,
}
result = graphql_post(query, variables, proxy)
if not result or result.get("errors"):
return {"results": [], "totalCount": 0, "totalPages": 0}
return (
result.get("data", {})
.get("talent", {})
.get("jobSearchResultsByPage", {})
)
The slug parameter maps to the role category URL segment. Common values:
| Slug | Role |
|---|---|
software-engineer |
Software Engineering |
product-manager |
Product Management |
data-scientist |
Data Science |
machine-learning-engineer |
ML Engineering |
frontend-engineer |
Frontend Development |
backend-engineer |
Backend Development |
designer |
Design |
devops |
DevOps / Infrastructure |
marketing |
Marketing |
sales |
Sales |
operations |
Operations |
Pagination is zero-indexed: page=0 for the first page, page=1 for the second, etc.
Pagination Handling
Wellfound paginates at 10–20 results per page depending on the endpoint version. Handle pagination cleanly:
def scrape_all_jobs(
role: str,
location: str = None,
remote: bool = None,
max_pages: int = 50,
proxy: str = None,
) -> list[dict]:
"""Scrape all available jobs for a role, handling pagination."""
all_jobs = []
page = 0
while page < max_pages:
result = search_jobs(role, location, remote, page, proxy)
batch = result.get("results", [])
total_pages = result.get("totalPages", 0)
total_count = result.get("totalCount", 0)
if not batch:
print(f"No results on page {page}, stopping.")
break
all_jobs.extend(batch)
print(f"Page {page}: {len(batch)} jobs (total: {len(all_jobs)}/{total_count})")
if page >= total_pages - 1:
print(f"Reached last page ({total_pages})")
break
page += 1
time.sleep(random.uniform(1.5, 3.5))
return all_jobs
# Scrape all remote ML engineer jobs
jobs = scrape_all_jobs(
"machine-learning-engineer",
remote=True,
max_pages=30,
proxy="http://user:[email protected]:9000",
)
print(f"Total: {len(jobs)} jobs")
Startup Detail Queries
Each job result includes a nested startup object with summary data. For full company details — all funding rounds, full investor list, complete tech stack — use the startup detail query:
STARTUP_DETAIL_QUERY = """
query StartupDetail($slug: String!) {
startups {
startup(slug: $slug) {
id
name
slug
highConcept
productDescription
companySize
stage
totalRaised
foundedDate
websiteUrl
twitterUrl
linkedInUrl
markets {
displayName
slug
}
techStack {
displayName
slug
}
investors {
name
slug
}
fundingRounds {
roundType
raisedAmount
closedAt
investors {
name
}
}
}
}
}
"""
def get_startup_detail(company_slug: str, proxy: str = None) -> dict:
"""Get detailed startup information by slug."""
result = graphql_post(STARTUP_DETAIL_QUERY, {"slug": company_slug}, proxy)
if not result or result.get("errors"):
return {}
return (
result.get("data", {})
.get("startups", {})
.get("startup", {})
)
# Enrich job listings with startup details
def enrich_jobs_with_startup_data(jobs: list[dict], proxy: str = None) -> list[dict]:
"""Add detailed startup data to job listings."""
startup_cache = {}
for job in jobs:
company_slug = job.get("startup", {}).get("slug")
if not company_slug:
continue
if company_slug not in startup_cache:
print(f" Fetching details for {company_slug}...")
startup_cache[company_slug] = get_startup_detail(company_slug, proxy)
time.sleep(random.uniform(1.0, 2.5))
job["startup_detail"] = startup_cache[company_slug]
return jobs
Parsing Salary and Equity Data
Compensation and equity come back as formatted strings. Parse them for numeric analysis:
import re
def parse_compensation(raw: str) -> dict:
"""
Parse salary strings like "$120K - $160K" or "$90K – $130K".
Handles en-dashes, em-dashes, and various formats.
"""
if not raw:
return {"salary_min": None, "salary_max": None, "raw": raw}
# Normalize dashes
normalized = re.sub(r"[–—−]", "-", raw)
nums = re.findall(r"\$?([\d,]+)[Kk]", normalized)
if len(nums) >= 2:
return {
"salary_min": int(nums[0].replace(",", "")) * 1000,
"salary_max": int(nums[1].replace(",", "")) * 1000,
"raw": raw,
}
elif len(nums) == 1:
val = int(nums[0].replace(",", "")) * 1000
return {"salary_min": val, "salary_max": val, "raw": raw}
return {"salary_min": None, "salary_max": None, "raw": raw}
def parse_equity(raw: str) -> dict:
"""
Parse equity strings like "0.10% - 0.50%" or "1.0% – 2.0%".
"""
if not raw:
return {"equity_min": None, "equity_max": None, "raw": raw}
nums = re.findall(r"([\d.]+)%", raw)
if len(nums) >= 2:
return {
"equity_min": float(nums[0]),
"equity_max": float(nums[1]),
"raw": raw,
}
elif len(nums) == 1:
val = float(nums[0])
return {"equity_min": val, "equity_max": val, "raw": raw}
return {"equity_min": None, "equity_max": None, "raw": raw}
def flatten_job(job: dict) -> dict:
"""Flatten a raw job dict into a clean row for analysis."""
startup = job.get("startup", {})
comp = parse_compensation(job.get("compensation", ""))
equity = parse_equity(job.get("equity", ""))
return {
"job_id": job.get("id"),
"title": job.get("title"),
"slug": job.get("slug"),
"remote": job.get("remote", False),
"locations": ", ".join(job.get("locationNames", [])),
"role_type": job.get("roleType"),
"experience_level": job.get("experienceLevel"),
"salary_min": comp["salary_min"],
"salary_max": comp["salary_max"],
"equity_min": equity["equity_min"],
"equity_max": equity["equity_max"],
"company_name": startup.get("name"),
"company_slug": startup.get("slug"),
"company_size": startup.get("companySize"),
"stage": startup.get("stage"),
"total_raised": startup.get("totalRaised"),
"markets": ", ".join(m["displayName"] for m in startup.get("markets", [])),
"tech_stack": ", ".join(t["displayName"] for t in startup.get("techStack", [])),
}
# Flatten and display
flat_jobs = [flatten_job(j) for j in jobs]
for j in flat_jobs[:5]:
print(f"{j['company_name']} — {j['title']}")
if j["salary_min"]:
print(f" Salary: ${j['salary_min']:,} - ${j['salary_max']:,}")
if j["equity_min"]:
print(f" Equity: {j['equity_min']}% - {j['equity_max']}%")
print(f" Stage: {j['stage']} | Size: {j['company_size']}")
Auth Workaround: NEXT_DATA Extraction
Wellfound increasingly gates some content behind login. The __NEXT_DATA__ approach bypasses many auth requirements because it uses server-rendered HTML:
import httpx
import json
import re
from bs4 import BeautifulSoup
def extract_next_data(url, proxy_url=None):
"""Extract __NEXT_DATA__ JSON from server-rendered HTML."""
req_headers = {
"User-Agent": HEADERS["User-Agent"],
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://wellfound.com/",
}
client_kwargs = {"headers": req_headers, "timeout": 20, "follow_redirects": True}
if proxy_url:
client_kwargs["proxies"] = {"all://": proxy_url}
with httpx.Client(**client_kwargs) as client:
resp = client.get(url)
if resp.status_code != 200:
return None
match = re.search(
r'<script id="__NEXT_DATA__" type="application/json">(.*?)</script>',
resp.text,
re.DOTALL,
)
if not match:
return None
try:
return json.loads(match.group(1))
except json.JSONDecodeError:
return None
def scrape_job_from_html(job_slug, proxy_url=None):
"""Scrape job listing directly from HTML — no auth required."""
url = f"https://wellfound.com/jobs/{job_slug}"
data = extract_next_data(url, proxy_url)
if not data:
return None
# Dig into Next.js props structure
props = data.get("props", {}).get("pageProps", {})
job = props.get("jobListing") or props.get("job")
# Try Apollo state cache if direct path fails
if not job:
apollo = props.get("apolloState", {})
job_keys = [k for k in apollo if "JobListing:" in k or "Job:" in k]
if job_keys:
job = apollo[job_keys[0]]
return job
# List jobs for a company via HTML
def list_company_jobs_html(company_slug, proxy_url=None):
"""Get job listings from company page HTML."""
url = f"https://wellfound.com/company/{company_slug}/jobs"
data = extract_next_data(url, proxy_url)
if not data:
return []
props = data.get("props", {}).get("pageProps", {})
# Structure varies, search recursively
return find_jobs_in_props(props)
def find_jobs_in_props(obj, max_depth=5):
"""Recursively find job listing arrays in Next.js props."""
if max_depth == 0 or not isinstance(obj, dict):
return []
# Look for job-like arrays
for key, value in obj.items():
if isinstance(value, list) and value and isinstance(value[0], dict):
if any(k in value[0] for k in ["compensation", "equity", "title", "slug"]):
return value
if isinstance(value, dict):
result = find_jobs_in_props(value, max_depth - 1)
if result:
return result
return []
Anti-Bot Measures and Proxy Integration
Cloudflare Defense Layers
Wellfound uses Cloudflare with bot scoring enabled:
- IP reputation: Datacenter IPs fail almost immediately
- JS challenge: First visit may require JavaScript execution
- Rate limiting: ~60–120 GraphQL requests/minute before throttling
- Browser fingerprinting: TLS fingerprint checked on login flows
ThorData Residential Proxy Setup
ThorData provides rotating residential proxies with country targeting. Use the US pool for Wellfound since it's a US-focused platform:
THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
THORDATA_HOST = "proxy.thordata.com"
THORDATA_PORT = 9000
def get_proxy(session_id=None, country="us"):
"""
Build proxy URL.
- session_id: None = rotate per request
- session_id provided = sticky session (same IP per session)
"""
if session_id:
user = f"{THORDATA_USER}-session-{session_id}-country-{country}"
else:
user = f"{THORDATA_USER}-country-{country}"
return f"http://{user}:{THORDATA_PASS}@{THORDATA_HOST}:{THORDATA_PORT}"
def resilient_job_search(
role, page=0, location=None, remote=None,
max_retries=3, use_sticky_session=True
):
"""Job search with automatic proxy rotation on failure."""
session_id = random.randint(10000, 99999) if use_sticky_session else None
for attempt in range(max_retries):
proxy = get_proxy(session_id=session_id, country="us")
result = search_jobs(role, location, remote, page, proxy)
if result and result.get("results"):
return result
# Rotate session on failure
session_id = random.randint(10000, 99999)
wait = (attempt + 1) * random.uniform(5, 15)
print(f"Attempt {attempt+1} failed, waiting {wait:.1f}s...")
time.sleep(wait)
return {"results": [], "totalCount": 0, "totalPages": 0}
Playwright with Proxy for Auth-Gated Content
from playwright.sync_api import sync_playwright
def scrape_with_browser(company_slug, proxy_config=None):
"""Use Playwright for content behind auth walls."""
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
proxy=proxy_config,
args=["--disable-blink-features=AutomationControlled"],
)
context = browser.new_context(
user_agent=HEADERS["User-Agent"],
viewport={"width": 1440, "height": 900},
locale="en-US",
)
graphql_responses = []
def capture_gql(response):
if "graphql" in response.url:
try:
data = response.json()
if data.get("data"):
graphql_responses.append(data["data"])
except Exception:
pass
page = context.new_page()
page.on("response", capture_gql)
page.goto(
f"https://wellfound.com/company/{company_slug}/jobs",
wait_until="networkidle",
)
browser.close()
return graphql_responses
proxy_config = {
"server": f"http://{THORDATA_HOST}:{THORDATA_PORT}",
"username": THORDATA_USER,
"password": THORDATA_PASS,
}
responses = scrape_with_browser("openai", proxy_config)
Data Storage: SQLite Schema
import sqlite3
import json
from datetime import datetime, date
def init_db(db_path: str = "wellfound_jobs.db") -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.executescript("""
CREATE TABLE IF NOT EXISTS companies (
id TEXT PRIMARY KEY,
slug TEXT UNIQUE NOT NULL,
name TEXT,
high_concept TEXT,
company_size TEXT,
stage TEXT,
total_raised INTEGER,
markets TEXT,
tech_stack TEXT,
scraped_at TEXT
);
CREATE TABLE IF NOT EXISTS jobs (
id TEXT PRIMARY KEY,
title TEXT,
slug TEXT,
company_id TEXT,
company_slug TEXT,
salary_min INTEGER,
salary_max INTEGER,
equity_min REAL,
equity_max REAL,
remote INTEGER DEFAULT 0,
locations TEXT,
role_type TEXT,
experience_level TEXT,
compensation_raw TEXT,
equity_raw TEXT,
scraped_at TEXT,
FOREIGN KEY (company_id) REFERENCES companies(id)
);
CREATE TABLE IF NOT EXISTS scrape_runs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
role_slug TEXT,
total_jobs INTEGER,
pages_scraped INTEGER,
started_at TEXT,
completed_at TEXT
);
CREATE INDEX IF NOT EXISTS idx_jobs_company ON jobs(company_id);
CREATE INDEX IF NOT EXISTS idx_jobs_salary ON jobs(salary_min, salary_max);
CREATE INDEX IF NOT EXISTS idx_jobs_stage ON jobs(company_id);
""")
conn.commit()
return conn
def save_job(conn: sqlite3.Connection, job: dict):
"""Save a single job and its company to the database."""
startup = job.get("startup", {})
flat = flatten_job(job)
# Upsert company
if startup.get("id"):
conn.execute("""
INSERT OR REPLACE INTO companies
(id, slug, name, high_concept, company_size, stage, total_raised,
markets, tech_stack, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
startup.get("id"), startup.get("slug"), startup.get("name"),
startup.get("highConcept"), startup.get("companySize"),
startup.get("stage"), startup.get("totalRaised"),
json.dumps([m["displayName"] for m in startup.get("markets", [])]),
json.dumps([t["displayName"] for t in startup.get("techStack", [])]),
datetime.utcnow().isoformat(),
))
# Upsert job
conn.execute("""
INSERT OR REPLACE INTO jobs
(id, title, slug, company_id, company_slug, salary_min, salary_max,
equity_min, equity_max, remote, locations, role_type, experience_level,
compensation_raw, equity_raw, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
flat["job_id"], flat["title"], flat["slug"],
startup.get("id"), flat["company_slug"],
flat["salary_min"], flat["salary_max"],
flat["equity_min"], flat["equity_max"],
1 if flat["remote"] else 0,
flat["locations"], flat["role_type"], flat["experience_level"],
job.get("compensation"), job.get("equity"),
datetime.utcnow().isoformat(),
))
conn.commit()
def batch_save_jobs(conn: sqlite3.Connection, jobs: list[dict]):
"""Save a batch of jobs efficiently."""
for job in jobs:
try:
save_job(conn, job)
except Exception as e:
print(f"Error saving job {job.get('id', 'unknown')}: {e}")
Building a Job Market Tracker
Track salary and equity trends over time — weekly snapshots build a useful dataset:
def run_weekly_scrape(role_slugs=None, db_path="wellfound_jobs.db"):
"""
Weekly job market scrape across multiple roles.
Run this on a schedule (e.g., every Monday).
"""
if role_slugs is None:
role_slugs = [
"software-engineer", "machine-learning-engineer",
"data-scientist", "product-manager", "devops",
]
conn = init_db(db_path)
proxy = get_proxy(country="us")
total_scraped = 0
for role in role_slugs:
print(f"\n=== Scraping {role} ===")
jobs = scrape_all_jobs(role, remote=True, max_pages=20, proxy=proxy)
batch_save_jobs(conn, jobs)
total_scraped += len(jobs)
print(f" Saved {len(jobs)} jobs for {role}")
# Record run
conn.execute("""
INSERT INTO scrape_runs (role_slug, total_jobs, pages_scraped, started_at, completed_at)
VALUES (?, ?, ?, ?, ?)
""", (role, len(jobs), 20, datetime.utcnow().isoformat(), datetime.utcnow().isoformat()))
conn.commit()
time.sleep(random.uniform(5, 10))
return total_scraped
def get_salary_trends(conn, role_pattern="%Engineer%", weeks_back=12):
"""Analyze salary trends over the past N weeks."""
cursor = conn.execute("""
SELECT
strftime('%Y-W%W', scraped_at) as week,
COUNT(*) as listings,
AVG(salary_min) as avg_min,
AVG(salary_max) as avg_max,
AVG(equity_min) as avg_equity_min,
AVG(equity_max) as avg_equity_max
FROM jobs
WHERE title LIKE ?
AND salary_min IS NOT NULL
AND scraped_at > datetime('now', '-' || ? || ' weeks')
GROUP BY week
ORDER BY week
""", (role_pattern, weeks_back))
return [
{
"week": row[0], "listings": row[1],
"avg_salary": (row[2] + row[3]) / 2 if row[2] and row[3] else None,
"avg_equity": (row[4] + row[5]) / 2 if row[4] and row[5] else None,
}
for row in cursor.fetchall()
]
trends = get_salary_trends(conn)
for t in trends:
salary = f"${t['avg_salary']:,.0f}" if t["avg_salary"] else "N/A"
equity = f"{t['avg_equity']:.2f}%" if t["avg_equity"] else "N/A"
print(f"Week {t['week']}: {t['listings']} listings | Avg salary: {salary} | Avg equity: {equity}")
Real-World Use Cases
1. Equity Benchmarking by Stage
def equity_by_stage(conn):
"""Compare equity ranges by company funding stage."""
cursor = conn.execute("""
SELECT
c.stage,
COUNT(DISTINCT j.id) as job_count,
AVG(j.equity_min) as avg_equity_min,
AVG(j.equity_max) as avg_equity_max,
MIN(j.equity_min) as min_equity,
MAX(j.equity_max) as max_equity
FROM jobs j
JOIN companies c ON j.company_id = c.id
WHERE j.equity_min IS NOT NULL
AND c.stage IS NOT NULL
GROUP BY c.stage
ORDER BY avg_equity_max DESC
""")
print("\nEquity ranges by company stage:")
for row in cursor.fetchall():
stage, count, avg_min, avg_max, min_eq, max_eq = row
print(f" {stage}: {avg_min:.2f}% - {avg_max:.2f}% avg "
f"(range: {min_eq:.2f}% - {max_eq:.2f}%, n={count})")
equity_by_stage(conn)
2. Tech Stack Intelligence
def find_companies_by_tech(conn, technology):
"""Find all companies using a specific technology."""
cursor = conn.execute("""
SELECT c.name, c.stage, c.company_size, c.total_raised,
COUNT(j.id) as open_roles
FROM companies c
LEFT JOIN jobs j ON c.id = j.company_id
WHERE c.tech_stack LIKE ?
GROUP BY c.id
ORDER BY c.total_raised DESC NULLS LAST
""", (f"%{technology}%",))
return [
{
"name": row[0], "stage": row[1], "size": row[2],
"raised": row[3], "open_roles": row[4],
}
for row in cursor.fetchall()
]
rust_companies = find_companies_by_tech(conn, "Rust")
print(f"\nCompanies using Rust: {len(rust_companies)}")
for c in rust_companies[:10]:
raised = f"${c['raised']:,}" if c["raised"] else "undisclosed"
print(f" {c['name']} ({c['stage']}) — raised {raised} — {c['open_roles']} open roles")
3. Salary Negotiation Intelligence
def get_offer_context(title_keyword, company_stage=None, remote=True):
"""
Given a job title and company stage, return salary percentiles
to inform salary negotiations.
"""
conn = sqlite3.connect("wellfound_jobs.db")
conditions = ["j.title LIKE ?", "j.salary_min IS NOT NULL"]
params = [f"%{title_keyword}%"]
if company_stage:
conditions.append("c.stage = ?")
params.append(company_stage)
if remote is not None:
conditions.append("j.remote = ?")
params.append(1 if remote else 0)
where = " AND ".join(conditions)
cursor = conn.execute(f"""
SELECT j.salary_min, j.salary_max, j.equity_min, j.equity_max,
c.stage, c.company_size
FROM jobs j
JOIN companies c ON j.company_id = c.id
WHERE {where}
""", params)
rows = cursor.fetchall()
if not rows:
return None
salaries = [(r[0] + r[1]) / 2 for r in rows if r[0] and r[1]]
equities = [(r[2] + r[3]) / 2 for r in rows if r[2] and r[3]]
salaries.sort()
equities.sort()
def percentile(lst, p):
if not lst:
return None
i = int(len(lst) * p / 100)
return lst[min(i, len(lst) - 1)]
return {
"sample_size": len(rows),
"salary_p25": percentile(salaries, 25),
"salary_median": percentile(salaries, 50),
"salary_p75": percentile(salaries, 75),
"equity_p25": percentile(equities, 25),
"equity_median": percentile(equities, 50),
"equity_p75": percentile(equities, 75),
}
context = get_offer_context("Machine Learning Engineer", "Series A")
if context:
print(f"Salary range (n={context['sample_size']}):")
print(f" P25: ${context['salary_p25']:,.0f}")
print(f" Median: ${context['salary_median']:,.0f}")
print(f" P75: ${context['salary_p75']:,.0f}")
print(f"Equity median: {context['equity_median']:.2f}%")
Complete Pipeline
def full_pipeline(
role_slugs=None,
output_db="wellfound_jobs.db",
max_pages=25,
):
"""
Full Wellfound jobs scraping pipeline.
Scrapes multiple roles, handles pagination, stores in SQLite.
"""
if role_slugs is None:
role_slugs = ["software-engineer", "machine-learning-engineer", "data-scientist"]
conn = init_db(output_db)
proxy = get_proxy(country="us")
for role in role_slugs:
print(f"\n=== Scraping: {role} ===")
run_start = datetime.utcnow().isoformat()
total = 0
page = 0
while page < max_pages:
result = resilient_job_search(role, page=page, remote=True)
batch = result.get("results", [])
if not batch:
break
batch_save_jobs(conn, batch)
total += len(batch)
print(f" Page {page}: {len(batch)} jobs saved (total: {total})")
if page >= result.get("totalPages", 0) - 1:
break
page += 1
time.sleep(random.uniform(2.0, 4.0))
conn.execute("""
INSERT INTO scrape_runs (role_slug, total_jobs, pages_scraped, started_at, completed_at)
VALUES (?, ?, ?, ?, ?)
""", (role, total, page + 1, run_start, datetime.utcnow().isoformat()))
conn.commit()
print(f" {role} complete: {total} jobs")
# Print summary stats
cursor = conn.execute("SELECT COUNT(*) FROM jobs")
total_jobs = cursor.fetchone()[0]
cursor = conn.execute("SELECT COUNT(*) FROM companies")
total_companies = cursor.fetchone()[0]
print(f"\nDatabase: {total_jobs:,} jobs, {total_companies:,} companies")
if __name__ == "__main__":
full_pipeline()
Legal Notes
Wellfound's Terms of Service prohibit automated scraping. This guide is for educational and research purposes.
Key considerations for your jurisdiction: - hiQ v. LinkedIn (9th Circuit): scraping publicly accessible data generally doesn't violate the CFAA - GDPR: EU users' personal data (names, contact info) has additional protections - Commercial use: Redistributing scraped data as a product carries higher legal risk than internal research
Practical safe use: personal salary research, academic market studies, internal tooling. Avoid: reselling data, building competing products, scraping at volumes that stress their infrastructure.
Summary
Wellfound's GraphQL API provides the cleanest access to startup job data — salary, equity, stage, funding, and tech stack in structured JSON. The main technical obstacles are Cloudflare bot protection and auth walls on some endpoints.
Core techniques:
1. GraphQL direct queries — richest data, requires residential proxies
2. __NEXT_DATA__ extraction — bypasses auth for server-rendered content
3. Playwright interception — for auth-gated or heavily dynamic pages
4. ThorData residential proxies — required for Cloudflare, essential for volume
With weekly scrapes across 5–10 role slugs, you build a useful salary benchmarking dataset within a month. Add startup detail enrichment and you have market intelligence that rivals Crunchbase — for the cost of proxy bandwidth.