Scraping AngelList/Wellfound Jobs (2026)

2026-04-09 ["angellist" "wellfound" "web scraping" "python" "job data" "startups"]

Scraping AngelList/Wellfound Jobs (2026)

Wellfound (formerly AngelList Talent) is unmatched as a source of startup job data. Each listing combines salary ranges, equity percentages, company stage, funding history, and tech stack — information that is genuinely hard to find consolidated anywhere else. Job boards like LinkedIn or Indeed don't show equity. Crunchbase doesn't show open roles. Wellfound shows all of it.

This post covers how to extract it programmatically in 2026 — from the GraphQL API, through auth workarounds, to pagination, proxy integration, and building a usable database for tracking salary and equity trends.

What Data You Can Extract

A single Wellfound job listing exposes a surprising amount of structured data:

Job Fields

Title — role name and slug
Compensation — salary range as string (e.g., "$120K – $160K")
Equity — percentage range (e.g., "0.10% – 0.50%")
Remote flag — boolean
Location names — cities/regions where role is based
Role type — full-time, contract, internship
Experience level — entry, mid, senior
Start date — earliest start date

Company Fields (nested in each job)

Name — company display name
Company size — headcount bucket (1-10, 11-50, 51-200, etc.)
High concept — one-line description
Product description — longer description
Funding stage — seed, Series A, B, C, growth, etc.
Total raised — dollar amount
Investors — named VC firms
Tech stack — tools and frameworks used

This makes Wellfound useful not just for job hunting but for salary benchmarking, market research, investor tracking, and startup intelligence tools.

GraphQL API Structure

Wellfound's frontend is a Next.js app that talks to a GraphQL endpoint at https://wellfound.com/graphql. The schema is not publicly documented but has been stable for years. Job search goes through talent.jobSearchResultsByPage.

import httpx
import json
import time
import random
from typing import Optional

HEADERS = {
    "Content-Type": "application/json",
    "Accept": "application/json",
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
    ),
    "Referer": "https://wellfound.com/jobs",
    "Origin": "https://wellfound.com",
    "Accept-Language": "en-US,en;q=0.9",
}

GQL_URL = "https://wellfound.com/graphql"

def graphql_post(query, variables, proxy_url=None, timeout=20):
    """Execute a Wellfound GraphQL query."""
    kwargs = {
        "headers": HEADERS,
        "timeout": timeout,
        "follow_redirects": True,
    }
    if proxy_url:
        kwargs["proxies"] = {"all://": proxy_url}

    try:
        with httpx.Client(**kwargs) as client:
            resp = client.post(GQL_URL, json={"query": query, "variables": variables})
            resp.raise_for_status()
            return resp.json()
    except httpx.HTTPStatusError as e:
        print(f"HTTP error: {e.response.status_code}")
        return None
    except Exception as e:
        print(f"Request error: {e}")
        return None

def search_jobs(
    role: str = "software-engineer",
    location: str = None,
    remote: bool = None,
    page: int = 0,
    proxy: str = None,
) -> dict:
    """Search Wellfound jobs via GraphQL API."""
    query = """
    query JobSearchResultsByPage($slug: String!, $page: Int, $filters: JobSearchFiltersInput) {
        talent {
            jobSearchResultsByPage(slug: $slug, page: $page, filters: $filters) {
                results {
                    id
                    title
                    slug
                    compensation
                    equity
                    remote
                    locationNames
                    roleType
                    experienceLevel
                    startup {
                        id
                        name
                        slug
                        companySize
                        highConcept
                        stage
                        totalRaised
                        markets { displayName }
                        techStack { displayName }
                    }
                }
                totalCount
                totalPages
                currentPage
            }
        }
    }
    """

    filters = {}
    if remote is not None:
        filters["remote"] = remote
    if location:
        filters["locationSlug"] = location

    variables = {
        "slug": role,
        "page": page,
        "filters": filters if filters else None,
    }

    result = graphql_post(query, variables, proxy)
    if not result or result.get("errors"):
        return {"results": [], "totalCount": 0, "totalPages": 0}

    return (
        result.get("data", {})
        .get("talent", {})
        .get("jobSearchResultsByPage", {})
    )

The slug parameter maps to the role category URL segment. Common values:

Slug	Role
`software-engineer`	Software Engineering
`product-manager`	Product Management
`data-scientist`	Data Science
`machine-learning-engineer`	ML Engineering
`frontend-engineer`	Frontend Development
`backend-engineer`	Backend Development
`designer`	Design
`devops`	DevOps / Infrastructure
`marketing`	Marketing
`sales`	Sales
`operations`	Operations

Pagination is zero-indexed: page=0 for the first page, page=1 for the second, etc.

Pagination Handling

Wellfound paginates at 10–20 results per page depending on the endpoint version. Handle pagination cleanly:

def scrape_all_jobs(
    role: str,
    location: str = None,
    remote: bool = None,
    max_pages: int = 50,
    proxy: str = None,
) -> list[dict]:
    """Scrape all available jobs for a role, handling pagination."""
    all_jobs = []
    page = 0

    while page < max_pages:
        result = search_jobs(role, location, remote, page, proxy)

        batch = result.get("results", [])
        total_pages = result.get("totalPages", 0)
        total_count = result.get("totalCount", 0)

        if not batch:
            print(f"No results on page {page}, stopping.")
            break

        all_jobs.extend(batch)
        print(f"Page {page}: {len(batch)} jobs (total: {len(all_jobs)}/{total_count})")

        if page >= total_pages - 1:
            print(f"Reached last page ({total_pages})")
            break

        page += 1
        time.sleep(random.uniform(1.5, 3.5))

    return all_jobs

# Scrape all remote ML engineer jobs
jobs = scrape_all_jobs(
    "machine-learning-engineer",
    remote=True,
    max_pages=30,
    proxy="http://user:[email protected]:9000",
)
print(f"Total: {len(jobs)} jobs")

Startup Detail Queries

Each job result includes a nested startup object with summary data. For full company details — all funding rounds, full investor list, complete tech stack — use the startup detail query:

STARTUP_DETAIL_QUERY = """
query StartupDetail($slug: String!) {
    startups {
        startup(slug: $slug) {
            id
            name
            slug
            highConcept
            productDescription
            companySize
            stage
            totalRaised
            foundedDate
            websiteUrl
            twitterUrl
            linkedInUrl
            markets {
                displayName
                slug
            }
            techStack {
                displayName
                slug
            }
            investors {
                name
                slug
            }
            fundingRounds {
                roundType
                raisedAmount
                closedAt
                investors {
                    name
                }
            }
        }
    }
}
"""

def get_startup_detail(company_slug: str, proxy: str = None) -> dict:
    """Get detailed startup information by slug."""
    result = graphql_post(STARTUP_DETAIL_QUERY, {"slug": company_slug}, proxy)

    if not result or result.get("errors"):
        return {}

    return (
        result.get("data", {})
        .get("startups", {})
        .get("startup", {})
    )

# Enrich job listings with startup details
def enrich_jobs_with_startup_data(jobs: list[dict], proxy: str = None) -> list[dict]:
    """Add detailed startup data to job listings."""
    startup_cache = {}

    for job in jobs:
        company_slug = job.get("startup", {}).get("slug")
        if not company_slug:
            continue

        if company_slug not in startup_cache:
            print(f"  Fetching details for {company_slug}...")
            startup_cache[company_slug] = get_startup_detail(company_slug, proxy)
            time.sleep(random.uniform(1.0, 2.5))

        job["startup_detail"] = startup_cache[company_slug]

    return jobs

Parsing Salary and Equity Data

Compensation and equity come back as formatted strings. Parse them for numeric analysis:

import re

def parse_compensation(raw: str) -> dict:
    """
    Parse salary strings like "$120K - $160K" or "$90K – $130K".
    Handles en-dashes, em-dashes, and various formats.
    """
    if not raw:
        return {"salary_min": None, "salary_max": None, "raw": raw}

    # Normalize dashes
    normalized = re.sub(r"[–—−]", "-", raw)
    nums = re.findall(r"\$?([\d,]+)[Kk]", normalized)

    if len(nums) >= 2:
        return {
            "salary_min": int(nums[0].replace(",", "")) * 1000,
            "salary_max": int(nums[1].replace(",", "")) * 1000,
            "raw": raw,
        }
    elif len(nums) == 1:
        val = int(nums[0].replace(",", "")) * 1000
        return {"salary_min": val, "salary_max": val, "raw": raw}

    return {"salary_min": None, "salary_max": None, "raw": raw}

def parse_equity(raw: str) -> dict:
    """
    Parse equity strings like "0.10% - 0.50%" or "1.0% – 2.0%".
    """
    if not raw:
        return {"equity_min": None, "equity_max": None, "raw": raw}

    nums = re.findall(r"([\d.]+)%", raw)
    if len(nums) >= 2:
        return {
            "equity_min": float(nums[0]),
            "equity_max": float(nums[1]),
            "raw": raw,
        }
    elif len(nums) == 1:
        val = float(nums[0])
        return {"equity_min": val, "equity_max": val, "raw": raw}

    return {"equity_min": None, "equity_max": None, "raw": raw}

def flatten_job(job: dict) -> dict:
    """Flatten a raw job dict into a clean row for analysis."""
    startup = job.get("startup", {})
    comp = parse_compensation(job.get("compensation", ""))
    equity = parse_equity(job.get("equity", ""))

    return {
        "job_id": job.get("id"),
        "title": job.get("title"),
        "slug": job.get("slug"),
        "remote": job.get("remote", False),
        "locations": ", ".join(job.get("locationNames", [])),
        "role_type": job.get("roleType"),
        "experience_level": job.get("experienceLevel"),
        "salary_min": comp["salary_min"],
        "salary_max": comp["salary_max"],
        "equity_min": equity["equity_min"],
        "equity_max": equity["equity_max"],
        "company_name": startup.get("name"),
        "company_slug": startup.get("slug"),
        "company_size": startup.get("companySize"),
        "stage": startup.get("stage"),
        "total_raised": startup.get("totalRaised"),
        "markets": ", ".join(m["displayName"] for m in startup.get("markets", [])),
        "tech_stack": ", ".join(t["displayName"] for t in startup.get("techStack", [])),
    }

# Flatten and display
flat_jobs = [flatten_job(j) for j in jobs]
for j in flat_jobs[:5]:
    print(f"{j['company_name']} — {j['title']}")
    if j["salary_min"]:
        print(f"  Salary: ${j['salary_min']:,} - ${j['salary_max']:,}")
    if j["equity_min"]:
        print(f"  Equity: {j['equity_min']}% - {j['equity_max']}%")
    print(f"  Stage: {j['stage']} | Size: {j['company_size']}")

Auth Workaround: NEXT_DATA Extraction

Wellfound increasingly gates some content behind login. The __NEXT_DATA__ approach bypasses many auth requirements because it uses server-rendered HTML:

import httpx
import json
import re
from bs4 import BeautifulSoup

def extract_next_data(url, proxy_url=None):
    """Extract __NEXT_DATA__ JSON from server-rendered HTML."""
    req_headers = {
        "User-Agent": HEADERS["User-Agent"],
        "Accept": "text/html,application/xhtml+xml",
        "Accept-Language": "en-US,en;q=0.9",
        "Referer": "https://wellfound.com/",
    }

    client_kwargs = {"headers": req_headers, "timeout": 20, "follow_redirects": True}
    if proxy_url:
        client_kwargs["proxies"] = {"all://": proxy_url}

    with httpx.Client(**client_kwargs) as client:
        resp = client.get(url)

    if resp.status_code != 200:
        return None

    match = re.search(
        r'<script id="__NEXT_DATA__" type="application/json">(.*?)</script>',
        resp.text,
        re.DOTALL,
    )
    if not match:
        return None

    try:
        return json.loads(match.group(1))
    except json.JSONDecodeError:
        return None

def scrape_job_from_html(job_slug, proxy_url=None):
    """Scrape job listing directly from HTML — no auth required."""
    url = f"https://wellfound.com/jobs/{job_slug}"
    data = extract_next_data(url, proxy_url)

    if not data:
        return None

    # Dig into Next.js props structure
    props = data.get("props", {}).get("pageProps", {})
    job = props.get("jobListing") or props.get("job")

    # Try Apollo state cache if direct path fails
    if not job:
        apollo = props.get("apolloState", {})
        job_keys = [k for k in apollo if "JobListing:" in k or "Job:" in k]
        if job_keys:
            job = apollo[job_keys[0]]

    return job

# List jobs for a company via HTML
def list_company_jobs_html(company_slug, proxy_url=None):
    """Get job listings from company page HTML."""
    url = f"https://wellfound.com/company/{company_slug}/jobs"
    data = extract_next_data(url, proxy_url)
    if not data:
        return []

    props = data.get("props", {}).get("pageProps", {})
    # Structure varies, search recursively
    return find_jobs_in_props(props)

def find_jobs_in_props(obj, max_depth=5):
    """Recursively find job listing arrays in Next.js props."""
    if max_depth == 0 or not isinstance(obj, dict):
        return []

    # Look for job-like arrays
    for key, value in obj.items():
        if isinstance(value, list) and value and isinstance(value[0], dict):
            if any(k in value[0] for k in ["compensation", "equity", "title", "slug"]):
                return value
        if isinstance(value, dict):
            result = find_jobs_in_props(value, max_depth - 1)
            if result:
                return result
    return []

Anti-Bot Measures and Proxy Integration

Cloudflare Defense Layers

Wellfound uses Cloudflare with bot scoring enabled:

IP reputation: Datacenter IPs fail almost immediately
JS challenge: First visit may require JavaScript execution
Rate limiting: ~60–120 GraphQL requests/minute before throttling
Browser fingerprinting: TLS fingerprint checked on login flows

ThorData Residential Proxy Setup

ThorData provides rotating residential proxies with country targeting. Use the US pool for Wellfound since it's a US-focused platform:

THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
THORDATA_HOST = "proxy.thordata.com"
THORDATA_PORT = 9000

def get_proxy(session_id=None, country="us"):
    """
    Build proxy URL.
    - session_id: None = rotate per request
    - session_id provided = sticky session (same IP per session)
    """
    if session_id:
        user = f"{THORDATA_USER}-session-{session_id}-country-{country}"
    else:
        user = f"{THORDATA_USER}-country-{country}"

    return f"http://{user}:{THORDATA_PASS}@{THORDATA_HOST}:{THORDATA_PORT}"

def resilient_job_search(
    role, page=0, location=None, remote=None,
    max_retries=3, use_sticky_session=True
):
    """Job search with automatic proxy rotation on failure."""
    session_id = random.randint(10000, 99999) if use_sticky_session else None

    for attempt in range(max_retries):
        proxy = get_proxy(session_id=session_id, country="us")
        result = search_jobs(role, location, remote, page, proxy)

        if result and result.get("results"):
            return result

        # Rotate session on failure
        session_id = random.randint(10000, 99999)
        wait = (attempt + 1) * random.uniform(5, 15)
        print(f"Attempt {attempt+1} failed, waiting {wait:.1f}s...")
        time.sleep(wait)

    return {"results": [], "totalCount": 0, "totalPages": 0}

Playwright with Proxy for Auth-Gated Content

from playwright.sync_api import sync_playwright

def scrape_with_browser(company_slug, proxy_config=None):
    """Use Playwright for content behind auth walls."""
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            proxy=proxy_config,
            args=["--disable-blink-features=AutomationControlled"],
        )
        context = browser.new_context(
            user_agent=HEADERS["User-Agent"],
            viewport={"width": 1440, "height": 900},
            locale="en-US",
        )

        graphql_responses = []
        def capture_gql(response):
            if "graphql" in response.url:
                try:
                    data = response.json()
                    if data.get("data"):
                        graphql_responses.append(data["data"])
                except Exception:
                    pass

        page = context.new_page()
        page.on("response", capture_gql)
        page.goto(
            f"https://wellfound.com/company/{company_slug}/jobs",
            wait_until="networkidle",
        )

        browser.close()

    return graphql_responses

proxy_config = {
    "server": f"http://{THORDATA_HOST}:{THORDATA_PORT}",
    "username": THORDATA_USER,
    "password": THORDATA_PASS,
}
responses = scrape_with_browser("openai", proxy_config)

Data Storage: SQLite Schema

import sqlite3
import json
from datetime import datetime, date

def init_db(db_path: str = "wellfound_jobs.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS companies (
            id TEXT PRIMARY KEY,
            slug TEXT UNIQUE NOT NULL,
            name TEXT,
            high_concept TEXT,
            company_size TEXT,
            stage TEXT,
            total_raised INTEGER,
            markets TEXT,
            tech_stack TEXT,
            scraped_at TEXT
        );

        CREATE TABLE IF NOT EXISTS jobs (
            id TEXT PRIMARY KEY,
            title TEXT,
            slug TEXT,
            company_id TEXT,
            company_slug TEXT,
            salary_min INTEGER,
            salary_max INTEGER,
            equity_min REAL,
            equity_max REAL,
            remote INTEGER DEFAULT 0,
            locations TEXT,
            role_type TEXT,
            experience_level TEXT,
            compensation_raw TEXT,
            equity_raw TEXT,
            scraped_at TEXT,
            FOREIGN KEY (company_id) REFERENCES companies(id)
        );

        CREATE TABLE IF NOT EXISTS scrape_runs (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            role_slug TEXT,
            total_jobs INTEGER,
            pages_scraped INTEGER,
            started_at TEXT,
            completed_at TEXT
        );

        CREATE INDEX IF NOT EXISTS idx_jobs_company ON jobs(company_id);
        CREATE INDEX IF NOT EXISTS idx_jobs_salary ON jobs(salary_min, salary_max);
        CREATE INDEX IF NOT EXISTS idx_jobs_stage ON jobs(company_id);
    """)
    conn.commit()
    return conn

def save_job(conn: sqlite3.Connection, job: dict):
    """Save a single job and its company to the database."""
    startup = job.get("startup", {})
    flat = flatten_job(job)

    # Upsert company
    if startup.get("id"):
        conn.execute("""
            INSERT OR REPLACE INTO companies
            (id, slug, name, high_concept, company_size, stage, total_raised,
             markets, tech_stack, scraped_at)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        """, (
            startup.get("id"), startup.get("slug"), startup.get("name"),
            startup.get("highConcept"), startup.get("companySize"),
            startup.get("stage"), startup.get("totalRaised"),
            json.dumps([m["displayName"] for m in startup.get("markets", [])]),
            json.dumps([t["displayName"] for t in startup.get("techStack", [])]),
            datetime.utcnow().isoformat(),
        ))

    # Upsert job
    conn.execute("""
        INSERT OR REPLACE INTO jobs
        (id, title, slug, company_id, company_slug, salary_min, salary_max,
         equity_min, equity_max, remote, locations, role_type, experience_level,
         compensation_raw, equity_raw, scraped_at)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, (
        flat["job_id"], flat["title"], flat["slug"],
        startup.get("id"), flat["company_slug"],
        flat["salary_min"], flat["salary_max"],
        flat["equity_min"], flat["equity_max"],
        1 if flat["remote"] else 0,
        flat["locations"], flat["role_type"], flat["experience_level"],
        job.get("compensation"), job.get("equity"),
        datetime.utcnow().isoformat(),
    ))

    conn.commit()

def batch_save_jobs(conn: sqlite3.Connection, jobs: list[dict]):
    """Save a batch of jobs efficiently."""
    for job in jobs:
        try:
            save_job(conn, job)
        except Exception as e:
            print(f"Error saving job {job.get('id', 'unknown')}: {e}")

Building a Job Market Tracker

Track salary and equity trends over time — weekly snapshots build a useful dataset:

def run_weekly_scrape(role_slugs=None, db_path="wellfound_jobs.db"):
    """
    Weekly job market scrape across multiple roles.
    Run this on a schedule (e.g., every Monday).
    """
    if role_slugs is None:
        role_slugs = [
            "software-engineer", "machine-learning-engineer",
            "data-scientist", "product-manager", "devops",
        ]

    conn = init_db(db_path)
    proxy = get_proxy(country="us")
    total_scraped = 0

    for role in role_slugs:
        print(f"\n=== Scraping {role} ===")
        jobs = scrape_all_jobs(role, remote=True, max_pages=20, proxy=proxy)

        batch_save_jobs(conn, jobs)
        total_scraped += len(jobs)
        print(f"  Saved {len(jobs)} jobs for {role}")

        # Record run
        conn.execute("""
            INSERT INTO scrape_runs (role_slug, total_jobs, pages_scraped, started_at, completed_at)
            VALUES (?, ?, ?, ?, ?)
        """, (role, len(jobs), 20, datetime.utcnow().isoformat(), datetime.utcnow().isoformat()))
        conn.commit()

        time.sleep(random.uniform(5, 10))

    return total_scraped

def get_salary_trends(conn, role_pattern="%Engineer%", weeks_back=12):
    """Analyze salary trends over the past N weeks."""
    cursor = conn.execute("""
        SELECT
            strftime('%Y-W%W', scraped_at) as week,
            COUNT(*) as listings,
            AVG(salary_min) as avg_min,
            AVG(salary_max) as avg_max,
            AVG(equity_min) as avg_equity_min,
            AVG(equity_max) as avg_equity_max
        FROM jobs
        WHERE title LIKE ?
          AND salary_min IS NOT NULL
          AND scraped_at > datetime('now', '-' || ? || ' weeks')
        GROUP BY week
        ORDER BY week
    """, (role_pattern, weeks_back))

    return [
        {
            "week": row[0], "listings": row[1],
            "avg_salary": (row[2] + row[3]) / 2 if row[2] and row[3] else None,
            "avg_equity": (row[4] + row[5]) / 2 if row[4] and row[5] else None,
        }
        for row in cursor.fetchall()
    ]

trends = get_salary_trends(conn)
for t in trends:
    salary = f"${t['avg_salary']:,.0f}" if t["avg_salary"] else "N/A"
    equity = f"{t['avg_equity']:.2f}%" if t["avg_equity"] else "N/A"
    print(f"Week {t['week']}: {t['listings']} listings | Avg salary: {salary} | Avg equity: {equity}")

Real-World Use Cases

1. Equity Benchmarking by Stage

def equity_by_stage(conn):
    """Compare equity ranges by company funding stage."""
    cursor = conn.execute("""
        SELECT
            c.stage,
            COUNT(DISTINCT j.id) as job_count,
            AVG(j.equity_min) as avg_equity_min,
            AVG(j.equity_max) as avg_equity_max,
            MIN(j.equity_min) as min_equity,
            MAX(j.equity_max) as max_equity
        FROM jobs j
        JOIN companies c ON j.company_id = c.id
        WHERE j.equity_min IS NOT NULL
          AND c.stage IS NOT NULL
        GROUP BY c.stage
        ORDER BY avg_equity_max DESC
    """)

    print("\nEquity ranges by company stage:")
    for row in cursor.fetchall():
        stage, count, avg_min, avg_max, min_eq, max_eq = row
        print(f"  {stage}: {avg_min:.2f}% - {avg_max:.2f}% avg "
              f"(range: {min_eq:.2f}% - {max_eq:.2f}%, n={count})")

equity_by_stage(conn)

2. Tech Stack Intelligence

def find_companies_by_tech(conn, technology):
    """Find all companies using a specific technology."""
    cursor = conn.execute("""
        SELECT c.name, c.stage, c.company_size, c.total_raised,
               COUNT(j.id) as open_roles
        FROM companies c
        LEFT JOIN jobs j ON c.id = j.company_id
        WHERE c.tech_stack LIKE ?
        GROUP BY c.id
        ORDER BY c.total_raised DESC NULLS LAST
    """, (f"%{technology}%",))

    return [
        {
            "name": row[0], "stage": row[1], "size": row[2],
            "raised": row[3], "open_roles": row[4],
        }
        for row in cursor.fetchall()
    ]

rust_companies = find_companies_by_tech(conn, "Rust")
print(f"\nCompanies using Rust: {len(rust_companies)}")
for c in rust_companies[:10]:
    raised = f"${c['raised']:,}" if c["raised"] else "undisclosed"
    print(f"  {c['name']} ({c['stage']}) — raised {raised} — {c['open_roles']} open roles")

3. Salary Negotiation Intelligence

def get_offer_context(title_keyword, company_stage=None, remote=True):
    """
    Given a job title and company stage, return salary percentiles
    to inform salary negotiations.
    """
    conn = sqlite3.connect("wellfound_jobs.db")

    conditions = ["j.title LIKE ?", "j.salary_min IS NOT NULL"]
    params = [f"%{title_keyword}%"]

    if company_stage:
        conditions.append("c.stage = ?")
        params.append(company_stage)
    if remote is not None:
        conditions.append("j.remote = ?")
        params.append(1 if remote else 0)

    where = " AND ".join(conditions)
    cursor = conn.execute(f"""
        SELECT j.salary_min, j.salary_max, j.equity_min, j.equity_max,
               c.stage, c.company_size
        FROM jobs j
        JOIN companies c ON j.company_id = c.id
        WHERE {where}
    """, params)

    rows = cursor.fetchall()
    if not rows:
        return None

    salaries = [(r[0] + r[1]) / 2 for r in rows if r[0] and r[1]]
    equities = [(r[2] + r[3]) / 2 for r in rows if r[2] and r[3]]

    salaries.sort()
    equities.sort()

    def percentile(lst, p):
        if not lst:
            return None
        i = int(len(lst) * p / 100)
        return lst[min(i, len(lst) - 1)]

    return {
        "sample_size": len(rows),
        "salary_p25": percentile(salaries, 25),
        "salary_median": percentile(salaries, 50),
        "salary_p75": percentile(salaries, 75),
        "equity_p25": percentile(equities, 25),
        "equity_median": percentile(equities, 50),
        "equity_p75": percentile(equities, 75),
    }

context = get_offer_context("Machine Learning Engineer", "Series A")
if context:
    print(f"Salary range (n={context['sample_size']}):")
    print(f"  P25: ${context['salary_p25']:,.0f}")
    print(f"  Median: ${context['salary_median']:,.0f}")
    print(f"  P75: ${context['salary_p75']:,.0f}")
    print(f"Equity median: {context['equity_median']:.2f}%")

Complete Pipeline

def full_pipeline(
    role_slugs=None,
    output_db="wellfound_jobs.db",
    max_pages=25,
):
    """
    Full Wellfound jobs scraping pipeline.
    Scrapes multiple roles, handles pagination, stores in SQLite.
    """
    if role_slugs is None:
        role_slugs = ["software-engineer", "machine-learning-engineer", "data-scientist"]

    conn = init_db(output_db)
    proxy = get_proxy(country="us")

    for role in role_slugs:
        print(f"\n=== Scraping: {role} ===")
        run_start = datetime.utcnow().isoformat()
        total = 0
        page = 0

        while page < max_pages:
            result = resilient_job_search(role, page=page, remote=True)
            batch = result.get("results", [])

            if not batch:
                break

            batch_save_jobs(conn, batch)
            total += len(batch)
            print(f"  Page {page}: {len(batch)} jobs saved (total: {total})")

            if page >= result.get("totalPages", 0) - 1:
                break

            page += 1
            time.sleep(random.uniform(2.0, 4.0))

        conn.execute("""
            INSERT INTO scrape_runs (role_slug, total_jobs, pages_scraped, started_at, completed_at)
            VALUES (?, ?, ?, ?, ?)
        """, (role, total, page + 1, run_start, datetime.utcnow().isoformat()))
        conn.commit()
        print(f"  {role} complete: {total} jobs")

    # Print summary stats
    cursor = conn.execute("SELECT COUNT(*) FROM jobs")
    total_jobs = cursor.fetchone()[0]
    cursor = conn.execute("SELECT COUNT(*) FROM companies")
    total_companies = cursor.fetchone()[0]
    print(f"\nDatabase: {total_jobs:,} jobs, {total_companies:,} companies")

if __name__ == "__main__":
    full_pipeline()

Legal Notes

Wellfound's Terms of Service prohibit automated scraping. This guide is for educational and research purposes.

Key considerations for your jurisdiction: - hiQ v. LinkedIn (9th Circuit): scraping publicly accessible data generally doesn't violate the CFAA - GDPR: EU users' personal data (names, contact info) has additional protections - Commercial use: Redistributing scraped data as a product carries higher legal risk than internal research

Practical safe use: personal salary research, academic market studies, internal tooling. Avoid: reselling data, building competing products, scraping at volumes that stress their infrastructure.

Summary

Wellfound's GraphQL API provides the cleanest access to startup job data — salary, equity, stage, funding, and tech stack in structured JSON. The main technical obstacles are Cloudflare bot protection and auth walls on some endpoints.

Core techniques: 1. GraphQL direct queries — richest data, requires residential proxies 2. __NEXT_DATA__ extraction — bypasses auth for server-rendered content 3. Playwright interception — for auth-gated or heavily dynamic pages 4. ThorData residential proxies — required for Cloudflare, essential for volume

With weekly scrapes across 5–10 role slugs, you build a useful salary benchmarking dataset within a month. Add startup detail enrichment and you have market intelligence that rivals Crunchbase — for the cost of proxy bandwidth.