← Back to blog

How to Scrape AngelList (Wellfound) Startup Data with Python (2026 Guide)

How to Scrape AngelList (Wellfound) Startup Data with Python (2026 Guide)

AngelList rebranded to Wellfound for job seekers, but the startup data is still some of the most valuable in tech. Funding rounds, team sizes, tech stacks, investor connections, job listings with salary ranges, equity percentages — it's a structured dataset of the entire startup ecosystem. If you're doing competitive research, building a market intelligence tool, or tracking which investors back companies in a given vertical, Wellfound is the primary public source.

Unlike databases like Crunchbase or PitchBook which charge thousands per month, Wellfound exposes much of this data publicly through their web interface. There's no official API for third parties. Everything goes through a Next.js frontend backed by GraphQL — which means browser automation with Playwright and some patience with their anti-bot setup.


What Data Is Available

A Wellfound company profile surfaces:

Job listings add: - Compensation: minimum and maximum salary - Equity: percentage range offered - Remote policy: remote, hybrid, or on-site - Experience level: entry, mid, senior - Role type: full-time, contract, internship


Setup

pip install playwright httpx beautifulsoup4 selectolax
playwright install chromium

Understanding the Site Architecture

Wellfound is a Next.js application. The initial page load returns server-rendered HTML with data embedded in a <script id="__NEXT_DATA__"> tag. Subsequent navigation fetches via internal GraphQL endpoints at https://wellfound.com/graphql.

Two complementary approaches:

  1. __NEXT_DATA__ extraction — parse the JSON embedded in server-rendered HTML (no auth required for most data)
  2. GraphQL interception — use Playwright to capture network responses, or replay GraphQL queries directly

Approach 1: Extracting NEXT_DATA

The fastest approach — no browser required for server-rendered pages:

import httpx
import json
import re
from bs4 import BeautifulSoup

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Referer": "https://wellfound.com/",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "same-origin",
}

def extract_next_data(url, proxy_url=None):
    """Extract __NEXT_DATA__ JSON from a Wellfound page."""
    client_kwargs = {
        "headers": HEADERS,
        "follow_redirects": True,
        "timeout": 20,
    }
    if proxy_url:
        client_kwargs["proxies"] = {"all://": proxy_url}

    with httpx.Client(**client_kwargs) as client:
        resp = client.get(url)
        resp.raise_for_status()

    # Extract the __NEXT_DATA__ script tag
    match = re.search(
        r'<script id="__NEXT_DATA__" type="application/json">(.*?)</script>',
        resp.text,
        re.DOTALL,
    )
    if not match:
        return None

    try:
        return json.loads(match.group(1))
    except json.JSONDecodeError:
        return None

def scrape_company_from_next_data(company_slug, proxy_url=None):
    """Scrape company data from __NEXT_DATA__ embedding."""
    url = f"https://wellfound.com/company/{company_slug}"
    data = extract_next_data(url, proxy_url)

    if not data:
        return None

    # Navigate the Next.js props structure
    props = data.get("props", {}).get("pageProps", {})

    # Different pages structure data differently
    company = (
        props.get("company")
        or props.get("startup")
        or props.get("initialData", {}).get("company")
    )

    if not company:
        # Try extracting from Apollo cache embedded in page
        apollo_state = props.get("apolloState", {})
        company_keys = [k for k in apollo_state if k.startswith("Startup:")]
        if company_keys:
            company = apollo_state[company_keys[0]]

    return company

# Example usage
company = scrape_company_from_next_data("stripe")
if company:
    print(json.dumps(company, indent=2)[:2000])

Approach 2: GraphQL API Direct Queries

Wellfound's frontend communicates with a GraphQL endpoint. You can replay these queries directly:

import httpx
import json

GQL_URL = "https://wellfound.com/graphql"

GQL_HEADERS = {
    **HEADERS,
    "Content-Type": "application/json",
    "Accept": "application/json",
    "Origin": "https://wellfound.com",
    "Referer": "https://wellfound.com/companies",
    "X-Requested-With": "XMLHttpRequest",
}

def graphql_query(query, variables, proxy_url=None):
    """Execute a GraphQL query against Wellfound's endpoint."""
    client_kwargs = {
        "headers": GQL_HEADERS,
        "follow_redirects": True,
        "timeout": 20,
    }
    if proxy_url:
        client_kwargs["proxies"] = {"all://": proxy_url}

    payload = {"query": query, "variables": variables}

    with httpx.Client(**client_kwargs) as client:
        resp = client.post(GQL_URL, json=payload)
        resp.raise_for_status()
        return resp.json()

COMPANY_QUERY = """
query StartupDetail($slug: String!) {
  startups {
    startup(slug: $slug) {
      id
      name
      slug
      highConcept
      productDescription
      companySize
      stage
      totalRaised
      foundedDate
      twitterUrl
      linkedInUrl
      crunchbaseUrl
      websiteUrl
      markets {
        displayName
        slug
      }
      techStack {
        displayName
        slug
      }
      investors {
        name
        slug
      }
      fundingRounds {
        roundType
        raisedAmount
        closedAt
        investors {
          name
        }
      }
      jobListings {
        id
        title
        slug
        compensation
        equity
        remote
        locationNames
        roleType
        startDate
      }
    }
  }
}
"""

def get_company_data(company_slug, proxy_url=None):
    """Fetch detailed company data from Wellfound GraphQL."""
    result = graphql_query(
        COMPANY_QUERY,
        {"slug": company_slug},
        proxy_url,
    )

    errors = result.get("errors")
    if errors:
        print(f"GraphQL errors: {errors}")
        return None

    return (
        result.get("data", {})
        .get("startups", {})
        .get("startup")
    )

# Example
company = get_company_data("stripe")
if company:
    print(f"{company['name']}: ${company.get('totalRaised', 0):,} raised")
    print(f"Stage: {company.get('stage')}")
    print(f"Tech stack: {[t['displayName'] for t in company.get('techStack', [])]}")
    print(f"Open roles: {len(company.get('jobListings', []))}")

Approach 3: Playwright Browser Automation

For pages with heavy client-side rendering or authentication walls:

from playwright.sync_api import sync_playwright
import json
import time
import random

def scrape_company_playwright(company_slug, proxy_config=None):
    """Scrape company profile using Playwright browser automation."""
    with sync_playwright() as p:
        launch_kwargs = {
            "headless": True,
            "args": [
                "--disable-blink-features=AutomationControlled",
                "--disable-dev-shm-usage",
                "--no-sandbox",
            ],
        }
        if proxy_config:
            launch_kwargs["proxy"] = proxy_config

        browser = p.chromium.launch(**launch_kwargs)
        context = browser.new_context(
            viewport={"width": 1440, "height": 900},
            user_agent=HEADERS["User-Agent"],
            locale="en-US",
            timezone_id="America/New_York",
        )

        # Intercept GraphQL responses
        graphql_data = []
        def handle_response(response):
            if "graphql" in response.url.lower():
                try:
                    data = response.json()
                    if data.get("data"):
                        graphql_data.append(data["data"])
                except Exception:
                    pass

        page = context.new_page()
        page.on("response", handle_response)

        # Navigate to company page
        page.goto(
            f"https://wellfound.com/company/{company_slug}",
            wait_until="networkidle",
            timeout=30000,
        )
        time.sleep(random.uniform(2, 4))

        # Extract visible data from DOM
        company_data = page.evaluate("""
            () => {
                const getText = (sel) => document.querySelector(sel)?.textContent?.trim();
                const getAll = (sel) => Array.from(
                    document.querySelectorAll(sel)
                ).map(e => e.textContent.trim()).filter(Boolean);

                return {
                    name: getText('h1') || getText('[data-test="company-name"]'),
                    tagline: getText('[data-test="tagline"]') || getText('[class*="tagline"]'),
                    description: getText('[data-test="description"]') || getText('[class*="description"]'),
                    size: getText('[data-test="company-size"]'),
                    stage: getText('[data-test="company-stage"]'),
                    markets: getAll('[data-test="market-tag"], [class*="market"]'),
                    tech_stack: getAll('[data-test="tech-tag"], [class*="tech-stack"]'),
                    social_links: Array.from(document.querySelectorAll('a[href*="twitter"], a[href*="linkedin"]'))
                        .map(a => ({ href: a.href, text: a.textContent.trim() })),
                };
            }
        """)

        # Scroll to load more content (lazy-loaded sections)
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        time.sleep(1.5)

        # Get funding info (often in a separate section)
        funding_data = page.evaluate("""
            () => {
                const rounds = [];
                document.querySelectorAll('[class*="funding-round"], [data-test*="round"]').forEach(el => {
                    rounds.push({
                        text: el.textContent.trim(),
                        type: el.querySelector('[class*="round-type"]')?.textContent?.trim(),
                        amount: el.querySelector('[class*="amount"]')?.textContent?.trim(),
                        date: el.querySelector('[class*="date"]')?.textContent?.trim(),
                    });
                });
                return rounds;
            }
        """)

        company_data["funding_rounds"] = funding_data
        company_data["graphql_data"] = graphql_data  # Raw captured responses

        browser.close()
        return company_data

# Proxy config for ThorData
proxy_config = {
    "server": "http://proxy.thordata.com:9000",
    "username": "your_user",
    "password": "your_pass",
}
company = scrape_company_playwright("stripe", proxy_config)
print(f"Company: {company.get('name')}")
print(f"Stage: {company.get('stage')}")

Anti-Bot Measures

Wellfound runs Cloudflare with bot scoring. Understanding the layers:

Cloudflare Bot Management

Cloudflare checks: - IP reputation: Datacenter IPs fail immediately; residential IPs pass - TLS fingerprint: Non-browser TLS stacks get challenged - JavaScript execution: CF injects a challenge that must execute in a real browser - Behavioral signals: Mouse movements, scroll events, request patterns

Solution: Use residential proxies. Datacenter IPs are near-universally blocked by Wellfound's Cloudflare config. ThorData's residential proxy network routes through real household IPs that pass Cloudflare's bot scoring.

PROXY_URL = "http://user:[email protected]:9000"

# For httpx direct requests
client = httpx.Client(
    headers=GQL_HEADERS,
    proxies={"all://": PROXY_URL},
    timeout=25,
)

# For Playwright
proxy_config = {
    "server": "http://proxy.thordata.com:9000",
    "username": "user",
    "password": "pass",
}

Rate Limiting

Wellfound's GraphQL endpoint throttles at roughly 60–120 requests per minute per session. Implement delays:

import time
import random

def rate_limited_query(query, variables, proxy_url=None, min_delay=1.0, max_delay=3.0):
    """Execute GraphQL query with rate limiting."""
    time.sleep(random.uniform(min_delay, max_delay))
    return graphql_query(query, variables, proxy_url)

Dynamic Class Names

CSS classes are hashed and change on deploys. Use stable selectors:

# Fragile — breaks on redeploy:
# page.query_selector(".styles_component__x7f2a")

# Stable alternatives:
page.query_selector("h1")                              # Semantic HTML
page.query_selector("[data-test='company-name']")     # data-test attributes
page.query_selector("main >> text=Funding")            # Text content selector
page.query_selector("[aria-label*='Stage']")           # Aria attributes
page.get_by_role("heading", level=1)                   # ARIA role

Login Walls

Some fields (detailed investor contacts, full salary for some roles) require authentication. Options:

  1. Session cookie injection: Log in manually, export cookies, inject via Playwright
  2. Stick to unauthenticated endpoints: Most public company data is accessible without login
  3. __NEXT_DATA__ bypass: Server-rendered data often bypasses auth checks
def inject_session_cookies(context, cookies_dict):
    """Inject authenticated session cookies into Playwright context."""
    for name, value in cookies_dict.items():
        context.add_cookies([{
            "name": name,
            "value": value,
            "domain": ".wellfound.com",
            "path": "/",
        }])

Scraping Company Listings at Scale

To scrape many companies (e.g., all Series A startups in fintech):

import time
import random
from pathlib import Path

def search_companies_by_market(
    market_slug, stage=None, proxy_url=None, max_companies=500
):
    """Search Wellfound for companies by market/vertical."""
    query = """
    query CompanySearch($market: String!, $stage: String, $page: Int) {
      startups {
        searchByMarket(market: $market, stage: $stage, page: $page) {
          results {
            id
            name
            slug
            highConcept
            companySize
            stage
            totalRaised
            markets { displayName }
          }
          totalCount
          totalPages
        }
      }
    }
    """

    all_companies = []
    page = 1

    while len(all_companies) < max_companies:
        variables = {"market": market_slug, "page": page}
        if stage:
            variables["stage"] = stage

        result = graphql_query(query, variables, proxy_url)
        if not result or result.get("errors"):
            break

        data = (
            result.get("data", {})
            .get("startups", {})
            .get("searchByMarket", {})
        )

        batch = data.get("results", [])
        if not batch:
            break

        all_companies.extend(batch)
        total_pages = data.get("totalPages", 1)

        print(f"Page {page}/{total_pages}: {len(batch)} companies (total: {len(all_companies)})")

        if page >= total_pages:
            break

        page += 1
        time.sleep(random.uniform(2.0, 4.0))

    return all_companies[:max_companies]

# Get all fintech Series A companies
proxy = "http://user:[email protected]:9000"
companies = search_companies_by_market("fintech", stage="series-a", proxy_url=proxy)
print(f"Found {len(companies)} fintech Series A companies")

Parsing Salary and Equity Data

import re

def parse_compensation(raw):
    """
    Parse salary string like "$120K - $160K" or "$90K - $130K".
    Returns {"salary_min": 120000, "salary_max": 160000}
    """
    if not raw:
        return {"salary_min": None, "salary_max": None}

    # Handle various dash types: -, –, —
    raw = re.sub(r"[–—]", "-", raw)
    nums = re.findall(r"\$?([\d,]+)[Kk]", raw)

    if len(nums) >= 2:
        return {
            "salary_min": int(nums[0].replace(",", "")) * 1000,
            "salary_max": int(nums[1].replace(",", "")) * 1000,
        }
    elif len(nums) == 1:
        val = int(nums[0].replace(",", "")) * 1000
        return {"salary_min": val, "salary_max": val}

    return {"salary_min": None, "salary_max": None}

def parse_equity(raw):
    """
    Parse equity string like "0.10% - 0.50%".
    Returns {"equity_min": 0.10, "equity_max": 0.50}
    """
    if not raw:
        return {"equity_min": None, "equity_max": None}

    nums = re.findall(r"([\d.]+)%", raw)
    if len(nums) >= 2:
        return {"equity_min": float(nums[0]), "equity_max": float(nums[1])}
    elif len(nums) == 1:
        val = float(nums[0])
        return {"equity_min": val, "equity_max": val}

    return {"equity_min": None, "equity_max": None}

def parse_total_raised(raw):
    """Parse total raised like "$4.5M" or "$250K" or "$1.2B"."""
    if not raw:
        return None
    multipliers = {"K": 1_000, "M": 1_000_000, "B": 1_000_000_000}
    match = re.search(r"\$([\d.]+)([KMB])", raw, re.IGNORECASE)
    if match:
        num = float(match.group(1))
        mult = multipliers.get(match.group(2).upper(), 1)
        return int(num * mult)
    return None

Intercepting GraphQL Network Traffic

The most reliable approach for capturing all data — let Playwright browse and capture everything:

from playwright.sync_api import sync_playwright
import json
from collections import defaultdict

def capture_all_graphql(company_slug, proxy_config=None):
    """
    Browse a company page and capture all GraphQL responses.
    Returns a structured dict of all data returned.
    """
    captured = defaultdict(list)

    def on_response(response):
        if "graphql" not in response.url.lower():
            return
        try:
            body = response.json()
            data = body.get("data", {})
            for key, value in data.items():
                captured[key].append(value)
        except Exception:
            pass

    with sync_playwright() as p:
        launch_kwargs = {"headless": True}
        if proxy_config:
            launch_kwargs["proxy"] = proxy_config

        browser = p.chromium.launch(**launch_kwargs)
        context = browser.new_context(
            user_agent=HEADERS["User-Agent"],
            viewport={"width": 1440, "height": 900},
        )
        page = context.new_page()
        page.on("response", on_response)

        # Visit main profile
        page.goto(
            f"https://wellfound.com/company/{company_slug}",
            wait_until="networkidle",
        )

        # Visit jobs tab to trigger job listings query
        page.click("text=Jobs")
        page.wait_for_load_state("networkidle")

        # Visit funding tab
        try:
            page.click("text=Funding")
            page.wait_for_load_state("networkidle")
        except Exception:
            pass

        browser.close()

    return dict(captured)

data = capture_all_graphql("stripe")
for key, values in data.items():
    print(f"{key}: {len(values)} response(s)")

Data Storage

SQLite Schema

import sqlite3
import json
from datetime import datetime

def init_db(db_path="wellfound.db"):
    conn = sqlite3.connect(db_path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS companies (
            id TEXT PRIMARY KEY,
            slug TEXT UNIQUE NOT NULL,
            name TEXT,
            tagline TEXT,
            description TEXT,
            stage TEXT,
            company_size TEXT,
            total_raised INTEGER,
            founded_date TEXT,
            website_url TEXT,
            twitter_url TEXT,
            linkedin_url TEXT,
            markets TEXT,
            tech_stack TEXT,
            scraped_at TEXT
        );

        CREATE TABLE IF NOT EXISTS funding_rounds (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            company_id TEXT,
            round_type TEXT,
            raised_amount INTEGER,
            closed_at TEXT,
            investors TEXT,
            FOREIGN KEY (company_id) REFERENCES companies(id)
        );

        CREATE TABLE IF NOT EXISTS job_listings (
            id TEXT PRIMARY KEY,
            company_id TEXT,
            title TEXT,
            compensation TEXT,
            salary_min INTEGER,
            salary_max INTEGER,
            equity_min REAL,
            equity_max REAL,
            remote INTEGER,
            location TEXT,
            role_type TEXT,
            scraped_at TEXT,
            FOREIGN KEY (company_id) REFERENCES companies(id)
        );

        CREATE INDEX IF NOT EXISTS idx_companies_stage ON companies(stage);
        CREATE INDEX IF NOT EXISTS idx_jobs_company ON job_listings(company_id);
    """)
    conn.commit()
    return conn

def save_company(conn, company_data):
    """Save company and its nested data to SQLite."""
    markets = json.dumps([m["displayName"] for m in company_data.get("markets", [])])
    tech_stack = json.dumps([t["displayName"] for t in company_data.get("techStack", [])])

    conn.execute("""
        INSERT OR REPLACE INTO companies
        (id, slug, name, tagline, description, stage, company_size,
         total_raised, founded_date, website_url, twitter_url, linkedin_url,
         markets, tech_stack, scraped_at)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, (
        company_data.get("id"), company_data.get("slug"),
        company_data.get("name"), company_data.get("highConcept"),
        company_data.get("productDescription"), company_data.get("stage"),
        company_data.get("companySize"),
        parse_total_raised(str(company_data.get("totalRaised", ""))),
        company_data.get("foundedDate"), company_data.get("websiteUrl"),
        company_data.get("twitterUrl"), company_data.get("linkedInUrl"),
        markets, tech_stack,
        datetime.utcnow().isoformat(),
    ))

    # Save funding rounds
    for round_data in company_data.get("fundingRounds", []):
        investors = json.dumps([i["name"] for i in round_data.get("investors", [])])
        conn.execute("""
            INSERT INTO funding_rounds
            (company_id, round_type, raised_amount, closed_at, investors)
            VALUES (?, ?, ?, ?, ?)
        """, (
            company_data.get("id"),
            round_data.get("roundType"),
            round_data.get("raisedAmount"),
            round_data.get("closedAt"),
            investors,
        ))

    # Save job listings
    for job in company_data.get("jobListings", []):
        comp = parse_compensation(job.get("compensation", ""))
        equity = parse_equity(job.get("equity", ""))
        conn.execute("""
            INSERT OR REPLACE INTO job_listings
            (id, company_id, title, compensation, salary_min, salary_max,
             equity_min, equity_max, remote, location, role_type, scraped_at)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        """, (
            job.get("id"), company_data.get("id"),
            job.get("title"), job.get("compensation"),
            comp["salary_min"], comp["salary_max"],
            equity["equity_min"], equity["equity_max"],
            1 if job.get("remote") else 0,
            ", ".join(job.get("locationNames", [])),
            job.get("roleType"),
            datetime.utcnow().isoformat(),
        ))

    conn.commit()

Real-World Use Cases

1. Startup Intelligence Feed

Build a daily feed of new funded startups in a vertical:

def daily_funding_monitor(markets, proxy_url=None):
    """Monitor new startups funded this week in target markets."""
    conn = init_db()
    new_companies = []

    for market in markets:
        companies = search_companies_by_market(market, proxy_url=proxy_url)
        for company in companies:
            # Check if new to our database
            existing = conn.execute(
                "SELECT id FROM companies WHERE slug = ?",
                (company.get("slug"),)
            ).fetchone()

            if not existing:
                # Fetch full details
                details = get_company_data(company["slug"], proxy_url)
                if details:
                    save_company(conn, details)
                    new_companies.append(details)
                    print(f"New: {details['name']} ({details.get('stage')})")
                time.sleep(random.uniform(2, 4))

    return new_companies

2. Salary Benchmarking Tool

Aggregate salary and equity data across roles and stages:

def build_salary_report(db_path="wellfound.db"):
    conn = sqlite3.connect(db_path)

    cursor = conn.execute("""
        SELECT
            j.title,
            c.stage,
            COUNT(*) as listings,
            AVG(j.salary_min) as avg_min,
            AVG(j.salary_max) as avg_max,
            AVG(j.equity_min) as avg_equity_min,
            AVG(j.equity_max) as avg_equity_max
        FROM job_listings j
        JOIN companies c ON j.company_id = c.id
        WHERE j.salary_min IS NOT NULL
          AND j.title LIKE '%Engineer%'
        GROUP BY j.title, c.stage
        HAVING COUNT(*) >= 3
        ORDER BY avg_max DESC
    """)

    print("\nSalary benchmarks for engineering roles:")
    for row in cursor.fetchall():
        title, stage, count, avg_min, avg_max, eq_min, eq_max = row
        print(f"  {title} @ {stage}: ${avg_min:,.0f}-${avg_max:,.0f} | "
              f"Equity: {eq_min:.2f}%-{eq_max:.2f}% ({count} listings)")

3. Investor Portfolio Tracker

Track which VCs are most active in your vertical:

def analyze_investor_activity(db_path="wellfound.db"):
    conn = sqlite3.connect(db_path)

    # Expand JSON investors array per company
    companies = conn.execute(
        "SELECT name, stage, markets FROM companies"
    ).fetchall()

    investor_counts = {}
    for name, stage, markets_json in companies:
        # This requires fetching funding rounds separately
        rounds = conn.execute(
            "SELECT investors FROM funding_rounds WHERE company_id = ("
            "SELECT id FROM companies WHERE name = ?)",
            (name,)
        ).fetchall()

        for (investors_json,) in rounds:
            try:
                for investor in json.loads(investors_json or "[]"):
                    investor_counts[investor] = investor_counts.get(investor, 0) + 1
            except json.JSONDecodeError:
                pass

    return sorted(investor_counts.items(), key=lambda x: -x[1])[:20]

top_investors = analyze_investor_activity()
print("\nMost active investors in database:")
for investor, count in top_investors:
    print(f"  {investor}: {count} portfolio companies")

Full Scrape Pipeline

import json
from pathlib import Path
import time
import random

def scrape_startup_ecosystem(
    market_slugs,
    output_dir="wellfound_data",
    proxy_url=None,
    max_per_market=200,
):
    """Complete pipeline: discover and enrich companies by market."""
    out = Path(output_dir)
    out.mkdir(exist_ok=True)
    conn = init_db(str(out / "startups.db"))

    total_saved = 0

    for market in market_slugs:
        print(f"\n=== Market: {market} ===")

        # Discover companies
        companies = search_companies_by_market(
            market, proxy_url=proxy_url, max_companies=max_per_market
        )

        for company in companies:
            slug = company.get("slug")
            if not slug:
                continue

            # Skip if already in DB (from previous run)
            existing = conn.execute(
                "SELECT scraped_at FROM companies WHERE slug = ?", (slug,)
            ).fetchone()
            if existing:
                continue

            # Fetch full details
            try:
                details = get_company_data(slug, proxy_url)
                if details:
                    save_company(conn, details)
                    total_saved += 1
                    print(f"  Saved {details.get('name')} ({details.get('stage')})")
            except Exception as e:
                print(f"  Error on {slug}: {e}")

            time.sleep(random.uniform(2.0, 5.0))

    print(f"\nComplete: {total_saved} companies saved")
    return total_saved

# Run it
proxy = "http://user:[email protected]:9000"
saved = scrape_startup_ecosystem(
    ["fintech", "ai-ml", "saas", "healthcare"],
    proxy_url=proxy,
    max_per_market=100,
)

Wellfound's Terms of Service prohibit automated scraping. This applies regardless of whether the data is publicly visible. Key considerations:

For production use at scale, consider Crunchbase API (paid but licensed), PitchBook, or direct partnerships with data providers.


Summary

Wellfound offers the richest publicly accessible startup dataset available. The technical path to accessing it:

  1. __NEXT_DATA__ extraction for server-rendered pages — no auth needed, fastest approach
  2. GraphQL direct queries — richest data, moderate rate limits, requires residential proxies
  3. Playwright with network interception — most complete, handles auth walls and dynamic content

Cloudflare bot protection is the primary obstacle. ThorData residential proxies are essential — datacenter IPs fail Cloudflare's bot scoring consistently. Store results in SQLite with proper indexing, and implement incremental scraping so you can resume after interruptions. Built correctly, this gives you a continuously updated startup intelligence database that rivals paid tools costing thousands per month.