How to Scrape AngelList (Wellfound) Startup Data with Python (2026 Guide)
How to Scrape AngelList (Wellfound) Startup Data with Python (2026 Guide)
AngelList rebranded to Wellfound for job seekers, but the startup data is still some of the most valuable in tech. Funding rounds, team sizes, tech stacks, investor connections, job listings with salary ranges, equity percentages — it's a structured dataset of the entire startup ecosystem. If you're doing competitive research, building a market intelligence tool, or tracking which investors back companies in a given vertical, Wellfound is the primary public source.
Unlike databases like Crunchbase or PitchBook which charge thousands per month, Wellfound exposes much of this data publicly through their web interface. There's no official API for third parties. Everything goes through a Next.js frontend backed by GraphQL — which means browser automation with Playwright and some patience with their anti-bot setup.
What Data Is Available
A Wellfound company profile surfaces:
- Company basics: name, slug, founding year, description, tagline
- Stage and funding: seed, Series A–E, growth stage, pre-IPO; total amount raised
- Investors: named VC firms and angels on the cap table
- Funding rounds: round type, amount, close date
- Team size: headcount range (e.g., "11-50")
- Tech stack: specific tools and frameworks the company uses
- Markets: product categories and vertical tags
- Job listings: open roles with salary ranges and equity
- Social presence: Twitter, LinkedIn, GitHub, personal website
- Founders: names, LinkedIn profiles, previous companies
Job listings add: - Compensation: minimum and maximum salary - Equity: percentage range offered - Remote policy: remote, hybrid, or on-site - Experience level: entry, mid, senior - Role type: full-time, contract, internship
Setup
pip install playwright httpx beautifulsoup4 selectolax
playwright install chromium
Understanding the Site Architecture
Wellfound is a Next.js application. The initial page load returns server-rendered HTML with data embedded in a <script id="__NEXT_DATA__"> tag. Subsequent navigation fetches via internal GraphQL endpoints at https://wellfound.com/graphql.
Two complementary approaches:
__NEXT_DATA__extraction — parse the JSON embedded in server-rendered HTML (no auth required for most data)- GraphQL interception — use Playwright to capture network responses, or replay GraphQL queries directly
Approach 1: Extracting NEXT_DATA
The fastest approach — no browser required for server-rendered pages:
import httpx
import json
import re
from bs4 import BeautifulSoup
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://wellfound.com/",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "same-origin",
}
def extract_next_data(url, proxy_url=None):
"""Extract __NEXT_DATA__ JSON from a Wellfound page."""
client_kwargs = {
"headers": HEADERS,
"follow_redirects": True,
"timeout": 20,
}
if proxy_url:
client_kwargs["proxies"] = {"all://": proxy_url}
with httpx.Client(**client_kwargs) as client:
resp = client.get(url)
resp.raise_for_status()
# Extract the __NEXT_DATA__ script tag
match = re.search(
r'<script id="__NEXT_DATA__" type="application/json">(.*?)</script>',
resp.text,
re.DOTALL,
)
if not match:
return None
try:
return json.loads(match.group(1))
except json.JSONDecodeError:
return None
def scrape_company_from_next_data(company_slug, proxy_url=None):
"""Scrape company data from __NEXT_DATA__ embedding."""
url = f"https://wellfound.com/company/{company_slug}"
data = extract_next_data(url, proxy_url)
if not data:
return None
# Navigate the Next.js props structure
props = data.get("props", {}).get("pageProps", {})
# Different pages structure data differently
company = (
props.get("company")
or props.get("startup")
or props.get("initialData", {}).get("company")
)
if not company:
# Try extracting from Apollo cache embedded in page
apollo_state = props.get("apolloState", {})
company_keys = [k for k in apollo_state if k.startswith("Startup:")]
if company_keys:
company = apollo_state[company_keys[0]]
return company
# Example usage
company = scrape_company_from_next_data("stripe")
if company:
print(json.dumps(company, indent=2)[:2000])
Approach 2: GraphQL API Direct Queries
Wellfound's frontend communicates with a GraphQL endpoint. You can replay these queries directly:
import httpx
import json
GQL_URL = "https://wellfound.com/graphql"
GQL_HEADERS = {
**HEADERS,
"Content-Type": "application/json",
"Accept": "application/json",
"Origin": "https://wellfound.com",
"Referer": "https://wellfound.com/companies",
"X-Requested-With": "XMLHttpRequest",
}
def graphql_query(query, variables, proxy_url=None):
"""Execute a GraphQL query against Wellfound's endpoint."""
client_kwargs = {
"headers": GQL_HEADERS,
"follow_redirects": True,
"timeout": 20,
}
if proxy_url:
client_kwargs["proxies"] = {"all://": proxy_url}
payload = {"query": query, "variables": variables}
with httpx.Client(**client_kwargs) as client:
resp = client.post(GQL_URL, json=payload)
resp.raise_for_status()
return resp.json()
COMPANY_QUERY = """
query StartupDetail($slug: String!) {
startups {
startup(slug: $slug) {
id
name
slug
highConcept
productDescription
companySize
stage
totalRaised
foundedDate
twitterUrl
linkedInUrl
crunchbaseUrl
websiteUrl
markets {
displayName
slug
}
techStack {
displayName
slug
}
investors {
name
slug
}
fundingRounds {
roundType
raisedAmount
closedAt
investors {
name
}
}
jobListings {
id
title
slug
compensation
equity
remote
locationNames
roleType
startDate
}
}
}
}
"""
def get_company_data(company_slug, proxy_url=None):
"""Fetch detailed company data from Wellfound GraphQL."""
result = graphql_query(
COMPANY_QUERY,
{"slug": company_slug},
proxy_url,
)
errors = result.get("errors")
if errors:
print(f"GraphQL errors: {errors}")
return None
return (
result.get("data", {})
.get("startups", {})
.get("startup")
)
# Example
company = get_company_data("stripe")
if company:
print(f"{company['name']}: ${company.get('totalRaised', 0):,} raised")
print(f"Stage: {company.get('stage')}")
print(f"Tech stack: {[t['displayName'] for t in company.get('techStack', [])]}")
print(f"Open roles: {len(company.get('jobListings', []))}")
Approach 3: Playwright Browser Automation
For pages with heavy client-side rendering or authentication walls:
from playwright.sync_api import sync_playwright
import json
import time
import random
def scrape_company_playwright(company_slug, proxy_config=None):
"""Scrape company profile using Playwright browser automation."""
with sync_playwright() as p:
launch_kwargs = {
"headless": True,
"args": [
"--disable-blink-features=AutomationControlled",
"--disable-dev-shm-usage",
"--no-sandbox",
],
}
if proxy_config:
launch_kwargs["proxy"] = proxy_config
browser = p.chromium.launch(**launch_kwargs)
context = browser.new_context(
viewport={"width": 1440, "height": 900},
user_agent=HEADERS["User-Agent"],
locale="en-US",
timezone_id="America/New_York",
)
# Intercept GraphQL responses
graphql_data = []
def handle_response(response):
if "graphql" in response.url.lower():
try:
data = response.json()
if data.get("data"):
graphql_data.append(data["data"])
except Exception:
pass
page = context.new_page()
page.on("response", handle_response)
# Navigate to company page
page.goto(
f"https://wellfound.com/company/{company_slug}",
wait_until="networkidle",
timeout=30000,
)
time.sleep(random.uniform(2, 4))
# Extract visible data from DOM
company_data = page.evaluate("""
() => {
const getText = (sel) => document.querySelector(sel)?.textContent?.trim();
const getAll = (sel) => Array.from(
document.querySelectorAll(sel)
).map(e => e.textContent.trim()).filter(Boolean);
return {
name: getText('h1') || getText('[data-test="company-name"]'),
tagline: getText('[data-test="tagline"]') || getText('[class*="tagline"]'),
description: getText('[data-test="description"]') || getText('[class*="description"]'),
size: getText('[data-test="company-size"]'),
stage: getText('[data-test="company-stage"]'),
markets: getAll('[data-test="market-tag"], [class*="market"]'),
tech_stack: getAll('[data-test="tech-tag"], [class*="tech-stack"]'),
social_links: Array.from(document.querySelectorAll('a[href*="twitter"], a[href*="linkedin"]'))
.map(a => ({ href: a.href, text: a.textContent.trim() })),
};
}
""")
# Scroll to load more content (lazy-loaded sections)
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(1.5)
# Get funding info (often in a separate section)
funding_data = page.evaluate("""
() => {
const rounds = [];
document.querySelectorAll('[class*="funding-round"], [data-test*="round"]').forEach(el => {
rounds.push({
text: el.textContent.trim(),
type: el.querySelector('[class*="round-type"]')?.textContent?.trim(),
amount: el.querySelector('[class*="amount"]')?.textContent?.trim(),
date: el.querySelector('[class*="date"]')?.textContent?.trim(),
});
});
return rounds;
}
""")
company_data["funding_rounds"] = funding_data
company_data["graphql_data"] = graphql_data # Raw captured responses
browser.close()
return company_data
# Proxy config for ThorData
proxy_config = {
"server": "http://proxy.thordata.com:9000",
"username": "your_user",
"password": "your_pass",
}
company = scrape_company_playwright("stripe", proxy_config)
print(f"Company: {company.get('name')}")
print(f"Stage: {company.get('stage')}")
Anti-Bot Measures
Wellfound runs Cloudflare with bot scoring. Understanding the layers:
Cloudflare Bot Management
Cloudflare checks: - IP reputation: Datacenter IPs fail immediately; residential IPs pass - TLS fingerprint: Non-browser TLS stacks get challenged - JavaScript execution: CF injects a challenge that must execute in a real browser - Behavioral signals: Mouse movements, scroll events, request patterns
Solution: Use residential proxies. Datacenter IPs are near-universally blocked by Wellfound's Cloudflare config. ThorData's residential proxy network routes through real household IPs that pass Cloudflare's bot scoring.
PROXY_URL = "http://user:[email protected]:9000"
# For httpx direct requests
client = httpx.Client(
headers=GQL_HEADERS,
proxies={"all://": PROXY_URL},
timeout=25,
)
# For Playwright
proxy_config = {
"server": "http://proxy.thordata.com:9000",
"username": "user",
"password": "pass",
}
Rate Limiting
Wellfound's GraphQL endpoint throttles at roughly 60–120 requests per minute per session. Implement delays:
import time
import random
def rate_limited_query(query, variables, proxy_url=None, min_delay=1.0, max_delay=3.0):
"""Execute GraphQL query with rate limiting."""
time.sleep(random.uniform(min_delay, max_delay))
return graphql_query(query, variables, proxy_url)
Dynamic Class Names
CSS classes are hashed and change on deploys. Use stable selectors:
# Fragile — breaks on redeploy:
# page.query_selector(".styles_component__x7f2a")
# Stable alternatives:
page.query_selector("h1") # Semantic HTML
page.query_selector("[data-test='company-name']") # data-test attributes
page.query_selector("main >> text=Funding") # Text content selector
page.query_selector("[aria-label*='Stage']") # Aria attributes
page.get_by_role("heading", level=1) # ARIA role
Login Walls
Some fields (detailed investor contacts, full salary for some roles) require authentication. Options:
- Session cookie injection: Log in manually, export cookies, inject via Playwright
- Stick to unauthenticated endpoints: Most public company data is accessible without login
__NEXT_DATA__bypass: Server-rendered data often bypasses auth checks
def inject_session_cookies(context, cookies_dict):
"""Inject authenticated session cookies into Playwright context."""
for name, value in cookies_dict.items():
context.add_cookies([{
"name": name,
"value": value,
"domain": ".wellfound.com",
"path": "/",
}])
Scraping Company Listings at Scale
To scrape many companies (e.g., all Series A startups in fintech):
import time
import random
from pathlib import Path
def search_companies_by_market(
market_slug, stage=None, proxy_url=None, max_companies=500
):
"""Search Wellfound for companies by market/vertical."""
query = """
query CompanySearch($market: String!, $stage: String, $page: Int) {
startups {
searchByMarket(market: $market, stage: $stage, page: $page) {
results {
id
name
slug
highConcept
companySize
stage
totalRaised
markets { displayName }
}
totalCount
totalPages
}
}
}
"""
all_companies = []
page = 1
while len(all_companies) < max_companies:
variables = {"market": market_slug, "page": page}
if stage:
variables["stage"] = stage
result = graphql_query(query, variables, proxy_url)
if not result or result.get("errors"):
break
data = (
result.get("data", {})
.get("startups", {})
.get("searchByMarket", {})
)
batch = data.get("results", [])
if not batch:
break
all_companies.extend(batch)
total_pages = data.get("totalPages", 1)
print(f"Page {page}/{total_pages}: {len(batch)} companies (total: {len(all_companies)})")
if page >= total_pages:
break
page += 1
time.sleep(random.uniform(2.0, 4.0))
return all_companies[:max_companies]
# Get all fintech Series A companies
proxy = "http://user:[email protected]:9000"
companies = search_companies_by_market("fintech", stage="series-a", proxy_url=proxy)
print(f"Found {len(companies)} fintech Series A companies")
Parsing Salary and Equity Data
import re
def parse_compensation(raw):
"""
Parse salary string like "$120K - $160K" or "$90K - $130K".
Returns {"salary_min": 120000, "salary_max": 160000}
"""
if not raw:
return {"salary_min": None, "salary_max": None}
# Handle various dash types: -, –, —
raw = re.sub(r"[–—]", "-", raw)
nums = re.findall(r"\$?([\d,]+)[Kk]", raw)
if len(nums) >= 2:
return {
"salary_min": int(nums[0].replace(",", "")) * 1000,
"salary_max": int(nums[1].replace(",", "")) * 1000,
}
elif len(nums) == 1:
val = int(nums[0].replace(",", "")) * 1000
return {"salary_min": val, "salary_max": val}
return {"salary_min": None, "salary_max": None}
def parse_equity(raw):
"""
Parse equity string like "0.10% - 0.50%".
Returns {"equity_min": 0.10, "equity_max": 0.50}
"""
if not raw:
return {"equity_min": None, "equity_max": None}
nums = re.findall(r"([\d.]+)%", raw)
if len(nums) >= 2:
return {"equity_min": float(nums[0]), "equity_max": float(nums[1])}
elif len(nums) == 1:
val = float(nums[0])
return {"equity_min": val, "equity_max": val}
return {"equity_min": None, "equity_max": None}
def parse_total_raised(raw):
"""Parse total raised like "$4.5M" or "$250K" or "$1.2B"."""
if not raw:
return None
multipliers = {"K": 1_000, "M": 1_000_000, "B": 1_000_000_000}
match = re.search(r"\$([\d.]+)([KMB])", raw, re.IGNORECASE)
if match:
num = float(match.group(1))
mult = multipliers.get(match.group(2).upper(), 1)
return int(num * mult)
return None
Intercepting GraphQL Network Traffic
The most reliable approach for capturing all data — let Playwright browse and capture everything:
from playwright.sync_api import sync_playwright
import json
from collections import defaultdict
def capture_all_graphql(company_slug, proxy_config=None):
"""
Browse a company page and capture all GraphQL responses.
Returns a structured dict of all data returned.
"""
captured = defaultdict(list)
def on_response(response):
if "graphql" not in response.url.lower():
return
try:
body = response.json()
data = body.get("data", {})
for key, value in data.items():
captured[key].append(value)
except Exception:
pass
with sync_playwright() as p:
launch_kwargs = {"headless": True}
if proxy_config:
launch_kwargs["proxy"] = proxy_config
browser = p.chromium.launch(**launch_kwargs)
context = browser.new_context(
user_agent=HEADERS["User-Agent"],
viewport={"width": 1440, "height": 900},
)
page = context.new_page()
page.on("response", on_response)
# Visit main profile
page.goto(
f"https://wellfound.com/company/{company_slug}",
wait_until="networkidle",
)
# Visit jobs tab to trigger job listings query
page.click("text=Jobs")
page.wait_for_load_state("networkidle")
# Visit funding tab
try:
page.click("text=Funding")
page.wait_for_load_state("networkidle")
except Exception:
pass
browser.close()
return dict(captured)
data = capture_all_graphql("stripe")
for key, values in data.items():
print(f"{key}: {len(values)} response(s)")
Data Storage
SQLite Schema
import sqlite3
import json
from datetime import datetime
def init_db(db_path="wellfound.db"):
conn = sqlite3.connect(db_path)
conn.executescript("""
CREATE TABLE IF NOT EXISTS companies (
id TEXT PRIMARY KEY,
slug TEXT UNIQUE NOT NULL,
name TEXT,
tagline TEXT,
description TEXT,
stage TEXT,
company_size TEXT,
total_raised INTEGER,
founded_date TEXT,
website_url TEXT,
twitter_url TEXT,
linkedin_url TEXT,
markets TEXT,
tech_stack TEXT,
scraped_at TEXT
);
CREATE TABLE IF NOT EXISTS funding_rounds (
id INTEGER PRIMARY KEY AUTOINCREMENT,
company_id TEXT,
round_type TEXT,
raised_amount INTEGER,
closed_at TEXT,
investors TEXT,
FOREIGN KEY (company_id) REFERENCES companies(id)
);
CREATE TABLE IF NOT EXISTS job_listings (
id TEXT PRIMARY KEY,
company_id TEXT,
title TEXT,
compensation TEXT,
salary_min INTEGER,
salary_max INTEGER,
equity_min REAL,
equity_max REAL,
remote INTEGER,
location TEXT,
role_type TEXT,
scraped_at TEXT,
FOREIGN KEY (company_id) REFERENCES companies(id)
);
CREATE INDEX IF NOT EXISTS idx_companies_stage ON companies(stage);
CREATE INDEX IF NOT EXISTS idx_jobs_company ON job_listings(company_id);
""")
conn.commit()
return conn
def save_company(conn, company_data):
"""Save company and its nested data to SQLite."""
markets = json.dumps([m["displayName"] for m in company_data.get("markets", [])])
tech_stack = json.dumps([t["displayName"] for t in company_data.get("techStack", [])])
conn.execute("""
INSERT OR REPLACE INTO companies
(id, slug, name, tagline, description, stage, company_size,
total_raised, founded_date, website_url, twitter_url, linkedin_url,
markets, tech_stack, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
company_data.get("id"), company_data.get("slug"),
company_data.get("name"), company_data.get("highConcept"),
company_data.get("productDescription"), company_data.get("stage"),
company_data.get("companySize"),
parse_total_raised(str(company_data.get("totalRaised", ""))),
company_data.get("foundedDate"), company_data.get("websiteUrl"),
company_data.get("twitterUrl"), company_data.get("linkedInUrl"),
markets, tech_stack,
datetime.utcnow().isoformat(),
))
# Save funding rounds
for round_data in company_data.get("fundingRounds", []):
investors = json.dumps([i["name"] for i in round_data.get("investors", [])])
conn.execute("""
INSERT INTO funding_rounds
(company_id, round_type, raised_amount, closed_at, investors)
VALUES (?, ?, ?, ?, ?)
""", (
company_data.get("id"),
round_data.get("roundType"),
round_data.get("raisedAmount"),
round_data.get("closedAt"),
investors,
))
# Save job listings
for job in company_data.get("jobListings", []):
comp = parse_compensation(job.get("compensation", ""))
equity = parse_equity(job.get("equity", ""))
conn.execute("""
INSERT OR REPLACE INTO job_listings
(id, company_id, title, compensation, salary_min, salary_max,
equity_min, equity_max, remote, location, role_type, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
job.get("id"), company_data.get("id"),
job.get("title"), job.get("compensation"),
comp["salary_min"], comp["salary_max"],
equity["equity_min"], equity["equity_max"],
1 if job.get("remote") else 0,
", ".join(job.get("locationNames", [])),
job.get("roleType"),
datetime.utcnow().isoformat(),
))
conn.commit()
Real-World Use Cases
1. Startup Intelligence Feed
Build a daily feed of new funded startups in a vertical:
def daily_funding_monitor(markets, proxy_url=None):
"""Monitor new startups funded this week in target markets."""
conn = init_db()
new_companies = []
for market in markets:
companies = search_companies_by_market(market, proxy_url=proxy_url)
for company in companies:
# Check if new to our database
existing = conn.execute(
"SELECT id FROM companies WHERE slug = ?",
(company.get("slug"),)
).fetchone()
if not existing:
# Fetch full details
details = get_company_data(company["slug"], proxy_url)
if details:
save_company(conn, details)
new_companies.append(details)
print(f"New: {details['name']} ({details.get('stage')})")
time.sleep(random.uniform(2, 4))
return new_companies
2. Salary Benchmarking Tool
Aggregate salary and equity data across roles and stages:
def build_salary_report(db_path="wellfound.db"):
conn = sqlite3.connect(db_path)
cursor = conn.execute("""
SELECT
j.title,
c.stage,
COUNT(*) as listings,
AVG(j.salary_min) as avg_min,
AVG(j.salary_max) as avg_max,
AVG(j.equity_min) as avg_equity_min,
AVG(j.equity_max) as avg_equity_max
FROM job_listings j
JOIN companies c ON j.company_id = c.id
WHERE j.salary_min IS NOT NULL
AND j.title LIKE '%Engineer%'
GROUP BY j.title, c.stage
HAVING COUNT(*) >= 3
ORDER BY avg_max DESC
""")
print("\nSalary benchmarks for engineering roles:")
for row in cursor.fetchall():
title, stage, count, avg_min, avg_max, eq_min, eq_max = row
print(f" {title} @ {stage}: ${avg_min:,.0f}-${avg_max:,.0f} | "
f"Equity: {eq_min:.2f}%-{eq_max:.2f}% ({count} listings)")
3. Investor Portfolio Tracker
Track which VCs are most active in your vertical:
def analyze_investor_activity(db_path="wellfound.db"):
conn = sqlite3.connect(db_path)
# Expand JSON investors array per company
companies = conn.execute(
"SELECT name, stage, markets FROM companies"
).fetchall()
investor_counts = {}
for name, stage, markets_json in companies:
# This requires fetching funding rounds separately
rounds = conn.execute(
"SELECT investors FROM funding_rounds WHERE company_id = ("
"SELECT id FROM companies WHERE name = ?)",
(name,)
).fetchall()
for (investors_json,) in rounds:
try:
for investor in json.loads(investors_json or "[]"):
investor_counts[investor] = investor_counts.get(investor, 0) + 1
except json.JSONDecodeError:
pass
return sorted(investor_counts.items(), key=lambda x: -x[1])[:20]
top_investors = analyze_investor_activity()
print("\nMost active investors in database:")
for investor, count in top_investors:
print(f" {investor}: {count} portfolio companies")
Full Scrape Pipeline
import json
from pathlib import Path
import time
import random
def scrape_startup_ecosystem(
market_slugs,
output_dir="wellfound_data",
proxy_url=None,
max_per_market=200,
):
"""Complete pipeline: discover and enrich companies by market."""
out = Path(output_dir)
out.mkdir(exist_ok=True)
conn = init_db(str(out / "startups.db"))
total_saved = 0
for market in market_slugs:
print(f"\n=== Market: {market} ===")
# Discover companies
companies = search_companies_by_market(
market, proxy_url=proxy_url, max_companies=max_per_market
)
for company in companies:
slug = company.get("slug")
if not slug:
continue
# Skip if already in DB (from previous run)
existing = conn.execute(
"SELECT scraped_at FROM companies WHERE slug = ?", (slug,)
).fetchone()
if existing:
continue
# Fetch full details
try:
details = get_company_data(slug, proxy_url)
if details:
save_company(conn, details)
total_saved += 1
print(f" Saved {details.get('name')} ({details.get('stage')})")
except Exception as e:
print(f" Error on {slug}: {e}")
time.sleep(random.uniform(2.0, 5.0))
print(f"\nComplete: {total_saved} companies saved")
return total_saved
# Run it
proxy = "http://user:[email protected]:9000"
saved = scrape_startup_ecosystem(
["fintech", "ai-ml", "saas", "healthcare"],
proxy_url=proxy,
max_per_market=100,
)
Legal Considerations
Wellfound's Terms of Service prohibit automated scraping. This applies regardless of whether the data is publicly visible. Key considerations:
- hiQ v. LinkedIn: The Ninth Circuit ruled that scraping publicly accessible data doesn't violate the CFAA, but ToS violations are still civil matters
- Data resale: Don't commercially redistribute scraped Wellfound data
- Rate limits: Scraping that degrades service is harder to defend legally
- Personal data: Be careful with founder/employee contact information under GDPR/CCPA
For production use at scale, consider Crunchbase API (paid but licensed), PitchBook, or direct partnerships with data providers.
Summary
Wellfound offers the richest publicly accessible startup dataset available. The technical path to accessing it:
__NEXT_DATA__extraction for server-rendered pages — no auth needed, fastest approach- GraphQL direct queries — richest data, moderate rate limits, requires residential proxies
- Playwright with network interception — most complete, handles auth walls and dynamic content
Cloudflare bot protection is the primary obstacle. ThorData residential proxies are essential — datacenter IPs fail Cloudflare's bot scoring consistently. Store results in SQLite with proper indexing, and implement incremental scraping so you can resume after interruptions. Built correctly, this gives you a continuously updated startup intelligence database that rivals paid tools costing thousands per month.