How to Scrape AngelList (Wellfound) in 2026: Investors, Portfolios & Deal Flow
How to Scrape AngelList (Wellfound) in 2026: Investors, Portfolios & Deal Flow
AngelList rebranded its talent platform to Wellfound in 2022, but it remains the most concentrated source of early-stage investor data on the internet. If you're doing investor outreach, building fundraising intelligence tools, or tracking startup hiring trends, the data on Wellfound is worth the effort to extract — investor theses, portfolio companies, check sizes, and job listings that reflect real hiring intent.
The platform hosts profiles for tens of thousands of active investors, from solo angels writing $25K checks to multi-stage funds deploying hundreds of millions. Each profile lists investment focus areas, portfolio companies, preferred check sizes, and sometimes direct contact preferences. For founders and researchers, this is a goldmine that would otherwise require expensive subscription services like Crunchbase Pro or PitchBook.
This guide walks through the complete extraction pipeline: Playwright setup for JS rendering, authenticated vs. unauthenticated scraping patterns, proxy configuration for sustained collection, SQLite storage design, and how to structure the data for practical use.
What You Can Extract
From investor profiles: - Investor name, photo URL, and bio text - Investment thesis and focus areas (free-form text) - Portfolio companies (with links to company profiles) - Check size range (pre-seed, seed, Series A annotations) - Geographic focus (location, target markets) - Sector tags (fintech, SaaS, biotech, climate, consumer, etc.) - Number of investments shown on profile - Social links (Twitter, LinkedIn, personal site)
From company profiles: - Company name, founding year, team size, funding stage - Total funding raised and individual round details - Current open job listings with compensation data - Sector and market tags - Brief description and longer company overview - Key people and founder profiles
From job listings: - Role title, department, remote/hybrid/onsite - Salary range (Wellfound requires salary disclosure) - Equity range (typical Wellfound differentiator) - Years of experience requirements - Company growth stage and funding context
Wellfound's Anti-Bot Measures
Wellfound is harder to scrape than most job boards because investor data is commercially sensitive to their business model. Here's what you're dealing with:
No public API. AngelList deprecated their developer API years ago. There is no official endpoint for investor or company data — everything has to come from the web interface.
React SPA with heavy JS rendering. All investor cards, profiles, and job listings are rendered client-side. Raw HTTP requests return an empty shell with essentially no content. You need a real browser.
Cloudflare protection. Full Cloudflare bot management with JS challenges on first visit. Cookie state from solving the challenge is required for subsequent requests. The challenge involves TLS fingerprint checks, JavaScript execution validation, and cookie chain verification.
Login walls on investor detail pages. Public investor listings are visible, but portfolio details, check sizes, and direct contact require a logged-in session. You can get basic data without logging in, but rich data needs authentication.
Behavioral fingerprinting. Wellfound tracks mouse movement patterns, scroll velocity, and click timing. Perfectly mechanical automation triggers detection within minutes. You need randomized delays and simulated human-like interaction.
IP rate limiting. Requests from a single IP get blocked after roughly 30 requests per minute. Sustained scraping without rotation leads to 403s within minutes. Datacenter IPs are almost always blocked outright — Wellfound checks IP reputation against commercial databases.
Session invalidation. Even with a valid li_at-equivalent session cookie, sessions expire or get invalidated after suspicious activity. Accounts used heavily for scraping get restricted or shadow-banned, meaning you see reduced data without an explicit error.
Setting Up Playwright for Wellfound
Start by launching a real browser with Playwright. Headless mode works, but a headed browser passes more fingerprint checks during initial Cloudflare validation. In production, use headless=True with careful stealth configuration:
import asyncio
import random
import sqlite3
import json
from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeoutError
async def create_browser(proxy: dict = None):
"""
Launch a Playwright browser with realistic stealth settings.
proxy: dict with 'server', optionally 'username' and 'password'
Returns: (playwright, browser, context) tuple
"""
playwright = await async_playwright().start()
launch_kwargs = {
"headless": True,
"args": [
"--no-sandbox",
"--disable-blink-features=AutomationControlled",
"--disable-dev-shm-usage",
"--disable-gpu",
"--window-size=1366,768",
],
}
if proxy:
launch_kwargs["proxy"] = proxy
browser = await playwright.chromium.launch(**launch_kwargs)
context = await browser.new_context(
viewport={"width": 1366, "height": 768},
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
locale="en-US",
timezone_id="America/New_York",
geolocation={"latitude": 40.7128, "longitude": -74.0060},
permissions=["geolocation"],
extra_http_headers={
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
},
)
# Remove automation indicators that Cloudflare and Wellfound check
await context.add_init_script("""
// Remove webdriver property
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
// Fake plugin array (empty in headless)
Object.defineProperty(navigator, 'plugins', {
get: () => [
{name: 'Chrome PDF Plugin', filename: 'internal-pdf-viewer'},
{name: 'Chrome PDF Viewer', filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai'},
{name: 'Native Client', filename: 'internal-nacl-plugin'},
]
});
// Fake language array
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en']
});
// Add chrome runtime
window.chrome = {
runtime: {},
csi: () => {},
loadTimes: () => {},
};
// Fix permissions query
const originalQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (parameters) =>
parameters.name === 'notifications'
? Promise.resolve({ state: Notification.permission })
: originalQuery(parameters);
""")
return playwright, browser, context
async def load_investors_page(context, filter_params: dict = None):
"""
Navigate to Wellfound investors listing and wait for content to load.
filter_params: optional dict with keys like 'stage', 'market', 'location'
"""
page = await context.new_page()
# Build URL with optional filters
base_url = "https://wellfound.com/investors"
if filter_params:
query = "&".join(f"{k}={v}" for k, v in filter_params.items())
url = f"{base_url}?{query}"
else:
url = base_url
try:
await page.goto(url, wait_until="networkidle", timeout=30000)
except PlaywrightTimeoutError:
# Cloudflare challenge may delay full network idle — wait for content instead
await page.wait_for_selector(
"[data-test='investor-card'], .investor-card, a[href*='/u/']",
timeout=20000
)
# Simulate human-like scrolling to trigger lazy-loaded investor cards
for _ in range(5):
await page.mouse.wheel(0, random.randint(300, 700))
await asyncio.sleep(random.uniform(1.2, 2.8))
# Wait for at least some investor cards to appear
await page.wait_for_selector("a[href*='/u/']", timeout=15000)
return page
Extracting Investor Listings
Once the investors listing page is loaded, scrape the cards. Wellfound's class names use hashed suffixes that change on deploys, so targeting structural selectors is more reliable than class names:
async def extract_investor_cards(page) -> list[dict]:
"""
Extract basic investor data from the Wellfound investors listing page.
Returns a list of dicts with name, profile_url, bio_preview, and tags.
"""
# Scroll to bottom to load all visible cards
prev_height = 0
for _ in range(8):
curr_height = await page.evaluate("document.body.scrollHeight")
if curr_height == prev_height:
break
prev_height = curr_height
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await asyncio.sleep(random.uniform(1.5, 3.0))
investors = []
# Get all investor profile links
links = await page.query_selector_all("a[href*='/u/']")
seen_urls = set()
for link in links:
try:
href = await link.get_attribute("href")
if not href or href in seen_urls:
continue
seen_urls.add(href)
# Navigate to parent card element
card = await link.evaluate_handle("el => el.closest('div[class]') || el.parentElement")
name_el = await card.query_selector("h3, h2, [class*='name'], strong")
name = (await name_el.inner_text()).strip() if name_el else (await link.inner_text()).strip()
bio_el = await card.query_selector("p, [class*='bio'], [class*='description']")
bio = (await bio_el.inner_text()).strip() if bio_el else None
tag_els = await card.query_selector_all("[class*='tag'], [class*='sector'], [class*='market']")
tags = [(await t.inner_text()).strip() for t in tag_els if await t.inner_text()]
profile_url = f"https://wellfound.com{href}" if href.startswith("/") else href
if name and len(name) > 1:
investors.append({
"name": name,
"profile_url": profile_url,
"bio_preview": bio[:200] if bio else None,
"tags": tags[:10],
})
except Exception:
continue
return investors
async def paginate_investor_listing(context, max_pages: int = 5) -> list[dict]:
"""
Page through the investor listing to collect more profiles.
Returns all investor stubs found across all pages.
"""
all_investors = []
page = await load_investors_page(context)
for page_num in range(max_pages):
batch = await extract_investor_cards(page)
all_investors.extend(batch)
print(f"Page {page_num + 1}: found {len(batch)} investors")
# Try to find and click "Next" pagination button
next_btn = await page.query_selector(
"a[rel='next'], button:has-text('Next'), [aria-label='Next page']"
)
if not next_btn:
break
await next_btn.click()
await page.wait_for_load_state("networkidle")
await asyncio.sleep(random.uniform(2.0, 4.0))
await page.close()
return all_investors
Extracting Full Investor Profiles
Investor profile pages load portfolio data asynchronously. After navigating to the page, wait for specific content elements before scraping:
async def extract_investor_profile(context, profile_url: str) -> dict:
"""
Open an investor profile page and extract full details.
Handles dynamic loading with explicit waits and graceful timeouts.
Returns a dict with all available investor data.
"""
page = await context.new_page()
investor = {"url": profile_url, "error": None}
try:
await page.goto(profile_url, wait_until="domcontentloaded", timeout=25000)
# Wait for the profile content block specifically
try:
await page.wait_for_selector(
"[class*='profile'], main, [class*='investor-profile']",
timeout=15000
)
except PlaywrightTimeoutError:
investor["error"] = "profile_content_timeout"
return investor
# Let async data requests finish
await asyncio.sleep(random.uniform(2.5, 4.5))
# --- Core fields ---
name_el = await page.query_selector("h1")
investor["name"] = (await name_el.inner_text()).strip() if name_el else None
# Bio and investment thesis (Wellfound often has both)
bio_el = await page.query_selector(
"[class*='bio'], [class*='thesis'], [class*='description'] p, "
"section p:first-of-type"
)
investor["bio"] = (await bio_el.inner_text()).strip() if bio_el else None
# Social links
twitter_el = await page.query_selector("a[href*='twitter.com'], a[href*='x.com']")
investor["twitter"] = await twitter_el.get_attribute("href") if twitter_el else None
linkedin_el = await page.query_selector("a[href*='linkedin.com']")
investor["linkedin"] = await linkedin_el.get_attribute("href") if linkedin_el else None
website_el = await page.query_selector("a[class*='website'], a[class*='external']")
investor["website"] = await website_el.get_attribute("href") if website_el else None
# Location
loc_el = await page.query_selector("[class*='location'], [data-test*='location']")
investor["location"] = (await loc_el.inner_text()).strip() if loc_el else None
# Investment stage preferences
stage_els = await page.query_selector_all("[class*='stage'], [class*='check'], [class*='round']")
investor["stages"] = list({
(await el.inner_text()).strip()
for el in stage_els
if await el.inner_text()
})
# Market and sector tags
tag_els = await page.query_selector_all(
"[class*='tag']:not([class*='stage']), [class*='sector'], [class*='market']"
)
investor["sectors"] = [
(await el.inner_text()).strip()
for el in tag_els
if await el.inner_text()
][:20]
# Portfolio companies
portfolio_els = await page.query_selector_all("a[href*='/company/']")
portfolio_urls = list({
await el.get_attribute("href") for el in portfolio_els
})
investor["portfolio"] = [
f"https://wellfound.com{u}" if u and u.startswith("/") else u
for u in portfolio_urls
if u
][:50]
# Number of investments (if shown)
invest_count_el = await page.query_selector("[class*='investment-count'], [class*='portfolio-size']")
investor["investment_count_text"] = (
(await invest_count_el.inner_text()).strip() if invest_count_el else None
)
except PlaywrightTimeoutError:
investor["error"] = "timeout"
except Exception as e:
investor["error"] = str(e)
finally:
await page.close()
return investor
Scraping Company and Job Data
Company pages follow a similar pattern. Job listings include compensation data that Wellfound requires startups to disclose — making it valuable for compensation benchmarking:
async def extract_company_profile(context, company_slug: str) -> dict:
"""
Scrape a Wellfound company profile page.
company_slug: the URL slug e.g. 'stripe' from wellfound.com/company/stripe
"""
url = f"https://wellfound.com/company/{company_slug}"
page = await context.new_page()
company = {"slug": company_slug, "url": url}
try:
await page.goto(url, wait_until="domcontentloaded", timeout=25000)
await page.wait_for_selector("h1, [class*='company-header']", timeout=12000)
await asyncio.sleep(random.uniform(2.0, 4.0))
name_el = await page.query_selector("h1")
company["name"] = (await name_el.inner_text()).strip() if name_el else company_slug
desc_el = await page.query_selector(
"[class*='description'], [class*='about'] p, main p:first-of-type"
)
company["description"] = (await desc_el.inner_text()).strip() if desc_el else None
# Funding stage indicator
stage_el = await page.query_selector("[class*='stage'], [class*='funding-stage']")
company["stage"] = (await stage_el.inner_text()).strip() if stage_el else None
# Team size
size_el = await page.query_selector("[class*='team-size'], [class*='employee']")
company["team_size"] = (await size_el.inner_text()).strip() if size_el else None
# Total funding
funding_el = await page.query_selector("[class*='total-funding'], [class*='raised']")
company["total_funding"] = (await funding_el.inner_text()).strip() if funding_el else None
# Tech stack
tech_els = await page.query_selector_all("[class*='tech'], [class*='stack'] span")
company["tech_stack"] = [
(await el.inner_text()).strip() for el in tech_els if await el.inner_text()
][:15]
except PlaywrightTimeoutError:
company["error"] = "timeout"
finally:
await page.close()
return company
async def extract_company_jobs(context, company_slug: str) -> list[dict]:
"""
Scrape job listings from a Wellfound company jobs page.
Returns roles with salary ranges, equity, and location.
"""
url = f"https://wellfound.com/company/{company_slug}/jobs"
page = await context.new_page()
jobs = []
try:
await page.goto(url, wait_until="domcontentloaded", timeout=25000)
await page.wait_for_selector(
"[class*='job'], [data-test*='job'], a[href*='/jobs/']",
timeout=15000
)
await asyncio.sleep(random.uniform(2.0, 4.0))
# Scroll to load all jobs
for _ in range(3):
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await asyncio.sleep(1.5)
job_els = await page.query_selector_all(
"[class*='JobListing'], [class*='job-listing'], [class*='JobCard']"
)
for job_el in job_els:
try:
title_el = await job_el.query_selector("h2, h3, [class*='title'], [class*='role']")
title = (await title_el.inner_text()).strip() if title_el else None
salary_el = await job_el.query_selector(
"[class*='salary'], [class*='compensation'], [class*='pay']"
)
salary = (await salary_el.inner_text()).strip() if salary_el else None
equity_el = await job_el.query_selector("[class*='equity'], [class*='stock']")
equity = (await equity_el.inner_text()).strip() if equity_el else None
loc_el = await job_el.query_selector("[class*='location'], [class*='remote']")
location = (await loc_el.inner_text()).strip() if loc_el else None
type_el = await job_el.query_selector(
"[class*='job-type'], [class*='employment-type']"
)
job_type = (await type_el.inner_text()).strip() if type_el else None
link_el = await job_el.query_selector("a[href*='/jobs/']")
job_url = await link_el.get_attribute("href") if link_el else None
if title:
jobs.append({
"company_slug": company_slug,
"title": title,
"salary_range": salary,
"equity_range": equity,
"location": location,
"job_type": job_type,
"job_url": (
f"https://wellfound.com{job_url}"
if job_url and job_url.startswith("/")
else job_url
),
})
except Exception:
continue
except PlaywrightTimeoutError:
pass
finally:
await page.close()
return jobs
Rotating Proxies to Avoid Blocks
Wellfound blocks IPs aggressively because there's no API to funnel you toward. After 30 or so requests, a single residential IP starts seeing Cloudflare challenges that don't resolve cleanly. You need a pool of rotating residential addresses.
ThorData's residential proxies have been reliable for this — the IPs rotate per request or per session depending on your configuration, and they pass Cloudflare's ASN reputation checks that trip up datacenter providers. The geo-targeting feature also lets you simulate investors browsing from specific cities, which affects what content Wellfound shows.
Here is how to wire ThorData into the Playwright setup and run the full collection pipeline:
import asyncio
import random
import json
THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
THORDATA_HOST = "proxy.thordata.com"
THORDATA_PORT = 9000
proxy_config = {
"server": f"http://{THORDATA_HOST}:{THORDATA_PORT}",
"username": THORDATA_USER,
"password": THORDATA_PASS,
}
async def collect_investors_with_jobs(
max_investors: int = 50,
max_jobs_per_company: int = 20,
) -> dict:
"""
Full pipeline: collect investors, their portfolio companies, and open jobs.
Returns dict with 'investors', 'companies', and 'jobs' lists.
"""
playwright, browser, context = await create_browser(proxy=proxy_config)
results = {"investors": [], "companies": [], "jobs": []}
try:
# Step 1: Get investor stubs from listing page
investors = await paginate_investor_listing(context, max_pages=3)
print(f"Found {len(investors)} investor stubs")
# Step 2: Enrich each investor with full profile data
for i, stub in enumerate(investors[:max_investors]):
profile = await extract_investor_profile(context, stub["profile_url"])
results["investors"].append({**stub, **profile})
print(f"[{i+1}/{min(max_investors, len(investors))}] "
f"{profile.get('name', 'unknown')}: "
f"{len(profile.get('portfolio', []))} portfolio companies")
# Delay between profiles
await asyncio.sleep(random.uniform(3.0, 6.0))
# Step 3: For first 3 portfolio companies of each investor, get jobs
for company_url in profile.get("portfolio", [])[:3]:
slug = company_url.rstrip("/").split("/")[-1]
if any(c["slug"] == slug for c in results["companies"]):
continue # already scraped
company = await extract_company_profile(context, slug)
results["companies"].append(company)
jobs = await extract_company_jobs(context, slug)
results["jobs"].extend(jobs)
await asyncio.sleep(random.uniform(2.0, 4.0))
finally:
await browser.close()
await playwright.stop()
return results
# Run it
data = asyncio.run(collect_investors_with_jobs(max_investors=20))
print(f"Collected: {len(data['investors'])} investors, "
f"{len(data['companies'])} companies, "
f"{len(data['jobs'])} jobs")
SQLite Storage Schema
A normalized SQLite schema handles investors, companies, funding rounds, and jobs without data duplication:
import sqlite3
from datetime import datetime, timezone
def init_db(db_path: str = "wellfound.db") -> sqlite3.Connection:
"""Initialize the Wellfound SQLite database with all required tables."""
conn = sqlite3.connect(db_path)
conn.row_factory = sqlite3.Row
conn.executescript("""
CREATE TABLE IF NOT EXISTS investors (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT NOT NULL,
url TEXT UNIQUE NOT NULL,
bio TEXT,
location TEXT,
twitter TEXT,
linkedin TEXT,
website TEXT,
stages TEXT, -- JSON array
sectors TEXT, -- JSON array
portfolio TEXT, -- JSON array of URLs
investment_count_text TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
error TEXT
);
CREATE TABLE IF NOT EXISTS companies (
slug TEXT PRIMARY KEY,
name TEXT,
description TEXT,
stage TEXT,
team_size TEXT,
total_funding TEXT,
tech_stack TEXT, -- JSON array
url TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
error TEXT
);
CREATE TABLE IF NOT EXISTS investor_portfolio (
investor_url TEXT NOT NULL,
company_slug TEXT NOT NULL,
PRIMARY KEY (investor_url, company_slug)
);
CREATE TABLE IF NOT EXISTS jobs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
company_slug TEXT NOT NULL,
title TEXT,
salary_range TEXT,
equity_range TEXT,
location TEXT,
job_type TEXT,
job_url TEXT UNIQUE,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_jobs_company ON jobs(company_slug);
CREATE INDEX IF NOT EXISTS idx_portfolio_investor ON investor_portfolio(investor_url);
CREATE INDEX IF NOT EXISTS idx_portfolio_company ON investor_portfolio(company_slug);
""")
conn.commit()
return conn
def upsert_investor(conn: sqlite3.Connection, investor: dict):
"""Insert or update an investor record."""
conn.execute("""
INSERT INTO investors
(name, url, bio, location, twitter, linkedin, website,
stages, sectors, portfolio, investment_count_text, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(url) DO UPDATE SET
bio=excluded.bio,
location=excluded.location,
stages=excluded.stages,
sectors=excluded.sectors,
portfolio=excluded.portfolio,
scraped_at=excluded.scraped_at
""", (
investor.get("name"),
investor.get("url"),
investor.get("bio"),
investor.get("location"),
investor.get("twitter"),
investor.get("linkedin"),
investor.get("website"),
json.dumps(investor.get("stages", [])),
json.dumps(investor.get("sectors", [])),
json.dumps(investor.get("portfolio", [])),
investor.get("investment_count_text"),
datetime.now(timezone.utc).isoformat(),
))
# Insert portfolio relationships
for company_url in investor.get("portfolio", []):
slug = company_url.rstrip("/").split("/")[-1]
conn.execute("""
INSERT OR IGNORE INTO investor_portfolio (investor_url, company_slug)
VALUES (?, ?)
""", (investor.get("url"), slug))
conn.commit()
def upsert_job(conn: sqlite3.Connection, job: dict):
"""Insert a job listing, ignoring duplicates."""
conn.execute("""
INSERT OR IGNORE INTO jobs
(company_slug, title, salary_range, equity_range, location, job_type, job_url, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
""", (
job.get("company_slug"),
job.get("title"),
job.get("salary_range"),
job.get("equity_range"),
job.get("location"),
job.get("job_type"),
job.get("job_url"),
datetime.now(timezone.utc).isoformat(),
))
conn.commit()
def query_investors_by_sector(conn: sqlite3.Connection, sector: str) -> list:
"""Find investors whose sectors JSON contains the given keyword."""
return conn.execute("""
SELECT name, url, location, sectors, investment_count_text
FROM investors
WHERE sectors LIKE ?
ORDER BY name
""", (f"%{sector}%",)).fetchall()
def get_salary_stats(conn: sqlite3.Connection) -> list:
"""Return job salary range distribution by company stage."""
return conn.execute("""
SELECT c.stage, j.salary_range, COUNT(*) as count
FROM jobs j
JOIN companies c ON j.company_slug = c.slug
WHERE j.salary_range IS NOT NULL
GROUP BY c.stage, j.salary_range
ORDER BY count DESC
""").fetchall()
Error Handling and Retry Logic
Wellfound will occasionally return errors, timeouts, or Cloudflare challenges mid-session. A robust retry wrapper handles this:
import asyncio
import random
async def with_retry(
coro_fn,
args: tuple = (),
max_attempts: int = 3,
base_delay: float = 5.0,
backoff_factor: float = 2.0,
):
"""
Retry an async coroutine with exponential backoff.
coro_fn: async function to call
args: positional arguments to pass
max_attempts: total attempts before giving up
base_delay: initial wait between retries in seconds
"""
last_error = None
for attempt in range(max_attempts):
try:
return await coro_fn(*args)
except PlaywrightTimeoutError as e:
last_error = e
if attempt < max_attempts - 1:
delay = base_delay * (backoff_factor ** attempt) + random.uniform(0, 2)
print(f"Timeout on attempt {attempt + 1}, retrying in {delay:.1f}s")
await asyncio.sleep(delay)
except Exception as e:
last_error = e
error_str = str(e).lower()
# Don't retry on 403/404 — the resource is genuinely unavailable
if "403" in error_str or "404" in error_str:
raise
if attempt < max_attempts - 1:
delay = base_delay * (backoff_factor ** attempt)
print(f"Error on attempt {attempt + 1}: {e}, retrying in {delay:.1f}s")
await asyncio.sleep(delay)
raise last_error or Exception("Max retries exceeded")
# Usage example
async def safe_get_profile(context, url: str) -> dict:
return await with_retry(extract_investor_profile, args=(context, url))
Rate Limiting and Request Pacing
Beyond proxy rotation, request timing matters. A pattern that works well for sustained Wellfound scraping:
import time
from collections import deque
class RateLimiter:
"""
Token bucket rate limiter for controlling request frequency.
Defaults to 20 requests per minute max.
"""
def __init__(self, requests_per_minute: int = 20):
self.min_interval = 60.0 / requests_per_minute
self.request_times = deque(maxlen=requests_per_minute)
async def wait(self):
"""Wait if necessary to respect the rate limit."""
now = time.monotonic()
if len(self.request_times) == self.request_times.maxlen:
oldest = self.request_times[0]
elapsed = now - oldest
window = 60.0
if elapsed < window:
wait_time = window - elapsed + random.uniform(0.5, 2.0)
await asyncio.sleep(wait_time)
# Add small jitter to prevent synchronized bursting
await asyncio.sleep(random.uniform(1.0, 3.0))
self.request_times.append(time.monotonic())
# Use in the collection loop
limiter = RateLimiter(requests_per_minute=15)
async def collect_with_rate_limit(context, investor_urls: list[str]) -> list[dict]:
results = []
for url in investor_urls:
await limiter.wait()
profile = await extract_investor_profile(context, url)
results.append(profile)
return results
Practical Use Cases
Fundraising outreach lists. Collect investors by sector and stage, then filter by location and check size. Output a CSV of investor names, bios, portfolio companies, and profile URLs. Cross-reference portfolio companies with Crunchbase or LinkedIn to identify warm intro paths.
Market mapping. Scrape all investors who have funded companies in a specific market (e.g., "climate tech" or "developer tools"). Analyze which investors are most active by portfolio count and investment stage distribution. Identify whitespace — markets with few active investors despite deal flow.
Compensation benchmarking. Collect job listings across 200-500 early-stage startups filtered by sector. Analyze salary and equity ranges by role title, stage, and location. This kind of data is worth significant money to recruiting firms and HR tools.
Portfolio overlap analysis. For a given investor, identify which other investors share portfolio companies. Network density among investors in a sector often predicts syndicate behavior — who leads with whom, and who follows.
Legal Considerations
Wellfound's Terms of Service prohibit automated scraping of their platform. The data itself — investor names, bios, and publicly listed companies — is generally not copyright-protected as standalone facts, but the database as a whole may be protected under database rights in EU jurisdictions. Keep scraping to personal research or internal tooling. Avoid building competing investor databases for commercial redistribution, and do not scrape private or locked investor data that requires account credentials you don't personally own.
The techniques documented here are for educational purposes. Respect robots.txt, use conservative request rates, and if Wellfound adds explicit opt-out signals, honor them.
Pagination and Bulk Investor Discovery
The investor listing page paginates. Here is how to systematically collect all pages:
import asyncio
import json
import random
import sqlite3
from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeoutError
async def paginate_investor_list(context, max_pages: int = 20) -> list:
"""Paginate through the Wellfound investor directory."""
all_investors = []
page = await context.new_page()
for page_num in range(1, max_pages + 1):
url = f"https://wellfound.com/investors?page={page_num}"
try:
await page.goto(url, wait_until="domcontentloaded", timeout=25000)
await asyncio.sleep(random.uniform(3.0, 6.0))
# Scroll to load lazy content
for _ in range(3):
await page.mouse.wheel(0, random.randint(400, 800))
await asyncio.sleep(random.uniform(1.0, 2.0))
investors = await extract_investor_cards(page)
if not investors:
print(f" Page {page_num}: no investors found, stopping")
break
all_investors.extend(investors)
print(f" Page {page_num}: {len(investors)} investors (total: {len(all_investors)})")
except PlaywrightTimeoutError:
print(f" Page {page_num}: timeout")
break
await asyncio.sleep(random.uniform(5.0, 10.0))
await page.close()
return all_investors
Filtering by Investment Stage and Sector
Wellfound's investor listing supports filtering via URL parameters. Combine with your scraping to pre-filter:
STAGE_FILTERS = {
"pre_seed": "pre-seed",
"seed": "seed",
"series_a": "series-a",
"series_b": "series-b",
}
SECTOR_FILTERS = [
"fintech",
"ai-ml",
"saas",
"healthcare",
"climate-tech",
"crypto-web3",
"consumer",
"enterprise",
]
def build_filter_url(stage: str = None, sector: str = None, page: int = 1) -> str:
"""Build a filtered investor listing URL."""
base = "https://wellfound.com/investors"
params = []
if stage:
params.append(f"stage={stage}")
if sector:
params.append(f"market={sector}")
if page > 1:
params.append(f"page={page}")
return f"{base}?{'&'.join(params)}" if params else base
async def collect_investors_by_sector(
context,
sectors: list,
stages: list = None,
max_pages_per_filter: int = 5,
) -> list:
"""Collect investors filtered by sector and/or stage."""
all_investors = []
seen_slugs = set()
for sector in sectors:
stage_list = stages or [None]
for stage in stage_list:
print(f" Collecting: sector={sector}, stage={stage}")
page = await context.new_page()
for pg in range(1, max_pages_per_filter + 1):
url = build_filter_url(stage=stage, sector=sector, page=pg)
try:
await page.goto(url, wait_until="domcontentloaded", timeout=25000)
await asyncio.sleep(random.uniform(4.0, 8.0))
investors = await extract_investor_cards(page)
if not investors:
break
for inv in investors:
slug = inv.get("profile_url", "").split("/")[-1]
if slug not in seen_slugs:
seen_slugs.add(slug)
inv["filter_sector"] = sector
inv["filter_stage"] = stage
all_investors.append(inv)
except PlaywrightTimeoutError:
break
await asyncio.sleep(random.uniform(5.0, 10.0))
await page.close()
return all_investors
Extracting Portfolio Metrics
For investors with disclosed portfolio data, you can compute basic metrics:
def analyze_investor_portfolio(investor: dict) -> dict:
"""Compute portfolio-level metrics from an investor profile."""
portfolio = investor.get("portfolio", [])
sectors = investor.get("sectors", [])
stages = investor.get("stages", [])
return {
"name": investor.get("name"),
"portfolio_size": len(portfolio),
"primary_sectors": sectors[:3],
"investment_stages": stages,
"location": investor.get("location"),
"bio_length": len(investor.get("bio") or ""),
"has_full_profile": bool(investor.get("bio")) and len(portfolio) > 0,
}
def rank_investors_by_activity(investors: list) -> list:
"""Rank investors by portfolio size and profile completeness."""
scored = []
for inv in investors:
metrics = analyze_investor_portfolio(inv)
score = (
metrics["portfolio_size"] * 2 +
len(metrics["primary_sectors"]) +
len(metrics["investment_stages"]) +
(5 if metrics["has_full_profile"] else 0)
)
scored.append((score, inv))
scored.sort(key=lambda x: x[0], reverse=True)
return [inv for _, inv in scored]
Rate Limiting and Session Management
Wellfound blocks IP addresses that make too many requests. Here is how to manage sessions and rotate proxies effectively:
import time
THORDATA_CONFIG = {
"server": "http://proxy.thordata.com:9000",
"username": "YOUR_USERNAME",
"password": "YOUR_PASSWORD",
}
class WellfoundScraper:
"""
Managed scraper for Wellfound investor data.
Handles session rotation, rate limiting, and retry logic.
"""
def __init__(self, proxy_config: dict = None, requests_per_session: int = 30):
self.proxy_config = proxy_config
self.requests_per_session = requests_per_session
self.session_request_count = 0
self.playwright = None
self.browser = None
self.context = None
async def start(self):
from playwright.async_api import async_playwright
self.playwright = await async_playwright().start()
await self._new_session()
async def _new_session(self):
"""Create a new browser context (effectively rotates IP with proxy rotation)."""
if self.browser:
await self.browser.close()
launch_kwargs = {
"headless": True,
"args": ["--disable-blink-features=AutomationControlled"],
}
if self.proxy_config:
launch_kwargs["proxy"] = self.proxy_config
self.browser = await self.playwright.chromium.launch(**launch_kwargs)
self.context = await self.browser.new_context(
viewport={"width": 1366, "height": 768},
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
locale="en-US",
)
await self.context.add_init_script(
"Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
)
self.session_request_count = 0
async def get_page(self, url: str):
"""Navigate to a URL, rotating session if needed."""
if self.session_request_count >= self.requests_per_session:
print("Rotating session...")
await self._new_session()
page = await self.context.new_page()
await page.goto(url, wait_until="domcontentloaded", timeout=25000)
self.session_request_count += 1
return page
async def stop(self):
if self.browser:
await self.browser.close()
if self.playwright:
await self.playwright.stop()
Complete Data Pipeline
async def run_wellfound_pipeline(
sectors: list = None,
db_path: str = "wellfound.db",
max_investors: int = 200,
):
"""
Full pipeline:
1. Discover investors via filtered listings
2. Fetch full profiles for each investor
3. Collect job listings for portfolio companies
4. Store everything in SQLite
"""
if sectors is None:
sectors = ["fintech", "saas", "ai-ml"]
conn = init_db(db_path)
scraper = WellfoundScraper(proxy_config=THORDATA_CONFIG)
await scraper.start()
try:
# Phase 1: Discovery
print("Phase 1: Discovering investors...")
all_investors = await collect_investors_by_sector(
scraper.context,
sectors=sectors,
max_pages_per_filter=5,
)
print(f"Found {len(all_investors)} unique investors")
# Rank by activity
all_investors = rank_investors_by_activity(all_investors)
# Phase 2: Enrich profiles
print("\nPhase 2: Fetching full investor profiles...")
enriched_count = 0
for investor in all_investors[:max_investors]:
url = investor.get("profile_url")
if not url:
continue
# Skip if already in DB
existing = conn.execute(
"SELECT 1 FROM investors WHERE url = ?", (url,)
).fetchone()
if existing:
print(f" Cached: {investor.get('name')}")
continue
print(f" Fetching: {investor.get('name')}")
profile = await extract_investor_profile(scraper.context, url)
# Save to DB
conn.execute(
"""INSERT OR REPLACE INTO investors
(name, url, bio, location, stages, sectors, portfolio, scraped_at)
VALUES (?,?,?,?,?,?,?,CURRENT_TIMESTAMP)""",
(
profile.get("name"),
url,
profile.get("bio"),
profile.get("location"),
json.dumps(profile.get("stages", [])),
json.dumps(profile.get("sectors", [])),
json.dumps(profile.get("portfolio", [])),
),
)
conn.commit()
enriched_count += 1
await asyncio.sleep(random.uniform(5.0, 10.0))
print(f"\nEnriched {enriched_count} investor profiles")
finally:
await scraper.stop()
conn.close()
# Run the pipeline
asyncio.run(run_wellfound_pipeline(
sectors=["fintech", "saas", "ai-ml"],
max_investors=100,
))
Useful Analysis Queries
Once data is collected in SQLite:
import sqlite3
import json
conn = sqlite3.connect("wellfound.db")
# Investors by number of portfolio companies
active_investors = conn.execute("""
SELECT name, location,
json_array_length(portfolio) as portfolio_size,
sectors
FROM investors
WHERE portfolio IS NOT NULL
ORDER BY portfolio_size DESC
LIMIT 20
""").fetchall()
for row in active_investors:
print(f"{row[0]:<30} {row[2]:>3} companies {row[1]}")
# Geographic distribution of investors
by_location = conn.execute("""
SELECT location, COUNT(*) as count
FROM investors
WHERE location IS NOT NULL
GROUP BY location
ORDER BY count DESC
LIMIT 15
""").fetchall()
# Most common investment stages
import collections
all_stages = []
for row in conn.execute("SELECT stages FROM investors WHERE stages IS NOT NULL"):
try:
stages = json.loads(row[0])
all_stages.extend(stages)
except json.JSONDecodeError:
pass
stage_counts = collections.Counter(all_stages)
for stage, count in stage_counts.most_common(10):
print(f" {stage}: {count} investors")
Anti-Detection Best Practices for Wellfound
A summary of what works in 2026:
- Use Playwright, not requests -- Wellfound requires full JS execution for all investor data
- Always remove
navigator.webdriver-- the init script in the examples above is mandatory - Rotate IPs via ThorData -- ThorData residential proxies pass Cloudflare's ASN checks; datacenter IPs fail immediately
- Session-based rotation -- create a new browser context every 25-30 requests; this rotates the IP and clears any session-level fingerprinting
- Simulate human scroll behavior -- use
mouse.wheel()with random distances before extracting data - Random delays -- minimum 4 seconds between page loads, aiming for 6-8 seconds average
- Vary request patterns -- don't always paginate sequentially; mix in filter combinations
Legal Considerations
Wellfound's Terms of Service prohibit automated scraping of their platform. The data itself -- investor names, bios, and publicly listed companies -- is generally not copyright-protected, but the database as a whole may be protected under database rights in some jurisdictions.
Keep scraping to personal research or internal tooling. Avoid building competing investor databases for commercial redistribution -- Wellfound's business model depends on this data, and they will enforce their terms against data resellers.