How to Scrape AngelList (Wellfound) in 2026: Investors, Portfolios & Deal Flow

2026-04-09 ["angellist" "wellfound" "web scraping" "python" "investor data" "playwright"]

How to Scrape AngelList (Wellfound) in 2026: Investors, Portfolios & Deal Flow

AngelList rebranded its talent platform to Wellfound in 2022, but it remains the most concentrated source of early-stage investor data on the internet. If you're doing investor outreach, building fundraising intelligence tools, or tracking startup hiring trends, the data on Wellfound is worth the effort to extract — investor theses, portfolio companies, check sizes, and job listings that reflect real hiring intent.

The platform hosts profiles for tens of thousands of active investors, from solo angels writing $25K checks to multi-stage funds deploying hundreds of millions. Each profile lists investment focus areas, portfolio companies, preferred check sizes, and sometimes direct contact preferences. For founders and researchers, this is a goldmine that would otherwise require expensive subscription services like Crunchbase Pro or PitchBook.

This guide walks through the complete extraction pipeline: Playwright setup for JS rendering, authenticated vs. unauthenticated scraping patterns, proxy configuration for sustained collection, SQLite storage design, and how to structure the data for practical use.

What You Can Extract

From investor profiles: - Investor name, photo URL, and bio text - Investment thesis and focus areas (free-form text) - Portfolio companies (with links to company profiles) - Check size range (pre-seed, seed, Series A annotations) - Geographic focus (location, target markets) - Sector tags (fintech, SaaS, biotech, climate, consumer, etc.) - Number of investments shown on profile - Social links (Twitter, LinkedIn, personal site)

From company profiles: - Company name, founding year, team size, funding stage - Total funding raised and individual round details - Current open job listings with compensation data - Sector and market tags - Brief description and longer company overview - Key people and founder profiles

From job listings: - Role title, department, remote/hybrid/onsite - Salary range (Wellfound requires salary disclosure) - Equity range (typical Wellfound differentiator) - Years of experience requirements - Company growth stage and funding context

Wellfound's Anti-Bot Measures

Wellfound is harder to scrape than most job boards because investor data is commercially sensitive to their business model. Here's what you're dealing with:

No public API. AngelList deprecated their developer API years ago. There is no official endpoint for investor or company data — everything has to come from the web interface.

React SPA with heavy JS rendering. All investor cards, profiles, and job listings are rendered client-side. Raw HTTP requests return an empty shell with essentially no content. You need a real browser.

Cloudflare protection. Full Cloudflare bot management with JS challenges on first visit. Cookie state from solving the challenge is required for subsequent requests. The challenge involves TLS fingerprint checks, JavaScript execution validation, and cookie chain verification.

Login walls on investor detail pages. Public investor listings are visible, but portfolio details, check sizes, and direct contact require a logged-in session. You can get basic data without logging in, but rich data needs authentication.

Behavioral fingerprinting. Wellfound tracks mouse movement patterns, scroll velocity, and click timing. Perfectly mechanical automation triggers detection within minutes. You need randomized delays and simulated human-like interaction.

IP rate limiting. Requests from a single IP get blocked after roughly 30 requests per minute. Sustained scraping without rotation leads to 403s within minutes. Datacenter IPs are almost always blocked outright — Wellfound checks IP reputation against commercial databases.

Session invalidation. Even with a valid li_at-equivalent session cookie, sessions expire or get invalidated after suspicious activity. Accounts used heavily for scraping get restricted or shadow-banned, meaning you see reduced data without an explicit error.

Setting Up Playwright for Wellfound

Start by launching a real browser with Playwright. Headless mode works, but a headed browser passes more fingerprint checks during initial Cloudflare validation. In production, use headless=True with careful stealth configuration:

import asyncio
import random
import sqlite3
import json
from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeoutError


async def create_browser(proxy: dict = None):
    """
    Launch a Playwright browser with realistic stealth settings.

    proxy: dict with 'server', optionally 'username' and 'password'
    Returns: (playwright, browser, context) tuple
    """
    playwright = await async_playwright().start()

    launch_kwargs = {
        "headless": True,
        "args": [
            "--no-sandbox",
            "--disable-blink-features=AutomationControlled",
            "--disable-dev-shm-usage",
            "--disable-gpu",
            "--window-size=1366,768",
        ],
    }
    if proxy:
        launch_kwargs["proxy"] = proxy

    browser = await playwright.chromium.launch(**launch_kwargs)
    context = await browser.new_context(
        viewport={"width": 1366, "height": 768},
        user_agent=(
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/124.0.0.0 Safari/537.36"
        ),
        locale="en-US",
        timezone_id="America/New_York",
        geolocation={"latitude": 40.7128, "longitude": -74.0060},
        permissions=["geolocation"],
        extra_http_headers={
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
        },
    )

    # Remove automation indicators that Cloudflare and Wellfound check
    await context.add_init_script("""
        // Remove webdriver property
        Object.defineProperty(navigator, 'webdriver', {get: () => undefined});

        // Fake plugin array (empty in headless)
        Object.defineProperty(navigator, 'plugins', {
            get: () => [
                {name: 'Chrome PDF Plugin', filename: 'internal-pdf-viewer'},
                {name: 'Chrome PDF Viewer', filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai'},
                {name: 'Native Client', filename: 'internal-nacl-plugin'},
            ]
        });

        // Fake language array
        Object.defineProperty(navigator, 'languages', {
            get: () => ['en-US', 'en']
        });

        // Add chrome runtime
        window.chrome = {
            runtime: {},
            csi: () => {},
            loadTimes: () => {},
        };

        // Fix permissions query
        const originalQuery = window.navigator.permissions.query;
        window.navigator.permissions.query = (parameters) =>
            parameters.name === 'notifications'
                ? Promise.resolve({ state: Notification.permission })
                : originalQuery(parameters);
    """)

    return playwright, browser, context


async def load_investors_page(context, filter_params: dict = None):
    """
    Navigate to Wellfound investors listing and wait for content to load.

    filter_params: optional dict with keys like 'stage', 'market', 'location'
    """
    page = await context.new_page()

    # Build URL with optional filters
    base_url = "https://wellfound.com/investors"
    if filter_params:
        query = "&".join(f"{k}={v}" for k, v in filter_params.items())
        url = f"{base_url}?{query}"
    else:
        url = base_url

    try:
        await page.goto(url, wait_until="networkidle", timeout=30000)
    except PlaywrightTimeoutError:
        # Cloudflare challenge may delay full network idle — wait for content instead
        await page.wait_for_selector(
            "[data-test='investor-card'], .investor-card, a[href*='/u/']",
            timeout=20000
        )

    # Simulate human-like scrolling to trigger lazy-loaded investor cards
    for _ in range(5):
        await page.mouse.wheel(0, random.randint(300, 700))
        await asyncio.sleep(random.uniform(1.2, 2.8))

    # Wait for at least some investor cards to appear
    await page.wait_for_selector("a[href*='/u/']", timeout=15000)

    return page

Extracting Investor Listings

Once the investors listing page is loaded, scrape the cards. Wellfound's class names use hashed suffixes that change on deploys, so targeting structural selectors is more reliable than class names:

async def extract_investor_cards(page) -> list[dict]:
    """
    Extract basic investor data from the Wellfound investors listing page.
    Returns a list of dicts with name, profile_url, bio_preview, and tags.
    """
    # Scroll to bottom to load all visible cards
    prev_height = 0
    for _ in range(8):
        curr_height = await page.evaluate("document.body.scrollHeight")
        if curr_height == prev_height:
            break
        prev_height = curr_height
        await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        await asyncio.sleep(random.uniform(1.5, 3.0))

    investors = []

    # Get all investor profile links
    links = await page.query_selector_all("a[href*='/u/']")
    seen_urls = set()

    for link in links:
        try:
            href = await link.get_attribute("href")
            if not href or href in seen_urls:
                continue
            seen_urls.add(href)

            # Navigate to parent card element
            card = await link.evaluate_handle("el => el.closest('div[class]') || el.parentElement")

            name_el = await card.query_selector("h3, h2, [class*='name'], strong")
            name = (await name_el.inner_text()).strip() if name_el else (await link.inner_text()).strip()

            bio_el = await card.query_selector("p, [class*='bio'], [class*='description']")
            bio = (await bio_el.inner_text()).strip() if bio_el else None

            tag_els = await card.query_selector_all("[class*='tag'], [class*='sector'], [class*='market']")
            tags = [(await t.inner_text()).strip() for t in tag_els if await t.inner_text()]

            profile_url = f"https://wellfound.com{href}" if href.startswith("/") else href

            if name and len(name) > 1:
                investors.append({
                    "name": name,
                    "profile_url": profile_url,
                    "bio_preview": bio[:200] if bio else None,
                    "tags": tags[:10],
                })
        except Exception:
            continue

    return investors


async def paginate_investor_listing(context, max_pages: int = 5) -> list[dict]:
    """
    Page through the investor listing to collect more profiles.
    Returns all investor stubs found across all pages.
    """
    all_investors = []
    page = await load_investors_page(context)

    for page_num in range(max_pages):
        batch = await extract_investor_cards(page)
        all_investors.extend(batch)
        print(f"Page {page_num + 1}: found {len(batch)} investors")

        # Try to find and click "Next" pagination button
        next_btn = await page.query_selector(
            "a[rel='next'], button:has-text('Next'), [aria-label='Next page']"
        )
        if not next_btn:
            break

        await next_btn.click()
        await page.wait_for_load_state("networkidle")
        await asyncio.sleep(random.uniform(2.0, 4.0))

    await page.close()
    return all_investors

Extracting Full Investor Profiles

Investor profile pages load portfolio data asynchronously. After navigating to the page, wait for specific content elements before scraping:

async def extract_investor_profile(context, profile_url: str) -> dict:
    """
    Open an investor profile page and extract full details.

    Handles dynamic loading with explicit waits and graceful timeouts.
    Returns a dict with all available investor data.
    """
    page = await context.new_page()
    investor = {"url": profile_url, "error": None}

    try:
        await page.goto(profile_url, wait_until="domcontentloaded", timeout=25000)

        # Wait for the profile content block specifically
        try:
            await page.wait_for_selector(
                "[class*='profile'], main, [class*='investor-profile']",
                timeout=15000
            )
        except PlaywrightTimeoutError:
            investor["error"] = "profile_content_timeout"
            return investor

        # Let async data requests finish
        await asyncio.sleep(random.uniform(2.5, 4.5))

        # --- Core fields ---
        name_el = await page.query_selector("h1")
        investor["name"] = (await name_el.inner_text()).strip() if name_el else None

        # Bio and investment thesis (Wellfound often has both)
        bio_el = await page.query_selector(
            "[class*='bio'], [class*='thesis'], [class*='description'] p, "
            "section p:first-of-type"
        )
        investor["bio"] = (await bio_el.inner_text()).strip() if bio_el else None

        # Social links
        twitter_el = await page.query_selector("a[href*='twitter.com'], a[href*='x.com']")
        investor["twitter"] = await twitter_el.get_attribute("href") if twitter_el else None

        linkedin_el = await page.query_selector("a[href*='linkedin.com']")
        investor["linkedin"] = await linkedin_el.get_attribute("href") if linkedin_el else None

        website_el = await page.query_selector("a[class*='website'], a[class*='external']")
        investor["website"] = await website_el.get_attribute("href") if website_el else None

        # Location
        loc_el = await page.query_selector("[class*='location'], [data-test*='location']")
        investor["location"] = (await loc_el.inner_text()).strip() if loc_el else None

        # Investment stage preferences
        stage_els = await page.query_selector_all("[class*='stage'], [class*='check'], [class*='round']")
        investor["stages"] = list({
            (await el.inner_text()).strip()
            for el in stage_els
            if await el.inner_text()
        })

        # Market and sector tags
        tag_els = await page.query_selector_all(
            "[class*='tag']:not([class*='stage']), [class*='sector'], [class*='market']"
        )
        investor["sectors"] = [
            (await el.inner_text()).strip()
            for el in tag_els
            if await el.inner_text()
        ][:20]

        # Portfolio companies
        portfolio_els = await page.query_selector_all("a[href*='/company/']")
        portfolio_urls = list({
            await el.get_attribute("href") for el in portfolio_els
        })
        investor["portfolio"] = [
            f"https://wellfound.com{u}" if u and u.startswith("/") else u
            for u in portfolio_urls
            if u
        ][:50]

        # Number of investments (if shown)
        invest_count_el = await page.query_selector("[class*='investment-count'], [class*='portfolio-size']")
        investor["investment_count_text"] = (
            (await invest_count_el.inner_text()).strip() if invest_count_el else None
        )

    except PlaywrightTimeoutError:
        investor["error"] = "timeout"
    except Exception as e:
        investor["error"] = str(e)
    finally:
        await page.close()

    return investor

Scraping Company and Job Data

Company pages follow a similar pattern. Job listings include compensation data that Wellfound requires startups to disclose — making it valuable for compensation benchmarking:

async def extract_company_profile(context, company_slug: str) -> dict:
    """
    Scrape a Wellfound company profile page.

    company_slug: the URL slug e.g. 'stripe' from wellfound.com/company/stripe
    """
    url = f"https://wellfound.com/company/{company_slug}"
    page = await context.new_page()
    company = {"slug": company_slug, "url": url}

    try:
        await page.goto(url, wait_until="domcontentloaded", timeout=25000)
        await page.wait_for_selector("h1, [class*='company-header']", timeout=12000)
        await asyncio.sleep(random.uniform(2.0, 4.0))

        name_el = await page.query_selector("h1")
        company["name"] = (await name_el.inner_text()).strip() if name_el else company_slug

        desc_el = await page.query_selector(
            "[class*='description'], [class*='about'] p, main p:first-of-type"
        )
        company["description"] = (await desc_el.inner_text()).strip() if desc_el else None

        # Funding stage indicator
        stage_el = await page.query_selector("[class*='stage'], [class*='funding-stage']")
        company["stage"] = (await stage_el.inner_text()).strip() if stage_el else None

        # Team size
        size_el = await page.query_selector("[class*='team-size'], [class*='employee']")
        company["team_size"] = (await size_el.inner_text()).strip() if size_el else None

        # Total funding
        funding_el = await page.query_selector("[class*='total-funding'], [class*='raised']")
        company["total_funding"] = (await funding_el.inner_text()).strip() if funding_el else None

        # Tech stack
        tech_els = await page.query_selector_all("[class*='tech'], [class*='stack'] span")
        company["tech_stack"] = [
            (await el.inner_text()).strip() for el in tech_els if await el.inner_text()
        ][:15]

    except PlaywrightTimeoutError:
        company["error"] = "timeout"
    finally:
        await page.close()

    return company


async def extract_company_jobs(context, company_slug: str) -> list[dict]:
    """
    Scrape job listings from a Wellfound company jobs page.
    Returns roles with salary ranges, equity, and location.
    """
    url = f"https://wellfound.com/company/{company_slug}/jobs"
    page = await context.new_page()
    jobs = []

    try:
        await page.goto(url, wait_until="domcontentloaded", timeout=25000)
        await page.wait_for_selector(
            "[class*='job'], [data-test*='job'], a[href*='/jobs/']",
            timeout=15000
        )
        await asyncio.sleep(random.uniform(2.0, 4.0))

        # Scroll to load all jobs
        for _ in range(3):
            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            await asyncio.sleep(1.5)

        job_els = await page.query_selector_all(
            "[class*='JobListing'], [class*='job-listing'], [class*='JobCard']"
        )

        for job_el in job_els:
            try:
                title_el = await job_el.query_selector("h2, h3, [class*='title'], [class*='role']")
                title = (await title_el.inner_text()).strip() if title_el else None

                salary_el = await job_el.query_selector(
                    "[class*='salary'], [class*='compensation'], [class*='pay']"
                )
                salary = (await salary_el.inner_text()).strip() if salary_el else None

                equity_el = await job_el.query_selector("[class*='equity'], [class*='stock']")
                equity = (await equity_el.inner_text()).strip() if equity_el else None

                loc_el = await job_el.query_selector("[class*='location'], [class*='remote']")
                location = (await loc_el.inner_text()).strip() if loc_el else None

                type_el = await job_el.query_selector(
                    "[class*='job-type'], [class*='employment-type']"
                )
                job_type = (await type_el.inner_text()).strip() if type_el else None

                link_el = await job_el.query_selector("a[href*='/jobs/']")
                job_url = await link_el.get_attribute("href") if link_el else None

                if title:
                    jobs.append({
                        "company_slug": company_slug,
                        "title": title,
                        "salary_range": salary,
                        "equity_range": equity,
                        "location": location,
                        "job_type": job_type,
                        "job_url": (
                            f"https://wellfound.com{job_url}"
                            if job_url and job_url.startswith("/")
                            else job_url
                        ),
                    })
            except Exception:
                continue

    except PlaywrightTimeoutError:
        pass
    finally:
        await page.close()

    return jobs

Rotating Proxies to Avoid Blocks

Wellfound blocks IPs aggressively because there's no API to funnel you toward. After 30 or so requests, a single residential IP starts seeing Cloudflare challenges that don't resolve cleanly. You need a pool of rotating residential addresses.

ThorData's residential proxies have been reliable for this — the IPs rotate per request or per session depending on your configuration, and they pass Cloudflare's ASN reputation checks that trip up datacenter providers. The geo-targeting feature also lets you simulate investors browsing from specific cities, which affects what content Wellfound shows.

Here is how to wire ThorData into the Playwright setup and run the full collection pipeline:

import asyncio
import random
import json

THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
THORDATA_HOST = "proxy.thordata.com"
THORDATA_PORT = 9000

proxy_config = {
    "server": f"http://{THORDATA_HOST}:{THORDATA_PORT}",
    "username": THORDATA_USER,
    "password": THORDATA_PASS,
}


async def collect_investors_with_jobs(
    max_investors: int = 50,
    max_jobs_per_company: int = 20,
) -> dict:
    """
    Full pipeline: collect investors, their portfolio companies, and open jobs.
    Returns dict with 'investors', 'companies', and 'jobs' lists.
    """
    playwright, browser, context = await create_browser(proxy=proxy_config)

    results = {"investors": [], "companies": [], "jobs": []}

    try:
        # Step 1: Get investor stubs from listing page
        investors = await paginate_investor_listing(context, max_pages=3)
        print(f"Found {len(investors)} investor stubs")

        # Step 2: Enrich each investor with full profile data
        for i, stub in enumerate(investors[:max_investors]):
            profile = await extract_investor_profile(context, stub["profile_url"])
            results["investors"].append({**stub, **profile})

            print(f"[{i+1}/{min(max_investors, len(investors))}] "
                  f"{profile.get('name', 'unknown')}: "
                  f"{len(profile.get('portfolio', []))} portfolio companies")

            # Delay between profiles
            await asyncio.sleep(random.uniform(3.0, 6.0))

            # Step 3: For first 3 portfolio companies of each investor, get jobs
            for company_url in profile.get("portfolio", [])[:3]:
                slug = company_url.rstrip("/").split("/")[-1]
                if any(c["slug"] == slug for c in results["companies"]):
                    continue  # already scraped

                company = await extract_company_profile(context, slug)
                results["companies"].append(company)

                jobs = await extract_company_jobs(context, slug)
                results["jobs"].extend(jobs)

                await asyncio.sleep(random.uniform(2.0, 4.0))

    finally:
        await browser.close()
        await playwright.stop()

    return results


# Run it
data = asyncio.run(collect_investors_with_jobs(max_investors=20))
print(f"Collected: {len(data['investors'])} investors, "
      f"{len(data['companies'])} companies, "
      f"{len(data['jobs'])} jobs")

SQLite Storage Schema

A normalized SQLite schema handles investors, companies, funding rounds, and jobs without data duplication:

import sqlite3
from datetime import datetime, timezone


def init_db(db_path: str = "wellfound.db") -> sqlite3.Connection:
    """Initialize the Wellfound SQLite database with all required tables."""
    conn = sqlite3.connect(db_path)
    conn.row_factory = sqlite3.Row

    conn.executescript("""
        CREATE TABLE IF NOT EXISTS investors (
            id          INTEGER PRIMARY KEY AUTOINCREMENT,
            name        TEXT NOT NULL,
            url         TEXT UNIQUE NOT NULL,
            bio         TEXT,
            location    TEXT,
            twitter     TEXT,
            linkedin    TEXT,
            website     TEXT,
            stages      TEXT,   -- JSON array
            sectors     TEXT,   -- JSON array
            portfolio   TEXT,   -- JSON array of URLs
            investment_count_text TEXT,
            scraped_at  TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
            error       TEXT
        );

        CREATE TABLE IF NOT EXISTS companies (
            slug            TEXT PRIMARY KEY,
            name            TEXT,
            description     TEXT,
            stage           TEXT,
            team_size       TEXT,
            total_funding   TEXT,
            tech_stack      TEXT,   -- JSON array
            url             TEXT,
            scraped_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
            error           TEXT
        );

        CREATE TABLE IF NOT EXISTS investor_portfolio (
            investor_url    TEXT NOT NULL,
            company_slug    TEXT NOT NULL,
            PRIMARY KEY (investor_url, company_slug)
        );

        CREATE TABLE IF NOT EXISTS jobs (
            id              INTEGER PRIMARY KEY AUTOINCREMENT,
            company_slug    TEXT NOT NULL,
            title           TEXT,
            salary_range    TEXT,
            equity_range    TEXT,
            location        TEXT,
            job_type        TEXT,
            job_url         TEXT UNIQUE,
            scraped_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );

        CREATE INDEX IF NOT EXISTS idx_jobs_company ON jobs(company_slug);
        CREATE INDEX IF NOT EXISTS idx_portfolio_investor ON investor_portfolio(investor_url);
        CREATE INDEX IF NOT EXISTS idx_portfolio_company ON investor_portfolio(company_slug);
    """)
    conn.commit()
    return conn


def upsert_investor(conn: sqlite3.Connection, investor: dict):
    """Insert or update an investor record."""
    conn.execute("""
        INSERT INTO investors
            (name, url, bio, location, twitter, linkedin, website,
             stages, sectors, portfolio, investment_count_text, scraped_at)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        ON CONFLICT(url) DO UPDATE SET
            bio=excluded.bio,
            location=excluded.location,
            stages=excluded.stages,
            sectors=excluded.sectors,
            portfolio=excluded.portfolio,
            scraped_at=excluded.scraped_at
    """, (
        investor.get("name"),
        investor.get("url"),
        investor.get("bio"),
        investor.get("location"),
        investor.get("twitter"),
        investor.get("linkedin"),
        investor.get("website"),
        json.dumps(investor.get("stages", [])),
        json.dumps(investor.get("sectors", [])),
        json.dumps(investor.get("portfolio", [])),
        investor.get("investment_count_text"),
        datetime.now(timezone.utc).isoformat(),
    ))

    # Insert portfolio relationships
    for company_url in investor.get("portfolio", []):
        slug = company_url.rstrip("/").split("/")[-1]
        conn.execute("""
            INSERT OR IGNORE INTO investor_portfolio (investor_url, company_slug)
            VALUES (?, ?)
        """, (investor.get("url"), slug))

    conn.commit()


def upsert_job(conn: sqlite3.Connection, job: dict):
    """Insert a job listing, ignoring duplicates."""
    conn.execute("""
        INSERT OR IGNORE INTO jobs
            (company_slug, title, salary_range, equity_range, location, job_type, job_url, scraped_at)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?)
    """, (
        job.get("company_slug"),
        job.get("title"),
        job.get("salary_range"),
        job.get("equity_range"),
        job.get("location"),
        job.get("job_type"),
        job.get("job_url"),
        datetime.now(timezone.utc).isoformat(),
    ))
    conn.commit()


def query_investors_by_sector(conn: sqlite3.Connection, sector: str) -> list:
    """Find investors whose sectors JSON contains the given keyword."""
    return conn.execute("""
        SELECT name, url, location, sectors, investment_count_text
        FROM investors
        WHERE sectors LIKE ?
        ORDER BY name
    """, (f"%{sector}%",)).fetchall()


def get_salary_stats(conn: sqlite3.Connection) -> list:
    """Return job salary range distribution by company stage."""
    return conn.execute("""
        SELECT c.stage, j.salary_range, COUNT(*) as count
        FROM jobs j
        JOIN companies c ON j.company_slug = c.slug
        WHERE j.salary_range IS NOT NULL
        GROUP BY c.stage, j.salary_range
        ORDER BY count DESC
    """).fetchall()

Error Handling and Retry Logic

Wellfound will occasionally return errors, timeouts, or Cloudflare challenges mid-session. A robust retry wrapper handles this:

import asyncio
import random


async def with_retry(
    coro_fn,
    args: tuple = (),
    max_attempts: int = 3,
    base_delay: float = 5.0,
    backoff_factor: float = 2.0,
):
    """
    Retry an async coroutine with exponential backoff.

    coro_fn: async function to call
    args: positional arguments to pass
    max_attempts: total attempts before giving up
    base_delay: initial wait between retries in seconds
    """
    last_error = None

    for attempt in range(max_attempts):
        try:
            return await coro_fn(*args)
        except PlaywrightTimeoutError as e:
            last_error = e
            if attempt < max_attempts - 1:
                delay = base_delay * (backoff_factor ** attempt) + random.uniform(0, 2)
                print(f"Timeout on attempt {attempt + 1}, retrying in {delay:.1f}s")
                await asyncio.sleep(delay)
        except Exception as e:
            last_error = e
            error_str = str(e).lower()
            # Don't retry on 403/404 — the resource is genuinely unavailable
            if "403" in error_str or "404" in error_str:
                raise
            if attempt < max_attempts - 1:
                delay = base_delay * (backoff_factor ** attempt)
                print(f"Error on attempt {attempt + 1}: {e}, retrying in {delay:.1f}s")
                await asyncio.sleep(delay)

    raise last_error or Exception("Max retries exceeded")


# Usage example
async def safe_get_profile(context, url: str) -> dict:
    return await with_retry(extract_investor_profile, args=(context, url))

Rate Limiting and Request Pacing

Beyond proxy rotation, request timing matters. A pattern that works well for sustained Wellfound scraping:

import time
from collections import deque


class RateLimiter:
    """
    Token bucket rate limiter for controlling request frequency.
    Defaults to 20 requests per minute max.
    """
    def __init__(self, requests_per_minute: int = 20):
        self.min_interval = 60.0 / requests_per_minute
        self.request_times = deque(maxlen=requests_per_minute)

    async def wait(self):
        """Wait if necessary to respect the rate limit."""
        now = time.monotonic()

        if len(self.request_times) == self.request_times.maxlen:
            oldest = self.request_times[0]
            elapsed = now - oldest
            window = 60.0

            if elapsed < window:
                wait_time = window - elapsed + random.uniform(0.5, 2.0)
                await asyncio.sleep(wait_time)

        # Add small jitter to prevent synchronized bursting
        await asyncio.sleep(random.uniform(1.0, 3.0))
        self.request_times.append(time.monotonic())


# Use in the collection loop
limiter = RateLimiter(requests_per_minute=15)

async def collect_with_rate_limit(context, investor_urls: list[str]) -> list[dict]:
    results = []
    for url in investor_urls:
        await limiter.wait()
        profile = await extract_investor_profile(context, url)
        results.append(profile)
    return results

Practical Use Cases

Fundraising outreach lists. Collect investors by sector and stage, then filter by location and check size. Output a CSV of investor names, bios, portfolio companies, and profile URLs. Cross-reference portfolio companies with Crunchbase or LinkedIn to identify warm intro paths.

Market mapping. Scrape all investors who have funded companies in a specific market (e.g., "climate tech" or "developer tools"). Analyze which investors are most active by portfolio count and investment stage distribution. Identify whitespace — markets with few active investors despite deal flow.

Compensation benchmarking. Collect job listings across 200-500 early-stage startups filtered by sector. Analyze salary and equity ranges by role title, stage, and location. This kind of data is worth significant money to recruiting firms and HR tools.

Portfolio overlap analysis. For a given investor, identify which other investors share portfolio companies. Network density among investors in a sector often predicts syndicate behavior — who leads with whom, and who follows.

Legal Considerations

Wellfound's Terms of Service prohibit automated scraping of their platform. The data itself — investor names, bios, and publicly listed companies — is generally not copyright-protected as standalone facts, but the database as a whole may be protected under database rights in EU jurisdictions. Keep scraping to personal research or internal tooling. Avoid building competing investor databases for commercial redistribution, and do not scrape private or locked investor data that requires account credentials you don't personally own.

The techniques documented here are for educational purposes. Respect robots.txt, use conservative request rates, and if Wellfound adds explicit opt-out signals, honor them.

Pagination and Bulk Investor Discovery

The investor listing page paginates. Here is how to systematically collect all pages:

import asyncio
import json
import random
import sqlite3
from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeoutError

async def paginate_investor_list(context, max_pages: int = 20) -> list:
    """Paginate through the Wellfound investor directory."""
    all_investors = []
    page = await context.new_page()

    for page_num in range(1, max_pages + 1):
        url = f"https://wellfound.com/investors?page={page_num}"

        try:
            await page.goto(url, wait_until="domcontentloaded", timeout=25000)
            await asyncio.sleep(random.uniform(3.0, 6.0))

            # Scroll to load lazy content
            for _ in range(3):
                await page.mouse.wheel(0, random.randint(400, 800))
                await asyncio.sleep(random.uniform(1.0, 2.0))

            investors = await extract_investor_cards(page)

            if not investors:
                print(f"  Page {page_num}: no investors found, stopping")
                break

            all_investors.extend(investors)
            print(f"  Page {page_num}: {len(investors)} investors (total: {len(all_investors)})")

        except PlaywrightTimeoutError:
            print(f"  Page {page_num}: timeout")
            break

        await asyncio.sleep(random.uniform(5.0, 10.0))

    await page.close()
    return all_investors

Filtering by Investment Stage and Sector

Wellfound's investor listing supports filtering via URL parameters. Combine with your scraping to pre-filter:

STAGE_FILTERS = {
    "pre_seed": "pre-seed",
    "seed": "seed",
    "series_a": "series-a",
    "series_b": "series-b",
}

SECTOR_FILTERS = [
    "fintech",
    "ai-ml",
    "saas",
    "healthcare",
    "climate-tech",
    "crypto-web3",
    "consumer",
    "enterprise",
]

def build_filter_url(stage: str = None, sector: str = None, page: int = 1) -> str:
    """Build a filtered investor listing URL."""
    base = "https://wellfound.com/investors"
    params = []
    if stage:
        params.append(f"stage={stage}")
    if sector:
        params.append(f"market={sector}")
    if page > 1:
        params.append(f"page={page}")
    return f"{base}?{'&'.join(params)}" if params else base


async def collect_investors_by_sector(
    context,
    sectors: list,
    stages: list = None,
    max_pages_per_filter: int = 5,
) -> list:
    """Collect investors filtered by sector and/or stage."""
    all_investors = []
    seen_slugs = set()

    for sector in sectors:
        stage_list = stages or [None]
        for stage in stage_list:
            print(f"  Collecting: sector={sector}, stage={stage}")
            page = await context.new_page()

            for pg in range(1, max_pages_per_filter + 1):
                url = build_filter_url(stage=stage, sector=sector, page=pg)

                try:
                    await page.goto(url, wait_until="domcontentloaded", timeout=25000)
                    await asyncio.sleep(random.uniform(4.0, 8.0))

                    investors = await extract_investor_cards(page)
                    if not investors:
                        break

                    for inv in investors:
                        slug = inv.get("profile_url", "").split("/")[-1]
                        if slug not in seen_slugs:
                            seen_slugs.add(slug)
                            inv["filter_sector"] = sector
                            inv["filter_stage"] = stage
                            all_investors.append(inv)

                except PlaywrightTimeoutError:
                    break

                await asyncio.sleep(random.uniform(5.0, 10.0))

            await page.close()

    return all_investors

Extracting Portfolio Metrics

For investors with disclosed portfolio data, you can compute basic metrics:

def analyze_investor_portfolio(investor: dict) -> dict:
    """Compute portfolio-level metrics from an investor profile."""
    portfolio = investor.get("portfolio", [])
    sectors = investor.get("sectors", [])
    stages = investor.get("stages", [])

    return {
        "name": investor.get("name"),
        "portfolio_size": len(portfolio),
        "primary_sectors": sectors[:3],
        "investment_stages": stages,
        "location": investor.get("location"),
        "bio_length": len(investor.get("bio") or ""),
        "has_full_profile": bool(investor.get("bio")) and len(portfolio) > 0,
    }


def rank_investors_by_activity(investors: list) -> list:
    """Rank investors by portfolio size and profile completeness."""
    scored = []
    for inv in investors:
        metrics = analyze_investor_portfolio(inv)
        score = (
            metrics["portfolio_size"] * 2 +
            len(metrics["primary_sectors"]) +
            len(metrics["investment_stages"]) +
            (5 if metrics["has_full_profile"] else 0)
        )
        scored.append((score, inv))

    scored.sort(key=lambda x: x[0], reverse=True)
    return [inv for _, inv in scored]

Rate Limiting and Session Management

Wellfound blocks IP addresses that make too many requests. Here is how to manage sessions and rotate proxies effectively:

import time

THORDATA_CONFIG = {
    "server": "http://proxy.thordata.com:9000",
    "username": "YOUR_USERNAME",
    "password": "YOUR_PASSWORD",
}

class WellfoundScraper:
    """
    Managed scraper for Wellfound investor data.
    Handles session rotation, rate limiting, and retry logic.
    """

    def __init__(self, proxy_config: dict = None, requests_per_session: int = 30):
        self.proxy_config = proxy_config
        self.requests_per_session = requests_per_session
        self.session_request_count = 0
        self.playwright = None
        self.browser = None
        self.context = None

    async def start(self):
        from playwright.async_api import async_playwright
        self.playwright = await async_playwright().start()
        await self._new_session()

    async def _new_session(self):
        """Create a new browser context (effectively rotates IP with proxy rotation)."""
        if self.browser:
            await self.browser.close()

        launch_kwargs = {
            "headless": True,
            "args": ["--disable-blink-features=AutomationControlled"],
        }
        if self.proxy_config:
            launch_kwargs["proxy"] = self.proxy_config

        self.browser = await self.playwright.chromium.launch(**launch_kwargs)
        self.context = await self.browser.new_context(
            viewport={"width": 1366, "height": 768},
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            ),
            locale="en-US",
        )
        await self.context.add_init_script(
            "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
        )
        self.session_request_count = 0

    async def get_page(self, url: str):
        """Navigate to a URL, rotating session if needed."""
        if self.session_request_count >= self.requests_per_session:
            print("Rotating session...")
            await self._new_session()

        page = await self.context.new_page()
        await page.goto(url, wait_until="domcontentloaded", timeout=25000)
        self.session_request_count += 1
        return page

    async def stop(self):
        if self.browser:
            await self.browser.close()
        if self.playwright:
            await self.playwright.stop()

Complete Data Pipeline

async def run_wellfound_pipeline(
    sectors: list = None,
    db_path: str = "wellfound.db",
    max_investors: int = 200,
):
    """
    Full pipeline:
    1. Discover investors via filtered listings
    2. Fetch full profiles for each investor
    3. Collect job listings for portfolio companies
    4. Store everything in SQLite
    """
    if sectors is None:
        sectors = ["fintech", "saas", "ai-ml"]

    conn = init_db(db_path)
    scraper = WellfoundScraper(proxy_config=THORDATA_CONFIG)
    await scraper.start()

    try:
        # Phase 1: Discovery
        print("Phase 1: Discovering investors...")
        all_investors = await collect_investors_by_sector(
            scraper.context,
            sectors=sectors,
            max_pages_per_filter=5,
        )
        print(f"Found {len(all_investors)} unique investors")

        # Rank by activity
        all_investors = rank_investors_by_activity(all_investors)

        # Phase 2: Enrich profiles
        print("\nPhase 2: Fetching full investor profiles...")
        enriched_count = 0

        for investor in all_investors[:max_investors]:
            url = investor.get("profile_url")
            if not url:
                continue

            # Skip if already in DB
            existing = conn.execute(
                "SELECT 1 FROM investors WHERE url = ?", (url,)
            ).fetchone()
            if existing:
                print(f"  Cached: {investor.get('name')}")
                continue

            print(f"  Fetching: {investor.get('name')}")
            profile = await extract_investor_profile(scraper.context, url)

            # Save to DB
            conn.execute(
                """INSERT OR REPLACE INTO investors
                   (name, url, bio, location, stages, sectors, portfolio, scraped_at)
                   VALUES (?,?,?,?,?,?,?,CURRENT_TIMESTAMP)""",
                (
                    profile.get("name"),
                    url,
                    profile.get("bio"),
                    profile.get("location"),
                    json.dumps(profile.get("stages", [])),
                    json.dumps(profile.get("sectors", [])),
                    json.dumps(profile.get("portfolio", [])),
                ),
            )
            conn.commit()
            enriched_count += 1

            await asyncio.sleep(random.uniform(5.0, 10.0))

        print(f"\nEnriched {enriched_count} investor profiles")

    finally:
        await scraper.stop()
        conn.close()


# Run the pipeline
asyncio.run(run_wellfound_pipeline(
    sectors=["fintech", "saas", "ai-ml"],
    max_investors=100,
))

Useful Analysis Queries

Once data is collected in SQLite:

import sqlite3
import json

conn = sqlite3.connect("wellfound.db")

# Investors by number of portfolio companies
active_investors = conn.execute("""
    SELECT name, location,
           json_array_length(portfolio) as portfolio_size,
           sectors
    FROM investors
    WHERE portfolio IS NOT NULL
    ORDER BY portfolio_size DESC
    LIMIT 20
""").fetchall()

for row in active_investors:
    print(f"{row[0]:<30} {row[2]:>3} companies  {row[1]}")

# Geographic distribution of investors
by_location = conn.execute("""
    SELECT location, COUNT(*) as count
    FROM investors
    WHERE location IS NOT NULL
    GROUP BY location
    ORDER BY count DESC
    LIMIT 15
""").fetchall()

# Most common investment stages
import collections
all_stages = []
for row in conn.execute("SELECT stages FROM investors WHERE stages IS NOT NULL"):
    try:
        stages = json.loads(row[0])
        all_stages.extend(stages)
    except json.JSONDecodeError:
        pass

stage_counts = collections.Counter(all_stages)
for stage, count in stage_counts.most_common(10):
    print(f"  {stage}: {count} investors")

Anti-Detection Best Practices for Wellfound

A summary of what works in 2026:

Use Playwright, not requests -- Wellfound requires full JS execution for all investor data
Always remove navigator.webdriver -- the init script in the examples above is mandatory
Rotate IPs via ThorData -- ThorData residential proxies pass Cloudflare's ASN checks; datacenter IPs fail immediately
Session-based rotation -- create a new browser context every 25-30 requests; this rotates the IP and clears any session-level fingerprinting
Simulate human scroll behavior -- use mouse.wheel() with random distances before extracting data
Random delays -- minimum 4 seconds between page loads, aiming for 6-8 seconds average
Vary request patterns -- don't always paginate sequentially; mix in filter combinations

Legal Considerations

Wellfound's Terms of Service prohibit automated scraping of their platform. The data itself -- investor names, bios, and publicly listed companies -- is generally not copyright-protected, but the database as a whole may be protected under database rights in some jurisdictions.

Keep scraping to personal research or internal tooling. Avoid building competing investor databases for commercial redistribution -- Wellfound's business model depends on this data, and they will enforce their terms against data resellers.