How to Scrape Y Combinator Companies: Python Guide (2026)

2026-04-09 [python ycombinator scraping startups]

How to Scrape Y Combinator Companies: Python Guide (2026)

Y Combinator's company directory is one of the richest startup datasets publicly available. It covers thousands of companies going back to 2005, with batch labels (W24, S25), funding status, industry tags, and founder details. Whether you're doing deal flow tracking, competitive intelligence, or market research into startup ecosystems, the YC directory is an essential data source.

This guide covers the full extraction pipeline: the Algolia API endpoint YC uses internally, Playwright-based scraping for JS-rendered detail pages, anti-detection setup, and practical analysis use cases.

What the YC Directory Contains

The directory at ycombinator.com/companies covers every company that has been through YC since 2005. Key fields available:

Batch — W24, S25, IK12 (Imagine K12), W95 (pre-standardized naming)
Status — Active, Inactive, Acquired, Public
One-liner — 150-character description of the company
Industry tags — up to several tags per company from a controlled vocabulary
Team size — rough headcount band
Top company flag — YC's internal marker for their standout portfolio companies
Hiring status — whether they're currently recruiting
Founder details — names, titles, LinkedIn URLs (when provided)
Locations — city, country
Website — the company's current URL

What you won't find here: funding amounts, investors, valuation, revenue. Those are on Crunchbase or need to be gathered from SEC filings (for public/late-stage companies).

The YC API Endpoint

YC's company search runs through Algolia, and the frontend calls it with a structured payload. The Algolia credentials are embedded in the YC frontend bundle and have been stable for years:

import httpx
import json
import time
import random
import sqlite3
from dataclasses import dataclass, field
from typing import Optional

ALGOLIA_APP_ID = "45BWZJ1SGC"
ALGOLIA_API_KEY = "Zjk5ZmE4OTRjNmFlZDdlNjFlZWFlY2EyYjczODE2NzM="
ALGOLIA_INDEX = "YCCompany_production"
ALGOLIA_URL = f"https://{ALGOLIA_APP_ID}-dsn.algolia.net/1/indexes/{ALGOLIA_INDEX}/query"

ALGOLIA_HEADERS = {
    "X-Algolia-Application-Id": ALGOLIA_APP_ID,
    "X-Algolia-API-Key": ALGOLIA_API_KEY,
    "Content-Type": "application/json",
    "Origin": "https://www.ycombinator.com",
    "Referer": "https://www.ycombinator.com/",
}

# All available fields in the Algolia index
ALL_FIELDS = [
    "name", "slug", "one_liner", "long_description",
    "batch", "status", "industries", "tags",
    "team_size", "top_company", "isHiring",
    "website", "all_locations", "founders",
    "stage", "objectID",
]

def build_search_payload(
    query: str = "",
    batch: Optional[str] = None,
    status: Optional[str] = None,
    industries: Optional[list[str]] = None,
    page: int = 0,
    hits_per_page: int = 50,
    sort: Optional[str] = None,
) -> dict:
    """
    Build Algolia search payload for YC companies.
    batch: 'W24', 'S25', etc.
    status: 'Active', 'Inactive', 'Acquired', 'Public'
    industries: list of industry names
    sort: None (relevance), 'top' (top companies first)
    """
    filters_parts = []
    if batch:
        filters_parts.append(f'batch:"{batch}"')
    if status:
        filters_parts.append(f'status:"{status}"')

    facet_filters = []
    if industries:
        for ind in industries:
            facet_filters.append([f"industries:{ind}"])

    payload = {
        "query": query,
        "filters": " AND ".join(filters_parts),
        "facetFilters": facet_filters,
        "page": page,
        "hitsPerPage": hits_per_page,
        "attributesToRetrieve": ALL_FIELDS,
        "attributesToHighlight": [],
    }

    # YC uses a separate "top companies" sort index
    if sort == "top":
        payload["indexName"] = f"{ALGOLIA_INDEX}_top_company_by_arr"

    return payload

def search_yc_companies(
    payload: dict,
    client: Optional[httpx.Client] = None,
    retries: int = 3,
    delay: float = 0.5,
) -> dict:
    """
    Execute an Algolia search against the YC company index.
    Returns raw Algolia response.
    """
    if client is None:
        client = httpx.Client(timeout=30)

    for attempt in range(retries):
        try:
            resp = client.post(ALGOLIA_URL, headers=ALGOLIA_HEADERS, json=payload)

            if resp.status_code == 429:
                wait = float(resp.headers.get("Retry-After", 10 * (attempt + 1)))
                print(f"Rate limited. Waiting {wait}s...")
                time.sleep(wait)
                continue

            resp.raise_for_status()
            return resp.json()

        except httpx.TimeoutException:
            if attempt == retries - 1:
                raise
            time.sleep(3 * (attempt + 1))
        except httpx.HTTPStatusError as e:
            if attempt == retries - 1:
                raise
            print(f"HTTP error {e.response.status_code} on attempt {attempt + 1}")
            time.sleep(2 * (attempt + 1))

    return {}

Extracting and Parsing Company Data

@dataclass
class YCCompany:
    name: str
    slug: str
    url: str
    website: str
    one_liner: str
    long_description: str
    batch: str
    status: str
    industries: list
    tags: list
    team_size: Optional[str]
    top_company: bool
    is_hiring: bool
    locations: list
    founders: list
    object_id: str

def parse_company(hit: dict) -> YCCompany:
    """Parse an Algolia hit into a structured YCCompany object."""
    founders = []
    for f in hit.get("founders", []):
        founders.append({
            "name": f.get("full_name", ""),
            "title": f.get("title", ""),
            "linkedin_url": f.get("linkedin_url", ""),
            "twitter_url": f.get("twitter_url", ""),
        })

    slug = hit.get("slug", "")
    return YCCompany(
        name=hit.get("name", ""),
        slug=slug,
        url=f"https://www.ycombinator.com/companies/{slug}" if slug else "",
        website=hit.get("website", ""),
        one_liner=hit.get("one_liner", ""),
        long_description=(hit.get("long_description") or "")[:1000],
        batch=hit.get("batch", ""),
        status=hit.get("status", ""),
        industries=hit.get("industries", []),
        tags=hit.get("tags", []),
        team_size=hit.get("team_size"),
        top_company=hit.get("top_company", False),
        is_hiring=hit.get("isHiring", False),
        locations=hit.get("all_locations", []),
        founders=founders,
        object_id=hit.get("objectID", ""),
    )

def get_all_companies_for_batch(
    batch: str,
    client: Optional[httpx.Client] = None,
    delay: float = 0.5,
) -> list[YCCompany]:
    """
    Retrieve all companies for a specific YC batch.
    Handles Algolia pagination automatically.
    """
    if client is None:
        client = httpx.Client(timeout=30)

    companies = []
    page = 0

    while True:
        payload = build_search_payload(batch=batch, page=page, hits_per_page=50)
        data = search_yc_companies(payload, client=client)

        hits = data.get("hits", [])
        if not hits:
            break

        companies.extend([parse_company(h) for h in hits])

        nb_pages = data.get("nbPages", 1)
        if page >= nb_pages - 1:
            break

        page += 1
        time.sleep(delay)

    return companies

def get_all_companies(
    status_filter: Optional[str] = None,
    industry_filter: Optional[list[str]] = None,
    max_total: int = 10000,
    delay: float = 0.6,
) -> list[YCCompany]:
    """
    Retrieve the full YC company directory with optional filters.
    Note: Algolia caps results at 1000 per query. To get everything,
    paginate by batch or use multiple filtered queries.
    """
    client = httpx.Client(timeout=30)
    all_companies = []
    page = 0

    while len(all_companies) < max_total:
        payload = build_search_payload(
            status=status_filter,
            industries=industry_filter,
            page=page,
            hits_per_page=50,
        )
        data = search_yc_companies(payload, client=client)

        hits = data.get("hits", [])
        if not hits:
            break

        all_companies.extend([parse_company(h) for h in hits])

        nb_pages = data.get("nbPages", 1)
        total = data.get("nbHits", 0)
        print(f"  Page {page + 1}/{nb_pages}: {len(all_companies)}/{total} companies")

        if page >= nb_pages - 1 or page >= 19:
            # Algolia standard indices cap at page 20 (1000 results)
            # For full coverage, split by batch instead
            break

        page += 1
        time.sleep(delay)

    client.close()
    return all_companies

def get_all_batches_complete(delay: float = 0.5) -> dict[str, list[YCCompany]]:
    """
    Get complete data by iterating through all known batches.
    This bypasses Algolia's 1000-result cap by querying per-batch.
    """
    import re
    from datetime import datetime

    # Generate batch labels from W05 to current
    client = httpx.Client(timeout=30)
    current_year = datetime.now().year % 100
    batches = []

    for year in range(5, current_year + 2):
        yy = str(year).zfill(2)
        batches.extend([f"W{yy}", f"S{yy}"])

    # Also include special batches
    batches.extend(["IK12", "W95", "S98"])

    all_by_batch = {}
    for batch in batches:
        payload = build_search_payload(batch=batch, hits_per_page=1)
        test_data = search_yc_companies(payload, client=client)
        total = test_data.get("nbHits", 0)

        if total == 0:
            continue

        print(f"Batch {batch}: {total} companies")
        all_by_batch[batch] = get_all_companies_for_batch(batch, client=client, delay=delay)
        time.sleep(0.5)

    client.close()
    return all_by_batch

Playwright Fallback for Company Detail Pages

When you need data not in the Algolia index — investor lists, launch posts, job descriptions, current news — you need to render the YC website:

from playwright.sync_api import sync_playwright, TimeoutError as PWTimeoutError
from bs4 import BeautifulSoup
import re

def scrape_company_detail_page(
    slug: str,
    proxy_config: Optional[dict] = None,
    timeout_ms: int = 30000,
) -> dict:
    """
    Scrape a YC company's detail page for data not in Algolia.
    proxy_config: {'server': 'http://host:port', 'username': 'u', 'password': 'p'}
    """
    url = f"https://www.ycombinator.com/companies/{slug}"

    with sync_playwright() as p:
        launch_args = {
            "headless": True,
            "args": [
                "--no-sandbox",
                "--disable-dev-shm-usage",
                "--disable-blink-features=AutomationControlled",
            ],
        }

        browser = p.chromium.launch(**launch_args)

        context_args = {
            "user_agent": (
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            ),
            "viewport": {"width": 1440, "height": 900},
            "locale": "en-US",
        }

        if proxy_config:
            context_args["proxy"] = proxy_config

        context = browser.new_context(**context_args)

        # Block unnecessary resources to speed up loading
        context.route("**/*.{png,jpg,gif,webp,woff,woff2,ttf,mp4}", lambda route: route.abort())

        page = context.new_page()

        try:
            page.goto(url, wait_until="domcontentloaded", timeout=timeout_ms)
            page.wait_for_selector("[class*='company-name'], h1", timeout=5000)
        except PWTimeoutError:
            pass  # Try to parse whatever loaded

        html = page.content()
        browser.close()

    soup = BeautifulSoup(html, "lxml")
    result = {"slug": slug, "url": url}

    # Investors section
    investors = []
    for el in soup.find_all(text=re.compile(r"Investors", re.I)):
        parent = el.parent
        if parent:
            for link in parent.find_next_siblings():
                names = link.find_all("span")
                investors.extend([n.get_text(strip=True) for n in names if n.get_text(strip=True)])
    result["investors"] = list(set(investors))

    # News mentions
    news_links = []
    for link in soup.find_all("a", href=True):
        href = link.get("href", "")
        if href.startswith("http") and "ycombinator.com" not in href:
            text = link.get_text(strip=True)
            if len(text) > 20:
                news_links.append({"url": href, "text": text[:200]})
    result["news_mentions"] = news_links[:10]

    # Job listings
    jobs = []
    for job_el in soup.find_all(attrs={"class": re.compile(r"job|role", re.I)}):
        title = job_el.get_text(strip=True)
        if title and len(title) < 100:
            jobs.append(title)
    result["active_jobs"] = list(set(jobs))

    # YC batch from page
    batch_el = soup.find(text=re.compile(r"[WS]\d{2}|[WS]\d{4}"))
    if batch_el:
        match = re.search(r"([WS]\d{2,4})", str(batch_el))
        if match:
            result["batch_from_page"] = match.group(1)

    return result

def batch_scrape_detail_pages(
    slugs: list[str],
    proxy_config: Optional[dict] = None,
    delay: float = 3.0,
    max_concurrent: int = 1,
) -> list[dict]:
    """Scrape multiple YC company detail pages sequentially."""
    results = []
    for i, slug in enumerate(slugs):
        print(f"Scraping {i + 1}/{len(slugs)}: {slug}")
        try:
            detail = scrape_company_detail_page(slug, proxy_config=proxy_config)
            results.append(detail)
        except Exception as e:
            print(f"  Failed {slug}: {e}")
            results.append({"slug": slug, "error": str(e)})
        time.sleep(delay + random.uniform(0, 1))
    return results

Anti-Detection Setup

YC's defenses are layered but not as aggressive as commercial platforms:

Cloudflare on the main site. Requests without browser-like headers get challenged or blocked. The Algolia endpoint is more permissive — it's a CDN-cached API endpoint, not the main site.

Algolia rate limits. The YC Algolia integration allows roughly 2-5 requests per second before you see 429 responses. For batch collection, stay at 1-2/sec.

IP blocking on ycombinator.com. Direct HTML scraping at volume from a single IP triggers Cloudflare blocks. For Playwright-based page scraping, residential proxies are recommended for anything beyond a few hundred pages.

import random

USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]

def polite_delay(base: float = 0.6, jitter: float = 0.4):
    """Sleep for base + random jitter seconds."""
    time.sleep(base + random.uniform(0, jitter))

def algolia_request_with_backoff(
    payload: dict,
    client: httpx.Client,
    max_retries: int = 5,
) -> dict:
    """
    Execute an Algolia request with exponential backoff.
    Handles rate limiting, timeouts, and transient server errors.
    """
    for attempt in range(max_retries):
        try:
            resp = client.post(ALGOLIA_URL, headers=ALGOLIA_HEADERS, json=payload)

            if resp.status_code == 200:
                return resp.json()

            elif resp.status_code == 429:
                retry_after = float(resp.headers.get("Retry-After", 2 ** attempt * 5))
                print(f"Rate limited. Waiting {retry_after:.1f}s (attempt {attempt + 1})")
                time.sleep(retry_after)

            elif resp.status_code in (500, 502, 503):
                wait = 2 ** attempt * 2
                print(f"Server error {resp.status_code}. Waiting {wait}s...")
                time.sleep(wait)

            else:
                resp.raise_for_status()

        except httpx.TimeoutException:
            wait = 5 * (attempt + 1)
            print(f"Timeout on attempt {attempt + 1}. Waiting {wait}s...")
            time.sleep(wait)

    raise RuntimeError(f"Max retries exceeded for Algolia request")

# ThorData proxy config for Playwright-based scraping
THORDATA_PROXY = {
    "server": "http://proxy.thordata.com:9001",
    "username": "YOUR_USERNAME",
    "password": "YOUR_PASSWORD",
}

# For the Algolia API endpoint, proxies are optional at low volume
# At high volume (bulk batch downloads), distribute load:
ALGOLIA_PROXIES = {
    "http://": f"http://YOUR_USERNAME:[email protected]:9001",
    "https://": f"http://YOUR_USERNAME:[email protected]:9001",
}

def build_proxied_httpx_client() -> httpx.Client:
    """Build an httpx client routing through ThorData."""
    return httpx.Client(
        proxies=ALGOLIA_PROXIES,
        headers={
            "Origin": "https://www.ycombinator.com",
            "Referer": "https://www.ycombinator.com/",
            "User-Agent": random.choice(USER_AGENTS),
        },
        timeout=30,
    )

Storage and Analysis

def setup_yc_database(db_path: str = "yc_companies.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.execute("PRAGMA journal_mode=WAL")
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS companies (
            slug TEXT PRIMARY KEY,
            name TEXT,
            url TEXT,
            website TEXT,
            one_liner TEXT,
            long_description TEXT,
            batch TEXT,
            status TEXT,
            team_size TEXT,
            top_company INTEGER DEFAULT 0,
            is_hiring INTEGER DEFAULT 0,
            industries TEXT,
            tags TEXT,
            locations TEXT,
            founder_count INTEGER DEFAULT 0,
            object_id TEXT,
            scraped_at TEXT DEFAULT (datetime('now')),
            updated_at TEXT
        );

        CREATE TABLE IF NOT EXISTS founders (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            company_slug TEXT NOT NULL,
            name TEXT,
            title TEXT,
            linkedin_url TEXT,
            twitter_url TEXT,
            FOREIGN KEY (company_slug) REFERENCES companies(slug)
        );

        CREATE TABLE IF NOT EXISTS company_details (
            slug TEXT PRIMARY KEY,
            investors TEXT,
            news_mentions TEXT,
            active_jobs TEXT,
            scraped_at TEXT DEFAULT (datetime('now')),
            FOREIGN KEY (slug) REFERENCES companies(slug)
        );

        CREATE INDEX IF NOT EXISTS idx_companies_batch ON companies(batch);
        CREATE INDEX IF NOT EXISTS idx_companies_status ON companies(status);
        CREATE INDEX IF NOT EXISTS idx_companies_top ON companies(top_company);
        CREATE INDEX IF NOT EXISTS idx_founders_company ON founders(company_slug);
        CREATE INDEX IF NOT EXISTS idx_companies_hiring ON companies(is_hiring);
    """)
    conn.commit()
    return conn

def save_company(conn: sqlite3.Connection, company: YCCompany):
    conn.execute("""
        INSERT OR REPLACE INTO companies
        (slug, name, url, website, one_liner, long_description, batch, status,
         team_size, top_company, is_hiring, industries, tags, locations,
         founder_count, object_id, updated_at)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, datetime('now'))
    """, (
        company.slug, company.name, company.url, company.website,
        company.one_liner, company.long_description, company.batch, company.status,
        company.team_size, int(company.top_company), int(company.is_hiring),
        json.dumps(company.industries), json.dumps(company.tags),
        json.dumps(company.locations), len(company.founders),
        company.object_id,
    ))

    # Save founders
    conn.execute("DELETE FROM founders WHERE company_slug = ?", (company.slug,))
    for f in company.founders:
        conn.execute(
            "INSERT INTO founders (company_slug, name, title, linkedin_url, twitter_url) VALUES (?, ?, ?, ?, ?)",
            (company.slug, f.get("name"), f.get("title"), f.get("linkedin_url"), f.get("twitter_url")),
        )

    conn.commit()

def analyze_batch(conn: sqlite3.Connection, batch: str) -> dict:
    """Generate summary stats for a YC batch."""
    rows = conn.execute(
        "SELECT status, COUNT(*) as count FROM companies WHERE batch = ? GROUP BY status",
        (batch,)
    ).fetchall()
    status_counts = {r[0]: r[1] for r in rows}

    hiring = conn.execute(
        "SELECT COUNT(*) FROM companies WHERE batch = ? AND is_hiring = 1",
        (batch,)
    ).fetchone()[0]

    top = conn.execute(
        "SELECT COUNT(*) FROM companies WHERE batch = ? AND top_company = 1",
        (batch,)
    ).fetchone()[0]

    total = conn.execute(
        "SELECT COUNT(*) FROM companies WHERE batch = ?",
        (batch,)
    ).fetchone()[0]

    # Top industries
    industry_rows = conn.execute(
        "SELECT industries FROM companies WHERE batch = ?",
        (batch,)
    ).fetchall()

    industry_counts = {}
    for row in industry_rows:
        try:
            industries = json.loads(row[0] or "[]")
            for ind in industries:
                industry_counts[ind] = industry_counts.get(ind, 0) + 1
        except (json.JSONDecodeError, TypeError):
            pass

    top_industries = sorted(industry_counts.items(), key=lambda x: x[1], reverse=True)[:10]

    return {
        "batch": batch,
        "total_companies": total,
        "status_breakdown": status_counts,
        "currently_hiring": hiring,
        "top_company_count": top,
        "top_industries": top_industries,
        "survival_rate": round(
            (status_counts.get("Active", 0) / total * 100) if total else 0, 1
        ),
    }

Practical Analysis Patterns

Deal flow tracking. Pull each new YC batch within days of announcement, parse industry tags, and filter for sectors you're monitoring:

def find_companies_by_industry(
    conn: sqlite3.Connection,
    industry_keywords: list[str],
    status: str = "Active",
    batches: Optional[list[str]] = None,
) -> list[dict]:
    """Find companies matching any of the given industry keywords."""
    query = "SELECT slug, name, batch, status, one_liner, website, industries, is_hiring FROM companies WHERE 1=1"
    params = []

    if status:
        query += " AND status = ?"
        params.append(status)

    if batches:
        placeholders = ",".join(["?"] * len(batches))
        query += f" AND batch IN ({placeholders})"
        params.extend(batches)

    rows = conn.execute(query, params).fetchall()
    results = []

    for row in rows:
        try:
            industries = json.loads(row[6] or "[]")
        except (json.JSONDecodeError, TypeError):
            industries = []

        if any(
            any(kw.lower() in ind.lower() for kw in industry_keywords)
            for ind in industries
        ):
            results.append({
                "slug": row[0],
                "name": row[1],
                "batch": row[2],
                "status": row[3],
                "one_liner": row[4],
                "website": row[5],
                "industries": industries,
                "is_hiring": bool(row[7]),
            })

    return results

def find_serial_founders(conn: sqlite3.Connection) -> list[dict]:
    """Find founders who appear in multiple YC-backed companies."""
    rows = conn.execute("""
        SELECT name, COUNT(DISTINCT company_slug) as company_count,
               GROUP_CONCAT(company_slug, ', ') as companies
        FROM founders
        WHERE name != '' AND name IS NOT NULL
        GROUP BY name
        HAVING company_count > 1
        ORDER BY company_count DESC
        LIMIT 50
    """).fetchall()

    return [
        {"founder": r[0], "company_count": r[1], "companies": r[2].split(", ")}
        for r in rows
    ]

def track_batch_over_time(
    conn: sqlite3.Connection,
    batch: str,
) -> dict:
    """
    Compare current state of a batch against historical data.
    Useful for tracking which companies went inactive, got acquired, etc.
    """
    # This assumes you've run multiple collection rounds and stored timestamps
    current = conn.execute(
        "SELECT slug, status, updated_at FROM companies WHERE batch = ? ORDER BY updated_at DESC",
        (batch,)
    ).fetchall()

    return {
        "batch": batch,
        "total": len(current),
        "as_of": current[0][2] if current else None,
        "statuses": {r[0]: r[1] for r in current},
    }

Running a Complete Collection

def run_full_yc_collection(
    db_path: str = "yc_companies.db",
    use_proxy: bool = False,
    include_detail_pages: bool = False,
) -> dict:
    """
    Full pipeline: collect all YC companies from Algolia,
    store in SQLite, optionally scrape detail pages.
    """
    conn = setup_yc_database(db_path)
    client = build_proxied_httpx_client() if use_proxy else httpx.Client(timeout=30)

    stats = {"batches_processed": 0, "companies_saved": 0, "errors": 0}

    # Get all batches
    print("Collecting all batches...")
    all_batches_data = get_all_batches_complete(delay=0.6)

    for batch, companies in all_batches_data.items():
        print(f"Saving batch {batch}: {len(companies)} companies")
        for company in companies:
            try:
                save_company(conn, company)
                stats["companies_saved"] += 1
            except Exception as e:
                print(f"  Error saving {company.slug}: {e}")
                stats["errors"] += 1

        stats["batches_processed"] += 1

    # Optionally scrape detail pages for top companies
    if include_detail_pages:
        top_companies = conn.execute(
            "SELECT slug FROM companies WHERE top_company = 1 AND status = 'Active'"
        ).fetchall()

        proxy_config = THORDATA_PROXY if use_proxy else None
        slugs = [r[0] for r in top_companies]
        print(f"\nScraping {len(slugs)} top company detail pages...")

        details = batch_scrape_detail_pages(
            slugs[:50],  # Start with first 50 to test
            proxy_config=proxy_config,
        )

        for detail in details:
            if "error" not in detail:
                conn.execute("""
                    INSERT OR REPLACE INTO company_details (slug, investors, news_mentions, active_jobs)
                    VALUES (?, ?, ?, ?)
                """, (
                    detail.get("slug"),
                    json.dumps(detail.get("investors", [])),
                    json.dumps(detail.get("news_mentions", [])),
                    json.dumps(detail.get("active_jobs", [])),
                ))
                conn.commit()

    # Generate summary
    total = conn.execute("SELECT COUNT(*) FROM companies").fetchone()[0]
    active = conn.execute("SELECT COUNT(*) FROM companies WHERE status = 'Active'").fetchone()[0]
    top = conn.execute("SELECT COUNT(*) FROM companies WHERE top_company = 1").fetchone()[0]
    hiring = conn.execute("SELECT COUNT(*) FROM companies WHERE is_hiring = 1").fetchone()[0]

    print(f"\n=== Collection Complete ===")
    print(f"Total companies: {total:,}")
    print(f"Active: {active:,}")
    print(f"Top companies: {top:,}")
    print(f"Currently hiring: {hiring:,}")

    client.close()
    conn.close()

    return {**stats, "total_in_db": total, "active": active, "top": top, "hiring": hiring}

What You Can Build With This Data

Deal flow screening. Filter by batch recency, industry, and top_company flag to build a shortlist. Cross-reference founders against LinkedIn profiles and prior company exits for signal on team quality.

Competitive landscape mapping. Search by industry tags to find all YC companies in a specific market segment. The status field tells you which ones are still active, acquired, or public — useful for understanding how the competitive landscape has evolved.

Hiring trend analysis. The isHiring flag combined with team size gives a rough signal on which companies are in growth mode. A company that was showing 1-10 employees six months ago and is now at 11-50 with open roles suggests product-market fit and funding.

Founder network analysis. The find_serial_founders function above surfaces people who have built multiple YC companies. Cross-referencing with batch years can tell you which founders are repeat YC participants vs. joining as a co-founder on a second company.

Batch benchmarking. Compare cohort-to-cohort survival rates, industry mix shifts, and top company concentration over time. Each batch is a natural experiment in what markets were attractive to high-quality founders at that moment.

One practical note on data freshness: the Algolia index updates periodically but not in real-time. Status changes (acquisitions, shutdowns) may lag the actual event by weeks. For anything time-sensitive, supplement with Crunchbase or news monitoring. The YC directory is authoritative for batch membership and initial company details; it's less reliable as a real-time status tracker.