How to Scrape Y Combinator Companies: Python Guide (2026)
How to Scrape Y Combinator Companies: Python Guide (2026)
Y Combinator's company directory is one of the richest startup datasets publicly available. It covers thousands of companies going back to 2005, with batch labels (W24, S25), funding status, industry tags, and founder details. Whether you're doing deal flow tracking, competitive intelligence, or market research into startup ecosystems, the YC directory is an essential data source.
This guide covers the full extraction pipeline: the Algolia API endpoint YC uses internally, Playwright-based scraping for JS-rendered detail pages, anti-detection setup, and practical analysis use cases.
What the YC Directory Contains
The directory at ycombinator.com/companies covers every company that has been through YC since 2005. Key fields available:
- Batch — W24, S25, IK12 (Imagine K12), W95 (pre-standardized naming)
- Status — Active, Inactive, Acquired, Public
- One-liner — 150-character description of the company
- Industry tags — up to several tags per company from a controlled vocabulary
- Team size — rough headcount band
- Top company flag — YC's internal marker for their standout portfolio companies
- Hiring status — whether they're currently recruiting
- Founder details — names, titles, LinkedIn URLs (when provided)
- Locations — city, country
- Website — the company's current URL
What you won't find here: funding amounts, investors, valuation, revenue. Those are on Crunchbase or need to be gathered from SEC filings (for public/late-stage companies).
The YC API Endpoint
YC's company search runs through Algolia, and the frontend calls it with a structured payload. The Algolia credentials are embedded in the YC frontend bundle and have been stable for years:
import httpx
import json
import time
import random
import sqlite3
from dataclasses import dataclass, field
from typing import Optional
ALGOLIA_APP_ID = "45BWZJ1SGC"
ALGOLIA_API_KEY = "Zjk5ZmE4OTRjNmFlZDdlNjFlZWFlY2EyYjczODE2NzM="
ALGOLIA_INDEX = "YCCompany_production"
ALGOLIA_URL = f"https://{ALGOLIA_APP_ID}-dsn.algolia.net/1/indexes/{ALGOLIA_INDEX}/query"
ALGOLIA_HEADERS = {
"X-Algolia-Application-Id": ALGOLIA_APP_ID,
"X-Algolia-API-Key": ALGOLIA_API_KEY,
"Content-Type": "application/json",
"Origin": "https://www.ycombinator.com",
"Referer": "https://www.ycombinator.com/",
}
# All available fields in the Algolia index
ALL_FIELDS = [
"name", "slug", "one_liner", "long_description",
"batch", "status", "industries", "tags",
"team_size", "top_company", "isHiring",
"website", "all_locations", "founders",
"stage", "objectID",
]
def build_search_payload(
query: str = "",
batch: Optional[str] = None,
status: Optional[str] = None,
industries: Optional[list[str]] = None,
page: int = 0,
hits_per_page: int = 50,
sort: Optional[str] = None,
) -> dict:
"""
Build Algolia search payload for YC companies.
batch: 'W24', 'S25', etc.
status: 'Active', 'Inactive', 'Acquired', 'Public'
industries: list of industry names
sort: None (relevance), 'top' (top companies first)
"""
filters_parts = []
if batch:
filters_parts.append(f'batch:"{batch}"')
if status:
filters_parts.append(f'status:"{status}"')
facet_filters = []
if industries:
for ind in industries:
facet_filters.append([f"industries:{ind}"])
payload = {
"query": query,
"filters": " AND ".join(filters_parts),
"facetFilters": facet_filters,
"page": page,
"hitsPerPage": hits_per_page,
"attributesToRetrieve": ALL_FIELDS,
"attributesToHighlight": [],
}
# YC uses a separate "top companies" sort index
if sort == "top":
payload["indexName"] = f"{ALGOLIA_INDEX}_top_company_by_arr"
return payload
def search_yc_companies(
payload: dict,
client: Optional[httpx.Client] = None,
retries: int = 3,
delay: float = 0.5,
) -> dict:
"""
Execute an Algolia search against the YC company index.
Returns raw Algolia response.
"""
if client is None:
client = httpx.Client(timeout=30)
for attempt in range(retries):
try:
resp = client.post(ALGOLIA_URL, headers=ALGOLIA_HEADERS, json=payload)
if resp.status_code == 429:
wait = float(resp.headers.get("Retry-After", 10 * (attempt + 1)))
print(f"Rate limited. Waiting {wait}s...")
time.sleep(wait)
continue
resp.raise_for_status()
return resp.json()
except httpx.TimeoutException:
if attempt == retries - 1:
raise
time.sleep(3 * (attempt + 1))
except httpx.HTTPStatusError as e:
if attempt == retries - 1:
raise
print(f"HTTP error {e.response.status_code} on attempt {attempt + 1}")
time.sleep(2 * (attempt + 1))
return {}
Extracting and Parsing Company Data
@dataclass
class YCCompany:
name: str
slug: str
url: str
website: str
one_liner: str
long_description: str
batch: str
status: str
industries: list
tags: list
team_size: Optional[str]
top_company: bool
is_hiring: bool
locations: list
founders: list
object_id: str
def parse_company(hit: dict) -> YCCompany:
"""Parse an Algolia hit into a structured YCCompany object."""
founders = []
for f in hit.get("founders", []):
founders.append({
"name": f.get("full_name", ""),
"title": f.get("title", ""),
"linkedin_url": f.get("linkedin_url", ""),
"twitter_url": f.get("twitter_url", ""),
})
slug = hit.get("slug", "")
return YCCompany(
name=hit.get("name", ""),
slug=slug,
url=f"https://www.ycombinator.com/companies/{slug}" if slug else "",
website=hit.get("website", ""),
one_liner=hit.get("one_liner", ""),
long_description=(hit.get("long_description") or "")[:1000],
batch=hit.get("batch", ""),
status=hit.get("status", ""),
industries=hit.get("industries", []),
tags=hit.get("tags", []),
team_size=hit.get("team_size"),
top_company=hit.get("top_company", False),
is_hiring=hit.get("isHiring", False),
locations=hit.get("all_locations", []),
founders=founders,
object_id=hit.get("objectID", ""),
)
def get_all_companies_for_batch(
batch: str,
client: Optional[httpx.Client] = None,
delay: float = 0.5,
) -> list[YCCompany]:
"""
Retrieve all companies for a specific YC batch.
Handles Algolia pagination automatically.
"""
if client is None:
client = httpx.Client(timeout=30)
companies = []
page = 0
while True:
payload = build_search_payload(batch=batch, page=page, hits_per_page=50)
data = search_yc_companies(payload, client=client)
hits = data.get("hits", [])
if not hits:
break
companies.extend([parse_company(h) for h in hits])
nb_pages = data.get("nbPages", 1)
if page >= nb_pages - 1:
break
page += 1
time.sleep(delay)
return companies
def get_all_companies(
status_filter: Optional[str] = None,
industry_filter: Optional[list[str]] = None,
max_total: int = 10000,
delay: float = 0.6,
) -> list[YCCompany]:
"""
Retrieve the full YC company directory with optional filters.
Note: Algolia caps results at 1000 per query. To get everything,
paginate by batch or use multiple filtered queries.
"""
client = httpx.Client(timeout=30)
all_companies = []
page = 0
while len(all_companies) < max_total:
payload = build_search_payload(
status=status_filter,
industries=industry_filter,
page=page,
hits_per_page=50,
)
data = search_yc_companies(payload, client=client)
hits = data.get("hits", [])
if not hits:
break
all_companies.extend([parse_company(h) for h in hits])
nb_pages = data.get("nbPages", 1)
total = data.get("nbHits", 0)
print(f" Page {page + 1}/{nb_pages}: {len(all_companies)}/{total} companies")
if page >= nb_pages - 1 or page >= 19:
# Algolia standard indices cap at page 20 (1000 results)
# For full coverage, split by batch instead
break
page += 1
time.sleep(delay)
client.close()
return all_companies
def get_all_batches_complete(delay: float = 0.5) -> dict[str, list[YCCompany]]:
"""
Get complete data by iterating through all known batches.
This bypasses Algolia's 1000-result cap by querying per-batch.
"""
import re
from datetime import datetime
# Generate batch labels from W05 to current
client = httpx.Client(timeout=30)
current_year = datetime.now().year % 100
batches = []
for year in range(5, current_year + 2):
yy = str(year).zfill(2)
batches.extend([f"W{yy}", f"S{yy}"])
# Also include special batches
batches.extend(["IK12", "W95", "S98"])
all_by_batch = {}
for batch in batches:
payload = build_search_payload(batch=batch, hits_per_page=1)
test_data = search_yc_companies(payload, client=client)
total = test_data.get("nbHits", 0)
if total == 0:
continue
print(f"Batch {batch}: {total} companies")
all_by_batch[batch] = get_all_companies_for_batch(batch, client=client, delay=delay)
time.sleep(0.5)
client.close()
return all_by_batch
Playwright Fallback for Company Detail Pages
When you need data not in the Algolia index — investor lists, launch posts, job descriptions, current news — you need to render the YC website:
from playwright.sync_api import sync_playwright, TimeoutError as PWTimeoutError
from bs4 import BeautifulSoup
import re
def scrape_company_detail_page(
slug: str,
proxy_config: Optional[dict] = None,
timeout_ms: int = 30000,
) -> dict:
"""
Scrape a YC company's detail page for data not in Algolia.
proxy_config: {'server': 'http://host:port', 'username': 'u', 'password': 'p'}
"""
url = f"https://www.ycombinator.com/companies/{slug}"
with sync_playwright() as p:
launch_args = {
"headless": True,
"args": [
"--no-sandbox",
"--disable-dev-shm-usage",
"--disable-blink-features=AutomationControlled",
],
}
browser = p.chromium.launch(**launch_args)
context_args = {
"user_agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
"viewport": {"width": 1440, "height": 900},
"locale": "en-US",
}
if proxy_config:
context_args["proxy"] = proxy_config
context = browser.new_context(**context_args)
# Block unnecessary resources to speed up loading
context.route("**/*.{png,jpg,gif,webp,woff,woff2,ttf,mp4}", lambda route: route.abort())
page = context.new_page()
try:
page.goto(url, wait_until="domcontentloaded", timeout=timeout_ms)
page.wait_for_selector("[class*='company-name'], h1", timeout=5000)
except PWTimeoutError:
pass # Try to parse whatever loaded
html = page.content()
browser.close()
soup = BeautifulSoup(html, "lxml")
result = {"slug": slug, "url": url}
# Investors section
investors = []
for el in soup.find_all(text=re.compile(r"Investors", re.I)):
parent = el.parent
if parent:
for link in parent.find_next_siblings():
names = link.find_all("span")
investors.extend([n.get_text(strip=True) for n in names if n.get_text(strip=True)])
result["investors"] = list(set(investors))
# News mentions
news_links = []
for link in soup.find_all("a", href=True):
href = link.get("href", "")
if href.startswith("http") and "ycombinator.com" not in href:
text = link.get_text(strip=True)
if len(text) > 20:
news_links.append({"url": href, "text": text[:200]})
result["news_mentions"] = news_links[:10]
# Job listings
jobs = []
for job_el in soup.find_all(attrs={"class": re.compile(r"job|role", re.I)}):
title = job_el.get_text(strip=True)
if title and len(title) < 100:
jobs.append(title)
result["active_jobs"] = list(set(jobs))
# YC batch from page
batch_el = soup.find(text=re.compile(r"[WS]\d{2}|[WS]\d{4}"))
if batch_el:
match = re.search(r"([WS]\d{2,4})", str(batch_el))
if match:
result["batch_from_page"] = match.group(1)
return result
def batch_scrape_detail_pages(
slugs: list[str],
proxy_config: Optional[dict] = None,
delay: float = 3.0,
max_concurrent: int = 1,
) -> list[dict]:
"""Scrape multiple YC company detail pages sequentially."""
results = []
for i, slug in enumerate(slugs):
print(f"Scraping {i + 1}/{len(slugs)}: {slug}")
try:
detail = scrape_company_detail_page(slug, proxy_config=proxy_config)
results.append(detail)
except Exception as e:
print(f" Failed {slug}: {e}")
results.append({"slug": slug, "error": str(e)})
time.sleep(delay + random.uniform(0, 1))
return results
Anti-Detection Setup
YC's defenses are layered but not as aggressive as commercial platforms:
Cloudflare on the main site. Requests without browser-like headers get challenged or blocked. The Algolia endpoint is more permissive — it's a CDN-cached API endpoint, not the main site.
Algolia rate limits. The YC Algolia integration allows roughly 2-5 requests per second before you see 429 responses. For batch collection, stay at 1-2/sec.
IP blocking on ycombinator.com. Direct HTML scraping at volume from a single IP triggers Cloudflare blocks. For Playwright-based page scraping, residential proxies are recommended for anything beyond a few hundred pages.
import random
USER_AGENTS = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]
def polite_delay(base: float = 0.6, jitter: float = 0.4):
"""Sleep for base + random jitter seconds."""
time.sleep(base + random.uniform(0, jitter))
def algolia_request_with_backoff(
payload: dict,
client: httpx.Client,
max_retries: int = 5,
) -> dict:
"""
Execute an Algolia request with exponential backoff.
Handles rate limiting, timeouts, and transient server errors.
"""
for attempt in range(max_retries):
try:
resp = client.post(ALGOLIA_URL, headers=ALGOLIA_HEADERS, json=payload)
if resp.status_code == 200:
return resp.json()
elif resp.status_code == 429:
retry_after = float(resp.headers.get("Retry-After", 2 ** attempt * 5))
print(f"Rate limited. Waiting {retry_after:.1f}s (attempt {attempt + 1})")
time.sleep(retry_after)
elif resp.status_code in (500, 502, 503):
wait = 2 ** attempt * 2
print(f"Server error {resp.status_code}. Waiting {wait}s...")
time.sleep(wait)
else:
resp.raise_for_status()
except httpx.TimeoutException:
wait = 5 * (attempt + 1)
print(f"Timeout on attempt {attempt + 1}. Waiting {wait}s...")
time.sleep(wait)
raise RuntimeError(f"Max retries exceeded for Algolia request")
# ThorData proxy config for Playwright-based scraping
THORDATA_PROXY = {
"server": "http://proxy.thordata.com:9001",
"username": "YOUR_USERNAME",
"password": "YOUR_PASSWORD",
}
# For the Algolia API endpoint, proxies are optional at low volume
# At high volume (bulk batch downloads), distribute load:
ALGOLIA_PROXIES = {
"http://": f"http://YOUR_USERNAME:[email protected]:9001",
"https://": f"http://YOUR_USERNAME:[email protected]:9001",
}
def build_proxied_httpx_client() -> httpx.Client:
"""Build an httpx client routing through ThorData."""
return httpx.Client(
proxies=ALGOLIA_PROXIES,
headers={
"Origin": "https://www.ycombinator.com",
"Referer": "https://www.ycombinator.com/",
"User-Agent": random.choice(USER_AGENTS),
},
timeout=30,
)
Storage and Analysis
def setup_yc_database(db_path: str = "yc_companies.db") -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.execute("PRAGMA journal_mode=WAL")
conn.executescript("""
CREATE TABLE IF NOT EXISTS companies (
slug TEXT PRIMARY KEY,
name TEXT,
url TEXT,
website TEXT,
one_liner TEXT,
long_description TEXT,
batch TEXT,
status TEXT,
team_size TEXT,
top_company INTEGER DEFAULT 0,
is_hiring INTEGER DEFAULT 0,
industries TEXT,
tags TEXT,
locations TEXT,
founder_count INTEGER DEFAULT 0,
object_id TEXT,
scraped_at TEXT DEFAULT (datetime('now')),
updated_at TEXT
);
CREATE TABLE IF NOT EXISTS founders (
id INTEGER PRIMARY KEY AUTOINCREMENT,
company_slug TEXT NOT NULL,
name TEXT,
title TEXT,
linkedin_url TEXT,
twitter_url TEXT,
FOREIGN KEY (company_slug) REFERENCES companies(slug)
);
CREATE TABLE IF NOT EXISTS company_details (
slug TEXT PRIMARY KEY,
investors TEXT,
news_mentions TEXT,
active_jobs TEXT,
scraped_at TEXT DEFAULT (datetime('now')),
FOREIGN KEY (slug) REFERENCES companies(slug)
);
CREATE INDEX IF NOT EXISTS idx_companies_batch ON companies(batch);
CREATE INDEX IF NOT EXISTS idx_companies_status ON companies(status);
CREATE INDEX IF NOT EXISTS idx_companies_top ON companies(top_company);
CREATE INDEX IF NOT EXISTS idx_founders_company ON founders(company_slug);
CREATE INDEX IF NOT EXISTS idx_companies_hiring ON companies(is_hiring);
""")
conn.commit()
return conn
def save_company(conn: sqlite3.Connection, company: YCCompany):
conn.execute("""
INSERT OR REPLACE INTO companies
(slug, name, url, website, one_liner, long_description, batch, status,
team_size, top_company, is_hiring, industries, tags, locations,
founder_count, object_id, updated_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, datetime('now'))
""", (
company.slug, company.name, company.url, company.website,
company.one_liner, company.long_description, company.batch, company.status,
company.team_size, int(company.top_company), int(company.is_hiring),
json.dumps(company.industries), json.dumps(company.tags),
json.dumps(company.locations), len(company.founders),
company.object_id,
))
# Save founders
conn.execute("DELETE FROM founders WHERE company_slug = ?", (company.slug,))
for f in company.founders:
conn.execute(
"INSERT INTO founders (company_slug, name, title, linkedin_url, twitter_url) VALUES (?, ?, ?, ?, ?)",
(company.slug, f.get("name"), f.get("title"), f.get("linkedin_url"), f.get("twitter_url")),
)
conn.commit()
def analyze_batch(conn: sqlite3.Connection, batch: str) -> dict:
"""Generate summary stats for a YC batch."""
rows = conn.execute(
"SELECT status, COUNT(*) as count FROM companies WHERE batch = ? GROUP BY status",
(batch,)
).fetchall()
status_counts = {r[0]: r[1] for r in rows}
hiring = conn.execute(
"SELECT COUNT(*) FROM companies WHERE batch = ? AND is_hiring = 1",
(batch,)
).fetchone()[0]
top = conn.execute(
"SELECT COUNT(*) FROM companies WHERE batch = ? AND top_company = 1",
(batch,)
).fetchone()[0]
total = conn.execute(
"SELECT COUNT(*) FROM companies WHERE batch = ?",
(batch,)
).fetchone()[0]
# Top industries
industry_rows = conn.execute(
"SELECT industries FROM companies WHERE batch = ?",
(batch,)
).fetchall()
industry_counts = {}
for row in industry_rows:
try:
industries = json.loads(row[0] or "[]")
for ind in industries:
industry_counts[ind] = industry_counts.get(ind, 0) + 1
except (json.JSONDecodeError, TypeError):
pass
top_industries = sorted(industry_counts.items(), key=lambda x: x[1], reverse=True)[:10]
return {
"batch": batch,
"total_companies": total,
"status_breakdown": status_counts,
"currently_hiring": hiring,
"top_company_count": top,
"top_industries": top_industries,
"survival_rate": round(
(status_counts.get("Active", 0) / total * 100) if total else 0, 1
),
}
Practical Analysis Patterns
Deal flow tracking. Pull each new YC batch within days of announcement, parse industry tags, and filter for sectors you're monitoring:
def find_companies_by_industry(
conn: sqlite3.Connection,
industry_keywords: list[str],
status: str = "Active",
batches: Optional[list[str]] = None,
) -> list[dict]:
"""Find companies matching any of the given industry keywords."""
query = "SELECT slug, name, batch, status, one_liner, website, industries, is_hiring FROM companies WHERE 1=1"
params = []
if status:
query += " AND status = ?"
params.append(status)
if batches:
placeholders = ",".join(["?"] * len(batches))
query += f" AND batch IN ({placeholders})"
params.extend(batches)
rows = conn.execute(query, params).fetchall()
results = []
for row in rows:
try:
industries = json.loads(row[6] or "[]")
except (json.JSONDecodeError, TypeError):
industries = []
if any(
any(kw.lower() in ind.lower() for kw in industry_keywords)
for ind in industries
):
results.append({
"slug": row[0],
"name": row[1],
"batch": row[2],
"status": row[3],
"one_liner": row[4],
"website": row[5],
"industries": industries,
"is_hiring": bool(row[7]),
})
return results
def find_serial_founders(conn: sqlite3.Connection) -> list[dict]:
"""Find founders who appear in multiple YC-backed companies."""
rows = conn.execute("""
SELECT name, COUNT(DISTINCT company_slug) as company_count,
GROUP_CONCAT(company_slug, ', ') as companies
FROM founders
WHERE name != '' AND name IS NOT NULL
GROUP BY name
HAVING company_count > 1
ORDER BY company_count DESC
LIMIT 50
""").fetchall()
return [
{"founder": r[0], "company_count": r[1], "companies": r[2].split(", ")}
for r in rows
]
def track_batch_over_time(
conn: sqlite3.Connection,
batch: str,
) -> dict:
"""
Compare current state of a batch against historical data.
Useful for tracking which companies went inactive, got acquired, etc.
"""
# This assumes you've run multiple collection rounds and stored timestamps
current = conn.execute(
"SELECT slug, status, updated_at FROM companies WHERE batch = ? ORDER BY updated_at DESC",
(batch,)
).fetchall()
return {
"batch": batch,
"total": len(current),
"as_of": current[0][2] if current else None,
"statuses": {r[0]: r[1] for r in current},
}
Running a Complete Collection
def run_full_yc_collection(
db_path: str = "yc_companies.db",
use_proxy: bool = False,
include_detail_pages: bool = False,
) -> dict:
"""
Full pipeline: collect all YC companies from Algolia,
store in SQLite, optionally scrape detail pages.
"""
conn = setup_yc_database(db_path)
client = build_proxied_httpx_client() if use_proxy else httpx.Client(timeout=30)
stats = {"batches_processed": 0, "companies_saved": 0, "errors": 0}
# Get all batches
print("Collecting all batches...")
all_batches_data = get_all_batches_complete(delay=0.6)
for batch, companies in all_batches_data.items():
print(f"Saving batch {batch}: {len(companies)} companies")
for company in companies:
try:
save_company(conn, company)
stats["companies_saved"] += 1
except Exception as e:
print(f" Error saving {company.slug}: {e}")
stats["errors"] += 1
stats["batches_processed"] += 1
# Optionally scrape detail pages for top companies
if include_detail_pages:
top_companies = conn.execute(
"SELECT slug FROM companies WHERE top_company = 1 AND status = 'Active'"
).fetchall()
proxy_config = THORDATA_PROXY if use_proxy else None
slugs = [r[0] for r in top_companies]
print(f"\nScraping {len(slugs)} top company detail pages...")
details = batch_scrape_detail_pages(
slugs[:50], # Start with first 50 to test
proxy_config=proxy_config,
)
for detail in details:
if "error" not in detail:
conn.execute("""
INSERT OR REPLACE INTO company_details (slug, investors, news_mentions, active_jobs)
VALUES (?, ?, ?, ?)
""", (
detail.get("slug"),
json.dumps(detail.get("investors", [])),
json.dumps(detail.get("news_mentions", [])),
json.dumps(detail.get("active_jobs", [])),
))
conn.commit()
# Generate summary
total = conn.execute("SELECT COUNT(*) FROM companies").fetchone()[0]
active = conn.execute("SELECT COUNT(*) FROM companies WHERE status = 'Active'").fetchone()[0]
top = conn.execute("SELECT COUNT(*) FROM companies WHERE top_company = 1").fetchone()[0]
hiring = conn.execute("SELECT COUNT(*) FROM companies WHERE is_hiring = 1").fetchone()[0]
print(f"\n=== Collection Complete ===")
print(f"Total companies: {total:,}")
print(f"Active: {active:,}")
print(f"Top companies: {top:,}")
print(f"Currently hiring: {hiring:,}")
client.close()
conn.close()
return {**stats, "total_in_db": total, "active": active, "top": top, "hiring": hiring}
What You Can Build With This Data
Deal flow screening. Filter by batch recency, industry, and top_company flag to build a shortlist. Cross-reference founders against LinkedIn profiles and prior company exits for signal on team quality.
Competitive landscape mapping. Search by industry tags to find all YC companies in a specific market segment. The status field tells you which ones are still active, acquired, or public — useful for understanding how the competitive landscape has evolved.
Hiring trend analysis. The isHiring flag combined with team size gives a rough signal on which companies are in growth mode. A company that was showing 1-10 employees six months ago and is now at 11-50 with open roles suggests product-market fit and funding.
Founder network analysis. The find_serial_founders function above surfaces people who have built multiple YC companies. Cross-referencing with batch years can tell you which founders are repeat YC participants vs. joining as a co-founder on a second company.
Batch benchmarking. Compare cohort-to-cohort survival rates, industry mix shifts, and top company concentration over time. Each batch is a natural experiment in what markets were attractive to high-quality founders at that moment.
One practical note on data freshness: the Algolia index updates periodically but not in real-time. Status changes (acquisitions, shutdowns) may lag the actual event by weeks. For anything time-sensitive, supplement with Crunchbase or news monitoring. The YC directory is authoritative for batch membership and initial company details; it's less reliable as a real-time status tracker.