How to Scrape Crunchbase Company Data in 2026: Autocomplete, Funding & Investors
How to Scrape Crunchbase Company Data in 2026: Autocomplete, Funding & Investors
Crunchbase is where startup data lives — funding rounds, investor networks, employee headcounts, acquisition histories, and company timelines. It is the first place VCs, analysts, and founders check when researching any company in the tech ecosystem.
The challenge: Crunchbase's paid API starts at $49/month and the free tier caps at 200 calls per day. However, Crunchbase exposes several endpoints that the frontend uses — most notably an autocomplete search endpoint that returns structured JSON without authentication. Combined with the official free-tier API for organization details, you can build a solid data pipeline.
This guide covers the autocomplete endpoint, the free REST API, page scraping as a fallback, funding round extraction, proxy configuration, and a complete SQLite-backed pipeline.
What Data Crunchbase Contains
Company profiles on Crunchbase are dense with structured information:
- Company overview — name, description, founded date, operating status (active, closed, acquired)
- Funding rounds — date, amount, type (Pre-Seed through Series H, Convertible Note, Grant, etc.)
- Total funding raised — aggregate in USD across all rounds
- Lead investors — name and fund details for each round
- Employee headcount — range buckets (1-10, 11-50, 51-200, 201-500, 501-1000, 1001-5000, 5001-10000, 10000+)
- Headquarters — city, region, country
- Categories — Crunchbase's industry taxonomy (up to 10 categories per company)
- Key people — founders, C-suite executives, board members
- Acquisitions — acquirer, price where disclosed, date
- IPO details — stock ticker, valuation, exchange, date
Crunchbase's Anti-Bot Architecture
Crunchbase is one of the more heavily defended scraping targets:
Cloudflare Bot Management. Every request passes through Cloudflare's full challenge pipeline. JS challenges and Turnstile CAPTCHAs trigger frequently on suspicious traffic. Fresh IPs with no browsing history, linear crawling patterns, or unusual header combinations all trigger challenges.
Aggressive rate limiting. 30-40 requests per minute from a single IP triggers block pages. The autocomplete endpoint is slightly more lenient (~60/min) but still monitored.
Content gating. After 5-10 profile views without login, a paywall modal covers the content. This is cookie-based and resets with proxy rotation.
Session token requirements. The internal GraphQL API requires valid session tokens with CSRF headers. The free-tier REST API uses a simpler API key scheme.
Legal enforcement. Crunchbase actively pursues scrapers with cease-and-desist letters and has filed lawsuits against data resellers. Their business model depends on selling this data.
Method 1: The Autocomplete Search Endpoint
Crunchbase's search autocomplete returns structured company data without any authentication. It is designed for the search bar but returns enough data for basic research:
import httpx
import json
import time
import random
from fake_useragent import UserAgent
ua = UserAgent()
def search_crunchbase_autocomplete(
query: str,
proxy: str = None,
limit: int = 25,
) -> list[dict]:
"""
Search Crunchbase via the autocomplete endpoint.
Returns company names, slugs, short descriptions, and entity types.
No authentication required.
"""
url = "https://www.crunchbase.com/v4/data/autocompletes"
params = {
"query": query,
"collection_ids": "organizations",
"limit": limit,
"source": "topSearch",
}
headers = {
"User-Agent": ua.random,
"Accept": "application/json",
"Referer": "https://www.crunchbase.com/",
"X-Requested-With": "XMLHttpRequest",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
}
client_kwargs = {"headers": headers, "follow_redirects": True, "timeout": 15}
if proxy:
client_kwargs["proxies"] = {"all://": proxy}
with httpx.Client(**client_kwargs) as client:
resp = client.get(url, params=params)
if resp.status_code != 200:
return []
data = resp.json()
results = []
for entity in data.get("entities", []):
props = entity.get("identifier", {})
results.append({
"name": props.get("value"),
"slug": props.get("permalink"),
"entity_type": props.get("entity_def_id"),
"uuid": props.get("uuid"),
"short_description": entity.get("short_description"),
"facet_ids": entity.get("facet_ids", []),
"image_url": entity.get("image_url"),
})
return results
def search_multiple_queries(queries: list[str], proxy: str = None) -> list[dict]:
"""Search multiple queries, deduplicate by slug."""
seen_slugs = set()
all_results = []
for query in queries:
results = search_crunchbase_autocomplete(query, proxy=proxy)
for r in results:
if r.get("slug") and r["slug"] not in seen_slugs:
seen_slugs.add(r["slug"])
all_results.append(r)
time.sleep(random.uniform(5, 12))
return all_results
Method 2: Free-Tier REST API
Crunchbase offers a free API tier (200 calls/day) that returns full organization data. Register at crunchbase.com/accelerator/application:
def fetch_organization(slug: str, api_key: str, proxy: str = None) -> dict:
"""
Fetch full organization data from Crunchbase's free REST API.
Requires a free API key from the Crunchbase Basic plan (200 calls/day).
field_ids documentation: https://data.crunchbase.com/docs/field-ids
"""
url = f"https://api.crunchbase.com/api/v4/entities/organizations/{slug}"
field_ids = [
"short_description",
"founded_on",
"num_employees_enum",
"funding_total",
"last_funding_type",
"last_funding_at",
"num_funding_rounds",
"categories",
"location_identifiers",
"founder_identifiers",
"website",
"linkedin",
"status",
"operating_status",
"ipo_status",
]
params = {
"user_key": api_key,
"field_ids": ",".join(field_ids),
}
headers = {
"User-Agent": ua.random,
"Accept": "application/json",
}
client_kwargs = {"headers": headers, "timeout": 20}
if proxy:
client_kwargs["proxies"] = {"all://": proxy}
with httpx.Client(**client_kwargs) as client:
resp = client.get(url, params=params)
if resp.status_code == 404:
return {"error": "not_found", "slug": slug}
if resp.status_code == 429:
return {"error": "rate_limited", "slug": slug}
if resp.status_code != 200:
return {"error": f"http_{resp.status_code}", "slug": slug}
data = resp.json()
props = data.get("properties", {})
org = {
"slug": slug,
"name": props.get("name") or slug,
"short_description": props.get("short_description"),
"founded_on": props.get("founded_on"),
"num_employees": props.get("num_employees_enum"),
"status": props.get("status"),
"operating_status": props.get("operating_status"),
"ipo_status": props.get("ipo_status"),
}
# Website
website = props.get("website")
if isinstance(website, dict):
org["website"] = website.get("value")
else:
org["website"] = website
# Funding
funding = props.get("funding_total", {})
if isinstance(funding, dict):
org["total_funding_usd"] = funding.get("value_usd")
org["total_funding_currency"] = funding.get("currency")
org["last_funding_type"] = props.get("last_funding_type")
org["last_funding_at"] = props.get("last_funding_at")
org["num_funding_rounds"] = props.get("num_funding_rounds", 0)
# Categories
cats = props.get("categories", [])
org["categories"] = [
c.get("value") for c in cats if isinstance(c, dict)
]
# Location
locs = props.get("location_identifiers", [])
if locs:
loc = locs[0]
org["location"] = loc.get("value") if isinstance(loc, dict) else str(loc)
# Founders
founders = props.get("founder_identifiers", [])
org["founders"] = [
f.get("value") for f in founders if isinstance(f, dict)
]
return org
Fetching Funding Rounds
Funding history is the most valuable Crunchbase data. The free API provides this through a sub-entity endpoint:
def fetch_funding_rounds(slug: str, api_key: str, proxy: str = None) -> list[dict]:
"""
Fetch all funding rounds for a company.
Returns rounds sorted by date descending.
"""
url = f"https://api.crunchbase.com/api/v4/entities/organizations/{slug}/funding_rounds"
field_ids = [
"announced_on",
"funding_type",
"money_raised",
"lead_investor_identifiers",
"investor_identifiers",
"num_investors",
"pre_money_valuation",
"post_money_valuation",
"is_equity",
]
params = {
"user_key": api_key,
"field_ids": ",".join(field_ids),
}
headers = {"User-Agent": ua.random, "Accept": "application/json"}
client_kwargs = {"headers": headers, "timeout": 20}
if proxy:
client_kwargs["proxies"] = {"all://": proxy}
with httpx.Client(**client_kwargs) as client:
resp = client.get(url, params=params)
if resp.status_code != 200:
return []
rounds = []
for entity in resp.json().get("entities", []):
props = entity.get("properties", {})
money = props.get("money_raised", {})
valuation_pre = props.get("pre_money_valuation", {})
valuation_post = props.get("post_money_valuation", {})
lead_investors = [
inv.get("value") for inv in props.get("lead_investor_identifiers", [])
if isinstance(inv, dict)
]
all_investors = [
inv.get("value") for inv in props.get("investor_identifiers", [])
if isinstance(inv, dict)
]
rounds.append({
"type": props.get("funding_type"),
"date": props.get("announced_on"),
"amount_usd": money.get("value_usd") if isinstance(money, dict) else None,
"currency": money.get("currency") if isinstance(money, dict) else None,
"lead_investors": lead_investors,
"all_investors": all_investors,
"num_investors": props.get("num_investors"),
"pre_money_valuation_usd": valuation_pre.get("value_usd") if isinstance(valuation_pre, dict) else None,
"post_money_valuation_usd": valuation_post.get("value_usd") if isinstance(valuation_post, dict) else None,
"is_equity": props.get("is_equity"),
})
return sorted(rounds, key=lambda r: r.get("date") or "", reverse=True)
Method 3: Scraping Profile Pages (Fallback)
When API calls are exhausted, scrape the web pages. Crunchbase embeds JSON-LD and a hydration state in the page source:
import re
def scrape_crunchbase_page(slug: str, proxy: str = None) -> dict:
"""
Scrape a Crunchbase organization page for embedded data.
Use this as fallback when API rate limit is hit.
"""
url = f"https://www.crunchbase.com/organization/{slug}"
headers = {
"User-Agent": ua.random,
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://www.google.com/",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "cross-site",
"Cache-Control": "no-cache",
"Upgrade-Insecure-Requests": "1",
}
client_kwargs = {
"headers": headers,
"follow_redirects": True,
"timeout": 20,
}
if proxy:
client_kwargs["proxies"] = {"all://": proxy}
with httpx.Client(**client_kwargs) as client:
resp = client.get(url)
if resp.status_code != 200:
return {"error": f"Status {resp.status_code}", "slug": slug}
company = {"slug": slug, "url": url}
# Extract JSON-LD structured data
ld_match = re.search(
r'<script type="application/ld\+json">(.*?)</script>',
resp.text, re.DOTALL,
)
if ld_match:
try:
ld = json.loads(ld_match.group(1))
company["name"] = ld.get("name")
company["description"] = ld.get("description")
company["founded"] = ld.get("foundingDate")
if "address" in ld:
addr = ld["address"]
company["city"] = addr.get("addressLocality")
company["country"] = addr.get("addressCountry")
founders_raw = ld.get("founder", [])
if founders_raw:
if isinstance(founders_raw, dict):
founders_raw = [founders_raw]
company["founders"] = [f.get("name") for f in founders_raw]
except (json.JSONDecodeError, TypeError):
pass
# Extract ng-state hydration data
state_match = re.search(
r'<script id="ng-state" type="application/json">(.*?)</script>',
resp.text, re.DOTALL,
)
if state_match:
try:
state = json.loads(state_match.group(1))
for key, value in state.items():
if isinstance(value, dict) and "properties" in value:
props = value["properties"]
company.setdefault("short_description", props.get("short_description"))
company.setdefault("num_employees", props.get("num_employees_enum"))
funding_total = props.get("funding_total", {})
if isinstance(funding_total, dict):
company.setdefault("total_funding_usd", funding_total.get("value_usd"))
company.setdefault("last_funding_type", props.get("last_funding_type"))
company.setdefault("status", props.get("status"))
break
except (json.JSONDecodeError, TypeError):
pass
return company
def detect_paywall(html: str) -> bool:
"""Check if Crunchbase returned a gated/paywall page."""
return any(
marker in html.lower()
for marker in [
"sign up to see",
"create a free account",
"upgrade to crunchbase pro",
"sign in to view",
]
)
SQLite Schema
import sqlite3
def init_crunchbase_db(db_path: str = "crunchbase.db") -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.executescript("""
CREATE TABLE IF NOT EXISTS companies (
slug TEXT PRIMARY KEY,
name TEXT,
description TEXT,
founded TEXT,
location TEXT,
num_employees TEXT,
total_funding_usd REAL,
last_funding_type TEXT,
last_funding_at TEXT,
num_funding_rounds INTEGER,
status TEXT,
operating_status TEXT,
ipo_status TEXT,
website TEXT,
categories TEXT,
founders TEXT,
source TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS funding_rounds (
id INTEGER PRIMARY KEY AUTOINCREMENT,
company_slug TEXT NOT NULL,
funding_type TEXT,
announced_date TEXT,
amount_usd REAL,
currency TEXT,
lead_investors TEXT,
all_investors TEXT,
num_investors INTEGER,
pre_money_valuation_usd REAL,
post_money_valuation_usd REAL,
is_equity INTEGER,
FOREIGN KEY (company_slug) REFERENCES companies(slug)
);
CREATE INDEX IF NOT EXISTS idx_companies_funding
ON companies(total_funding_usd DESC);
CREATE INDEX IF NOT EXISTS idx_companies_last_round
ON companies(last_funding_type, last_funding_at);
CREATE INDEX IF NOT EXISTS idx_rounds_slug
ON funding_rounds(company_slug);
CREATE INDEX IF NOT EXISTS idx_rounds_date
ON funding_rounds(announced_date DESC);
""")
conn.commit()
return conn
def save_company(conn: sqlite3.Connection, company: dict, source: str = "api"):
conn.execute(
"""INSERT OR REPLACE INTO companies
(slug, name, description, founded, location, num_employees,
total_funding_usd, last_funding_type, last_funding_at,
num_funding_rounds, status, operating_status, ipo_status,
website, categories, founders, source)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)""",
(
company.get("slug"),
company.get("name"),
company.get("short_description") or company.get("description"),
company.get("founded_on") or company.get("founded"),
company.get("location"),
company.get("num_employees"),
company.get("total_funding_usd"),
company.get("last_funding_type"),
company.get("last_funding_at"),
company.get("num_funding_rounds"),
company.get("status"),
company.get("operating_status"),
company.get("ipo_status"),
company.get("website"),
json.dumps(company.get("categories", [])),
json.dumps(company.get("founders", [])),
source,
),
)
conn.commit()
def save_funding_rounds(conn: sqlite3.Connection, company_slug: str, rounds: list[dict]):
conn.executemany(
"""INSERT INTO funding_rounds
(company_slug, funding_type, announced_date, amount_usd, currency,
lead_investors, all_investors, num_investors, pre_money_valuation_usd,
post_money_valuation_usd, is_equity)
VALUES (?,?,?,?,?,?,?,?,?,?,?)""",
[
(
company_slug,
r.get("type"),
r.get("date"),
r.get("amount_usd"),
r.get("currency"),
json.dumps(r.get("lead_investors", [])),
json.dumps(r.get("all_investors", [])),
r.get("num_investors"),
r.get("pre_money_valuation_usd"),
r.get("post_money_valuation_usd"),
int(r.get("is_equity") or False),
)
for r in rounds
],
)
conn.commit()
Error Handling and Retry Logic
def fetch_with_retry(
func,
*args,
max_retries: int = 3,
base_delay: float = 5.0,
**kwargs,
):
"""Execute a fetch function with exponential backoff retry."""
for attempt in range(max_retries):
result = func(*args, **kwargs)
if isinstance(result, dict) and result.get("error") == "rate_limited":
wait = base_delay * (2 ** attempt) + random.uniform(0, 5)
print(f" Rate limited (attempt {attempt + 1}), waiting {wait:.0f}s")
time.sleep(wait)
continue
return result
return {"error": "max_retries_exceeded"}
Proxy Configuration
Crunchbase sits behind Cloudflare's full bot management suite. Datacenter IPs get Turnstile CAPTCHAs before any content loads. Residential proxies are non-negotiable for web scraping.
ThorData's residential proxy network works for Crunchbase because the IPs pass Cloudflare's ASN reputation checks. The autocomplete endpoint is the most proxy-friendly and tolerates slightly higher request rates than profile pages. For the free API, proxies help when spreading requests across multiple API keys.
API_KEY = "your_free_api_key"
PROXY = "http://USER:[email protected]:9000"
# Step 1: Discover companies via autocomplete
results = search_crunchbase_autocomplete("fintech payments", proxy=PROXY)
print(f"Found {len(results)} companies")
# Step 2: Fetch full data via API
conn = init_crunchbase_db()
for result in results[:20]:
slug = result.get("slug")
if not slug:
continue
print(f" Fetching: {result['name']}")
org = fetch_with_retry(fetch_organization, slug, API_KEY, proxy=PROXY)
if "error" not in org:
save_company(conn, org, source="api")
# Fetch funding rounds for companies with funding
if org.get("num_funding_rounds", 0) > 0:
rounds = fetch_with_retry(fetch_funding_rounds, slug, API_KEY, proxy=PROXY)
if rounds:
save_funding_rounds(conn, slug, rounds)
print(f" {len(rounds)} funding rounds saved")
# Conservative delays — Crunchbase monitors patterns aggressively
time.sleep(random.uniform(10, 20))
conn.close()
Useful SQL Queries
conn = sqlite3.connect("crunchbase.db")
# Companies by total funding, most funded first
top_funded = conn.execute("""
SELECT name, location, total_funding_usd, last_funding_type, num_funding_rounds
FROM companies
WHERE total_funding_usd IS NOT NULL
ORDER BY total_funding_usd DESC
LIMIT 20
""").fetchall()
# Recent funding rounds above $10M
recent_large = conn.execute("""
SELECT c.name, r.funding_type, r.announced_date, r.amount_usd,
r.lead_investors
FROM funding_rounds r
JOIN companies c ON c.slug = r.company_slug
WHERE r.amount_usd >= 10000000
ORDER BY r.announced_date DESC
LIMIT 50
""").fetchall()
# Companies by category
ai_companies = conn.execute("""
SELECT name, total_funding_usd, num_employees, location
FROM companies
WHERE categories LIKE '%Artificial Intelligence%'
ORDER BY total_funding_usd DESC NULLS LAST
LIMIT 30
""").fetchall()
Complete Pipeline
def run_crunchbase_pipeline(
search_queries: list[str],
api_key: str,
db_path: str = "crunchbase.db",
proxy: str = None,
):
"""
Full pipeline:
1. Search for companies using multiple queries (autocomplete)
2. Fetch full organization data via REST API
3. Fetch funding rounds for companies with funding history
4. Store everything in SQLite
"""
conn = init_crunchbase_db(db_path)
# Phase 1: Discovery
print("Discovering companies...")
all_results = search_multiple_queries(search_queries, proxy=proxy)
print(f"Found {len(all_results)} unique companies")
# Phase 2: Enrich via API
api_calls_used = 0
for result in all_results:
slug = result.get("slug")
if not slug:
continue
# Check if we already have recent data
existing = conn.execute(
"SELECT scraped_at FROM companies WHERE slug = ?", (slug,)
).fetchone()
if existing:
print(f" Skip (cached): {result['name']}")
continue
# Save basic data from autocomplete
save_company(conn, {
"slug": slug,
"name": result.get("name"),
"short_description": result.get("short_description"),
}, source="autocomplete")
if api_calls_used >= 150: # Reserve buffer before hitting 200/day limit
print("API call budget nearly exhausted, stopping enrichment")
break
# Enrich with full API data
print(f" API fetch: {result['name']}")
org = fetch_with_retry(fetch_organization, slug, api_key, proxy=proxy)
api_calls_used += 1
if "error" not in org:
save_company(conn, org, source="api")
if org.get("num_funding_rounds", 0) > 0 and api_calls_used < 150:
rounds = fetch_with_retry(fetch_funding_rounds, slug, api_key, proxy=proxy)
api_calls_used += 1
if rounds:
save_funding_rounds(conn, slug, rounds)
time.sleep(random.uniform(12, 25))
conn.close()
print(f"Pipeline complete. API calls used: {api_calls_used}/200")
# Run it
run_crunchbase_pipeline(
search_queries=[
"artificial intelligence startup",
"fintech payments",
"climate tech carbon",
"biotech drug discovery",
],
api_key="your_api_key_here",
proxy=PROXY,
db_path="crunchbase.db",
)
Legal Considerations
Crunchbase explicitly prohibits scraping in their Terms of Service and pursues violators. Their business model depends on selling this data, giving them a strong legal position for enforcement. The appropriate access levels are:
- Free REST API (200 calls/day): Sanctioned for personal research
- Pro plan ($49/month): Appropriate for commercial use and higher volumes
- Enterprise licensing: For building products that include Crunchbase data
Use autocomplete for discovery, the free API for enrichment, and page scraping as a last resort for data you cannot get any other way. Never build competing data products using scraped Crunchbase data — that is precisely what their enforcement targets.
Sector Intelligence Reports
Use the collected data to generate automated sector reports:
import sqlite3
import json
from datetime import datetime, timedelta
def generate_sector_report(
sector_keyword: str,
db_path: str = "crunchbase.db",
months_back: int = 12,
) -> dict:
"""
Generate a funding intelligence report for a sector.
Returns aggregated metrics, top companies, and recent rounds.
"""
conn = sqlite3.connect(db_path)
cutoff_date = (datetime.now() - timedelta(days=months_back * 30)).strftime("%Y-%m-%d")
# Top funded companies in sector
top_companies = conn.execute("""
SELECT name, total_funding_usd, num_funding_rounds,
last_funding_type, location, num_employees
FROM companies
WHERE categories LIKE ?
AND total_funding_usd IS NOT NULL
ORDER BY total_funding_usd DESC
LIMIT 20
""", (f'%{sector_keyword}%',)).fetchall()
# Recent rounds in sector
recent_rounds = conn.execute("""
SELECT c.name, r.funding_type, r.announced_date, r.amount_usd,
r.lead_investors
FROM funding_rounds r
JOIN companies c ON c.slug = r.company_slug
WHERE c.categories LIKE ?
AND r.announced_date >= ?
AND r.amount_usd IS NOT NULL
ORDER BY r.announced_date DESC
LIMIT 50
""", (f'%{sector_keyword}%', cutoff_date)).fetchall()
# Funding by stage distribution
stage_dist = conn.execute("""
SELECT last_funding_type, COUNT(*) as count,
AVG(total_funding_usd) as avg_total_funding
FROM companies
WHERE categories LIKE ?
AND last_funding_type IS NOT NULL
GROUP BY last_funding_type
ORDER BY count DESC
""", (f'%{sector_keyword}%',)).fetchall()
conn.close()
return {
"sector": sector_keyword,
"generated_at": datetime.now().isoformat(),
"top_companies": [
{
"name": row[0],
"total_funding_m": round(row[1] / 1e6, 1) if row[1] else None,
"rounds": row[2],
"last_stage": row[3],
"location": row[4],
"employees": row[5],
}
for row in top_companies
],
"recent_rounds": [
{
"company": row[0],
"type": row[1],
"date": row[2],
"amount_m": round(row[3] / 1e6, 1) if row[3] else None,
"lead_investors": json.loads(row[4] or "[]"),
}
for row in recent_rounds
],
"stage_distribution": [
{"stage": row[0], "count": row[1], "avg_total_funding_m": round((row[2] or 0) / 1e6, 1)}
for row in stage_dist
],
}
# Generate report for AI sector
report = generate_sector_report("Artificial Intelligence")
print(f"\n{report['sector']} Sector Report")
print(f"Generated: {report['generated_at']}\n")
print("Top 5 funded companies:")
for c in report['top_companies'][:5]:
print(f" {c['name']:<35} ${c['total_funding_m']}M {c['last_stage']}")
Finding Active Investors by Sector
Cross-reference funding rounds with investor names to find the most active VCs in a space:
import json
import sqlite3
from collections import Counter
def find_active_investors(
sector_keyword: str,
min_investments: int = 3,
db_path: str = "crunchbase.db",
) -> list:
"""Find the most active investors in a sector."""
conn = sqlite3.connect(db_path)
rows = conn.execute("""
SELECT r.lead_investors, r.all_investors,
r.funding_type, r.announced_date
FROM funding_rounds r
JOIN companies c ON c.slug = r.company_slug
WHERE c.categories LIKE ?
AND r.amount_usd IS NOT NULL
""", (f'%{sector_keyword}%',)).fetchall()
conn.close()
investor_counts = Counter()
investor_stages = {}
for row in rows:
lead = json.loads(row[0] or "[]")
all_inv = json.loads(row[1] or "[]")
stage = row[2]
for inv in lead:
investor_counts[inv] += 2 # Lead counts double
if inv not in investor_stages:
investor_stages[inv] = []
investor_stages[inv].append(stage)
for inv in all_inv:
investor_counts[inv] += 1
results = [
{
"name": name,
"investment_score": count,
"preferred_stages": Counter(investor_stages.get(name, [])).most_common(3),
}
for name, count in investor_counts.most_common(30)
if investor_counts[name] >= min_investments
]
return results
# Find most active AI investors
investors = find_active_investors("Artificial Intelligence", min_investments=2)
print("Most active AI investors:")
for inv in investors[:10]:
stages = ", ".join(f"{s[0]}({s[1]})" for s in inv["preferred_stages"])
print(f" {inv['name']:<30} score={inv['investment_score']} stages: {stages}")
Startup Discovery Pipeline
Combine autocomplete search with trend detection to discover emerging startups:
import time
import random
from fake_useragent import UserAgent
ua = UserAgent()
EMERGING_KEYWORDS = [
"AI agents 2026",
"quantum computing startup",
"climate fintech",
"synthetic biology",
"space tech startup",
"web3 infrastructure",
"robotics automation",
"longevity biotech",
]
def discover_emerging_startups(
keywords: list,
proxy: str = None,
db_path: str = "crunchbase.db",
) -> list:
"""
Search Crunchbase for companies matching emerging tech keywords.
Filters to recently founded or recently funded companies.
"""
conn = init_crunchbase_db(db_path)
discovered = []
for keyword in keywords:
print(f"Searching: {keyword}")
results = search_crunchbase_autocomplete(keyword, proxy=proxy)
for r in results:
slug = r.get("slug")
if not slug:
continue
# Quick save from autocomplete
save_company(conn, {
"slug": slug,
"name": r.get("name"),
"short_description": r.get("short_description"),
}, source="autocomplete_emerging")
discovered.append(r)
time.sleep(random.uniform(8, 15))
conn.close()
return discovered
# Run discovery
new_companies = discover_emerging_startups(EMERGING_KEYWORDS)
print(f"Discovered {len(new_companies)} companies in emerging sectors")
Handling Paywalls and Content Gating
Crunchbase increasingly gates content. Here is how to detect and handle it:
import re
def detect_crunchbase_paywall(html: str) -> str:
"""Detect what type of content restriction is in place."""
if "upgrade to crunchbase pro" in html.lower():
return "pro_paywall"
if "sign up to see" in html.lower():
return "signup_required"
if "create a free account" in html.lower():
return "free_account_required"
if "log in" in html.lower() and "crunchbase" in html.lower():
return "login_required"
if not re.search(r'"name"\s*:\s*"[^"]+"', html):
return "empty_response"
return "ok"
def scrape_with_paywall_fallback(
slug: str,
api_key: str,
proxy: str = None,
) -> dict:
"""
Try API first, fall back to page scraping, handle paywalls gracefully.
"""
# Try official API first
org = fetch_with_retry(fetch_organization, slug, api_key, proxy=proxy)
if "error" not in org:
return org
# API failed or rate limited -- try page scraping
print(f" API failed for {slug}, trying page scrape")
scraped = scrape_crunchbase_page(slug, proxy=proxy)
if "error" in scraped:
return scraped
# Check for paywall in scraped data
paywall_type = detect_crunchbase_paywall(str(scraped))
if paywall_type != "ok":
return {"slug": slug, "error": f"paywall_{paywall_type}"}
return scraped
Complete Reference: Field Availability by Method
| Field | Free API | Autocomplete | Page Scrape |
|---|---|---|---|
| Company name | Yes | Yes | Yes |
| Short description | Yes | Yes | Yes |
| Founded date | Yes | No | Sometimes |
| Headquarters | Yes | No | Sometimes |
| Employee count | Yes | No | Sometimes |
| Total funding | Yes | No | Sometimes |
| Last funding type | Yes | No | Sometimes |
| Number of rounds | Yes | No | No |
| Categories | Yes | Yes (tags) | Sometimes |
| Website | Yes | No | No |
| Founders | Yes | No | Sometimes |
| Yes | No | No | |
| IPO status | Yes | No | No |
| Funding rounds detail | Yes (separate endpoint) | No | No |
| Investor names | Yes (separate endpoint) | No | No |
The free API at 200 calls/day is by far the most data-rich approach. Autocomplete is useful for bulk discovery. Page scraping is a last resort for data not in the API.
Key Takeaways
- Crunchbase's autocomplete endpoint at
https://www.crunchbase.com/v4/data/autocompletesreturns basic company data without authentication -- useful for bulk discovery - The free REST API (200 calls/day) is the best approach for enriched data including funding rounds and investor details
- Cloudflare with full bot management protects all Crunchbase pages -- residential proxies are required for web scraping
- ThorData's residential proxy network passes Cloudflare's ASN checks; use it for both autocomplete requests and page scraping
- Store data in SQLite with separate tables for companies and funding rounds, linked by slug
- Crunchbase aggressively enforces their ToS against data resellers -- use the data for internal research, not for building competing databases