Scraping Crunchbase Funding Rounds and Investor Data with Python (2026)
Scraping Crunchbase Funding Rounds and Investor Data with Python (2026)
Crunchbase is the standard reference for startup funding data, but they've aggressively moved behind a paywall. Their Pro plan runs $49/month and their Enterprise API starts at $99/month for basic access. This pricing keeps out individual researchers, developers building side projects, and companies that just need occasional funding data without a full subscription.
The good news: a meaningful amount of company data is still crawlable from their public-facing pages — funding rounds with amounts, dates, and investor names; company descriptions and categories; acquisition history; key team members and titles. This guide covers the full technical stack for extracting that data in 2026, with working code, proxy configuration, and storage patterns for building a startup funding database.
What Crunchbase Exposes Without Login
Before building a scraper, understand the boundaries. Without authentication, Crunchbase shows:
Organization pages (crunchbase.com/organization/{slug}):
- Company name, description, and category tags
- Total funding amount and last funding round type
- Founded date, HQ location
- Number of funding rounds
- Summary of most recent 2-3 funding rounds (amount, type, lead investor)
- Key team members and their titles (first 3-5 visible)
- Acquisition history (acquiree name, approximate date)
What requires login: - Full investor lists per round - Contact information - Detailed employee counts over time - Data export - Company financials
The summary cards visible without login are still highly useful for building a startup watchlist, tracking specific companies, or researching the competitive landscape.
Architecture: Why You Need Playwright
Crunchbase is a React SPA. The initial HTML response is an empty shell — all data is fetched via their internal API after JavaScript executes. Naive HTTP requests get either an empty page or a Cloudflare challenge.
Playwright running real Chromium is the reliable approach. Their internal API uses GraphQL-like calls with changing schema — intercepting those responses from a real browser session is more stable than trying to reverse-engineer and directly call their API endpoints.
pip install playwright
playwright install chromium
import asyncio
import json
import random
import time
from typing import Optional
from playwright.async_api import async_playwright, BrowserContext, Page
# Typical Chrome User-Agent
UA = (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/127.0.0.0 Safari/537.36"
)
Stealth Browser Context
Crunchbase uses Cloudflare and its own bot detection layer. Configure the browser context to minimize automation signals:
async def create_crunchbase_context(
playwright,
proxy_config: dict = None,
) -> tuple:
"""
Create a Playwright browser context configured for Crunchbase scraping.
Returns (browser, context).
"""
launch_opts = {"headless": True}
if proxy_config:
launch_opts["proxy"] = proxy_config
browser = await playwright.chromium.launch(**launch_opts)
context_opts = {
"user_agent": UA,
"viewport": {"width": 1366, "height": 768},
"locale": "en-US",
"timezone_id": "America/New_York",
"color_scheme": "light",
"device_scale_factor": 1,
"has_touch": False,
"is_mobile": False,
}
context = await browser.new_context(**context_opts)
# Override automation tells
await context.add_init_script("""
// Remove webdriver
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
// Realistic plugins
Object.defineProperty(navigator, 'plugins', {
get: () => [{name: 'Chrome PDF Viewer'}, {name: 'Chromium PDF Viewer'}]
});
// Languages
Object.defineProperty(navigator, 'languages', {get: () => ['en-US', 'en']});
// Chrome object
window.chrome = window.chrome || {runtime: {}};
// Screen
Object.defineProperty(screen, 'width', {get: () => 1366});
Object.defineProperty(screen, 'height', {get: () => 768});
""")
# Block heavy resources we don't need
await context.route(
"**/*.{png,jpg,jpeg,gif,webp,svg,woff,woff2,ttf,eot,otf}",
lambda route: route.abort()
)
# Allow data: URLs (sometimes used for inline assets)
return browser, context
Scraping an Organization Page
async def scrape_organization(
slug: str,
page: Page,
capture_api_responses: bool = True,
) -> dict:
"""
Scrape a Crunchbase organization page.
slug: the organization's URL slug (e.g., 'stripe', 'openai', 'anthropic')
"""
url = f"https://www.crunchbase.com/organization/{slug}"
# Capture internal API responses
api_data = []
if capture_api_responses:
async def handle_response(response):
if (
"api.crunchbase.com" in response.url or
"crunchbase.com/v4/" in response.url or
"graphql" in response.url.lower()
):
try:
data = await response.json()
api_data.append({"url": response.url, "data": data})
except Exception:
pass
page.on("response", handle_response)
# Navigate — use longer timeout, Crunchbase can be slow
try:
await page.goto(url, wait_until="networkidle", timeout=35000)
except Exception:
# Retry with domcontentloaded which is less strict
await page.goto(url, wait_until="domcontentloaded", timeout=30000)
await asyncio.sleep(4)
# Wait for the main content card to appear
try:
await page.wait_for_selector(
"profile-section, .page-centered-card, h1.profile-name",
timeout=15000
)
except Exception:
pass
await asyncio.sleep(random.uniform(2, 3.5))
org = {
"slug": slug,
"url": url,
"funding_rounds": [],
"investors": [],
"acquisitions": [],
"team": [],
"api_data_captured": len(api_data),
}
# Extract company name
name = await _safe_text(page, "h1.profile-name, h1, [class*='ProfileName']")
org["name"] = name or slug
# Extract fields card (total funding, last round type, founded date, etc.)
await _extract_fields_card(page, org)
# Extract funding rounds section
org["funding_rounds"] = await _extract_funding_rounds(page)
# Extract key people/team
org["team"] = await _extract_team_members(page)
# Extract acquisitions
org["acquisitions"] = await _extract_acquisitions(page)
# Extract categories/tags
org["categories"] = await _extract_categories(page)
# Try to parse API responses for richer data
if api_data:
_enrich_from_api_data(org, api_data)
return org
async def _safe_text(page: Page, selector: str) -> str:
"""Try to get text from a selector, return empty string on failure."""
try:
el = await page.query_selector(selector)
if el:
return (await el.text_content() or "").strip()
except Exception:
pass
return ""
async def _extract_fields_card(page: Page, org: dict):
"""Extract key-value fields from the company card."""
field_selectors = [
"fields-card .field-row",
"[class*='KeyValuePair']",
".profile-card-item",
]
for selector in field_selectors:
fields = await page.query_selector_all(selector)
if not fields:
continue
for field in fields:
try:
label_el = await field.query_selector(
".field-label, [class*='label'], [class*='key']"
)
value_el = await field.query_selector(
".field-value, [class*='value']"
)
if not (label_el and value_el):
# Try splitting the field text
full_text = (await field.text_content() or "").strip()
if ":" in full_text:
parts = full_text.split(":", 1)
label, value = parts[0].strip(), parts[1].strip()
else:
continue
else:
label = (await label_el.text_content() or "").strip()
value = (await value_el.text_content() or "").strip()
if not label or not value:
continue
# Map to structured fields
label_lower = label.lower()
if "total funding" in label_lower:
org["total_funding"] = value
elif "last funding" in label_lower and "type" in label_lower:
org["last_funding_type"] = value
elif "last funding" in label_lower and "date" in label_lower:
org["last_funding_date"] = value
elif "founded" in label_lower:
org["founded_date"] = value
elif "headquarter" in label_lower or "location" in label_lower:
org["headquarters"] = value
elif "employee" in label_lower:
org["employee_range"] = value
elif "website" in label_lower:
org["website"] = value
elif "stock symbol" in label_lower or "ticker" in label_lower:
org["ticker"] = value
elif "ipo" in label_lower:
org["ipo_date"] = value
elif "funding rounds" in label_lower:
org["num_funding_rounds"] = value
except Exception:
continue
if org.get("total_funding") or org.get("founded_date"):
break # Got something, stop trying other selectors
async def _extract_funding_rounds(page: Page) -> list[dict]:
"""Extract funding round summary data visible on the page."""
rounds = []
# Try multiple section selectors
section_selectors = [
"funding-rounds-list",
"[section-id='funding_rounds']",
"[class*='funding-rounds']",
"section-card:has(h3:contains('Funding Rounds'))",
]
for sel in section_selectors:
try:
section = await page.query_selector(sel)
if not section:
continue
# Look for round rows/items
row_selectors = [
"a[href*='/funding_round/']",
".cb-card",
"[class*='FundingRound']",
]
for row_sel in row_selectors:
rows = await section.query_selector_all(row_sel)
if rows:
for row in rows[:10]:
round_data = await _parse_round_row(row)
if round_data:
rounds.append(round_data)
break
if rounds:
break
except Exception:
continue
return rounds
async def _parse_round_row(el) -> Optional[dict]:
"""Parse a single funding round row element."""
try:
# Get all text content
full_text = (await el.text_content() or "").strip()
if not full_text or len(full_text) < 5:
return None
# The round link URL usually contains the round ID
link = await el.query_selector("a[href*='/funding_round/']")
round_url = ""
if link:
href = await link.get_attribute("href") or ""
round_url = f"https://www.crunchbase.com{href}" if href.startswith("/") else href
return {
"raw_text": full_text,
"round_url": round_url,
}
except Exception:
return None
async def _extract_team_members(page: Page) -> list[dict]:
"""Extract visible team members from the page."""
team = []
person_selectors = [
"a[href*='/person/']",
"[class*='PersonProfile']",
]
seen_hrefs = set()
for sel in person_selectors:
elements = await page.query_selector_all(sel)
for el in elements[:20]:
try:
href = await el.get_attribute("href") or ""
if not href or href in seen_hrefs:
continue
seen_hrefs.add(href)
name_el = await el.query_selector(
"[class*='name'], [class*='Name'], strong, b"
)
title_el = await el.query_selector(
"[class*='title'], [class*='role'], [class*='Title']"
)
name = (await name_el.text_content() if name_el else await el.text_content() or "").strip()
title = (await title_el.text_content() if title_el else "").strip()
if name and "/person/" in href:
team.append({
"name": name[:100],
"title": title[:100],
"profile_url": f"https://www.crunchbase.com{href}" if href.startswith("/") else href,
})
except Exception:
continue
# Deduplicate by profile URL
seen = set()
unique = []
for p in team:
key = p["profile_url"]
if key not in seen:
seen.add(key)
unique.append(p)
return unique[:10]
async def _extract_acquisitions(page: Page) -> list[dict]:
"""Extract acquisition history."""
acquisitions = []
acq_selectors = [
"acquisitions-list a[href*='/organization/']",
"[section-id='acquisitions'] a",
]
for sel in acq_selectors:
links = await page.query_selector_all(sel)
if links:
for link in links[:20]:
try:
href = await link.get_attribute("href") or ""
text = (await link.text_content() or "").strip()
if href and text and "/organization/" in href:
acquisitions.append({
"acquiree": text,
"slug": href.strip("/").split("/")[-1],
"url": f"https://www.crunchbase.com{href}",
})
except Exception:
continue
break
return acquisitions[:10]
async def _extract_categories(page: Page) -> list[str]:
"""Extract category/industry tags."""
return await page.evaluate("""
() => {
const links = document.querySelectorAll('a[href*="/hub/companies/"]');
return [...new Set(Array.from(links).map(l => l.textContent.trim()))].filter(Boolean);
}
""")
def _enrich_from_api_data(org: dict, api_data: list[dict]):
"""
Try to extract richer data from intercepted API responses.
API schema changes, so this is best-effort.
"""
for item in api_data:
data = item.get("data", {})
# Navigate common response shapes
entity = (
data.get("data", {})
.get("organization", data.get("organization", {}))
)
if not entity:
continue
# Funding rounds from API
rounds_data = entity.get("fundingRounds", {}).get("edges", [])
if rounds_data:
for edge in rounds_data[:10]:
node = edge.get("node", {})
if node:
org["funding_rounds"].append({
"round_type": node.get("fundingType", ""),
"amount": node.get("money", {}).get("value"),
"currency": node.get("money", {}).get("currency"),
"announced_on": node.get("announcedOn", ""),
"closed_on": node.get("closedOn", ""),
"lead_investors": [
inv.get("name", "")
for inv in node.get("leadInvestors", [])[:3]
],
"source": "api",
})
break
Scraping Individual Funding Round Pages
Each funding round has its own page with more detail:
async def scrape_funding_round(
round_url: str,
page: Page,
) -> dict:
"""
Scrape an individual funding round page.
round_url: full Crunchbase URL like /funding_round/{uuid}
"""
if not round_url.startswith("http"):
round_url = f"https://www.crunchbase.com{round_url}"
try:
await page.goto(round_url, wait_until="networkidle", timeout=30000)
await asyncio.sleep(random.uniform(2, 3.5))
except Exception:
return {"url": round_url, "error": "navigation_failed"}
round_data = {"url": round_url}
# Extract field card data (amount, date, round type, etc.)
fields = await page.query_selector_all("fields-card .field-row, [class*='KeyValuePair']")
for field in fields:
try:
label_el = await field.query_selector(".field-label, [class*='label']")
value_el = await field.query_selector(".field-value, [class*='value']")
if label_el and value_el:
label = (await label_el.text_content() or "").strip().lower()
value = (await value_el.text_content() or "").strip()
if "money raised" in label or "funding amount" in label:
round_data["amount"] = value
elif "announced" in label:
round_data["announced_date"] = value
elif "closed" in label and "date" in label:
round_data["closed_date"] = value
elif "funding type" in label or "series" in label:
round_data["round_type"] = value
elif "pre-money" in label:
round_data["pre_money_valuation"] = value
elif "post-money" in label:
round_data["post_money_valuation"] = value
except Exception:
continue
# Extract investor list (both lead and participating)
investor_links = await page.query_selector_all(
"a[href*='/organization/'], a[href*='/person/']"
)
investors = []
seen_hrefs = set()
for link in investor_links[:30]:
try:
href = await link.get_attribute("href") or ""
if href in seen_hrefs:
continue
seen_hrefs.add(href)
text = (await link.text_content() or "").strip()
entity_type = "organization" if "/organization/" in href else "person"
if text and (entity_type == "organization" or "/person/" in href):
investors.append({
"name": text[:100],
"type": entity_type,
"slug": href.strip("/").split("/")[-1],
})
except Exception:
continue
round_data["investors"] = investors[:20]
# Extract company being funded
company_link = await page.query_selector(
"a.profile-link[href*='/organization/']"
)
if company_link:
href = await company_link.get_attribute("href") or ""
text = (await company_link.text_content() or "").strip()
round_data["company"] = {"name": text, "slug": href.strip("/").split("/")[-1]}
return round_data
Batch Processing Multiple Companies
async def scrape_company_batch(
slugs: list[str],
proxy_config: dict = None,
min_delay: float = 8.0,
max_delay: float = 18.0,
context_rotate_every: int = 15,
) -> list[dict]:
"""
Scrape multiple Crunchbase organizations.
Rotates browser context every N companies to reset session state.
Uses randomized delays to avoid detection.
"""
results = []
async with async_playwright() as p:
browser, context = await create_crunchbase_context(p, proxy_config)
page = await context.new_page()
requests_in_context = 0
for i, slug in enumerate(slugs):
# Context rotation
if requests_in_context > 0 and requests_in_context % context_rotate_every == 0:
print(f" Rotating context after {requests_in_context} requests...")
await context.close()
await asyncio.sleep(random.uniform(5, 10))
_, context = await create_crunchbase_context(p, proxy_config)
page = await context.new_page()
print(f"[{i+1}/{len(slugs)}] Scraping: {slug}")
try:
# Simulate natural browsing: start from homepage occasionally
if requests_in_context == 0:
await page.goto("https://www.crunchbase.com", wait_until="domcontentloaded")
await asyncio.sleep(random.uniform(2, 4))
org = await scrape_organization(slug, page)
results.append(org)
print(f" {org.get('name', slug)}: {org.get('total_funding', 'N/A')} total funding")
except Exception as e:
print(f" Error: {e}")
results.append({"slug": slug, "error": str(e)})
requests_in_context += 1
# Randomized delay — critical for avoiding detection
delay = random.uniform(min_delay, max_delay)
print(f" Waiting {delay:.1f}s...")
await asyncio.sleep(delay)
await browser.close()
return results
Database Schema
import sqlite3
from datetime import date, datetime
import json
def init_crunchbase_db(db_path: str = "crunchbase.db") -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS organizations (
slug TEXT PRIMARY KEY,
name TEXT,
url TEXT,
description TEXT,
categories TEXT,
total_funding TEXT,
last_funding_type TEXT,
last_funding_date TEXT,
founded_date TEXT,
headquarters TEXT,
employee_range TEXT,
website TEXT,
ticker TEXT,
ipo_date TEXT,
num_funding_rounds TEXT,
scraped_at TEXT,
scrape_status TEXT
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS funding_rounds (
id INTEGER PRIMARY KEY AUTOINCREMENT,
org_slug TEXT NOT NULL,
round_type TEXT,
amount TEXT,
currency TEXT,
announced_date TEXT,
closed_date TEXT,
pre_money_valuation TEXT,
post_money_valuation TEXT,
lead_investors TEXT,
all_investors TEXT,
round_url TEXT,
source TEXT,
scraped_at TEXT,
FOREIGN KEY (org_slug) REFERENCES organizations(slug)
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS team_members (
id INTEGER PRIMARY KEY AUTOINCREMENT,
org_slug TEXT NOT NULL,
name TEXT,
title TEXT,
profile_url TEXT,
UNIQUE(org_slug, profile_url)
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS acquisitions (
id INTEGER PRIMARY KEY AUTOINCREMENT,
acquirer_slug TEXT NOT NULL,
acquiree_name TEXT,
acquiree_slug TEXT,
acquiree_url TEXT,
scraped_at TEXT
)
""")
conn.execute("CREATE INDEX IF NOT EXISTS idx_org_slug ON organizations(slug)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_round_org ON funding_rounds(org_slug)")
conn.commit()
return conn
def save_organization(conn: sqlite3.Connection, org: dict):
"""Save a scraped organization to the database."""
now = datetime.utcnow().isoformat()
conn.execute("""
INSERT OR REPLACE INTO organizations
(slug, name, url, categories, total_funding, last_funding_type,
last_funding_date, founded_date, headquarters, employee_range,
website, ticker, ipo_date, num_funding_rounds, scraped_at, scrape_status)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)
""", (
org.get("slug"), org.get("name"), org.get("url"),
json.dumps(org.get("categories", [])),
org.get("total_funding"), org.get("last_funding_type"),
org.get("last_funding_date"), org.get("founded_date"),
org.get("headquarters"), org.get("employee_range"),
org.get("website"), org.get("ticker"),
org.get("ipo_date"), org.get("num_funding_rounds"),
now, "success" if not org.get("error") else "error"
))
# Save funding rounds
for round_data in org.get("funding_rounds", []):
conn.execute("""
INSERT INTO funding_rounds
(org_slug, round_type, amount, currency, announced_date, closed_date,
pre_money_valuation, post_money_valuation, lead_investors, all_investors,
round_url, source, scraped_at)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)
""", (
org["slug"],
round_data.get("round_type"), round_data.get("amount"),
round_data.get("currency"), round_data.get("announced_date"),
round_data.get("closed_date"),
round_data.get("pre_money_valuation"), round_data.get("post_money_valuation"),
json.dumps(round_data.get("lead_investors", [])),
json.dumps([inv.get("name") for inv in round_data.get("investors", [])]),
round_data.get("round_url"), round_data.get("source", "page"),
now
))
# Save team members
for person in org.get("team", []):
try:
conn.execute("""
INSERT OR IGNORE INTO team_members (org_slug, name, title, profile_url)
VALUES (?, ?, ?, ?)
""", (org["slug"], person["name"], person["title"], person["profile_url"]))
except Exception:
pass
# Save acquisitions
for acq in org.get("acquisitions", []):
conn.execute("""
INSERT INTO acquisitions (acquirer_slug, acquiree_name, acquiree_slug, acquiree_url, scraped_at)
VALUES (?, ?, ?, ?, ?)
""", (
org["slug"], acq.get("acquiree"), acq.get("slug"),
acq.get("url"), now
))
conn.commit()
Anti-Bot Strategy Details
Crunchbase in 2026 uses a multi-layer detection stack:
Cloudflare handles the perimeter. They check: - IP reputation (datacenter IPs fail immediately) - TLS fingerprint (matches against known browser vs. library patterns) - HTTP/2 fingerprint - JavaScript challenge capability
Session behavioral tracking monitors: - Time between requests - Navigation pattern (jumping directly to deep URLs vs. browsing naturally) - Cookie presence and age
For IP rotation: ThorData's residential proxy network provides IPs that pass Cloudflare's reputation check. City-level targeting lets you match proxy location to a realistic browsing pattern:
# ThorData proxy configuration
THORDATA_CONFIGS = {
"us_ny": {
"server": "http://proxy.thordata.com:9000",
"username": "YOUR_USER-country-us-city-newyork",
"password": "YOUR_PASS",
},
"us_sf": {
"server": "http://proxy.thordata.com:9000",
"username": "YOUR_USER-country-us-city-sanfrancisco",
"password": "YOUR_PASS",
},
"us_rotation": {
"server": "http://proxy.thordata.com:9000",
"username": "YOUR_USER-country-us",
"password": "YOUR_PASS",
},
}
Sustainable scraping cadence for Crunchbase: - Minimum 8-15 seconds between page loads (not requests — full page loads) - Maximum 15-20 companies per context before rotating - Rotate proxy on each context rotation - Start from the homepage on the first request of each new context - Total daily volume: under 150-200 organization pages per IP pool
Exceeding these thresholds doesn't just get your current session blocked — Crunchbase can add your proxy IPs to a longer-term denylist. Conservative rates are worth it.
Finding Company Slugs
To scrape Crunchbase data, you need to know the slug (URL identifier) for each company. Several approaches:
From known URLs: If you have Crunchbase URLs (from news articles, LinkedIn, etc.), extract the slug from the path: crunchbase.com/organization/SLUG.
From company names via Algolia search (public, unauthenticated):
import requests
def find_crunchbase_slug(company_name: str) -> Optional[str]:
"""
Search for a company's Crunchbase slug using their Algolia search API.
This endpoint is unauthenticated and works without a Crunchbase account.
"""
url = "https://api.crunchbase.com/api/v4/autocompletes"
params = {
"query": company_name,
"collection_ids": "organizations",
"limit": 5,
# Note: Crunchbase's autocomplete API key is embedded in their public JS
# and changes periodically. Check browser network tab for current key.
"user_key": "CURRENT_PUBLIC_KEY",
}
headers = {"User-Agent": UA, "Referer": "https://www.crunchbase.com/"}
try:
resp = requests.get(url, params=params, headers=headers, timeout=10)
if resp.status_code == 200:
data = resp.json()
entities = data.get("entities", [])
if entities:
# Return the identifier (slug) of the best match
return entities[0].get("identifier", {}).get("permalink")
except Exception:
pass
return None
# Alternative: search Google for the Crunchbase page
def find_slug_via_search(company_name: str) -> Optional[str]:
"""
Use a DuckDuckGo search to find the Crunchbase slug.
More reliable than the API approach when the public API key changes.
"""
search_url = "https://api.duckduckgo.com/"
params = {
"q": f"site:crunchbase.com/organization {company_name}",
"format": "json",
"no_html": 1,
}
try:
resp = requests.get(search_url, params=params, timeout=10)
data = resp.json()
for result in data.get("RelatedTopics", []):
first_url = result.get("FirstURL", "")
if "crunchbase.com/organization/" in first_url:
slug = first_url.split("/organization/")[-1].rstrip("/")
return slug.split("/")[0] # Remove any trailing path
except Exception:
pass
return None
Complete Pipeline with Monitoring
async def run_funding_research(
company_names: list[str],
db_path: str = "crunchbase.db",
proxy_config: dict = None,
) -> dict:
"""
Full pipeline: find slugs -> scrape orgs -> save to DB.
Returns summary stats.
"""
db = init_crunchbase_db(db_path)
# Find slugs for company names
print("Finding Crunchbase slugs...")
slugs = []
for name in company_names:
slug = find_crunchbase_slug(name) or find_slug_via_search(name)
if slug:
slugs.append(slug)
print(f" {name} -> {slug}")
else:
print(f" {name} -> NOT FOUND")
time.sleep(1)
print(f"\nFound {len(slugs)}/{len(company_names)} slugs")
# Scrape organizations
results = await scrape_company_batch(
slugs,
proxy_config=proxy_config or THORDATA_CONFIGS.get("us_rotation"),
)
# Save to database
saved = 0
for org in results:
if not org.get("error"):
save_organization(db, org)
saved += 1
# Summary stats
total_rounds = db.execute("SELECT COUNT(*) FROM funding_rounds").fetchone()[0]
total_orgs = db.execute("SELECT COUNT(*) FROM organizations").fetchone()[0]
print(f"\nDatabase: {total_orgs} organizations, {total_rounds} funding rounds")
db.close()
return {
"companies_requested": len(company_names),
"slugs_found": len(slugs),
"scraped_successfully": saved,
"db_path": db_path,
}
# Example usage
if __name__ == "__main__":
companies = [
"Stripe", "Plaid", "Brex", "Ramp", "Mercury",
"Rippling", "Gusto", "Deel", "Remote.com", "Lattice",
]
result = asyncio.run(run_funding_research(
companies,
db_path="fintech_funding.db",
))
print(f"\nResult: {result}")
Querying the Funding Database
Once you have data in SQLite, run analytical queries:
def analyze_funding_rounds(db_path: str = "crunchbase.db"):
conn = sqlite3.connect(db_path)
print("\n=== Funding Analysis ===")
# Companies by total funding visibility
orgs = conn.execute("""
SELECT name, total_funding, last_funding_type, last_funding_date, founded_date
FROM organizations
WHERE scrape_status = 'success'
ORDER BY scraped_at DESC
""").fetchall()
print(f"\nScraped organizations ({len(orgs)}):")
for o in orgs[:20]:
print(f" {o[0]:35} | {o[1] or 'N/A':15} | {o[2] or 'N/A':12} | Founded: {o[4] or 'N/A'}")
# Funding rounds distribution
rounds = conn.execute("""
SELECT round_type, COUNT(*) AS count
FROM funding_rounds
WHERE round_type != ''
GROUP BY round_type
ORDER BY count DESC
""").fetchall()
print("\nFunding rounds by type:")
for r in rounds:
print(f" {r[0]:20} {r[1]} rounds")
conn.close()
Legal and Practical Considerations
Crunchbase's Terms of Service prohibit scraping. The practical implications:
- Don't scrape data behind authentication (this guide only covers public pages)
- Keep volume low — we're talking tens to low hundreds of companies, not millions
- For production systems handling commercial decisions, their official API is worth the $99/month — it's more reliable, richer data, and eliminates legal and operational risk
- For research and small-scale competitive intelligence, the public page data shown here is proportionate and reasonable
Crunchbase actively maintains their bot defenses. When your scraper starts failing, the most common causes are: context/proxy not being rotated frequently enough, delays being too short, or a Cloudflare rule update that requires header adjustments. Budget time for maintenance.
Conclusion
Crunchbase public pages provide a useful subset of their full dataset without authentication. Playwright with proper stealth configuration, residential proxies (like ThorData for IP reputation), conservative request pacing (8-18 seconds between pages), and regular context rotation are the core requirements.
The data you get — funding totals, round types and dates, visible investors, founding dates — is sufficient for competitive landscape mapping, investment research on specific sectors, or building a startup watchlist. For deeper data (full investor lists, employee counts over time, contact details), the official API is the right answer.