Scraping Crunchbase Funding Rounds and Investor Data with Python (2026)

2026-04-09 [python crunchbase scraping funding investors]

Scraping Crunchbase Funding Rounds and Investor Data with Python (2026)

Crunchbase is the standard reference for startup funding data, but they've aggressively moved behind a paywall. Their Pro plan runs $49/month and their Enterprise API starts at $99/month for basic access. This pricing keeps out individual researchers, developers building side projects, and companies that just need occasional funding data without a full subscription.

The good news: a meaningful amount of company data is still crawlable from their public-facing pages — funding rounds with amounts, dates, and investor names; company descriptions and categories; acquisition history; key team members and titles. This guide covers the full technical stack for extracting that data in 2026, with working code, proxy configuration, and storage patterns for building a startup funding database.

Before building a scraper, understand the boundaries. Without authentication, Crunchbase shows:

Organization pages (crunchbase.com/organization/{slug}): - Company name, description, and category tags - Total funding amount and last funding round type - Founded date, HQ location - Number of funding rounds - Summary of most recent 2-3 funding rounds (amount, type, lead investor) - Key team members and their titles (first 3-5 visible) - Acquisition history (acquiree name, approximate date)

What requires login: - Full investor lists per round - Contact information - Detailed employee counts over time - Data export - Company financials

The summary cards visible without login are still highly useful for building a startup watchlist, tracking specific companies, or researching the competitive landscape.

Architecture: Why You Need Playwright

Crunchbase is a React SPA. The initial HTML response is an empty shell — all data is fetched via their internal API after JavaScript executes. Naive HTTP requests get either an empty page or a Cloudflare challenge.

Playwright running real Chromium is the reliable approach. Their internal API uses GraphQL-like calls with changing schema — intercepting those responses from a real browser session is more stable than trying to reverse-engineer and directly call their API endpoints.

pip install playwright
playwright install chromium

import asyncio
import json
import random
import time
from typing import Optional
from playwright.async_api import async_playwright, BrowserContext, Page

# Typical Chrome User-Agent
UA = (
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
    "AppleWebKit/537.36 (KHTML, like Gecko) "
    "Chrome/127.0.0.0 Safari/537.36"
)

Stealth Browser Context

Crunchbase uses Cloudflare and its own bot detection layer. Configure the browser context to minimize automation signals:

async def create_crunchbase_context(
    playwright,
    proxy_config: dict = None,
) -> tuple:
    """
    Create a Playwright browser context configured for Crunchbase scraping.
    Returns (browser, context).
    """
    launch_opts = {"headless": True}
    if proxy_config:
        launch_opts["proxy"] = proxy_config

    browser = await playwright.chromium.launch(**launch_opts)

    context_opts = {
        "user_agent": UA,
        "viewport": {"width": 1366, "height": 768},
        "locale": "en-US",
        "timezone_id": "America/New_York",
        "color_scheme": "light",
        "device_scale_factor": 1,
        "has_touch": False,
        "is_mobile": False,
    }

    context = await browser.new_context(**context_opts)

    # Override automation tells
    await context.add_init_script("""
        // Remove webdriver
        Object.defineProperty(navigator, 'webdriver', {get: () => undefined});

        // Realistic plugins
        Object.defineProperty(navigator, 'plugins', {
            get: () => [{name: 'Chrome PDF Viewer'}, {name: 'Chromium PDF Viewer'}]
        });

        // Languages
        Object.defineProperty(navigator, 'languages', {get: () => ['en-US', 'en']});

        // Chrome object
        window.chrome = window.chrome || {runtime: {}};

        // Screen
        Object.defineProperty(screen, 'width', {get: () => 1366});
        Object.defineProperty(screen, 'height', {get: () => 768});
    """)

    # Block heavy resources we don't need
    await context.route(
        "**/*.{png,jpg,jpeg,gif,webp,svg,woff,woff2,ttf,eot,otf}",
        lambda route: route.abort()
    )
    # Allow data: URLs (sometimes used for inline assets)

    return browser, context

Scraping an Organization Page

async def scrape_organization(
    slug: str,
    page: Page,
    capture_api_responses: bool = True,
) -> dict:
    """
    Scrape a Crunchbase organization page.
    slug: the organization's URL slug (e.g., 'stripe', 'openai', 'anthropic')
    """
    url = f"https://www.crunchbase.com/organization/{slug}"

    # Capture internal API responses
    api_data = []
    if capture_api_responses:
        async def handle_response(response):
            if (
                "api.crunchbase.com" in response.url or
                "crunchbase.com/v4/" in response.url or
                "graphql" in response.url.lower()
            ):
                try:
                    data = await response.json()
                    api_data.append({"url": response.url, "data": data})
                except Exception:
                    pass
        page.on("response", handle_response)

    # Navigate — use longer timeout, Crunchbase can be slow
    try:
        await page.goto(url, wait_until="networkidle", timeout=35000)
    except Exception:
        # Retry with domcontentloaded which is less strict
        await page.goto(url, wait_until="domcontentloaded", timeout=30000)
        await asyncio.sleep(4)

    # Wait for the main content card to appear
    try:
        await page.wait_for_selector(
            "profile-section, .page-centered-card, h1.profile-name",
            timeout=15000
        )
    except Exception:
        pass

    await asyncio.sleep(random.uniform(2, 3.5))

    org = {
        "slug": slug,
        "url": url,
        "funding_rounds": [],
        "investors": [],
        "acquisitions": [],
        "team": [],
        "api_data_captured": len(api_data),
    }

    # Extract company name
    name = await _safe_text(page, "h1.profile-name, h1, [class*='ProfileName']")
    org["name"] = name or slug

    # Extract fields card (total funding, last round type, founded date, etc.)
    await _extract_fields_card(page, org)

    # Extract funding rounds section
    org["funding_rounds"] = await _extract_funding_rounds(page)

    # Extract key people/team
    org["team"] = await _extract_team_members(page)

    # Extract acquisitions
    org["acquisitions"] = await _extract_acquisitions(page)

    # Extract categories/tags
    org["categories"] = await _extract_categories(page)

    # Try to parse API responses for richer data
    if api_data:
        _enrich_from_api_data(org, api_data)

    return org


async def _safe_text(page: Page, selector: str) -> str:
    """Try to get text from a selector, return empty string on failure."""
    try:
        el = await page.query_selector(selector)
        if el:
            return (await el.text_content() or "").strip()
    except Exception:
        pass
    return ""


async def _extract_fields_card(page: Page, org: dict):
    """Extract key-value fields from the company card."""
    field_selectors = [
        "fields-card .field-row",
        "[class*='KeyValuePair']",
        ".profile-card-item",
    ]

    for selector in field_selectors:
        fields = await page.query_selector_all(selector)
        if not fields:
            continue

        for field in fields:
            try:
                label_el = await field.query_selector(
                    ".field-label, [class*='label'], [class*='key']"
                )
                value_el = await field.query_selector(
                    ".field-value, [class*='value']"
                )

                if not (label_el and value_el):
                    # Try splitting the field text
                    full_text = (await field.text_content() or "").strip()
                    if ":" in full_text:
                        parts = full_text.split(":", 1)
                        label, value = parts[0].strip(), parts[1].strip()
                    else:
                        continue
                else:
                    label = (await label_el.text_content() or "").strip()
                    value = (await value_el.text_content() or "").strip()

                if not label or not value:
                    continue

                # Map to structured fields
                label_lower = label.lower()
                if "total funding" in label_lower:
                    org["total_funding"] = value
                elif "last funding" in label_lower and "type" in label_lower:
                    org["last_funding_type"] = value
                elif "last funding" in label_lower and "date" in label_lower:
                    org["last_funding_date"] = value
                elif "founded" in label_lower:
                    org["founded_date"] = value
                elif "headquarter" in label_lower or "location" in label_lower:
                    org["headquarters"] = value
                elif "employee" in label_lower:
                    org["employee_range"] = value
                elif "website" in label_lower:
                    org["website"] = value
                elif "stock symbol" in label_lower or "ticker" in label_lower:
                    org["ticker"] = value
                elif "ipo" in label_lower:
                    org["ipo_date"] = value
                elif "funding rounds" in label_lower:
                    org["num_funding_rounds"] = value

            except Exception:
                continue
        if org.get("total_funding") or org.get("founded_date"):
            break  # Got something, stop trying other selectors


async def _extract_funding_rounds(page: Page) -> list[dict]:
    """Extract funding round summary data visible on the page."""
    rounds = []

    # Try multiple section selectors
    section_selectors = [
        "funding-rounds-list",
        "[section-id='funding_rounds']",
        "[class*='funding-rounds']",
        "section-card:has(h3:contains('Funding Rounds'))",
    ]

    for sel in section_selectors:
        try:
            section = await page.query_selector(sel)
            if not section:
                continue

            # Look for round rows/items
            row_selectors = [
                "a[href*='/funding_round/']",
                ".cb-card",
                "[class*='FundingRound']",
            ]
            for row_sel in row_selectors:
                rows = await section.query_selector_all(row_sel)
                if rows:
                    for row in rows[:10]:
                        round_data = await _parse_round_row(row)
                        if round_data:
                            rounds.append(round_data)
                    break

            if rounds:
                break
        except Exception:
            continue

    return rounds


async def _parse_round_row(el) -> Optional[dict]:
    """Parse a single funding round row element."""
    try:
        # Get all text content
        full_text = (await el.text_content() or "").strip()
        if not full_text or len(full_text) < 5:
            return None

        # The round link URL usually contains the round ID
        link = await el.query_selector("a[href*='/funding_round/']")
        round_url = ""
        if link:
            href = await link.get_attribute("href") or ""
            round_url = f"https://www.crunchbase.com{href}" if href.startswith("/") else href

        return {
            "raw_text": full_text,
            "round_url": round_url,
        }
    except Exception:
        return None


async def _extract_team_members(page: Page) -> list[dict]:
    """Extract visible team members from the page."""
    team = []

    person_selectors = [
        "a[href*='/person/']",
        "[class*='PersonProfile']",
    ]

    seen_hrefs = set()
    for sel in person_selectors:
        elements = await page.query_selector_all(sel)
        for el in elements[:20]:
            try:
                href = await el.get_attribute("href") or ""
                if not href or href in seen_hrefs:
                    continue
                seen_hrefs.add(href)

                name_el = await el.query_selector(
                    "[class*='name'], [class*='Name'], strong, b"
                )
                title_el = await el.query_selector(
                    "[class*='title'], [class*='role'], [class*='Title']"
                )

                name = (await name_el.text_content() if name_el else await el.text_content() or "").strip()
                title = (await title_el.text_content() if title_el else "").strip()

                if name and "/person/" in href:
                    team.append({
                        "name": name[:100],
                        "title": title[:100],
                        "profile_url": f"https://www.crunchbase.com{href}" if href.startswith("/") else href,
                    })
            except Exception:
                continue

    # Deduplicate by profile URL
    seen = set()
    unique = []
    for p in team:
        key = p["profile_url"]
        if key not in seen:
            seen.add(key)
            unique.append(p)

    return unique[:10]


async def _extract_acquisitions(page: Page) -> list[dict]:
    """Extract acquisition history."""
    acquisitions = []

    acq_selectors = [
        "acquisitions-list a[href*='/organization/']",
        "[section-id='acquisitions'] a",
    ]

    for sel in acq_selectors:
        links = await page.query_selector_all(sel)
        if links:
            for link in links[:20]:
                try:
                    href = await link.get_attribute("href") or ""
                    text = (await link.text_content() or "").strip()
                    if href and text and "/organization/" in href:
                        acquisitions.append({
                            "acquiree": text,
                            "slug": href.strip("/").split("/")[-1],
                            "url": f"https://www.crunchbase.com{href}",
                        })
                except Exception:
                    continue
            break

    return acquisitions[:10]


async def _extract_categories(page: Page) -> list[str]:
    """Extract category/industry tags."""
    return await page.evaluate("""
        () => {
            const links = document.querySelectorAll('a[href*="/hub/companies/"]');
            return [...new Set(Array.from(links).map(l => l.textContent.trim()))].filter(Boolean);
        }
    """)


def _enrich_from_api_data(org: dict, api_data: list[dict]):
    """
    Try to extract richer data from intercepted API responses.
    API schema changes, so this is best-effort.
    """
    for item in api_data:
        data = item.get("data", {})

        # Navigate common response shapes
        entity = (
            data.get("data", {})
            .get("organization", data.get("organization", {}))
        )

        if not entity:
            continue

        # Funding rounds from API
        rounds_data = entity.get("fundingRounds", {}).get("edges", [])
        if rounds_data:
            for edge in rounds_data[:10]:
                node = edge.get("node", {})
                if node:
                    org["funding_rounds"].append({
                        "round_type": node.get("fundingType", ""),
                        "amount": node.get("money", {}).get("value"),
                        "currency": node.get("money", {}).get("currency"),
                        "announced_on": node.get("announcedOn", ""),
                        "closed_on": node.get("closedOn", ""),
                        "lead_investors": [
                            inv.get("name", "")
                            for inv in node.get("leadInvestors", [])[:3]
                        ],
                        "source": "api",
                    })
            break

Scraping Individual Funding Round Pages

Each funding round has its own page with more detail:

async def scrape_funding_round(
    round_url: str,
    page: Page,
) -> dict:
    """
    Scrape an individual funding round page.
    round_url: full Crunchbase URL like /funding_round/{uuid}
    """
    if not round_url.startswith("http"):
        round_url = f"https://www.crunchbase.com{round_url}"

    try:
        await page.goto(round_url, wait_until="networkidle", timeout=30000)
        await asyncio.sleep(random.uniform(2, 3.5))
    except Exception:
        return {"url": round_url, "error": "navigation_failed"}

    round_data = {"url": round_url}

    # Extract field card data (amount, date, round type, etc.)
    fields = await page.query_selector_all("fields-card .field-row, [class*='KeyValuePair']")
    for field in fields:
        try:
            label_el = await field.query_selector(".field-label, [class*='label']")
            value_el = await field.query_selector(".field-value, [class*='value']")
            if label_el and value_el:
                label = (await label_el.text_content() or "").strip().lower()
                value = (await value_el.text_content() or "").strip()

                if "money raised" in label or "funding amount" in label:
                    round_data["amount"] = value
                elif "announced" in label:
                    round_data["announced_date"] = value
                elif "closed" in label and "date" in label:
                    round_data["closed_date"] = value
                elif "funding type" in label or "series" in label:
                    round_data["round_type"] = value
                elif "pre-money" in label:
                    round_data["pre_money_valuation"] = value
                elif "post-money" in label:
                    round_data["post_money_valuation"] = value
        except Exception:
            continue

    # Extract investor list (both lead and participating)
    investor_links = await page.query_selector_all(
        "a[href*='/organization/'], a[href*='/person/']"
    )

    investors = []
    seen_hrefs = set()
    for link in investor_links[:30]:
        try:
            href = await link.get_attribute("href") or ""
            if href in seen_hrefs:
                continue
            seen_hrefs.add(href)

            text = (await link.text_content() or "").strip()
            entity_type = "organization" if "/organization/" in href else "person"

            if text and (entity_type == "organization" or "/person/" in href):
                investors.append({
                    "name": text[:100],
                    "type": entity_type,
                    "slug": href.strip("/").split("/")[-1],
                })
        except Exception:
            continue

    round_data["investors"] = investors[:20]

    # Extract company being funded
    company_link = await page.query_selector(
        "a.profile-link[href*='/organization/']"
    )
    if company_link:
        href = await company_link.get_attribute("href") or ""
        text = (await company_link.text_content() or "").strip()
        round_data["company"] = {"name": text, "slug": href.strip("/").split("/")[-1]}

    return round_data

Batch Processing Multiple Companies

async def scrape_company_batch(
    slugs: list[str],
    proxy_config: dict = None,
    min_delay: float = 8.0,
    max_delay: float = 18.0,
    context_rotate_every: int = 15,
) -> list[dict]:
    """
    Scrape multiple Crunchbase organizations.

    Rotates browser context every N companies to reset session state.
    Uses randomized delays to avoid detection.
    """
    results = []

    async with async_playwright() as p:
        browser, context = await create_crunchbase_context(p, proxy_config)
        page = await context.new_page()
        requests_in_context = 0

        for i, slug in enumerate(slugs):
            # Context rotation
            if requests_in_context > 0 and requests_in_context % context_rotate_every == 0:
                print(f"  Rotating context after {requests_in_context} requests...")
                await context.close()
                await asyncio.sleep(random.uniform(5, 10))
                _, context = await create_crunchbase_context(p, proxy_config)
                page = await context.new_page()

            print(f"[{i+1}/{len(slugs)}] Scraping: {slug}")

            try:
                # Simulate natural browsing: start from homepage occasionally
                if requests_in_context == 0:
                    await page.goto("https://www.crunchbase.com", wait_until="domcontentloaded")
                    await asyncio.sleep(random.uniform(2, 4))

                org = await scrape_organization(slug, page)
                results.append(org)
                print(f"  {org.get('name', slug)}: {org.get('total_funding', 'N/A')} total funding")

            except Exception as e:
                print(f"  Error: {e}")
                results.append({"slug": slug, "error": str(e)})

            requests_in_context += 1

            # Randomized delay — critical for avoiding detection
            delay = random.uniform(min_delay, max_delay)
            print(f"  Waiting {delay:.1f}s...")
            await asyncio.sleep(delay)

        await browser.close()

    return results

Database Schema

import sqlite3
from datetime import date, datetime
import json


def init_crunchbase_db(db_path: str = "crunchbase.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS organizations (
            slug TEXT PRIMARY KEY,
            name TEXT,
            url TEXT,
            description TEXT,
            categories TEXT,
            total_funding TEXT,
            last_funding_type TEXT,
            last_funding_date TEXT,
            founded_date TEXT,
            headquarters TEXT,
            employee_range TEXT,
            website TEXT,
            ticker TEXT,
            ipo_date TEXT,
            num_funding_rounds TEXT,
            scraped_at TEXT,
            scrape_status TEXT
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS funding_rounds (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            org_slug TEXT NOT NULL,
            round_type TEXT,
            amount TEXT,
            currency TEXT,
            announced_date TEXT,
            closed_date TEXT,
            pre_money_valuation TEXT,
            post_money_valuation TEXT,
            lead_investors TEXT,
            all_investors TEXT,
            round_url TEXT,
            source TEXT,
            scraped_at TEXT,
            FOREIGN KEY (org_slug) REFERENCES organizations(slug)
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS team_members (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            org_slug TEXT NOT NULL,
            name TEXT,
            title TEXT,
            profile_url TEXT,
            UNIQUE(org_slug, profile_url)
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS acquisitions (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            acquirer_slug TEXT NOT NULL,
            acquiree_name TEXT,
            acquiree_slug TEXT,
            acquiree_url TEXT,
            scraped_at TEXT
        )
    """)

    conn.execute("CREATE INDEX IF NOT EXISTS idx_org_slug ON organizations(slug)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_round_org ON funding_rounds(org_slug)")

    conn.commit()
    return conn


def save_organization(conn: sqlite3.Connection, org: dict):
    """Save a scraped organization to the database."""
    now = datetime.utcnow().isoformat()

    conn.execute("""
        INSERT OR REPLACE INTO organizations
        (slug, name, url, categories, total_funding, last_funding_type,
         last_funding_date, founded_date, headquarters, employee_range,
         website, ticker, ipo_date, num_funding_rounds, scraped_at, scrape_status)
        VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)
    """, (
        org.get("slug"), org.get("name"), org.get("url"),
        json.dumps(org.get("categories", [])),
        org.get("total_funding"), org.get("last_funding_type"),
        org.get("last_funding_date"), org.get("founded_date"),
        org.get("headquarters"), org.get("employee_range"),
        org.get("website"), org.get("ticker"),
        org.get("ipo_date"), org.get("num_funding_rounds"),
        now, "success" if not org.get("error") else "error"
    ))

    # Save funding rounds
    for round_data in org.get("funding_rounds", []):
        conn.execute("""
            INSERT INTO funding_rounds
            (org_slug, round_type, amount, currency, announced_date, closed_date,
             pre_money_valuation, post_money_valuation, lead_investors, all_investors,
             round_url, source, scraped_at)
            VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)
        """, (
            org["slug"],
            round_data.get("round_type"), round_data.get("amount"),
            round_data.get("currency"), round_data.get("announced_date"),
            round_data.get("closed_date"),
            round_data.get("pre_money_valuation"), round_data.get("post_money_valuation"),
            json.dumps(round_data.get("lead_investors", [])),
            json.dumps([inv.get("name") for inv in round_data.get("investors", [])]),
            round_data.get("round_url"), round_data.get("source", "page"),
            now
        ))

    # Save team members
    for person in org.get("team", []):
        try:
            conn.execute("""
                INSERT OR IGNORE INTO team_members (org_slug, name, title, profile_url)
                VALUES (?, ?, ?, ?)
            """, (org["slug"], person["name"], person["title"], person["profile_url"]))
        except Exception:
            pass

    # Save acquisitions
    for acq in org.get("acquisitions", []):
        conn.execute("""
            INSERT INTO acquisitions (acquirer_slug, acquiree_name, acquiree_slug, acquiree_url, scraped_at)
            VALUES (?, ?, ?, ?, ?)
        """, (
            org["slug"], acq.get("acquiree"), acq.get("slug"),
            acq.get("url"), now
        ))

    conn.commit()

Anti-Bot Strategy Details

Crunchbase in 2026 uses a multi-layer detection stack:

Cloudflare handles the perimeter. They check: - IP reputation (datacenter IPs fail immediately) - TLS fingerprint (matches against known browser vs. library patterns) - HTTP/2 fingerprint - JavaScript challenge capability

Session behavioral tracking monitors: - Time between requests - Navigation pattern (jumping directly to deep URLs vs. browsing naturally) - Cookie presence and age

For IP rotation: ThorData's residential proxy network provides IPs that pass Cloudflare's reputation check. City-level targeting lets you match proxy location to a realistic browsing pattern:

# ThorData proxy configuration
THORDATA_CONFIGS = {
    "us_ny": {
        "server": "http://proxy.thordata.com:9000",
        "username": "YOUR_USER-country-us-city-newyork",
        "password": "YOUR_PASS",
    },
    "us_sf": {
        "server": "http://proxy.thordata.com:9000",
        "username": "YOUR_USER-country-us-city-sanfrancisco",
        "password": "YOUR_PASS",
    },
    "us_rotation": {
        "server": "http://proxy.thordata.com:9000",
        "username": "YOUR_USER-country-us",
        "password": "YOUR_PASS",
    },
}

Sustainable scraping cadence for Crunchbase: - Minimum 8-15 seconds between page loads (not requests — full page loads) - Maximum 15-20 companies per context before rotating - Rotate proxy on each context rotation - Start from the homepage on the first request of each new context - Total daily volume: under 150-200 organization pages per IP pool

Exceeding these thresholds doesn't just get your current session blocked — Crunchbase can add your proxy IPs to a longer-term denylist. Conservative rates are worth it.

Finding Company Slugs

To scrape Crunchbase data, you need to know the slug (URL identifier) for each company. Several approaches:

From known URLs: If you have Crunchbase URLs (from news articles, LinkedIn, etc.), extract the slug from the path: crunchbase.com/organization/SLUG.

From company names via Algolia search (public, unauthenticated):

import requests

def find_crunchbase_slug(company_name: str) -> Optional[str]:
    """
    Search for a company's Crunchbase slug using their Algolia search API.
    This endpoint is unauthenticated and works without a Crunchbase account.
    """
    url = "https://api.crunchbase.com/api/v4/autocompletes"
    params = {
        "query": company_name,
        "collection_ids": "organizations",
        "limit": 5,
        # Note: Crunchbase's autocomplete API key is embedded in their public JS
        # and changes periodically. Check browser network tab for current key.
        "user_key": "CURRENT_PUBLIC_KEY",
    }
    headers = {"User-Agent": UA, "Referer": "https://www.crunchbase.com/"}

    try:
        resp = requests.get(url, params=params, headers=headers, timeout=10)
        if resp.status_code == 200:
            data = resp.json()
            entities = data.get("entities", [])
            if entities:
                # Return the identifier (slug) of the best match
                return entities[0].get("identifier", {}).get("permalink")
    except Exception:
        pass
    return None


# Alternative: search Google for the Crunchbase page
def find_slug_via_search(company_name: str) -> Optional[str]:
    """
    Use a DuckDuckGo search to find the Crunchbase slug.
    More reliable than the API approach when the public API key changes.
    """
    search_url = "https://api.duckduckgo.com/"
    params = {
        "q": f"site:crunchbase.com/organization {company_name}",
        "format": "json",
        "no_html": 1,
    }
    try:
        resp = requests.get(search_url, params=params, timeout=10)
        data = resp.json()
        for result in data.get("RelatedTopics", []):
            first_url = result.get("FirstURL", "")
            if "crunchbase.com/organization/" in first_url:
                slug = first_url.split("/organization/")[-1].rstrip("/")
                return slug.split("/")[0]  # Remove any trailing path
    except Exception:
        pass
    return None

Complete Pipeline with Monitoring

async def run_funding_research(
    company_names: list[str],
    db_path: str = "crunchbase.db",
    proxy_config: dict = None,
) -> dict:
    """
    Full pipeline: find slugs -> scrape orgs -> save to DB.
    Returns summary stats.
    """
    db = init_crunchbase_db(db_path)

    # Find slugs for company names
    print("Finding Crunchbase slugs...")
    slugs = []
    for name in company_names:
        slug = find_crunchbase_slug(name) or find_slug_via_search(name)
        if slug:
            slugs.append(slug)
            print(f"  {name} -> {slug}")
        else:
            print(f"  {name} -> NOT FOUND")
        time.sleep(1)

    print(f"\nFound {len(slugs)}/{len(company_names)} slugs")

    # Scrape organizations
    results = await scrape_company_batch(
        slugs,
        proxy_config=proxy_config or THORDATA_CONFIGS.get("us_rotation"),
    )

    # Save to database
    saved = 0
    for org in results:
        if not org.get("error"):
            save_organization(db, org)
            saved += 1

    # Summary stats
    total_rounds = db.execute("SELECT COUNT(*) FROM funding_rounds").fetchone()[0]
    total_orgs = db.execute("SELECT COUNT(*) FROM organizations").fetchone()[0]

    print(f"\nDatabase: {total_orgs} organizations, {total_rounds} funding rounds")
    db.close()

    return {
        "companies_requested": len(company_names),
        "slugs_found": len(slugs),
        "scraped_successfully": saved,
        "db_path": db_path,
    }


# Example usage
if __name__ == "__main__":
    companies = [
        "Stripe", "Plaid", "Brex", "Ramp", "Mercury",
        "Rippling", "Gusto", "Deel", "Remote.com", "Lattice",
    ]

    result = asyncio.run(run_funding_research(
        companies,
        db_path="fintech_funding.db",
    ))
    print(f"\nResult: {result}")

Querying the Funding Database

Once you have data in SQLite, run analytical queries:

def analyze_funding_rounds(db_path: str = "crunchbase.db"):
    conn = sqlite3.connect(db_path)

    print("\n=== Funding Analysis ===")

    # Companies by total funding visibility
    orgs = conn.execute("""
        SELECT name, total_funding, last_funding_type, last_funding_date, founded_date
        FROM organizations
        WHERE scrape_status = 'success'
        ORDER BY scraped_at DESC
    """).fetchall()

    print(f"\nScraped organizations ({len(orgs)}):")
    for o in orgs[:20]:
        print(f"  {o[0]:35} | {o[1] or 'N/A':15} | {o[2] or 'N/A':12} | Founded: {o[4] or 'N/A'}")

    # Funding rounds distribution
    rounds = conn.execute("""
        SELECT round_type, COUNT(*) AS count
        FROM funding_rounds
        WHERE round_type != ''
        GROUP BY round_type
        ORDER BY count DESC
    """).fetchall()

    print("\nFunding rounds by type:")
    for r in rounds:
        print(f"  {r[0]:20} {r[1]} rounds")

    conn.close()

Legal and Practical Considerations

Crunchbase's Terms of Service prohibit scraping. The practical implications:

Don't scrape data behind authentication (this guide only covers public pages)
Keep volume low — we're talking tens to low hundreds of companies, not millions
For production systems handling commercial decisions, their official API is worth the $99/month — it's more reliable, richer data, and eliminates legal and operational risk
For research and small-scale competitive intelligence, the public page data shown here is proportionate and reasonable

Crunchbase actively maintains their bot defenses. When your scraper starts failing, the most common causes are: context/proxy not being rotated frequently enough, delays being too short, or a Cloudflare rule update that requires header adjustments. Budget time for maintenance.

Conclusion

Crunchbase public pages provide a useful subset of their full dataset without authentication. Playwright with proper stealth configuration, residential proxies (like ThorData for IP reputation), conservative request pacing (8-18 seconds between pages), and regular context rotation are the core requirements.

The data you get — funding totals, round types and dates, visible investors, founding dates — is sufficient for competitive landscape mapping, investment research on specific sectors, or building a startup watchlist. For deeper data (full investor lists, employee counts over time, contact details), the official API is the right answer.

Scraping Crunchbase Funding Rounds and Investor Data with Python (2026)

Scraping Crunchbase Funding Rounds and Investor Data with Python (2026)

What Crunchbase Exposes Without Login

Architecture: Why You Need Playwright

Stealth Browser Context

Scraping an Organization Page

Scraping Individual Funding Round Pages

Batch Processing Multiple Companies

Database Schema

Anti-Bot Strategy Details

Finding Company Slugs

Complete Pipeline with Monitoring

Querying the Funding Database

Legal and Practical Considerations

Conclusion