How to Scrape Crunchbase Company Data in 2026: Autocomplete, Funding & Investors

2026-04-09 ["crunchbase" "web scraping" "python" "startup data" "funding rounds"]

How to Scrape Crunchbase Company Data in 2026: Autocomplete, Funding & Investors

Crunchbase is where startup data lives — funding rounds, investor networks, employee headcounts, acquisition histories, and company timelines. It is the first place VCs, analysts, and founders check when researching any company in the tech ecosystem.

The challenge: Crunchbase's paid API starts at $49/month and the free tier caps at 200 calls per day. However, Crunchbase exposes several endpoints that the frontend uses — most notably an autocomplete search endpoint that returns structured JSON without authentication. Combined with the official free-tier API for organization details, you can build a solid data pipeline.

This guide covers the autocomplete endpoint, the free REST API, page scraping as a fallback, funding round extraction, proxy configuration, and a complete SQLite-backed pipeline.

What Data Crunchbase Contains

Company profiles on Crunchbase are dense with structured information:

Company overview — name, description, founded date, operating status (active, closed, acquired)
Funding rounds — date, amount, type (Pre-Seed through Series H, Convertible Note, Grant, etc.)
Total funding raised — aggregate in USD across all rounds
Lead investors — name and fund details for each round
Employee headcount — range buckets (1-10, 11-50, 51-200, 201-500, 501-1000, 1001-5000, 5001-10000, 10000+)
Headquarters — city, region, country
Categories — Crunchbase's industry taxonomy (up to 10 categories per company)
Key people — founders, C-suite executives, board members
Acquisitions — acquirer, price where disclosed, date
IPO details — stock ticker, valuation, exchange, date

Crunchbase's Anti-Bot Architecture

Crunchbase is one of the more heavily defended scraping targets:

Cloudflare Bot Management. Every request passes through Cloudflare's full challenge pipeline. JS challenges and Turnstile CAPTCHAs trigger frequently on suspicious traffic. Fresh IPs with no browsing history, linear crawling patterns, or unusual header combinations all trigger challenges.

Aggressive rate limiting. 30-40 requests per minute from a single IP triggers block pages. The autocomplete endpoint is slightly more lenient (~60/min) but still monitored.

Content gating. After 5-10 profile views without login, a paywall modal covers the content. This is cookie-based and resets with proxy rotation.

Session token requirements. The internal GraphQL API requires valid session tokens with CSRF headers. The free-tier REST API uses a simpler API key scheme.

Legal enforcement. Crunchbase actively pursues scrapers with cease-and-desist letters and has filed lawsuits against data resellers. Their business model depends on selling this data.

Method 1: The Autocomplete Search Endpoint

Crunchbase's search autocomplete returns structured company data without any authentication. It is designed for the search bar but returns enough data for basic research:

import httpx
import json
import time
import random
from fake_useragent import UserAgent

ua = UserAgent()

def search_crunchbase_autocomplete(
    query: str,
    proxy: str = None,
    limit: int = 25,
) -> list[dict]:
    """
    Search Crunchbase via the autocomplete endpoint.
    Returns company names, slugs, short descriptions, and entity types.
    No authentication required.
    """
    url = "https://www.crunchbase.com/v4/data/autocompletes"

    params = {
        "query": query,
        "collection_ids": "organizations",
        "limit": limit,
        "source": "topSearch",
    }

    headers = {
        "User-Agent": ua.random,
        "Accept": "application/json",
        "Referer": "https://www.crunchbase.com/",
        "X-Requested-With": "XMLHttpRequest",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Sec-Fetch-Dest": "empty",
        "Sec-Fetch-Mode": "cors",
        "Sec-Fetch-Site": "same-origin",
    }

    client_kwargs = {"headers": headers, "follow_redirects": True, "timeout": 15}
    if proxy:
        client_kwargs["proxies"] = {"all://": proxy}

    with httpx.Client(**client_kwargs) as client:
        resp = client.get(url, params=params)

    if resp.status_code != 200:
        return []

    data = resp.json()
    results = []

    for entity in data.get("entities", []):
        props = entity.get("identifier", {})
        results.append({
            "name": props.get("value"),
            "slug": props.get("permalink"),
            "entity_type": props.get("entity_def_id"),
            "uuid": props.get("uuid"),
            "short_description": entity.get("short_description"),
            "facet_ids": entity.get("facet_ids", []),
            "image_url": entity.get("image_url"),
        })

    return results


def search_multiple_queries(queries: list[str], proxy: str = None) -> list[dict]:
    """Search multiple queries, deduplicate by slug."""
    seen_slugs = set()
    all_results = []

    for query in queries:
        results = search_crunchbase_autocomplete(query, proxy=proxy)
        for r in results:
            if r.get("slug") and r["slug"] not in seen_slugs:
                seen_slugs.add(r["slug"])
                all_results.append(r)

        time.sleep(random.uniform(5, 12))

    return all_results

Method 2: Free-Tier REST API

Crunchbase offers a free API tier (200 calls/day) that returns full organization data. Register at crunchbase.com/accelerator/application:

def fetch_organization(slug: str, api_key: str, proxy: str = None) -> dict:
    """
    Fetch full organization data from Crunchbase's free REST API.
    Requires a free API key from the Crunchbase Basic plan (200 calls/day).

    field_ids documentation: https://data.crunchbase.com/docs/field-ids
    """
    url = f"https://api.crunchbase.com/api/v4/entities/organizations/{slug}"

    field_ids = [
        "short_description",
        "founded_on",
        "num_employees_enum",
        "funding_total",
        "last_funding_type",
        "last_funding_at",
        "num_funding_rounds",
        "categories",
        "location_identifiers",
        "founder_identifiers",
        "website",
        "linkedin",
        "status",
        "operating_status",
        "ipo_status",
    ]

    params = {
        "user_key": api_key,
        "field_ids": ",".join(field_ids),
    }

    headers = {
        "User-Agent": ua.random,
        "Accept": "application/json",
    }

    client_kwargs = {"headers": headers, "timeout": 20}
    if proxy:
        client_kwargs["proxies"] = {"all://": proxy}

    with httpx.Client(**client_kwargs) as client:
        resp = client.get(url, params=params)

    if resp.status_code == 404:
        return {"error": "not_found", "slug": slug}
    if resp.status_code == 429:
        return {"error": "rate_limited", "slug": slug}
    if resp.status_code != 200:
        return {"error": f"http_{resp.status_code}", "slug": slug}

    data = resp.json()
    props = data.get("properties", {})

    org = {
        "slug": slug,
        "name": props.get("name") or slug,
        "short_description": props.get("short_description"),
        "founded_on": props.get("founded_on"),
        "num_employees": props.get("num_employees_enum"),
        "status": props.get("status"),
        "operating_status": props.get("operating_status"),
        "ipo_status": props.get("ipo_status"),
    }

    # Website
    website = props.get("website")
    if isinstance(website, dict):
        org["website"] = website.get("value")
    else:
        org["website"] = website

    # Funding
    funding = props.get("funding_total", {})
    if isinstance(funding, dict):
        org["total_funding_usd"] = funding.get("value_usd")
        org["total_funding_currency"] = funding.get("currency")
    org["last_funding_type"] = props.get("last_funding_type")
    org["last_funding_at"] = props.get("last_funding_at")
    org["num_funding_rounds"] = props.get("num_funding_rounds", 0)

    # Categories
    cats = props.get("categories", [])
    org["categories"] = [
        c.get("value") for c in cats if isinstance(c, dict)
    ]

    # Location
    locs = props.get("location_identifiers", [])
    if locs:
        loc = locs[0]
        org["location"] = loc.get("value") if isinstance(loc, dict) else str(loc)

    # Founders
    founders = props.get("founder_identifiers", [])
    org["founders"] = [
        f.get("value") for f in founders if isinstance(f, dict)
    ]

    return org

Fetching Funding Rounds

Funding history is the most valuable Crunchbase data. The free API provides this through a sub-entity endpoint:

def fetch_funding_rounds(slug: str, api_key: str, proxy: str = None) -> list[dict]:
    """
    Fetch all funding rounds for a company.
    Returns rounds sorted by date descending.
    """
    url = f"https://api.crunchbase.com/api/v4/entities/organizations/{slug}/funding_rounds"

    field_ids = [
        "announced_on",
        "funding_type",
        "money_raised",
        "lead_investor_identifiers",
        "investor_identifiers",
        "num_investors",
        "pre_money_valuation",
        "post_money_valuation",
        "is_equity",
    ]

    params = {
        "user_key": api_key,
        "field_ids": ",".join(field_ids),
    }

    headers = {"User-Agent": ua.random, "Accept": "application/json"}
    client_kwargs = {"headers": headers, "timeout": 20}
    if proxy:
        client_kwargs["proxies"] = {"all://": proxy}

    with httpx.Client(**client_kwargs) as client:
        resp = client.get(url, params=params)

    if resp.status_code != 200:
        return []

    rounds = []
    for entity in resp.json().get("entities", []):
        props = entity.get("properties", {})

        money = props.get("money_raised", {})
        valuation_pre = props.get("pre_money_valuation", {})
        valuation_post = props.get("post_money_valuation", {})

        lead_investors = [
            inv.get("value") for inv in props.get("lead_investor_identifiers", [])
            if isinstance(inv, dict)
        ]
        all_investors = [
            inv.get("value") for inv in props.get("investor_identifiers", [])
            if isinstance(inv, dict)
        ]

        rounds.append({
            "type": props.get("funding_type"),
            "date": props.get("announced_on"),
            "amount_usd": money.get("value_usd") if isinstance(money, dict) else None,
            "currency": money.get("currency") if isinstance(money, dict) else None,
            "lead_investors": lead_investors,
            "all_investors": all_investors,
            "num_investors": props.get("num_investors"),
            "pre_money_valuation_usd": valuation_pre.get("value_usd") if isinstance(valuation_pre, dict) else None,
            "post_money_valuation_usd": valuation_post.get("value_usd") if isinstance(valuation_post, dict) else None,
            "is_equity": props.get("is_equity"),
        })

    return sorted(rounds, key=lambda r: r.get("date") or "", reverse=True)

Method 3: Scraping Profile Pages (Fallback)

When API calls are exhausted, scrape the web pages. Crunchbase embeds JSON-LD and a hydration state in the page source:

import re

def scrape_crunchbase_page(slug: str, proxy: str = None) -> dict:
    """
    Scrape a Crunchbase organization page for embedded data.
    Use this as fallback when API rate limit is hit.
    """
    url = f"https://www.crunchbase.com/organization/{slug}"

    headers = {
        "User-Agent": ua.random,
        "Accept": "text/html,application/xhtml+xml",
        "Accept-Language": "en-US,en;q=0.9",
        "Referer": "https://www.google.com/",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "cross-site",
        "Cache-Control": "no-cache",
        "Upgrade-Insecure-Requests": "1",
    }

    client_kwargs = {
        "headers": headers,
        "follow_redirects": True,
        "timeout": 20,
    }
    if proxy:
        client_kwargs["proxies"] = {"all://": proxy}

    with httpx.Client(**client_kwargs) as client:
        resp = client.get(url)

    if resp.status_code != 200:
        return {"error": f"Status {resp.status_code}", "slug": slug}

    company = {"slug": slug, "url": url}

    # Extract JSON-LD structured data
    ld_match = re.search(
        r'<script type="application/ld\+json">(.*?)</script>',
        resp.text, re.DOTALL,
    )
    if ld_match:
        try:
            ld = json.loads(ld_match.group(1))
            company["name"] = ld.get("name")
            company["description"] = ld.get("description")
            company["founded"] = ld.get("foundingDate")
            if "address" in ld:
                addr = ld["address"]
                company["city"] = addr.get("addressLocality")
                company["country"] = addr.get("addressCountry")
            founders_raw = ld.get("founder", [])
            if founders_raw:
                if isinstance(founders_raw, dict):
                    founders_raw = [founders_raw]
                company["founders"] = [f.get("name") for f in founders_raw]
        except (json.JSONDecodeError, TypeError):
            pass

    # Extract ng-state hydration data
    state_match = re.search(
        r'<script id="ng-state" type="application/json">(.*?)</script>',
        resp.text, re.DOTALL,
    )
    if state_match:
        try:
            state = json.loads(state_match.group(1))
            for key, value in state.items():
                if isinstance(value, dict) and "properties" in value:
                    props = value["properties"]
                    company.setdefault("short_description", props.get("short_description"))
                    company.setdefault("num_employees", props.get("num_employees_enum"))
                    funding_total = props.get("funding_total", {})
                    if isinstance(funding_total, dict):
                        company.setdefault("total_funding_usd", funding_total.get("value_usd"))
                    company.setdefault("last_funding_type", props.get("last_funding_type"))
                    company.setdefault("status", props.get("status"))
                    break
        except (json.JSONDecodeError, TypeError):
            pass

    return company


def detect_paywall(html: str) -> bool:
    """Check if Crunchbase returned a gated/paywall page."""
    return any(
        marker in html.lower()
        for marker in [
            "sign up to see",
            "create a free account",
            "upgrade to crunchbase pro",
            "sign in to view",
        ]
    )

SQLite Schema

import sqlite3

def init_crunchbase_db(db_path: str = "crunchbase.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS companies (
            slug TEXT PRIMARY KEY,
            name TEXT,
            description TEXT,
            founded TEXT,
            location TEXT,
            num_employees TEXT,
            total_funding_usd REAL,
            last_funding_type TEXT,
            last_funding_at TEXT,
            num_funding_rounds INTEGER,
            status TEXT,
            operating_status TEXT,
            ipo_status TEXT,
            website TEXT,
            categories TEXT,
            founders TEXT,
            source TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );

        CREATE TABLE IF NOT EXISTS funding_rounds (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            company_slug TEXT NOT NULL,
            funding_type TEXT,
            announced_date TEXT,
            amount_usd REAL,
            currency TEXT,
            lead_investors TEXT,
            all_investors TEXT,
            num_investors INTEGER,
            pre_money_valuation_usd REAL,
            post_money_valuation_usd REAL,
            is_equity INTEGER,
            FOREIGN KEY (company_slug) REFERENCES companies(slug)
        );

        CREATE INDEX IF NOT EXISTS idx_companies_funding
            ON companies(total_funding_usd DESC);

        CREATE INDEX IF NOT EXISTS idx_companies_last_round
            ON companies(last_funding_type, last_funding_at);

        CREATE INDEX IF NOT EXISTS idx_rounds_slug
            ON funding_rounds(company_slug);

        CREATE INDEX IF NOT EXISTS idx_rounds_date
            ON funding_rounds(announced_date DESC);
    """)
    conn.commit()
    return conn


def save_company(conn: sqlite3.Connection, company: dict, source: str = "api"):
    conn.execute(
        """INSERT OR REPLACE INTO companies
           (slug, name, description, founded, location, num_employees,
            total_funding_usd, last_funding_type, last_funding_at,
            num_funding_rounds, status, operating_status, ipo_status,
            website, categories, founders, source)
           VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)""",
        (
            company.get("slug"),
            company.get("name"),
            company.get("short_description") or company.get("description"),
            company.get("founded_on") or company.get("founded"),
            company.get("location"),
            company.get("num_employees"),
            company.get("total_funding_usd"),
            company.get("last_funding_type"),
            company.get("last_funding_at"),
            company.get("num_funding_rounds"),
            company.get("status"),
            company.get("operating_status"),
            company.get("ipo_status"),
            company.get("website"),
            json.dumps(company.get("categories", [])),
            json.dumps(company.get("founders", [])),
            source,
        ),
    )
    conn.commit()


def save_funding_rounds(conn: sqlite3.Connection, company_slug: str, rounds: list[dict]):
    conn.executemany(
        """INSERT INTO funding_rounds
           (company_slug, funding_type, announced_date, amount_usd, currency,
            lead_investors, all_investors, num_investors, pre_money_valuation_usd,
            post_money_valuation_usd, is_equity)
           VALUES (?,?,?,?,?,?,?,?,?,?,?)""",
        [
            (
                company_slug,
                r.get("type"),
                r.get("date"),
                r.get("amount_usd"),
                r.get("currency"),
                json.dumps(r.get("lead_investors", [])),
                json.dumps(r.get("all_investors", [])),
                r.get("num_investors"),
                r.get("pre_money_valuation_usd"),
                r.get("post_money_valuation_usd"),
                int(r.get("is_equity") or False),
            )
            for r in rounds
        ],
    )
    conn.commit()

Error Handling and Retry Logic

def fetch_with_retry(
    func,
    *args,
    max_retries: int = 3,
    base_delay: float = 5.0,
    **kwargs,
):
    """Execute a fetch function with exponential backoff retry."""
    for attempt in range(max_retries):
        result = func(*args, **kwargs)

        if isinstance(result, dict) and result.get("error") == "rate_limited":
            wait = base_delay * (2 ** attempt) + random.uniform(0, 5)
            print(f"  Rate limited (attempt {attempt + 1}), waiting {wait:.0f}s")
            time.sleep(wait)
            continue

        return result

    return {"error": "max_retries_exceeded"}

Proxy Configuration

Crunchbase sits behind Cloudflare's full bot management suite. Datacenter IPs get Turnstile CAPTCHAs before any content loads. Residential proxies are non-negotiable for web scraping.

ThorData's residential proxy network works for Crunchbase because the IPs pass Cloudflare's ASN reputation checks. The autocomplete endpoint is the most proxy-friendly and tolerates slightly higher request rates than profile pages. For the free API, proxies help when spreading requests across multiple API keys.

API_KEY = "your_free_api_key"
PROXY = "http://USER:[email protected]:9000"

# Step 1: Discover companies via autocomplete
results = search_crunchbase_autocomplete("fintech payments", proxy=PROXY)
print(f"Found {len(results)} companies")

# Step 2: Fetch full data via API
conn = init_crunchbase_db()
for result in results[:20]:
    slug = result.get("slug")
    if not slug:
        continue

    print(f"  Fetching: {result['name']}")
    org = fetch_with_retry(fetch_organization, slug, API_KEY, proxy=PROXY)

    if "error" not in org:
        save_company(conn, org, source="api")

        # Fetch funding rounds for companies with funding
        if org.get("num_funding_rounds", 0) > 0:
            rounds = fetch_with_retry(fetch_funding_rounds, slug, API_KEY, proxy=PROXY)
            if rounds:
                save_funding_rounds(conn, slug, rounds)
                print(f"    {len(rounds)} funding rounds saved")

    # Conservative delays — Crunchbase monitors patterns aggressively
    time.sleep(random.uniform(10, 20))

conn.close()

Useful SQL Queries

conn = sqlite3.connect("crunchbase.db")

# Companies by total funding, most funded first
top_funded = conn.execute("""
    SELECT name, location, total_funding_usd, last_funding_type, num_funding_rounds
    FROM companies
    WHERE total_funding_usd IS NOT NULL
    ORDER BY total_funding_usd DESC
    LIMIT 20
""").fetchall()

# Recent funding rounds above $10M
recent_large = conn.execute("""
    SELECT c.name, r.funding_type, r.announced_date, r.amount_usd,
           r.lead_investors
    FROM funding_rounds r
    JOIN companies c ON c.slug = r.company_slug
    WHERE r.amount_usd >= 10000000
    ORDER BY r.announced_date DESC
    LIMIT 50
""").fetchall()

# Companies by category
ai_companies = conn.execute("""
    SELECT name, total_funding_usd, num_employees, location
    FROM companies
    WHERE categories LIKE '%Artificial Intelligence%'
    ORDER BY total_funding_usd DESC NULLS LAST
    LIMIT 30
""").fetchall()

Complete Pipeline

def run_crunchbase_pipeline(
    search_queries: list[str],
    api_key: str,
    db_path: str = "crunchbase.db",
    proxy: str = None,
):
    """
    Full pipeline:
    1. Search for companies using multiple queries (autocomplete)
    2. Fetch full organization data via REST API
    3. Fetch funding rounds for companies with funding history
    4. Store everything in SQLite
    """
    conn = init_crunchbase_db(db_path)

    # Phase 1: Discovery
    print("Discovering companies...")
    all_results = search_multiple_queries(search_queries, proxy=proxy)
    print(f"Found {len(all_results)} unique companies")

    # Phase 2: Enrich via API
    api_calls_used = 0
    for result in all_results:
        slug = result.get("slug")
        if not slug:
            continue

        # Check if we already have recent data
        existing = conn.execute(
            "SELECT scraped_at FROM companies WHERE slug = ?", (slug,)
        ).fetchone()
        if existing:
            print(f"  Skip (cached): {result['name']}")
            continue

        # Save basic data from autocomplete
        save_company(conn, {
            "slug": slug,
            "name": result.get("name"),
            "short_description": result.get("short_description"),
        }, source="autocomplete")

        if api_calls_used >= 150:  # Reserve buffer before hitting 200/day limit
            print("API call budget nearly exhausted, stopping enrichment")
            break

        # Enrich with full API data
        print(f"  API fetch: {result['name']}")
        org = fetch_with_retry(fetch_organization, slug, api_key, proxy=proxy)
        api_calls_used += 1

        if "error" not in org:
            save_company(conn, org, source="api")

            if org.get("num_funding_rounds", 0) > 0 and api_calls_used < 150:
                rounds = fetch_with_retry(fetch_funding_rounds, slug, api_key, proxy=proxy)
                api_calls_used += 1
                if rounds:
                    save_funding_rounds(conn, slug, rounds)

        time.sleep(random.uniform(12, 25))

    conn.close()
    print(f"Pipeline complete. API calls used: {api_calls_used}/200")


# Run it
run_crunchbase_pipeline(
    search_queries=[
        "artificial intelligence startup",
        "fintech payments",
        "climate tech carbon",
        "biotech drug discovery",
    ],
    api_key="your_api_key_here",
    proxy=PROXY,
    db_path="crunchbase.db",
)

Legal Considerations

Crunchbase explicitly prohibits scraping in their Terms of Service and pursues violators. Their business model depends on selling this data, giving them a strong legal position for enforcement. The appropriate access levels are:

Free REST API (200 calls/day): Sanctioned for personal research
Pro plan ($49/month): Appropriate for commercial use and higher volumes
Enterprise licensing: For building products that include Crunchbase data

Use autocomplete for discovery, the free API for enrichment, and page scraping as a last resort for data you cannot get any other way. Never build competing data products using scraped Crunchbase data — that is precisely what their enforcement targets.

Sector Intelligence Reports

Use the collected data to generate automated sector reports:

import sqlite3
import json
from datetime import datetime, timedelta

def generate_sector_report(
    sector_keyword: str,
    db_path: str = "crunchbase.db",
    months_back: int = 12,
) -> dict:
    """
    Generate a funding intelligence report for a sector.
    Returns aggregated metrics, top companies, and recent rounds.
    """
    conn = sqlite3.connect(db_path)

    cutoff_date = (datetime.now() - timedelta(days=months_back * 30)).strftime("%Y-%m-%d")

    # Top funded companies in sector
    top_companies = conn.execute("""
        SELECT name, total_funding_usd, num_funding_rounds,
               last_funding_type, location, num_employees
        FROM companies
        WHERE categories LIKE ?
          AND total_funding_usd IS NOT NULL
        ORDER BY total_funding_usd DESC
        LIMIT 20
    """, (f'%{sector_keyword}%',)).fetchall()

    # Recent rounds in sector
    recent_rounds = conn.execute("""
        SELECT c.name, r.funding_type, r.announced_date, r.amount_usd,
               r.lead_investors
        FROM funding_rounds r
        JOIN companies c ON c.slug = r.company_slug
        WHERE c.categories LIKE ?
          AND r.announced_date >= ?
          AND r.amount_usd IS NOT NULL
        ORDER BY r.announced_date DESC
        LIMIT 50
    """, (f'%{sector_keyword}%', cutoff_date)).fetchall()

    # Funding by stage distribution
    stage_dist = conn.execute("""
        SELECT last_funding_type, COUNT(*) as count,
               AVG(total_funding_usd) as avg_total_funding
        FROM companies
        WHERE categories LIKE ?
          AND last_funding_type IS NOT NULL
        GROUP BY last_funding_type
        ORDER BY count DESC
    """, (f'%{sector_keyword}%',)).fetchall()

    conn.close()

    return {
        "sector": sector_keyword,
        "generated_at": datetime.now().isoformat(),
        "top_companies": [
            {
                "name": row[0],
                "total_funding_m": round(row[1] / 1e6, 1) if row[1] else None,
                "rounds": row[2],
                "last_stage": row[3],
                "location": row[4],
                "employees": row[5],
            }
            for row in top_companies
        ],
        "recent_rounds": [
            {
                "company": row[0],
                "type": row[1],
                "date": row[2],
                "amount_m": round(row[3] / 1e6, 1) if row[3] else None,
                "lead_investors": json.loads(row[4] or "[]"),
            }
            for row in recent_rounds
        ],
        "stage_distribution": [
            {"stage": row[0], "count": row[1], "avg_total_funding_m": round((row[2] or 0) / 1e6, 1)}
            for row in stage_dist
        ],
    }


# Generate report for AI sector
report = generate_sector_report("Artificial Intelligence")
print(f"\n{report['sector']} Sector Report")
print(f"Generated: {report['generated_at']}\n")
print("Top 5 funded companies:")
for c in report['top_companies'][:5]:
    print(f"  {c['name']:<35} ${c['total_funding_m']}M  {c['last_stage']}")

Finding Active Investors by Sector

Cross-reference funding rounds with investor names to find the most active VCs in a space:

import json
import sqlite3
from collections import Counter

def find_active_investors(
    sector_keyword: str,
    min_investments: int = 3,
    db_path: str = "crunchbase.db",
) -> list:
    """Find the most active investors in a sector."""
    conn = sqlite3.connect(db_path)

    rows = conn.execute("""
        SELECT r.lead_investors, r.all_investors,
               r.funding_type, r.announced_date
        FROM funding_rounds r
        JOIN companies c ON c.slug = r.company_slug
        WHERE c.categories LIKE ?
          AND r.amount_usd IS NOT NULL
    """, (f'%{sector_keyword}%',)).fetchall()

    conn.close()

    investor_counts = Counter()
    investor_stages = {}

    for row in rows:
        lead = json.loads(row[0] or "[]")
        all_inv = json.loads(row[1] or "[]")
        stage = row[2]

        for inv in lead:
            investor_counts[inv] += 2  # Lead counts double
            if inv not in investor_stages:
                investor_stages[inv] = []
            investor_stages[inv].append(stage)

        for inv in all_inv:
            investor_counts[inv] += 1

    results = [
        {
            "name": name,
            "investment_score": count,
            "preferred_stages": Counter(investor_stages.get(name, [])).most_common(3),
        }
        for name, count in investor_counts.most_common(30)
        if investor_counts[name] >= min_investments
    ]

    return results


# Find most active AI investors
investors = find_active_investors("Artificial Intelligence", min_investments=2)
print("Most active AI investors:")
for inv in investors[:10]:
    stages = ", ".join(f"{s[0]}({s[1]})" for s in inv["preferred_stages"])
    print(f"  {inv['name']:<30} score={inv['investment_score']}  stages: {stages}")

Startup Discovery Pipeline

Combine autocomplete search with trend detection to discover emerging startups:

import time
import random
from fake_useragent import UserAgent

ua = UserAgent()

EMERGING_KEYWORDS = [
    "AI agents 2026",
    "quantum computing startup",
    "climate fintech",
    "synthetic biology",
    "space tech startup",
    "web3 infrastructure",
    "robotics automation",
    "longevity biotech",
]

def discover_emerging_startups(
    keywords: list,
    proxy: str = None,
    db_path: str = "crunchbase.db",
) -> list:
    """
    Search Crunchbase for companies matching emerging tech keywords.
    Filters to recently founded or recently funded companies.
    """
    conn = init_crunchbase_db(db_path)
    discovered = []

    for keyword in keywords:
        print(f"Searching: {keyword}")
        results = search_crunchbase_autocomplete(keyword, proxy=proxy)

        for r in results:
            slug = r.get("slug")
            if not slug:
                continue

            # Quick save from autocomplete
            save_company(conn, {
                "slug": slug,
                "name": r.get("name"),
                "short_description": r.get("short_description"),
            }, source="autocomplete_emerging")

            discovered.append(r)

        time.sleep(random.uniform(8, 15))

    conn.close()
    return discovered


# Run discovery
new_companies = discover_emerging_startups(EMERGING_KEYWORDS)
print(f"Discovered {len(new_companies)} companies in emerging sectors")

Handling Paywalls and Content Gating

Crunchbase increasingly gates content. Here is how to detect and handle it:

import re

def detect_crunchbase_paywall(html: str) -> str:
    """Detect what type of content restriction is in place."""
    if "upgrade to crunchbase pro" in html.lower():
        return "pro_paywall"
    if "sign up to see" in html.lower():
        return "signup_required"
    if "create a free account" in html.lower():
        return "free_account_required"
    if "log in" in html.lower() and "crunchbase" in html.lower():
        return "login_required"
    if not re.search(r'"name"\s*:\s*"[^"]+"', html):
        return "empty_response"
    return "ok"


def scrape_with_paywall_fallback(
    slug: str,
    api_key: str,
    proxy: str = None,
) -> dict:
    """
    Try API first, fall back to page scraping, handle paywalls gracefully.
    """
    # Try official API first
    org = fetch_with_retry(fetch_organization, slug, api_key, proxy=proxy)

    if "error" not in org:
        return org

    # API failed or rate limited -- try page scraping
    print(f"  API failed for {slug}, trying page scrape")
    scraped = scrape_crunchbase_page(slug, proxy=proxy)

    if "error" in scraped:
        return scraped

    # Check for paywall in scraped data
    paywall_type = detect_crunchbase_paywall(str(scraped))
    if paywall_type != "ok":
        return {"slug": slug, "error": f"paywall_{paywall_type}"}

    return scraped

Complete Reference: Field Availability by Method

Field	Free API	Autocomplete	Page Scrape
Company name	Yes	Yes	Yes
Short description	Yes	Yes	Yes
Founded date	Yes	No	Sometimes
Headquarters	Yes	No	Sometimes
Employee count	Yes	No	Sometimes
Total funding	Yes	No	Sometimes
Last funding type	Yes	No	Sometimes
Number of rounds	Yes	No	No
Categories	Yes	Yes (tags)	Sometimes
Website	Yes	No	No
Founders	Yes	No	Sometimes
LinkedIn	Yes	No	No
IPO status	Yes	No	No
Funding rounds detail	Yes (separate endpoint)	No	No
Investor names	Yes (separate endpoint)	No	No

The free API at 200 calls/day is by far the most data-rich approach. Autocomplete is useful for bulk discovery. Page scraping is a last resort for data not in the API.

Key Takeaways

Crunchbase's autocomplete endpoint at https://www.crunchbase.com/v4/data/autocompletes returns basic company data without authentication -- useful for bulk discovery
The free REST API (200 calls/day) is the best approach for enriched data including funding rounds and investor details
Cloudflare with full bot management protects all Crunchbase pages -- residential proxies are required for web scraping
ThorData's residential proxy network passes Cloudflare's ASN checks; use it for both autocomplete requests and page scraping
Store data in SQLite with separate tables for companies and funding rounds, linked by slug
Crunchbase aggressively enforces their ToS against data resellers -- use the data for internal research, not for building competing databases