← Back to blog

How to Scrape F6S in 2026: Startups, Funding & Accelerator Programs

How to Scrape F6S in 2026: Startups, Funding & Accelerator Programs

F6S is the largest startup platform on the internet — over 5 million founders, thousands of active accelerator programs, and a deep catalog of funding data. If you are doing deal sourcing, competitive research, or building an accelerator discovery tool, this is where the data lives. Here is how to get it out.

What You Can Extract

F6S exposes a wide range of structured data across multiple sections of their platform:

Startup profiles: - Startup name, tagline, and full description - Team size and employee count - Funding stage (pre-seed, seed, Series A, Series B, etc.) - Total capital raised (when disclosed) - Industry and sector tags - Country and city of operation - Founding date and years in operation - Website URL and social media links

Founder and team data: - Founder names and roles - LinkedIn profile URLs - Prior company experience - Education background (when listed) - Number of previous exits

Accelerator programs: - Program names and organizing entities - Application open/close deadlines - Equity percentage taken - Cash stipend per cohort - Stage requirements (pre-seed only, post-revenue, etc.) - Program location or remote status - Mentor and investor network size - Cohort size (number of companies accepted) - Alumni companies (searchable by program)

Deal flow and investment signals: - Recent funding rounds posted publicly - Investor names attached to rounds - Lead investor identification - Co-investor lists


F6S Anti-Bot Measures

F6S is server-side rendered, which makes parsing easier than dealing with a JavaScript-heavy SPA. That said, they have meaningful protections in place that you need to understand before writing your first request.

1. Rate Limiting on Directory Endpoints

The startup directory and program listing endpoints throttle requests aggressively. In testing, a single IP can make roughly 30-40 requests per minute before seeing HTTP 429 responses. Individual profile pages are more lenient (around 20-25 per minute) but still tracked.

Sessions are tracked server-side using cookies set during your first visit. Scraping without a valid session cookie will return degraded responses — either empty arrays where you expect data, or redirects to the login page. Always seed your session by visiting the main page first before hitting API endpoints.

3. CAPTCHA on Profile Pages

Rapid sequential requests to /company/ profile paths will surface a CAPTCHA challenge. The trigger threshold is roughly 10 profile requests in under 30 seconds from the same IP. Space your requests with at least 2-5 seconds of jitter.

4. User-Agent Filtering

Default user-agent strings from Python libraries (like python-httpx/0.x.x or python-requests/2.x.x) are filtered. You will receive either empty responses or HTTP 403. Always use a realistic browser user-agent string from a recent Chrome or Firefox version.

5. Referrer Header Validation

Some internal API endpoints check the Referer header and expect requests to originate from an F6S page (e.g. https://www.f6s.com/companies). Missing or mismatched referrers cause empty responses even when authentication is otherwise valid.

6. IP-Based Blocking

Sustained scraping over 15-20 minutes from a single IP leads to a soft block. Soft blocks manifest as HTTP 429 responses or silent redirects to a challenge page (not a real CAPTCHA, just a JS challenge page that scraper clients cannot handle). Once blocked, the IP stays blocked for 4-12 hours in testing.


Method 1: The Internal JSON API

F6S loads its startup directory via XHR calls to internal JSON endpoints. Open Chrome DevTools, navigate to the Network tab, filter by Fetch/XHR, and browse to f6s.com/companies. You will see calls to something like /api/v2/companies with query parameters for pagination and filtering.

This is the cleanest extraction path — structured JSON, no HTML parsing required.

import httpx
import time
import random
import json

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Accept": "application/json, text/plain, */*",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Referer": "https://www.f6s.com/companies",
    "X-Requested-With": "XMLHttpRequest",
    "Sec-Fetch-Dest": "empty",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Site": "same-origin",
}

def seed_session(client: httpx.Client) -> None:
    """Visit the main directory page to establish a valid session cookie."""
    client.get("https://www.f6s.com/companies", headers={
        **HEADERS,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    })
    # Give the server a moment to register the session
    time.sleep(random.uniform(1.5, 3.0))

def fetch_startups_page(client: httpx.Client, page: int, filters: dict = None) -> list[dict]:
    """Fetch one page of startup listings from the internal API."""
    url = "https://www.f6s.com/api/v2/companies"
    params = {
        "page": page,
        "per_page": 24,
        "sort": "trending",
    }
    if filters:
        params.update(filters)

    resp = client.get(url, params=params, headers=HEADERS, timeout=20)
    resp.raise_for_status()
    data = resp.json()
    return data.get("companies", [])

def scrape_all_startups(max_pages: int = 10, filters: dict = None) -> list[dict]:
    """Scrape startup listings with pagination and polite delays."""
    results = []

    with httpx.Client(follow_redirects=True) as client:
        seed_session(client)

        for page in range(1, max_pages + 1):
            try:
                companies = fetch_startups_page(client, page, filters)
                if not companies:
                    print(f"No results on page {page}, stopping.")
                    break

                results.extend(companies)
                print(f"Page {page}: fetched {len(companies)} companies (total: {len(results)})")

            except httpx.HTTPStatusError as e:
                if e.response.status_code == 429:
                    print(f"Rate limited on page {page}. Waiting 60 seconds.")
                    time.sleep(60)
                    # Retry once after waiting
                    try:
                        companies = fetch_startups_page(client, page, filters)
                        results.extend(companies)
                    except Exception:
                        print("Still rate limited after retry. Stopping.")
                        break
                else:
                    print(f"HTTP {e.response.status_code} on page {page}")
                    break

            except httpx.TimeoutException:
                print(f"Timeout on page {page}, retrying once...")
                time.sleep(5)

            # Polite delay between pages: 2-5 seconds with jitter
            time.sleep(random.uniform(2, 5))

    return results

# Example: scrape fintech startups in Europe
fintech_results = scrape_all_startups(
    max_pages=5,
    filters={"sector": "fintech", "country": "Europe"}
)
print(f"Total: {len(fintech_results)} startups")

Filtering by Stage and Sector

The F6S API accepts filter parameters that correspond to the faceted search UI on the website:

# Available filter parameters (discovered via DevTools inspection)
STAGE_FILTERS = {
    "idea": "stage=1",
    "pre_seed": "stage=2",
    "seed": "stage=3",
    "series_a": "stage=4",
    "series_b_plus": "stage=5",
    "profitable": "stage=6",
}

SECTOR_FILTERS = [
    "fintech", "healthtech", "edtech", "saas", "marketplace",
    "ai-ml", "crypto-web3", "cleantech", "govtech", "legaltech",
    "proptech", "foodtech", "regtech", "insurtech", "biotech",
]

# Filter to seed-stage AI/ML startups
results = scrape_all_startups(
    max_pages=10,
    filters={"stage": 2, "sector": "ai-ml", "sort": "newest"}
)

Method 2: HTML Parsing Individual Profiles

For detailed profile data that does not appear in the JSON API response — long-form descriptions, team bios, program history — you scrape the HTML profile pages directly using BeautifulSoup.

import httpx
from bs4 import BeautifulSoup
import time
import random
import re

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Referer": "https://www.f6s.com/companies",
}

def parse_startup_profile(html: str) -> dict:
    """Extract structured data from an F6S company profile page."""
    soup = BeautifulSoup(html, "html.parser")
    data = {}

    # Company name (usually in an h1)
    name_el = soup.select_one("h1.profile-name, h1[itemprop=name], .company-name h1")
    data["name"] = name_el.get_text(strip=True) if name_el else None

    # Short tagline / one-liner
    tagline_el = soup.select_one(".profile-tagline, .company-tagline, meta[name=description]")
    if tagline_el:
        if tagline_el.name == "meta":
            data["tagline"] = tagline_el.get("content", "")
        else:
            data["tagline"] = tagline_el.get_text(strip=True)
    else:
        data["tagline"] = None

    # Full description text
    desc_el = soup.select_one("div.profile-description, div.company-description, [itemprop=description]")
    data["description"] = desc_el.get_text(" ", strip=True) if desc_el else None

    # Location
    location_el = soup.select_one("span.profile-location, .company-location, [itemprop=addressLocality]")
    data["location"] = location_el.get_text(strip=True) if location_el else None

    # Funding stage label
    stage_el = soup.select_one("span[data-field=funding_stage], .funding-stage, .stage-badge")
    data["funding_stage"] = stage_el.get_text(strip=True) if stage_el else None

    # Founded year
    founded_el = soup.select_one("span[data-field=founded], .founded-year, [itemprop=foundingDate]")
    data["founded"] = founded_el.get_text(strip=True) if founded_el else None

    # Employee count
    employees_el = soup.select_one("span[data-field=team_size], .employee-count")
    data["employees"] = employees_el.get_text(strip=True) if employees_el else None

    # Website URL
    website_el = soup.select_one("a[data-field=website], a.company-website, [itemprop=url]")
    data["website"] = website_el.get("href") if website_el else None

    # Sector/industry tags
    data["tags"] = [t.get_text(strip=True) for t in soup.select("a.profile-tag, .sector-tag, .industry-tag")]

    # Founder/team members
    founders = []
    for card in soup.select(".team-member, .founder-card, [itemprop=employee]"):
        name_el = card.select_one(".member-name, [itemprop=name]")
        role_el = card.select_one(".member-role, [itemprop=jobTitle]")
        linkedin_el = card.select_one("a[href*=linkedin]")
        founders.append({
            "name": name_el.get_text(strip=True) if name_el else None,
            "role": role_el.get_text(strip=True) if role_el else None,
            "linkedin": linkedin_el.get("href") if linkedin_el else None,
        })
    data["founders"] = founders

    # Social media links
    data["linkedin"] = None
    data["twitter"] = None
    for link in soup.select("a[href*=linkedin\.com/company], a[href*=twitter\.com], a[href*=x\.com]"):
        href = link.get("href", "")
        if "linkedin" in href:
            data["linkedin"] = href
        elif "twitter" in href or "x.com" in href:
            data["twitter"] = href

    return data

def scrape_profile(client: httpx.Client, slug: str) -> dict:
    """Scrape a single company profile by its slug."""
    url = f"https://www.f6s.com/company/{slug}"
    resp = client.get(url, headers=HEADERS, timeout=20)
    resp.raise_for_status()
    profile = parse_startup_profile(resp.text)
    profile["slug"] = slug
    profile["url"] = url
    return profile

def scrape_profiles_batch(slugs: list[str], delay_min: float = 2.5, delay_max: float = 5.0) -> list[dict]:
    """Scrape multiple company profiles with polite delays."""
    results = []
    with httpx.Client(follow_redirects=True) as client:
        # Seed a session first
        client.get("https://www.f6s.com/companies", headers=HEADERS)
        time.sleep(2)

        for i, slug in enumerate(slugs):
            try:
                profile = scrape_profile(client, slug)
                results.append(profile)
                print(f"[{i+1}/{len(slugs)}] Scraped: {profile.get('name', slug)}")
            except httpx.HTTPStatusError as e:
                print(f"HTTP {e.response.status_code} for {slug}")
            except Exception as ex:
                print(f"Error scraping {slug}: {ex}")

            time.sleep(random.uniform(delay_min, delay_max))

    return results

Scraping Accelerator Programs

The F6S accelerator directory at /programs is one of the most valuable sections. Program cards expose deadlines, equity terms, cohort details, and alumni data in structured HTML.

def parse_program_card(card) -> dict:
    """Extract data from a single program listing card."""
    program = {}

    name_el = card.select_one("h3.program-title, .program-name, h2 a")
    program["name"] = name_el.get_text(strip=True) if name_el else None

    organizer_el = card.select_one(".program-organizer, .accelerator-name")
    program["organizer"] = organizer_el.get_text(strip=True) if organizer_el else None

    deadline_el = card.select_one("span.program-deadline, .deadline-label, time[datetime]")
    if deadline_el:
        program["deadline"] = deadline_el.get("datetime") or deadline_el.get_text(strip=True)
    else:
        program["deadline"] = None

    equity_el = card.select_one("span.program-equity, .equity-take, [data-field=equity]")
    program["equity"] = equity_el.get_text(strip=True) if equity_el else None

    stipend_el = card.select_one(".program-stipend, [data-field=investment]")
    program["stipend"] = stipend_el.get_text(strip=True) if stipend_el else None

    location_el = card.select_one(".program-location, [data-field=location]")
    program["location"] = location_el.get_text(strip=True) if location_el else None

    stage_el = card.select_one(".program-stage, [data-field=stage_requirements]")
    program["stage_requirements"] = stage_el.get_text(strip=True) if stage_el else None

    link_el = card.select_one("a.program-link, h3 a, .program-cta")
    program["url"] = link_el.get("href") if link_el else None

    return program

def scrape_programs(max_pages: int = 10, status: str = "open") -> list[dict]:
    """Scrape accelerator and incubator program listings."""
    programs = []

    with httpx.Client(follow_redirects=True) as client:
        client.get("https://www.f6s.com/programs", headers=HEADERS)
        time.sleep(2)

        for page in range(1, max_pages + 1):
            url = f"https://www.f6s.com/programs?page={page}&status={status}"
            resp = client.get(url, headers=HEADERS, timeout=20)
            resp.raise_for_status()

            soup = BeautifulSoup(resp.text, "html.parser")
            cards = soup.select("div.program-card, article.program-listing, .accelerator-card")

            if not cards:
                print(f"No program cards found on page {page}. Stopping.")
                break

            for card in cards:
                prog = parse_program_card(card)
                if prog.get("name"):
                    programs.append(prog)

            print(f"Page {page}: found {len(cards)} programs (total: {len(programs)})")
            time.sleep(random.uniform(2, 4))

    return programs

# Scrape all currently open programs
open_programs = scrape_programs(max_pages=5, status="open")
print(f"Open accelerator programs: {len(open_programs)}")

# Filter to equity-free programs
equity_free = [p for p in open_programs if p.get("equity") in (None, "0%", "No equity")]
print(f"Equity-free programs: {len(equity_free)}")

Extracting Founder Profiles and LinkedIn Connections

Founder data on F6S is particularly valuable for investor deal flow and talent research:

def parse_founder_profile(html: str) -> dict:
    """Extract data from an F6S founder/person profile."""
    soup = BeautifulSoup(html, "html.parser")
    data = {}

    data["name"] = None
    name_el = soup.select_one("h1.founder-name, h1[itemprop=name], .profile-header h1")
    if name_el:
        data["name"] = name_el.get_text(strip=True)

    data["title"] = None
    title_el = soup.select_one(".founder-title, [itemprop=jobTitle], .current-role")
    if title_el:
        data["title"] = title_el.get_text(strip=True)

    data["bio"] = None
    bio_el = soup.select_one(".founder-bio, [itemprop=description], .profile-bio")
    if bio_el:
        data["bio"] = bio_el.get_text(" ", strip=True)

    data["location"] = None
    loc_el = soup.select_one(".founder-location, [itemprop=addressLocality]")
    if loc_el:
        data["location"] = loc_el.get_text(strip=True)

    # Skills and expertise tags
    data["skills"] = [s.get_text(strip=True) for s in soup.select(".skill-tag, .expertise-tag, .founder-skill")]

    # Companies they are associated with
    data["companies"] = []
    for co in soup.select(".founder-company, .associated-startup, [itemprop=worksFor]"):
        name = co.select_one(".company-name, [itemprop=name]")
        role = co.select_one(".role-label, [itemprop=jobTitle]")
        data["companies"].append({
            "name": name.get_text(strip=True) if name else None,
            "role": role.get_text(strip=True) if role else None,
        })

    # Social links
    data["linkedin"] = None
    data["twitter"] = None
    data["github"] = None
    for link in soup.select("a[href*=linkedin], a[href*=twitter], a[href*=github]"):
        href = link.get("href", "")
        if "linkedin" in href:
            data["linkedin"] = href
        elif "twitter" in href or "x.com" in href:
            data["twitter"] = href
        elif "github" in href:
            data["github"] = href

    return data

def scrape_startup_founders(startup_slug: str, client: httpx.Client) -> list[dict]:
    """Get founder profiles for a specific startup."""
    profile_html = client.get(
        f"https://www.f6s.com/company/{startup_slug}",
        headers=HEADERS, timeout=20
    ).text

    soup = BeautifulSoup(profile_html, "html.parser")
    founder_links = []

    for link in soup.select("a[href*=/people/], a[href*=/founder/]"):
        href = link.get("href", "")
        if href and href not in founder_links:
            founder_links.append(href)

    founders = []
    for link in founder_links[:5]:  # limit to top 5
        full_url = link if link.startswith("http") else f"https://www.f6s.com{link}"
        try:
            resp = client.get(full_url, headers=HEADERS, timeout=20)
            founder_data = parse_founder_profile(resp.text)
            founders.append(founder_data)
            time.sleep(random.uniform(2, 4))
        except Exception as e:
            print(f"Error scraping founder {link}: {e}")

    return founders

Bypassing Rate Limits with Proxy Rotation

After a few hundred requests, F6S will start blocking your IP with soft blocks that can last 4-12 hours. For any serious data collection effort, you need IP rotation.

I route my requests through ThorData's residential proxy pool to keep sessions clean and avoid blocks. Residential IPs are legitimate household internet connections — they look completely normal to F6S's bot detection systems, unlike datacenter IPs which are trivially fingerprinted.

ThorData offers sticky sessions (keeping the same IP for extended periods) and rotating sessions (new IP per request or every N seconds). For F6S scraping, sticky sessions with rotation every 3-5 minutes work best: you build up a valid session cookie, make 20-30 requests, then rotate to a fresh IP before hitting the per-IP threshold.

import httpx
import time
import random

# ThorData proxy configuration
# Get credentials at: https://thordata.partnerstack.com/partner/0a0x4nzh
THORDATA_USER = "your_thordata_username"
THORDATA_PASS = "your_thordata_password"
THORDATA_GATEWAY = "gate.thordata.net"
THORDATA_PORT = 7777

def build_proxy_url(session_id: str = None, country: str = "US") -> str:
    """Build a ThorData proxy URL with optional sticky session."""
    if session_id:
        # Sticky session: same IP for duration of session_id
        user = f"{THORDATA_USER}-session-{session_id}-country-{country}"
    else:
        # Rotating: new IP per request
        user = f"{THORDATA_USER}-country-{country}"

    return f"http://{user}:{THORDATA_PASS}@{THORDATA_GATEWAY}:{THORDATA_PORT}"

class F6SScraper:
    """F6S scraper with automatic proxy rotation and session management."""

    REQUESTS_PER_SESSION = 25  # rotate IP every 25 requests
    SESSION_SEED_DELAY = 2.0    # seconds to wait after seeding session

    def __init__(self, use_proxy: bool = True):
        self.use_proxy = use_proxy
        self.request_count = 0
        self.session_id = random.randint(100000, 999999)
        self.client = self._build_client()

    def _build_client(self) -> httpx.Client:
        """Build httpx client with or without proxy."""
        kwargs = {"follow_redirects": True, "timeout": 20}

        if self.use_proxy:
            proxy_url = build_proxy_url(
                session_id=str(self.session_id),
                country="US"
            )
            kwargs["proxies"] = {
                "http://": proxy_url,
                "https://": proxy_url,
            }

        client = httpx.Client(**kwargs)

        # Seed the session
        try:
            client.get("https://www.f6s.com/companies", headers=HEADERS)
            time.sleep(self.SESSION_SEED_DELAY)
        except Exception as e:
            print(f"Session seed failed: {e}")

        return client

    def _maybe_rotate(self) -> None:
        """Rotate proxy session after threshold is reached."""
        if self.use_proxy and self.request_count >= self.REQUESTS_PER_SESSION:
            print(f"Rotating proxy after {self.request_count} requests...")
            self.client.close()
            self.session_id = random.randint(100000, 999999)
            self.request_count = 0
            self.client = self._build_client()

    def get(self, url: str, **kwargs) -> httpx.Response:
        """Make a GET request with automatic proxy rotation."""
        self._maybe_rotate()
        resp = self.client.get(url, headers=HEADERS, **kwargs)
        self.request_count += 1
        return resp

    def close(self):
        self.client.close()

# Usage
scraper = F6SScraper(use_proxy=True)
try:
    resp = scraper.get("https://www.f6s.com/api/v2/companies?page=1")
    data = resp.json()
    print(f"Fetched {len(data.get('companies', []))} companies")
finally:
    scraper.close()

Building a Startup Intelligence Database

For ongoing deal flow monitoring, store everything in SQLite so you can query across runs, track changes, and deduplicate efficiently.

import sqlite3
from datetime import datetime, timezone

def init_db(path: str = "f6s_startups.db") -> sqlite3.Connection:
    """Initialize the startup intelligence database with all required tables."""
    conn = sqlite3.connect(path)
    conn.row_factory = sqlite3.Row

    conn.executescript("""
        CREATE TABLE IF NOT EXISTS startups (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            slug TEXT UNIQUE NOT NULL,
            name TEXT,
            tagline TEXT,
            description TEXT,
            location TEXT,
            country TEXT,
            funding_stage TEXT,
            founded_year INTEGER,
            employee_count TEXT,
            website TEXT,
            linkedin TEXT,
            twitter TEXT,
            fetched_at TEXT NOT NULL,
            updated_at TEXT
        );

        CREATE TABLE IF NOT EXISTS startup_tags (
            startup_slug TEXT NOT NULL,
            tag TEXT NOT NULL,
            UNIQUE(startup_slug, tag),
            FOREIGN KEY (startup_slug) REFERENCES startups(slug)
        );

        CREATE TABLE IF NOT EXISTS founders (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            startup_slug TEXT NOT NULL,
            name TEXT,
            role TEXT,
            linkedin TEXT,
            twitter TEXT,
            github TEXT,
            bio TEXT,
            FOREIGN KEY (startup_slug) REFERENCES startups(slug)
        );

        CREATE TABLE IF NOT EXISTS programs (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            name TEXT NOT NULL,
            organizer TEXT,
            deadline TEXT,
            equity TEXT,
            stipend TEXT,
            location TEXT,
            stage_requirements TEXT,
            url TEXT,
            status TEXT DEFAULT 'open',
            fetched_at TEXT NOT NULL,
            UNIQUE(name, organizer)
        );

        CREATE TABLE IF NOT EXISTS startup_programs (
            startup_slug TEXT,
            program_id INTEGER,
            status TEXT,  -- applied, accepted, rejected, alum
            UNIQUE(startup_slug, program_id),
            FOREIGN KEY (startup_slug) REFERENCES startups(slug),
            FOREIGN KEY (program_id) REFERENCES programs(id)
        );

        CREATE INDEX IF NOT EXISTS idx_startups_stage ON startups(funding_stage);
        CREATE INDEX IF NOT EXISTS idx_startups_country ON startups(country);
        CREATE INDEX IF NOT EXISTS idx_startups_fetched ON startups(fetched_at);
        CREATE INDEX IF NOT EXISTS idx_programs_deadline ON programs(deadline);
    """)

    conn.commit()
    return conn

def insert_startup(conn: sqlite3.Connection, data: dict) -> None:
    """Insert or update a startup record."""
    now = datetime.now(timezone.utc).isoformat()

    conn.execute("""
        INSERT INTO startups (slug, name, tagline, description, location, funding_stage,
                              founded_year, employee_count, website, linkedin, twitter, fetched_at)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        ON CONFLICT(slug) DO UPDATE SET
            name = excluded.name,
            tagline = excluded.tagline,
            description = excluded.description,
            location = excluded.location,
            funding_stage = excluded.funding_stage,
            founded_year = excluded.founded_year,
            employee_count = excluded.employee_count,
            website = excluded.website,
            linkedin = excluded.linkedin,
            twitter = excluded.twitter,
            updated_at = ?
    """, (
        data.get("slug"), data.get("name"), data.get("tagline"),
        data.get("description"), data.get("location"), data.get("funding_stage"),
        data.get("founded"), data.get("employees"), data.get("website"),
        data.get("linkedin"), data.get("twitter"), now, now
    ))

    # Insert tags
    for tag in data.get("tags", []):
        conn.execute(
            "INSERT OR IGNORE INTO startup_tags (startup_slug, tag) VALUES (?, ?)",
            (data.get("slug"), tag)
        )

    conn.commit()

def insert_founder(conn: sqlite3.Connection, startup_slug: str, founder: dict) -> None:
    """Insert a founder record."""
    conn.execute("""
        INSERT OR IGNORE INTO founders (startup_slug, name, role, linkedin, twitter, github, bio)
        VALUES (?, ?, ?, ?, ?, ?, ?)
    """, (
        startup_slug, founder.get("name"), founder.get("role"),
        founder.get("linkedin"), founder.get("twitter"), founder.get("github"),
        founder.get("bio")
    ))
    conn.commit()

def insert_program(conn: sqlite3.Connection, program: dict) -> int:
    """Insert or update a program record, return the row ID."""
    now = datetime.now(timezone.utc).isoformat()
    cursor = conn.execute("""
        INSERT INTO programs (name, organizer, deadline, equity, stipend, location,
                              stage_requirements, url, fetched_at)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
        ON CONFLICT(name, organizer) DO UPDATE SET
            deadline = excluded.deadline,
            equity = excluded.equity,
            stipend = excluded.stipend,
            location = excluded.location
    """, (
        program.get("name"), program.get("organizer"), program.get("deadline"),
        program.get("equity"), program.get("stipend"), program.get("location"),
        program.get("stage_requirements"), program.get("url"), now
    ))
    conn.commit()
    return cursor.lastrowid

# Query examples
def query_startups_by_stage(conn: sqlite3.Connection, stage: str) -> list[sqlite3.Row]:
    return conn.execute(
        "SELECT * FROM startups WHERE funding_stage LIKE ? ORDER BY fetched_at DESC",
        (f"%{stage}%",)
    ).fetchall()

def query_programs_closing_soon(conn: sqlite3.Connection, days: int = 30) -> list[sqlite3.Row]:
    """Find programs with deadlines in the next N days."""
    from datetime import timedelta
    cutoff = (datetime.now(timezone.utc) + timedelta(days=days)).date().isoformat()
    return conn.execute(
        "SELECT * FROM programs WHERE deadline <= ? AND deadline >= date('now') ORDER BY deadline",
        (cutoff,)
    ).fetchall()

Complete Pipeline: Scrape, Store, Export

Putting it all together into a complete pipeline:

import json
import csv
from pathlib import Path

def run_pipeline(
    max_startup_pages: int = 5,
    max_program_pages: int = 3,
    db_path: str = "f6s_data.db",
    export_dir: str = "./exports",
    use_proxy: bool = True,
) -> dict:
    """
    Full F6S scraping pipeline.
    Returns summary statistics.
    """
    stats = {"startups": 0, "founders": 0, "programs": 0, "errors": 0}

    Path(export_dir).mkdir(exist_ok=True)
    conn = init_db(db_path)
    scraper = F6SScraper(use_proxy=use_proxy)

    try:
        # 1. Scrape startup directory
        print("--- Phase 1: Scraping startup directory ---")
        raw_startups = scrape_all_startups(max_pages=max_startup_pages)

        # 2. Enrich each startup with profile data
        print("--- Phase 2: Enriching startup profiles ---")
        for raw in raw_startups:
            slug = raw.get("slug") or raw.get("id")
            if not slug:
                continue

            try:
                resp = scraper.get(f"https://www.f6s.com/company/{slug}")
                profile = parse_startup_profile(resp.text)
                profile["slug"] = slug
                insert_startup(conn, profile)
                stats["startups"] += 1

                # Scrape founders too (throttled)
                founders = scrape_startup_founders(slug, scraper.client)
                for f in founders:
                    insert_founder(conn, slug, f)
                    stats["founders"] += 1

            except Exception as e:
                print(f"Error enriching {slug}: {e}")
                stats["errors"] += 1

            time.sleep(random.uniform(3, 6))

        # 3. Scrape programs
        print("--- Phase 3: Scraping accelerator programs ---")
        programs = scrape_programs(max_pages=max_program_pages)
        for program in programs:
            insert_program(conn, program)
            stats["programs"] += 1

        # 4. Export to CSV
        print("--- Phase 4: Exporting data ---")
        startups_data = conn.execute("SELECT * FROM startups").fetchall()
        if startups_data:
            with open(f"{export_dir}/startups.csv", "w", newline="", encoding="utf-8") as f:
                writer = csv.writer(f)
                writer.writerow(startups_data[0].keys())
                writer.writerows(startups_data)
            print(f"Exported {len(startups_data)} startups to {export_dir}/startups.csv")

        programs_data = conn.execute("SELECT * FROM programs").fetchall()
        if programs_data:
            with open(f"{export_dir}/programs.csv", "w", newline="", encoding="utf-8") as f:
                writer = csv.writer(f)
                writer.writerow(programs_data[0].keys())
                writer.writerows(programs_data)
            print(f"Exported {len(programs_data)} programs to {export_dir}/programs.csv")

    finally:
        scraper.close()
        conn.close()

    print(f"Pipeline complete: {stats}")
    return stats

if __name__ == "__main__":
    results = run_pipeline(
        max_startup_pages=10,
        max_program_pages=5,
        use_proxy=True,
    )
    print(json.dumps(results, indent=2))

Alternative: Using a Scraping API Instead of DIY

If you do not want to manage proxy rotation, session handling, and CAPTCHA avoidance yourself, a commercial scraping API can handle all of that infrastructure for you. ThorData provides a proxy API where you send your requests through their gateway and get back the response after residential IP rotation, CAPTCHA solving, and JavaScript rendering.

The integration is minimal:

import httpx

THORDATA_API_KEY = "your_api_key_here"

def thordata_fetch(url: str, render_js: bool = False) -> str:
    """Fetch a URL through ThorData proxy infrastructure."""
    endpoint = "https://api.thordata.com/scrape"
    payload = {
        "url": url,
        "render_js": render_js,
        "country": "US",
        "session_type": "sticky",
    }

    resp = httpx.post(
        endpoint,
        json=payload,
        headers={"Authorization": f"Bearer {THORDATA_API_KEY}"},
        timeout=60,
    )
    resp.raise_for_status()
    return resp.json().get("html", "")

# Use exactly like a regular HTTP response
html = thordata_fetch("https://www.f6s.com/companies", render_js=True)
soup = BeautifulSoup(html, "html.parser")

This approach trades control for convenience — useful if you are doing occasional research rather than building a continuous pipeline.


Handling JavaScript-Heavy Pages with Playwright

Some F6S pages (particularly newer program application flows) are JavaScript-rendered and will not parse correctly with httpx alone. Playwright handles these:

from playwright.async_api import async_playwright
import asyncio

async def scrape_with_playwright(url: str, proxy_url: str = None) -> str:
    """Fetch a JavaScript-rendered F6S page using Playwright."""
    launch_opts = {"headless": True}
    if proxy_url:
        launch_opts["proxy"] = {"server": proxy_url}

    async with async_playwright() as p:
        browser = await p.chromium.launch(**launch_opts)
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
            viewport={"width": 1280, "height": 800},
        )

        page = await context.new_page()

        # Block unnecessary resources to speed up loading
        await page.route("**/*.{png,jpg,jpeg,gif,svg,ico,woff,woff2}", lambda r: r.abort())

        await page.goto(url, wait_until="networkidle", timeout=30000)

        # Wait for main content container
        try:
            await page.wait_for_selector(".company-listing, .program-card", timeout=10000)
        except Exception:
            pass  # Continue even if selector not found

        html = await page.content()
        await browser.close()
        return html

# Sync wrapper for use in regular scripts
def scrape_js_page(url: str, proxy_url: str = None) -> str:
    return asyncio.run(scrape_with_playwright(url, proxy_url))

# Example usage
html = scrape_js_page(
    "https://www.f6s.com/programs",
    proxy_url=build_proxy_url(session_id="12345")
)
soup = BeautifulSoup(html, "html.parser")

What to Do with F6S Data

F6S data has clear commercial value in several contexts:

Deal flow tools for VCs and angels: Build a dashboard showing new seed-stage startups by sector, updated daily. Filter by geography, team size, and program participation. VCs pay for curated deal flow.

Accelerator program tracking: Build a calendar of all open accelerator programs with deadlines, equity terms, and stage requirements. Founders pay for this aggregation because finding programs manually is time-consuming.

Competitive intelligence: Track when competitors raise funding rounds or hire key executives by monitoring their F6S profiles for changes.

Founder network mapping: Graph the connections between founders, investors, and accelerators to identify key nodes and communities within startup ecosystems.

Job boards for startup jobs: Many F6S company profiles include hiring signals. Aggregate these into a startup-focused job board.


F6S's terms of service prohibit automated scraping of their platform. This guide is intended for research, personal use, and building tools for private use — not bulk commercial redistribution of F6S data to third parties.

Public data accessed without authentication sits in a legal gray zone in most jurisdictions. In the EU, database rights (sui generis) may apply even to publicly accessible data. In the US, the Computer Fraud and Abuse Act (CFAA) case law is mixed but generally allows accessing public data without authentication.

If you are building a product that surfaces F6S data to end users, consult a lawyer. If you are doing one-off research for personal or academic purposes, the risk profile is much lower.

Always: - Respect the robots.txt at f6s.com/robots.txt - Do not hammer their servers (use delays) - Do not scrape behind authentication - Do not redistribute bulk data commercially