How to Scrape F6S in 2026: Startups, Funding & Accelerator Programs
How to Scrape F6S in 2026: Startups, Funding & Accelerator Programs
F6S is the largest startup platform on the internet — over 5 million founders, thousands of active accelerator programs, and a deep catalog of funding data. If you are doing deal sourcing, competitive research, or building an accelerator discovery tool, this is where the data lives. Here is how to get it out.
What You Can Extract
F6S exposes a wide range of structured data across multiple sections of their platform:
Startup profiles: - Startup name, tagline, and full description - Team size and employee count - Funding stage (pre-seed, seed, Series A, Series B, etc.) - Total capital raised (when disclosed) - Industry and sector tags - Country and city of operation - Founding date and years in operation - Website URL and social media links
Founder and team data: - Founder names and roles - LinkedIn profile URLs - Prior company experience - Education background (when listed) - Number of previous exits
Accelerator programs: - Program names and organizing entities - Application open/close deadlines - Equity percentage taken - Cash stipend per cohort - Stage requirements (pre-seed only, post-revenue, etc.) - Program location or remote status - Mentor and investor network size - Cohort size (number of companies accepted) - Alumni companies (searchable by program)
Deal flow and investment signals: - Recent funding rounds posted publicly - Investor names attached to rounds - Lead investor identification - Co-investor lists
F6S Anti-Bot Measures
F6S is server-side rendered, which makes parsing easier than dealing with a JavaScript-heavy SPA. That said, they have meaningful protections in place that you need to understand before writing your first request.
1. Rate Limiting on Directory Endpoints
The startup directory and program listing endpoints throttle requests aggressively. In testing, a single IP can make roughly 30-40 requests per minute before seeing HTTP 429 responses. Individual profile pages are more lenient (around 20-25 per minute) but still tracked.
2. Cookie-Based Session Tracking
Sessions are tracked server-side using cookies set during your first visit. Scraping without a valid session cookie will return degraded responses — either empty arrays where you expect data, or redirects to the login page. Always seed your session by visiting the main page first before hitting API endpoints.
3. CAPTCHA on Profile Pages
Rapid sequential requests to /company/ profile paths will surface a CAPTCHA challenge. The trigger threshold is roughly 10 profile requests in under 30 seconds from the same IP. Space your requests with at least 2-5 seconds of jitter.
4. User-Agent Filtering
Default user-agent strings from Python libraries (like python-httpx/0.x.x or python-requests/2.x.x) are filtered. You will receive either empty responses or HTTP 403. Always use a realistic browser user-agent string from a recent Chrome or Firefox version.
5. Referrer Header Validation
Some internal API endpoints check the Referer header and expect requests to originate from an F6S page (e.g. https://www.f6s.com/companies). Missing or mismatched referrers cause empty responses even when authentication is otherwise valid.
6. IP-Based Blocking
Sustained scraping over 15-20 minutes from a single IP leads to a soft block. Soft blocks manifest as HTTP 429 responses or silent redirects to a challenge page (not a real CAPTCHA, just a JS challenge page that scraper clients cannot handle). Once blocked, the IP stays blocked for 4-12 hours in testing.
Method 1: The Internal JSON API
F6S loads its startup directory via XHR calls to internal JSON endpoints. Open Chrome DevTools, navigate to the Network tab, filter by Fetch/XHR, and browse to f6s.com/companies. You will see calls to something like /api/v2/companies with query parameters for pagination and filtering.
This is the cleanest extraction path — structured JSON, no HTML parsing required.
import httpx
import time
import random
import json
HEADERS = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept": "application/json, text/plain, */*",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://www.f6s.com/companies",
"X-Requested-With": "XMLHttpRequest",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
}
def seed_session(client: httpx.Client) -> None:
"""Visit the main directory page to establish a valid session cookie."""
client.get("https://www.f6s.com/companies", headers={
**HEADERS,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
})
# Give the server a moment to register the session
time.sleep(random.uniform(1.5, 3.0))
def fetch_startups_page(client: httpx.Client, page: int, filters: dict = None) -> list[dict]:
"""Fetch one page of startup listings from the internal API."""
url = "https://www.f6s.com/api/v2/companies"
params = {
"page": page,
"per_page": 24,
"sort": "trending",
}
if filters:
params.update(filters)
resp = client.get(url, params=params, headers=HEADERS, timeout=20)
resp.raise_for_status()
data = resp.json()
return data.get("companies", [])
def scrape_all_startups(max_pages: int = 10, filters: dict = None) -> list[dict]:
"""Scrape startup listings with pagination and polite delays."""
results = []
with httpx.Client(follow_redirects=True) as client:
seed_session(client)
for page in range(1, max_pages + 1):
try:
companies = fetch_startups_page(client, page, filters)
if not companies:
print(f"No results on page {page}, stopping.")
break
results.extend(companies)
print(f"Page {page}: fetched {len(companies)} companies (total: {len(results)})")
except httpx.HTTPStatusError as e:
if e.response.status_code == 429:
print(f"Rate limited on page {page}. Waiting 60 seconds.")
time.sleep(60)
# Retry once after waiting
try:
companies = fetch_startups_page(client, page, filters)
results.extend(companies)
except Exception:
print("Still rate limited after retry. Stopping.")
break
else:
print(f"HTTP {e.response.status_code} on page {page}")
break
except httpx.TimeoutException:
print(f"Timeout on page {page}, retrying once...")
time.sleep(5)
# Polite delay between pages: 2-5 seconds with jitter
time.sleep(random.uniform(2, 5))
return results
# Example: scrape fintech startups in Europe
fintech_results = scrape_all_startups(
max_pages=5,
filters={"sector": "fintech", "country": "Europe"}
)
print(f"Total: {len(fintech_results)} startups")
Filtering by Stage and Sector
The F6S API accepts filter parameters that correspond to the faceted search UI on the website:
# Available filter parameters (discovered via DevTools inspection)
STAGE_FILTERS = {
"idea": "stage=1",
"pre_seed": "stage=2",
"seed": "stage=3",
"series_a": "stage=4",
"series_b_plus": "stage=5",
"profitable": "stage=6",
}
SECTOR_FILTERS = [
"fintech", "healthtech", "edtech", "saas", "marketplace",
"ai-ml", "crypto-web3", "cleantech", "govtech", "legaltech",
"proptech", "foodtech", "regtech", "insurtech", "biotech",
]
# Filter to seed-stage AI/ML startups
results = scrape_all_startups(
max_pages=10,
filters={"stage": 2, "sector": "ai-ml", "sort": "newest"}
)
Method 2: HTML Parsing Individual Profiles
For detailed profile data that does not appear in the JSON API response — long-form descriptions, team bios, program history — you scrape the HTML profile pages directly using BeautifulSoup.
import httpx
from bs4 import BeautifulSoup
import time
import random
import re
HEADERS = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://www.f6s.com/companies",
}
def parse_startup_profile(html: str) -> dict:
"""Extract structured data from an F6S company profile page."""
soup = BeautifulSoup(html, "html.parser")
data = {}
# Company name (usually in an h1)
name_el = soup.select_one("h1.profile-name, h1[itemprop=name], .company-name h1")
data["name"] = name_el.get_text(strip=True) if name_el else None
# Short tagline / one-liner
tagline_el = soup.select_one(".profile-tagline, .company-tagline, meta[name=description]")
if tagline_el:
if tagline_el.name == "meta":
data["tagline"] = tagline_el.get("content", "")
else:
data["tagline"] = tagline_el.get_text(strip=True)
else:
data["tagline"] = None
# Full description text
desc_el = soup.select_one("div.profile-description, div.company-description, [itemprop=description]")
data["description"] = desc_el.get_text(" ", strip=True) if desc_el else None
# Location
location_el = soup.select_one("span.profile-location, .company-location, [itemprop=addressLocality]")
data["location"] = location_el.get_text(strip=True) if location_el else None
# Funding stage label
stage_el = soup.select_one("span[data-field=funding_stage], .funding-stage, .stage-badge")
data["funding_stage"] = stage_el.get_text(strip=True) if stage_el else None
# Founded year
founded_el = soup.select_one("span[data-field=founded], .founded-year, [itemprop=foundingDate]")
data["founded"] = founded_el.get_text(strip=True) if founded_el else None
# Employee count
employees_el = soup.select_one("span[data-field=team_size], .employee-count")
data["employees"] = employees_el.get_text(strip=True) if employees_el else None
# Website URL
website_el = soup.select_one("a[data-field=website], a.company-website, [itemprop=url]")
data["website"] = website_el.get("href") if website_el else None
# Sector/industry tags
data["tags"] = [t.get_text(strip=True) for t in soup.select("a.profile-tag, .sector-tag, .industry-tag")]
# Founder/team members
founders = []
for card in soup.select(".team-member, .founder-card, [itemprop=employee]"):
name_el = card.select_one(".member-name, [itemprop=name]")
role_el = card.select_one(".member-role, [itemprop=jobTitle]")
linkedin_el = card.select_one("a[href*=linkedin]")
founders.append({
"name": name_el.get_text(strip=True) if name_el else None,
"role": role_el.get_text(strip=True) if role_el else None,
"linkedin": linkedin_el.get("href") if linkedin_el else None,
})
data["founders"] = founders
# Social media links
data["linkedin"] = None
data["twitter"] = None
for link in soup.select("a[href*=linkedin\.com/company], a[href*=twitter\.com], a[href*=x\.com]"):
href = link.get("href", "")
if "linkedin" in href:
data["linkedin"] = href
elif "twitter" in href or "x.com" in href:
data["twitter"] = href
return data
def scrape_profile(client: httpx.Client, slug: str) -> dict:
"""Scrape a single company profile by its slug."""
url = f"https://www.f6s.com/company/{slug}"
resp = client.get(url, headers=HEADERS, timeout=20)
resp.raise_for_status()
profile = parse_startup_profile(resp.text)
profile["slug"] = slug
profile["url"] = url
return profile
def scrape_profiles_batch(slugs: list[str], delay_min: float = 2.5, delay_max: float = 5.0) -> list[dict]:
"""Scrape multiple company profiles with polite delays."""
results = []
with httpx.Client(follow_redirects=True) as client:
# Seed a session first
client.get("https://www.f6s.com/companies", headers=HEADERS)
time.sleep(2)
for i, slug in enumerate(slugs):
try:
profile = scrape_profile(client, slug)
results.append(profile)
print(f"[{i+1}/{len(slugs)}] Scraped: {profile.get('name', slug)}")
except httpx.HTTPStatusError as e:
print(f"HTTP {e.response.status_code} for {slug}")
except Exception as ex:
print(f"Error scraping {slug}: {ex}")
time.sleep(random.uniform(delay_min, delay_max))
return results
Scraping Accelerator Programs
The F6S accelerator directory at /programs is one of the most valuable sections. Program cards expose deadlines, equity terms, cohort details, and alumni data in structured HTML.
def parse_program_card(card) -> dict:
"""Extract data from a single program listing card."""
program = {}
name_el = card.select_one("h3.program-title, .program-name, h2 a")
program["name"] = name_el.get_text(strip=True) if name_el else None
organizer_el = card.select_one(".program-organizer, .accelerator-name")
program["organizer"] = organizer_el.get_text(strip=True) if organizer_el else None
deadline_el = card.select_one("span.program-deadline, .deadline-label, time[datetime]")
if deadline_el:
program["deadline"] = deadline_el.get("datetime") or deadline_el.get_text(strip=True)
else:
program["deadline"] = None
equity_el = card.select_one("span.program-equity, .equity-take, [data-field=equity]")
program["equity"] = equity_el.get_text(strip=True) if equity_el else None
stipend_el = card.select_one(".program-stipend, [data-field=investment]")
program["stipend"] = stipend_el.get_text(strip=True) if stipend_el else None
location_el = card.select_one(".program-location, [data-field=location]")
program["location"] = location_el.get_text(strip=True) if location_el else None
stage_el = card.select_one(".program-stage, [data-field=stage_requirements]")
program["stage_requirements"] = stage_el.get_text(strip=True) if stage_el else None
link_el = card.select_one("a.program-link, h3 a, .program-cta")
program["url"] = link_el.get("href") if link_el else None
return program
def scrape_programs(max_pages: int = 10, status: str = "open") -> list[dict]:
"""Scrape accelerator and incubator program listings."""
programs = []
with httpx.Client(follow_redirects=True) as client:
client.get("https://www.f6s.com/programs", headers=HEADERS)
time.sleep(2)
for page in range(1, max_pages + 1):
url = f"https://www.f6s.com/programs?page={page}&status={status}"
resp = client.get(url, headers=HEADERS, timeout=20)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
cards = soup.select("div.program-card, article.program-listing, .accelerator-card")
if not cards:
print(f"No program cards found on page {page}. Stopping.")
break
for card in cards:
prog = parse_program_card(card)
if prog.get("name"):
programs.append(prog)
print(f"Page {page}: found {len(cards)} programs (total: {len(programs)})")
time.sleep(random.uniform(2, 4))
return programs
# Scrape all currently open programs
open_programs = scrape_programs(max_pages=5, status="open")
print(f"Open accelerator programs: {len(open_programs)}")
# Filter to equity-free programs
equity_free = [p for p in open_programs if p.get("equity") in (None, "0%", "No equity")]
print(f"Equity-free programs: {len(equity_free)}")
Extracting Founder Profiles and LinkedIn Connections
Founder data on F6S is particularly valuable for investor deal flow and talent research:
def parse_founder_profile(html: str) -> dict:
"""Extract data from an F6S founder/person profile."""
soup = BeautifulSoup(html, "html.parser")
data = {}
data["name"] = None
name_el = soup.select_one("h1.founder-name, h1[itemprop=name], .profile-header h1")
if name_el:
data["name"] = name_el.get_text(strip=True)
data["title"] = None
title_el = soup.select_one(".founder-title, [itemprop=jobTitle], .current-role")
if title_el:
data["title"] = title_el.get_text(strip=True)
data["bio"] = None
bio_el = soup.select_one(".founder-bio, [itemprop=description], .profile-bio")
if bio_el:
data["bio"] = bio_el.get_text(" ", strip=True)
data["location"] = None
loc_el = soup.select_one(".founder-location, [itemprop=addressLocality]")
if loc_el:
data["location"] = loc_el.get_text(strip=True)
# Skills and expertise tags
data["skills"] = [s.get_text(strip=True) for s in soup.select(".skill-tag, .expertise-tag, .founder-skill")]
# Companies they are associated with
data["companies"] = []
for co in soup.select(".founder-company, .associated-startup, [itemprop=worksFor]"):
name = co.select_one(".company-name, [itemprop=name]")
role = co.select_one(".role-label, [itemprop=jobTitle]")
data["companies"].append({
"name": name.get_text(strip=True) if name else None,
"role": role.get_text(strip=True) if role else None,
})
# Social links
data["linkedin"] = None
data["twitter"] = None
data["github"] = None
for link in soup.select("a[href*=linkedin], a[href*=twitter], a[href*=github]"):
href = link.get("href", "")
if "linkedin" in href:
data["linkedin"] = href
elif "twitter" in href or "x.com" in href:
data["twitter"] = href
elif "github" in href:
data["github"] = href
return data
def scrape_startup_founders(startup_slug: str, client: httpx.Client) -> list[dict]:
"""Get founder profiles for a specific startup."""
profile_html = client.get(
f"https://www.f6s.com/company/{startup_slug}",
headers=HEADERS, timeout=20
).text
soup = BeautifulSoup(profile_html, "html.parser")
founder_links = []
for link in soup.select("a[href*=/people/], a[href*=/founder/]"):
href = link.get("href", "")
if href and href not in founder_links:
founder_links.append(href)
founders = []
for link in founder_links[:5]: # limit to top 5
full_url = link if link.startswith("http") else f"https://www.f6s.com{link}"
try:
resp = client.get(full_url, headers=HEADERS, timeout=20)
founder_data = parse_founder_profile(resp.text)
founders.append(founder_data)
time.sleep(random.uniform(2, 4))
except Exception as e:
print(f"Error scraping founder {link}: {e}")
return founders
Bypassing Rate Limits with Proxy Rotation
After a few hundred requests, F6S will start blocking your IP with soft blocks that can last 4-12 hours. For any serious data collection effort, you need IP rotation.
I route my requests through ThorData's residential proxy pool to keep sessions clean and avoid blocks. Residential IPs are legitimate household internet connections — they look completely normal to F6S's bot detection systems, unlike datacenter IPs which are trivially fingerprinted.
ThorData offers sticky sessions (keeping the same IP for extended periods) and rotating sessions (new IP per request or every N seconds). For F6S scraping, sticky sessions with rotation every 3-5 minutes work best: you build up a valid session cookie, make 20-30 requests, then rotate to a fresh IP before hitting the per-IP threshold.
import httpx
import time
import random
# ThorData proxy configuration
# Get credentials at: https://thordata.partnerstack.com/partner/0a0x4nzh
THORDATA_USER = "your_thordata_username"
THORDATA_PASS = "your_thordata_password"
THORDATA_GATEWAY = "gate.thordata.net"
THORDATA_PORT = 7777
def build_proxy_url(session_id: str = None, country: str = "US") -> str:
"""Build a ThorData proxy URL with optional sticky session."""
if session_id:
# Sticky session: same IP for duration of session_id
user = f"{THORDATA_USER}-session-{session_id}-country-{country}"
else:
# Rotating: new IP per request
user = f"{THORDATA_USER}-country-{country}"
return f"http://{user}:{THORDATA_PASS}@{THORDATA_GATEWAY}:{THORDATA_PORT}"
class F6SScraper:
"""F6S scraper with automatic proxy rotation and session management."""
REQUESTS_PER_SESSION = 25 # rotate IP every 25 requests
SESSION_SEED_DELAY = 2.0 # seconds to wait after seeding session
def __init__(self, use_proxy: bool = True):
self.use_proxy = use_proxy
self.request_count = 0
self.session_id = random.randint(100000, 999999)
self.client = self._build_client()
def _build_client(self) -> httpx.Client:
"""Build httpx client with or without proxy."""
kwargs = {"follow_redirects": True, "timeout": 20}
if self.use_proxy:
proxy_url = build_proxy_url(
session_id=str(self.session_id),
country="US"
)
kwargs["proxies"] = {
"http://": proxy_url,
"https://": proxy_url,
}
client = httpx.Client(**kwargs)
# Seed the session
try:
client.get("https://www.f6s.com/companies", headers=HEADERS)
time.sleep(self.SESSION_SEED_DELAY)
except Exception as e:
print(f"Session seed failed: {e}")
return client
def _maybe_rotate(self) -> None:
"""Rotate proxy session after threshold is reached."""
if self.use_proxy and self.request_count >= self.REQUESTS_PER_SESSION:
print(f"Rotating proxy after {self.request_count} requests...")
self.client.close()
self.session_id = random.randint(100000, 999999)
self.request_count = 0
self.client = self._build_client()
def get(self, url: str, **kwargs) -> httpx.Response:
"""Make a GET request with automatic proxy rotation."""
self._maybe_rotate()
resp = self.client.get(url, headers=HEADERS, **kwargs)
self.request_count += 1
return resp
def close(self):
self.client.close()
# Usage
scraper = F6SScraper(use_proxy=True)
try:
resp = scraper.get("https://www.f6s.com/api/v2/companies?page=1")
data = resp.json()
print(f"Fetched {len(data.get('companies', []))} companies")
finally:
scraper.close()
Building a Startup Intelligence Database
For ongoing deal flow monitoring, store everything in SQLite so you can query across runs, track changes, and deduplicate efficiently.
import sqlite3
from datetime import datetime, timezone
def init_db(path: str = "f6s_startups.db") -> sqlite3.Connection:
"""Initialize the startup intelligence database with all required tables."""
conn = sqlite3.connect(path)
conn.row_factory = sqlite3.Row
conn.executescript("""
CREATE TABLE IF NOT EXISTS startups (
id INTEGER PRIMARY KEY AUTOINCREMENT,
slug TEXT UNIQUE NOT NULL,
name TEXT,
tagline TEXT,
description TEXT,
location TEXT,
country TEXT,
funding_stage TEXT,
founded_year INTEGER,
employee_count TEXT,
website TEXT,
linkedin TEXT,
twitter TEXT,
fetched_at TEXT NOT NULL,
updated_at TEXT
);
CREATE TABLE IF NOT EXISTS startup_tags (
startup_slug TEXT NOT NULL,
tag TEXT NOT NULL,
UNIQUE(startup_slug, tag),
FOREIGN KEY (startup_slug) REFERENCES startups(slug)
);
CREATE TABLE IF NOT EXISTS founders (
id INTEGER PRIMARY KEY AUTOINCREMENT,
startup_slug TEXT NOT NULL,
name TEXT,
role TEXT,
linkedin TEXT,
twitter TEXT,
github TEXT,
bio TEXT,
FOREIGN KEY (startup_slug) REFERENCES startups(slug)
);
CREATE TABLE IF NOT EXISTS programs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT NOT NULL,
organizer TEXT,
deadline TEXT,
equity TEXT,
stipend TEXT,
location TEXT,
stage_requirements TEXT,
url TEXT,
status TEXT DEFAULT 'open',
fetched_at TEXT NOT NULL,
UNIQUE(name, organizer)
);
CREATE TABLE IF NOT EXISTS startup_programs (
startup_slug TEXT,
program_id INTEGER,
status TEXT, -- applied, accepted, rejected, alum
UNIQUE(startup_slug, program_id),
FOREIGN KEY (startup_slug) REFERENCES startups(slug),
FOREIGN KEY (program_id) REFERENCES programs(id)
);
CREATE INDEX IF NOT EXISTS idx_startups_stage ON startups(funding_stage);
CREATE INDEX IF NOT EXISTS idx_startups_country ON startups(country);
CREATE INDEX IF NOT EXISTS idx_startups_fetched ON startups(fetched_at);
CREATE INDEX IF NOT EXISTS idx_programs_deadline ON programs(deadline);
""")
conn.commit()
return conn
def insert_startup(conn: sqlite3.Connection, data: dict) -> None:
"""Insert or update a startup record."""
now = datetime.now(timezone.utc).isoformat()
conn.execute("""
INSERT INTO startups (slug, name, tagline, description, location, funding_stage,
founded_year, employee_count, website, linkedin, twitter, fetched_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(slug) DO UPDATE SET
name = excluded.name,
tagline = excluded.tagline,
description = excluded.description,
location = excluded.location,
funding_stage = excluded.funding_stage,
founded_year = excluded.founded_year,
employee_count = excluded.employee_count,
website = excluded.website,
linkedin = excluded.linkedin,
twitter = excluded.twitter,
updated_at = ?
""", (
data.get("slug"), data.get("name"), data.get("tagline"),
data.get("description"), data.get("location"), data.get("funding_stage"),
data.get("founded"), data.get("employees"), data.get("website"),
data.get("linkedin"), data.get("twitter"), now, now
))
# Insert tags
for tag in data.get("tags", []):
conn.execute(
"INSERT OR IGNORE INTO startup_tags (startup_slug, tag) VALUES (?, ?)",
(data.get("slug"), tag)
)
conn.commit()
def insert_founder(conn: sqlite3.Connection, startup_slug: str, founder: dict) -> None:
"""Insert a founder record."""
conn.execute("""
INSERT OR IGNORE INTO founders (startup_slug, name, role, linkedin, twitter, github, bio)
VALUES (?, ?, ?, ?, ?, ?, ?)
""", (
startup_slug, founder.get("name"), founder.get("role"),
founder.get("linkedin"), founder.get("twitter"), founder.get("github"),
founder.get("bio")
))
conn.commit()
def insert_program(conn: sqlite3.Connection, program: dict) -> int:
"""Insert or update a program record, return the row ID."""
now = datetime.now(timezone.utc).isoformat()
cursor = conn.execute("""
INSERT INTO programs (name, organizer, deadline, equity, stipend, location,
stage_requirements, url, fetched_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(name, organizer) DO UPDATE SET
deadline = excluded.deadline,
equity = excluded.equity,
stipend = excluded.stipend,
location = excluded.location
""", (
program.get("name"), program.get("organizer"), program.get("deadline"),
program.get("equity"), program.get("stipend"), program.get("location"),
program.get("stage_requirements"), program.get("url"), now
))
conn.commit()
return cursor.lastrowid
# Query examples
def query_startups_by_stage(conn: sqlite3.Connection, stage: str) -> list[sqlite3.Row]:
return conn.execute(
"SELECT * FROM startups WHERE funding_stage LIKE ? ORDER BY fetched_at DESC",
(f"%{stage}%",)
).fetchall()
def query_programs_closing_soon(conn: sqlite3.Connection, days: int = 30) -> list[sqlite3.Row]:
"""Find programs with deadlines in the next N days."""
from datetime import timedelta
cutoff = (datetime.now(timezone.utc) + timedelta(days=days)).date().isoformat()
return conn.execute(
"SELECT * FROM programs WHERE deadline <= ? AND deadline >= date('now') ORDER BY deadline",
(cutoff,)
).fetchall()
Complete Pipeline: Scrape, Store, Export
Putting it all together into a complete pipeline:
import json
import csv
from pathlib import Path
def run_pipeline(
max_startup_pages: int = 5,
max_program_pages: int = 3,
db_path: str = "f6s_data.db",
export_dir: str = "./exports",
use_proxy: bool = True,
) -> dict:
"""
Full F6S scraping pipeline.
Returns summary statistics.
"""
stats = {"startups": 0, "founders": 0, "programs": 0, "errors": 0}
Path(export_dir).mkdir(exist_ok=True)
conn = init_db(db_path)
scraper = F6SScraper(use_proxy=use_proxy)
try:
# 1. Scrape startup directory
print("--- Phase 1: Scraping startup directory ---")
raw_startups = scrape_all_startups(max_pages=max_startup_pages)
# 2. Enrich each startup with profile data
print("--- Phase 2: Enriching startup profiles ---")
for raw in raw_startups:
slug = raw.get("slug") or raw.get("id")
if not slug:
continue
try:
resp = scraper.get(f"https://www.f6s.com/company/{slug}")
profile = parse_startup_profile(resp.text)
profile["slug"] = slug
insert_startup(conn, profile)
stats["startups"] += 1
# Scrape founders too (throttled)
founders = scrape_startup_founders(slug, scraper.client)
for f in founders:
insert_founder(conn, slug, f)
stats["founders"] += 1
except Exception as e:
print(f"Error enriching {slug}: {e}")
stats["errors"] += 1
time.sleep(random.uniform(3, 6))
# 3. Scrape programs
print("--- Phase 3: Scraping accelerator programs ---")
programs = scrape_programs(max_pages=max_program_pages)
for program in programs:
insert_program(conn, program)
stats["programs"] += 1
# 4. Export to CSV
print("--- Phase 4: Exporting data ---")
startups_data = conn.execute("SELECT * FROM startups").fetchall()
if startups_data:
with open(f"{export_dir}/startups.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(startups_data[0].keys())
writer.writerows(startups_data)
print(f"Exported {len(startups_data)} startups to {export_dir}/startups.csv")
programs_data = conn.execute("SELECT * FROM programs").fetchall()
if programs_data:
with open(f"{export_dir}/programs.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(programs_data[0].keys())
writer.writerows(programs_data)
print(f"Exported {len(programs_data)} programs to {export_dir}/programs.csv")
finally:
scraper.close()
conn.close()
print(f"Pipeline complete: {stats}")
return stats
if __name__ == "__main__":
results = run_pipeline(
max_startup_pages=10,
max_program_pages=5,
use_proxy=True,
)
print(json.dumps(results, indent=2))
Alternative: Using a Scraping API Instead of DIY
If you do not want to manage proxy rotation, session handling, and CAPTCHA avoidance yourself, a commercial scraping API can handle all of that infrastructure for you. ThorData provides a proxy API where you send your requests through their gateway and get back the response after residential IP rotation, CAPTCHA solving, and JavaScript rendering.
The integration is minimal:
import httpx
THORDATA_API_KEY = "your_api_key_here"
def thordata_fetch(url: str, render_js: bool = False) -> str:
"""Fetch a URL through ThorData proxy infrastructure."""
endpoint = "https://api.thordata.com/scrape"
payload = {
"url": url,
"render_js": render_js,
"country": "US",
"session_type": "sticky",
}
resp = httpx.post(
endpoint,
json=payload,
headers={"Authorization": f"Bearer {THORDATA_API_KEY}"},
timeout=60,
)
resp.raise_for_status()
return resp.json().get("html", "")
# Use exactly like a regular HTTP response
html = thordata_fetch("https://www.f6s.com/companies", render_js=True)
soup = BeautifulSoup(html, "html.parser")
This approach trades control for convenience — useful if you are doing occasional research rather than building a continuous pipeline.
Handling JavaScript-Heavy Pages with Playwright
Some F6S pages (particularly newer program application flows) are JavaScript-rendered and will not parse correctly with httpx alone. Playwright handles these:
from playwright.async_api import async_playwright
import asyncio
async def scrape_with_playwright(url: str, proxy_url: str = None) -> str:
"""Fetch a JavaScript-rendered F6S page using Playwright."""
launch_opts = {"headless": True}
if proxy_url:
launch_opts["proxy"] = {"server": proxy_url}
async with async_playwright() as p:
browser = await p.chromium.launch(**launch_opts)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
viewport={"width": 1280, "height": 800},
)
page = await context.new_page()
# Block unnecessary resources to speed up loading
await page.route("**/*.{png,jpg,jpeg,gif,svg,ico,woff,woff2}", lambda r: r.abort())
await page.goto(url, wait_until="networkidle", timeout=30000)
# Wait for main content container
try:
await page.wait_for_selector(".company-listing, .program-card", timeout=10000)
except Exception:
pass # Continue even if selector not found
html = await page.content()
await browser.close()
return html
# Sync wrapper for use in regular scripts
def scrape_js_page(url: str, proxy_url: str = None) -> str:
return asyncio.run(scrape_with_playwright(url, proxy_url))
# Example usage
html = scrape_js_page(
"https://www.f6s.com/programs",
proxy_url=build_proxy_url(session_id="12345")
)
soup = BeautifulSoup(html, "html.parser")
What to Do with F6S Data
F6S data has clear commercial value in several contexts:
Deal flow tools for VCs and angels: Build a dashboard showing new seed-stage startups by sector, updated daily. Filter by geography, team size, and program participation. VCs pay for curated deal flow.
Accelerator program tracking: Build a calendar of all open accelerator programs with deadlines, equity terms, and stage requirements. Founders pay for this aggregation because finding programs manually is time-consuming.
Competitive intelligence: Track when competitors raise funding rounds or hire key executives by monitoring their F6S profiles for changes.
Founder network mapping: Graph the connections between founders, investors, and accelerators to identify key nodes and communities within startup ecosystems.
Job boards for startup jobs: Many F6S company profiles include hiring signals. Aggregate these into a startup-focused job board.
Legal Considerations
F6S's terms of service prohibit automated scraping of their platform. This guide is intended for research, personal use, and building tools for private use — not bulk commercial redistribution of F6S data to third parties.
Public data accessed without authentication sits in a legal gray zone in most jurisdictions. In the EU, database rights (sui generis) may apply even to publicly accessible data. In the US, the Computer Fraud and Abuse Act (CFAA) case law is mixed but generally allows accessing public data without authentication.
If you are building a product that surfaces F6S data to end users, consult a lawyer. If you are doing one-off research for personal or academic purposes, the risk profile is much lower.
Always:
- Respect the robots.txt at f6s.com/robots.txt
- Do not hammer their servers (use delays)
- Do not scrape behind authentication
- Do not redistribute bulk data commercially