How to Scrape GitHub Repositories with Python (2026)
GitHub is one of the richest public datasets on the internet. Millions of repositories, contributor graphs, topic tags, star counts, commit histories — all accessible through a clean REST API. Whether you're building a tool to track trending libraries, doing competitive research, scraping data for a dataset, or just want to know which repos in a niche are gaining traction, the GitHub API is the right starting point.
This post covers what actually works in 2026: searching repos, pulling contributor lists, using code search, handling pagination without hitting rate limits, bulk collection with rotating proxies, and building a real dataset pipeline.
Rate Limits and Authentication
Before you write a single line of code, understand the rate limit situation.
Unauthenticated: 60 requests per hour. Basically useless for any real work.
Authenticated with a personal access token: 5,000 requests per hour. Workable for most tasks.
Search API specifically: 10 requests per minute authenticated (30/min for some endpoints). This is a separate cap from the main 5,000/hr limit.
To get a token, go to GitHub Settings > Developer settings > Personal access tokens > Tokens (classic). For read-only public repo access, you only need the public_repo scope — or no scopes at all if you just want public data.
Store it in an environment variable, not in your code:
import os
import time
import re
import csv
import json
import sqlite3
import requests
from pathlib import Path
from datetime import datetime, timezone
GITHUB_TOKEN = os.environ.get("GITHUB_TOKEN")
session = requests.Session()
session.headers.update({
"Authorization": f"Bearer {GITHUB_TOKEN}",
"Accept": "application/vnd.github+json",
"X-GitHub-Api-Version": "2022-11-28",
})
Using a Session object means you don't repeat headers on every call, and it reuses the underlying TCP connection which speeds things up slightly.
Understanding Rate Limit Headers
Every GitHub API response includes rate limit information in its headers. Read them on every request:
def check_rate_limit(response):
"""Parse rate limit headers and sleep if we're about to be blocked."""
remaining = int(response.headers.get("X-RateLimit-Remaining", 1))
limit = int(response.headers.get("X-RateLimit-Limit", 5000))
reset_at = int(response.headers.get("X-RateLimit-Reset", 0))
if remaining == 0:
wait = max(0, reset_at - int(time.time())) + 2
print(f"Rate limit exhausted ({limit}/hr). Waiting {wait}s...")
time.sleep(wait)
elif remaining < 100:
# Slow down when getting close to the limit
time.sleep(0.5)
return remaining
def safe_get(url, params=None, retries=3):
"""GET with retry logic and rate limit handling."""
for attempt in range(retries):
try:
r = session.get(url, params=params, timeout=30)
check_rate_limit(r)
if r.status_code == 403:
# Could be rate limit or abuse detection
retry_after = int(r.headers.get("Retry-After", 60))
print(f"403 received. Waiting {retry_after}s...")
time.sleep(retry_after)
continue
if r.status_code == 422:
# Unprocessable — usually a bad query, don't retry
print(f"422 Unprocessable: {r.json().get('message', '')}")
return None
r.raise_for_status()
return r
except requests.exceptions.ConnectionError:
wait = 2 ** attempt
print(f"Connection error. Retry {attempt+1}/{retries} in {wait}s...")
time.sleep(wait)
except requests.exceptions.Timeout:
print(f"Timeout on attempt {attempt+1}")
time.sleep(5)
return None
Searching Repositories
The search endpoint is GET /search/repositories. It accepts a q parameter using GitHub's search syntax and returns up to 1,000 results per query (100 per page, paginated).
def search_repos(query, sort="stars", order="desc", per_page=100):
"""
Search GitHub repositories by query string.
Args:
query: GitHub search syntax string
Examples:
- "topic:fastapi stars:>500 language:python"
- "machine learning created:>2024-01-01 stars:>100"
- "org:microsoft language:typescript"
sort: stars, forks, help-wanted-issues, updated
order: asc or desc
per_page: results per page (max 100)
Returns:
List of repository dicts
"""
url = "https://api.github.com/search/repositories"
params = {
"q": query,
"sort": sort,
"order": order,
"per_page": per_page,
}
r = safe_get(url, params=params)
if r is None:
return []
data = r.json()
total = data.get("total_count", 0)
items = data.get("items", [])
print(f"Query matched {total:,} repos, returning first {len(items)}")
return items
def extract_repo_fields(repo):
"""Extract the most useful fields from a raw GitHub repo response."""
return {
"id": repo["id"],
"full_name": repo["full_name"],
"name": repo["name"],
"owner": repo["owner"]["login"],
"owner_type": repo["owner"]["type"], # User or Organization
"description": repo.get("description", ""),
"homepage": repo.get("homepage", ""),
"stars": repo["stargazers_count"],
"forks": repo["forks_count"],
"watchers": repo["watchers_count"],
"open_issues": repo["open_issues_count"],
"language": repo.get("language", ""),
"topics": ", ".join(repo.get("topics", [])),
"license": (repo.get("license") or {}).get("spdx_id", ""),
"default_branch": repo.get("default_branch", "main"),
"size_kb": repo["size"],
"is_fork": repo["fork"],
"is_archived": repo.get("archived", False),
"is_template": repo.get("is_template", False),
"has_wiki": repo.get("has_wiki", False),
"has_issues": repo.get("has_issues", True),
"has_pages": repo.get("has_pages", False),
"created_at": repo["created_at"],
"updated_at": repo["updated_at"],
"pushed_at": repo["pushed_at"],
"url": repo["html_url"],
"clone_url": repo["clone_url"],
"api_url": repo["url"],
"collected_at": datetime.now(timezone.utc).isoformat(),
}
# Example: trending Python ML repos from the past year
repos = search_repos(
"topic:machine-learning language:python stars:>1000 pushed:>2025-01-01"
)
for repo in repos[:5]:
r = extract_repo_fields(repo)
print(f"{r['full_name']}: {r['stars']:,} stars, {r['language']}, topics: {r['topics'][:50]}")
Getting Full Repository Details
The search endpoint returns a subset of fields. For the complete metadata, including subscriber count, network count, and full topics list, hit the individual repo endpoint:
def get_repo_details(full_name):
"""
Fetch complete repository metadata.
Args:
full_name: "owner/repo" string (e.g. "tiangolo/fastapi")
Returns:
Full repo dict with all fields, or None on error
"""
url = f"https://api.github.com/repos/{full_name}"
r = safe_get(url)
if r is None:
return None
repo = r.json()
# Add fields not in search results
repo["subscribers_count"] = repo.get("subscribers_count", 0)
repo["network_count"] = repo.get("network_count", 0)
return extract_repo_fields(repo)
def get_repo_topics(full_name):
"""Get all topics for a repository (requires Accept header for topics preview)."""
url = f"https://api.github.com/repos/{full_name}/topics"
r = safe_get(url)
if r is None:
return []
return r.json().get("names", [])
def get_repo_languages(full_name):
"""Get byte counts by language for a repo."""
url = f"https://api.github.com/repos/{full_name}/languages"
r = safe_get(url)
if r is None:
return {}
return r.json()
# Enrich search results with full details
def enrich_repos(repos, delay=0.2):
"""Fetch full details for each repo in the list."""
enriched = []
for i, repo in enumerate(repos):
full_name = repo.get("full_name") or repo
details = get_repo_details(full_name)
if details:
# Also get language breakdown
langs = get_repo_languages(full_name)
details["languages_json"] = json.dumps(langs)
enriched.append(details)
if (i + 1) % 20 == 0:
print(f" Enriched {i+1}/{len(repos)} repos...")
time.sleep(delay)
return enriched
Getting Contributors
Once you have a repo's full_name (e.g. tiangolo/fastapi), you can pull the contributor list from GET /repos/{owner}/{repo}/contributors.
def get_contributors(full_name, max_pages=5, include_anon=False):
"""
Get contributors for a repository.
Args:
full_name: "owner/repo" string
max_pages: cap on pagination (each page = 100 contributors)
include_anon: include anonymous contributors
Returns:
List of contributor dicts sorted by commit count descending
"""
owner, repo = full_name.split("/", 1)
url = f"https://api.github.com/repos/{owner}/{repo}/contributors"
contributors = []
page = 1
while page <= max_pages:
params = {
"per_page": 100,
"page": page,
"anon": "1" if include_anon else "0",
}
r = safe_get(url, params=params)
if r is None:
break
batch = r.json()
if not batch:
break
for c in batch:
if c.get("type") == "Anonymous":
contributors.append({
"login": c.get("email", "anonymous"),
"contributions": c["contributions"],
"type": "anonymous",
"profile": "",
})
else:
contributors.append({
"login": c["login"],
"contributions": c["contributions"],
"type": c["type"], # User or Bot
"profile": c["html_url"],
"avatar": c["avatar_url"],
})
# Check if there are more pages
if get_next_page_url(r) is None:
break
page += 1
time.sleep(0.3)
return contributors
# Example: top contributors to a major project
fastapi_contributors = get_contributors("tiangolo/fastapi", max_pages=2)
print(f"FastAPI has {len(fastapi_contributors)} contributors")
for c in fastapi_contributors[:5]:
print(f" {c['login']}: {c['contributions']} commits")
Getting Commit History
For tracking project velocity and contributor patterns over time:
def get_commits(full_name, since=None, until=None, author=None, max_pages=10):
"""
Get commit history for a repository.
Args:
full_name: "owner/repo" string
since: ISO 8601 datetime string (e.g. "2025-01-01T00:00:00Z")
until: ISO 8601 datetime string
author: GitHub username to filter by
max_pages: pagination cap
Returns:
List of commit dicts
"""
owner, repo = full_name.split("/", 1)
url = f"https://api.github.com/repos/{owner}/{repo}/commits"
params = {"per_page": 100}
if since:
params["since"] = since
if until:
params["until"] = until
if author:
params["author"] = author
commits = []
page = 1
while page <= max_pages:
params["page"] = page
r = safe_get(url, params=params)
if r is None:
break
batch = r.json()
if not batch:
break
for c in batch:
commit_data = c.get("commit", {})
author_data = commit_data.get("author", {})
committer_data = commit_data.get("committer", {})
github_author = c.get("author") or {}
commits.append({
"sha": c["sha"][:8],
"full_sha": c["sha"],
"message": commit_data.get("message", "").split("\n")[0][:200],
"author_name": author_data.get("name", ""),
"author_email": author_data.get("email", ""),
"author_date": author_data.get("date", ""),
"committer_date": committer_data.get("date", ""),
"github_login": github_author.get("login", ""),
"additions": c.get("stats", {}).get("additions", 0),
"deletions": c.get("stats", {}).get("deletions", 0),
"comment_count": commit_data.get("comment_count", 0),
})
if get_next_page_url(r) is None:
break
page += 1
time.sleep(0.3)
return commits
# Example: get 2025 commit activity
commits_2025 = get_commits(
"tiangolo/fastapi",
since="2025-01-01T00:00:00Z",
until="2025-12-31T23:59:59Z",
max_pages=5
)
print(f"FastAPI 2025 commits: {len(commits_2025)}")
Getting Issues and Pull Requests
def get_issues(full_name, state="open", labels=None, max_pages=5):
"""
Get issues (and optionally PRs) for a repository.
The GitHub API returns PRs mixed in with issues by default.
Filter by checking for 'pull_request' key in each item.
Args:
full_name: "owner/repo" string
state: open, closed, or all
labels: comma-separated label names (e.g. "bug,help wanted")
max_pages: pagination cap
Returns:
List of issue dicts
"""
owner, repo = full_name.split("/", 1)
url = f"https://api.github.com/repos/{owner}/{repo}/issues"
params = {
"state": state,
"per_page": 100,
"sort": "created",
"direction": "desc",
}
if labels:
params["labels"] = labels
issues = []
page = 1
while page <= max_pages:
params["page"] = page
r = safe_get(url, params=params)
if r is None:
break
batch = r.json()
if not batch:
break
for item in batch:
is_pr = "pull_request" in item
issues.append({
"number": item["number"],
"type": "pr" if is_pr else "issue",
"title": item["title"],
"state": item["state"],
"author": (item.get("user") or {}).get("login", ""),
"created_at": item["created_at"],
"updated_at": item["updated_at"],
"closed_at": item.get("closed_at", ""),
"labels": ", ".join(l["name"] for l in item.get("labels", [])),
"comments": item.get("comments", 0),
"body_preview": (item.get("body") or "")[:300],
"url": item["html_url"],
"is_pr": is_pr,
})
if get_next_page_url(r) is None:
break
page += 1
time.sleep(0.3)
return issues
# Example: help-wanted issues (good for finding contribution opportunities)
help_wanted = get_issues(
"tiangolo/fastapi",
state="open",
labels="help wanted",
max_pages=2
)
print(f"FastAPI help-wanted issues: {len(help_wanted)}")
Code Search
The code search endpoint lets you search across file contents on GitHub. This is useful for finding repos that use a specific library, pattern, or configuration value.
def search_code(query, per_page=30):
"""
Search code across all public GitHub repositories.
Args:
query: GitHub code search syntax
Examples:
- "import thordata language:python"
- "GITHUB_TOKEN filename:.env"
- "org:django extension:py def middleware"
per_page: results per page (max 30 for code search)
Returns:
List of code match dicts
"""
url = "https://api.github.com/search/code"
params = {
"q": query,
"per_page": per_page,
}
# Code search has stricter rate limits — add extra delay
time.sleep(6)
r = safe_get(url, params=params)
if r is None:
return []
data = r.json()
results = []
for item in data.get("items", []):
results.append({
"repo": item["repository"]["full_name"],
"repo_stars": item["repository"].get("stargazers_count", 0),
"file": item["name"],
"path": item["path"],
"url": item["html_url"],
"raw_url": item.get("download_url", ""),
"text_matches": [
m.get("fragment", "") for m in item.get("text_matches", [])
],
})
return results
def get_file_contents(full_name, path):
"""
Download the raw contents of a file from a repository.
Args:
full_name: "owner/repo" string
path: file path within the repo (e.g. "src/main.py")
Returns:
File content as string, or None
"""
owner, repo = full_name.split("/", 1)
url = f"https://api.github.com/repos/{owner}/{repo}/contents/{path}"
r = safe_get(url)
if r is None:
return None
data = r.json()
if data.get("encoding") == "base64":
import base64
return base64.b64decode(data["content"]).decode("utf-8", errors="replace")
return data.get("content", "")
# Example: find Python files importing a specific package
results = search_code("import thordata language:python")
for r in results[:5]:
print(f" {r['repo']} — {r['path']}")
Code search is the most rate-limited endpoint — one request every 6-7 seconds if you're making multiple calls to stay inside the 10/min cap.
Pagination with Link Headers
GitHub uses Link headers for pagination rather than returning total page counts in the body. The header looks like this:
Link: <https://api.github.com/search/repositories?q=...&page=2>; rel="next",
<https://api.github.com/search/repositories?q=...&page=10>; rel="last"
Parse it like this:
def get_next_page_url(response):
"""Extract the next-page URL from a GitHub Link header."""
link_header = response.headers.get("Link", "")
if not link_header:
return None
match = re.search(r'<([^>]+)>;\s*rel="next"', link_header)
return match.group(1) if match else None
def get_last_page_number(response):
"""Extract the last page number from a GitHub Link header."""
link_header = response.headers.get("Link", "")
if not link_header:
return 1
match = re.search(r'<[^>]+[?&]page=(\d+)>;\s*rel="last"', link_header)
return int(match.group(1)) if match else 1
def search_repos_all_pages(query, max_pages=10, delay=1.0):
"""
Paginate through all search results for a query.
GitHub caps total results at 1,000 per query even with pagination.
Narrow your query if you need more than that.
Args:
query: GitHub search syntax string
max_pages: safety cap (GitHub allows max 10 pages of 100)
delay: seconds between page requests
Returns:
List of all repo dicts
"""
url = "https://api.github.com/search/repositories"
params = {"q": query, "per_page": 100, "sort": "stars", "order": "desc"}
all_repos = []
page_count = 0
while url and page_count < max_pages:
r = safe_get(url, params=params)
if r is None:
break
data = r.json()
batch = data.get("items", [])
all_repos.extend(batch)
total_count = data.get("total_count", 0)
print(f"Page {page_count+1}: +{len(batch)} repos (total fetched: {len(all_repos)}/{min(total_count, 1000)})")
url = get_next_page_url(r)
params = {} # URL already has params encoded after first request
page_count += 1
if url:
time.sleep(delay)
return all_repos
Note that GitHub caps search results at 1,000 total even with pagination. If your query matches more than that, narrow it down with additional filters (language:, created:, stars:>N).
Anti-Detection and Proxy Setup
For large-scale collection, single-IP scraping gets flagged by GitHub's abuse detection. The API token limits are per-token, but the secondary rate limit (abuse detection) is per-IP. To work around this safely:
import random
# Header rotation pool — vary User-Agent and other headers
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]
def make_session_with_proxy(proxy_url=None, token=None):
"""
Create a requests session optionally configured with a proxy.
For rotating residential proxies, use ThorData:
https://thordata.partnerstack.com/partner/0a0x4nzh
Args:
proxy_url: Full proxy URL (e.g. "http://user:pass@host:port")
token: GitHub personal access token
Returns:
Configured requests.Session
"""
s = requests.Session()
s.headers.update({
"Accept": "application/vnd.github+json",
"X-GitHub-Api-Version": "2022-11-28",
"User-Agent": random.choice(USER_AGENTS),
})
if token:
s.headers["Authorization"] = f"Bearer {token}"
if proxy_url:
s.proxies = {
"http": proxy_url,
"https": proxy_url,
}
return s
def build_thordata_proxy(username, password, country="US", sticky=False, session_id=None):
"""
Build a ThorData proxy URL with optional sticky sessions.
ThorData residential proxies: https://thordata.partnerstack.com/partner/0a0x4nzh
Args:
username: ThorData account username
password: ThorData account password
country: 2-letter country code for geo-targeting
sticky: Use sticky session (same IP for duration)
session_id: Session ID for sticky sessions (random if not provided)
Returns:
Proxy URL string
"""
import uuid
if sticky:
if session_id is None:
session_id = str(uuid.uuid4())[:8]
user = f"{username}-session-{session_id}-country-{country}"
else:
user = f"{username}-country-{country}"
return f"http://{user}:{password}@gate.thordata.net:7777"
# Using proxies with multiple tokens for high-volume collection
class MultiTokenClient:
"""Rotate between multiple GitHub tokens to maximize throughput."""
def __init__(self, tokens, proxy_url=None):
self.sessions = [
make_session_with_proxy(proxy_url=proxy_url, token=t)
for t in tokens
]
self.current = 0
def get(self, url, params=None):
"""Make a GET request using the next available session."""
s = self.sessions[self.current]
self.current = (self.current + 1) % len(self.sessions)
r = s.get(url, params=params, timeout=30)
check_rate_limit(r)
r.raise_for_status()
return r
SQLite Database for Repository Storage
Persist your collection in SQLite for deduplication and analysis:
def init_repos_db(path="github_repos.db"):
"""Initialize SQLite database for repository storage."""
conn = sqlite3.connect(path)
conn.row_factory = sqlite3.Row
conn.executescript("""
CREATE TABLE IF NOT EXISTS repos (
id INTEGER PRIMARY KEY,
full_name TEXT UNIQUE NOT NULL,
name TEXT,
owner TEXT,
owner_type TEXT,
description TEXT,
stars INTEGER DEFAULT 0,
forks INTEGER DEFAULT 0,
watchers INTEGER DEFAULT 0,
open_issues INTEGER DEFAULT 0,
language TEXT,
topics TEXT,
license TEXT,
size_kb INTEGER DEFAULT 0,
is_fork INTEGER DEFAULT 0,
is_archived INTEGER DEFAULT 0,
created_at TEXT,
pushed_at TEXT,
url TEXT,
languages_json TEXT,
collected_at TEXT
);
CREATE TABLE IF NOT EXISTS contributors (
id INTEGER PRIMARY KEY AUTOINCREMENT,
repo_full_name TEXT NOT NULL,
login TEXT NOT NULL,
contributions INTEGER DEFAULT 0,
type TEXT,
profile TEXT,
collected_at TEXT,
UNIQUE(repo_full_name, login)
);
CREATE TABLE IF NOT EXISTS commits (
id INTEGER PRIMARY KEY AUTOINCREMENT,
repo_full_name TEXT NOT NULL,
sha TEXT NOT NULL,
author_login TEXT,
author_date TEXT,
message TEXT,
UNIQUE(repo_full_name, sha)
);
CREATE INDEX IF NOT EXISTS idx_repos_stars ON repos(stars DESC);
CREATE INDEX IF NOT EXISTS idx_repos_language ON repos(language);
CREATE INDEX IF NOT EXISTS idx_repos_pushed ON repos(pushed_at);
CREATE INDEX IF NOT EXISTS idx_contributors_repo ON contributors(repo_full_name);
""")
conn.commit()
return conn
def upsert_repo(conn, repo_data):
"""Insert or update a repository record."""
fields = list(repo_data.keys())
placeholders = ", ".join(["?"] * len(fields))
updates = ", ".join([f"{f} = excluded.{f}" for f in fields if f != "id"])
sql = f"""
INSERT INTO repos ({", ".join(fields)})
VALUES ({placeholders})
ON CONFLICT(full_name) DO UPDATE SET {updates}
"""
conn.execute(sql, list(repo_data.values()))
conn.commit()
def upsert_contributors(conn, repo_full_name, contributors):
"""Batch insert contributors for a repository."""
now = datetime.now(timezone.utc).isoformat()
conn.executemany(
"""
INSERT INTO contributors (repo_full_name, login, contributions, type, profile, collected_at)
VALUES (?, ?, ?, ?, ?, ?)
ON CONFLICT(repo_full_name, login) DO UPDATE SET
contributions = excluded.contributions
""",
[
(repo_full_name, c["login"], c["contributions"],
c.get("type", ""), c.get("profile", ""), now)
for c in contributors
]
)
conn.commit()
Real-World Use Cases
Use Case 1: Track Rising Python Libraries
Build a weekly report of Python libraries gaining momentum — useful for content strategy, investment research, or staying current with the ecosystem.
def find_rising_python_libs(min_stars=100, months_back=6):
"""
Find Python repositories that gained significant stars recently.
Looks for repos created within months_back that already have min_stars.
"""
from datetime import timedelta
cutoff = datetime.now(timezone.utc) - timedelta(days=30 * months_back)
since_str = cutoff.strftime("%Y-%m-%d")
query = f"language:python stars:>{min_stars} created:>{since_str}"
repos = search_repos_all_pages(query, max_pages=5)
# Sort by stars per day since creation
results = []
for repo in repos:
r = extract_repo_fields(repo)
created = datetime.fromisoformat(r["created_at"].replace("Z", "+00:00"))
days_old = max(1, (datetime.now(timezone.utc) - created).days)
r["stars_per_day"] = r["stars"] / days_old
results.append(r)
results.sort(key=lambda x: x["stars_per_day"], reverse=True)
print("\n=== Fastest Rising Python Libraries ===")
for r in results[:15]:
print(f" {r['full_name']}: {r['stars']:,} stars in {r['stars_per_day']:.1f}/day | {r['description'][:60]}")
return results
rising = find_rising_python_libs(min_stars=200, months_back=3)
Use Case 2: Competitive Intelligence for a Tech Stack
Map the open-source ecosystem around a technology to understand adoption, key players, and momentum:
def map_tech_ecosystem(tech_name, language=None):
"""
Map repositories related to a technology.
Returns a picture of the ecosystem: main projects, forks, tooling, tutorials.
"""
queries = [
f"topic:{tech_name}",
f"{tech_name} tutorial language:{'python' if not language else language}",
f"{tech_name} integration stars:>50",
]
all_repos = []
seen = set()
for q in queries:
repos = search_repos(q, per_page=50)
for repo in repos:
fn = repo["full_name"]
if fn not in seen:
seen.add(fn)
all_repos.append(extract_repo_fields(repo))
time.sleep(2)
# Analyze the ecosystem
total_stars = sum(r["stars"] for r in all_repos)
languages = {}
for r in all_repos:
lang = r["language"] or "Unknown"
languages[lang] = languages.get(lang, 0) + 1
print(f"\n=== {tech_name} Ecosystem ===")
print(f"Repos found: {len(all_repos)}, Total stars: {total_stars:,}")
print(f"Top languages: {sorted(languages.items(), key=lambda x: -x[1])[:5]}")
return sorted(all_repos, key=lambda x: x["stars"], reverse=True)
fastapi_ecosystem = map_tech_ecosystem("fastapi", language="python")
Use Case 3: Developer Contact Research
Find active contributors to relevant projects for recruiting or partnership outreach:
def find_active_contributors(repo_list, min_commits=10):
"""
Find developers who are actively contributing to a set of repos.
Useful for recruiting, outreach, or identifying experts in a domain.
Returns contributors with min_commits or more across the repo set.
"""
contributor_stats = {}
for full_name in repo_list:
print(f"Getting contributors for {full_name}...")
contributors = get_contributors(full_name, max_pages=2)
for c in contributors:
login = c["login"]
if login not in contributor_stats:
contributor_stats[login] = {
"login": login,
"profile": c.get("profile", ""),
"total_contributions": 0,
"active_repos": [],
"type": c.get("type", "User"),
}
contributor_stats[login]["total_contributions"] += c["contributions"]
contributor_stats[login]["active_repos"].append(full_name)
time.sleep(1)
# Filter by minimum contributions and exclude bots
active = [
c for c in contributor_stats.values()
if c["total_contributions"] >= min_commits
and c["type"] != "Bot"
and "[bot]" not in c["login"]
]
return sorted(active, key=lambda x: x["total_contributions"], reverse=True)
python_web_contributors = find_active_contributors(
["tiangolo/fastapi", "encode/httpx", "pydantic/pydantic"],
min_commits=20
)
print(f"\nFound {len(python_web_contributors)} active contributors")
for c in python_web_contributors[:10]:
repos = ", ".join(c["active_repos"])
print(f" {c['login']}: {c['total_contributions']} commits across {repos}")
Use Case 4: Security Research — Finding Exposed Credentials
A legitimate use of code search is scanning your own organization's repos for accidentally committed secrets:
def scan_org_for_secrets(org_name):
"""
Scan an organization's public repos for common accidentally-committed secrets.
Useful for security audits of your own organization.
IMPORTANT: Only use this on your own organization. Do not use against others.
"""
secret_patterns = [
f"org:{org_name} filename:.env",
f"org:{org_name} password filename:config.py",
f"org:{org_name} PRIVATE_KEY extension:pem",
f"org:{org_name} AWS_SECRET_ACCESS_KEY",
f"org:{org_name} api_key filename:settings",
]
findings = []
for pattern in secret_patterns:
print(f"Checking: {pattern}")
results = search_code(pattern, per_page=10)
if results:
findings.append({
"pattern": pattern,
"matches": results,
})
print(f" WARNING: {len(results)} potential matches found!")
time.sleep(8) # Respect code search rate limit
return findings
Exporting to CSV and JSON
def export_to_csv(repos, output_path="github_repos.csv"):
"""Export repository list to CSV."""
if not repos:
print("No repos to export")
return
output = Path(output_path)
fieldnames = list(repos[0].keys())
with open(output, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(repos)
print(f"Exported {len(repos)} repos to {output} ({output.stat().st_size:,} bytes)")
def export_to_jsonl(repos, output_path="github_repos.jsonl"):
"""Export repository list to JSON Lines format."""
output = Path(output_path)
with open(output, "w", encoding="utf-8") as f:
for repo in repos:
f.write(json.dumps(repo) + "\n")
print(f"Exported {len(repos)} repos to {output}")
Full Pipeline: Collecting a Dataset
Here is a complete end-to-end pipeline that collects trending repositories, enriches them with contributor data, and saves everything to SQLite:
def run_collection_pipeline(queries, output_db="github_dataset.db", max_repos=500):
"""
Full collection pipeline:
1. Search repos using multiple queries
2. Deduplicate
3. Fetch full details
4. Fetch contributor lists
5. Save to SQLite
Args:
queries: List of GitHub search query strings
output_db: Path to SQLite database
max_repos: Maximum repos to collect
Returns:
Collection statistics dict
"""
conn = init_repos_db(output_db)
all_repos = {} # full_name -> repo dict
stats = {
"queries_run": 0,
"repos_collected": 0,
"contributors_collected": 0,
"errors": 0,
"started_at": datetime.now(timezone.utc).isoformat(),
}
# Phase 1: Search
print("=== Phase 1: Searching repositories ===")
for query in queries:
print(f"\nQuery: {query}")
repos = search_repos_all_pages(query, max_pages=5)
for repo in repos:
fn = repo["full_name"]
if fn not in all_repos:
all_repos[fn] = repo
stats["queries_run"] += 1
print(f" Running total: {len(all_repos)} unique repos")
time.sleep(2)
# Phase 2: Enrich and save
print(f"\n=== Phase 2: Enriching {len(all_repos)} repos ===")
repo_list = list(all_repos.values())[:max_repos]
for i, repo in enumerate(repo_list):
full_name = repo["full_name"]
try:
# Get full details
details = get_repo_details(full_name)
if details:
langs = get_repo_languages(full_name)
details["languages_json"] = json.dumps(langs)
upsert_repo(conn, details)
stats["repos_collected"] += 1
# Get contributors
contributors = get_contributors(full_name, max_pages=2)
if contributors:
upsert_contributors(conn, full_name, contributors)
stats["contributors_collected"] += len(contributors)
if (i + 1) % 20 == 0:
print(f" Progress: {i+1}/{len(repo_list)} repos enriched")
time.sleep(0.5)
except Exception as e:
print(f" Error on {full_name}: {e}")
stats["errors"] += 1
stats["completed_at"] = datetime.now(timezone.utc).isoformat()
stats["database"] = output_db
print(f"\n=== Collection Complete ===")
print(f"Repos collected: {stats['repos_collected']}")
print(f"Contributors collected: {stats['contributors_collected']}")
print(f"Errors: {stats['errors']}")
print(f"Database: {output_db}")
return stats
# Example: build a Python web framework ecosystem dataset
results = run_collection_pipeline(
queries=[
"topic:fastapi language:python stars:>100",
"topic:django language:python stars:>200",
"topic:flask language:python stars:>100 pushed:>2025-01-01",
"web framework python stars:>500 language:python",
],
output_db="python_web_frameworks.db",
max_repos=300,
)
Handling Heavy Scraping Beyond API Limits
For most use cases, 5,000 requests per hour is enough. But if you're building something that needs to scrape thousands of repos continuously, you'll hit the ceiling fast.
The typical solution is rotating proxies. Each request comes from a different IP, so you're not rate-limited as a single client. I've had good results with ThorData's rotating residential proxies — their residential pool works cleanly with the requests library and doesn't trip GitHub's bot detection the way datacenter IPs sometimes do.
# Configure a session with ThorData rotating residential proxies
# Sign up at: https://thordata.partnerstack.com/partner/0a0x4nzh
THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
proxy_url = build_thordata_proxy(THORDATA_USER, THORDATA_PASS, country="US")
proxy_session = make_session_with_proxy(proxy_url=proxy_url, token=GITHUB_TOKEN)
# Now use proxy_session instead of session for all requests
# Each request automatically routes through a different residential IP
With rotating proxies, you spread load across IPs automatically. That said, for most scraping tasks the public API with a token is sufficient — reach for proxies when you actually need scale, not by default.
Summary
The GitHub REST API is solid and well-documented. The main things to get right:
- Always authenticate — 60 req/hr unauthenticated is nothing
- Check
X-RateLimit-Remainingon every response and back off when it hits zero - Use the
Linkheader for pagination, not manual page counting - Code search has a separate, stricter rate limit — slow down for that endpoint
- Use exponential backoff on 403/429 responses
- For large-scale collection, rotating residential proxies spread load across IPs
The endpoints covered here — /search/repositories, /repos/{owner}/{repo}/contributors, /repos/{owner}/{repo}/commits, /repos/{owner}/{repo}/issues, and /search/code — cover the majority of what you'd want for repo analysis, competitive intelligence, or dataset building. The full API reference at docs.github.com/en/rest has the complete field listings if you need something more specific.
For real-world data pipelines, combine the SQLite storage layer with a scheduled job (cron or a simple sleep loop) and you have a continuously updating dataset that costs nothing to run beyond the occasional proxy bill.