Scraping GitHub: Repos, Stars, Issues, and User Profiles in 2026
Scraping GitHub: Repos, Stars, Issues, and User Profiles in 2026
GitHub hosts over 400 million repositories. Whether you are building a developer tool, analyzing open-source trends, or doing academic research, you will eventually need to pull data out of it at scale. Here is how to do it properly in 2026 without getting your tokens revoked.
Two APIs, Different Trade-offs
GitHub provides two official APIs:
REST API v3 — straightforward, resource-based endpoints. You request /repos/torvalds/linux and get back JSON. Simple to use, but inefficient when you need nested data: a repo's top contributors AND their recent activity requires multiple round trips.
GraphQL API v4 — a single endpoint where you specify exactly which fields you want. One query can return a repo's stars, its last 10 issues, and each issue's first 5 comments. Less bandwidth, fewer requests, steeper learning curve.
Both require the same authentication and share the same rate limit pool for most operations.
Authentication and Rate Limits
Without authentication, you get 60 requests per hour. That is barely enough for manual testing — you will burn through it in under a minute with any kind of loop.
With a Personal Access Token (PAT), you get 5,000 requests per hour. Generate one at github.com/settings/tokens with the scopes you need (usually public_repo is enough for public data).
import requests
import time
TOKEN = "ghp_your_token_here"
headers = {
"Authorization": f"token {TOKEN}",
"Accept": "application/vnd.github+json"
}
# Check your current rate limit status
r = requests.get("https://api.github.com/rate_limit", headers=headers)
limits = r.json()["resources"]["core"]
print(f"Remaining: {limits['remaining']}/{limits['limit']}")
print(f"Resets at: {limits['reset']}")
The X-RateLimit-Remaining header is returned on every response. Check it before each request and back off when running low.
Pulling Repo Data with REST API v3
The REST API is the fastest path to basic repo stats. Each repository object from the API includes:
stargazers_count— total starsforks_count— total forkswatchers_count— watchers (usually same as stars)open_issues_count— open issues and PRs combinedlanguage— detected primary languagetopics— repository topic tagslicense— SPDX identifiercreated_at,updated_at,pushed_at— timestampssize— repo size in KBdefault_branch
import requests
import time
TOKEN = "ghp_your_token_here"
HEADERS = {
"Authorization": f"token {TOKEN}",
"Accept": "application/vnd.github+json"
}
def parse_next_link(link_header):
"""Parse GitHub Link header for next page URL."""
import re
if not link_header:
return None
for part in link_header.split(","):
match = re.match(r'<([^>]+)>;\s*rel="next"', part.strip())
if match:
return match.group(1)
return None
def get_user_repos(username):
"""Fetch all public repos for a user, handling pagination."""
repos = []
url = f"https://api.github.com/users/{username}/repos"
params = {"per_page": 100, "sort": "updated"}
while url:
resp = requests.get(url, headers=HEADERS, params=params)
remaining = int(resp.headers.get("X-RateLimit-Remaining", 0))
if remaining < 10:
reset_time = int(resp.headers["X-RateLimit-Reset"])
sleep_seconds = max(reset_time - time.time(), 0) + 1
print(f"Rate limit low ({remaining}). Sleeping {sleep_seconds:.0f}s")
time.sleep(sleep_seconds)
resp.raise_for_status()
repos.extend(resp.json())
# Follow pagination via Link header
url = parse_next_link(resp.headers.get("Link", ""))
params = {} # params already encoded in the next URL
return repos
repos = get_user_repos("torvalds")
for repo in repos:
print(f"{repo['name']}: {repo['stargazers_count']} stars, "
f"{repo['forks_count']} forks, {repo['language'] or 'unknown'}")
Fetching Commit History
def get_commit_history(owner, repo, since=None, until=None, author=None):
"""Fetch commit history with optional date and author filters."""
url = f"https://api.github.com/repos/{owner}/{repo}/commits"
params = {"per_page": 100}
if since:
params["since"] = since # ISO 8601: "2026-01-01T00:00:00Z"
if until:
params["until"] = until
if author:
params["author"] = author
commits = []
while url:
resp = requests.get(url, headers=HEADERS, params=params)
resp.raise_for_status()
for commit in resp.json():
commits.append({
"sha": commit["sha"],
"message": commit["commit"]["message"].split("\n")[0][:100],
"author": commit["commit"]["author"]["name"],
"date": commit["commit"]["author"]["date"],
"url": commit["html_url"],
})
url = parse_next_link(resp.headers.get("Link", ""))
params = {}
return commits
commits = get_commit_history("torvalds", "linux", since="2026-01-01T00:00:00Z")
print(f"Commits since Jan 2026: {len(commits)}")
Pulling Issues and Pull Requests
Issues and PRs share the same endpoint. Filter by type param to separate them.
def get_repo_issues(owner, repo, state="all", labels=None, since=None, max_pages=20):
"""Fetch issues for a repository."""
url = f"https://api.github.com/repos/{owner}/{repo}/issues"
params = {
"per_page": 100,
"state": state, # "open", "closed", or "all"
"sort": "updated",
"direction": "desc",
}
if labels:
params["labels"] = ",".join(labels)
if since:
params["since"] = since
issues = []
page = 0
while url and page < max_pages:
resp = requests.get(url, headers=HEADERS, params=params if page == 0 else None)
resp.raise_for_status()
for issue in resp.json():
# Skip pull requests (they appear in issues endpoint too)
is_pr = "pull_request" in issue
issues.append({
"number": issue["number"],
"title": issue["title"],
"state": issue["state"],
"is_pr": is_pr,
"author": issue["user"]["login"],
"labels": [l["name"] for l in issue.get("labels", [])],
"comments": issue["comments"],
"created_at": issue["created_at"],
"updated_at": issue["updated_at"],
"closed_at": issue.get("closed_at"),
"body_length": len(issue.get("body") or ""),
})
url = parse_next_link(resp.headers.get("Link", ""))
page += 1
time.sleep(0.3)
return issues
issues = get_repo_issues("microsoft", "vscode", state="open", labels=["bug"])
print(f"Open vscode bug reports: {len(issues)}")
Scraping User Profiles
User profile data includes follower counts, following, public repo count, bio, company, location, and activity timestamps.
def get_user_profile(username):
"""Fetch detailed user profile."""
resp = requests.get(
f"https://api.github.com/users/{username}",
headers=HEADERS
)
resp.raise_for_status()
u = resp.json()
return {
"login": u["login"],
"id": u["id"],
"name": u.get("name"),
"company": u.get("company"),
"blog": u.get("blog"),
"location": u.get("location"),
"email": u.get("email"),
"bio": u.get("bio"),
"twitter_username": u.get("twitter_username"),
"public_repos": u["public_repos"],
"public_gists": u["public_gists"],
"followers": u["followers"],
"following": u["following"],
"created_at": u["created_at"],
"updated_at": u["updated_at"],
}
def get_user_followers(username, max_pages=10):
"""Fetch list of user followers."""
url = f"https://api.github.com/users/{username}/followers"
params = {"per_page": 100}
followers = []
page = 0
while url and page < max_pages:
resp = requests.get(url, headers=HEADERS, params=params if page == 0 else None)
resp.raise_for_status()
followers.extend([u["login"] for u in resp.json()])
url = parse_next_link(resp.headers.get("Link", ""))
page += 1
time.sleep(0.3)
return followers
profile = get_user_profile("antirez")
print(f"{profile['name']} — {profile['followers']} followers, {profile['public_repos']} repos")
Complex Queries with GraphQL API v4
When you need data that spans multiple resources, GraphQL eliminates the N+1 request problem.
import requests
TOKEN = "ghp_your_token_here"
GRAPHQL_URL = "https://api.github.com/graphql"
HEADERS_GQL = {"Authorization": f"bearer {TOKEN}"}
def graphql_query(query, variables=None):
"""Execute a GraphQL query against the GitHub v4 API."""
payload = {"query": query}
if variables:
payload["variables"] = variables
resp = requests.post(GRAPHQL_URL, json=payload, headers=HEADERS_GQL, timeout=30)
resp.raise_for_status()
data = resp.json()
if "errors" in data:
for err in data["errors"]:
print(f"GraphQL error: {err['message']}")
return data.get("data")
# Fetch an org's top repos with rich metadata in one request
query = """
{
organization(login: "facebook") {
repositories(first: 10, orderBy: {field: STARGAZERS, direction: DESC}) {
nodes {
name
stargazerCount
forkCount
issues(states: OPEN) { totalCount }
pullRequests(states: OPEN) { totalCount }
primaryLanguage { name }
repositoryTopics(first: 5) {
nodes { topic { name } }
}
licenseInfo { spdxId }
updatedAt
diskUsage
}
}
}
}
"""
data = graphql_query(query)
repos = data["organization"]["repositories"]["nodes"]
for repo in repos:
topics = [t["topic"]["name"] for t in repo["repositoryTopics"]["nodes"]]
lang = repo["primaryLanguage"]["name"] if repo.get("primaryLanguage") else "unknown"
print(f"{repo['name']}: {repo['stargazerCount']} stars | "
f"{repo['issues']['totalCount']} issues | "
f"{lang} | topics: {', '.join(topics)}")
One request. Ten repos with stars, forks, open issues, open PRs, language, topics, and license. The REST equivalent would take 30+ requests.
GraphQL Pagination with Cursors
GraphQL uses cursor-based pagination, which is more reliable than offset pagination for large datasets.
def get_all_org_repos(org_name, max_repos=500):
"""Fetch all repos for an organization using GraphQL cursor pagination."""
query = """
query($org: String!, $cursor: String) {
organization(login: $org) {
repositories(
first: 100,
after: $cursor,
orderBy: {field: STARGAZERS, direction: DESC}
) {
pageInfo {
hasNextPage
endCursor
}
nodes {
name
stargazerCount
forkCount
primaryLanguage { name }
createdAt
}
}
}
}
"""
all_repos = []
cursor = None
while len(all_repos) < max_repos:
data = graphql_query(query, variables={"org": org_name, "cursor": cursor})
if not data:
break
page = data["organization"]["repositories"]
all_repos.extend(page["nodes"])
if not page["pageInfo"]["hasNextPage"]:
break
cursor = page["pageInfo"]["endCursor"]
time.sleep(0.5)
return all_repos[:max_repos]
repos = get_all_org_repos("microsoft")
print(f"Microsoft has {len(repos)} repos")
print(f"Most starred: {repos[0]['name']} ({repos[0]['stargazerCount']} stars)")
The Search API and Its 1,000-Result Ceiling
GitHub Search API (/search/repositories, /search/code, etc.) caps results at 1,000 per query, regardless of how many actual matches exist. The workaround is partitioning queries by a sortable field.
import time
def search_repos_all(language, min_stars=100, max_repos=10_000):
"""
Search repositories with star-range partitioning to bypass the 1000-result limit.
Each sub-query covers a star count range narrow enough to stay under the cap.
"""
# Define star count buckets — adjust based on distribution
star_ranges = [
(100, 200),
(201, 500),
(501, 1000),
(1001, 3000),
(3001, 10000),
(10001, 50000),
(50001, 999999),
]
all_repos = []
for low, high in star_ranges:
url = "https://api.github.com/search/repositories"
params = {
"q": f"language:{language} stars:{low}..{high}",
"per_page": 100,
"sort": "stars",
"order": "desc",
}
page_repos = []
page_url = url
while page_url and len(page_repos) < 1000:
resp = requests.get(page_url, headers=HEADERS, params=params if page_url == url else None)
if resp.status_code == 422:
print(f"Range {low}-{high}: query too broad, skipping")
break
if resp.status_code == 403:
# Search API rate limit: 30/min for authenticated
time.sleep(10)
continue
resp.raise_for_status()
data = resp.json()
page_repos.extend(data.get("items", []))
print(f" Range {low}-{high}: {len(page_repos)}/{data.get('total_count', '?')} repos")
page_url = parse_next_link(resp.headers.get("Link", ""))
time.sleep(2.5) # Search API rate limit: 30 req/min
all_repos.extend(page_repos)
if len(all_repos) >= max_repos:
break
return all_repos[:max_repos]
python_repos = search_repos_all("python", min_stars=100)
print(f"Total Python repos scraped: {len(python_repos)}")
Anti-Detection and Proxy Rotation
GitHub monitors for high request velocity, repeated identical User-Agents, and scraping patterns. Datacenter IPs get tighter restrictions than residential IPs.
For small projects (a few thousand requests), a PAT and polite delays are sufficient. For bulk operations — crawling millions of repos, monitoring all commits to thousands of projects, or running multiple tokens simultaneously — residential proxies help distribute the load.
ThorData provides rotating residential proxies that work for GitHub API calls. Each request exits from a different residential IP, so you avoid per-IP throttling while each individual token stays within its own rate budget.
import random
PROXY_URL = "http://user:[email protected]:9000"
GITHUB_TOKENS = [
"ghp_token1",
"ghp_token2",
"ghp_token3",
]
class MultiTokenSession:
"""Round-robin across multiple GitHub tokens with proxy rotation."""
def __init__(self, tokens, proxy_url=None):
self.tokens = tokens
self.proxy_url = proxy_url
self.token_index = 0
self._sessions = {}
def _get_session(self, token):
if token not in self._sessions:
s = requests.Session()
s.headers.update({
"Authorization": f"token {token}",
"Accept": "application/vnd.github+json",
"User-Agent": f"github-scraper/{random.randint(1, 100)}",
})
if self.proxy_url:
s.proxies = {"http": self.proxy_url, "https": self.proxy_url}
self._sessions[token] = s
return self._sessions[token]
def get(self, url, **kwargs):
token = self.tokens[self.token_index % len(self.tokens)]
self.token_index += 1
session = self._get_session(token)
resp = session.get(url, **kwargs)
remaining = int(resp.headers.get("X-RateLimit-Remaining", 999))
if remaining < 100:
# Rotate to next token
self.token_index += 1
print(f"Token rotation: {remaining} remaining, switching token")
return resp
multi = MultiTokenSession(GITHUB_TOKENS, proxy_url=PROXY_URL)
resp = multi.get("https://api.github.com/repos/torvalds/linux")
print(resp.json()["stargazers_count"])
Scaling Beyond 5,000 Requests/Hour
Options for higher-volume collection:
- Conditional requests: Use
If-None-Matchwith ETags. Requests that return304 Not Modifieddo not count against your limit. - GraphQL batching: Pack more data into fewer requests using fragments and batched queries.
- Multiple tokens: GitHub allows multiple PATs per account. Each gets its own limit.
- GitHub Archive: For historical event data, skip the API entirely.
def get_with_etag(url, session, etag_cache):
"""Make a request using ETag for conditional caching."""
headers = {}
if url in etag_cache:
headers["If-None-Match"] = etag_cache[url]
resp = session.get(url, headers=headers, timeout=15)
if resp.status_code == 304:
return None # Not modified, use cached data
if resp.status_code == 200:
etag = resp.headers.get("ETag")
if etag:
etag_cache[url] = etag
return resp.json()
resp.raise_for_status()
return resp.json()
GitHub Archive for Historical Data
If you need data older than what the API conveniently serves, or you want to analyze events at scale without hitting rate limits at all, use GitHub Archive at gharchive.org.
GH Archive records every public GitHub event (pushes, stars, forks, issues, PRs, comments) as newline-delimited JSON, compressed into hourly files. It has been running since 2011.
The data is also loaded into Google BigQuery as the githubarchive public dataset. You can query years of GitHub activity with SQL:
-- Most-starred repos in March 2026
SELECT repo.name, COUNT(*) as stars
FROM `githubarchive.month.202603`
WHERE type = 'WatchEvent'
GROUP BY repo.name
ORDER BY stars DESC
LIMIT 20
-- Most active contributors to Python repos in 2025
SELECT actor.login, COUNT(*) as commits
FROM `githubarchive.year.2025`
WHERE type = 'PushEvent'
AND repo.name LIKE '%.py'
GROUP BY actor.login
ORDER BY commits DESC
LIMIT 50
BigQuery gives you 1TB of free queries per month, which covers most research use cases. This returns the most-starred repos for a given period without making a single API call.
Storing Results in SQLite
import sqlite3
from datetime import datetime
def init_github_db(db_path="github.db"):
conn = sqlite3.connect(db_path)
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("""
CREATE TABLE IF NOT EXISTS repos (
id INTEGER PRIMARY KEY,
owner TEXT NOT NULL,
name TEXT NOT NULL,
full_name TEXT UNIQUE,
description TEXT,
language TEXT,
stars INTEGER,
forks INTEGER,
open_issues INTEGER,
topics TEXT,
license TEXT,
created_at TEXT,
updated_at TEXT,
pushed_at TEXT,
size_kb INTEGER,
default_branch TEXT,
fetched_at TEXT
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS commits (
sha TEXT PRIMARY KEY,
repo_full_name TEXT,
message TEXT,
author_name TEXT,
author_email TEXT,
committed_at TEXT,
fetched_at TEXT
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS issues (
id INTEGER PRIMARY KEY,
repo_full_name TEXT,
number INTEGER,
title TEXT,
state TEXT,
is_pr INTEGER,
author TEXT,
labels TEXT,
comments INTEGER,
created_at TEXT,
updated_at TEXT,
closed_at TEXT
)
""")
conn.execute("CREATE INDEX IF NOT EXISTS idx_repos_language ON repos(language)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_repos_stars ON repos(stars)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_commits_repo ON commits(repo_full_name)")
conn.commit()
return conn
def save_repos(conn, repos):
now = datetime.utcnow().isoformat()
conn.executemany("""
INSERT OR REPLACE INTO repos
(id, owner, name, full_name, description, language, stars, forks,
open_issues, topics, license, created_at, updated_at, pushed_at,
size_kb, default_branch, fetched_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", [
(
r["id"],
r["owner"]["login"],
r["name"],
r["full_name"],
r.get("description"),
r.get("language"),
r.get("stargazers_count", 0),
r.get("forks_count", 0),
r.get("open_issues_count", 0),
",".join(r.get("topics", [])),
r.get("license", {}).get("spdx_id") if r.get("license") else None,
r.get("created_at"),
r.get("updated_at"),
r.get("pushed_at"),
r.get("size"),
r.get("default_branch"),
now,
)
for r in repos
])
conn.commit()
print(f"Saved {len(repos)} repos")
Practical Tips
Cache aggressively. GitHub responses include ETag and Last-Modified headers. Use them. A local SQLite database mapping URLs to responses will cut your API usage dramatically.
Respect Retry-After headers. When you hit a secondary rate limit (abuse detection), GitHub returns a Retry-After header. Honor it or risk getting your token suspended.
Use per_page=100 always. The default is 30. Setting it to 100 (the maximum) cuts your pagination requests by 70%.
Check X-RateLimit-Resource. GitHub separates rate limits by resource type (core, search, graphql, code_search). Your GraphQL budget is separate from your REST budget — use both pools strategically.
Do not scrape what you can download. GitHub provides data exports for many things: repo archives, npm packages are mirrored, and GH Archive covers event data. Always check if a bulk download exists before hitting the API.
Topics require a special accept header. To get repository topics via REST, add Accept: application/vnd.github.mercy-preview+json or use GraphQL which returns them natively.
What to Build With This Data
Developer network analysis: Map follower/following relationships to identify influential nodes in the developer community. Who bridges JavaScript and Python ecosystems?
License compliance monitoring: Scan organizations for repos using GPL or other copyleft licenses in commercial products.
OSS health metrics: Build dashboards tracking issue response time, PR merge rate, and contributor diversity across projects you depend on.
Trend detection: Track which topics are gaining repos month-over-month. Which frameworks are developers gravitating toward in 2026?
Hiring intelligence: Find developers by the languages they commit in, the repos they contribute to, and their activity patterns.
GitHub's APIs are well-designed and generous enough for most projects. The REST API handles simple lookups, GraphQL handles complex queries, and GH Archive covers historical analysis at scale. Start with the smallest approach that works — most projects never actually need more than a few hundred API calls per day.