Scraping GitHub: Repos, Stars, Issues, and User Profiles in 2026

2026-04-06 github web-scraping api graphql developer-tools

Scraping GitHub: Repos, Stars, Issues, and User Profiles in 2026

GitHub hosts over 400 million repositories. Whether you are building a developer tool, analyzing open-source trends, or doing academic research, you will eventually need to pull data out of it at scale. Here is how to do it properly in 2026 without getting your tokens revoked.

Two APIs, Different Trade-offs

GitHub provides two official APIs:

REST API v3 — straightforward, resource-based endpoints. You request /repos/torvalds/linux and get back JSON. Simple to use, but inefficient when you need nested data: a repo's top contributors AND their recent activity requires multiple round trips.

GraphQL API v4 — a single endpoint where you specify exactly which fields you want. One query can return a repo's stars, its last 10 issues, and each issue's first 5 comments. Less bandwidth, fewer requests, steeper learning curve.

Both require the same authentication and share the same rate limit pool for most operations.

Authentication and Rate Limits

Without authentication, you get 60 requests per hour. That is barely enough for manual testing — you will burn through it in under a minute with any kind of loop.

With a Personal Access Token (PAT), you get 5,000 requests per hour. Generate one at github.com/settings/tokens with the scopes you need (usually public_repo is enough for public data).

import requests
import time

TOKEN = "ghp_your_token_here"
headers = {
    "Authorization": f"token {TOKEN}",
    "Accept": "application/vnd.github+json"
}

# Check your current rate limit status
r = requests.get("https://api.github.com/rate_limit", headers=headers)
limits = r.json()["resources"]["core"]
print(f"Remaining: {limits['remaining']}/{limits['limit']}")
print(f"Resets at: {limits['reset']}")

The X-RateLimit-Remaining header is returned on every response. Check it before each request and back off when running low.

Pulling Repo Data with REST API v3

The REST API is the fastest path to basic repo stats. Each repository object from the API includes:

stargazers_count — total stars
forks_count — total forks
watchers_count — watchers (usually same as stars)
open_issues_count — open issues and PRs combined
language — detected primary language
topics — repository topic tags
license — SPDX identifier
created_at, updated_at, pushed_at — timestamps
size — repo size in KB
default_branch

import requests
import time

TOKEN = "ghp_your_token_here"
HEADERS = {
    "Authorization": f"token {TOKEN}",
    "Accept": "application/vnd.github+json"
}

def parse_next_link(link_header):
    """Parse GitHub Link header for next page URL."""
    import re
    if not link_header:
        return None
    for part in link_header.split(","):
        match = re.match(r'<([^>]+)>;\s*rel="next"', part.strip())
        if match:
            return match.group(1)
    return None

def get_user_repos(username):
    """Fetch all public repos for a user, handling pagination."""
    repos = []
    url = f"https://api.github.com/users/{username}/repos"
    params = {"per_page": 100, "sort": "updated"}

    while url:
        resp = requests.get(url, headers=HEADERS, params=params)

        remaining = int(resp.headers.get("X-RateLimit-Remaining", 0))
        if remaining < 10:
            reset_time = int(resp.headers["X-RateLimit-Reset"])
            sleep_seconds = max(reset_time - time.time(), 0) + 1
            print(f"Rate limit low ({remaining}). Sleeping {sleep_seconds:.0f}s")
            time.sleep(sleep_seconds)

        resp.raise_for_status()
        repos.extend(resp.json())

        # Follow pagination via Link header
        url = parse_next_link(resp.headers.get("Link", ""))
        params = {}  # params already encoded in the next URL

    return repos

repos = get_user_repos("torvalds")
for repo in repos:
    print(f"{repo['name']}: {repo['stargazers_count']} stars, "
          f"{repo['forks_count']} forks, {repo['language'] or 'unknown'}")

Fetching Commit History

def get_commit_history(owner, repo, since=None, until=None, author=None):
    """Fetch commit history with optional date and author filters."""
    url = f"https://api.github.com/repos/{owner}/{repo}/commits"
    params = {"per_page": 100}
    if since:
        params["since"] = since  # ISO 8601: "2026-01-01T00:00:00Z"
    if until:
        params["until"] = until
    if author:
        params["author"] = author

    commits = []
    while url:
        resp = requests.get(url, headers=HEADERS, params=params)
        resp.raise_for_status()

        for commit in resp.json():
            commits.append({
                "sha": commit["sha"],
                "message": commit["commit"]["message"].split("\n")[0][:100],
                "author": commit["commit"]["author"]["name"],
                "date": commit["commit"]["author"]["date"],
                "url": commit["html_url"],
            })

        url = parse_next_link(resp.headers.get("Link", ""))
        params = {}

    return commits

commits = get_commit_history("torvalds", "linux", since="2026-01-01T00:00:00Z")
print(f"Commits since Jan 2026: {len(commits)}")

Pulling Issues and Pull Requests

Issues and PRs share the same endpoint. Filter by type param to separate them.

def get_repo_issues(owner, repo, state="all", labels=None, since=None, max_pages=20):
    """Fetch issues for a repository."""
    url = f"https://api.github.com/repos/{owner}/{repo}/issues"
    params = {
        "per_page": 100,
        "state": state,  # "open", "closed", or "all"
        "sort": "updated",
        "direction": "desc",
    }
    if labels:
        params["labels"] = ",".join(labels)
    if since:
        params["since"] = since

    issues = []
    page = 0

    while url and page < max_pages:
        resp = requests.get(url, headers=HEADERS, params=params if page == 0 else None)
        resp.raise_for_status()

        for issue in resp.json():
            # Skip pull requests (they appear in issues endpoint too)
            is_pr = "pull_request" in issue

            issues.append({
                "number": issue["number"],
                "title": issue["title"],
                "state": issue["state"],
                "is_pr": is_pr,
                "author": issue["user"]["login"],
                "labels": [l["name"] for l in issue.get("labels", [])],
                "comments": issue["comments"],
                "created_at": issue["created_at"],
                "updated_at": issue["updated_at"],
                "closed_at": issue.get("closed_at"),
                "body_length": len(issue.get("body") or ""),
            })

        url = parse_next_link(resp.headers.get("Link", ""))
        page += 1
        time.sleep(0.3)

    return issues

issues = get_repo_issues("microsoft", "vscode", state="open", labels=["bug"])
print(f"Open vscode bug reports: {len(issues)}")

Scraping User Profiles

User profile data includes follower counts, following, public repo count, bio, company, location, and activity timestamps.

def get_user_profile(username):
    """Fetch detailed user profile."""
    resp = requests.get(
        f"https://api.github.com/users/{username}",
        headers=HEADERS
    )
    resp.raise_for_status()
    u = resp.json()

    return {
        "login": u["login"],
        "id": u["id"],
        "name": u.get("name"),
        "company": u.get("company"),
        "blog": u.get("blog"),
        "location": u.get("location"),
        "email": u.get("email"),
        "bio": u.get("bio"),
        "twitter_username": u.get("twitter_username"),
        "public_repos": u["public_repos"],
        "public_gists": u["public_gists"],
        "followers": u["followers"],
        "following": u["following"],
        "created_at": u["created_at"],
        "updated_at": u["updated_at"],
    }

def get_user_followers(username, max_pages=10):
    """Fetch list of user followers."""
    url = f"https://api.github.com/users/{username}/followers"
    params = {"per_page": 100}
    followers = []
    page = 0

    while url and page < max_pages:
        resp = requests.get(url, headers=HEADERS, params=params if page == 0 else None)
        resp.raise_for_status()
        followers.extend([u["login"] for u in resp.json()])
        url = parse_next_link(resp.headers.get("Link", ""))
        page += 1
        time.sleep(0.3)

    return followers

profile = get_user_profile("antirez")
print(f"{profile['name']} — {profile['followers']} followers, {profile['public_repos']} repos")

Complex Queries with GraphQL API v4

When you need data that spans multiple resources, GraphQL eliminates the N+1 request problem.

import requests

TOKEN = "ghp_your_token_here"
GRAPHQL_URL = "https://api.github.com/graphql"
HEADERS_GQL = {"Authorization": f"bearer {TOKEN}"}

def graphql_query(query, variables=None):
    """Execute a GraphQL query against the GitHub v4 API."""
    payload = {"query": query}
    if variables:
        payload["variables"] = variables

    resp = requests.post(GRAPHQL_URL, json=payload, headers=HEADERS_GQL, timeout=30)
    resp.raise_for_status()

    data = resp.json()
    if "errors" in data:
        for err in data["errors"]:
            print(f"GraphQL error: {err['message']}")

    return data.get("data")

# Fetch an org's top repos with rich metadata in one request
query = """
{
  organization(login: "facebook") {
    repositories(first: 10, orderBy: {field: STARGAZERS, direction: DESC}) {
      nodes {
        name
        stargazerCount
        forkCount
        issues(states: OPEN) { totalCount }
        pullRequests(states: OPEN) { totalCount }
        primaryLanguage { name }
        repositoryTopics(first: 5) {
          nodes { topic { name } }
        }
        licenseInfo { spdxId }
        updatedAt
        diskUsage
      }
    }
  }
}
"""

data = graphql_query(query)
repos = data["organization"]["repositories"]["nodes"]

for repo in repos:
    topics = [t["topic"]["name"] for t in repo["repositoryTopics"]["nodes"]]
    lang = repo["primaryLanguage"]["name"] if repo.get("primaryLanguage") else "unknown"
    print(f"{repo['name']}: {repo['stargazerCount']} stars | "
          f"{repo['issues']['totalCount']} issues | "
          f"{lang} | topics: {', '.join(topics)}")

One request. Ten repos with stars, forks, open issues, open PRs, language, topics, and license. The REST equivalent would take 30+ requests.

GraphQL Pagination with Cursors

GraphQL uses cursor-based pagination, which is more reliable than offset pagination for large datasets.

def get_all_org_repos(org_name, max_repos=500):
    """Fetch all repos for an organization using GraphQL cursor pagination."""
    query = """
    query($org: String!, $cursor: String) {
      organization(login: $org) {
        repositories(
          first: 100,
          after: $cursor,
          orderBy: {field: STARGAZERS, direction: DESC}
        ) {
          pageInfo {
            hasNextPage
            endCursor
          }
          nodes {
            name
            stargazerCount
            forkCount
            primaryLanguage { name }
            createdAt
          }
        }
      }
    }
    """

    all_repos = []
    cursor = None

    while len(all_repos) < max_repos:
        data = graphql_query(query, variables={"org": org_name, "cursor": cursor})

        if not data:
            break

        page = data["organization"]["repositories"]
        all_repos.extend(page["nodes"])

        if not page["pageInfo"]["hasNextPage"]:
            break

        cursor = page["pageInfo"]["endCursor"]
        time.sleep(0.5)

    return all_repos[:max_repos]

repos = get_all_org_repos("microsoft")
print(f"Microsoft has {len(repos)} repos")
print(f"Most starred: {repos[0]['name']} ({repos[0]['stargazerCount']} stars)")

The Search API and Its 1,000-Result Ceiling

GitHub Search API (/search/repositories, /search/code, etc.) caps results at 1,000 per query, regardless of how many actual matches exist. The workaround is partitioning queries by a sortable field.

import time

def search_repos_all(language, min_stars=100, max_repos=10_000):
    """
    Search repositories with star-range partitioning to bypass the 1000-result limit.
    Each sub-query covers a star count range narrow enough to stay under the cap.
    """
    # Define star count buckets — adjust based on distribution
    star_ranges = [
        (100, 200),
        (201, 500),
        (501, 1000),
        (1001, 3000),
        (3001, 10000),
        (10001, 50000),
        (50001, 999999),
    ]

    all_repos = []

    for low, high in star_ranges:
        url = "https://api.github.com/search/repositories"
        params = {
            "q": f"language:{language} stars:{low}..{high}",
            "per_page": 100,
            "sort": "stars",
            "order": "desc",
        }

        page_repos = []
        page_url = url

        while page_url and len(page_repos) < 1000:
            resp = requests.get(page_url, headers=HEADERS, params=params if page_url == url else None)

            if resp.status_code == 422:
                print(f"Range {low}-{high}: query too broad, skipping")
                break

            if resp.status_code == 403:
                # Search API rate limit: 30/min for authenticated
                time.sleep(10)
                continue

            resp.raise_for_status()

            data = resp.json()
            page_repos.extend(data.get("items", []))
            print(f"  Range {low}-{high}: {len(page_repos)}/{data.get('total_count', '?')} repos")

            page_url = parse_next_link(resp.headers.get("Link", ""))
            time.sleep(2.5)  # Search API rate limit: 30 req/min

        all_repos.extend(page_repos)

        if len(all_repos) >= max_repos:
            break

    return all_repos[:max_repos]

python_repos = search_repos_all("python", min_stars=100)
print(f"Total Python repos scraped: {len(python_repos)}")

Anti-Detection and Proxy Rotation

GitHub monitors for high request velocity, repeated identical User-Agents, and scraping patterns. Datacenter IPs get tighter restrictions than residential IPs.

For small projects (a few thousand requests), a PAT and polite delays are sufficient. For bulk operations — crawling millions of repos, monitoring all commits to thousands of projects, or running multiple tokens simultaneously — residential proxies help distribute the load.

ThorData provides rotating residential proxies that work for GitHub API calls. Each request exits from a different residential IP, so you avoid per-IP throttling while each individual token stays within its own rate budget.

import random

PROXY_URL = "http://user:[email protected]:9000"

GITHUB_TOKENS = [
    "ghp_token1",
    "ghp_token2",
    "ghp_token3",
]

class MultiTokenSession:
    """Round-robin across multiple GitHub tokens with proxy rotation."""

    def __init__(self, tokens, proxy_url=None):
        self.tokens = tokens
        self.proxy_url = proxy_url
        self.token_index = 0
        self._sessions = {}

    def _get_session(self, token):
        if token not in self._sessions:
            s = requests.Session()
            s.headers.update({
                "Authorization": f"token {token}",
                "Accept": "application/vnd.github+json",
                "User-Agent": f"github-scraper/{random.randint(1, 100)}",
            })
            if self.proxy_url:
                s.proxies = {"http": self.proxy_url, "https": self.proxy_url}
            self._sessions[token] = s
        return self._sessions[token]

    def get(self, url, **kwargs):
        token = self.tokens[self.token_index % len(self.tokens)]
        self.token_index += 1

        session = self._get_session(token)
        resp = session.get(url, **kwargs)

        remaining = int(resp.headers.get("X-RateLimit-Remaining", 999))
        if remaining < 100:
            # Rotate to next token
            self.token_index += 1
            print(f"Token rotation: {remaining} remaining, switching token")

        return resp

multi = MultiTokenSession(GITHUB_TOKENS, proxy_url=PROXY_URL)
resp = multi.get("https://api.github.com/repos/torvalds/linux")
print(resp.json()["stargazers_count"])

Scaling Beyond 5,000 Requests/Hour

Options for higher-volume collection:

Conditional requests: Use If-None-Match with ETags. Requests that return 304 Not Modified do not count against your limit.
GraphQL batching: Pack more data into fewer requests using fragments and batched queries.
Multiple tokens: GitHub allows multiple PATs per account. Each gets its own limit.
GitHub Archive: For historical event data, skip the API entirely.

def get_with_etag(url, session, etag_cache):
    """Make a request using ETag for conditional caching."""
    headers = {}
    if url in etag_cache:
        headers["If-None-Match"] = etag_cache[url]

    resp = session.get(url, headers=headers, timeout=15)

    if resp.status_code == 304:
        return None  # Not modified, use cached data

    if resp.status_code == 200:
        etag = resp.headers.get("ETag")
        if etag:
            etag_cache[url] = etag
        return resp.json()

    resp.raise_for_status()
    return resp.json()

GitHub Archive for Historical Data

If you need data older than what the API conveniently serves, or you want to analyze events at scale without hitting rate limits at all, use GitHub Archive at gharchive.org.

GH Archive records every public GitHub event (pushes, stars, forks, issues, PRs, comments) as newline-delimited JSON, compressed into hourly files. It has been running since 2011.

The data is also loaded into Google BigQuery as the githubarchive public dataset. You can query years of GitHub activity with SQL:

-- Most-starred repos in March 2026
SELECT repo.name, COUNT(*) as stars
FROM `githubarchive.month.202603`
WHERE type = 'WatchEvent'
GROUP BY repo.name
ORDER BY stars DESC
LIMIT 20

-- Most active contributors to Python repos in 2025
SELECT actor.login, COUNT(*) as commits
FROM `githubarchive.year.2025`
WHERE type = 'PushEvent'
  AND repo.name LIKE '%.py'
GROUP BY actor.login
ORDER BY commits DESC
LIMIT 50

BigQuery gives you 1TB of free queries per month, which covers most research use cases. This returns the most-starred repos for a given period without making a single API call.

Storing Results in SQLite

import sqlite3
from datetime import datetime

def init_github_db(db_path="github.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("PRAGMA journal_mode=WAL")

    conn.execute("""
        CREATE TABLE IF NOT EXISTS repos (
            id INTEGER PRIMARY KEY,
            owner TEXT NOT NULL,
            name TEXT NOT NULL,
            full_name TEXT UNIQUE,
            description TEXT,
            language TEXT,
            stars INTEGER,
            forks INTEGER,
            open_issues INTEGER,
            topics TEXT,
            license TEXT,
            created_at TEXT,
            updated_at TEXT,
            pushed_at TEXT,
            size_kb INTEGER,
            default_branch TEXT,
            fetched_at TEXT
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS commits (
            sha TEXT PRIMARY KEY,
            repo_full_name TEXT,
            message TEXT,
            author_name TEXT,
            author_email TEXT,
            committed_at TEXT,
            fetched_at TEXT
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS issues (
            id INTEGER PRIMARY KEY,
            repo_full_name TEXT,
            number INTEGER,
            title TEXT,
            state TEXT,
            is_pr INTEGER,
            author TEXT,
            labels TEXT,
            comments INTEGER,
            created_at TEXT,
            updated_at TEXT,
            closed_at TEXT
        )
    """)

    conn.execute("CREATE INDEX IF NOT EXISTS idx_repos_language ON repos(language)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_repos_stars ON repos(stars)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_commits_repo ON commits(repo_full_name)")

    conn.commit()
    return conn

def save_repos(conn, repos):
    now = datetime.utcnow().isoformat()
    conn.executemany("""
        INSERT OR REPLACE INTO repos
        (id, owner, name, full_name, description, language, stars, forks,
         open_issues, topics, license, created_at, updated_at, pushed_at,
         size_kb, default_branch, fetched_at)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, [
        (
            r["id"],
            r["owner"]["login"],
            r["name"],
            r["full_name"],
            r.get("description"),
            r.get("language"),
            r.get("stargazers_count", 0),
            r.get("forks_count", 0),
            r.get("open_issues_count", 0),
            ",".join(r.get("topics", [])),
            r.get("license", {}).get("spdx_id") if r.get("license") else None,
            r.get("created_at"),
            r.get("updated_at"),
            r.get("pushed_at"),
            r.get("size"),
            r.get("default_branch"),
            now,
        )
        for r in repos
    ])
    conn.commit()
    print(f"Saved {len(repos)} repos")

Practical Tips

Cache aggressively. GitHub responses include ETag and Last-Modified headers. Use them. A local SQLite database mapping URLs to responses will cut your API usage dramatically.

Respect Retry-After headers. When you hit a secondary rate limit (abuse detection), GitHub returns a Retry-After header. Honor it or risk getting your token suspended.

Use per_page=100 always. The default is 30. Setting it to 100 (the maximum) cuts your pagination requests by 70%.

Check X-RateLimit-Resource. GitHub separates rate limits by resource type (core, search, graphql, code_search). Your GraphQL budget is separate from your REST budget — use both pools strategically.

Do not scrape what you can download. GitHub provides data exports for many things: repo archives, npm packages are mirrored, and GH Archive covers event data. Always check if a bulk download exists before hitting the API.

Topics require a special accept header. To get repository topics via REST, add Accept: application/vnd.github.mercy-preview+json or use GraphQL which returns them natively.

What to Build With This Data

Developer network analysis: Map follower/following relationships to identify influential nodes in the developer community. Who bridges JavaScript and Python ecosystems?

License compliance monitoring: Scan organizations for repos using GPL or other copyleft licenses in commercial products.

OSS health metrics: Build dashboards tracking issue response time, PR merge rate, and contributor diversity across projects you depend on.

Trend detection: Track which topics are gaining repos month-over-month. Which frameworks are developers gravitating toward in 2026?

Hiring intelligence: Find developers by the languages they commit in, the repos they contribute to, and their activity patterns.

GitHub's APIs are well-designed and generous enough for most projects. The REST API handles simple lookups, GraphQL handles complex queries, and GH Archive covers historical analysis at scale. Start with the smallest approach that works — most projects never actually need more than a few hundred API calls per day.