How to Scrape GitHub Repositories with Python (2026)

2026-04-09 [python github api scraping data-extraction]

GitHub is one of the richest public datasets on the internet. Millions of repositories, contributor graphs, topic tags, star counts, commit histories — all accessible through a clean REST API. Whether you're building a tool to track trending libraries, doing competitive research, scraping data for a dataset, or just want to know which repos in a niche are gaining traction, the GitHub API is the right starting point.

This post covers what actually works in 2026: searching repos, pulling contributor lists, using code search, handling pagination without hitting rate limits, bulk collection with rotating proxies, and building a real dataset pipeline.

Rate Limits and Authentication

Before you write a single line of code, understand the rate limit situation.

Unauthenticated: 60 requests per hour. Basically useless for any real work.

Authenticated with a personal access token: 5,000 requests per hour. Workable for most tasks.

Search API specifically: 10 requests per minute authenticated (30/min for some endpoints). This is a separate cap from the main 5,000/hr limit.

To get a token, go to GitHub Settings > Developer settings > Personal access tokens > Tokens (classic). For read-only public repo access, you only need the public_repo scope — or no scopes at all if you just want public data.

Store it in an environment variable, not in your code:

import os
import time
import re
import csv
import json
import sqlite3
import requests
from pathlib import Path
from datetime import datetime, timezone

GITHUB_TOKEN = os.environ.get("GITHUB_TOKEN")

session = requests.Session()
session.headers.update({
    "Authorization": f"Bearer {GITHUB_TOKEN}",
    "Accept": "application/vnd.github+json",
    "X-GitHub-Api-Version": "2022-11-28",
})

Using a Session object means you don't repeat headers on every call, and it reuses the underlying TCP connection which speeds things up slightly.

Understanding Rate Limit Headers

Every GitHub API response includes rate limit information in its headers. Read them on every request:

def check_rate_limit(response):
    """Parse rate limit headers and sleep if we're about to be blocked."""
    remaining = int(response.headers.get("X-RateLimit-Remaining", 1))
    limit = int(response.headers.get("X-RateLimit-Limit", 5000))
    reset_at = int(response.headers.get("X-RateLimit-Reset", 0))

    if remaining == 0:
        wait = max(0, reset_at - int(time.time())) + 2
        print(f"Rate limit exhausted ({limit}/hr). Waiting {wait}s...")
        time.sleep(wait)
    elif remaining < 100:
        # Slow down when getting close to the limit
        time.sleep(0.5)

    return remaining

def safe_get(url, params=None, retries=3):
    """GET with retry logic and rate limit handling."""
    for attempt in range(retries):
        try:
            r = session.get(url, params=params, timeout=30)
            check_rate_limit(r)

            if r.status_code == 403:
                # Could be rate limit or abuse detection
                retry_after = int(r.headers.get("Retry-After", 60))
                print(f"403 received. Waiting {retry_after}s...")
                time.sleep(retry_after)
                continue

            if r.status_code == 422:
                # Unprocessable — usually a bad query, don't retry
                print(f"422 Unprocessable: {r.json().get('message', '')}")
                return None

            r.raise_for_status()
            return r

        except requests.exceptions.ConnectionError:
            wait = 2 ** attempt
            print(f"Connection error. Retry {attempt+1}/{retries} in {wait}s...")
            time.sleep(wait)

        except requests.exceptions.Timeout:
            print(f"Timeout on attempt {attempt+1}")
            time.sleep(5)

    return None

Searching Repositories

The search endpoint is GET /search/repositories. It accepts a q parameter using GitHub's search syntax and returns up to 1,000 results per query (100 per page, paginated).

def search_repos(query, sort="stars", order="desc", per_page=100):
    """
    Search GitHub repositories by query string.

    Args:
        query: GitHub search syntax string
               Examples:
               - "topic:fastapi stars:>500 language:python"
               - "machine learning created:>2024-01-01 stars:>100"
               - "org:microsoft language:typescript"
        sort: stars, forks, help-wanted-issues, updated
        order: asc or desc
        per_page: results per page (max 100)

    Returns:
        List of repository dicts
    """
    url = "https://api.github.com/search/repositories"
    params = {
        "q": query,
        "sort": sort,
        "order": order,
        "per_page": per_page,
    }

    r = safe_get(url, params=params)
    if r is None:
        return []

    data = r.json()
    total = data.get("total_count", 0)
    items = data.get("items", [])
    print(f"Query matched {total:,} repos, returning first {len(items)}")
    return items


def extract_repo_fields(repo):
    """Extract the most useful fields from a raw GitHub repo response."""
    return {
        "id": repo["id"],
        "full_name": repo["full_name"],
        "name": repo["name"],
        "owner": repo["owner"]["login"],
        "owner_type": repo["owner"]["type"],  # User or Organization
        "description": repo.get("description", ""),
        "homepage": repo.get("homepage", ""),
        "stars": repo["stargazers_count"],
        "forks": repo["forks_count"],
        "watchers": repo["watchers_count"],
        "open_issues": repo["open_issues_count"],
        "language": repo.get("language", ""),
        "topics": ", ".join(repo.get("topics", [])),
        "license": (repo.get("license") or {}).get("spdx_id", ""),
        "default_branch": repo.get("default_branch", "main"),
        "size_kb": repo["size"],
        "is_fork": repo["fork"],
        "is_archived": repo.get("archived", False),
        "is_template": repo.get("is_template", False),
        "has_wiki": repo.get("has_wiki", False),
        "has_issues": repo.get("has_issues", True),
        "has_pages": repo.get("has_pages", False),
        "created_at": repo["created_at"],
        "updated_at": repo["updated_at"],
        "pushed_at": repo["pushed_at"],
        "url": repo["html_url"],
        "clone_url": repo["clone_url"],
        "api_url": repo["url"],
        "collected_at": datetime.now(timezone.utc).isoformat(),
    }


# Example: trending Python ML repos from the past year
repos = search_repos(
    "topic:machine-learning language:python stars:>1000 pushed:>2025-01-01"
)

for repo in repos[:5]:
    r = extract_repo_fields(repo)
    print(f"{r['full_name']}: {r['stars']:,} stars, {r['language']}, topics: {r['topics'][:50]}")

Getting Full Repository Details

The search endpoint returns a subset of fields. For the complete metadata, including subscriber count, network count, and full topics list, hit the individual repo endpoint:

def get_repo_details(full_name):
    """
    Fetch complete repository metadata.

    Args:
        full_name: "owner/repo" string (e.g. "tiangolo/fastapi")

    Returns:
        Full repo dict with all fields, or None on error
    """
    url = f"https://api.github.com/repos/{full_name}"
    r = safe_get(url)
    if r is None:
        return None

    repo = r.json()

    # Add fields not in search results
    repo["subscribers_count"] = repo.get("subscribers_count", 0)
    repo["network_count"] = repo.get("network_count", 0)

    return extract_repo_fields(repo)


def get_repo_topics(full_name):
    """Get all topics for a repository (requires Accept header for topics preview)."""
    url = f"https://api.github.com/repos/{full_name}/topics"
    r = safe_get(url)
    if r is None:
        return []
    return r.json().get("names", [])


def get_repo_languages(full_name):
    """Get byte counts by language for a repo."""
    url = f"https://api.github.com/repos/{full_name}/languages"
    r = safe_get(url)
    if r is None:
        return {}
    return r.json()


# Enrich search results with full details
def enrich_repos(repos, delay=0.2):
    """Fetch full details for each repo in the list."""
    enriched = []
    for i, repo in enumerate(repos):
        full_name = repo.get("full_name") or repo
        details = get_repo_details(full_name)
        if details:
            # Also get language breakdown
            langs = get_repo_languages(full_name)
            details["languages_json"] = json.dumps(langs)
            enriched.append(details)

        if (i + 1) % 20 == 0:
            print(f"  Enriched {i+1}/{len(repos)} repos...")

        time.sleep(delay)

    return enriched

Getting Contributors

Once you have a repo's full_name (e.g. tiangolo/fastapi), you can pull the contributor list from GET /repos/{owner}/{repo}/contributors.

def get_contributors(full_name, max_pages=5, include_anon=False):
    """
    Get contributors for a repository.

    Args:
        full_name: "owner/repo" string
        max_pages: cap on pagination (each page = 100 contributors)
        include_anon: include anonymous contributors

    Returns:
        List of contributor dicts sorted by commit count descending
    """
    owner, repo = full_name.split("/", 1)
    url = f"https://api.github.com/repos/{owner}/{repo}/contributors"

    contributors = []
    page = 1

    while page <= max_pages:
        params = {
            "per_page": 100,
            "page": page,
            "anon": "1" if include_anon else "0",
        }
        r = safe_get(url, params=params)
        if r is None:
            break

        batch = r.json()
        if not batch:
            break

        for c in batch:
            if c.get("type") == "Anonymous":
                contributors.append({
                    "login": c.get("email", "anonymous"),
                    "contributions": c["contributions"],
                    "type": "anonymous",
                    "profile": "",
                })
            else:
                contributors.append({
                    "login": c["login"],
                    "contributions": c["contributions"],
                    "type": c["type"],  # User or Bot
                    "profile": c["html_url"],
                    "avatar": c["avatar_url"],
                })

        # Check if there are more pages
        if get_next_page_url(r) is None:
            break

        page += 1
        time.sleep(0.3)

    return contributors


# Example: top contributors to a major project
fastapi_contributors = get_contributors("tiangolo/fastapi", max_pages=2)
print(f"FastAPI has {len(fastapi_contributors)} contributors")
for c in fastapi_contributors[:5]:
    print(f"  {c['login']}: {c['contributions']} commits")

Getting Commit History

For tracking project velocity and contributor patterns over time:

def get_commits(full_name, since=None, until=None, author=None, max_pages=10):
    """
    Get commit history for a repository.

    Args:
        full_name: "owner/repo" string
        since: ISO 8601 datetime string (e.g. "2025-01-01T00:00:00Z")
        until: ISO 8601 datetime string
        author: GitHub username to filter by
        max_pages: pagination cap

    Returns:
        List of commit dicts
    """
    owner, repo = full_name.split("/", 1)
    url = f"https://api.github.com/repos/{owner}/{repo}/commits"

    params = {"per_page": 100}
    if since:
        params["since"] = since
    if until:
        params["until"] = until
    if author:
        params["author"] = author

    commits = []
    page = 1

    while page <= max_pages:
        params["page"] = page
        r = safe_get(url, params=params)
        if r is None:
            break

        batch = r.json()
        if not batch:
            break

        for c in batch:
            commit_data = c.get("commit", {})
            author_data = commit_data.get("author", {})
            committer_data = commit_data.get("committer", {})
            github_author = c.get("author") or {}

            commits.append({
                "sha": c["sha"][:8],
                "full_sha": c["sha"],
                "message": commit_data.get("message", "").split("\n")[0][:200],
                "author_name": author_data.get("name", ""),
                "author_email": author_data.get("email", ""),
                "author_date": author_data.get("date", ""),
                "committer_date": committer_data.get("date", ""),
                "github_login": github_author.get("login", ""),
                "additions": c.get("stats", {}).get("additions", 0),
                "deletions": c.get("stats", {}).get("deletions", 0),
                "comment_count": commit_data.get("comment_count", 0),
            })

        if get_next_page_url(r) is None:
            break

        page += 1
        time.sleep(0.3)

    return commits


# Example: get 2025 commit activity
commits_2025 = get_commits(
    "tiangolo/fastapi",
    since="2025-01-01T00:00:00Z",
    until="2025-12-31T23:59:59Z",
    max_pages=5
)
print(f"FastAPI 2025 commits: {len(commits_2025)}")

Getting Issues and Pull Requests

def get_issues(full_name, state="open", labels=None, max_pages=5):
    """
    Get issues (and optionally PRs) for a repository.

    The GitHub API returns PRs mixed in with issues by default.
    Filter by checking for 'pull_request' key in each item.

    Args:
        full_name: "owner/repo" string
        state: open, closed, or all
        labels: comma-separated label names (e.g. "bug,help wanted")
        max_pages: pagination cap

    Returns:
        List of issue dicts
    """
    owner, repo = full_name.split("/", 1)
    url = f"https://api.github.com/repos/{owner}/{repo}/issues"

    params = {
        "state": state,
        "per_page": 100,
        "sort": "created",
        "direction": "desc",
    }
    if labels:
        params["labels"] = labels

    issues = []
    page = 1

    while page <= max_pages:
        params["page"] = page
        r = safe_get(url, params=params)
        if r is None:
            break

        batch = r.json()
        if not batch:
            break

        for item in batch:
            is_pr = "pull_request" in item
            issues.append({
                "number": item["number"],
                "type": "pr" if is_pr else "issue",
                "title": item["title"],
                "state": item["state"],
                "author": (item.get("user") or {}).get("login", ""),
                "created_at": item["created_at"],
                "updated_at": item["updated_at"],
                "closed_at": item.get("closed_at", ""),
                "labels": ", ".join(l["name"] for l in item.get("labels", [])),
                "comments": item.get("comments", 0),
                "body_preview": (item.get("body") or "")[:300],
                "url": item["html_url"],
                "is_pr": is_pr,
            })

        if get_next_page_url(r) is None:
            break

        page += 1
        time.sleep(0.3)

    return issues


# Example: help-wanted issues (good for finding contribution opportunities)
help_wanted = get_issues(
    "tiangolo/fastapi",
    state="open",
    labels="help wanted",
    max_pages=2
)
print(f"FastAPI help-wanted issues: {len(help_wanted)}")

Code Search

The code search endpoint lets you search across file contents on GitHub. This is useful for finding repos that use a specific library, pattern, or configuration value.

def search_code(query, per_page=30):
    """
    Search code across all public GitHub repositories.

    Args:
        query: GitHub code search syntax
               Examples:
               - "import thordata language:python"
               - "GITHUB_TOKEN filename:.env"
               - "org:django extension:py def middleware"
        per_page: results per page (max 30 for code search)

    Returns:
        List of code match dicts
    """
    url = "https://api.github.com/search/code"
    params = {
        "q": query,
        "per_page": per_page,
    }

    # Code search has stricter rate limits — add extra delay
    time.sleep(6)

    r = safe_get(url, params=params)
    if r is None:
        return []

    data = r.json()
    results = []

    for item in data.get("items", []):
        results.append({
            "repo": item["repository"]["full_name"],
            "repo_stars": item["repository"].get("stargazers_count", 0),
            "file": item["name"],
            "path": item["path"],
            "url": item["html_url"],
            "raw_url": item.get("download_url", ""),
            "text_matches": [
                m.get("fragment", "") for m in item.get("text_matches", [])
            ],
        })

    return results


def get_file_contents(full_name, path):
    """
    Download the raw contents of a file from a repository.

    Args:
        full_name: "owner/repo" string
        path: file path within the repo (e.g. "src/main.py")

    Returns:
        File content as string, or None
    """
    owner, repo = full_name.split("/", 1)
    url = f"https://api.github.com/repos/{owner}/{repo}/contents/{path}"

    r = safe_get(url)
    if r is None:
        return None

    data = r.json()
    if data.get("encoding") == "base64":
        import base64
        return base64.b64decode(data["content"]).decode("utf-8", errors="replace")

    return data.get("content", "")


# Example: find Python files importing a specific package
results = search_code("import thordata language:python")
for r in results[:5]:
    print(f"  {r['repo']} — {r['path']}")

Code search is the most rate-limited endpoint — one request every 6-7 seconds if you're making multiple calls to stay inside the 10/min cap.

Pagination with Link Headers

GitHub uses Link headers for pagination rather than returning total page counts in the body. The header looks like this:

Link: <https://api.github.com/search/repositories?q=...&page=2>; rel="next",
      <https://api.github.com/search/repositories?q=...&page=10>; rel="last"

Parse it like this:

def get_next_page_url(response):
    """Extract the next-page URL from a GitHub Link header."""
    link_header = response.headers.get("Link", "")
    if not link_header:
        return None
    match = re.search(r'<([^>]+)>;\s*rel="next"', link_header)
    return match.group(1) if match else None


def get_last_page_number(response):
    """Extract the last page number from a GitHub Link header."""
    link_header = response.headers.get("Link", "")
    if not link_header:
        return 1
    match = re.search(r'<[^>]+[?&]page=(\d+)>;\s*rel="last"', link_header)
    return int(match.group(1)) if match else 1


def search_repos_all_pages(query, max_pages=10, delay=1.0):
    """
    Paginate through all search results for a query.

    GitHub caps total results at 1,000 per query even with pagination.
    Narrow your query if you need more than that.

    Args:
        query: GitHub search syntax string
        max_pages: safety cap (GitHub allows max 10 pages of 100)
        delay: seconds between page requests

    Returns:
        List of all repo dicts
    """
    url = "https://api.github.com/search/repositories"
    params = {"q": query, "per_page": 100, "sort": "stars", "order": "desc"}

    all_repos = []
    page_count = 0

    while url and page_count < max_pages:
        r = safe_get(url, params=params)
        if r is None:
            break

        data = r.json()
        batch = data.get("items", [])
        all_repos.extend(batch)

        total_count = data.get("total_count", 0)
        print(f"Page {page_count+1}: +{len(batch)} repos (total fetched: {len(all_repos)}/{min(total_count, 1000)})")

        url = get_next_page_url(r)
        params = {}  # URL already has params encoded after first request
        page_count += 1

        if url:
            time.sleep(delay)

    return all_repos

Note that GitHub caps search results at 1,000 total even with pagination. If your query matches more than that, narrow it down with additional filters (language:, created:, stars:>N).

Anti-Detection and Proxy Setup

For large-scale collection, single-IP scraping gets flagged by GitHub's abuse detection. The API token limits are per-token, but the secondary rate limit (abuse detection) is per-IP. To work around this safely:

import random

# Header rotation pool — vary User-Agent and other headers
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]

def make_session_with_proxy(proxy_url=None, token=None):
    """
    Create a requests session optionally configured with a proxy.

    For rotating residential proxies, use ThorData:
    https://thordata.partnerstack.com/partner/0a0x4nzh

    Args:
        proxy_url: Full proxy URL (e.g. "http://user:pass@host:port")
        token: GitHub personal access token

    Returns:
        Configured requests.Session
    """
    s = requests.Session()
    s.headers.update({
        "Accept": "application/vnd.github+json",
        "X-GitHub-Api-Version": "2022-11-28",
        "User-Agent": random.choice(USER_AGENTS),
    })

    if token:
        s.headers["Authorization"] = f"Bearer {token}"

    if proxy_url:
        s.proxies = {
            "http": proxy_url,
            "https": proxy_url,
        }

    return s


def build_thordata_proxy(username, password, country="US", sticky=False, session_id=None):
    """
    Build a ThorData proxy URL with optional sticky sessions.

    ThorData residential proxies: https://thordata.partnerstack.com/partner/0a0x4nzh

    Args:
        username: ThorData account username
        password: ThorData account password
        country: 2-letter country code for geo-targeting
        sticky: Use sticky session (same IP for duration)
        session_id: Session ID for sticky sessions (random if not provided)

    Returns:
        Proxy URL string
    """
    import uuid
    if sticky:
        if session_id is None:
            session_id = str(uuid.uuid4())[:8]
        user = f"{username}-session-{session_id}-country-{country}"
    else:
        user = f"{username}-country-{country}"

    return f"http://{user}:{password}@gate.thordata.net:7777"


# Using proxies with multiple tokens for high-volume collection
class MultiTokenClient:
    """Rotate between multiple GitHub tokens to maximize throughput."""

    def __init__(self, tokens, proxy_url=None):
        self.sessions = [
            make_session_with_proxy(proxy_url=proxy_url, token=t)
            for t in tokens
        ]
        self.current = 0

    def get(self, url, params=None):
        """Make a GET request using the next available session."""
        s = self.sessions[self.current]
        self.current = (self.current + 1) % len(self.sessions)

        r = s.get(url, params=params, timeout=30)
        check_rate_limit(r)
        r.raise_for_status()
        return r

SQLite Database for Repository Storage

Persist your collection in SQLite for deduplication and analysis:

def init_repos_db(path="github_repos.db"):
    """Initialize SQLite database for repository storage."""
    conn = sqlite3.connect(path)
    conn.row_factory = sqlite3.Row

    conn.executescript("""
        CREATE TABLE IF NOT EXISTS repos (
            id INTEGER PRIMARY KEY,
            full_name TEXT UNIQUE NOT NULL,
            name TEXT,
            owner TEXT,
            owner_type TEXT,
            description TEXT,
            stars INTEGER DEFAULT 0,
            forks INTEGER DEFAULT 0,
            watchers INTEGER DEFAULT 0,
            open_issues INTEGER DEFAULT 0,
            language TEXT,
            topics TEXT,
            license TEXT,
            size_kb INTEGER DEFAULT 0,
            is_fork INTEGER DEFAULT 0,
            is_archived INTEGER DEFAULT 0,
            created_at TEXT,
            pushed_at TEXT,
            url TEXT,
            languages_json TEXT,
            collected_at TEXT
        );

        CREATE TABLE IF NOT EXISTS contributors (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            repo_full_name TEXT NOT NULL,
            login TEXT NOT NULL,
            contributions INTEGER DEFAULT 0,
            type TEXT,
            profile TEXT,
            collected_at TEXT,
            UNIQUE(repo_full_name, login)
        );

        CREATE TABLE IF NOT EXISTS commits (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            repo_full_name TEXT NOT NULL,
            sha TEXT NOT NULL,
            author_login TEXT,
            author_date TEXT,
            message TEXT,
            UNIQUE(repo_full_name, sha)
        );

        CREATE INDEX IF NOT EXISTS idx_repos_stars ON repos(stars DESC);
        CREATE INDEX IF NOT EXISTS idx_repos_language ON repos(language);
        CREATE INDEX IF NOT EXISTS idx_repos_pushed ON repos(pushed_at);
        CREATE INDEX IF NOT EXISTS idx_contributors_repo ON contributors(repo_full_name);
    """)

    conn.commit()
    return conn


def upsert_repo(conn, repo_data):
    """Insert or update a repository record."""
    fields = list(repo_data.keys())
    placeholders = ", ".join(["?"] * len(fields))
    updates = ", ".join([f"{f} = excluded.{f}" for f in fields if f != "id"])

    sql = f"""
        INSERT INTO repos ({", ".join(fields)})
        VALUES ({placeholders})
        ON CONFLICT(full_name) DO UPDATE SET {updates}
    """
    conn.execute(sql, list(repo_data.values()))
    conn.commit()


def upsert_contributors(conn, repo_full_name, contributors):
    """Batch insert contributors for a repository."""
    now = datetime.now(timezone.utc).isoformat()
    conn.executemany(
        """
        INSERT INTO contributors (repo_full_name, login, contributions, type, profile, collected_at)
        VALUES (?, ?, ?, ?, ?, ?)
        ON CONFLICT(repo_full_name, login) DO UPDATE SET
            contributions = excluded.contributions
        """,
        [
            (repo_full_name, c["login"], c["contributions"],
             c.get("type", ""), c.get("profile", ""), now)
            for c in contributors
        ]
    )
    conn.commit()

Real-World Use Cases

Use Case 1: Track Rising Python Libraries

Build a weekly report of Python libraries gaining momentum — useful for content strategy, investment research, or staying current with the ecosystem.

def find_rising_python_libs(min_stars=100, months_back=6):
    """
    Find Python repositories that gained significant stars recently.
    Looks for repos created within months_back that already have min_stars.
    """
    from datetime import timedelta

    cutoff = datetime.now(timezone.utc) - timedelta(days=30 * months_back)
    since_str = cutoff.strftime("%Y-%m-%d")

    query = f"language:python stars:>{min_stars} created:>{since_str}"
    repos = search_repos_all_pages(query, max_pages=5)

    # Sort by stars per day since creation
    results = []
    for repo in repos:
        r = extract_repo_fields(repo)
        created = datetime.fromisoformat(r["created_at"].replace("Z", "+00:00"))
        days_old = max(1, (datetime.now(timezone.utc) - created).days)
        r["stars_per_day"] = r["stars"] / days_old
        results.append(r)

    results.sort(key=lambda x: x["stars_per_day"], reverse=True)

    print("\n=== Fastest Rising Python Libraries ===")
    for r in results[:15]:
        print(f"  {r['full_name']}: {r['stars']:,} stars in {r['stars_per_day']:.1f}/day | {r['description'][:60]}")

    return results


rising = find_rising_python_libs(min_stars=200, months_back=3)

Use Case 2: Competitive Intelligence for a Tech Stack

Map the open-source ecosystem around a technology to understand adoption, key players, and momentum:

def map_tech_ecosystem(tech_name, language=None):
    """
    Map repositories related to a technology.
    Returns a picture of the ecosystem: main projects, forks, tooling, tutorials.
    """
    queries = [
        f"topic:{tech_name}",
        f"{tech_name} tutorial language:{'python' if not language else language}",
        f"{tech_name} integration stars:>50",
    ]

    all_repos = []
    seen = set()

    for q in queries:
        repos = search_repos(q, per_page=50)
        for repo in repos:
            fn = repo["full_name"]
            if fn not in seen:
                seen.add(fn)
                all_repos.append(extract_repo_fields(repo))
        time.sleep(2)

    # Analyze the ecosystem
    total_stars = sum(r["stars"] for r in all_repos)
    languages = {}
    for r in all_repos:
        lang = r["language"] or "Unknown"
        languages[lang] = languages.get(lang, 0) + 1

    print(f"\n=== {tech_name} Ecosystem ===")
    print(f"Repos found: {len(all_repos)}, Total stars: {total_stars:,}")
    print(f"Top languages: {sorted(languages.items(), key=lambda x: -x[1])[:5]}")

    return sorted(all_repos, key=lambda x: x["stars"], reverse=True)


fastapi_ecosystem = map_tech_ecosystem("fastapi", language="python")

Use Case 3: Developer Contact Research

Find active contributors to relevant projects for recruiting or partnership outreach:

def find_active_contributors(repo_list, min_commits=10):
    """
    Find developers who are actively contributing to a set of repos.
    Useful for recruiting, outreach, or identifying experts in a domain.

    Returns contributors with min_commits or more across the repo set.
    """
    contributor_stats = {}

    for full_name in repo_list:
        print(f"Getting contributors for {full_name}...")
        contributors = get_contributors(full_name, max_pages=2)

        for c in contributors:
            login = c["login"]
            if login not in contributor_stats:
                contributor_stats[login] = {
                    "login": login,
                    "profile": c.get("profile", ""),
                    "total_contributions": 0,
                    "active_repos": [],
                    "type": c.get("type", "User"),
                }
            contributor_stats[login]["total_contributions"] += c["contributions"]
            contributor_stats[login]["active_repos"].append(full_name)

        time.sleep(1)

    # Filter by minimum contributions and exclude bots
    active = [
        c for c in contributor_stats.values()
        if c["total_contributions"] >= min_commits
        and c["type"] != "Bot"
        and "[bot]" not in c["login"]
    ]

    return sorted(active, key=lambda x: x["total_contributions"], reverse=True)


python_web_contributors = find_active_contributors(
    ["tiangolo/fastapi", "encode/httpx", "pydantic/pydantic"],
    min_commits=20
)
print(f"\nFound {len(python_web_contributors)} active contributors")
for c in python_web_contributors[:10]:
    repos = ", ".join(c["active_repos"])
    print(f"  {c['login']}: {c['total_contributions']} commits across {repos}")

Use Case 4: Security Research — Finding Exposed Credentials

A legitimate use of code search is scanning your own organization's repos for accidentally committed secrets:

def scan_org_for_secrets(org_name):
    """
    Scan an organization's public repos for common accidentally-committed secrets.
    Useful for security audits of your own organization.

    IMPORTANT: Only use this on your own organization. Do not use against others.
    """
    secret_patterns = [
        f"org:{org_name} filename:.env",
        f"org:{org_name} password filename:config.py",
        f"org:{org_name} PRIVATE_KEY extension:pem",
        f"org:{org_name} AWS_SECRET_ACCESS_KEY",
        f"org:{org_name} api_key filename:settings",
    ]

    findings = []
    for pattern in secret_patterns:
        print(f"Checking: {pattern}")
        results = search_code(pattern, per_page=10)
        if results:
            findings.append({
                "pattern": pattern,
                "matches": results,
            })
            print(f"  WARNING: {len(results)} potential matches found!")
        time.sleep(8)  # Respect code search rate limit

    return findings

Exporting to CSV and JSON

def export_to_csv(repos, output_path="github_repos.csv"):
    """Export repository list to CSV."""
    if not repos:
        print("No repos to export")
        return

    output = Path(output_path)
    fieldnames = list(repos[0].keys())

    with open(output, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(repos)

    print(f"Exported {len(repos)} repos to {output} ({output.stat().st_size:,} bytes)")


def export_to_jsonl(repos, output_path="github_repos.jsonl"):
    """Export repository list to JSON Lines format."""
    output = Path(output_path)
    with open(output, "w", encoding="utf-8") as f:
        for repo in repos:
            f.write(json.dumps(repo) + "\n")

    print(f"Exported {len(repos)} repos to {output}")

Full Pipeline: Collecting a Dataset

Here is a complete end-to-end pipeline that collects trending repositories, enriches them with contributor data, and saves everything to SQLite:

def run_collection_pipeline(queries, output_db="github_dataset.db", max_repos=500):
    """
    Full collection pipeline:
    1. Search repos using multiple queries
    2. Deduplicate
    3. Fetch full details
    4. Fetch contributor lists
    5. Save to SQLite

    Args:
        queries: List of GitHub search query strings
        output_db: Path to SQLite database
        max_repos: Maximum repos to collect

    Returns:
        Collection statistics dict
    """
    conn = init_repos_db(output_db)

    all_repos = {}  # full_name -> repo dict
    stats = {
        "queries_run": 0,
        "repos_collected": 0,
        "contributors_collected": 0,
        "errors": 0,
        "started_at": datetime.now(timezone.utc).isoformat(),
    }

    # Phase 1: Search
    print("=== Phase 1: Searching repositories ===")
    for query in queries:
        print(f"\nQuery: {query}")
        repos = search_repos_all_pages(query, max_pages=5)
        for repo in repos:
            fn = repo["full_name"]
            if fn not in all_repos:
                all_repos[fn] = repo
        stats["queries_run"] += 1
        print(f"  Running total: {len(all_repos)} unique repos")
        time.sleep(2)

    # Phase 2: Enrich and save
    print(f"\n=== Phase 2: Enriching {len(all_repos)} repos ===")
    repo_list = list(all_repos.values())[:max_repos]

    for i, repo in enumerate(repo_list):
        full_name = repo["full_name"]
        try:
            # Get full details
            details = get_repo_details(full_name)
            if details:
                langs = get_repo_languages(full_name)
                details["languages_json"] = json.dumps(langs)
                upsert_repo(conn, details)
                stats["repos_collected"] += 1

            # Get contributors
            contributors = get_contributors(full_name, max_pages=2)
            if contributors:
                upsert_contributors(conn, full_name, contributors)
                stats["contributors_collected"] += len(contributors)

            if (i + 1) % 20 == 0:
                print(f"  Progress: {i+1}/{len(repo_list)} repos enriched")

            time.sleep(0.5)

        except Exception as e:
            print(f"  Error on {full_name}: {e}")
            stats["errors"] += 1

    stats["completed_at"] = datetime.now(timezone.utc).isoformat()
    stats["database"] = output_db

    print(f"\n=== Collection Complete ===")
    print(f"Repos collected: {stats['repos_collected']}")
    print(f"Contributors collected: {stats['contributors_collected']}")
    print(f"Errors: {stats['errors']}")
    print(f"Database: {output_db}")

    return stats


# Example: build a Python web framework ecosystem dataset
results = run_collection_pipeline(
    queries=[
        "topic:fastapi language:python stars:>100",
        "topic:django language:python stars:>200",
        "topic:flask language:python stars:>100 pushed:>2025-01-01",
        "web framework python stars:>500 language:python",
    ],
    output_db="python_web_frameworks.db",
    max_repos=300,
)

Handling Heavy Scraping Beyond API Limits

For most use cases, 5,000 requests per hour is enough. But if you're building something that needs to scrape thousands of repos continuously, you'll hit the ceiling fast.

The typical solution is rotating proxies. Each request comes from a different IP, so you're not rate-limited as a single client. I've had good results with ThorData's rotating residential proxies — their residential pool works cleanly with the requests library and doesn't trip GitHub's bot detection the way datacenter IPs sometimes do.

# Configure a session with ThorData rotating residential proxies
# Sign up at: https://thordata.partnerstack.com/partner/0a0x4nzh

THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"

proxy_url = build_thordata_proxy(THORDATA_USER, THORDATA_PASS, country="US")
proxy_session = make_session_with_proxy(proxy_url=proxy_url, token=GITHUB_TOKEN)

# Now use proxy_session instead of session for all requests
# Each request automatically routes through a different residential IP

With rotating proxies, you spread load across IPs automatically. That said, for most scraping tasks the public API with a token is sufficient — reach for proxies when you actually need scale, not by default.

Summary

The GitHub REST API is solid and well-documented. The main things to get right:

Always authenticate — 60 req/hr unauthenticated is nothing
Check X-RateLimit-Remaining on every response and back off when it hits zero
Use the Link header for pagination, not manual page counting
Code search has a separate, stricter rate limit — slow down for that endpoint
Use exponential backoff on 403/429 responses
For large-scale collection, rotating residential proxies spread load across IPs

The endpoints covered here — /search/repositories, /repos/{owner}/{repo}/contributors, /repos/{owner}/{repo}/commits, /repos/{owner}/{repo}/issues, and /search/code — cover the majority of what you'd want for repo analysis, competitive intelligence, or dataset building. The full API reference at docs.github.com/en/rest has the complete field listings if you need something more specific.

For real-world data pipelines, combine the SQLite storage layer with a scheduled job (cron or a simple sleep loop) and you have a continuously updating dataset that costs nothing to run beyond the occasional proxy bill.