How to Scrape GitHub Trending Repos in 2026: Stars, Languages & Contributor Analysis

2026-04-09 ["github" "web scraping" "python" "trending repos" "github api"]

GitHub's trending page is one of the best publicly available signals for developer momentum. Unlike GitHub's overall star counts — which favor established projects that have been accumulating stars for years — the trending page specifically surfaces what's gaining traction right now. Projects that appear there have demonstrated real velocity: engineers are discovering them, sharing them, and starring them within a compressed time window.

This makes the data valuable for several use cases: spotting emerging tools before they hit mainstream awareness, tracking which programming languages are gaining developer interest, competitive research on new libraries in your space, and building a newsletter or analytics product around developer trends. The challenge is that there's no official API for trending data. This guide covers how to get it programmatically, enrich it via the GitHub REST API, track it over time in SQLite, and build real insights from the data.

The trending page lives at github.com/trending with several filter parameters:

https://github.com/trending                         # all languages, daily
https://github.com/trending?since=weekly            # all languages, weekly
https://github.com/trending?since=monthly           # all languages, monthly
https://github.com/trending/python                  # Python only, daily
https://github.com/trending/python?since=weekly     # Python only, weekly
https://github.com/trending/rust?since=monthly      # Rust only, monthly

The since parameter accepts: daily, weekly, monthly. Language slugs are lowercase: python, javascript, typescript, rust, go, java, c, c++, c#, ruby, swift, kotlin, scala, elixir, haskell, zig, etc.

The good news: GitHub trending is server-rendered HTML. Unlike React SPAs, you don't need a headless browser — requests with a realistic User-Agent works reliably.

import requests
from bs4 import BeautifulSoup
import time
import re
from typing import Optional

HEADERS = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Cache-Control": "no-cache",
}

session = requests.Session()
session.headers.update(HEADERS)


def scrape_trending(
    language: str = "",
    since: str = "daily",
    spoken_language: str = "",
) -> list[dict]:
    """
    Scrape GitHub trending page.

    Args:
        language: programming language slug (e.g. 'python', 'rust') or '' for all
        since: 'daily', 'weekly', or 'monthly'
        spoken_language: filter by spoken language (e.g. 'en', 'zh', 'ja')

    Returns:
        List of repository dicts with metadata
    """
    base = "https://github.com/trending"
    url = f"{base}/{language}" if language else base
    params = {"since": since}
    if spoken_language:
        params["spoken_language_code"] = spoken_language

    resp = session.get(url, params=params, timeout=15)
    if resp.status_code == 429:
        print(f"Rate limited, waiting 60s...")
        time.sleep(60)
        resp = session.get(url, params=params, timeout=15)

    if resp.status_code != 200:
        raise RuntimeError(f"HTTP {resp.status_code} fetching trending page: {url}")

    return _parse_trending_html(resp.text, since)


def _parse_trending_html(html: str, since: str) -> list[dict]:
    """Parse the trending page HTML into structured repo data."""
    soup = BeautifulSoup(html, "html.parser")
    repos = []

    for article in soup.select("article.Box-row"):
        try:
            repo = _parse_repo_article(article, since)
            if repo:
                repos.append(repo)
        except Exception as e:
            print(f"  Parse error: {e}")
            continue

    return repos


def _parse_repo_article(article, since: str) -> Optional[dict]:
    """Parse a single repo article element."""

    # Full name from the h2 link
    h2_link = article.select_one("h2 a")
    if not h2_link:
        return None

    full_name = h2_link.get("href", "").strip("/")
    parts = full_name.split("/")
    if len(parts) != 2:
        return None

    owner, repo_name = parts

    # Description
    desc_el = article.select_one("p")
    description = desc_el.get_text(strip=True) if desc_el else ""

    # Programming language
    lang_el = article.select_one("[itemprop='programmingLanguage']")
    language = lang_el.get_text(strip=True) if lang_el else ""

    # Language color
    lang_color_el = article.select_one(".repo-language-color")
    lang_color = ""
    if lang_color_el:
        style = lang_color_el.get("style", "")
        color_match = re.search(r"background-color:\s*(#[0-9a-fA-F]+)", style)
        if color_match:
            lang_color = color_match.group(1)

    # Stars today (the "N stars today" text in the footer)
    stars_today_text = ""
    for span in article.select("span"):
        text = span.get_text(strip=True)
        if "star" in text.lower() and ("today" in text.lower() or "this" in text.lower()):
            stars_today_text = text
            break

    stars_today = _parse_number(stars_today_text.split()[0]) if stars_today_text else 0

    # Total stars and forks (the two link stats in the footer)
    stat_links = article.select("a.Link--muted")
    total_stars = 0
    forks = 0
    if len(stat_links) >= 1:
        total_stars = _parse_number(stat_links[0].get_text(strip=True))
    if len(stat_links) >= 2:
        forks = _parse_number(stat_links[1].get_text(strip=True))

    # Top contributors (avatar images)
    contrib_images = article.select("a img[data-hovercard-type='user']")
    contributors = [
        img.get("alt", "").lstrip("@")
        for img in contrib_images
        if img.get("alt", "").startswith("@")
    ]

    # Built-by section (contributor profile links)
    built_by_links = article.select("a[data-hovercard-type='user']")
    contributor_usernames = list({
        link.get("href", "").strip("/")
        for link in built_by_links
        if link.get("href", "").count("/") == 1
    })

    return {
        "owner": owner,
        "repo": repo_name,
        "full_name": full_name,
        "github_url": f"https://github.com/{full_name}",
        "description": description,
        "language": language,
        "language_color": lang_color,
        "stars_today": stars_today,
        "stars_today_text": stars_today_text,
        "total_stars": total_stars,
        "forks": forks,
        "contributors": contributor_usernames[:5],
        "since": since,
    }


def _parse_number(text: str) -> int:
    """Parse '1,234' or '12.3k' or '1.2k' style numbers to int."""
    text = text.strip().replace(",", "").replace(" ", "").lower()
    if not text:
        return 0
    try:
        if text.endswith("k"):
            return int(float(text[:-1]) * 1000)
        if text.endswith("m"):
            return int(float(text[:-1]) * 1_000_000)
        return int(text)
    except (ValueError, IndexError):
        return 0

Scraping Multiple Languages and Time Windows

For comprehensive data, collect across languages and time windows:

POPULAR_LANGUAGES = [
    "",           # All languages
    "python",
    "javascript",
    "typescript",
    "rust",
    "go",
    "java",
    "c",
    "c++",
    "c#",
    "swift",
    "kotlin",
    "ruby",
    "php",
    "scala",
    "elixir",
    "zig",
    "haskell",
    "julia",
    "r",
]

TIME_WINDOWS = ["daily", "weekly", "monthly"]


def scrape_all_combinations(
    languages: list[str] = None,
    since_values: list[str] = None,
    delay_between: float = 2.0,
) -> list[dict]:
    """
    Scrape trending across all language/time combinations.
    Returns deduplicated list of repos with context metadata.
    """
    if languages is None:
        languages = ["", "python", "javascript", "typescript", "rust", "go"]
    if since_values is None:
        since_values = ["daily"]

    all_repos = {}

    for since in since_values:
        for lang in languages:
            label = f"{lang or 'all'}/{since}"
            print(f"Scraping {label}...")

            try:
                repos = scrape_trending(language=lang, since=since)
                for repo in repos:
                    key = repo["full_name"]
                    if key not in all_repos:
                        all_repos[key] = repo
                        all_repos[key]["appearances"] = []
                    all_repos[key]["appearances"].append(f"{lang or 'all'}/{since}")

                print(f"  Found {len(repos)} repos")
            except Exception as e:
                print(f"  Failed: {e}")

            time.sleep(delay_between)

    return list(all_repos.values())

Enriching with the GitHub REST API

The trending page gives you a snapshot. The REST API fills in the gaps: open issue count, license, repository topics, push date, homepage URL, and accurate aggregate stats.

GitHub's REST API is well-documented. Without a token: 60 req/hour. With a personal access token (PAT): 5,000 req/hour. For any real workload, get a token — it's free and takes 2 minutes.

import os
import time
from typing import Optional

GITHUB_TOKEN = os.environ.get("GITHUB_TOKEN", "")
GITHUB_API_BASE = "https://api.github.com"

def github_request(
    path: str,
    token: str = GITHUB_TOKEN,
    params: dict = None,
) -> Optional[dict | list]:
    """
    Make a GitHub API request with auth and error handling.
    Returns parsed JSON or None on error.
    """
    url = f"{GITHUB_API_BASE}{path}"
    headers = {
        "Accept": "application/vnd.github+json",
        "X-GitHub-Api-Version": "2022-11-28",
    }
    if token:
        headers["Authorization"] = f"Bearer {token}"

    resp = requests.get(url, headers=headers, params=params, timeout=15)

    # Check rate limit headers
    remaining = int(resp.headers.get("X-RateLimit-Remaining", 999))
    if remaining < 10:
        reset_time = int(resp.headers.get("X-RateLimit-Reset", 0))
        wait_seconds = max(0, reset_time - time.time()) + 5
        print(f"  Rate limit nearly exhausted ({remaining} remaining). Waiting {wait_seconds:.0f}s")
        time.sleep(wait_seconds)

    if resp.status_code == 403:
        # Secondary rate limit or auth issue
        retry_after = int(resp.headers.get("Retry-After", 60))
        print(f"  403: waiting {retry_after}s")
        time.sleep(retry_after)
        return None

    if resp.status_code == 404:
        return None

    if resp.status_code == 422:
        # Unprocessable entity (e.g., repo too large for some endpoints)
        return None

    if resp.status_code != 200:
        print(f"  API {resp.status_code} for {path}")
        return None

    return resp.json()


def check_rate_limit() -> dict:
    """Check your current GitHub API quota."""
    data = github_request("/rate_limit")
    if not data:
        return {}

    core = data["resources"]["core"]
    search = data["resources"]["search"]

    return {
        "core_remaining": core["remaining"],
        "core_limit": core["limit"],
        "core_reset_in": max(0, core["reset"] - int(time.time())),
        "search_remaining": search["remaining"],
    }


def enrich_repo(full_name: str) -> dict:
    """
    Fetch additional metadata from the GitHub API.
    Returns a dict of enrichment fields, or empty dict on failure.
    """
    # Core repo data
    repo_data = github_request(f"/repos/{full_name}")
    if not repo_data:
        return {"api_error": True}

    # Top contributors (up to 5)
    contrib_data = github_request(f"/repos/{full_name}/contributors", params={"per_page": 5}) or []
    top_contributors = [
        {"login": c["login"], "contributions": c.get("contributions", 0)}
        for c in contrib_data[:5]
        if c.get("type") != "Anonymous"
    ]

    # Recent releases
    releases = github_request(f"/repos/{full_name}/releases", params={"per_page": 3}) or []
    recent_releases = [
        {
            "tag": r.get("tag_name", ""),
            "name": r.get("name", ""),
            "published": r.get("published_at", ""),
            "prerelease": r.get("prerelease", False),
        }
        for r in releases[:3]
    ]

    # Latest commit activity (for freshness signal)
    commits = github_request(f"/repos/{full_name}/commits", params={"per_page": 1}) or []
    latest_commit = {}
    if commits:
        c = commits[0]
        latest_commit = {
            "sha": c.get("sha", "")[:8],
            "message": (c.get("commit", {}).get("message", "") or "").split("\n")[0][:80],
            "date": c.get("commit", {}).get("committer", {}).get("date", ""),
        }

    return {
        "open_issues": repo_data.get("open_issues_count", 0),
        "watchers": repo_data.get("watchers_count", 0),
        "topics": repo_data.get("topics", []),
        "license": (repo_data.get("license") or {}).get("spdx_id", ""),
        "license_name": (repo_data.get("license") or {}).get("name", ""),
        "homepage": repo_data.get("homepage") or "",
        "pushed_at": repo_data.get("pushed_at", ""),
        "created_at": repo_data.get("created_at", ""),
        "updated_at": repo_data.get("updated_at", ""),
        "default_branch": repo_data.get("default_branch", "main"),
        "size_kb": repo_data.get("size", 0),
        "is_fork": repo_data.get("fork", False),
        "is_archived": repo_data.get("archived", False),
        "is_template": repo_data.get("is_template", False),
        "has_wiki": repo_data.get("has_wiki", False),
        "has_discussions": repo_data.get("has_discussions", False),
        "subscriber_count": repo_data.get("subscribers_count", 0),
        "top_contributors": top_contributors,
        "recent_releases": recent_releases,
        "latest_commit": latest_commit,
        "network_count": repo_data.get("network_count", 0),
    }


def scrape_and_enrich(
    language: str = "",
    since: str = "daily",
    delay: float = 0.8,
) -> list[dict]:
    """
    Full pipeline: scrape trending page then enrich each repo with API data.
    """
    print(f"Scraping trending/{language or 'all'} ({since})...")
    repos = scrape_trending(language=language, since=since)
    print(f"Found {len(repos)} trending repos")

    # Check API quota first
    quota = check_rate_limit()
    if quota.get("core_remaining", 0) < len(repos) * 3:
        print(f"Warning: Low API quota ({quota.get('core_remaining')} remaining)")
        print("Skipping API enrichment to preserve quota")
        return repos

    enriched = []
    for i, repo in enumerate(repos):
        full_name = repo["full_name"]
        print(f"  [{i+1}/{len(repos)}] Enriching {full_name}...")

        extra = enrich_repo(full_name)
        combined = {**repo, **extra}
        enriched.append(combined)

        time.sleep(delay)

    return enriched

Star Velocity Tracking with SQLite

Single snapshots are useful. A series of snapshots over weeks is where it gets powerful. Star velocity — how consistently a project appears in trending, and how its star gain changes day to day — is a far stronger signal than any individual metric.

import sqlite3
from datetime import date, datetime, timedelta

DB_PATH = "github_trending.db"


def init_db(db_path: str = DB_PATH) -> sqlite3.Connection:
    """Initialize the trending database."""
    conn = sqlite3.connect(db_path)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS repos (
            full_name TEXT PRIMARY KEY,
            owner TEXT,
            repo TEXT,
            description TEXT,
            language TEXT,
            homepage TEXT,
            license TEXT,
            created_at TEXT,
            is_fork INTEGER DEFAULT 0,
            first_seen DATE,
            last_seen DATE
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS trending_snapshots (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            full_name TEXT NOT NULL,
            stars_today INTEGER,
            total_stars INTEGER,
            forks INTEGER,
            open_issues INTEGER,
            watchers INTEGER,
            topics TEXT,
            contributors TEXT,
            since TEXT NOT NULL,
            language_filter TEXT,
            snapshot_date DATE NOT NULL,
            scraped_at TEXT,
            UNIQUE(full_name, since, language_filter, snapshot_date)
        )
    """)

    conn.execute("""
        CREATE INDEX IF NOT EXISTS idx_snap_name_date
        ON trending_snapshots(full_name, snapshot_date)
    """)
    conn.execute("""
        CREATE INDEX IF NOT EXISTS idx_snap_date
        ON trending_snapshots(snapshot_date)
    """)
    conn.execute("""
        CREATE INDEX IF NOT EXISTS idx_snap_language
        ON trending_snapshots(language_filter, snapshot_date)
    """)

    conn.commit()
    return conn


def save_snapshot(
    conn: sqlite3.Connection,
    repos: list[dict],
    since: str = "daily",
    language_filter: str = "",
    db_path: str = DB_PATH,
):
    """Save a trending snapshot batch to the database."""
    today = date.today().isoformat()
    now = datetime.utcnow().isoformat()

    for r in repos:
        full_name = r.get("full_name", "")
        if not full_name:
            continue

        # Upsert repo record
        conn.execute("""
            INSERT INTO repos (full_name, owner, repo, description, language,
                               homepage, license, created_at, is_fork, first_seen, last_seen)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            ON CONFLICT(full_name) DO UPDATE SET
                description=excluded.description,
                last_seen=excluded.last_seen,
                language=COALESCE(excluded.language, repos.language)
        """, (
            full_name, r.get("owner"), r.get("repo"),
            r.get("description", ""), r.get("language", ""),
            r.get("homepage", ""), r.get("license", ""),
            r.get("created_at", ""), int(r.get("is_fork", False)),
            today, today
        ))

        # Insert snapshot
        conn.execute("""
            INSERT OR REPLACE INTO trending_snapshots
            (full_name, stars_today, total_stars, forks, open_issues, watchers,
             topics, contributors, since, language_filter, snapshot_date, scraped_at)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        """, (
            full_name, r.get("stars_today", 0), r.get("total_stars", 0),
            r.get("forks", 0), r.get("open_issues", 0), r.get("watchers", 0),
            ",".join(r.get("topics", [])),
            ",".join(str(c) if isinstance(c, str) else c.get("login", "")
                     for c in r.get("top_contributors", r.get("contributors", []))),
            since, language_filter, today, now,
        ))

    conn.commit()
    print(f"Saved {len(repos)} snapshots for {today} ({since}/{language_filter or 'all'})")


def get_star_velocity(
    conn: sqlite3.Connection,
    full_name: str,
    days: int = 30,
) -> list[dict]:
    """Get daily star gain history for a repo."""
    rows = conn.execute("""
        SELECT snapshot_date, stars_today, total_stars, since
        FROM trending_snapshots
        WHERE full_name = ? AND since = 'daily'
        ORDER BY snapshot_date DESC
        LIMIT ?
    """, (full_name, days)).fetchall()

    return [
        {"date": r[0], "stars_today": r[1], "total_stars": r[2]}
        for r in rows
    ]


def get_trending_frequency(
    conn: sqlite3.Connection,
    days: int = 30,
    since: str = "daily",
) -> list[dict]:
    """
    Rank repos by how often they appeared in trending within the last N days.
    Consistently trending repos are higher signal than one-off appearances.
    """
    cutoff = (date.today() - timedelta(days=days)).isoformat()

    rows = conn.execute("""
        SELECT s.full_name, r.description, r.language, r.license,
               COUNT(DISTINCT s.snapshot_date) AS days_trending,
               MAX(s.stars_today) AS peak_stars_today,
               AVG(s.stars_today) AS avg_stars_today,
               MAX(s.total_stars) AS latest_total_stars,
               MAX(s.snapshot_date) AS last_seen_date
        FROM trending_snapshots s
        JOIN repos r ON r.full_name = s.full_name
        WHERE s.snapshot_date >= ? AND s.since = ?
        GROUP BY s.full_name
        ORDER BY days_trending DESC, avg_stars_today DESC
    """, (cutoff, since)).fetchall()

    return [
        {
            "full_name": r[0], "description": r[1], "language": r[2],
            "license": r[3], "days_trending": r[4], "peak_stars_today": r[5],
            "avg_stars_today": round(r[6], 1), "total_stars": r[7],
            "last_seen": r[8],
        }
        for r in rows
    ]


def get_language_trends(
    conn: sqlite3.Connection,
    days: int = 30,
) -> list[dict]:
    """
    Analyze which languages are trending most frequently.
    """
    cutoff = (date.today() - timedelta(days=days)).isoformat()

    rows = conn.execute("""
        SELECT r.language,
               COUNT(DISTINCT s.full_name) AS unique_repos,
               COUNT(*) AS total_appearances,
               SUM(s.stars_today) AS total_stars_gained,
               AVG(s.stars_today) AS avg_stars_per_repo
        FROM trending_snapshots s
        JOIN repos r ON r.full_name = s.full_name
        WHERE s.snapshot_date >= ? AND r.language != ''
        GROUP BY r.language
        ORDER BY total_stars_gained DESC
    """, (cutoff,)).fetchall()

    return [
        {
            "language": r[0], "unique_repos": r[1], "appearances": r[2],
            "total_stars": r[3], "avg_stars": round(r[4], 1),
        }
        for r in rows
    ]

Daily Collection Script

This is the cron job you want running:

import json
import csv
from pathlib import Path


def run_daily_collection(
    languages: list[str] = None,
    db_path: str = DB_PATH,
    output_dir: str = ".",
):
    """
    Daily pipeline: scrape trending, enrich with API, save to SQLite.
    Designed to run as a cron job.
    """
    if languages is None:
        languages = ["", "python", "typescript", "rust", "go"]

    conn = init_db(db_path)

    all_enriched = []

    for lang in languages:
        label = lang if lang else "all"
        print(f"\n--- Trending/{label}/daily ---")

        try:
            repos = scrape_and_enrich(language=lang, since="daily", delay=0.8)
            save_snapshot(conn, repos, since="daily", language_filter=lang)
            all_enriched.extend(repos)
            print(f"  Saved {len(repos)} repos")
        except Exception as e:
            print(f"  FAILED: {e}")

        time.sleep(3)  # Polite gap between language queries

    # Export today's snapshot to CSV
    today = date.today().isoformat()
    csv_path = Path(output_dir) / f"trending_{today}.csv"

    if all_enriched:
        # Deduplicate
        seen = set()
        unique = []
        for r in all_enriched:
            if r["full_name"] not in seen:
                seen.add(r["full_name"])
                unique.append(r)

        flat = []
        for r in unique:
            flat.append({
                "full_name": r["full_name"],
                "description": r.get("description", ""),
                "language": r.get("language", ""),
                "stars_today": r.get("stars_today", 0),
                "total_stars": r.get("total_stars", 0),
                "forks": r.get("forks", 0),
                "open_issues": r.get("open_issues", 0),
                "topics": ",".join(r.get("topics", [])),
                "license": r.get("license", ""),
                "created_at": r.get("created_at", ""),
                "pushed_at": r.get("pushed_at", ""),
                "is_fork": r.get("is_fork", False),
                "homepage": r.get("homepage", ""),
            })

        with open(csv_path, "w", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=flat[0].keys())
            writer.writeheader()
            writer.writerows(flat)
        print(f"\nExported {len(flat)} repos to {csv_path}")

    conn.close()
    print(f"\nDaily collection complete. Total repos: {len(all_enriched)}")


# Analytics query
def print_weekly_summary(db_path: str = DB_PATH):
    conn = init_db(db_path)

    print("\n=== 7-Day Trending Summary ===")
    top_repos = get_trending_frequency(conn, days=7)[:10]
    print("\nTop repos by trending frequency (7 days):")
    for r in top_repos:
        print(f"  {r['full_name']:40} | {r['days_trending']} days | "
              f"avg {r['avg_stars_today']:.0f} stars/day | {r['language']}")

    print("\nLanguage breakdown (7 days):")
    langs = get_language_trends(conn, days=7)[:10]
    for l in langs:
        print(f"  {l['language']:20} | {l['unique_repos']} repos | "
              f"{l['total_stars']:,} total stars gained")

    conn.close()


if __name__ == "__main__":
    run_daily_collection()
    print_weekly_summary()

Schedule with cron:

0 8 * * * /usr/bin/python3 /path/to/script.py >> /var/log/github_trending.log 2>&1

Handling GitHub's Rate Limits

GitHub's rate limiting has two tiers that catch people by surprise:

Primary rate limit: 60 req/hour unauthenticated, 5,000/hour with a PAT. Check remaining quota:

def wait_if_low_quota(threshold: int = 50):
    """Pause if API quota is running low."""
    quota = check_rate_limit()
    remaining = quota.get("core_remaining", 999)
    if remaining < threshold:
        wait = quota.get("core_reset_in", 300) + 5
        print(f"  Low quota ({remaining}). Waiting {wait}s for reset...")
        time.sleep(wait)

Secondary rate limits: These activate when you make too many requests in a short window regardless of hourly quota. They're not clearly documented, but practical thresholds are: - No more than 10 requests/second to a single endpoint - No more than 900 requests/hour to any endpoint - Large content responses (repos with thousands of commits) trigger them faster

Handle the 403 with Retry-After:

def github_request_with_backoff(path: str, max_retries: int = 4) -> Optional[dict]:
    """GitHub API request with secondary rate limit handling."""
    for attempt in range(max_retries):
        result = github_request(path)
        if result is not None:
            return result

        if attempt < max_retries - 1:
            backoff = (2 ** attempt) * 10  # 10s, 20s, 40s
            print(f"  Retry {attempt + 1}/{max_retries} in {backoff}s")
            time.sleep(backoff)

    return None

Practical advice: With a PAT and 0.8s delays, you'll process 25 repos/run in about 30 seconds and never hit the secondary limit. The 5,000/hour primary limit is almost impossible to exhaust with trending data alone.

Anti-Bot and Proxy Considerations

GitHub is one of the most scraper-friendly major platforms. They don't use aggressive bot detection on the trending HTML page, they're transparent about API rate limits, and they respect developers accessing public data.

That said, a few scenarios where proxy rotation helps:

Unauthenticated at scale: Without a token, you burn through 60 req/hour quickly. Each API call (search, repo detail, contributors) counts separately. IP rotation lets multiple unauthenticated "users" share the work. ThorData residential proxies work well here.

Multi-account authenticated access: If you have multiple GitHub accounts (personal, work, bot accounts), you can use their separate tokens to multiply your quota. Each token gets its own 5,000/hour allowance.

Avoiding the unauthenticated limit on the HTML page: The trending HTML page doesn't have strict rate limits, but hitting it hundreds of times per minute from one IP will get you throttled. At 1 trending page per 2 seconds you'll never encounter issues.

For the vast majority of trending data use cases, just use a single PAT — it's simpler and more reliable than proxy rotation.

Real-World Use Cases

Early stage project discovery. Run daily. Look for repos that appear across multiple languages/time windows simultaneously — they're genuinely crossing communities. A project trending in "all languages", "python", and "ai-ml" on the same day has unusually broad appeal.

Developer tooling market research. Track which subcategories of developer tools are consistently producing trending repos. CLI tools trending consistently? Database projects? Testing frameworks? This maps the build momentum in the ecosystem.

Newsletter/content curation. The consistently trending repos (appearing 3+ times in 7 days) are your most reliable signal. Single-day trending repos might be from social spikes; persistent trending reflects sustained organic interest.

Hiring signal. Consistent trending contributors — people who show up in the "Built by" section repeatedly — are active open source builders. Track them for recruiting research.

Language adoption tracking. The language trends query over 30-90 days shows which languages are generating momentum. Zig, Gleam, and Elixir have shown strong per-repo star velocity relative to their absolute numbers in recent data.

Conclusion

GitHub trending is excellent raw material for developer ecosystem analysis. Server-rendered HTML means simple requests + BeautifulSoup extraction, no headless browser needed. The REST API enrichment adds the depth that makes the data genuinely useful. And SQLite with a daily cron gives you the time series that turns snapshots into trend analysis. Run it for two weeks and you have enough data to answer interesting questions. Run it for three months and you have a real dataset.

How to Scrape GitHub Trending Repos in 2026: Stars, Languages & Contributor Analysis

How to Scrape GitHub Trending Repos in 2026: Stars, Languages & Contributor Analysis

GitHub Trending Page Structure

Basic Trending Scraper

Scraping Multiple Languages and Time Windows

Enriching with the GitHub REST API

Star Velocity Tracking with SQLite

Daily Collection Script

Handling GitHub's Rate Limits

Anti-Bot and Proxy Considerations

Real-World Use Cases

Conclusion