← Back to blog

Scrape GitHub Actions Workflows: Run Data & Job Statistics via API (2026)

Scrape GitHub Actions Workflows: Run Data & Job Statistics via API (2026)

GitHub Actions is the most widely used CI/CD platform for open-source software. The workflow run data — build times, failure rates, job duration patterns, trigger distributions — is valuable for DevOps analytics, developer tooling, OSS health metrics, and understanding how projects actually build and test their code.

Unlike most scraping targets, GitHub has a well-documented REST API that gives you structured data without HTML parsing. The API is generous with rate limits (5,000 requests/hour authenticated) and returns clean JSON. This guide covers extracting workflow run data, job-level statistics, artifacts, and marketplace action metadata — plus the patterns that make high-volume collection practical.

What's Available Through the API

The GitHub Actions REST API exposes:

For public repositories, the Actions API requires no authentication — though unauthenticated calls are limited to 60 req/hour, which is unusably low for any real collection work.

Authentication Setup

Create a personal access token (PAT) at github.com/settings/tokens. For public repo Actions data, public_repo or no additional scopes are needed. For private repos, you need repo. For organization-level analytics, add read:org.

import httpx
import os
import time
from datetime import datetime, timedelta, timezone
from typing import Optional

class GitHubActionsClient:
    """Client for the GitHub Actions REST API with rate limit handling."""

    BASE = "https://api.github.com"

    def __init__(self, token: str = None):
        self.token = token or os.environ.get("GITHUB_TOKEN")
        if not self.token:
            print("Warning: No GitHub token. Limited to 60 req/hour.")

        headers = {
            "Accept": "application/vnd.github+json",
            "X-GitHub-Api-Version": "2022-11-28",
        }
        if self.token:
            headers["Authorization"] = f"Bearer {self.token}"

        self.client = httpx.Client(headers=headers, timeout=20)
        self._rate_limit_remaining = 5000
        self._rate_limit_reset = 0

    def _get(self, url: str, params: dict = None) -> httpx.Response:
        """Make a GET request with automatic rate limit handling."""
        if self._rate_limit_remaining < 10:
            wait = max(0, self._rate_limit_reset - time.time()) + 5
            print(f"  Rate limit low. Sleeping {wait:.0f}s until reset.")
            time.sleep(wait)

        resp = self.client.get(url, params=params)

        remaining = resp.headers.get("X-RateLimit-Remaining")
        reset = resp.headers.get("X-RateLimit-Reset")
        if remaining:
            self._rate_limit_remaining = int(remaining)
        if reset:
            self._rate_limit_reset = int(reset)

        if resp.status_code == 429:
            retry_after = int(resp.headers.get("Retry-After", 60))
            print(f"  429 rate limited. Sleeping {retry_after}s.")
            time.sleep(retry_after)
            return self._get(url, params)

        if resp.status_code == 403 and "rate limit" in resp.text.lower():
            wait = max(0, self._rate_limit_reset - time.time()) + 5
            print(f"  403 rate limit exceeded. Sleeping {wait:.0f}s.")
            time.sleep(wait)
            return self._get(url, params)

        return resp

    def _get_paginated(
        self,
        url: str,
        params: dict = None,
        result_key: str = None,
        max_pages: int = 10,
    ) -> list:
        """Fetch all pages of a paginated endpoint."""
        params = params or {}
        params.setdefault("per_page", 100)

        all_items = []

        for page_num in range(1, max_pages + 1):
            params["page"] = page_num
            resp = self._get(url, params=params)

            if resp.status_code == 404:
                break
            resp.raise_for_status()

            data = resp.json()

            if result_key:
                items = data.get(result_key, [])
            elif isinstance(data, list):
                items = data
            else:
                for key in ["workflow_runs", "jobs", "artifacts", "items", "workflows"]:
                    if key in data:
                        items = data[key]
                        break
                else:
                    items = []

            if not items:
                break

            all_items.extend(items)

            if "next" not in resp.headers.get("link", ""):
                break

        return all_items

    def check_rate_limit(self) -> dict:
        """Check current API rate limit status."""
        resp = self._get(f"{self.BASE}/rate_limit")
        data = resp.json()
        core = data["resources"]["core"]
        reset_dt = datetime.fromtimestamp(core["reset"], tz=timezone.utc)

        return {
            "remaining": core["remaining"],
            "limit": core["limit"],
            "used": core["used"],
            "resets_at": reset_dt.isoformat(),
            "minutes_until_reset": round(
                (reset_dt - datetime.now(tz=timezone.utc)).total_seconds() / 60, 1
            ),
        }

Fetching Workflow Runs

    def list_workflows(self, owner: str, repo: str) -> list[dict]:
        """List all workflow files defined in a repository."""
        url = f"{self.BASE}/repos/{owner}/{repo}/actions/workflows"
        items = self._get_paginated(url, result_key="workflows")

        return [
            {
                "id": w["id"],
                "name": w["name"],
                "path": w["path"],
                "state": w["state"],
                "created_at": w["created_at"],
                "updated_at": w["updated_at"],
            }
            for w in items
        ]

    def get_workflow_runs(
        self,
        owner: str,
        repo: str,
        workflow_id: str | int = None,
        branch: str = None,
        event: str = None,
        status: str = None,
        created_after: datetime = None,
        max_pages: int = 10,
    ) -> list[dict]:
        """
        Fetch workflow runs for a repository.

        status options: queued, in_progress, completed, waiting, requested, pending
        event options: push, pull_request, schedule, workflow_dispatch, etc.
        """
        if workflow_id:
            url = f"{self.BASE}/repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs"
        else:
            url = f"{self.BASE}/repos/{owner}/{repo}/actions/runs"

        params = {}
        if branch:
            params["branch"] = branch
        if event:
            params["event"] = event
        if status:
            params["status"] = status
        if created_after:
            params["created"] = f">={created_after.strftime('%Y-%m-%d')}"

        runs = self._get_paginated(
            url, params=params, result_key="workflow_runs", max_pages=max_pages
        )

        return [
            {
                "id": r["id"],
                "name": r["name"],
                "workflow_id": r["workflow_id"],
                "head_branch": r["head_branch"],
                "head_sha": r["head_sha"],
                "event": r["event"],
                "status": r["status"],
                "conclusion": r["conclusion"],
                "run_number": r["run_number"],
                "run_attempt": r["run_attempt"],
                "created_at": r["created_at"],
                "updated_at": r["updated_at"],
                "run_started_at": r.get("run_started_at"),
                "actor_login": (
                    r.get("triggering_actor", {}).get("login")
                    if r.get("triggering_actor") else None
                ),
                "repo": r["repository"]["full_name"],
            }
            for r in runs
        ]

    def get_run_jobs(
        self,
        owner: str,
        repo: str,
        run_id: int,
        filter_jobs: str = "all",
    ) -> list[dict]:
        """Fetch jobs for a specific workflow run."""
        url = f"{self.BASE}/repos/{owner}/{repo}/actions/runs/{run_id}/jobs"
        jobs = self._get_paginated(url, params={"filter": filter_jobs}, result_key="jobs")

        return [
            {
                "id": j["id"],
                "run_id": j["run_id"],
                "name": j["name"],
                "status": j["status"],
                "conclusion": j["conclusion"],
                "started_at": j.get("started_at"),
                "completed_at": j.get("completed_at"),
                "runner_name": j.get("runner_name"),
                "runner_group_name": j.get("runner_group_name"),
                "workflow_name": j.get("workflow_name"),
                "steps": [
                    {
                        "name": s["name"],
                        "status": s["status"],
                        "conclusion": s.get("conclusion"),
                        "number": s["number"],
                        "started_at": s.get("started_at"),
                        "completed_at": s.get("completed_at"),
                    }
                    for s in j.get("steps", [])
                ],
            }
            for j in jobs
        ]

    def get_run_timing(self, owner: str, repo: str, run_id: int) -> dict:
        """Get billable minutes for a workflow run broken down by OS."""
        resp = self._get(f"{self.BASE}/repos/{owner}/{repo}/actions/runs/{run_id}/timing")
        if resp.status_code != 200:
            return {}
        return resp.json()

Analyzing Build Performance

Aggregate patterns reveal CI/CD health metrics that can't be seen from individual runs:

from collections import defaultdict
from statistics import mean, median, stdev


def compute_run_duration_seconds(run: dict) -> float | None:
    """Calculate duration of a workflow run in seconds."""
    start_str = run.get("run_started_at") or run.get("created_at")
    end_str = run.get("updated_at")

    if not start_str or not end_str:
        return None

    fmt = "%Y-%m-%dT%H:%M:%SZ"
    try:
        start = datetime.strptime(start_str, fmt)
        end = datetime.strptime(end_str, fmt)
        return (end - start).total_seconds()
    except ValueError:
        return None


def analyze_workflow_performance(
    client: GitHubActionsClient,
    owner: str,
    repo: str,
    days: int = 30,
    workflow_id: str = None,
) -> dict:
    """Compute CI/CD performance metrics for a repository."""
    since = datetime.utcnow() - timedelta(days=days)
    runs = client.get_workflow_runs(
        owner, repo,
        workflow_id=workflow_id,
        status="completed",
        created_after=since,
    )

    if not runs:
        return {"error": "No completed runs in time window"}

    durations = []
    conclusions = defaultdict(int)
    events = defaultdict(int)
    branches = defaultdict(int)
    daily_counts = defaultdict(int)

    for run in runs:
        conclusions[run["conclusion"] or "unknown"] += 1
        events[run["event"]] += 1
        branches[run["head_branch"]] += 1

        duration = compute_run_duration_seconds(run)
        if duration and duration > 0:
            durations.append(duration)

        if run["created_at"]:
            day = run["created_at"][:10]
            daily_counts[day] += 1

    total = len(runs)
    success_count = conclusions.get("success", 0)

    stats = {
        "total_runs": total,
        "date_range": f"last {days} days",
        "success_rate": round(success_count / total * 100, 1) if total else 0,
        "conclusion_breakdown": dict(sorted(conclusions.items(), key=lambda x: -x[1])),
        "event_breakdown": dict(sorted(events.items(), key=lambda x: -x[1])),
        "top_branches": dict(sorted(branches.items(), key=lambda x: -x[1])[:10]),
        "daily_run_counts": dict(sorted(daily_counts.items())),
    }

    if durations:
        sorted_d = sorted(durations)
        stats["duration_seconds"] = {
            "mean": round(mean(durations)),
            "median": round(median(durations)),
            "p90": round(sorted_d[int(len(sorted_d) * 0.90)]),
            "p95": round(sorted_d[int(len(sorted_d) * 0.95)]),
            "p99": round(sorted_d[int(len(sorted_d) * 0.99)]),
            "max": round(max(durations)),
            "stdev": round(stdev(durations)) if len(durations) > 1 else 0,
        }
        stats["duration_minutes"] = {
            k: round(v / 60, 1)
            for k, v in stats["duration_seconds"].items()
        }

    return stats


def analyze_job_breakdown(
    client: GitHubActionsClient,
    owner: str,
    repo: str,
    sample_runs: int = 30,
) -> list[dict]:
    """Break down job-level duration and failure rates from recent runs."""
    runs = client.get_workflow_runs(
        owner, repo,
        status="completed",
        created_after=datetime.utcnow() - timedelta(days=14),
    )[:sample_runs]

    job_stats = defaultdict(list)

    for run in runs:
        jobs = client.get_run_jobs(owner, repo, run["id"])

        for job in jobs:
            if (
                job["status"] != "completed"
                or not job["started_at"]
                or not job["completed_at"]
            ):
                continue

            fmt = "%Y-%m-%dT%H:%M:%SZ"
            try:
                started = datetime.strptime(job["started_at"], fmt)
                completed = datetime.strptime(job["completed_at"], fmt)
                duration = (completed - started).total_seconds()
            except ValueError:
                continue

            step_durations = {}
            for step in job.get("steps", []):
                if step.get("started_at") and step.get("completed_at"):
                    try:
                        s_start = datetime.strptime(step["started_at"], fmt)
                        s_end = datetime.strptime(step["completed_at"], fmt)
                        step_durations[step["name"]] = (s_end - s_start).total_seconds()
                    except ValueError:
                        pass

            job_stats[job["name"]].append({
                "duration": duration,
                "conclusion": job["conclusion"],
                "runner": job.get("runner_name", "unknown"),
                "step_durations": step_durations,
            })

    summary = []
    for job_name, job_runs in job_stats.items():
        durations = [r["duration"] for r in job_runs]
        failures = sum(1 for r in job_runs if r["conclusion"] not in ("success", "skipped"))

        all_steps = defaultdict(list)
        for r in job_runs:
            for step, dur in r.get("step_durations", {}).items():
                all_steps[step].append(dur)

        top_steps = [
            {"step": s, "avg_seconds": round(mean(durs))}
            for s, durs in sorted(all_steps.items(), key=lambda x: -mean(x[1]))
            if mean(durs) > 5
        ][:5]

        summary.append({
            "job_name": job_name,
            "sample_runs": len(job_runs),
            "avg_duration_sec": round(mean(durations)),
            "median_duration_sec": round(median(durations)),
            "max_duration_sec": round(max(durations)),
            "failure_count": failures,
            "failure_rate_pct": round(failures / len(job_runs) * 100, 1),
            "top_time_consuming_steps": top_steps,
        })

    return sorted(summary, key=lambda x: -x["avg_duration_sec"])

Scraping Workflow YAML Files

The workflow YAML definitions show which actions and versions a repository uses — useful for security auditing or action adoption analysis:

import base64
import yaml


def get_workflow_yaml(
    client: GitHubActionsClient,
    owner: str,
    repo: str,
    workflow_path: str,
) -> dict | None:
    """Fetch and parse a workflow YAML file from the repository."""
    url = f"{client.BASE}/repos/{owner}/{repo}/contents/{workflow_path.lstrip('/')}"
    resp = client._get(url)

    if resp.status_code == 404:
        return None

    resp.raise_for_status()
    content_b64 = resp.json().get("content", "")
    decoded = base64.b64decode(content_b64).decode("utf-8")

    try:
        return yaml.safe_load(decoded)
    except yaml.YAMLError as e:
        print(f"  YAML parse error for {workflow_path}: {e}")
        return None


def extract_actions_used(workflow_yaml: dict) -> list[str]:
    """Extract all external action references from a workflow definition."""
    if not workflow_yaml or not isinstance(workflow_yaml, dict):
        return []

    actions = set()
    for job_name, job in (workflow_yaml.get("jobs") or {}).items():
        if not isinstance(job, dict):
            continue
        for step in job.get("steps", []):
            if isinstance(step, dict) and "uses" in step:
                action_ref = step["uses"]
                if not action_ref.startswith("./"):
                    action_name = action_ref.split("@")[0]
                    actions.add(action_name)

    return sorted(actions)


def audit_action_versions(
    client: GitHubActionsClient,
    owner: str,
    repo: str,
) -> list[dict]:
    """Audit which action versions are used across all workflows in a repo."""
    workflows = client.list_workflows(owner, repo)
    audit = []

    for workflow in workflows:
        wf_yaml = get_workflow_yaml(client, owner, repo, workflow["path"])
        if not wf_yaml:
            continue

        actions_refs = {}
        for job_name, job in (wf_yaml.get("jobs") or {}).items():
            if not isinstance(job, dict):
                continue
            for step in (job.get("steps") or []):
                if isinstance(step, dict) and "uses" in step:
                    ref = step["uses"]
                    if not ref.startswith("./"):
                        name, _, version = ref.partition("@")
                        actions_refs[name] = version or "unversioned"

        audit.append({
            "workflow": workflow["name"],
            "path": workflow["path"],
            "actions": actions_refs,
        })

    return audit

Searching for Actions in the Marketplace

The marketplace itself isn't directly API-accessible, but you can search GitHub for action repositories by topic:

def search_marketplace_actions(
    client: GitHubActionsClient,
    query: str,
    min_stars: int = 50,
    max_results: int = 100,
) -> list[dict]:
    """Search for GitHub Actions repositories by topic/keyword."""
    url = f"{client.BASE}/search/repositories"
    params = {
        "q": f"{query} topic:github-actions in:name,description",
        "sort": "stars",
        "order": "desc",
        "per_page": min(max_results, 100),
    }

    resp = client._get(url, params=params)
    resp.raise_for_status()
    data = resp.json()

    return [
        {
            "full_name": r["full_name"],
            "description": r.get("description"),
            "stars": r["stargazers_count"],
            "forks": r["forks_count"],
            "topics": r.get("topics", []),
            "language": r.get("language"),
            "updated_at": r["updated_at"],
            "url": r["html_url"],
        }
        for r in data.get("items", [])
        if r["stargazers_count"] >= min_stars
    ]


def get_action_metadata(
    client: GitHubActionsClient,
    owner: str,
    repo: str,
) -> dict | None:
    """Fetch action.yml metadata for a GitHub Action repository."""
    for filename in ["action.yml", "action.yaml"]:
        url = f"{client.BASE}/repos/{owner}/{repo}/contents/{filename}"
        resp = client._get(url)

        if resp.status_code == 200:
            content_b64 = resp.json().get("content", "")
            content = base64.b64decode(content_b64).decode("utf-8")
            try:
                return yaml.safe_load(content)
            except yaml.YAMLError:
                return None

    return None

ETag Caching for Rate Limit Efficiency

Conditional requests with ETags let you check if data has changed without burning rate limit quota. 304 Not Modified responses don't count against your hourly limit:

class CachedActionsClient(GitHubActionsClient):
    """Extension with ETag-based response caching."""

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._etag_cache: dict = {}

    def get_with_cache(self, url: str, params: dict = None) -> tuple[dict, bool]:
        """
        GET with ETag caching.
        Returns (data, from_cache). from_cache=True means 304 response —
        no rate limit cost for this call.
        """
        cache_key = f"{url}?{sorted((params or {}).items())}"
        headers = {}

        if cache_key in self._etag_cache:
            headers["If-None-Match"] = self._etag_cache[cache_key]["etag"]

        # Use the underlying httpx client directly to pass custom headers
        resp = self.client.get(url, params=params, headers=headers)

        remaining = resp.headers.get("X-RateLimit-Remaining")
        if remaining:
            self._rate_limit_remaining = int(remaining)

        if resp.status_code == 304:
            return self._etag_cache[cache_key]["data"], True

        resp.raise_for_status()
        data = resp.json()

        etag = resp.headers.get("ETag")
        if etag:
            self._etag_cache[cache_key] = {"etag": etag, "data": data}

        return data, False

Storing Results for Long-Term Analysis

import sqlite3


def init_actions_db(path: str = "github_actions.db") -> sqlite3.Connection:
    conn = sqlite3.connect(path)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS workflow_runs (
            id              INTEGER PRIMARY KEY,
            repo            TEXT,
            workflow_id     INTEGER,
            workflow_name   TEXT,
            head_branch     TEXT,
            head_sha        TEXT,
            event           TEXT,
            status          TEXT,
            conclusion      TEXT,
            run_number      INTEGER,
            run_attempt     INTEGER,
            created_at      TEXT,
            updated_at      TEXT,
            duration_seconds REAL,
            collected_at    TEXT DEFAULT (datetime('now'))
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS run_jobs (
            id              INTEGER PRIMARY KEY,
            run_id          INTEGER,
            repo            TEXT,
            job_name        TEXT,
            status          TEXT,
            conclusion      TEXT,
            started_at      TEXT,
            completed_at    TEXT,
            duration_seconds REAL,
            runner_name     TEXT,
            step_count      INTEGER,
            FOREIGN KEY (run_id) REFERENCES workflow_runs(id)
        )
    """)

    conn.execute("CREATE INDEX IF NOT EXISTS idx_runs_repo ON workflow_runs(repo)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_runs_created ON workflow_runs(created_at)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_jobs_run ON run_jobs(run_id)")
    conn.commit()
    return conn


def store_runs(conn: sqlite3.Connection, runs: list[dict], owner: str, repo: str):
    """Store workflow run records to SQLite."""
    repo_full = f"{owner}/{repo}"

    for run in runs:
        duration = compute_run_duration_seconds(run)

        conn.execute("""
            INSERT OR REPLACE INTO workflow_runs
            (id, repo, workflow_id, workflow_name, head_branch, head_sha,
             event, status, conclusion, run_number, run_attempt,
             created_at, updated_at, duration_seconds)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        """, (
            run["id"], repo_full, run.get("workflow_id"), run.get("name"),
            run.get("head_branch"), run.get("head_sha"),
            run.get("event"), run.get("status"), run.get("conclusion"),
            run.get("run_number"), run.get("run_attempt"),
            run.get("created_at"), run.get("updated_at"), duration,
        ))

    conn.commit()


def get_slowest_runs(conn: sqlite3.Connection, repo: str, limit: int = 20) -> list[dict]:
    """Find the slowest workflow runs for a repository."""
    rows = conn.execute("""
        SELECT id, workflow_name, head_branch, conclusion,
               duration_seconds, created_at
        FROM workflow_runs
        WHERE repo = ? AND status = 'completed'
        ORDER BY duration_seconds DESC NULLS LAST
        LIMIT ?
    """, (repo, limit)).fetchall()

    return [
        {
            "run_id": r[0],
            "workflow": r[1],
            "branch": r[2],
            "conclusion": r[3],
            "duration_minutes": round(r[4] / 60, 1) if r[4] else None,
            "created_at": r[5],
        }
        for r in rows
    ]


def get_failure_trend(conn: sqlite3.Connection, repo: str, days: int = 30) -> list[dict]:
    """Get daily failure rates for a repository."""
    rows = conn.execute("""
        SELECT
            DATE(created_at) as day,
            COUNT(*) as total,
            SUM(CASE WHEN conclusion = 'success' THEN 1 ELSE 0 END) as successes,
            SUM(CASE WHEN conclusion = 'failure' THEN 1 ELSE 0 END) as failures
        FROM workflow_runs
        WHERE repo = ?
          AND created_at >= DATE('now', ? || ' days')
          AND status = 'completed'
        GROUP BY day
        ORDER BY day
    """, (repo, f"-{days}")).fetchall()

    return [
        {
            "day": r[0],
            "total": r[1],
            "success_rate": round(r[2] / r[1] * 100, 1) if r[1] else 0,
            "failure_count": r[3],
        }
        for r in rows
    ]

Multi-Repo Organization Analysis

For organization-wide DevOps health dashboards:

def list_org_repos_with_actions(
    client: GitHubActionsClient,
    org: str,
    min_stars: int = 10,
) -> list[str]:
    """List organization repositories that have GitHub Actions workflows."""
    url = f"{client.BASE}/orgs/{org}/repos"
    repos = client._get_paginated(url, params={"type": "public"}, max_pages=20)

    results = []
    for repo in repos:
        if repo.get("stargazers_count", 0) < min_stars:
            continue
        if repo.get("archived", False):
            continue
        if not repo.get("has_projects", True):
            continue
        results.append(repo["full_name"])

    return results


def org_ci_health_report(
    client: GitHubActionsClient,
    org: str,
    days: int = 7,
    min_stars: int = 50,
) -> list[dict]:
    """
    Generate CI health summary for all significant repos in an organization.
    """
    repos = list_org_repos_with_actions(client, org, min_stars=min_stars)
    print(f"Analyzing {len(repos)} repos in {org}...")

    report = []
    for repo_full_name in repos:
        owner, repo = repo_full_name.split("/", 1)
        print(f"  {repo_full_name}...")

        try:
            stats = analyze_workflow_performance(client, owner, repo, days=days)
            stats["repo"] = repo_full_name
            report.append(stats)
        except Exception as e:
            report.append({"repo": repo_full_name, "error": str(e)})

        time.sleep(0.3)  # Polite delay

    # Sort by failure rate descending (worst health first)
    return sorted(
        [r for r in report if "error" not in r],
        key=lambda x: -(100 - x.get("success_rate", 100))
    )

Rate Limits and When Proxies Help

GitHub's API rate limits are tiered:

For most analytics work, a single PAT at 5,000 req/hour is sufficient. But these scenarios change the math:

Organization-wide historical backfill. Pulling 90 days of run history for 500 repos requires at minimum 500 API calls for the run list alone — and probably 5,000+ if you're also fetching job details. Distributed across multiple PATs, each routed through ThorData residential proxies, you can parallelize without hitting per-token rate limits.

Unauthenticated public tools. If you're building a public CI dashboard that doesn't require users to connect GitHub, you're limited to 60 req/hour per IP. Rotating through many residential IPs effectively multiplies your unauthenticated quota.

Non-API data collection. GitHub's Actions marketplace page, billing dashboards, and certain organizational views aren't fully accessible via the REST API. For those pages, you're scraping HTML with all the usual anti-bot considerations — GitHub uses Akamai for their web frontend, which checks TLS fingerprints and IP reputation. Residential proxies from ThorData pass these checks where datacenter ranges get blocked.

The GitHub Terms of Service permit automated API access for building tools, research, and analysis. The key constraints:

GitHub Actions analytics is one of the cleanest, most accessible DevOps data sources. The API is well-documented, rate limits are generous for most purposes, and the data is clean JSON. Start with the REST API, use ETag caching for efficiency, and reach for GraphQL when you need cross-repository queries at scale.