Scrape GitHub Actions Workflows: Run Data & Job Statistics via API (2026)
Scrape GitHub Actions Workflows: Run Data & Job Statistics via API (2026)
GitHub Actions is the most widely used CI/CD platform for open-source software. The workflow run data — build times, failure rates, job duration patterns, trigger distributions — is valuable for DevOps analytics, developer tooling, OSS health metrics, and understanding how projects actually build and test their code.
Unlike most scraping targets, GitHub has a well-documented REST API that gives you structured data without HTML parsing. The API is generous with rate limits (5,000 requests/hour authenticated) and returns clean JSON. This guide covers extracting workflow run data, job-level statistics, artifacts, and marketplace action metadata — plus the patterns that make high-volume collection practical.
What's Available Through the API
The GitHub Actions REST API exposes:
- Workflow runs — every execution of every workflow in a repository, with status, conclusion, duration, trigger event, branch, commit SHA, and run attempt number
- Jobs — individual jobs within a run, with per-step timing, runner details, and log URLs
- Workflow files — the YAML workflow definitions themselves, via the repository contents API
- Artifacts — files uploaded during runs, with download URLs (time-limited) and expiration timestamps
- Billing/timing — billable minutes broken down by OS type (ubuntu, macos, windows) per run
- Secrets and variables — names only (not values) of repository and environment secrets
- Runners — self-hosted runner details for organizations with admin access
- Marketplace actions — action metadata via repository contents (action.yml files)
For public repositories, the Actions API requires no authentication — though unauthenticated calls are limited to 60 req/hour, which is unusably low for any real collection work.
Authentication Setup
Create a personal access token (PAT) at github.com/settings/tokens. For public repo Actions data, public_repo or no additional scopes are needed. For private repos, you need repo. For organization-level analytics, add read:org.
import httpx
import os
import time
from datetime import datetime, timedelta, timezone
from typing import Optional
class GitHubActionsClient:
"""Client for the GitHub Actions REST API with rate limit handling."""
BASE = "https://api.github.com"
def __init__(self, token: str = None):
self.token = token or os.environ.get("GITHUB_TOKEN")
if not self.token:
print("Warning: No GitHub token. Limited to 60 req/hour.")
headers = {
"Accept": "application/vnd.github+json",
"X-GitHub-Api-Version": "2022-11-28",
}
if self.token:
headers["Authorization"] = f"Bearer {self.token}"
self.client = httpx.Client(headers=headers, timeout=20)
self._rate_limit_remaining = 5000
self._rate_limit_reset = 0
def _get(self, url: str, params: dict = None) -> httpx.Response:
"""Make a GET request with automatic rate limit handling."""
if self._rate_limit_remaining < 10:
wait = max(0, self._rate_limit_reset - time.time()) + 5
print(f" Rate limit low. Sleeping {wait:.0f}s until reset.")
time.sleep(wait)
resp = self.client.get(url, params=params)
remaining = resp.headers.get("X-RateLimit-Remaining")
reset = resp.headers.get("X-RateLimit-Reset")
if remaining:
self._rate_limit_remaining = int(remaining)
if reset:
self._rate_limit_reset = int(reset)
if resp.status_code == 429:
retry_after = int(resp.headers.get("Retry-After", 60))
print(f" 429 rate limited. Sleeping {retry_after}s.")
time.sleep(retry_after)
return self._get(url, params)
if resp.status_code == 403 and "rate limit" in resp.text.lower():
wait = max(0, self._rate_limit_reset - time.time()) + 5
print(f" 403 rate limit exceeded. Sleeping {wait:.0f}s.")
time.sleep(wait)
return self._get(url, params)
return resp
def _get_paginated(
self,
url: str,
params: dict = None,
result_key: str = None,
max_pages: int = 10,
) -> list:
"""Fetch all pages of a paginated endpoint."""
params = params or {}
params.setdefault("per_page", 100)
all_items = []
for page_num in range(1, max_pages + 1):
params["page"] = page_num
resp = self._get(url, params=params)
if resp.status_code == 404:
break
resp.raise_for_status()
data = resp.json()
if result_key:
items = data.get(result_key, [])
elif isinstance(data, list):
items = data
else:
for key in ["workflow_runs", "jobs", "artifacts", "items", "workflows"]:
if key in data:
items = data[key]
break
else:
items = []
if not items:
break
all_items.extend(items)
if "next" not in resp.headers.get("link", ""):
break
return all_items
def check_rate_limit(self) -> dict:
"""Check current API rate limit status."""
resp = self._get(f"{self.BASE}/rate_limit")
data = resp.json()
core = data["resources"]["core"]
reset_dt = datetime.fromtimestamp(core["reset"], tz=timezone.utc)
return {
"remaining": core["remaining"],
"limit": core["limit"],
"used": core["used"],
"resets_at": reset_dt.isoformat(),
"minutes_until_reset": round(
(reset_dt - datetime.now(tz=timezone.utc)).total_seconds() / 60, 1
),
}
Fetching Workflow Runs
def list_workflows(self, owner: str, repo: str) -> list[dict]:
"""List all workflow files defined in a repository."""
url = f"{self.BASE}/repos/{owner}/{repo}/actions/workflows"
items = self._get_paginated(url, result_key="workflows")
return [
{
"id": w["id"],
"name": w["name"],
"path": w["path"],
"state": w["state"],
"created_at": w["created_at"],
"updated_at": w["updated_at"],
}
for w in items
]
def get_workflow_runs(
self,
owner: str,
repo: str,
workflow_id: str | int = None,
branch: str = None,
event: str = None,
status: str = None,
created_after: datetime = None,
max_pages: int = 10,
) -> list[dict]:
"""
Fetch workflow runs for a repository.
status options: queued, in_progress, completed, waiting, requested, pending
event options: push, pull_request, schedule, workflow_dispatch, etc.
"""
if workflow_id:
url = f"{self.BASE}/repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs"
else:
url = f"{self.BASE}/repos/{owner}/{repo}/actions/runs"
params = {}
if branch:
params["branch"] = branch
if event:
params["event"] = event
if status:
params["status"] = status
if created_after:
params["created"] = f">={created_after.strftime('%Y-%m-%d')}"
runs = self._get_paginated(
url, params=params, result_key="workflow_runs", max_pages=max_pages
)
return [
{
"id": r["id"],
"name": r["name"],
"workflow_id": r["workflow_id"],
"head_branch": r["head_branch"],
"head_sha": r["head_sha"],
"event": r["event"],
"status": r["status"],
"conclusion": r["conclusion"],
"run_number": r["run_number"],
"run_attempt": r["run_attempt"],
"created_at": r["created_at"],
"updated_at": r["updated_at"],
"run_started_at": r.get("run_started_at"),
"actor_login": (
r.get("triggering_actor", {}).get("login")
if r.get("triggering_actor") else None
),
"repo": r["repository"]["full_name"],
}
for r in runs
]
def get_run_jobs(
self,
owner: str,
repo: str,
run_id: int,
filter_jobs: str = "all",
) -> list[dict]:
"""Fetch jobs for a specific workflow run."""
url = f"{self.BASE}/repos/{owner}/{repo}/actions/runs/{run_id}/jobs"
jobs = self._get_paginated(url, params={"filter": filter_jobs}, result_key="jobs")
return [
{
"id": j["id"],
"run_id": j["run_id"],
"name": j["name"],
"status": j["status"],
"conclusion": j["conclusion"],
"started_at": j.get("started_at"),
"completed_at": j.get("completed_at"),
"runner_name": j.get("runner_name"),
"runner_group_name": j.get("runner_group_name"),
"workflow_name": j.get("workflow_name"),
"steps": [
{
"name": s["name"],
"status": s["status"],
"conclusion": s.get("conclusion"),
"number": s["number"],
"started_at": s.get("started_at"),
"completed_at": s.get("completed_at"),
}
for s in j.get("steps", [])
],
}
for j in jobs
]
def get_run_timing(self, owner: str, repo: str, run_id: int) -> dict:
"""Get billable minutes for a workflow run broken down by OS."""
resp = self._get(f"{self.BASE}/repos/{owner}/{repo}/actions/runs/{run_id}/timing")
if resp.status_code != 200:
return {}
return resp.json()
Analyzing Build Performance
Aggregate patterns reveal CI/CD health metrics that can't be seen from individual runs:
from collections import defaultdict
from statistics import mean, median, stdev
def compute_run_duration_seconds(run: dict) -> float | None:
"""Calculate duration of a workflow run in seconds."""
start_str = run.get("run_started_at") or run.get("created_at")
end_str = run.get("updated_at")
if not start_str or not end_str:
return None
fmt = "%Y-%m-%dT%H:%M:%SZ"
try:
start = datetime.strptime(start_str, fmt)
end = datetime.strptime(end_str, fmt)
return (end - start).total_seconds()
except ValueError:
return None
def analyze_workflow_performance(
client: GitHubActionsClient,
owner: str,
repo: str,
days: int = 30,
workflow_id: str = None,
) -> dict:
"""Compute CI/CD performance metrics for a repository."""
since = datetime.utcnow() - timedelta(days=days)
runs = client.get_workflow_runs(
owner, repo,
workflow_id=workflow_id,
status="completed",
created_after=since,
)
if not runs:
return {"error": "No completed runs in time window"}
durations = []
conclusions = defaultdict(int)
events = defaultdict(int)
branches = defaultdict(int)
daily_counts = defaultdict(int)
for run in runs:
conclusions[run["conclusion"] or "unknown"] += 1
events[run["event"]] += 1
branches[run["head_branch"]] += 1
duration = compute_run_duration_seconds(run)
if duration and duration > 0:
durations.append(duration)
if run["created_at"]:
day = run["created_at"][:10]
daily_counts[day] += 1
total = len(runs)
success_count = conclusions.get("success", 0)
stats = {
"total_runs": total,
"date_range": f"last {days} days",
"success_rate": round(success_count / total * 100, 1) if total else 0,
"conclusion_breakdown": dict(sorted(conclusions.items(), key=lambda x: -x[1])),
"event_breakdown": dict(sorted(events.items(), key=lambda x: -x[1])),
"top_branches": dict(sorted(branches.items(), key=lambda x: -x[1])[:10]),
"daily_run_counts": dict(sorted(daily_counts.items())),
}
if durations:
sorted_d = sorted(durations)
stats["duration_seconds"] = {
"mean": round(mean(durations)),
"median": round(median(durations)),
"p90": round(sorted_d[int(len(sorted_d) * 0.90)]),
"p95": round(sorted_d[int(len(sorted_d) * 0.95)]),
"p99": round(sorted_d[int(len(sorted_d) * 0.99)]),
"max": round(max(durations)),
"stdev": round(stdev(durations)) if len(durations) > 1 else 0,
}
stats["duration_minutes"] = {
k: round(v / 60, 1)
for k, v in stats["duration_seconds"].items()
}
return stats
def analyze_job_breakdown(
client: GitHubActionsClient,
owner: str,
repo: str,
sample_runs: int = 30,
) -> list[dict]:
"""Break down job-level duration and failure rates from recent runs."""
runs = client.get_workflow_runs(
owner, repo,
status="completed",
created_after=datetime.utcnow() - timedelta(days=14),
)[:sample_runs]
job_stats = defaultdict(list)
for run in runs:
jobs = client.get_run_jobs(owner, repo, run["id"])
for job in jobs:
if (
job["status"] != "completed"
or not job["started_at"]
or not job["completed_at"]
):
continue
fmt = "%Y-%m-%dT%H:%M:%SZ"
try:
started = datetime.strptime(job["started_at"], fmt)
completed = datetime.strptime(job["completed_at"], fmt)
duration = (completed - started).total_seconds()
except ValueError:
continue
step_durations = {}
for step in job.get("steps", []):
if step.get("started_at") and step.get("completed_at"):
try:
s_start = datetime.strptime(step["started_at"], fmt)
s_end = datetime.strptime(step["completed_at"], fmt)
step_durations[step["name"]] = (s_end - s_start).total_seconds()
except ValueError:
pass
job_stats[job["name"]].append({
"duration": duration,
"conclusion": job["conclusion"],
"runner": job.get("runner_name", "unknown"),
"step_durations": step_durations,
})
summary = []
for job_name, job_runs in job_stats.items():
durations = [r["duration"] for r in job_runs]
failures = sum(1 for r in job_runs if r["conclusion"] not in ("success", "skipped"))
all_steps = defaultdict(list)
for r in job_runs:
for step, dur in r.get("step_durations", {}).items():
all_steps[step].append(dur)
top_steps = [
{"step": s, "avg_seconds": round(mean(durs))}
for s, durs in sorted(all_steps.items(), key=lambda x: -mean(x[1]))
if mean(durs) > 5
][:5]
summary.append({
"job_name": job_name,
"sample_runs": len(job_runs),
"avg_duration_sec": round(mean(durations)),
"median_duration_sec": round(median(durations)),
"max_duration_sec": round(max(durations)),
"failure_count": failures,
"failure_rate_pct": round(failures / len(job_runs) * 100, 1),
"top_time_consuming_steps": top_steps,
})
return sorted(summary, key=lambda x: -x["avg_duration_sec"])
Scraping Workflow YAML Files
The workflow YAML definitions show which actions and versions a repository uses — useful for security auditing or action adoption analysis:
import base64
import yaml
def get_workflow_yaml(
client: GitHubActionsClient,
owner: str,
repo: str,
workflow_path: str,
) -> dict | None:
"""Fetch and parse a workflow YAML file from the repository."""
url = f"{client.BASE}/repos/{owner}/{repo}/contents/{workflow_path.lstrip('/')}"
resp = client._get(url)
if resp.status_code == 404:
return None
resp.raise_for_status()
content_b64 = resp.json().get("content", "")
decoded = base64.b64decode(content_b64).decode("utf-8")
try:
return yaml.safe_load(decoded)
except yaml.YAMLError as e:
print(f" YAML parse error for {workflow_path}: {e}")
return None
def extract_actions_used(workflow_yaml: dict) -> list[str]:
"""Extract all external action references from a workflow definition."""
if not workflow_yaml or not isinstance(workflow_yaml, dict):
return []
actions = set()
for job_name, job in (workflow_yaml.get("jobs") or {}).items():
if not isinstance(job, dict):
continue
for step in job.get("steps", []):
if isinstance(step, dict) and "uses" in step:
action_ref = step["uses"]
if not action_ref.startswith("./"):
action_name = action_ref.split("@")[0]
actions.add(action_name)
return sorted(actions)
def audit_action_versions(
client: GitHubActionsClient,
owner: str,
repo: str,
) -> list[dict]:
"""Audit which action versions are used across all workflows in a repo."""
workflows = client.list_workflows(owner, repo)
audit = []
for workflow in workflows:
wf_yaml = get_workflow_yaml(client, owner, repo, workflow["path"])
if not wf_yaml:
continue
actions_refs = {}
for job_name, job in (wf_yaml.get("jobs") or {}).items():
if not isinstance(job, dict):
continue
for step in (job.get("steps") or []):
if isinstance(step, dict) and "uses" in step:
ref = step["uses"]
if not ref.startswith("./"):
name, _, version = ref.partition("@")
actions_refs[name] = version or "unversioned"
audit.append({
"workflow": workflow["name"],
"path": workflow["path"],
"actions": actions_refs,
})
return audit
Searching for Actions in the Marketplace
The marketplace itself isn't directly API-accessible, but you can search GitHub for action repositories by topic:
def search_marketplace_actions(
client: GitHubActionsClient,
query: str,
min_stars: int = 50,
max_results: int = 100,
) -> list[dict]:
"""Search for GitHub Actions repositories by topic/keyword."""
url = f"{client.BASE}/search/repositories"
params = {
"q": f"{query} topic:github-actions in:name,description",
"sort": "stars",
"order": "desc",
"per_page": min(max_results, 100),
}
resp = client._get(url, params=params)
resp.raise_for_status()
data = resp.json()
return [
{
"full_name": r["full_name"],
"description": r.get("description"),
"stars": r["stargazers_count"],
"forks": r["forks_count"],
"topics": r.get("topics", []),
"language": r.get("language"),
"updated_at": r["updated_at"],
"url": r["html_url"],
}
for r in data.get("items", [])
if r["stargazers_count"] >= min_stars
]
def get_action_metadata(
client: GitHubActionsClient,
owner: str,
repo: str,
) -> dict | None:
"""Fetch action.yml metadata for a GitHub Action repository."""
for filename in ["action.yml", "action.yaml"]:
url = f"{client.BASE}/repos/{owner}/{repo}/contents/{filename}"
resp = client._get(url)
if resp.status_code == 200:
content_b64 = resp.json().get("content", "")
content = base64.b64decode(content_b64).decode("utf-8")
try:
return yaml.safe_load(content)
except yaml.YAMLError:
return None
return None
ETag Caching for Rate Limit Efficiency
Conditional requests with ETags let you check if data has changed without burning rate limit quota. 304 Not Modified responses don't count against your hourly limit:
class CachedActionsClient(GitHubActionsClient):
"""Extension with ETag-based response caching."""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._etag_cache: dict = {}
def get_with_cache(self, url: str, params: dict = None) -> tuple[dict, bool]:
"""
GET with ETag caching.
Returns (data, from_cache). from_cache=True means 304 response —
no rate limit cost for this call.
"""
cache_key = f"{url}?{sorted((params or {}).items())}"
headers = {}
if cache_key in self._etag_cache:
headers["If-None-Match"] = self._etag_cache[cache_key]["etag"]
# Use the underlying httpx client directly to pass custom headers
resp = self.client.get(url, params=params, headers=headers)
remaining = resp.headers.get("X-RateLimit-Remaining")
if remaining:
self._rate_limit_remaining = int(remaining)
if resp.status_code == 304:
return self._etag_cache[cache_key]["data"], True
resp.raise_for_status()
data = resp.json()
etag = resp.headers.get("ETag")
if etag:
self._etag_cache[cache_key] = {"etag": etag, "data": data}
return data, False
Storing Results for Long-Term Analysis
import sqlite3
def init_actions_db(path: str = "github_actions.db") -> sqlite3.Connection:
conn = sqlite3.connect(path)
conn.execute("""
CREATE TABLE IF NOT EXISTS workflow_runs (
id INTEGER PRIMARY KEY,
repo TEXT,
workflow_id INTEGER,
workflow_name TEXT,
head_branch TEXT,
head_sha TEXT,
event TEXT,
status TEXT,
conclusion TEXT,
run_number INTEGER,
run_attempt INTEGER,
created_at TEXT,
updated_at TEXT,
duration_seconds REAL,
collected_at TEXT DEFAULT (datetime('now'))
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS run_jobs (
id INTEGER PRIMARY KEY,
run_id INTEGER,
repo TEXT,
job_name TEXT,
status TEXT,
conclusion TEXT,
started_at TEXT,
completed_at TEXT,
duration_seconds REAL,
runner_name TEXT,
step_count INTEGER,
FOREIGN KEY (run_id) REFERENCES workflow_runs(id)
)
""")
conn.execute("CREATE INDEX IF NOT EXISTS idx_runs_repo ON workflow_runs(repo)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_runs_created ON workflow_runs(created_at)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_jobs_run ON run_jobs(run_id)")
conn.commit()
return conn
def store_runs(conn: sqlite3.Connection, runs: list[dict], owner: str, repo: str):
"""Store workflow run records to SQLite."""
repo_full = f"{owner}/{repo}"
for run in runs:
duration = compute_run_duration_seconds(run)
conn.execute("""
INSERT OR REPLACE INTO workflow_runs
(id, repo, workflow_id, workflow_name, head_branch, head_sha,
event, status, conclusion, run_number, run_attempt,
created_at, updated_at, duration_seconds)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
run["id"], repo_full, run.get("workflow_id"), run.get("name"),
run.get("head_branch"), run.get("head_sha"),
run.get("event"), run.get("status"), run.get("conclusion"),
run.get("run_number"), run.get("run_attempt"),
run.get("created_at"), run.get("updated_at"), duration,
))
conn.commit()
def get_slowest_runs(conn: sqlite3.Connection, repo: str, limit: int = 20) -> list[dict]:
"""Find the slowest workflow runs for a repository."""
rows = conn.execute("""
SELECT id, workflow_name, head_branch, conclusion,
duration_seconds, created_at
FROM workflow_runs
WHERE repo = ? AND status = 'completed'
ORDER BY duration_seconds DESC NULLS LAST
LIMIT ?
""", (repo, limit)).fetchall()
return [
{
"run_id": r[0],
"workflow": r[1],
"branch": r[2],
"conclusion": r[3],
"duration_minutes": round(r[4] / 60, 1) if r[4] else None,
"created_at": r[5],
}
for r in rows
]
def get_failure_trend(conn: sqlite3.Connection, repo: str, days: int = 30) -> list[dict]:
"""Get daily failure rates for a repository."""
rows = conn.execute("""
SELECT
DATE(created_at) as day,
COUNT(*) as total,
SUM(CASE WHEN conclusion = 'success' THEN 1 ELSE 0 END) as successes,
SUM(CASE WHEN conclusion = 'failure' THEN 1 ELSE 0 END) as failures
FROM workflow_runs
WHERE repo = ?
AND created_at >= DATE('now', ? || ' days')
AND status = 'completed'
GROUP BY day
ORDER BY day
""", (repo, f"-{days}")).fetchall()
return [
{
"day": r[0],
"total": r[1],
"success_rate": round(r[2] / r[1] * 100, 1) if r[1] else 0,
"failure_count": r[3],
}
for r in rows
]
Multi-Repo Organization Analysis
For organization-wide DevOps health dashboards:
def list_org_repos_with_actions(
client: GitHubActionsClient,
org: str,
min_stars: int = 10,
) -> list[str]:
"""List organization repositories that have GitHub Actions workflows."""
url = f"{client.BASE}/orgs/{org}/repos"
repos = client._get_paginated(url, params={"type": "public"}, max_pages=20)
results = []
for repo in repos:
if repo.get("stargazers_count", 0) < min_stars:
continue
if repo.get("archived", False):
continue
if not repo.get("has_projects", True):
continue
results.append(repo["full_name"])
return results
def org_ci_health_report(
client: GitHubActionsClient,
org: str,
days: int = 7,
min_stars: int = 50,
) -> list[dict]:
"""
Generate CI health summary for all significant repos in an organization.
"""
repos = list_org_repos_with_actions(client, org, min_stars=min_stars)
print(f"Analyzing {len(repos)} repos in {org}...")
report = []
for repo_full_name in repos:
owner, repo = repo_full_name.split("/", 1)
print(f" {repo_full_name}...")
try:
stats = analyze_workflow_performance(client, owner, repo, days=days)
stats["repo"] = repo_full_name
report.append(stats)
except Exception as e:
report.append({"repo": repo_full_name, "error": str(e)})
time.sleep(0.3) # Polite delay
# Sort by failure rate descending (worst health first)
return sorted(
[r for r in report if "error" not in r],
key=lambda x: -(100 - x.get("success_rate", 100))
)
Rate Limits and When Proxies Help
GitHub's API rate limits are tiered:
- Unauthenticated: 60 requests/hour — unusable for any real work
- Personal Access Token: 5,000 requests/hour
- GitHub App token: 15,000 requests/hour (with installation-level scopes)
- Secondary rate limits: GitHub also limits concurrent requests and requests per minute to specific endpoints, independent of the hourly quota
For most analytics work, a single PAT at 5,000 req/hour is sufficient. But these scenarios change the math:
Organization-wide historical backfill. Pulling 90 days of run history for 500 repos requires at minimum 500 API calls for the run list alone — and probably 5,000+ if you're also fetching job details. Distributed across multiple PATs, each routed through ThorData residential proxies, you can parallelize without hitting per-token rate limits.
Unauthenticated public tools. If you're building a public CI dashboard that doesn't require users to connect GitHub, you're limited to 60 req/hour per IP. Rotating through many residential IPs effectively multiplies your unauthenticated quota.
Non-API data collection. GitHub's Actions marketplace page, billing dashboards, and certain organizational views aren't fully accessible via the REST API. For those pages, you're scraping HTML with all the usual anti-bot considerations — GitHub uses Akamai for their web frontend, which checks TLS fingerprints and IP reputation. Residential proxies from ThorData pass these checks where datacenter ranges get blocked.
Legal and Ethical Considerations
The GitHub Terms of Service permit automated API access for building tools, research, and analysis. The key constraints:
- Don't use the API to circumvent access controls (trying to access private repos you aren't authorized to see)
- Respect rate limits — repeated violations can result in token or IP suspension
- Don't scrape the HTML site aggressively — use the API, which is the designed access method
- Workflow run data for public repositories is public — it's the same data visible in any browser without authentication
GitHub Actions analytics is one of the cleanest, most accessible DevOps data sources. The API is well-documented, rate limits are generous for most purposes, and the data is clean JSON. Start with the REST API, use ETag caching for efficiency, and reach for GraphQL when you need cross-repository queries at scale.