Scrape Hacker News Who's Hiring: Job Data & Hiring Trends (2026)

2026-04-09 [python hackernews scraping jobs api]

Scrape Hacker News Who's Hiring: Job Data & Hiring Trends (2026)

Every month, Y Combinator's Hacker News publishes a "Who's Hiring?" thread where companies post job openings. These threads are a goldmine for labor market analysis — they reveal what tech stacks companies actually use in production, what roles are in demand right now, which companies are growing, and sometimes even transparent salary ranges.

The data goes back to 2011. That's 15+ years of monthly snapshots of the tech job market. You can build time series showing how the tech stack landscape shifted from Rails-dominated in 2012 to Go and Rust today, how remote work percentages changed before and after 2020, or how LLM/AI roles exploded in 2023-2024. The HN hiring corpus is one of the better longitudinal datasets for tech labor market research.

This guide covers everything: accessing the data via the Algolia API, parsing the unstructured job post text with regex and NLP, building a SQLite database for trend analysis, and running the analysis that yields actual insights.

The Two HN APIs

Hacker News exposes two APIs with completely different capabilities:

Firebase API (hacker-news.firebaseio.com/v0/): Real-time, official, maintained by HN/YC. Lets you fetch individual items by ID and watch the live feed. No search, no pagination by query — you'd have to know item IDs in advance or traverse the new/best/top streams to find hiring threads.

Algolia API (hn.algolia.com/api/v1/): Unofficial index maintained by Algolia. Full-text search, date filtering, pagination, author filtering. This is what you want for historical analysis.

The Algolia API is free, no authentication required, and has generous rate limits. The only documented constraint is 10,000 requests per hour.

Finding Historical Hiring Threads

The HN user whoishiring posts the monthly "Who is hiring?" threads. Filter by that author to find all of them:

import requests
import time
from datetime import datetime, timezone
from typing import Optional

ALGOLIA_BASE = "https://hn.algolia.com/api/v1"

def find_hiring_threads(
    year: int = None,
    month: int = None,
    max_results: int = 24,
) -> list[dict]:
    """
    Find 'Who is hiring?' threads.
    Year/month: filter to specific period. None = all available.
    """
    url = f"{ALGOLIA_BASE}/search"
    params = {
        "query": '"Who is hiring?"',
        "tags": "story,author_whoishiring",
        "hitsPerPage": max_results,
        "attributesToRetrieve": "objectID,title,created_at,num_comments,url",
        "attributesToHighlight": "",
    }

    # Date filtering
    if year and month:
        start_dt = datetime(year, month, 1, tzinfo=timezone.utc)
        if month == 12:
            end_dt = datetime(year + 1, 1, 1, tzinfo=timezone.utc)
        else:
            end_dt = datetime(year, month + 1, 1, tzinfo=timezone.utc)
        params["numericFilters"] = (
            f"created_at_i>={int(start_dt.timestamp())},"
            f"created_at_i<{int(end_dt.timestamp())}"
        )
    elif year:
        start_dt = datetime(year, 1, 1, tzinfo=timezone.utc)
        end_dt = datetime(year + 1, 1, 1, tzinfo=timezone.utc)
        params["numericFilters"] = (
            f"created_at_i>={int(start_dt.timestamp())},"
            f"created_at_i<{int(end_dt.timestamp())}"
        )

    resp = requests.get(url, params=params, timeout=30)
    resp.raise_for_status()
    data = resp.json()

    threads = []
    for hit in data.get("hits", []):
        threads.append({
            "id": hit["objectID"],
            "title": hit.get("title", ""),
            "created_at": hit.get("created_at", ""),
            "num_comments": hit.get("num_comments", 0),
            "url": f"https://news.ycombinator.com/item?id={hit['objectID']}",
        })

    return sorted(threads, key=lambda x: x["created_at"])


# Find all 2026 threads
threads_2026 = find_hiring_threads(year=2026)
for t in threads_2026:
    print(f"{t['created_at'][:10]}  {t['title']:50}  ({t['num_comments']} posts)")

# Get specific month
march_2026 = find_hiring_threads(year=2026, month=3)
print(f"\nMarch 2026: {len(march_2026)} thread(s)")

Fetching Job Posts from a Thread

Each top-level comment in a hiring thread is one company's job posting. Algolia returns comments with their parent story ID:

def get_thread_comments(
    thread_id: str,
    page: int = 0,
    page_size: int = 100,
) -> tuple[list[dict], int]:
    """
    Fetch comments for a thread. Returns (comments_list, total_pages).
    """
    url = f"{ALGOLIA_BASE}/search"
    params = {
        "tags": f"comment,story_{thread_id}",
        "hitsPerPage": page_size,
        "page": page,
        "attributesToRetrieve": (
            "objectID,author,comment_text,created_at,"
            "parent_id,story_id,points"
        ),
        "attributesToHighlight": "",
    }

    resp = requests.get(url, params=params, timeout=30)
    resp.raise_for_status()
    data = resp.json()

    return data.get("hits", []), data.get("nbPages", 1)


def get_all_job_posts(
    thread_id: str,
    delay: float = 0.5,
) -> list[dict]:
    """
    Fetch all top-level comments (job posts) from a hiring thread.
    Top-level = directly replies to the thread, not replies to other comments.
    """
    all_posts = []
    page = 0
    thread_id_int = int(thread_id)

    while True:
        posts, total_pages = get_thread_comments(thread_id, page=page)

        if not posts:
            break

        # Filter to direct replies to the thread
        top_level = []
        for p in posts:
            parent = p.get("parent_id")
            story = p.get("story_id")

            # A top-level comment's parent IS the thread
            if str(parent) == str(thread_id) or parent == thread_id_int:
                top_level.append(p)

        all_posts.extend(top_level)

        page += 1
        if page >= total_pages:
            break

        time.sleep(delay)

    return all_posts


# Fetch posts from the latest thread
if threads_2026:
    latest = threads_2026[-1]
    posts = get_all_job_posts(latest["id"])
    print(f"\n{latest['title']}")
    print(f"Total top-level job posts: {len(posts)}")
    print(f"\nFirst 3 posts:")
    for p in posts[:3]:
        text = p.get("comment_text", "")[:200]
        print(f"  Author: {p['author']} ({p['created_at'][:10]})")
        print(f"  Preview: {text}...")
        print()

Parsing Job Post Text

Job posts don't follow a strict schema — they're freeform text with conventions. A typical post looks like:

CompanyName | Role | Location | Salary range (optional)

[Company description paragraph]

Tech stack: Python, Django, PostgreSQL, AWS, k8s

Looking for: Senior backend engineer, ML engineer

Apply: [email protected] or https://company.com/careers

Here's a comprehensive parser:

import re
from html import unescape
from typing import Optional


def clean_html(html_text: str) -> str:
    """Strip HTML tags, decode entities, normalize whitespace."""
    if not html_text:
        return ""
    # Replace <p> with newlines for structure
    text = re.sub(r"<p>", "\n", html_text, flags=re.IGNORECASE)
    text = re.sub(r"</p>", "\n", text, flags=re.IGNORECASE)
    text = re.sub(r"<br\s*/?>", "\n", text, flags=re.IGNORECASE)
    # Strip remaining tags
    text = re.sub(r"<[^>]+>", " ", text)
    text = unescape(text)
    return re.sub(r" {2,}", " ", text).strip()


# Comprehensive tech keyword list
TECH_KEYWORDS = {
    # Languages
    "languages": [
        "Python", "JavaScript", "TypeScript", "Go", "Golang", "Rust",
        "Java", "Kotlin", "Swift", "Ruby", "Elixir", "Erlang", "Scala",
        "Haskell", "OCaml", "Clojure", "F#", "C#", "C\\+\\+", "C lang",
        "PHP", "Perl", "Julia", "R lang", "COBOL", "Fortran",
        "Dart", "Zig", "Nim",
    ],
    # Frontend
    "frontend": [
        "React", "Vue\\.js", "Angular", "Svelte", "Next\\.js", "Nuxt",
        "Redux", "TypeScript", "WebAssembly", "Wasm", "Tailwind",
        "Webpack", "Vite", "HTMX",
    ],
    # Backend
    "backend": [
        "Node\\.js", "Django", "FastAPI", "Flask", "Rails", "Spring Boot",
        "Express", "Fastify", "NestJS", "Laravel", "Symfony", "Phoenix",
        "Gin", "Echo", "Fiber", "Axum",
    ],
    # Databases
    "databases": [
        "PostgreSQL", "Postgres", "MySQL", "MariaDB", "SQLite",
        "MongoDB", "Redis", "DynamoDB", "Cassandra", "CockroachDB",
        "Snowflake", "BigQuery", "Redshift", "ClickHouse", "TimescaleDB",
        "Elasticsearch", "OpenSearch", "Pinecone", "Weaviate",
    ],
    # Infrastructure
    "infra": [
        "AWS", "GCP", "Azure", "Kubernetes", "K8s", "Docker",
        "Terraform", "Pulumi", "Ansible", "Helm", "ArgoCD",
        "GitHub Actions", "CircleCI", "Jenkins", "Datadog",
        "Prometheus", "Grafana", "OpenTelemetry",
    ],
    # Data & ML
    "data_ml": [
        "Kafka", "Spark", "Flink", "Airflow", "dbt", "Dagster",
        "PyTorch", "TensorFlow", "JAX", "scikit-learn",
        "LLM", "GPT", "Claude", "Gemini", "RAG", "vector database",
        "machine learning", "deep learning", "NLP", "computer vision",
        "MLOps", "LangChain", "LlamaIndex",
    ],
}

# Flatten for matching
ALL_TECH = {cat: kws for cat, kws in TECH_KEYWORDS.items()}
FLAT_TECH = [kw for kws in TECH_KEYWORDS.values() for kw in kws]


def detect_tech_stack(text: str) -> dict:
    """Detect tech keywords in job post text, categorized."""
    text_lower = text.lower()
    found = {}

    for category, keywords in ALL_TECH.items():
        hits = []
        for kw in keywords:
            # Case-insensitive word boundary match
            pattern = r'\b' + kw.lower().replace(".", "\\.").replace("\\.", "\\.") + r'\b'
            if re.search(pattern, text_lower):
                canonical = re.sub(r'\\', '', kw)
                hits.append(canonical)
        if hits:
            found[category] = hits

    return found


def parse_salary(text: str) -> Optional[dict]:
    """
    Extract salary information from job post text.
    Handles formats like: $150k-$200k, $120,000-$160,000, $150k+, €80k-€110k
    """
    patterns = [
        # $150k-$200k or $150K-$200K
        r'[\$€£]?\s*(\d+\.?\d*)\s*[kK]\s*[-–—to]+\s*[\$€£]?\s*(\d+\.?\d*)\s*[kK]',
        # $120,000-$160,000
        r'[\$€£]\s*([\d,]+)\s*[-–—to]+\s*[\$€£]?\s*([\d,]+)',
        # $200k+
        r'[\$€£]\s*(\d+\.?\d*)\s*[kK]\+',
        # $200,000+
        r'[\$€£]\s*([\d,]+)\+',
    ]

    for i, pattern in enumerate(patterns):
        m = re.search(pattern, text, re.IGNORECASE)
        if m:
            try:
                if i < 2:
                    low_str = m.group(1).replace(",", "")
                    high_str = m.group(2).replace(",", "")
                    low = float(low_str)
                    high = float(high_str)

                    # Normalize K notation
                    if i == 0:  # K format
                        low *= 1000
                        high *= 1000

                    # Sanity check: salary should be 20k-2M
                    if 20000 <= low <= 2_000_000 and 20000 <= high <= 2_000_000:
                        return {
                            "low": int(low),
                            "high": int(high),
                            "midpoint": int((low + high) / 2),
                            "raw": m.group(0),
                        }
                else:
                    # Single value with +
                    val_str = m.group(1).replace(",", "")
                    val = float(val_str)
                    if "k" in pattern.lower():
                        val *= 1000
                    if 20000 <= val <= 2_000_000:
                        return {"low": int(val), "high": None, "raw": m.group(0)}

            except (ValueError, IndexError):
                continue

    return None


def detect_locations(text: str) -> dict:
    """Detect location and work type signals."""
    text_lower = text.lower()

    is_remote = bool(re.search(r'\bremote\b', text_lower))
    is_onsite = bool(re.search(r'\bon[- ]?site\b|\bonsite\b', text_lower))
    is_hybrid = bool(re.search(r'\bhybrid\b', text_lower))

    # Major tech cities
    city_patterns = {
        "San Francisco": r'\bsan francisco\b|\bsf\b|\bsfba\b|\bbay area\b',
        "New York": r'\bnew york\b|\bnyc\b|\bny\b',
        "Seattle": r'\bseattle\b|\bbellevue\b',
        "Boston": r'\bboston\b|\bcambridge, ma\b',
        "Austin": r'\baustin\b',
        "Los Angeles": r'\blos angeles\b|\bla\b',
        "Chicago": r'\bchicago\b',
        "Denver": r'\bdenver\b|\bboulder\b',
        "London": r'\blondon\b',
        "Berlin": r'\bberlin\b',
        "Amsterdam": r'\bamsterdam\b',
        "Toronto": r'\btoronto\b',
        "Singapore": r'\bsingapore\b',
        "Dublin": r'\bdublin\b',
        "Remote US": r'\bremote us\b|\bus remote\b|\bunited states remote\b',
        "Worldwide Remote": r'\bworldwide remote\b|\bglobal remote\b|\banywhere\b',
    }

    detected_cities = [
        city for city, pattern in city_patterns.items()
        if re.search(pattern, text_lower)
    ]

    return {
        "is_remote": is_remote,
        "is_onsite": is_onsite,
        "is_hybrid": is_hybrid,
        "cities": detected_cities,
        "work_type": (
            "hybrid" if is_hybrid
            else "remote" if is_remote and not is_onsite
            else "onsite" if is_onsite
            else "unknown"
        ),
    }


def detect_role_types(text: str) -> list[str]:
    """Classify job roles mentioned in the post."""
    text_lower = text.lower()

    role_patterns = {
        "backend": r'\bback[- ]?end\b|\bbackend\b',
        "frontend": r'\bfront[- ]?end\b|\bfrontend\b',
        "fullstack": r'\bfull[- ]?stack\b',
        "devops": r'\bdevops\b|\bsre\b|\bplatform engineer\b|\binfrastructure\b',
        "data_engineer": r'\bdata engineer\b|\bdata pipeline\b|\betl\b',
        "data_scientist": r'\bdata scientist\b|\bstatistician\b',
        "ml_ai": r'\bmachine learning\b|\bml engineer\b|\bai engineer\b|\bllm\b|\bgenai\b',
        "security": r'\bsecurity engineer\b|\binfosec\b|\bpenetration\b|\bdevsecops\b',
        "mobile": r'\bmobile\b|\bios developer\b|\bandroid developer\b|\breact native\b|\bflutter\b',
        "embedded": r'\bembedded\b|\bfirmware\b|\breal[- ]?time\b',
        "product": r'\bproduct manager\b|\bproduct owner\b|\bpm\b',
        "design": r'\bux\b|\bui designer\b|\bproduct designer\b',
        "engineering_manager": r'\bengineering manager\b|\bem\b|\btechnical lead\b|\bstaff engineer\b',
        "founder": r'\bco[- ]?founder\b|\bcto\b|\bhead of engineering\b',
    }

    return [
        role for role, pattern in role_patterns.items()
        if re.search(pattern, text_lower)
    ]


def detect_experience_level(text: str) -> str:
    """Infer seniority level from job post text."""
    text_lower = text.lower()

    if re.search(r'\bstaff\b|\bprincipal\b|\bdistinguished\b', text_lower):
        return "staff+"
    if re.search(r'\bsenior\b|\bsr\.\b|\blead\b|\bexperienced\b', text_lower):
        return "senior"
    if re.search(r'\bjunior\b|\bjr\.\b|\bentry[- ]?level\b|\bnew grad\b|\bgraduate\b', text_lower):
        return "junior"
    if re.search(r'\bmid[- ]?level\b|\bintermediate\b', text_lower):
        return "mid"
    return "unspecified"


def extract_email_and_url(text: str) -> dict:
    """Extract apply URLs and email addresses from post text."""
    emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
    urls = re.findall(r'https?://[^\s<>"]+', text)

    # Filter URLs to likely application/careers links
    career_urls = [
        u for u in urls
        if any(kw in u.lower() for kw in ["career", "jobs", "apply", "hire", "recruit", "lever.co", "greenhouse.io", "ashbyhq.com", "workable.com"])
    ]

    return {
        "emails": list(set(emails))[:3],
        "career_urls": list(set(career_urls))[:3],
        "all_urls": list(set(urls))[:5],
    }


def parse_job_post(comment: dict) -> dict:
    """
    Full parser: extract structured data from a raw HN job comment.
    """
    raw_html = comment.get("comment_text", "")
    clean_text = clean_html(raw_html)

    if len(clean_text) < 30:
        return None  # Skip empty or trivially short posts

    # Company name extraction (usually first line, before pipe/dash/newline)
    first_line = clean_text.split("\n")[0].strip()
    company = ""
    company_patterns = [
        r'^([A-Za-z][\w\s&.,\-\(\)]+?)\s*[|\–\—]\s',  # Company | something
        r'^([A-Za-z][\w\s&.,]+?)\s*\(',                  # Company (location)
        r'^([A-Za-z][\w\s&.,]+?)\s*is\s+hir',            # Company is hiring
    ]
    for pattern in company_patterns:
        m = re.match(pattern, first_line)
        if m:
            company = m.group(1).strip()[:80]
            break
    if not company:
        company = first_line[:60]

    # Parse all signals
    tech_stack = detect_tech_stack(clean_text)
    salary = parse_salary(clean_text)
    locations = detect_locations(clean_text)
    roles = detect_role_types(clean_text)
    level = detect_experience_level(clean_text)
    contact = extract_email_and_url(clean_text)

    # Equity mentions
    has_equity = bool(re.search(r'\bequity\b|\bstock\b|\beso[sp]?\b|\bvesting\b', clean_text.lower()))

    # Visa/sponsorship mentions
    sponsors_visa = bool(re.search(r'\bvisa\b|\bsponsorship\b|\bwork.authoriz', clean_text.lower()))

    return {
        "hn_id": comment.get("objectID"),
        "author": comment.get("author"),
        "created_at": comment.get("created_at"),
        "company": company,
        "clean_text": clean_text,
        "raw_html": raw_html,
        "char_count": len(clean_text),
        "is_remote": locations["is_remote"],
        "work_type": locations["work_type"],
        "cities": locations["cities"],
        "tech_languages": tech_stack.get("languages", []),
        "tech_frontend": tech_stack.get("frontend", []),
        "tech_backend": tech_stack.get("backend", []),
        "tech_databases": tech_stack.get("databases", []),
        "tech_infra": tech_stack.get("infra", []),
        "tech_data_ml": tech_stack.get("data_ml", []),
        "all_tech": [t for ts in tech_stack.values() for t in ts],
        "role_types": roles,
        "experience_level": level,
        "salary": salary,
        "has_equity": has_equity,
        "sponsors_visa": sponsors_visa,
        "contact": contact,
    }

Database Schema and Storage

import sqlite3
import json
from datetime import datetime

def init_jobs_db(db_path: str = "hn_jobs.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS hiring_threads (
            id TEXT PRIMARY KEY,
            title TEXT,
            created_at TEXT,
            num_comments INTEGER,
            url TEXT,
            scraped_at TEXT
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS job_posts (
            hn_id TEXT PRIMARY KEY,
            thread_id TEXT,
            author TEXT,
            company TEXT,
            created_at TEXT,
            char_count INTEGER,
            is_remote INTEGER,
            work_type TEXT,
            cities TEXT,
            tech_languages TEXT,
            tech_frontend TEXT,
            tech_backend TEXT,
            tech_databases TEXT,
            tech_infra TEXT,
            tech_data_ml TEXT,
            all_tech TEXT,
            role_types TEXT,
            experience_level TEXT,
            salary_low INTEGER,
            salary_high INTEGER,
            salary_mid INTEGER,
            salary_raw TEXT,
            has_equity INTEGER,
            sponsors_visa INTEGER,
            emails TEXT,
            career_urls TEXT,
            clean_text TEXT,
            scraped_at TEXT,
            FOREIGN KEY (thread_id) REFERENCES hiring_threads(id)
        )
    """)

    # Full-text search
    conn.execute("""
        CREATE VIRTUAL TABLE IF NOT EXISTS job_posts_fts
        USING fts5(company, clean_text, all_tech, content='job_posts', content_rowid='rowid')
    """)

    conn.execute("CREATE INDEX IF NOT EXISTS idx_jobs_thread ON job_posts(thread_id)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_jobs_date ON job_posts(created_at)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_jobs_remote ON job_posts(is_remote)")

    conn.commit()
    return conn


def save_thread(conn: sqlite3.Connection, thread: dict):
    conn.execute("""
        INSERT OR REPLACE INTO hiring_threads (id, title, created_at, num_comments, url, scraped_at)
        VALUES (?, ?, ?, ?, ?, ?)
    """, (
        thread["id"], thread["title"], thread["created_at"],
        thread["num_comments"], thread.get("url", ""),
        datetime.utcnow().isoformat()
    ))
    conn.commit()


def save_job_post(conn: sqlite3.Connection, job: dict, thread_id: str):
    if not job or not job.get("hn_id"):
        return

    salary = job.get("salary") or {}
    contact = job.get("contact") or {}

    conn.execute("""
        INSERT OR REPLACE INTO job_posts
        (hn_id, thread_id, author, company, created_at, char_count,
         is_remote, work_type, cities, tech_languages, tech_frontend,
         tech_backend, tech_databases, tech_infra, tech_data_ml, all_tech,
         role_types, experience_level, salary_low, salary_high, salary_mid,
         salary_raw, has_equity, sponsors_visa, emails, career_urls,
         clean_text, scraped_at)
        VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)
    """, (
        job["hn_id"], thread_id, job.get("author"), job.get("company"),
        job.get("created_at"), job.get("char_count", 0),
        int(job.get("is_remote", False)), job.get("work_type", ""),
        json.dumps(job.get("cities", [])),
        json.dumps(job.get("tech_languages", [])),
        json.dumps(job.get("tech_frontend", [])),
        json.dumps(job.get("tech_backend", [])),
        json.dumps(job.get("tech_databases", [])),
        json.dumps(job.get("tech_infra", [])),
        json.dumps(job.get("tech_data_ml", [])),
        json.dumps(job.get("all_tech", [])),
        json.dumps(job.get("role_types", [])),
        job.get("experience_level", ""),
        salary.get("low"), salary.get("high"), salary.get("midpoint"),
        salary.get("raw", ""),
        int(job.get("has_equity", False)),
        int(job.get("sponsors_visa", False)),
        json.dumps(contact.get("emails", [])),
        json.dumps(contact.get("career_urls", [])),
        job.get("clean_text", ""),
        datetime.utcnow().isoformat(),
    ))
    conn.commit()

Trend Analysis Queries

With data in the database, you can run real analysis:

from collections import Counter
import json


def tech_trend_by_month(conn: sqlite3.Connection, tech: str) -> list[dict]:
    """
    Track mentions of a specific technology across months.
    """
    rows = conn.execute("""
        SELECT strftime('%Y-%m', j.created_at) AS month,
               COUNT(*) AS total_posts,
               SUM(CASE WHEN j.all_tech LIKE ? THEN 1 ELSE 0 END) AS tech_mentions
        FROM job_posts j
        WHERE j.char_count > 100
        GROUP BY month
        ORDER BY month
    """, (f'%{tech}%',)).fetchall()

    return [
        {
            "month": r[0],
            "total_posts": r[1],
            "mentions": r[2],
            "pct": round(100 * r[2] / r[1], 1) if r[1] > 0 else 0,
        }
        for r in rows
    ]


def remote_work_trend(conn: sqlite3.Connection) -> list[dict]:
    """
    Track remote work percentage across months.
    """
    rows = conn.execute("""
        SELECT strftime('%Y-%m', created_at) AS month,
               COUNT(*) AS total,
               SUM(is_remote) AS remote_count,
               SUM(CASE WHEN work_type='onsite' THEN 1 ELSE 0 END) AS onsite_count,
               SUM(CASE WHEN work_type='hybrid' THEN 1 ELSE 0 END) AS hybrid_count
        FROM job_posts
        WHERE char_count > 100
        GROUP BY month
        ORDER BY month
    """).fetchall()

    return [
        {
            "month": r[0], "total": r[1], "remote": r[2],
            "onsite": r[3], "hybrid": r[4],
            "remote_pct": round(100 * r[2] / r[1], 1) if r[1] > 0 else 0,
        }
        for r in rows
    ]


def top_tech_by_month(conn: sqlite3.Connection, month: str, top_n: int = 20) -> list[dict]:
    """
    Get top technologies mentioned in a specific month (YYYY-MM format).
    """
    rows = conn.execute("""
        SELECT all_tech FROM job_posts
        WHERE strftime('%Y-%m', created_at) = ?
          AND char_count > 100
    """, (month,)).fetchall()

    tech_counter = Counter()
    for row in rows:
        try:
            techs = json.loads(row[0] or "[]")
            for t in techs:
                if t:
                    tech_counter[t.lower()] += 1
        except json.JSONDecodeError:
            pass

    total_posts = len(rows)
    return [
        {
            "tech": tech, "mentions": count,
            "pct": round(100 * count / total_posts, 1) if total_posts > 0 else 0,
        }
        for tech, count in tech_counter.most_common(top_n)
    ]


def salary_trends(conn: sqlite3.Connection) -> list[dict]:
    """
    Track median salary midpoints across months.
    """
    rows = conn.execute("""
        SELECT strftime('%Y-%m', created_at) AS month,
               COUNT(*) AS posts_with_salary,
               AVG(salary_mid) AS avg_salary,
               MIN(salary_mid) AS min_salary,
               MAX(salary_mid) AS max_salary
        FROM job_posts
        WHERE salary_mid IS NOT NULL AND salary_mid BETWEEN 30000 AND 1000000
        GROUP BY month
        ORDER BY month
    """).fetchall()

    return [
        {
            "month": r[0], "posts": r[1],
            "avg": int(r[2]) if r[2] else None,
            "min": r[3], "max": r[4],
        }
        for r in rows
    ]


def company_hiring_frequency(conn: sqlite3.Connection, min_appearances: int = 2) -> list[dict]:
    """
    Find companies that have posted in multiple hiring threads.
    These are the consistent hirers.
    """
    rows = conn.execute("""
        SELECT company,
               COUNT(DISTINCT thread_id) AS thread_appearances,
               COUNT(*) AS total_posts,
               GROUP_CONCAT(DISTINCT strftime('%Y-%m', created_at)) AS months
        FROM job_posts
        WHERE char_count > 100 AND company != ''
        GROUP BY LOWER(company)
        HAVING thread_appearances >= ?
        ORDER BY thread_appearances DESC, total_posts DESC
        LIMIT 50
    """, (min_appearances,)).fetchall()

    return [
        {
            "company": r[0], "thread_appearances": r[1],
            "total_posts": r[2], "months": r[3],
        }
        for r in rows
    ]


def print_monthly_report(conn: sqlite3.Connection, month: str):
    """Print a summary report for a specific month."""
    print(f"\n=== HN Who's Hiring: {month} ===")

    # Basic stats
    stats = conn.execute("""
        SELECT COUNT(*), SUM(is_remote), AVG(char_count),
               SUM(CASE WHEN salary_mid IS NOT NULL THEN 1 ELSE 0 END)
        FROM job_posts
        WHERE strftime('%Y-%m', created_at) = ? AND char_count > 100
    """, (month,)).fetchone()

    if not stats or not stats[0]:
        print(f"  No data for {month}")
        return

    total, remote, avg_len, with_salary = stats
    print(f"  Total posts: {total}")
    print(f"  Remote: {remote} ({100*remote//total}%)")
    print(f"  With salary: {with_salary} ({100*with_salary//total}%)")
    print(f"  Avg post length: {int(avg_len)} chars")

    # Top tech
    top_tech = top_tech_by_month(conn, month, top_n=10)
    print(f"\n  Top Technologies:")
    for t in top_tech[:10]:
        bar = "█" * (t["mentions"] // max(1, total // 50))
        print(f"    {t['tech']:20} {t['mentions']:4} ({t['pct']:5.1f}%) {bar}")

Complete Collection Pipeline

def collect_thread(thread: dict, db: sqlite3.Connection, delay: float = 1.0):
    """Collect and parse all job posts from a single hiring thread."""
    print(f"\nCollecting: {thread['title']}")
    save_thread(db, thread)

    posts = get_all_job_posts(thread["id"], delay=delay)
    print(f"  Retrieved {len(posts)} comments")

    parsed_count = 0
    for post in posts:
        job = parse_job_post(post)
        if job and job.get("char_count", 0) > 50:
            save_job_post(db, job, thread["id"])
            parsed_count += 1

    print(f"  Saved {parsed_count} job posts")
    return parsed_count


def run_full_collection(
    years: list[int] = None,
    db_path: str = "hn_jobs.db",
    delay: float = 0.8,
):
    """
    Collect all HN hiring threads for specified years.
    Default: current year only.
    """
    if years is None:
        years = [2026]

    db = init_jobs_db(db_path)
    total_posts = 0

    for year in years:
        print(f"\n{'='*50}")
        print(f"Year: {year}")
        threads = find_hiring_threads(year=year)
        print(f"Found {len(threads)} hiring threads")

        for thread in threads:
            count = collect_thread(thread, db, delay=delay)
            total_posts += count
            time.sleep(3)  # Gap between threads

    print(f"\nCollection complete. Total job posts: {total_posts}")

    # Print summary reports
    conn = db
    for year in years:
        for month_num in range(1, 13):
            month_str = f"{year}-{month_num:02d}"
            row = conn.execute(
                "SELECT COUNT(*) FROM job_posts WHERE strftime('%Y-%m', created_at) = ?",
                (month_str,)
            ).fetchone()
            if row and row[0] > 0:
                print_monthly_report(conn, month_str)

    db.close()


if __name__ == "__main__":
    # Collect 2024-2026 for multi-year trend analysis
    run_full_collection(years=[2024, 2025, 2026])

    # Print remote work trend
    db = init_jobs_db()
    print("\nRemote work by month (last 24 months):")
    for row in remote_work_trend(db)[-24:]:
        print(f"  {row['month']}: {row['remote_pct']:5.1f}% remote "
              f"({row['remote']}/{row['total']} posts)")

    # Python trend
    print("\nPython mentions over time:")
    for row in tech_trend_by_month(db, "Python")[-12:]:
        print(f"  {row['month']}: {row['pct']:5.1f}% of posts ({row['mentions']}/{row['total_posts']})")

    db.close()

Rate Limits and Proxy Notes

The Algolia HN API is generous and doesn't require authentication. Practical limits:

10,000 requests/hour (documented)
1,000 hits max per query (use pagination via page parameter to get more)
Large result sets from a single IP over sustained periods can trigger soft throttling

For historical collection spanning multiple years (100+ threads, 50,000+ posts), space requests with 0.5-1s delays and you'll never hit limits. The Algolia API is designed to be used this way.

If you're running from a cloud server or a shared IP that other scrapers have hammered, ThorData residential proxies avoid inheriting someone else's rate limit debt:

def fetch_with_proxy(url: str, params: dict, proxy_url: str) -> dict:
    """Fetch Algolia API through a proxy."""
    proxies = {"http": proxy_url, "https": proxy_url}
    resp = requests.get(url, params=params, proxies=proxies, timeout=30)
    resp.raise_for_status()
    return resp.json()

What the Data Actually Shows

If you collect 2021-2026, you'll see some clear patterns in the tech stack data:

Python dominance expanding. Python's mention rate has grown from ~35% of posts in 2021 to ~55%+ in 2025-2026, driven by ML/AI adoption pulling it into companies that previously wouldn't have used it.

Remote work spike and plateau. Remote mentions peaked around 70% in 2021-2022, dropped to 50-55% by 2024, and have stabilized there. The "office is back" narrative overstated the swing.

Rust growing steadily. From <5% in 2021 to 12-15% in 2026. Not a dominant language, but a consistent signal of performance-critical systems work.

LLM/AI roles are a new category. Before 2023, ML roles were maybe 8-12% of posts. By 2026, explicit LLM/AI/genAI roles are 25%+, and that's probably undercounting teams where LLM use is implicit.

Salary ranges increasing, then leveling. The 2021-2022 hiring frenzy pushed median posted salaries up ~30%. 2023 corrections were visible in the data. 2024-2026 shows stabilization at above pre-2021 levels.

The HN hiring dataset is one of the more honest datasets you can build for tech labor market research — the posts are written by people who are actually trying to hire, not aggregated through a third-party job board with its own biases.

Conclusion

The Algolia HN API makes this dataset unusually accessible. No authentication, free pagination, 15+ years of data. The work is in the text parsing — job posts are intentionally human-written and varied. The regex-based parser here gets you ~80% accuracy on most signals; for production use on salary extraction specifically, consider augmenting with an LLM extraction pass on posts where the regex returns null. The analysis queries show real patterns once you have a few months of data. Collect it consistently and you have something genuinely useful.