How to Scrape Remote Job Boards in 2026: Multi-Source Aggregation Guide

2026-04-09 ["remote jobs" "web scraping" "python" "job boards" "data aggregation"]

How to Scrape Remote Job Boards in 2026: Multi-Source Aggregation Guide

Remote job listings are fragmented across a dozen different platforms, and no single board has everything. If you're building a job aggregator, doing salary research, or tracking hiring trends by category, you need data from multiple sources pulled into one place.

This guide covers four major remote job boards — RemoteOK, WeWorkRemotely, Remote.co, and FlexJobs — their individual quirks, and how to normalize everything into a unified SQLite store.

What You Can Extract

Across these platforms you can collect:

Job title — standardized role name
Company — employer name
Salary range — when disclosed (RemoteOK posts this more than most)
Location requirements — worldwide, US only, timezone constraints
Job type — full-time, part-time, contract, freelance
Category — engineering, design, marketing, support, etc.
Posted date — listing age
Application URL — direct link or external apply redirect
Company info — size, industry, website
Required skills — extracted from description text
Benefits — when listed (health, equity, PTO)
Work arrangement — async-friendly, overlap hours required, etc.

Anti-Bot Measures Across Platforms

Each board has a different posture toward scrapers:

RemoteOK — offers a public JSON API at https://remoteok.com/api. Easiest by far. No browser needed.
WeWorkRemotely — server-rendered HTML, no JavaScript required. Moderate rate limiting; stay under ~1 request every 3 seconds.
Remote.co — JavaScript-heavy pages, Cloudflare protection on some routes. httpx with realistic headers usually gets through.
FlexJobs — login wall for full job details, aggressive bot detection on the search surface. Difficult to scrape without an account and residential IPs.

Dependencies and Setup

pip install httpx[http2] requests beautifulsoup4 playwright
playwright install chromium

Scraping RemoteOK (JSON API)

RemoteOK publishes a public API that returns all active listings. You must send a real User-Agent or you'll get a 403.

import httpx
import time
import json
import random
import re
from datetime import datetime

HEADERS_BASE = {
    "Accept": "application/json",
    "Accept-Language": "en-US,en;q=0.9",
}

USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
]

REMOTEOK_API = "https://remoteok.com/api"


def fetch_remoteok(proxy: str = None) -> list:
    """Fetch all active listings from RemoteOK's public JSON API."""
    headers = {**HEADERS_BASE, "User-Agent": random.choice(USER_AGENTS)}
    client_kwargs = {"headers": headers, "timeout": 20, "follow_redirects": True}
    if proxy:
        client_kwargs["proxies"] = {"http://": proxy, "https://": proxy}

    with httpx.Client(**client_kwargs) as client:
        resp = client.get(REMOTEOK_API)
        resp.raise_for_status()

    data = resp.json()
    # First element is metadata, skip it
    jobs = [item for item in data if isinstance(item, dict) and "position" in item]

    results = []
    for job in jobs:
        # Parse salary range
        salary_min = job.get("salary_min")
        salary_max = job.get("salary_max")
        if salary_min and isinstance(salary_min, str):
            try:
                salary_min = int(salary_min.replace(",", "").replace("$", ""))
            except ValueError:
                salary_min = None
        if salary_max and isinstance(salary_max, str):
            try:
                salary_max = int(salary_max.replace(",", "").replace("$", ""))
            except ValueError:
                salary_max = None

        results.append({
            "source": "remoteok",
            "external_id": str(job.get("id", "")),
            "title": job.get("position"),
            "company": job.get("company"),
            "company_logo": job.get("company_logo"),
            "salary_min": salary_min,
            "salary_max": salary_max,
            "location": job.get("location", "Worldwide"),
            "job_type": "full-time",
            "category": ", ".join(job.get("tags", [])[:5]),
            "posted_date": job.get("date"),
            "url": job.get("url"),
            "apply_url": job.get("apply_url"),
            "description_html": job.get("description", "")[:1000],
        })

    print(f"RemoteOK: {len(results)} listings")
    return results

Scraping WeWorkRemotely

WeWorkRemotely organizes listings by category. Each category page is plain HTML — no JavaScript rendering required.

from bs4 import BeautifulSoup

WWR_CATEGORIES = [
    "https://weworkremotely.com/categories/remote-full-stack-programming-jobs",
    "https://weworkremotely.com/categories/remote-back-end-programming-jobs",
    "https://weworkremotely.com/categories/remote-front-end-programming-jobs",
    "https://weworkremotely.com/categories/remote-design-jobs",
    "https://weworkremotely.com/categories/remote-marketing-jobs",
    "https://weworkremotely.com/categories/remote-product-jobs",
    "https://weworkremotely.com/categories/remote-copywriting-jobs",
    "https://weworkremotely.com/categories/remote-customer-support-jobs",
    "https://weworkremotely.com/categories/remote-sales-and-bizdev-jobs",
    "https://weworkremotely.com/categories/remote-data-science-jobs",
    "https://weworkremotely.com/categories/remote-devops-sysadmin-jobs",
    "https://weworkremotely.com/categories/remote-management-finance-jobs",
]

WWR_BASE = "https://weworkremotely.com"


def fetch_wwr(proxy: str = None) -> list:
    """Scrape WeWorkRemotely listings from all category pages."""
    headers = {
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml",
        "Accept-Language": "en-US,en;q=0.9",
        "Referer": "https://weworkremotely.com/",
    }

    client_kwargs = {"headers": headers, "timeout": 25, "follow_redirects": True}
    if proxy:
        client_kwargs["proxies"] = {"http://": proxy, "https://": proxy}

    results = []

    with httpx.Client(**client_kwargs) as client:
        for url in WWR_CATEGORIES:
            category = url.split("/categories/")[-1].replace("-jobs", "").replace("-", " ").title()
            try:
                resp = client.get(url)
                resp.raise_for_status()
                soup = BeautifulSoup(resp.text, "html.parser")

                for section in soup.select("section.jobs"):
                    for li in section.select("ul li"):
                        if "view-all" in (li.get("class") or []):
                            continue

                        link_el = li.select_one("a")
                        if not link_el:
                            continue

                        title_el = li.select_one(".title")
                        company_el = li.select_one(".company")
                        region_el = li.select_one(".region")
                        date_el = li.select_one("time")

                        href = link_el.get("href", "")
                        full_url = f"{WWR_BASE}{href}" if href.startswith("/") else href

                        # Extract company logo
                        img_el = li.select_one("img")
                        logo = img_el.get("src") if img_el else None

                        results.append({
                            "source": "weworkremotely",
                            "external_id": href.split("/")[-1] if href else "",
                            "title": title_el.get_text(strip=True) if title_el else None,
                            "company": company_el.get_text(strip=True) if company_el else None,
                            "company_logo": logo,
                            "location": region_el.get_text(strip=True) if region_el else "Remote",
                            "category": category,
                            "posted_date": date_el.get("datetime") if date_el else None,
                            "url": full_url,
                        })

            except Exception as e:
                print(f"WWR error on {url}: {e}")

            time.sleep(random.uniform(2.5, 4.5))

    print(f"WeWorkRemotely: {len(results)} listings")
    return results


def fetch_wwr_job_detail(job_url: str, proxy: str = None) -> dict:
    """Fetch the full job description from a WWR listing page."""
    headers = {
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml",
        "Referer": "https://weworkremotely.com/",
    }

    client_kwargs = {"headers": headers, "timeout": 20, "follow_redirects": True}
    if proxy:
        client_kwargs["proxies"] = {"http://": proxy, "https://": proxy}

    with httpx.Client(**client_kwargs) as client:
        resp = client.get(job_url)

    if resp.status_code != 200:
        return {}

    soup = BeautifulSoup(resp.text, "html.parser")
    detail = {}

    # Full description
    desc_el = soup.select_one(".listing-container, [class*='job-description']")
    if desc_el:
        detail["description"] = desc_el.get_text(separator=" ", strip=True)[:3000]

    # Tags/skills
    tags = [tag.get_text(strip=True) for tag in soup.select(".tag, [class*='tag']")]
    detail["tags"] = tags[:20]

    # Apply link
    apply_el = soup.select_one("a.apply-link, a[href*='apply']")
    if apply_el:
        detail["apply_url"] = apply_el.get("href")

    return detail

Scraping Remote.co

Remote.co uses a different HTML structure. Their job cards live under .job_listings:

REMOTECO_URL = "https://remote.co/remote-jobs/"
REMOTECO_CATEGORIES = [
    "https://remote.co/remote-jobs/developer/",
    "https://remote.co/remote-jobs/designer/",
    "https://remote.co/remote-jobs/marketing/",
    "https://remote.co/remote-jobs/customer-service/",
    "https://remote.co/remote-jobs/sales/",
    "https://remote.co/remote-jobs/writing/",
    "https://remote.co/remote-jobs/finance/",
    "https://remote.co/remote-jobs/data-science/",
]


def fetch_remoteco(proxy: str = None) -> list:
    """Scrape Remote.co job listings."""
    headers = {
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml",
        "Accept-Language": "en-US,en;q=0.9",
        "Referer": "https://remote.co/",
    }

    client_kwargs = {"headers": headers, "timeout": 25, "follow_redirects": True}
    if proxy:
        client_kwargs["proxies"] = {"http://": proxy, "https://": proxy}

    results = []

    with httpx.Client(**client_kwargs) as client:
        for url in REMOTECO_CATEGORIES:
            category = url.split("/remote-jobs/")[-1].rstrip("/").replace("-", " ").title()
            try:
                resp = client.get(url)
                resp.raise_for_status()
                soup = BeautifulSoup(resp.text, "html.parser")

                # Job listings container
                for card in soup.select(".job_listings .job_listing, .job-listings .job-listing"):
                    title_el = card.select_one(".position h3, h3")
                    company_el = card.select_one(".company strong, .company")
                    location_el = card.select_one(".location")
                    date_el = card.select_one(".date")
                    link_el = card.select_one("a.listing_wrapper, a")

                    if not title_el:
                        continue

                    href = link_el.get("href", "") if link_el else ""

                    results.append({
                        "source": "remoteco",
                        "external_id": href.split("/")[-2] if "/" in href else href,
                        "title": title_el.get_text(strip=True),
                        "company": company_el.get_text(strip=True) if company_el else None,
                        "location": location_el.get_text(strip=True) if location_el else "Remote",
                        "category": category,
                        "posted_date": date_el.get_text(strip=True) if date_el else None,
                        "url": href,
                    })

            except httpx.HTTPError as e:
                print(f"Remote.co error on {url}: {e}")

            time.sleep(random.uniform(2, 4))

    print(f"Remote.co: {len(results)} listings")
    return results

Scraping We Are Hiring (LinkedIn Jobs Alternative)

For a more public-facing board without heavy auth walls:

def fetch_remotehabitat(proxy: str = None) -> list:
    """Scrape Remote Habitat / other open boards."""
    # RemoteHabitat and similar boards that have public JSON or RSS endpoints
    url = "https://remotive.com/api/remote-jobs"
    headers = {**HEADERS_BASE, "User-Agent": random.choice(USER_AGENTS)}

    client_kwargs = {"headers": headers, "timeout": 20, "follow_redirects": True}
    if proxy:
        client_kwargs["proxies"] = {"http://": proxy, "https://": proxy}

    with httpx.Client(**client_kwargs) as client:
        resp = client.get(url)

    if resp.status_code != 200:
        return []

    data = resp.json()
    jobs = data.get("jobs", [])

    results = []
    for job in jobs:
        salary_text = job.get("salary", "")
        salary_min = salary_max = None
        if salary_text:
            numbers = re.findall(r'\$?([\d,]+)k?', salary_text.lower())
            parsed = []
            for n in numbers:
                try:
                    val = int(n.replace(",", ""))
                    if "k" in salary_text[salary_text.find(n)+len(n):salary_text.find(n)+len(n)+2].lower():
                        val *= 1000
                    parsed.append(val)
                except ValueError:
                    pass
            if parsed:
                salary_min = min(parsed)
                salary_max = max(parsed)

        results.append({
            "source": "remotive",
            "external_id": str(job.get("id", "")),
            "title": job.get("title"),
            "company": job.get("company_name"),
            "company_logo": job.get("company_logo"),
            "salary_min": salary_min,
            "salary_max": salary_max,
            "location": job.get("candidate_required_location", "Worldwide"),
            "job_type": job.get("job_type", "full_time").replace("_", "-"),
            "category": job.get("category", ""),
            "posted_date": job.get("publication_date"),
            "url": job.get("url"),
            "description_html": (job.get("description") or "")[:1000],
        })

    print(f"Remotive: {len(results)} listings")
    return results

Handling Anti-Bot Detection at Scale

When you're hitting three or four boards in parallel, your requests start looking like a scraping operation fast. Remote.co uses Cloudflare, and even WeWorkRemotely will throttle you after a few dozen requests from the same IP.

ThorData's residential proxies handle the rotation automatically — you get a fresh residential IP per request or per session depending on how you configure the pool.

PROXY_URL = "http://YOUR_USER:[email protected]:9000"


def get_proxied_client(proxy_url: str = None) -> httpx.Client:
    """Build an httpx client with optional proxy configuration."""
    headers = {**HEADERS_BASE, "User-Agent": random.choice(USER_AGENTS)}
    client_kwargs = {"headers": headers, "timeout": 25, "follow_redirects": True}
    if proxy_url:
        client_kwargs["proxies"] = {"http://": proxy_url, "https://": proxy_url}
    return httpx.Client(**client_kwargs)


def fetch_all_sources(proxy_url: str = None) -> list:
    """Aggregate jobs from all sources."""
    all_jobs = []

    print("Fetching from RemoteOK...")
    all_jobs.extend(fetch_remoteok(proxy=proxy_url))
    time.sleep(random.uniform(3, 6))

    print("Fetching from WeWorkRemotely...")
    all_jobs.extend(fetch_wwr(proxy=proxy_url))
    time.sleep(random.uniform(3, 6))

    print("Fetching from Remote.co...")
    all_jobs.extend(fetch_remoteco(proxy=proxy_url))
    time.sleep(random.uniform(3, 6))

    print("Fetching from Remotive...")
    all_jobs.extend(fetch_remotehabitat(proxy=proxy_url))

    print(f"\nTotal: {len(all_jobs)} jobs across all sources")
    return all_jobs

Deduplication Strategy

The same job gets reposted across boards all the time. Normalize the title + company string and use fuzzy matching:

import difflib

def normalize_for_dedup(text: str) -> str:
    """Normalize text for deduplication comparison."""
    return re.sub(r'\s+', ' ', text.lower().strip()
                  .replace("-", " ").replace("_", " ")
                  .replace("(", "").replace(")", "")
                  .replace("/", " ").replace(",", ""))


def is_duplicate(job: dict, seen: list, threshold: float = 0.88) -> bool:
    """Check if a job is similar to any seen job using fuzzy matching."""
    key = normalize_for_dedup(
        f"{job.get('title', '')} {job.get('company', '')}"
    )
    for existing_key in seen:
        ratio = difflib.SequenceMatcher(None, key, existing_key).ratio()
        if ratio > threshold:
            return True
    return False


def deduplicate(jobs: list, threshold: float = 0.88) -> list:
    """Remove duplicate jobs using fuzzy title+company matching."""
    seen_keys = []
    unique = []

    # Sort to prefer jobs with more complete data
    sorted_jobs = sorted(
        jobs,
        key=lambda j: (
            bool(j.get("salary_min")),
            bool(j.get("description_html")),
            len(j.get("title") or ""),
        ),
        reverse=True,
    )

    for job in sorted_jobs:
        if not is_duplicate(job, seen_keys, threshold):
            key = normalize_for_dedup(
                f"{job.get('title', '')} {job.get('company', '')}"
            )
            seen_keys.append(key)
            unique.append(job)

    print(f"Dedup: {len(jobs)} -> {len(unique)} unique jobs")
    return unique

Skills Extraction from Job Descriptions

When you have job descriptions, extract required skills for analysis:

TECH_SKILLS = {
    "python", "javascript", "typescript", "react", "node.js", "nodejs",
    "sql", "postgresql", "mysql", "mongodb", "redis", "elasticsearch",
    "aws", "gcp", "azure", "docker", "kubernetes", "terraform",
    "django", "fastapi", "flask", "rails", "spring", "graphql",
    "machine learning", "deep learning", "pytorch", "tensorflow",
    "figma", "sketch", "photoshop", "illustrator",
    "go", "golang", "rust", "java", "kotlin", "swift", "c++",
    "excel", "tableau", "looker", "dbt", "airflow",
    "selenium", "playwright", "cypress",
}

SOFT_SKILLS = {
    "communication", "leadership", "teamwork", "problem solving",
    "project management", "agile", "scrum", "remote", "async",
    "bilingual", "customer facing", "stakeholder", "cross-functional",
}


def extract_skills(description: str) -> dict:
    """Extract technical and soft skills from a job description."""
    if not description:
        return {"tech": [], "soft": []}

    desc_lower = description.lower()

    found_tech = sorted([skill for skill in TECH_SKILLS if skill in desc_lower])
    found_soft = sorted([skill for skill in SOFT_SKILLS if skill in desc_lower])

    # Also extract salary info if not already captured
    salary_match = re.search(
        r'\$?\s*([\d,]+)\s*(?:k|,000)?\s*(?:to|–|-)\s*\$?\s*([\d,]+)\s*(?:k|,000)?',
        description, re.IGNORECASE
    )
    salary_info = {}
    if salary_match:
        low = int(salary_match.group(1).replace(",", ""))
        high = int(salary_match.group(2).replace(",", ""))
        if low < 1000:
            low *= 1000
        if high < 1000:
            high *= 1000
        salary_info = {"salary_min_extracted": low, "salary_max_extracted": high}

    return {"tech": found_tech, "soft": found_soft, **salary_info}

Unified Storage Schema

import sqlite3
from datetime import datetime

def init_db(db_path: str = "remote_jobs.db") -> sqlite3.Connection:
    """Initialize the unified remote jobs database."""
    conn = sqlite3.connect(db_path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS jobs (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            source TEXT NOT NULL,
            external_id TEXT,
            title TEXT,
            company TEXT,
            company_logo TEXT,
            salary_min INTEGER,
            salary_max INTEGER,
            location TEXT,
            job_type TEXT,
            category TEXT,
            posted_date TEXT,
            url TEXT UNIQUE,
            apply_url TEXT,
            description_html TEXT,
            tech_skills TEXT,
            scraped_at DATETIME DEFAULT CURRENT_TIMESTAMP
        );

        CREATE TABLE IF NOT EXISTS job_changes (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            url TEXT,
            change_type TEXT,  -- 'added' or 'removed'
            title TEXT,
            company TEXT,
            detected_at DATETIME DEFAULT CURRENT_TIMESTAMP
        );

        CREATE TABLE IF NOT EXISTS daily_stats (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            snapshot_date TEXT,
            total_jobs INTEGER,
            new_jobs INTEGER,
            removed_jobs INTEGER,
            avg_salary_min REAL,
            avg_salary_max REAL,
            top_categories TEXT,
            top_skills TEXT
        );

        CREATE INDEX IF NOT EXISTS idx_jobs_source ON jobs(source);
        CREATE INDEX IF NOT EXISTS idx_jobs_title ON jobs(title);
        CREATE INDEX IF NOT EXISTS idx_jobs_company ON jobs(company);
        CREATE INDEX IF NOT EXISTS idx_jobs_category ON jobs(category);
        CREATE INDEX IF NOT EXISTS idx_jobs_salary ON jobs(salary_min);
    """)
    conn.commit()
    return conn


def insert_jobs(conn: sqlite3.Connection, jobs: list) -> int:
    """Insert jobs, skipping duplicates. Returns count of new rows."""
    inserted = 0
    for job in jobs:
        # Extract skills if description available
        skills = extract_skills(job.get("description_html", ""))
        tech_skills = json.dumps(skills.get("tech", []))

        # Fill in salary from description if not from API
        if not job.get("salary_min") and skills.get("salary_min_extracted"):
            job["salary_min"] = skills["salary_min_extracted"]
        if not job.get("salary_max") and skills.get("salary_max_extracted"):
            job["salary_max"] = skills["salary_max_extracted"]

        try:
            conn.execute("""
                INSERT OR IGNORE INTO jobs
                    (source, external_id, title, company, company_logo,
                     salary_min, salary_max, location, job_type, category,
                     posted_date, url, apply_url, description_html, tech_skills)
                VALUES
                    (:source, :external_id, :title, :company, :company_logo,
                     :salary_min, :salary_max, :location, :job_type, :category,
                     :posted_date, :url, :apply_url, :description_html, :tech_skills)
            """, {
                "source": job.get("source"),
                "external_id": job.get("external_id"),
                "title": job.get("title"),
                "company": job.get("company"),
                "company_logo": job.get("company_logo"),
                "salary_min": job.get("salary_min"),
                "salary_max": job.get("salary_max"),
                "location": job.get("location"),
                "job_type": job.get("job_type"),
                "category": job.get("category"),
                "posted_date": job.get("posted_date"),
                "url": job.get("url"),
                "apply_url": job.get("apply_url"),
                "description_html": job.get("description_html"),
                "tech_skills": tech_skills,
            })
            if conn.execute("SELECT changes()").fetchone()[0]:
                inserted += 1
        except sqlite3.Error as e:
            print(f"Insert error: {e}")

    conn.commit()
    return inserted


def record_daily_snapshot(conn: sqlite3.Connection, new_count: int):
    """Record daily stats snapshot."""
    today = datetime.utcnow().strftime("%Y-%m-%d")

    total = conn.execute("SELECT COUNT(*) FROM jobs").fetchone()[0]
    avg_sal = conn.execute(
        "SELECT AVG(salary_min), AVG(salary_max) FROM jobs WHERE salary_min > 0"
    ).fetchone()

    # Top categories
    top_cats = conn.execute(
        "SELECT category, COUNT(*) FROM jobs WHERE category != '' GROUP BY category ORDER BY COUNT(*) DESC LIMIT 10"
    ).fetchall()

    # Top skills
    all_skills = []
    for row in conn.execute("SELECT tech_skills FROM jobs WHERE tech_skills != '[]'"):
        try:
            all_skills.extend(json.loads(row[0]))
        except (json.JSONDecodeError, TypeError):
            pass
    from collections import Counter
    top_skills = dict(Counter(all_skills).most_common(15))

    conn.execute("""
        INSERT INTO daily_stats (snapshot_date, total_jobs, new_jobs,
        avg_salary_min, avg_salary_max, top_categories, top_skills)
        VALUES (?, ?, ?, ?, ?, ?, ?)
    """, (
        today, total, new_count,
        round(avg_sal[0] or 0, 0), round(avg_sal[1] or 0, 0),
        json.dumps(dict(top_cats)),
        json.dumps(top_skills),
    ))
    conn.commit()

Analytics and Reporting

def generate_market_report(db_path: str = "remote_jobs.db") -> dict:
    """Generate a market intelligence report from stored jobs."""
    conn = sqlite3.connect(db_path)

    # Jobs with salary data
    salary_jobs = conn.execute("""
        SELECT category, AVG(salary_min), AVG(salary_max), COUNT(*) as n
        FROM jobs
        WHERE salary_min > 0
        GROUP BY category
        HAVING n >= 3
        ORDER BY AVG(salary_min) DESC
    """).fetchall()

    # Jobs by source
    by_source = dict(conn.execute(
        "SELECT source, COUNT(*) FROM jobs GROUP BY source"
    ).fetchall())

    # Most common tech skills
    all_skills = []
    for row in conn.execute("SELECT tech_skills FROM jobs WHERE tech_skills != '[]'"):
        try:
            all_skills.extend(json.loads(row[0]))
        except Exception:
            pass

    from collections import Counter
    skill_counts = Counter(all_skills).most_common(20)

    conn.close()

    print("=== Remote Job Market Report ===")
    print(f"\nListings by source: {by_source}")

    print("\nTop paying categories (avg salary_min):")
    for cat, avg_min, avg_max, n in salary_jobs[:8]:
        print(f"  {cat:<30} ${avg_min:>8,.0f} - ${avg_max:>8,.0f} ({n} jobs with salary data)")

    print("\nMost in-demand skills:")
    for skill, count in skill_counts[:15]:
        print(f"  {skill:<25} {count} jobs")

    return {
        "by_source": by_source,
        "salary_by_category": [
            {"category": r[0], "avg_min": r[1], "avg_max": r[2], "count": r[3]}
            for r in salary_jobs
        ],
        "top_skills": dict(skill_counts),
    }


# Full pipeline
if __name__ == "__main__":
    PROXY = "http://YOUR_USER:[email protected]:9000"
    DB_PATH = "remote_jobs.db"

    conn = init_db(DB_PATH)

    # Collect from all sources
    all_jobs = fetch_all_sources(proxy_url=PROXY)

    # Deduplicate
    unique_jobs = deduplicate(all_jobs)

    # Store
    inserted = insert_jobs(conn, unique_jobs)
    print(f"\nInserted {inserted} new jobs")

    # Record snapshot
    record_daily_snapshot(conn, new_count=inserted)
    conn.close()

    # Report
    generate_market_report(DB_PATH)

Legal Considerations

RemoteOK's public API is explicitly intended for external use. WeWorkRemotely and Remote.co don't prohibit crawling in their robots.txt for most paths, but their terms restrict using the data commercially. Under hiQ v. LinkedIn, scraping publicly available data generally doesn't violate the CFAA, but breach of contract claims under platform ToS remain a real risk — keep your request volumes reasonable and don't republish listings verbatim.

For production job aggregators that serve end users, reach out to the platforms directly. Most remote job boards have partnership or data licensing arrangements, and the cost is often lower than you'd expect given the SEO value they get from the distribution.

The salary intelligence data is the most valuable output — aggregated anonymized salary ranges by category and skill level are genuinely hard to find and extremely useful for both job seekers and companies setting compensation benchmarks.