← Back to blog

How to Scrape the Wayback Machine in 2026: CDX API, Snapshots & Historical Data

How to Scrape the Wayback Machine in 2026: CDX API, Snapshots & Historical Data

The Wayback Machine has archived over 835 billion web pages since 1996. That's not just internet history — it's a goldmine for competitive intelligence. Want to track how a competitor changed their pricing over the past three years? See when a company pivoted their messaging? Find content that's been removed from the live web? The Wayback Machine has it.

Unlike most scraping targets, archive.org actively encourages programmatic access. They provide a proper API (the CDX API), public datasets, and have a mission of open access. This makes the Wayback Machine one of the most scraper-friendly data sources you'll ever work with — but there are still rate limits and techniques worth knowing.

What Can You Extract?

The Wayback Machine gives you access to:

Rate Limits and Access Policies

Archive.org is nonprofit and runs on donations. Be respectful:

  1. CDX API — No hard rate limit published, but more than ~15 requests per minute per IP triggers 503 or 429 responses. They ask for 1 request per second in their documentation.
  2. Snapshot retrieval — Fetching archived pages through web.archive.org/web/ has similar soft limits. Burst traffic gets throttled.
  3. Bulk access — For large-scale research, archive.org provides WARC files and the IA Scrape API. Use those instead of hitting the CDX API millions of times.
  4. User-Agent — Archive.org asks that you identify your bot. Set a descriptive User-Agent with contact info.
  5. Robots.txt — The Wayback Machine respects the original site's robots.txt at capture time. Some archived pages may be excluded retroactively if the site owner requests it.

The CDX API: Your Primary Tool

The CDX (Capture/Digital Index) API returns structured metadata about every snapshot of a URL. This is where you start.

pip install requests beautifulsoup4

Basic CDX Query

import requests
import time
from datetime import datetime

CDX_URL = "https://web.archive.org/cdx/search/cdx"

HEADERS = {
    "User-Agent": "PriceHistoryBot/1.0 (research; [email protected])",
}

def get_snapshots(url: str, from_date: str = None, to_date: str = None, limit: int = 1000) -> list:
    """Query the CDX API for snapshots of a URL.

    Dates in YYYYMMDD format. Returns list of snapshot dicts.
    """
    params = {
        "url": url,
        "output": "json",
        "fl": "timestamp,original,statuscode,mimetype,digest",
        "limit": limit,
    }
    if from_date:
        params["from"] = from_date
    if to_date:
        params["to"] = to_date

    resp = requests.get(CDX_URL, params=params, headers=HEADERS, timeout=30)
    resp.raise_for_status()

    rows = resp.json()
    if len(rows) < 2:
        return []

    # First row is the header
    keys = rows[0]
    return [dict(zip(keys, row)) for row in rows[1:]]


# Example: all snapshots of a pricing page in 2025
snapshots = get_snapshots(
    "https://example.com/pricing",
    from_date="20250101",
    to_date="20251231",
)
print(f"Found {len(snapshots)} snapshots")
for s in snapshots[:5]:
    ts = datetime.strptime(s["timestamp"], "%Y%m%d%H%M%S")
    print(f"  {ts.isoformat()} — HTTP {s['statuscode']}")

Domain-Wide URL Discovery

Find every URL ever archived under a domain:

def discover_urls(domain: str, match_type: str = "domain", limit: int = 10000) -> list:
    """Find all archived URLs for a domain.

    match_type: 'exact', 'prefix', 'host', or 'domain'
    """
    params = {
        "url": domain,
        "output": "json",
        "fl": "original",
        "matchType": match_type,
        "collapse": "urlkey",  # deduplicate by URL
        "limit": limit,
    }

    resp = requests.get(CDX_URL, params=params, headers=HEADERS, timeout=60)
    rows = resp.json()

    if len(rows) < 2:
        return []

    return [row[0] for row in rows[1:]]


# Find all archived pages of a competitor's site
urls = discover_urls("competitor.com")
print(f"Found {len(urls)} unique URLs archived")

# Filter to pricing-related pages
pricing_urls = [u for u in urls if "pric" in u.lower() or "plan" in u.lower()]
print(f"  {len(pricing_urls)} pricing-related pages")

Downloading Snapshots

Once you have timestamps from the CDX API, fetch the actual archived page:

from bs4 import BeautifulSoup

def fetch_snapshot(url: str, timestamp: str) -> str:
    """Download an archived snapshot. Returns raw HTML."""
    archive_url = f"https://web.archive.org/web/{timestamp}id_/{url}"
    # 'id_' modifier returns the original page without Wayback toolbar

    resp = requests.get(archive_url, headers=HEADERS, timeout=30)
    if resp.status_code == 200:
        return resp.text
    return None


def extract_prices_from_snapshot(html: str) -> list:
    """Extract price-like strings from archived HTML."""
    soup = BeautifulSoup(html, "html.parser")

    prices = []
    import re
    price_pattern = re.compile(r'\$[\d,]+(?:\.\d{2})?(?:/\w+)?')

    for el in soup.find_all(string=price_pattern):
        matches = price_pattern.findall(el)
        prices.extend(matches)

    return list(set(prices))

Building a Historical Price Tracker

The real power of the Wayback Machine is tracking changes over time. Here's a complete price history tracker:

import json
import csv

def track_price_history(pricing_url: str, year_start: int = 2020, year_end: int = 2026) -> list:
    """Build a price history from Wayback Machine snapshots."""
    history = []

    snapshots = get_snapshots(
        pricing_url,
        from_date=f"{year_start}0101",
        to_date=f"{year_end}1231",
    )

    # Sample one snapshot per month to avoid hammering the API
    seen_months = set()
    filtered = []
    for s in snapshots:
        month_key = s["timestamp"][:6]  # YYYYMM
        if month_key not in seen_months and s["statuscode"] == "200":
            seen_months.add(month_key)
            filtered.append(s)

    print(f"Checking {len(filtered)} monthly snapshots...")

    for snap in filtered:
        html = fetch_snapshot(pricing_url, snap["timestamp"])
        if not html:
            continue

        prices = extract_prices_from_snapshot(html)
        ts = datetime.strptime(snap["timestamp"], "%Y%m%d%H%M%S")

        history.append({
            "date": ts.strftime("%Y-%m-%d"),
            "timestamp": snap["timestamp"],
            "prices_found": prices,
        })

        # Respect rate limits — archive.org asks for 1 req/sec
        time.sleep(1.5)

    return history


# Track Notion's pricing over the years
history = track_price_history("https://www.notion.so/pricing")
for entry in history:
    print(f"{entry['date']}: {entry['prices_found']}")

Detecting Content Changes Across Snapshots

Beyond prices, you can detect any change on a page over time — messaging pivots, feature additions, removed pages:

import hashlib
from difflib import unified_diff

def get_text_fingerprint(html: str) -> str:
    """Extract clean text from HTML and hash it for change detection."""
    soup = BeautifulSoup(html, "html.parser")

    # Remove scripts, styles, and navigation
    for tag in soup(["script", "style", "nav", "header", "footer"]):
        tag.decompose()

    text = soup.get_text(separator=" ", strip=True)
    # Normalize whitespace
    import re
    text = re.sub(r"\s+", " ", text).strip()
    return hashlib.md5(text.encode()).hexdigest(), text


def detect_changes(url: str, from_date: str = "20230101", to_date: str = "20261231") -> list:
    """Detect content changes across snapshots of a URL."""
    snapshots = get_snapshots(url, from_date=from_date, to_date=to_date)

    seen_months = set()
    monthly_snaps = []
    for s in snapshots:
        month_key = s["timestamp"][:6]
        if month_key not in seen_months and s["statuscode"] == "200":
            seen_months.add(month_key)
            monthly_snaps.append(s)

    changes = []
    prev_hash = None
    prev_text = None

    for snap in monthly_snaps:
        html = fetch_snapshot(url, snap["timestamp"])
        if not html:
            time.sleep(1)
            continue

        curr_hash, curr_text = get_text_fingerprint(html)
        ts = datetime.strptime(snap["timestamp"], "%Y%m%d%H%M%S")

        if prev_hash and curr_hash != prev_hash:
            # Content changed — compute diff
            prev_lines = prev_text.split(". ")
            curr_lines = curr_text.split(". ")
            diff = list(unified_diff(prev_lines, curr_lines, lineterm="", n=0))
            changes.append({
                "date": ts.strftime("%Y-%m-%d"),
                "timestamp": snap["timestamp"],
                "changed": True,
                "diff_lines": len(diff),
                "sample_diff": diff[:5] if diff else [],
            })
        else:
            changes.append({
                "date": ts.strftime("%Y-%m-%d"),
                "timestamp": snap["timestamp"],
                "changed": False,
                "diff_lines": 0,
            })

        prev_hash = curr_hash
        prev_text = curr_text
        time.sleep(1.5)

    return changes


# Track a competitor's homepage for content changes
changes = detect_changes("https://competitor.com/about")
changed_dates = [c["date"] for c in changes if c["changed"]]
print(f"Content changed on: {changed_dates}")

Handling Rate Limits at Scale

Even with respectful delays, large-scale Wayback Machine research can hit rate limits. If you're tracking hundreds of domains or downloading thousands of snapshots, you'll need to distribute requests across IPs.

A residential proxy service like ThorData helps here — not because archive.org blocks aggressively, but because distributing requests across multiple IPs lets you maintain a respectful per-IP rate while increasing your total throughput. Archive.org monitors per-IP request rates, so rotating through residential IPs means each individual IP stays well under the 1 request/second guideline.

def fetch_snapshot_with_proxy(url: str, timestamp: str, proxy: str = None) -> str:
    """Download a snapshot using an optional proxy."""
    archive_url = f"https://web.archive.org/web/{timestamp}id_/{url}"

    proxies = {"http": proxy, "https": proxy} if proxy else None
    resp = requests.get(archive_url, headers=HEADERS, proxies=proxies, timeout=30)

    if resp.status_code == 200:
        return resp.text
    elif resp.status_code == 429:
        print("Rate limited — waiting 30 seconds")
        time.sleep(30)
        return None
    return None


# Batch process with proxy rotation
PROXY = "http://USER:[email protected]:9000"

for snap in filtered:
    html = fetch_snapshot_with_proxy(pricing_url, snap["timestamp"], proxy=PROXY)
    time.sleep(1)  # Still be respectful even with proxies

Querying Multiple Competitors at Scale

For competitive intelligence across a whole market vertical, process many domains in sequence:

import sqlite3

def init_competitive_db(db_path: str = "competitive_intel.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS snapshots_meta (
            url TEXT,
            timestamp TEXT,
            statuscode TEXT,
            digest TEXT,
            PRIMARY KEY (url, timestamp)
        );

        CREATE TABLE IF NOT EXISTS price_extractions (
            url TEXT,
            snapshot_date TEXT,
            prices TEXT,
            raw_text_hash TEXT,
            fetched_at TEXT DEFAULT (datetime('now')),
            PRIMARY KEY (url, snapshot_date)
        );

        CREATE TABLE IF NOT EXISTS content_changes (
            url TEXT,
            change_date TEXT,
            diff_lines INTEGER,
            previous_hash TEXT,
            current_hash TEXT,
            PRIMARY KEY (url, change_date)
        );
    """)
    conn.commit()
    return conn


def research_competitor(
    domain: str,
    pages_of_interest: list,
    db_path: str = "competitive_intel.db",
    proxy: str = None,
    years_back: int = 3,
) -> dict:
    """Pull historical data for a competitor domain."""
    conn = init_competitive_db(db_path)
    results = {"domain": domain, "pages": {}}

    import datetime as dt
    start_year = dt.date.today().year - years_back

    for page_path in pages_of_interest:
        full_url = f"https://{domain}{page_path}"
        print(f"\nResearching: {full_url}")

        snapshots = get_snapshots(
            full_url,
            from_date=f"{start_year}0101",
        )

        # Store metadata
        for snap in snapshots:
            conn.execute(
                "INSERT OR IGNORE INTO snapshots_meta VALUES (?,?,?,?)",
                (full_url, snap["timestamp"], snap.get("statuscode"), snap.get("digest")),
            )
        conn.commit()

        # Sample monthly and extract prices
        seen_months = set()
        price_history = []
        for snap in snapshots:
            month_key = snap["timestamp"][:6]
            if month_key not in seen_months and snap.get("statuscode") == "200":
                seen_months.add(month_key)
                html = fetch_snapshot_with_proxy(full_url, snap["timestamp"], proxy=proxy)
                if html:
                    prices = extract_prices_from_snapshot(html)
                    snap_date = snap["timestamp"][:8]
                    conn.execute(
                        "INSERT OR REPLACE INTO price_extractions (url, snapshot_date, prices) VALUES (?,?,?)",
                        (full_url, snap_date, json.dumps(prices)),
                    )
                    price_history.append({"date": snap_date, "prices": prices})
                time.sleep(1.5)

        results["pages"][page_path] = {
            "total_snapshots": len(snapshots),
            "monthly_samples": len(seen_months),
            "price_history": price_history,
        }

    conn.close()
    return results


# Research top 5 competitors
COMPETITORS = [
    ("notion.so", ["/pricing", "/about"]),
    ("airtable.com", ["/pricing"]),
    ("coda.io", ["/pricing"]),
]

PROXY = "http://USER:[email protected]:9000"
for domain, pages in COMPETITORS:
    result = research_competitor(domain, pages, proxy=PROXY)
    for page, data in result["pages"].items():
        print(f"{domain}{page}: {data['monthly_samples']} snapshots")
    time.sleep(5)

Saving URL History to Archive.org

You can also submit URLs for archiving — useful for preserving evidence or tracking competitors going forward:

def save_to_wayback(url: str) -> str:
    """Submit a URL for archiving. Returns the archived URL."""
    save_url = f"https://web.archive.org/save/{url}"
    resp = requests.get(save_url, headers=HEADERS, timeout=60)

    if resp.status_code == 200:
        # The archived URL is in the Content-Location header
        archived = resp.headers.get("Content-Location", "")
        if archived:
            return f"https://web.archive.org{archived}"
    return None


# Archive a competitor's current pricing page
result = save_to_wayback("https://competitor.com/pricing")
print(f"Archived at: {result}")

Storing Results

def save_history_csv(history: list, filename: str = "price_history.csv"):
    if not history:
        return
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["date", "timestamp", "prices_found"])
        writer.writeheader()
        for entry in history:
            entry_copy = entry.copy()
            entry_copy["prices_found"] = json.dumps(entry["prices_found"])
            writer.writerow(entry_copy)
    print(f"Saved {len(history)} entries to {filename}")

Comparing Competitor Landing Pages Visually

Beyond text extraction, you can use the archived HTML to reconstruct what a competitor's page looked like at a specific date. This is useful for understanding design and UX evolution:

def get_page_structure_over_time(url: str, sample_years: list = None) -> list:
    """Extract structural metadata (headings, CTAs, nav links) from snapshots."""
    sample_years = sample_years or [2021, 2022, 2023, 2024, 2025, 2026]
    results = []

    for year in sample_years:
        snapshots = get_snapshots(
            url,
            from_date=f"{year}0601",
            to_date=f"{year}0901",
            limit=5,
        )

        # Take the first good snapshot in mid-year
        target_snap = None
        for snap in snapshots:
            if snap.get("statuscode") == "200":
                target_snap = snap
                break

        if not target_snap:
            continue

        html = fetch_snapshot(url, target_snap["timestamp"])
        if not html:
            time.sleep(1)
            continue

        soup = BeautifulSoup(html, "html.parser")

        # Extract headings hierarchy
        headings = []
        for tag in ["h1", "h2", "h3"]:
            for el in soup.find_all(tag)[:5]:
                text = el.get_text(strip=True)
                if text and len(text) < 200:
                    headings.append({"tag": tag, "text": text})

        # Extract navigation links (top-level)
        nav_links = []
        for nav in soup.find_all("nav")[:2]:
            for a in nav.find_all("a")[:10]:
                text = a.get_text(strip=True)
                if text and len(text) < 50:
                    nav_links.append(text)

        # Find CTA buttons (common patterns)
        cta_patterns = ["get started", "sign up", "try free", "buy now", "learn more", "start free"]
        ctas = []
        for a in soup.find_all(["a", "button"]):
            text = a.get_text(strip=True).lower()
            if any(p in text for p in cta_patterns):
                ctas.append(a.get_text(strip=True))

        results.append({
            "year": year,
            "timestamp": target_snap["timestamp"],
            "headings": headings[:5],
            "nav_links": nav_links[:8],
            "ctas": list(set(ctas))[:5],
        })

        time.sleep(2)

    return results


# Track how a competitor's homepage evolved
structure_history = get_page_structure_over_time("https://notion.so")
for year_data in structure_history:
    print(f"\n{year_data['year']}:")
    for h in year_data["headings"][:2]:
        print(f"  {h['tag'].upper()}: {h['text'][:60]}")
    if year_data["ctas"]:
        print(f"  CTAs: {', '.join(year_data['ctas'][:3])}")

Bulk CDX Queries with Pagination

When a domain has tens of thousands of archived URLs, you need to paginate the CDX API itself:

def get_all_snapshots_paginated(
    url: str,
    from_date: str = None,
    to_date: str = None,
    page_size: int = 10000,
) -> list:
    """Get all snapshots of a URL using CDX API pagination."""
    all_snapshots = []
    page = 0

    while True:
        params = {
            "url": url,
            "output": "json",
            "fl": "timestamp,statuscode,mimetype,digest",
            "limit": page_size,
            "offset": page * page_size,
        }
        if from_date:
            params["from"] = from_date
        if to_date:
            params["to"] = to_date

        resp = requests.get(CDX_URL, params=params, headers=HEADERS, timeout=60)

        if resp.status_code == 404 or not resp.text.strip():
            break

        rows = resp.json()
        if len(rows) <= 1:  # Only header row or empty
            break

        keys = rows[0]
        batch = [dict(zip(keys, row)) for row in rows[1:]]
        all_snapshots.extend(batch)

        print(f"  Page {page}: +{len(batch)} snapshots ({len(all_snapshots)} total)")

        if len(batch) < page_size:
            break  # Last page

        page += 1
        time.sleep(1)

    return all_snapshots


# Get complete snapshot history for a domain's pricing page
all_snaps = get_all_snapshots_paginated(
    "https://slack.com/intl/en-us/pricing",
    from_date="20200101",
)
print(f"Total snapshots: {len(all_snaps)}")

# Analyze capture frequency by year
from collections import Counter
years = [s["timestamp"][:4] for s in all_snaps]
year_counts = Counter(years)
for year, count in sorted(year_counts.items()):
    print(f"  {year}: {count} captures")

The Wayback Machine captures not just content but also on-page SEO signals. Track title tags, meta descriptions, and internal link structures over time:

import re

def extract_seo_signals(html: str) -> dict:
    """Extract on-page SEO signals from archived HTML."""
    soup = BeautifulSoup(html, "html.parser")

    # Title tag
    title = soup.find("title")
    title_text = title.get_text(strip=True) if title else None

    # Meta description
    meta_desc = soup.find("meta", attrs={"name": re.compile("description", re.I)})
    meta_description = meta_desc.get("content", "") if meta_desc else None

    # H1 tags
    h1_tags = [h.get_text(strip=True) for h in soup.find_all("h1")[:3]]

    # Canonical URL
    canonical = soup.find("link", attrs={"rel": "canonical"})
    canonical_url = canonical.get("href") if canonical else None

    # Structured data (JSON-LD)
    schema_types = []
    for script in soup.find_all("script", type="application/ld+json"):
        try:
            import json
            data = json.loads(script.string)
            if isinstance(data, dict):
                schema_types.append(data.get("@type", ""))
            elif isinstance(data, list):
                schema_types.extend(d.get("@type", "") for d in data if isinstance(d, dict))
        except Exception:
            pass

    # Word count (rough)
    for tag in soup(["script", "style"]):
        tag.decompose()
    word_count = len(soup.get_text().split())

    return {
        "title": title_text,
        "title_length": len(title_text) if title_text else 0,
        "meta_description": meta_description,
        "meta_desc_length": len(meta_description) if meta_description else 0,
        "h1_tags": h1_tags,
        "canonical": canonical_url,
        "schema_types": [t for t in schema_types if t],
        "word_count": word_count,
    }


def track_seo_evolution(url: str, years_back: int = 3) -> list:
    """Track SEO signal changes for a URL over time."""
    import datetime as dt
    start_year = dt.date.today().year - years_back

    snapshots = get_snapshots(
        url,
        from_date=f"{start_year}0101",
    )

    # One snapshot per quarter
    seen_quarters = set()
    quarterly_snaps = []
    for snap in snapshots:
        quarter = snap["timestamp"][:4] + "Q" + str((int(snap["timestamp"][4:6]) - 1) // 3 + 1)
        if quarter not in seen_quarters and snap.get("statuscode") == "200":
            seen_quarters.add(quarter)
            quarterly_snaps.append(snap)

    results = []
    for snap in quarterly_snaps:
        html = fetch_snapshot(url, snap["timestamp"])
        if not html:
            time.sleep(1)
            continue

        seo = extract_seo_signals(html)
        ts = datetime.strptime(snap["timestamp"], "%Y%m%d%H%M%S")
        results.append({
            "date": ts.strftime("%Y-%m"),
            "timestamp": snap["timestamp"],
            **seo,
        })
        time.sleep(1.5)

    return results


# Track how a competitor's homepage SEO evolved
seo_history = track_seo_evolution("https://competitor.com")
for entry in seo_history:
    print(f"{entry['date']}: \"{entry['title'][:40]}\" ({entry['word_count']} words)")

Finding Removed Content and Dead Pages

The CDX API can show you pages that existed at one point but now return 404 — useful for finding content a competitor has quietly deleted:

def find_removed_pages(domain: str, proxy: str = None) -> list:
    """
    Find pages that were once live but now return 404 or have been removed.
    Compares Wayback Machine records with current live responses.
    """
    print(f"Discovering archived URLs for {domain}...")
    archived_urls = discover_urls(domain, limit=500)

    print(f"Found {len(archived_urls)} archived URLs. Checking live status...")
    removed = []
    still_live = []

    proxies = {"http": proxy, "https": proxy} if proxy else None

    for url in archived_urls[:200]:  # Sample first 200
        try:
            resp = requests.head(
                url,
                headers=HEADERS,
                proxies=proxies,
                timeout=10,
                allow_redirects=True,
            )
            status = resp.status_code

            if status in (404, 410):
                # Get the last snapshot for context
                snaps = get_snapshots(url, limit=3)
                last_snap = snaps[-1] if snaps else None

                removed.append({
                    "url": url,
                    "current_status": status,
                    "last_archived": last_snap["timestamp"][:8] if last_snap else "unknown",
                })
            elif status < 400:
                still_live.append(url)

        except requests.exceptions.ConnectionError:
            removed.append({"url": url, "current_status": "connection_error", "last_archived": "unknown"})
        except Exception:
            pass

        time.sleep(0.5)

    print(f"\nResults: {len(still_live)} live, {len(removed)} removed/dead")
    return removed


# Find what a competitor quietly deleted
removed = find_removed_pages("competitor.com")
for page in removed[:20]:
    print(f"  [{page['current_status']}] {page['url']}")
    print(f"    Last archived: {page['last_archived']}")

The Wayback Machine is explicitly designed for public access. Archive.org operates under a broad interpretation of fair use and the library exemption of copyright law. Using the CDX API for research, competitive intelligence, or content verification is well within their intended use. The main restriction: don't redistribute archived content at scale or use it to rebuild sites that the original owners have taken down.

Key Takeaways