How to Scrape the Wayback Machine in 2026: CDX API, Snapshots & Historical Data
How to Scrape the Wayback Machine in 2026: CDX API, Snapshots & Historical Data
The Wayback Machine has archived over 835 billion web pages since 1996. That's not just internet history — it's a goldmine for competitive intelligence. Want to track how a competitor changed their pricing over the past three years? See when a company pivoted their messaging? Find content that's been removed from the live web? The Wayback Machine has it.
Unlike most scraping targets, archive.org actively encourages programmatic access. They provide a proper API (the CDX API), public datasets, and have a mission of open access. This makes the Wayback Machine one of the most scraper-friendly data sources you'll ever work with — but there are still rate limits and techniques worth knowing.
What Can You Extract?
The Wayback Machine gives you access to:
- URL history — every captured snapshot of a URL with timestamps
- Full page snapshots — the complete HTML as it appeared at capture time
- Capture metadata — HTTP status codes, MIME types, content digests
- Domain-wide captures — all archived URLs under a domain
- Bulk download — WARC files for large-scale archival data
- Changes over time — diff between snapshots to track modifications
Rate Limits and Access Policies
Archive.org is nonprofit and runs on donations. Be respectful:
- CDX API — No hard rate limit published, but more than ~15 requests per minute per IP triggers 503 or 429 responses. They ask for 1 request per second in their documentation.
- Snapshot retrieval — Fetching archived pages through
web.archive.org/web/has similar soft limits. Burst traffic gets throttled. - Bulk access — For large-scale research, archive.org provides WARC files and the IA Scrape API. Use those instead of hitting the CDX API millions of times.
- User-Agent — Archive.org asks that you identify your bot. Set a descriptive User-Agent with contact info.
- Robots.txt — The Wayback Machine respects the original site's robots.txt at capture time. Some archived pages may be excluded retroactively if the site owner requests it.
The CDX API: Your Primary Tool
The CDX (Capture/Digital Index) API returns structured metadata about every snapshot of a URL. This is where you start.
pip install requests beautifulsoup4
Basic CDX Query
import requests
import time
from datetime import datetime
CDX_URL = "https://web.archive.org/cdx/search/cdx"
HEADERS = {
"User-Agent": "PriceHistoryBot/1.0 (research; [email protected])",
}
def get_snapshots(url: str, from_date: str = None, to_date: str = None, limit: int = 1000) -> list:
"""Query the CDX API for snapshots of a URL.
Dates in YYYYMMDD format. Returns list of snapshot dicts.
"""
params = {
"url": url,
"output": "json",
"fl": "timestamp,original,statuscode,mimetype,digest",
"limit": limit,
}
if from_date:
params["from"] = from_date
if to_date:
params["to"] = to_date
resp = requests.get(CDX_URL, params=params, headers=HEADERS, timeout=30)
resp.raise_for_status()
rows = resp.json()
if len(rows) < 2:
return []
# First row is the header
keys = rows[0]
return [dict(zip(keys, row)) for row in rows[1:]]
# Example: all snapshots of a pricing page in 2025
snapshots = get_snapshots(
"https://example.com/pricing",
from_date="20250101",
to_date="20251231",
)
print(f"Found {len(snapshots)} snapshots")
for s in snapshots[:5]:
ts = datetime.strptime(s["timestamp"], "%Y%m%d%H%M%S")
print(f" {ts.isoformat()} — HTTP {s['statuscode']}")
Domain-Wide URL Discovery
Find every URL ever archived under a domain:
def discover_urls(domain: str, match_type: str = "domain", limit: int = 10000) -> list:
"""Find all archived URLs for a domain.
match_type: 'exact', 'prefix', 'host', or 'domain'
"""
params = {
"url": domain,
"output": "json",
"fl": "original",
"matchType": match_type,
"collapse": "urlkey", # deduplicate by URL
"limit": limit,
}
resp = requests.get(CDX_URL, params=params, headers=HEADERS, timeout=60)
rows = resp.json()
if len(rows) < 2:
return []
return [row[0] for row in rows[1:]]
# Find all archived pages of a competitor's site
urls = discover_urls("competitor.com")
print(f"Found {len(urls)} unique URLs archived")
# Filter to pricing-related pages
pricing_urls = [u for u in urls if "pric" in u.lower() or "plan" in u.lower()]
print(f" {len(pricing_urls)} pricing-related pages")
Downloading Snapshots
Once you have timestamps from the CDX API, fetch the actual archived page:
from bs4 import BeautifulSoup
def fetch_snapshot(url: str, timestamp: str) -> str:
"""Download an archived snapshot. Returns raw HTML."""
archive_url = f"https://web.archive.org/web/{timestamp}id_/{url}"
# 'id_' modifier returns the original page without Wayback toolbar
resp = requests.get(archive_url, headers=HEADERS, timeout=30)
if resp.status_code == 200:
return resp.text
return None
def extract_prices_from_snapshot(html: str) -> list:
"""Extract price-like strings from archived HTML."""
soup = BeautifulSoup(html, "html.parser")
prices = []
import re
price_pattern = re.compile(r'\$[\d,]+(?:\.\d{2})?(?:/\w+)?')
for el in soup.find_all(string=price_pattern):
matches = price_pattern.findall(el)
prices.extend(matches)
return list(set(prices))
Building a Historical Price Tracker
The real power of the Wayback Machine is tracking changes over time. Here's a complete price history tracker:
import json
import csv
def track_price_history(pricing_url: str, year_start: int = 2020, year_end: int = 2026) -> list:
"""Build a price history from Wayback Machine snapshots."""
history = []
snapshots = get_snapshots(
pricing_url,
from_date=f"{year_start}0101",
to_date=f"{year_end}1231",
)
# Sample one snapshot per month to avoid hammering the API
seen_months = set()
filtered = []
for s in snapshots:
month_key = s["timestamp"][:6] # YYYYMM
if month_key not in seen_months and s["statuscode"] == "200":
seen_months.add(month_key)
filtered.append(s)
print(f"Checking {len(filtered)} monthly snapshots...")
for snap in filtered:
html = fetch_snapshot(pricing_url, snap["timestamp"])
if not html:
continue
prices = extract_prices_from_snapshot(html)
ts = datetime.strptime(snap["timestamp"], "%Y%m%d%H%M%S")
history.append({
"date": ts.strftime("%Y-%m-%d"),
"timestamp": snap["timestamp"],
"prices_found": prices,
})
# Respect rate limits — archive.org asks for 1 req/sec
time.sleep(1.5)
return history
# Track Notion's pricing over the years
history = track_price_history("https://www.notion.so/pricing")
for entry in history:
print(f"{entry['date']}: {entry['prices_found']}")
Detecting Content Changes Across Snapshots
Beyond prices, you can detect any change on a page over time — messaging pivots, feature additions, removed pages:
import hashlib
from difflib import unified_diff
def get_text_fingerprint(html: str) -> str:
"""Extract clean text from HTML and hash it for change detection."""
soup = BeautifulSoup(html, "html.parser")
# Remove scripts, styles, and navigation
for tag in soup(["script", "style", "nav", "header", "footer"]):
tag.decompose()
text = soup.get_text(separator=" ", strip=True)
# Normalize whitespace
import re
text = re.sub(r"\s+", " ", text).strip()
return hashlib.md5(text.encode()).hexdigest(), text
def detect_changes(url: str, from_date: str = "20230101", to_date: str = "20261231") -> list:
"""Detect content changes across snapshots of a URL."""
snapshots = get_snapshots(url, from_date=from_date, to_date=to_date)
seen_months = set()
monthly_snaps = []
for s in snapshots:
month_key = s["timestamp"][:6]
if month_key not in seen_months and s["statuscode"] == "200":
seen_months.add(month_key)
monthly_snaps.append(s)
changes = []
prev_hash = None
prev_text = None
for snap in monthly_snaps:
html = fetch_snapshot(url, snap["timestamp"])
if not html:
time.sleep(1)
continue
curr_hash, curr_text = get_text_fingerprint(html)
ts = datetime.strptime(snap["timestamp"], "%Y%m%d%H%M%S")
if prev_hash and curr_hash != prev_hash:
# Content changed — compute diff
prev_lines = prev_text.split(". ")
curr_lines = curr_text.split(". ")
diff = list(unified_diff(prev_lines, curr_lines, lineterm="", n=0))
changes.append({
"date": ts.strftime("%Y-%m-%d"),
"timestamp": snap["timestamp"],
"changed": True,
"diff_lines": len(diff),
"sample_diff": diff[:5] if diff else [],
})
else:
changes.append({
"date": ts.strftime("%Y-%m-%d"),
"timestamp": snap["timestamp"],
"changed": False,
"diff_lines": 0,
})
prev_hash = curr_hash
prev_text = curr_text
time.sleep(1.5)
return changes
# Track a competitor's homepage for content changes
changes = detect_changes("https://competitor.com/about")
changed_dates = [c["date"] for c in changes if c["changed"]]
print(f"Content changed on: {changed_dates}")
Handling Rate Limits at Scale
Even with respectful delays, large-scale Wayback Machine research can hit rate limits. If you're tracking hundreds of domains or downloading thousands of snapshots, you'll need to distribute requests across IPs.
A residential proxy service like ThorData helps here — not because archive.org blocks aggressively, but because distributing requests across multiple IPs lets you maintain a respectful per-IP rate while increasing your total throughput. Archive.org monitors per-IP request rates, so rotating through residential IPs means each individual IP stays well under the 1 request/second guideline.
def fetch_snapshot_with_proxy(url: str, timestamp: str, proxy: str = None) -> str:
"""Download a snapshot using an optional proxy."""
archive_url = f"https://web.archive.org/web/{timestamp}id_/{url}"
proxies = {"http": proxy, "https": proxy} if proxy else None
resp = requests.get(archive_url, headers=HEADERS, proxies=proxies, timeout=30)
if resp.status_code == 200:
return resp.text
elif resp.status_code == 429:
print("Rate limited — waiting 30 seconds")
time.sleep(30)
return None
return None
# Batch process with proxy rotation
PROXY = "http://USER:[email protected]:9000"
for snap in filtered:
html = fetch_snapshot_with_proxy(pricing_url, snap["timestamp"], proxy=PROXY)
time.sleep(1) # Still be respectful even with proxies
Querying Multiple Competitors at Scale
For competitive intelligence across a whole market vertical, process many domains in sequence:
import sqlite3
def init_competitive_db(db_path: str = "competitive_intel.db") -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.executescript("""
CREATE TABLE IF NOT EXISTS snapshots_meta (
url TEXT,
timestamp TEXT,
statuscode TEXT,
digest TEXT,
PRIMARY KEY (url, timestamp)
);
CREATE TABLE IF NOT EXISTS price_extractions (
url TEXT,
snapshot_date TEXT,
prices TEXT,
raw_text_hash TEXT,
fetched_at TEXT DEFAULT (datetime('now')),
PRIMARY KEY (url, snapshot_date)
);
CREATE TABLE IF NOT EXISTS content_changes (
url TEXT,
change_date TEXT,
diff_lines INTEGER,
previous_hash TEXT,
current_hash TEXT,
PRIMARY KEY (url, change_date)
);
""")
conn.commit()
return conn
def research_competitor(
domain: str,
pages_of_interest: list,
db_path: str = "competitive_intel.db",
proxy: str = None,
years_back: int = 3,
) -> dict:
"""Pull historical data for a competitor domain."""
conn = init_competitive_db(db_path)
results = {"domain": domain, "pages": {}}
import datetime as dt
start_year = dt.date.today().year - years_back
for page_path in pages_of_interest:
full_url = f"https://{domain}{page_path}"
print(f"\nResearching: {full_url}")
snapshots = get_snapshots(
full_url,
from_date=f"{start_year}0101",
)
# Store metadata
for snap in snapshots:
conn.execute(
"INSERT OR IGNORE INTO snapshots_meta VALUES (?,?,?,?)",
(full_url, snap["timestamp"], snap.get("statuscode"), snap.get("digest")),
)
conn.commit()
# Sample monthly and extract prices
seen_months = set()
price_history = []
for snap in snapshots:
month_key = snap["timestamp"][:6]
if month_key not in seen_months and snap.get("statuscode") == "200":
seen_months.add(month_key)
html = fetch_snapshot_with_proxy(full_url, snap["timestamp"], proxy=proxy)
if html:
prices = extract_prices_from_snapshot(html)
snap_date = snap["timestamp"][:8]
conn.execute(
"INSERT OR REPLACE INTO price_extractions (url, snapshot_date, prices) VALUES (?,?,?)",
(full_url, snap_date, json.dumps(prices)),
)
price_history.append({"date": snap_date, "prices": prices})
time.sleep(1.5)
results["pages"][page_path] = {
"total_snapshots": len(snapshots),
"monthly_samples": len(seen_months),
"price_history": price_history,
}
conn.close()
return results
# Research top 5 competitors
COMPETITORS = [
("notion.so", ["/pricing", "/about"]),
("airtable.com", ["/pricing"]),
("coda.io", ["/pricing"]),
]
PROXY = "http://USER:[email protected]:9000"
for domain, pages in COMPETITORS:
result = research_competitor(domain, pages, proxy=PROXY)
for page, data in result["pages"].items():
print(f"{domain}{page}: {data['monthly_samples']} snapshots")
time.sleep(5)
Saving URL History to Archive.org
You can also submit URLs for archiving — useful for preserving evidence or tracking competitors going forward:
def save_to_wayback(url: str) -> str:
"""Submit a URL for archiving. Returns the archived URL."""
save_url = f"https://web.archive.org/save/{url}"
resp = requests.get(save_url, headers=HEADERS, timeout=60)
if resp.status_code == 200:
# The archived URL is in the Content-Location header
archived = resp.headers.get("Content-Location", "")
if archived:
return f"https://web.archive.org{archived}"
return None
# Archive a competitor's current pricing page
result = save_to_wayback("https://competitor.com/pricing")
print(f"Archived at: {result}")
Storing Results
def save_history_csv(history: list, filename: str = "price_history.csv"):
if not history:
return
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["date", "timestamp", "prices_found"])
writer.writeheader()
for entry in history:
entry_copy = entry.copy()
entry_copy["prices_found"] = json.dumps(entry["prices_found"])
writer.writerow(entry_copy)
print(f"Saved {len(history)} entries to {filename}")
Comparing Competitor Landing Pages Visually
Beyond text extraction, you can use the archived HTML to reconstruct what a competitor's page looked like at a specific date. This is useful for understanding design and UX evolution:
def get_page_structure_over_time(url: str, sample_years: list = None) -> list:
"""Extract structural metadata (headings, CTAs, nav links) from snapshots."""
sample_years = sample_years or [2021, 2022, 2023, 2024, 2025, 2026]
results = []
for year in sample_years:
snapshots = get_snapshots(
url,
from_date=f"{year}0601",
to_date=f"{year}0901",
limit=5,
)
# Take the first good snapshot in mid-year
target_snap = None
for snap in snapshots:
if snap.get("statuscode") == "200":
target_snap = snap
break
if not target_snap:
continue
html = fetch_snapshot(url, target_snap["timestamp"])
if not html:
time.sleep(1)
continue
soup = BeautifulSoup(html, "html.parser")
# Extract headings hierarchy
headings = []
for tag in ["h1", "h2", "h3"]:
for el in soup.find_all(tag)[:5]:
text = el.get_text(strip=True)
if text and len(text) < 200:
headings.append({"tag": tag, "text": text})
# Extract navigation links (top-level)
nav_links = []
for nav in soup.find_all("nav")[:2]:
for a in nav.find_all("a")[:10]:
text = a.get_text(strip=True)
if text and len(text) < 50:
nav_links.append(text)
# Find CTA buttons (common patterns)
cta_patterns = ["get started", "sign up", "try free", "buy now", "learn more", "start free"]
ctas = []
for a in soup.find_all(["a", "button"]):
text = a.get_text(strip=True).lower()
if any(p in text for p in cta_patterns):
ctas.append(a.get_text(strip=True))
results.append({
"year": year,
"timestamp": target_snap["timestamp"],
"headings": headings[:5],
"nav_links": nav_links[:8],
"ctas": list(set(ctas))[:5],
})
time.sleep(2)
return results
# Track how a competitor's homepage evolved
structure_history = get_page_structure_over_time("https://notion.so")
for year_data in structure_history:
print(f"\n{year_data['year']}:")
for h in year_data["headings"][:2]:
print(f" {h['tag'].upper()}: {h['text'][:60]}")
if year_data["ctas"]:
print(f" CTAs: {', '.join(year_data['ctas'][:3])}")
Bulk CDX Queries with Pagination
When a domain has tens of thousands of archived URLs, you need to paginate the CDX API itself:
def get_all_snapshots_paginated(
url: str,
from_date: str = None,
to_date: str = None,
page_size: int = 10000,
) -> list:
"""Get all snapshots of a URL using CDX API pagination."""
all_snapshots = []
page = 0
while True:
params = {
"url": url,
"output": "json",
"fl": "timestamp,statuscode,mimetype,digest",
"limit": page_size,
"offset": page * page_size,
}
if from_date:
params["from"] = from_date
if to_date:
params["to"] = to_date
resp = requests.get(CDX_URL, params=params, headers=HEADERS, timeout=60)
if resp.status_code == 404 or not resp.text.strip():
break
rows = resp.json()
if len(rows) <= 1: # Only header row or empty
break
keys = rows[0]
batch = [dict(zip(keys, row)) for row in rows[1:]]
all_snapshots.extend(batch)
print(f" Page {page}: +{len(batch)} snapshots ({len(all_snapshots)} total)")
if len(batch) < page_size:
break # Last page
page += 1
time.sleep(1)
return all_snapshots
# Get complete snapshot history for a domain's pricing page
all_snaps = get_all_snapshots_paginated(
"https://slack.com/intl/en-us/pricing",
from_date="20200101",
)
print(f"Total snapshots: {len(all_snaps)}")
# Analyze capture frequency by year
from collections import Counter
years = [s["timestamp"][:4] for s in all_snaps]
year_counts = Counter(years)
for year, count in sorted(year_counts.items()):
print(f" {year}: {count} captures")
Monitoring SEO and Backlink Changes
The Wayback Machine captures not just content but also on-page SEO signals. Track title tags, meta descriptions, and internal link structures over time:
import re
def extract_seo_signals(html: str) -> dict:
"""Extract on-page SEO signals from archived HTML."""
soup = BeautifulSoup(html, "html.parser")
# Title tag
title = soup.find("title")
title_text = title.get_text(strip=True) if title else None
# Meta description
meta_desc = soup.find("meta", attrs={"name": re.compile("description", re.I)})
meta_description = meta_desc.get("content", "") if meta_desc else None
# H1 tags
h1_tags = [h.get_text(strip=True) for h in soup.find_all("h1")[:3]]
# Canonical URL
canonical = soup.find("link", attrs={"rel": "canonical"})
canonical_url = canonical.get("href") if canonical else None
# Structured data (JSON-LD)
schema_types = []
for script in soup.find_all("script", type="application/ld+json"):
try:
import json
data = json.loads(script.string)
if isinstance(data, dict):
schema_types.append(data.get("@type", ""))
elif isinstance(data, list):
schema_types.extend(d.get("@type", "") for d in data if isinstance(d, dict))
except Exception:
pass
# Word count (rough)
for tag in soup(["script", "style"]):
tag.decompose()
word_count = len(soup.get_text().split())
return {
"title": title_text,
"title_length": len(title_text) if title_text else 0,
"meta_description": meta_description,
"meta_desc_length": len(meta_description) if meta_description else 0,
"h1_tags": h1_tags,
"canonical": canonical_url,
"schema_types": [t for t in schema_types if t],
"word_count": word_count,
}
def track_seo_evolution(url: str, years_back: int = 3) -> list:
"""Track SEO signal changes for a URL over time."""
import datetime as dt
start_year = dt.date.today().year - years_back
snapshots = get_snapshots(
url,
from_date=f"{start_year}0101",
)
# One snapshot per quarter
seen_quarters = set()
quarterly_snaps = []
for snap in snapshots:
quarter = snap["timestamp"][:4] + "Q" + str((int(snap["timestamp"][4:6]) - 1) // 3 + 1)
if quarter not in seen_quarters and snap.get("statuscode") == "200":
seen_quarters.add(quarter)
quarterly_snaps.append(snap)
results = []
for snap in quarterly_snaps:
html = fetch_snapshot(url, snap["timestamp"])
if not html:
time.sleep(1)
continue
seo = extract_seo_signals(html)
ts = datetime.strptime(snap["timestamp"], "%Y%m%d%H%M%S")
results.append({
"date": ts.strftime("%Y-%m"),
"timestamp": snap["timestamp"],
**seo,
})
time.sleep(1.5)
return results
# Track how a competitor's homepage SEO evolved
seo_history = track_seo_evolution("https://competitor.com")
for entry in seo_history:
print(f"{entry['date']}: \"{entry['title'][:40]}\" ({entry['word_count']} words)")
Finding Removed Content and Dead Pages
The CDX API can show you pages that existed at one point but now return 404 — useful for finding content a competitor has quietly deleted:
def find_removed_pages(domain: str, proxy: str = None) -> list:
"""
Find pages that were once live but now return 404 or have been removed.
Compares Wayback Machine records with current live responses.
"""
print(f"Discovering archived URLs for {domain}...")
archived_urls = discover_urls(domain, limit=500)
print(f"Found {len(archived_urls)} archived URLs. Checking live status...")
removed = []
still_live = []
proxies = {"http": proxy, "https": proxy} if proxy else None
for url in archived_urls[:200]: # Sample first 200
try:
resp = requests.head(
url,
headers=HEADERS,
proxies=proxies,
timeout=10,
allow_redirects=True,
)
status = resp.status_code
if status in (404, 410):
# Get the last snapshot for context
snaps = get_snapshots(url, limit=3)
last_snap = snaps[-1] if snaps else None
removed.append({
"url": url,
"current_status": status,
"last_archived": last_snap["timestamp"][:8] if last_snap else "unknown",
})
elif status < 400:
still_live.append(url)
except requests.exceptions.ConnectionError:
removed.append({"url": url, "current_status": "connection_error", "last_archived": "unknown"})
except Exception:
pass
time.sleep(0.5)
print(f"\nResults: {len(still_live)} live, {len(removed)} removed/dead")
return removed
# Find what a competitor quietly deleted
removed = find_removed_pages("competitor.com")
for page in removed[:20]:
print(f" [{page['current_status']}] {page['url']}")
print(f" Last archived: {page['last_archived']}")
Legal Considerations
The Wayback Machine is explicitly designed for public access. Archive.org operates under a broad interpretation of fair use and the library exemption of copyright law. Using the CDX API for research, competitive intelligence, or content verification is well within their intended use. The main restriction: don't redistribute archived content at scale or use it to rebuild sites that the original owners have taken down.
Key Takeaways
- The CDX API is your starting point — it returns structured metadata for every snapshot of any URL, no authentication needed.
- Use
id_modifier in snapshot URLs to get the original page without the Wayback toolbar injected. - Sample snapshots by month or week — downloading every snapshot wastes bandwidth and strains a nonprofit's infrastructure.
- For large-scale research across hundreds of domains, distribute requests through ThorData's residential proxies to stay under per-IP rate limits while maintaining throughput.
- The
collapse=urlkeyCDX parameter deduplicates results when discovering all URLs on a domain — without it you get thousands of nearly-identical entries. - Content change detection (hashing page text across monthly snapshots) is more useful than raw snapshot counts for competitive intelligence.
- Be a good citizen: set a descriptive User-Agent, keep to 1 request per second per IP, and donate to archive.org if you're extracting real business value from their work.