How to Scrape Public Court Records: PACER, CourtListener & State Courts (2026)
How to Scrape Public Court Records: PACER, CourtListener & State Courts (2026)
Public court records are one of the most underutilized data sources in existence. They are comprehensive, authoritative, and in the United States specifically, they are public by constitutional principle. The concept of open courts — the idea that judicial proceedings must be accessible to citizens — has deep roots in American jurisprudence going back to the founding era. In practice, this means that an enormous quantity of high-value structured data is legally accessible to anyone who knows how to get it.
The tricky part is not legality. It is the fragmented, inconsistent, and often technically antiquated infrastructure you must navigate to actually retrieve the data. The federal court system uses PACER, a system built in the 1990s that charges $0.10 per page and has an interface that would feel at home on Windows 98. State courts are even more varied: some have modern REST APIs, some have basic HTML search forms, some require JavaScript-heavy navigation, and a few have essentially no online presence at all.
What makes court data valuable? Consider the range of questions it can answer: Which companies are involved in the most patent litigation? Which law firms have the highest win rates in appellate courts? What is the geographic distribution of bankruptcy filings over the past decade? Which industries are seeing the most wage theft cases? What is the average time from filing to resolution for civil rights cases? None of this is secret. All of it is in the public record. It simply requires the tools and patience to extract it.
This guide provides everything you need to work with court data in Python in 2026. We cover the CourtListener API (the best starting point for federal data), PACER and the RECAP archive, state court portal scraping, complete code for seven real-world use cases, proxy rotation with ThorData for high-volume access, error handling and retry logic, and output schemas for each data type. Every code example is working Python and handles the actual quirks of these systems.
The legal and ethical framework is straightforward: this data is public. Accessing public court records programmatically is lawful in the United States under the principle that government-held public information belongs to the public. The practical limits are PACER's terms of service (which restrict automated bulk downloading without explicit approval) and common sense — scraping a court portal so aggressively that you degrade service for attorneys trying to meet filing deadlines is antisocial at minimum and potentially a terms violation. CourtListener's API is the right tool for federal data precisely because it is designed for programmatic access.
Setup
pip install requests httpx beautifulsoup4 lxml tenacity fake-useragent
For state court scraping that requires JavaScript rendering:
pip install playwright
playwright install chromium
CourtListener API: The Best Starting Point
CourtListener is operated by the Free Law Project, a 501(c)(3) nonprofit that archives federal court opinions, oral arguments, and docket data. Their API is free for registered users, well-documented, and specifically designed for the kind of access this guide covers.
Get an API token: Register at courtlistener.com, go to your account settings, and generate a token. The free tier provides 5,000 requests per hour for authenticated users. That is generous for most research purposes.
Basic Search
import requests
from typing import Optional, List
from dataclasses import dataclass, field
API_BASE = "https://www.courtlistener.com/api/rest/v4"
TOKEN = "your_courtlistener_token"
HEADERS = {"Authorization": f"Token {TOKEN}"}
@dataclass
class CourtOpinion:
case_name: str
court: str
date_filed: Optional[str]
date_decided: Optional[str]
docket_number: Optional[str]
judge: Optional[str]
status: str
url: str
download_url: Optional[str]
citation_count: Optional[int]
snippet: str
def search_opinions(query: str, court: str = None,
date_after: str = None, date_before: str = None,
page_size: int = 20, max_results: int = 100) -> List[CourtOpinion]:
"""
Search CourtListener for court opinions.
Args:
query: Full-text search query
court: Court identifier (e.g., 'scotus', 'ca9', 'dcd')
date_after: ISO date string (YYYY-MM-DD)
date_before: ISO date string
page_size: Results per page (max 100)
max_results: Total results to retrieve
"""
opinions = []
page = 1
params = {
"q": query,
"type": "o",
"order_by": "dateFiled desc",
"page_size": min(page_size, 100),
}
if court:
params["court"] = court
if date_after:
params["filed_after"] = date_after
if date_before:
params["filed_before"] = date_before
while len(opinions) < max_results:
params["page"] = page
resp = requests.get(f"{API_BASE}/search/", headers=HEADERS, params=params, timeout=30)
resp.raise_for_status()
data = resp.json()
results = data.get("results", [])
if not results:
break
for hit in results:
opinions.append(CourtOpinion(
case_name=hit.get("caseName", ""),
court=hit.get("court", ""),
date_filed=hit.get("dateFiled"),
date_decided=hit.get("dateDecided"),
docket_number=hit.get("docketNumber"),
judge=hit.get("judge"),
status=hit.get("status", ""),
url=f"https://www.courtlistener.com{hit.get('absolute_url', '')}",
download_url=hit.get("download_url"),
citation_count=hit.get("citeCount"),
snippet=hit.get("snippet", ""),
))
# Stop if no more pages
if not data.get("next"):
break
page += 1
# Rate limit: 5000/hr = ~1.4/sec. Be conservative.
import time
time.sleep(0.8)
return opinions[:max_results]
# Example: find all federal opinions about web scraping
opinions = search_opinions("web scraping hiQ LinkedIn", court="ca9", date_after="2020-01-01")
for op in opinions:
print(f"{op.case_name} ({op.date_filed})")
print(f" Court: {op.court}, Docket: {op.docket_number}")
print(f" URL: {op.url}")
print()
Fetching Full Opinion Text
import time
def get_opinion_text(cluster_id: int) -> dict:
"""
Fetch full opinion text and metadata for a specific case cluster.
Returns the case metadata and the text of each opinion in the cluster.
"""
# Get cluster (case) metadata
cluster_resp = requests.get(
f"{API_BASE}/clusters/{cluster_id}/",
headers=HEADERS,
timeout=30
)
cluster_resp.raise_for_status()
cluster = cluster_resp.json()
# Fetch each opinion in the cluster
opinions = []
for opinion_url in cluster.get("sub_opinions", []):
op_id = opinion_url.rstrip("/").split("/")[-1]
op_resp = requests.get(
f"{API_BASE}/opinions/{op_id}/",
headers=HEADERS,
timeout=30
)
if op_resp.status_code == 200:
op_data = op_resp.json()
opinions.append({
"type": op_data.get("type"),
"author": op_data.get("author_str"),
"text_plain": op_data.get("plain_text", ""),
"text_html": op_data.get("html", ""),
"page_count": op_data.get("page_count"),
})
time.sleep(0.5)
return {
"case_name": cluster.get("case_name"),
"date_decided": cluster.get("date_filed"),
"court": cluster.get("court"),
"docket_number": cluster.get("docket_number"),
"citation_count": cluster.get("citation_count"),
"attorneys": cluster.get("attorneys"),
"opinions": opinions,
}
def search_and_download_opinions(query: str, output_dir: str = "opinions", max_cases: int = 50):
"""Search for opinions and download their full text."""
import os
import json
os.makedirs(output_dir, exist_ok=True)
results = search_opinions(query, max_results=max_cases)
print(f"Found {len(results)} opinions matching '{query}'")
for i, opinion in enumerate(results):
# Extract cluster ID from URL
cluster_id_match = opinion.url.split("/")[-2] if opinion.url else None
if not cluster_id_match or not cluster_id_match.isdigit():
continue
filepath = os.path.join(output_dir, f"{cluster_id_match}.json")
if os.path.exists(filepath):
print(f"[{i+1}/{len(results)}] Cached: {opinion.case_name}")
continue
try:
full_data = get_opinion_text(int(cluster_id_match))
with open(filepath, "w", encoding="utf-8") as f:
json.dump(full_data, f, ensure_ascii=False, indent=2)
print(f"[{i+1}/{len(results)}] Saved: {opinion.case_name}")
except Exception as e:
print(f"[{i+1}/{len(results)}] Error for {opinion.case_name}: {e}")
time.sleep(1.0)
PACER and the RECAP Archive
PACER is the official federal court document system. Direct automated access to PACER is restricted by their terms of service — bulk downloading without authorization is explicitly prohibited. However, the RECAP project has created a community-built free mirror of PACER content.
RECAP works as a browser extension: when any user accesses a PACER document, the extension automatically uploads it to CourtListener's free archive. Over years, this has built a substantial free mirror of federal court documents. If a document is in the RECAP archive, you can access it via the CourtListener API at no cost.
import requests
import time
from dataclasses import dataclass, field
from typing import Optional, List
@dataclass
class DocketEntry:
entry_number: Optional[int]
date_filed: Optional[str]
description: str
documents: List[dict] = field(default_factory=list)
@dataclass
class Docket:
case_name: str
court: str
docket_number: str
date_filed: Optional[str]
date_terminated: Optional[str]
assigned_to: Optional[str]
cause: Optional[str]
nature_of_suit: Optional[str]
jury_demand: Optional[str]
entry_count: int
pacer_case_id: Optional[str]
idb_data: dict = field(default_factory=dict)
def search_dockets(case_name: str = None, docket_number: str = None,
court: str = None, nature_of_suit: str = None,
date_filed_after: str = None, max_results: int = 50) -> List[Docket]:
"""Search RECAP archive for federal court dockets."""
params = {"page_size": min(max_results, 100)}
if case_name:
params["case_name"] = case_name
if docket_number:
params["docket_number"] = docket_number
if court:
params["court"] = court
if nature_of_suit:
params["nature_of_suit"] = nature_of_suit
if date_filed_after:
params["date_filed__gte"] = date_filed_after
resp = requests.get(
"https://www.courtlistener.com/api/rest/v4/dockets/",
headers=HEADERS,
params=params,
timeout=30
)
resp.raise_for_status()
data = resp.json()
dockets = []
for item in data.get("results", []):
dockets.append(Docket(
case_name=item.get("case_name", ""),
court=item.get("court_id", ""),
docket_number=item.get("docket_number", ""),
date_filed=item.get("date_filed"),
date_terminated=item.get("date_terminated"),
assigned_to=item.get("assigned_to_str"),
cause=item.get("cause"),
nature_of_suit=item.get("nature_of_suit"),
jury_demand=item.get("jury_demand"),
entry_count=item.get("entry_count", 0) or 0,
pacer_case_id=item.get("pacer_case_id"),
))
return dockets
def get_docket_entries(docket_id: int, max_entries: int = 200) -> List[DocketEntry]:
"""Fetch docket entries for a specific case."""
entries = []
page = 1
while len(entries) < max_entries:
resp = requests.get(
"https://www.courtlistener.com/api/rest/v4/docket-entries/",
headers=HEADERS,
params={"docket": docket_id, "page": page, "page_size": 50},
timeout=30
)
resp.raise_for_status()
data = resp.json()
for item in data.get("results", []):
docs = []
for doc in item.get("recap_documents", []):
docs.append({
"document_number": doc.get("document_number"),
"description": doc.get("description"),
"is_available": doc.get("is_available"),
"file_size": doc.get("file_size"),
"filepath_local": doc.get("filepath_local"),
})
entries.append(DocketEntry(
entry_number=item.get("entry_number"),
date_filed=item.get("date_filed"),
description=item.get("description", ""),
documents=docs,
))
if not data.get("next"):
break
page += 1
time.sleep(0.5)
return entries[:max_entries]
State Court Scraping with Proxy Rotation
State courts are where production scraping gets challenging. Each state has its own portal, authentication model, and anti-bot defenses. Many state court portals use aggressive IP-based rate limiting — they serve local attorneys making individual lookups and are not designed for bulk access.
This makes residential proxy rotation essential for any serious state court data collection. ThorData provides rotating residential proxies that route your traffic through real consumer IPs, making your requests appear identical to a local attorney checking case status from home.
import requests
from bs4 import BeautifulSoup
import time
import random
from typing import Optional, List, Generator
from dataclasses import dataclass
@dataclass
class StateCaseRecord:
case_number: str
case_type: str
parties: dict
date_filed: Optional[str]
status: str
judge: Optional[str]
events: List[dict]
court_name: str
state: str
source_url: str
def create_state_court_session(thordata_user: str, thordata_pass: str,
country: str = "US") -> requests.Session:
"""Create a scraping session for state court portals."""
session_id = random.randint(100000, 999999)
proxy_url = f"http://{thordata_user}-country-{country}-session-sc{session_id}:{thordata_pass}@proxy.thordata.com:9000"
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
})
session.proxies = {"http": proxy_url, "https": proxy_url}
return session
def scrape_state_court_portal(
portal_url: str,
search_params: dict,
state: str,
session: requests.Session,
max_pages: int = 10
) -> Generator[StateCaseRecord, None, None]:
"""
Generic state court scraper. Adapts to common portal patterns.
Yields StateCaseRecord objects as they are extracted.
"""
for page in range(1, max_pages + 1):
params = {**search_params, "page": page}
try:
resp = session.get(portal_url, params=params, timeout=30)
if resp.status_code == 429:
retry_after = int(resp.headers.get("Retry-After", 60))
print(f"Rate limited — waiting {retry_after}s")
time.sleep(retry_after)
continue
if resp.status_code in (403, 503):
print(f"Blocked (HTTP {resp.status_code}) — IP may need rotation")
time.sleep(random.uniform(30, 60))
break
resp.raise_for_status()
except requests.RequestException as e:
print(f"Request error on page {page}: {e}")
time.sleep(random.uniform(5, 15))
continue
soup = BeautifulSoup(resp.text, "lxml")
# Detect CAPTCHA
if soup.find("div", id="captcha") or soup.find("div", class_="g-recaptcha"):
print("CAPTCHA detected — stopping (try slower rate or manual lookup)")
break
# Extract case records from table (common pattern for court portals)
records_found = 0
# Try common table structures used by state court portals
table = (soup.select_one("table#search-results") or
soup.select_one("table.case-list") or
soup.select_one("table.results") or
soup.find("table", attrs={"summary": lambda s: s and "case" in s.lower()}) or
soup.find("table"))
if table:
rows = table.find_all("tr")[1:] # Skip header
for row in rows:
cells = row.find_all("td")
if not cells or len(cells) < 2:
continue
case_link = cells[0].find("a")
case_number_raw = cells[0].get_text(strip=True)
case_number = parse_case_number(case_number_raw)
party_text = cells[1].get_text(strip=True) if len(cells) > 1 else ""
parties = parse_party_names(party_text)
date_filed = cells[2].get_text(strip=True) if len(cells) > 2 else None
status = cells[3].get_text(strip=True) if len(cells) > 3 else "Unknown"
judge = cells[4].get_text(strip=True) if len(cells) > 4 else None
detail_url = ""
if case_link:
href = case_link.get("href", "")
if href.startswith("http"):
detail_url = href
elif href:
from urllib.parse import urljoin
detail_url = urljoin(portal_url, href)
yield StateCaseRecord(
case_number=case_number,
case_type=infer_case_type(case_number),
parties=parties,
date_filed=date_filed,
status=status,
judge=judge,
events=[],
court_name="",
state=state,
source_url=detail_url or portal_url,
)
records_found += 1
if records_found == 0:
print(f"No records found on page {page} — stopping pagination")
break
# Check for next page link
next_link = (soup.select_one("a[aria-label='Next Page']") or
soup.select_one("a.pagination-next") or
soup.select_one("a:contains('Next')"))
if not next_link:
break
# Human-like delay between pages
time.sleep(random.uniform(2.0, 5.0))
def scrape_case_detail(case_url: str, session: requests.Session, state: str) -> dict:
"""Fetch detailed case information from a case detail page."""
resp = session.get(case_url, timeout=30)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
events = []
# Docket/event table — extremely common pattern
event_table = (soup.select_one("table#docket-entries") or
soup.select_one("table.event-list") or
soup.select_one("table#events"))
if event_table:
for row in event_table.find_all("tr")[1:]:
cells = row.find_all("td")
if len(cells) >= 2:
events.append({
"date": cells[0].get_text(strip=True),
"event": cells[1].get_text(strip=True),
"document_url": cells[2].find("a").get("href") if len(cells) > 2 and cells[2].find("a") else None,
})
# Case metadata — often in definition lists or key-value pairs
metadata = {}
for dl in soup.find_all("dl"):
dts = dl.find_all("dt")
dds = dl.find_all("dd")
for dt, dd in zip(dts, dds):
key = dt.get_text(strip=True).lower().replace(" ", "_").replace(":", "")
metadata[key] = dd.get_text(strip=True)
return {"events": events, "metadata": metadata}
Data Normalization Utilities
Court records use inconsistent formats across jurisdictions. These utilities normalize the most common patterns:
import re
from typing import Optional
def parse_case_number(raw: str) -> str:
"""Normalize case numbers across court systems."""
clean = re.sub(r"\s+", " ", raw.strip())
# Federal format: "2:24-cv-01234" or "24-cv-1234"
federal_match = re.search(
r"(\d+:)?(\d{2})[- ](cv|cr|mc|ap|bk|adv)[- ](\d+)",
clean, re.IGNORECASE
)
if federal_match:
year = federal_match.group(2)
case_type = federal_match.group(3).lower()
seq = federal_match.group(4).zfill(5)
prefix = federal_match.group(1) or ""
return f"{prefix}{year}-{case_type}-{seq}"
# State format: "CV-2024-001234" or "2024-L-001234"
state_match = re.search(
r"(cv|cr|l|d|f|pc|p|jv|dr)?[-]?(\d{4})[-](\w+)[-]?(\d+)",
clean, re.IGNORECASE
)
if state_match:
return clean
return clean
def parse_party_names(raw: str) -> dict:
"""Split case title into plaintiff/defendant."""
separators = [" v. ", " vs. ", " vs ", " V. ", " V ", " -v- ", " v "]
for sep in separators:
if sep.lower() in raw.lower():
idx = raw.lower().find(sep.lower())
plaintiff = raw[:idx].strip()
defendant = raw[idx + len(sep):].strip()
return {
"plaintiff": plaintiff,
"defendant": defendant,
"raw": raw,
}
return {"plaintiff": raw.strip(), "defendant": None, "raw": raw}
def infer_case_type(case_number: str) -> str:
"""Infer case type from case number prefix."""
case_lower = case_number.lower()
type_map = {
"cv": "civil",
"cr": "criminal",
"bk": "bankruptcy",
"ap": "adversary_proceeding",
"mc": "miscellaneous",
"l": "civil_law", # Common in some states
"d": "divorce",
"dr": "domestic_relations",
"f": "felony",
"m": "misdemeanor",
"jv": "juvenile",
"p": "probate",
"pc": "probate",
}
for prefix, type_name in type_map.items():
if re.search(rf"\b{prefix}\b", case_lower):
return type_name
return "unknown"
def normalize_date(raw: str) -> Optional[str]:
"""Convert various date formats to ISO 8601."""
if not raw:
return None
import re
# Remove extra whitespace
raw = raw.strip()
# ISO format — already correct
if re.match(r"\d{4}-\d{2}-\d{2}", raw):
return raw[:10]
# MM/DD/YYYY
match = re.match(r"(\d{1,2})/(\d{1,2})/(\d{4})", raw)
if match:
month, day, year = match.groups()
return f"{year}-{month.zfill(2)}-{day.zfill(2)}"
# Month DD, YYYY
month_names = {
"january": "01", "february": "02", "march": "03", "april": "04",
"may": "05", "june": "06", "july": "07", "august": "08",
"september": "09", "october": "10", "november": "11", "december": "12",
"jan": "01", "feb": "02", "mar": "03", "apr": "04",
"jun": "06", "jul": "07", "aug": "08", "sep": "09",
"oct": "10", "nov": "11", "dec": "12",
}
match = re.match(r"(\w+)\s+(\d{1,2}),?\s+(\d{4})", raw, re.IGNORECASE)
if match:
month_str, day, year = match.groups()
month_num = month_names.get(month_str.lower())
if month_num:
return f"{year}-{month_num}-{day.zfill(2)}"
return raw # Return as-is if no pattern matches
Error Handling and Retry Logic
Court portal scraping requires robust error handling. State portals are often slow, go down for maintenance, and have aggressive rate limiting:
import requests
import time
import random
import logging
from functools import wraps
from typing import Optional, Callable, Any
logger = logging.getLogger(__name__)
def retry_with_backoff(
max_attempts: int = 5,
base_delay: float = 2.0,
max_delay: float = 120.0,
exceptions: tuple = (requests.RequestException, Exception),
rotate_proxy_on: tuple = (403, 429, 503),
):
"""
Decorator for retrying court portal requests with exponential backoff.
Optionally rotates proxy on specific HTTP status codes.
"""
def decorator(func: Callable) -> Callable:
@wraps(func)
def wrapper(*args, **kwargs) -> Any:
last_exception = None
for attempt in range(1, max_attempts + 1):
try:
return func(*args, **kwargs)
except requests.HTTPError as e:
status_code = e.response.status_code if e.response else 0
last_exception = e
if status_code in rotate_proxy_on:
logger.warning(f"HTTP {status_code} on attempt {attempt} — needs proxy rotation")
# Signal to caller that proxy rotation is needed
if attempt == max_attempts:
raise
if status_code in (404, 410):
# Don't retry — resource doesn't exist
raise
# Respect Retry-After header
if status_code == 429:
retry_after = int(e.response.headers.get("Retry-After", 60))
logger.info(f"Rate limited — waiting {retry_after}s")
time.sleep(retry_after)
continue
except (requests.ConnectionError, requests.Timeout) as e:
last_exception = e
logger.warning(f"Network error on attempt {attempt}: {e}")
except Exception as e:
last_exception = e
logger.warning(f"Error on attempt {attempt}: {e}")
if attempt < max_attempts:
delay = min(base_delay * (2 ** (attempt - 1)), max_delay)
jitter = random.uniform(0, delay * 0.3)
sleep_time = delay + jitter
logger.info(f"Retrying in {sleep_time:.1f}s (attempt {attempt}/{max_attempts})")
time.sleep(sleep_time)
raise last_exception
return wrapper
return decorator
@retry_with_backoff(max_attempts=4, base_delay=3.0, max_delay=60.0)
def fetch_court_page(url: str, session: requests.Session,
params: dict = None) -> BeautifulSoup:
"""Fetch a court portal page with retry logic."""
resp = session.get(url, params=params, timeout=45)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
# Detect maintenance pages
title = soup.find("title")
if title:
title_text = title.string.lower() if title.string else ""
if any(kw in title_text for kw in ["maintenance", "unavailable", "down for"]):
raise requests.ConnectionError("Court portal is under maintenance")
# Detect session expiry
if soup.find("div", id="session-expired") or "session has expired" in resp.text.lower():
raise requests.HTTPError("Session expired", response=resp)
return soup
def scrape_with_checkpointing(case_numbers: list, session: requests.Session,
portal_url: str, checkpoint_file: str = "progress.json"):
"""
Scrape case details with checkpointing to resume interrupted jobs.
Saves progress after each successful fetch.
"""
import json
import os
# Load existing progress
completed = {}
if os.path.exists(checkpoint_file):
with open(checkpoint_file, "r") as f:
completed = json.load(f)
print(f"Resuming: {len(completed)} cases already processed")
results = dict(completed)
for i, case_number in enumerate(case_numbers):
if case_number in completed:
continue
try:
soup = fetch_court_page(portal_url, session, params={"case": case_number})
case_data = extract_case_from_soup(soup)
results[case_number] = case_data
# Save checkpoint after each successful fetch
with open(checkpoint_file, "w") as f:
json.dump(results, f)
print(f"[{i+1}/{len(case_numbers)}] ✓ {case_number}")
except Exception as e:
results[case_number] = {"error": str(e)}
print(f"[{i+1}/{len(case_numbers)}] ✗ {case_number}: {e}")
time.sleep(random.uniform(1.5, 4.0))
return results
def extract_case_from_soup(soup: BeautifulSoup) -> dict:
"""Extract case data from a parsed court page."""
return {
"title": soup.find("h1").get_text(strip=True) if soup.find("h1") else None,
}
Seven Real-World Use Cases with Complete Code
Use Case 1: Litigation Monitor for Companies
import requests
import json
import time
from dataclasses import dataclass, field
from typing import List, Optional
@dataclass
class LitigationAlert:
company_name: str
case_name: str
court: str
date_filed: str
case_type: str
nature_of_suit: Optional[str]
url: str
docket_number: str
def monitor_company_litigation(company_names: List[str],
since_date: str,
output_file: str = "litigation_alerts.jsonl") -> List[LitigationAlert]:
"""
Monitor for new litigation involving specified companies.
Checks CourtListener for new federal cases.
"""
alerts = []
for company in company_names:
print(f"Checking litigation for: {company}")
results = search_opinions(
query=f'"{company}"',
date_after=since_date,
max_results=50
)
dockets = search_dockets(
case_name=company,
date_filed_after=since_date,
max_results=50
)
for docket in dockets:
alert = LitigationAlert(
company_name=company,
case_name=docket.case_name,
court=docket.court,
date_filed=docket.date_filed or "",
case_type=infer_case_type(docket.docket_number or ""),
nature_of_suit=docket.nature_of_suit,
url=f"https://www.courtlistener.com/docket/{docket.pacer_case_id}/",
docket_number=docket.docket_number or "",
)
alerts.append(alert)
time.sleep(1.0)
# Save to JSONL
with open(output_file, "a") as f:
for alert in alerts:
f.write(json.dumps({
"company": alert.company_name,
"case_name": alert.case_name,
"court": alert.court,
"date_filed": alert.date_filed,
"type": alert.case_type,
"nature_of_suit": alert.nature_of_suit,
"url": alert.url,
"docket_number": alert.docket_number,
}) + "\n")
return alerts
Output schema:
{
"company": "Acme Corp",
"case_name": "Smith v. Acme Corp",
"court": "nysd",
"date_filed": "2026-03-15",
"type": "civil",
"nature_of_suit": "Employment Discrimination",
"url": "https://www.courtlistener.com/docket/12345678/",
"docket_number": "1:26-cv-01234"
}
Use Case 2: Bankruptcy Filing Tracker
from dataclasses import dataclass
from typing import Optional, List
@dataclass
class BankruptcyFiling:
case_number: str
debtor_name: str
chapter: int # 7, 11, 13, etc.
date_filed: str
date_closed: Optional[str]
court: str
trustee: Optional[str]
assets: Optional[str]
liabilities: Optional[str]
creditor_count: Optional[int]
def fetch_bankruptcy_filings(court: str = "almb", date_after: str = "2026-01-01",
max_results: int = 200) -> List[BankruptcyFiling]:
"""
Fetch recent bankruptcy filings from CourtListener RECAP archive.
court: Bankruptcy court identifier (e.g., 'almb', 'caeb', 'nysb')
"""
params = {
"court": court,
"date_filed__gte": date_after,
"page_size": 100,
"order_by": "-date_filed",
}
filings = []
page = 1
while len(filings) < max_results:
params["page"] = page
resp = requests.get(
"https://www.courtlistener.com/api/rest/v4/dockets/",
headers=HEADERS,
params=params,
timeout=30
)
resp.raise_for_status()
data = resp.json()
for item in data.get("results", []):
docket_num = item.get("docket_number", "")
# Detect chapter from docket number or nature of suit
chapter = 7
if "ch11" in docket_num.lower() or "chapter 11" in item.get("cause", "").lower():
chapter = 11
elif "ch13" in docket_num.lower() or "chapter 13" in item.get("cause", "").lower():
chapter = 13
filings.append(BankruptcyFiling(
case_number=docket_num,
debtor_name=item.get("case_name", "").split(" v. ")[0] if " v. " in item.get("case_name", "") else item.get("case_name", ""),
chapter=chapter,
date_filed=item.get("date_filed", ""),
date_closed=item.get("date_terminated"),
court=item.get("court_id", ""),
trustee=item.get("assigned_to_str"),
assets=None,
liabilities=None,
creditor_count=None,
))
if not data.get("next"):
break
page += 1
time.sleep(0.8)
return filings[:max_results]
Use Case 3: Patent Litigation Intelligence
def fetch_patent_cases(assignee: str = None, patent_number: str = None,
date_after: str = "2024-01-01") -> list:
"""
Find patent infringement cases. Useful for competitive intelligence
and patent validity research.
"""
query_parts = ['nature_of_suit:"Patent"']
if assignee:
query_parts.append(f'"{assignee}"')
if patent_number:
query_parts.append(f'"{patent_number}"')
query = " ".join(query_parts)
dockets = search_dockets(
nature_of_suit="830", # Patent nature of suit code
date_filed_after=date_after,
max_results=100
)
patent_cases = []
for docket in dockets:
# Filter by assignee if specified
if assignee and assignee.lower() not in docket.case_name.lower():
continue
patent_cases.append({
"case_name": docket.case_name,
"court": docket.court,
"docket_number": docket.docket_number,
"date_filed": docket.date_filed,
"status": "terminated" if docket.date_terminated else "active",
"judge": docket.assigned_to,
"entry_count": docket.entry_count,
})
return patent_cases
# Output schema:
# {
# "case_name": "TechCorp v. InnovateCo",
# "court": "cacd",
# "docket_number": "2:26-cv-01234",
# "date_filed": "2026-01-15",
# "status": "active",
# "judge": "Hon. Jane Smith",
# "entry_count": 42
# }
Use Case 4: Employment Discrimination Case Database
def build_employment_discrimination_database(
courts: List[str] = None,
date_after: str = "2023-01-01",
max_per_court: int = 500
) -> list:
"""
Build a database of employment discrimination cases for research.
Nature of suit codes: 442 (Civil Rights - Employment),
446 (Americans with Disabilities), 448 (Education)
"""
if courts is None:
# Major district courts
courts = ["nysd", "cacd", "ilnd", "txsd", "gamd"]
all_cases = []
for court in courts:
print(f"Fetching from {court}...")
for nos_code in ["442", "446"]:
dockets = search_dockets(
court=court,
nature_of_suit=nos_code,
date_filed_after=date_after,
max_results=max_per_court
)
for docket in dockets:
all_cases.append({
"court": court,
"case_name": docket.case_name,
"docket_number": docket.docket_number,
"date_filed": docket.date_filed,
"date_terminated": docket.date_terminated,
"nature_of_suit": nos_code,
"nature_of_suit_desc": "Civil Rights - Employment" if nos_code == "442" else "ADA",
"judge": docket.assigned_to,
"resolved": docket.date_terminated is not None,
"entry_count": docket.entry_count,
})
time.sleep(2.0)
return all_cases
Use Case 5: Real-Time Docket Monitor
import json
import os
from datetime import datetime
class DocketMonitor:
"""
Monitor specific federal cases for new docket entries.
Useful for tracking active litigation involving your interests.
"""
def __init__(self, watchlist_file: str = "watchlist.json",
state_file: str = "docket_state.json"):
self.watchlist_file = watchlist_file
self.state_file = state_file
self.watchlist = self._load_watchlist()
self.state = self._load_state()
def _load_watchlist(self) -> list:
if os.path.exists(self.watchlist_file):
with open(self.watchlist_file) as f:
return json.load(f)
return []
def _load_state(self) -> dict:
if os.path.exists(self.state_file):
with open(self.state_file) as f:
return json.load(f)
return {}
def _save_state(self):
with open(self.state_file, "w") as f:
json.dump(self.state, f, indent=2)
def add_case(self, docket_id: int, case_name: str, description: str = ""):
"""Add a case to the watchlist."""
self.watchlist.append({
"docket_id": docket_id,
"case_name": case_name,
"description": description,
"added_at": datetime.utcnow().isoformat(),
})
with open(self.watchlist_file, "w") as f:
json.dump(self.watchlist, f, indent=2)
print(f"Added to watchlist: {case_name} (docket {docket_id})")
def check_for_updates(self) -> List[dict]:
"""Check all watched cases for new docket entries."""
new_entries = []
for case in self.watchlist:
docket_id = case["docket_id"]
case_key = str(docket_id)
try:
entries = get_docket_entries(docket_id, max_entries=50)
if not entries:
continue
# Find entries newer than our last check
last_seen = self.state.get(case_key, {}).get("last_entry_date")
for entry in entries:
if entry.date_filed and (not last_seen or entry.date_filed > last_seen):
new_entries.append({
"case_name": case["case_name"],
"docket_id": docket_id,
"entry_number": entry.entry_number,
"date_filed": entry.date_filed,
"description": entry.description[:200],
"has_documents": len(entry.documents) > 0,
})
# Update state
if entries:
dates = [e.date_filed for e in entries if e.date_filed]
if dates:
self.state[case_key] = {
"last_entry_date": max(dates),
"last_checked": datetime.utcnow().isoformat(),
"entry_count": len(entries),
}
time.sleep(1.0)
except Exception as e:
print(f"Error checking {case['case_name']}: {e}")
self._save_state()
return new_entries
# Usage
monitor = DocketMonitor()
monitor.add_case(12345678, "Smith v. TechCorp Inc", "Employment discrimination case")
updates = monitor.check_for_updates()
for update in updates:
print(f"New filing in {update['case_name']}:")
print(f" Entry #{update['entry_number']} filed {update['date_filed']}")
print(f" {update['description']}")
Use Case 6: State Court Eviction Data Collector
def collect_eviction_data(state: str, county: str,
thordata_user: str, thordata_pass: str,
date_range: tuple) -> List[StateCaseRecord]:
"""
Collect eviction filing data from state court portals.
Eviction data is public record in all US states.
Useful for housing researchers, tenant advocates, journalists.
Note: Portal URLs and selectors vary by state — this is a template.
Consult your state court's online access portal for the actual endpoint.
"""
session = create_state_court_session(thordata_user, thordata_pass)
# Common state portal URL patterns (adapt per state)
portal_urls = {
"FL": "https://myeclerk.myorangeclerk.com/Cases/Search",
"TX": "https://www.txcourts.gov/court-search/",
"CA": "https://www.lacourt.org/casesummary/ui/index.aspx",
"NY": "https://iapps.courts.state.ny.us/webcivil/FCASMain",
}
portal_url = portal_urls.get(state.upper())
if not portal_url:
raise ValueError(f"No portal URL configured for state: {state}")
records = list(scrape_state_court_portal(
portal_url=portal_url,
search_params={
"county": county,
"case_type": "eviction",
"date_from": date_range[0],
"date_to": date_range[1],
},
state=state,
session=session,
max_pages=20,
))
print(f"Collected {len(records)} eviction records from {county}, {state}")
return records
Use Case 7: Federal Regulatory Enforcement Tracker
def track_regulatory_enforcement(agency_keywords: List[str] = None,
court: str = None,
date_after: str = "2025-01-01") -> list:
"""
Track federal regulatory enforcement actions.
Covers SEC, FTC, DOJ, EPA, CFPB, and other agency cases.
"""
if agency_keywords is None:
agency_keywords = ["SEC", "FTC", "DOJ", "EPA", "CFPB", "FDA", "CFTC"]
enforcement_cases = []
for agency in agency_keywords:
print(f"Searching for {agency} enforcement actions...")
dockets = search_dockets(
case_name=f"United States v",
court=court,
date_filed_after=date_after,
max_results=100
)
# Also search opinions mentioning the agency
opinions = search_opinions(
query=f"{agency} enforcement penalty",
date_after=date_after,
max_results=20
)
for docket in dockets:
if agency.lower() in docket.cause.lower() if docket.cause else False:
enforcement_cases.append({
"agency": agency,
"case_name": docket.case_name,
"court": docket.court,
"docket_number": docket.docket_number,
"date_filed": docket.date_filed,
"cause": docket.cause,
"nature_of_suit": docket.nature_of_suit,
"status": "terminated" if docket.date_terminated else "active",
"type": "docket",
})
time.sleep(1.5)
return enforcement_cases
Output Schema Reference
from dataclasses import dataclass, field
from typing import Optional, List
@dataclass
class FullCourtRecord:
"""Unified output schema for court records from any source."""
# Identifiers
record_id: str
source: str # "courtlistener", "pacer_recap", "state_portal"
court_name: str
court_state: str # Two-letter state code
court_level: str # "federal", "state", "local"
# Case identification
case_number: str
case_name: str
case_type: str # "civil", "criminal", "bankruptcy", etc.
nature_of_suit: Optional[str]
cause_of_action: Optional[str]
# Parties
plaintiff: Optional[str]
defendant: Optional[str]
additional_parties: List[str] = field(default_factory=list)
attorneys: List[dict] = field(default_factory=list)
# Timeline
date_filed: Optional[str]
date_terminated: Optional[str]
is_active: bool = True
# Court details
judge: Optional[str]
jury_demand: Optional[str]
# Docket
entry_count: int = 0
last_entry_date: Optional[str] = None
# Source metadata
source_url: str = ""
scraped_at: str = ""
# Extended data
opinions: List[dict] = field(default_factory=list)
raw_metadata: dict = field(default_factory=dict)
Key Considerations
Legal: Court records are public. Programmatic access is lawful under the principle of open courts. PACER's terms restrict automated bulk downloading without approval — use CourtListener/RECAP for federal data.
Ethical: Set reasonable rate limits. State court portals serve real users with filing deadlines. Degrading service to a court system is antisocial and may trigger legal scrutiny beyond simple terms violations.
Privacy: While records are public, they contain sensitive information — personal addresses, financial details, criminal records. Handle responsibly and consider applicable privacy laws in your jurisdiction before publishing or distributing extracted data.
Technical: State court portals change without notice. Selectors break. Budget for ongoing maintenance. Cache everything locally to avoid re-fetching.
For any volume above casual research, ThorData residential proxies are necessary for state court portals — they use aggressive IP-based blocking that datacenter IPs cannot get through.
CourtListener's API plus the RECAP archive covers the vast majority of federal court needs. For state courts, plan to build and maintain court-specific scrapers. It is not glamorous work, but court data is among the most valuable public data that exists for research, journalism, legal intelligence, and business use cases.