← Back to blog

How to Scrape Public Court Records: PACER, CourtListener & State Courts (2026)

How to Scrape Public Court Records: PACER, CourtListener & State Courts (2026)

Public court records are one of the most underutilized data sources in existence. They are comprehensive, authoritative, and in the United States specifically, they are public by constitutional principle. The concept of open courts — the idea that judicial proceedings must be accessible to citizens — has deep roots in American jurisprudence going back to the founding era. In practice, this means that an enormous quantity of high-value structured data is legally accessible to anyone who knows how to get it.

The tricky part is not legality. It is the fragmented, inconsistent, and often technically antiquated infrastructure you must navigate to actually retrieve the data. The federal court system uses PACER, a system built in the 1990s that charges $0.10 per page and has an interface that would feel at home on Windows 98. State courts are even more varied: some have modern REST APIs, some have basic HTML search forms, some require JavaScript-heavy navigation, and a few have essentially no online presence at all.

What makes court data valuable? Consider the range of questions it can answer: Which companies are involved in the most patent litigation? Which law firms have the highest win rates in appellate courts? What is the geographic distribution of bankruptcy filings over the past decade? Which industries are seeing the most wage theft cases? What is the average time from filing to resolution for civil rights cases? None of this is secret. All of it is in the public record. It simply requires the tools and patience to extract it.

This guide provides everything you need to work with court data in Python in 2026. We cover the CourtListener API (the best starting point for federal data), PACER and the RECAP archive, state court portal scraping, complete code for seven real-world use cases, proxy rotation with ThorData for high-volume access, error handling and retry logic, and output schemas for each data type. Every code example is working Python and handles the actual quirks of these systems.

The legal and ethical framework is straightforward: this data is public. Accessing public court records programmatically is lawful in the United States under the principle that government-held public information belongs to the public. The practical limits are PACER's terms of service (which restrict automated bulk downloading without explicit approval) and common sense — scraping a court portal so aggressively that you degrade service for attorneys trying to meet filing deadlines is antisocial at minimum and potentially a terms violation. CourtListener's API is the right tool for federal data precisely because it is designed for programmatic access.

Setup

pip install requests httpx beautifulsoup4 lxml tenacity fake-useragent

For state court scraping that requires JavaScript rendering:

pip install playwright
playwright install chromium

CourtListener API: The Best Starting Point

CourtListener is operated by the Free Law Project, a 501(c)(3) nonprofit that archives federal court opinions, oral arguments, and docket data. Their API is free for registered users, well-documented, and specifically designed for the kind of access this guide covers.

Get an API token: Register at courtlistener.com, go to your account settings, and generate a token. The free tier provides 5,000 requests per hour for authenticated users. That is generous for most research purposes.

import requests
from typing import Optional, List
from dataclasses import dataclass, field

API_BASE = "https://www.courtlistener.com/api/rest/v4"
TOKEN = "your_courtlistener_token"

HEADERS = {"Authorization": f"Token {TOKEN}"}

@dataclass
class CourtOpinion:
    case_name: str
    court: str
    date_filed: Optional[str]
    date_decided: Optional[str]
    docket_number: Optional[str]
    judge: Optional[str]
    status: str
    url: str
    download_url: Optional[str]
    citation_count: Optional[int]
    snippet: str

def search_opinions(query: str, court: str = None, 
                    date_after: str = None, date_before: str = None,
                    page_size: int = 20, max_results: int = 100) -> List[CourtOpinion]:
    """
    Search CourtListener for court opinions.

    Args:
        query: Full-text search query
        court: Court identifier (e.g., 'scotus', 'ca9', 'dcd')
        date_after: ISO date string (YYYY-MM-DD)
        date_before: ISO date string
        page_size: Results per page (max 100)
        max_results: Total results to retrieve
    """
    opinions = []
    page = 1

    params = {
        "q": query,
        "type": "o",
        "order_by": "dateFiled desc",
        "page_size": min(page_size, 100),
    }
    if court:
        params["court"] = court
    if date_after:
        params["filed_after"] = date_after
    if date_before:
        params["filed_before"] = date_before

    while len(opinions) < max_results:
        params["page"] = page
        resp = requests.get(f"{API_BASE}/search/", headers=HEADERS, params=params, timeout=30)
        resp.raise_for_status()
        data = resp.json()

        results = data.get("results", [])
        if not results:
            break

        for hit in results:
            opinions.append(CourtOpinion(
                case_name=hit.get("caseName", ""),
                court=hit.get("court", ""),
                date_filed=hit.get("dateFiled"),
                date_decided=hit.get("dateDecided"),
                docket_number=hit.get("docketNumber"),
                judge=hit.get("judge"),
                status=hit.get("status", ""),
                url=f"https://www.courtlistener.com{hit.get('absolute_url', '')}",
                download_url=hit.get("download_url"),
                citation_count=hit.get("citeCount"),
                snippet=hit.get("snippet", ""),
            ))

        # Stop if no more pages
        if not data.get("next"):
            break

        page += 1

        # Rate limit: 5000/hr = ~1.4/sec. Be conservative.
        import time
        time.sleep(0.8)

    return opinions[:max_results]


# Example: find all federal opinions about web scraping
opinions = search_opinions("web scraping hiQ LinkedIn", court="ca9", date_after="2020-01-01")
for op in opinions:
    print(f"{op.case_name} ({op.date_filed})")
    print(f"  Court: {op.court}, Docket: {op.docket_number}")
    print(f"  URL: {op.url}")
    print()

Fetching Full Opinion Text

import time

def get_opinion_text(cluster_id: int) -> dict:
    """
    Fetch full opinion text and metadata for a specific case cluster.
    Returns the case metadata and the text of each opinion in the cluster.
    """
    # Get cluster (case) metadata
    cluster_resp = requests.get(
        f"{API_BASE}/clusters/{cluster_id}/",
        headers=HEADERS,
        timeout=30
    )
    cluster_resp.raise_for_status()
    cluster = cluster_resp.json()

    # Fetch each opinion in the cluster
    opinions = []
    for opinion_url in cluster.get("sub_opinions", []):
        op_id = opinion_url.rstrip("/").split("/")[-1]
        op_resp = requests.get(
            f"{API_BASE}/opinions/{op_id}/",
            headers=HEADERS,
            timeout=30
        )
        if op_resp.status_code == 200:
            op_data = op_resp.json()
            opinions.append({
                "type": op_data.get("type"),
                "author": op_data.get("author_str"),
                "text_plain": op_data.get("plain_text", ""),
                "text_html": op_data.get("html", ""),
                "page_count": op_data.get("page_count"),
            })
        time.sleep(0.5)

    return {
        "case_name": cluster.get("case_name"),
        "date_decided": cluster.get("date_filed"),
        "court": cluster.get("court"),
        "docket_number": cluster.get("docket_number"),
        "citation_count": cluster.get("citation_count"),
        "attorneys": cluster.get("attorneys"),
        "opinions": opinions,
    }


def search_and_download_opinions(query: str, output_dir: str = "opinions", max_cases: int = 50):
    """Search for opinions and download their full text."""
    import os
    import json

    os.makedirs(output_dir, exist_ok=True)

    results = search_opinions(query, max_results=max_cases)
    print(f"Found {len(results)} opinions matching '{query}'")

    for i, opinion in enumerate(results):
        # Extract cluster ID from URL
        cluster_id_match = opinion.url.split("/")[-2] if opinion.url else None
        if not cluster_id_match or not cluster_id_match.isdigit():
            continue

        filepath = os.path.join(output_dir, f"{cluster_id_match}.json")
        if os.path.exists(filepath):
            print(f"[{i+1}/{len(results)}] Cached: {opinion.case_name}")
            continue

        try:
            full_data = get_opinion_text(int(cluster_id_match))
            with open(filepath, "w", encoding="utf-8") as f:
                json.dump(full_data, f, ensure_ascii=False, indent=2)
            print(f"[{i+1}/{len(results)}] Saved: {opinion.case_name}")
        except Exception as e:
            print(f"[{i+1}/{len(results)}] Error for {opinion.case_name}: {e}")

        time.sleep(1.0)

PACER and the RECAP Archive

PACER is the official federal court document system. Direct automated access to PACER is restricted by their terms of service — bulk downloading without authorization is explicitly prohibited. However, the RECAP project has created a community-built free mirror of PACER content.

RECAP works as a browser extension: when any user accesses a PACER document, the extension automatically uploads it to CourtListener's free archive. Over years, this has built a substantial free mirror of federal court documents. If a document is in the RECAP archive, you can access it via the CourtListener API at no cost.

import requests
import time
from dataclasses import dataclass, field
from typing import Optional, List

@dataclass
class DocketEntry:
    entry_number: Optional[int]
    date_filed: Optional[str]
    description: str
    documents: List[dict] = field(default_factory=list)

@dataclass
class Docket:
    case_name: str
    court: str
    docket_number: str
    date_filed: Optional[str]
    date_terminated: Optional[str]
    assigned_to: Optional[str]
    cause: Optional[str]
    nature_of_suit: Optional[str]
    jury_demand: Optional[str]
    entry_count: int
    pacer_case_id: Optional[str]
    idb_data: dict = field(default_factory=dict)

def search_dockets(case_name: str = None, docket_number: str = None,
                   court: str = None, nature_of_suit: str = None,
                   date_filed_after: str = None, max_results: int = 50) -> List[Docket]:
    """Search RECAP archive for federal court dockets."""
    params = {"page_size": min(max_results, 100)}

    if case_name:
        params["case_name"] = case_name
    if docket_number:
        params["docket_number"] = docket_number
    if court:
        params["court"] = court
    if nature_of_suit:
        params["nature_of_suit"] = nature_of_suit
    if date_filed_after:
        params["date_filed__gte"] = date_filed_after

    resp = requests.get(
        "https://www.courtlistener.com/api/rest/v4/dockets/",
        headers=HEADERS,
        params=params,
        timeout=30
    )
    resp.raise_for_status()
    data = resp.json()

    dockets = []
    for item in data.get("results", []):
        dockets.append(Docket(
            case_name=item.get("case_name", ""),
            court=item.get("court_id", ""),
            docket_number=item.get("docket_number", ""),
            date_filed=item.get("date_filed"),
            date_terminated=item.get("date_terminated"),
            assigned_to=item.get("assigned_to_str"),
            cause=item.get("cause"),
            nature_of_suit=item.get("nature_of_suit"),
            jury_demand=item.get("jury_demand"),
            entry_count=item.get("entry_count", 0) or 0,
            pacer_case_id=item.get("pacer_case_id"),
        ))

    return dockets


def get_docket_entries(docket_id: int, max_entries: int = 200) -> List[DocketEntry]:
    """Fetch docket entries for a specific case."""
    entries = []
    page = 1

    while len(entries) < max_entries:
        resp = requests.get(
            "https://www.courtlistener.com/api/rest/v4/docket-entries/",
            headers=HEADERS,
            params={"docket": docket_id, "page": page, "page_size": 50},
            timeout=30
        )
        resp.raise_for_status()
        data = resp.json()

        for item in data.get("results", []):
            docs = []
            for doc in item.get("recap_documents", []):
                docs.append({
                    "document_number": doc.get("document_number"),
                    "description": doc.get("description"),
                    "is_available": doc.get("is_available"),
                    "file_size": doc.get("file_size"),
                    "filepath_local": doc.get("filepath_local"),
                })

            entries.append(DocketEntry(
                entry_number=item.get("entry_number"),
                date_filed=item.get("date_filed"),
                description=item.get("description", ""),
                documents=docs,
            ))

        if not data.get("next"):
            break
        page += 1
        time.sleep(0.5)

    return entries[:max_entries]

State Court Scraping with Proxy Rotation

State courts are where production scraping gets challenging. Each state has its own portal, authentication model, and anti-bot defenses. Many state court portals use aggressive IP-based rate limiting — they serve local attorneys making individual lookups and are not designed for bulk access.

This makes residential proxy rotation essential for any serious state court data collection. ThorData provides rotating residential proxies that route your traffic through real consumer IPs, making your requests appear identical to a local attorney checking case status from home.

import requests
from bs4 import BeautifulSoup
import time
import random
from typing import Optional, List, Generator
from dataclasses import dataclass

@dataclass
class StateCaseRecord:
    case_number: str
    case_type: str
    parties: dict
    date_filed: Optional[str]
    status: str
    judge: Optional[str]
    events: List[dict]
    court_name: str
    state: str
    source_url: str


def create_state_court_session(thordata_user: str, thordata_pass: str,
                                 country: str = "US") -> requests.Session:
    """Create a scraping session for state court portals."""
    session_id = random.randint(100000, 999999)
    proxy_url = f"http://{thordata_user}-country-{country}-session-sc{session_id}:{thordata_pass}@proxy.thordata.com:9000"

    session = requests.Session()
    session.headers.update({
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
    })
    session.proxies = {"http": proxy_url, "https": proxy_url}
    return session


def scrape_state_court_portal(
    portal_url: str,
    search_params: dict,
    state: str,
    session: requests.Session,
    max_pages: int = 10
) -> Generator[StateCaseRecord, None, None]:
    """
    Generic state court scraper. Adapts to common portal patterns.
    Yields StateCaseRecord objects as they are extracted.
    """

    for page in range(1, max_pages + 1):
        params = {**search_params, "page": page}

        try:
            resp = session.get(portal_url, params=params, timeout=30)

            if resp.status_code == 429:
                retry_after = int(resp.headers.get("Retry-After", 60))
                print(f"Rate limited — waiting {retry_after}s")
                time.sleep(retry_after)
                continue

            if resp.status_code in (403, 503):
                print(f"Blocked (HTTP {resp.status_code}) — IP may need rotation")
                time.sleep(random.uniform(30, 60))
                break

            resp.raise_for_status()

        except requests.RequestException as e:
            print(f"Request error on page {page}: {e}")
            time.sleep(random.uniform(5, 15))
            continue

        soup = BeautifulSoup(resp.text, "lxml")

        # Detect CAPTCHA
        if soup.find("div", id="captcha") or soup.find("div", class_="g-recaptcha"):
            print("CAPTCHA detected — stopping (try slower rate or manual lookup)")
            break

        # Extract case records from table (common pattern for court portals)
        records_found = 0

        # Try common table structures used by state court portals
        table = (soup.select_one("table#search-results") or 
                 soup.select_one("table.case-list") or
                 soup.select_one("table.results") or
                 soup.find("table", attrs={"summary": lambda s: s and "case" in s.lower()}) or
                 soup.find("table"))

        if table:
            rows = table.find_all("tr")[1:]  # Skip header
            for row in rows:
                cells = row.find_all("td")
                if not cells or len(cells) < 2:
                    continue

                case_link = cells[0].find("a")
                case_number_raw = cells[0].get_text(strip=True)
                case_number = parse_case_number(case_number_raw)

                party_text = cells[1].get_text(strip=True) if len(cells) > 1 else ""
                parties = parse_party_names(party_text)

                date_filed = cells[2].get_text(strip=True) if len(cells) > 2 else None
                status = cells[3].get_text(strip=True) if len(cells) > 3 else "Unknown"
                judge = cells[4].get_text(strip=True) if len(cells) > 4 else None

                detail_url = ""
                if case_link:
                    href = case_link.get("href", "")
                    if href.startswith("http"):
                        detail_url = href
                    elif href:
                        from urllib.parse import urljoin
                        detail_url = urljoin(portal_url, href)

                yield StateCaseRecord(
                    case_number=case_number,
                    case_type=infer_case_type(case_number),
                    parties=parties,
                    date_filed=date_filed,
                    status=status,
                    judge=judge,
                    events=[],
                    court_name="",
                    state=state,
                    source_url=detail_url or portal_url,
                )
                records_found += 1

        if records_found == 0:
            print(f"No records found on page {page} — stopping pagination")
            break

        # Check for next page link
        next_link = (soup.select_one("a[aria-label='Next Page']") or
                     soup.select_one("a.pagination-next") or
                     soup.select_one("a:contains('Next')"))

        if not next_link:
            break

        # Human-like delay between pages
        time.sleep(random.uniform(2.0, 5.0))


def scrape_case_detail(case_url: str, session: requests.Session, state: str) -> dict:
    """Fetch detailed case information from a case detail page."""
    resp = session.get(case_url, timeout=30)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "lxml")

    events = []

    # Docket/event table — extremely common pattern
    event_table = (soup.select_one("table#docket-entries") or
                   soup.select_one("table.event-list") or
                   soup.select_one("table#events"))

    if event_table:
        for row in event_table.find_all("tr")[1:]:
            cells = row.find_all("td")
            if len(cells) >= 2:
                events.append({
                    "date": cells[0].get_text(strip=True),
                    "event": cells[1].get_text(strip=True),
                    "document_url": cells[2].find("a").get("href") if len(cells) > 2 and cells[2].find("a") else None,
                })

    # Case metadata — often in definition lists or key-value pairs
    metadata = {}
    for dl in soup.find_all("dl"):
        dts = dl.find_all("dt")
        dds = dl.find_all("dd")
        for dt, dd in zip(dts, dds):
            key = dt.get_text(strip=True).lower().replace(" ", "_").replace(":", "")
            metadata[key] = dd.get_text(strip=True)

    return {"events": events, "metadata": metadata}

Data Normalization Utilities

Court records use inconsistent formats across jurisdictions. These utilities normalize the most common patterns:

import re
from typing import Optional

def parse_case_number(raw: str) -> str:
    """Normalize case numbers across court systems."""
    clean = re.sub(r"\s+", " ", raw.strip())

    # Federal format: "2:24-cv-01234" or "24-cv-1234"
    federal_match = re.search(
        r"(\d+:)?(\d{2})[- ](cv|cr|mc|ap|bk|adv)[- ](\d+)",
        clean, re.IGNORECASE
    )
    if federal_match:
        year = federal_match.group(2)
        case_type = federal_match.group(3).lower()
        seq = federal_match.group(4).zfill(5)
        prefix = federal_match.group(1) or ""
        return f"{prefix}{year}-{case_type}-{seq}"

    # State format: "CV-2024-001234" or "2024-L-001234"
    state_match = re.search(
        r"(cv|cr|l|d|f|pc|p|jv|dr)?[-]?(\d{4})[-](\w+)[-]?(\d+)",
        clean, re.IGNORECASE
    )
    if state_match:
        return clean

    return clean


def parse_party_names(raw: str) -> dict:
    """Split case title into plaintiff/defendant."""
    separators = [" v. ", " vs. ", " vs ", " V. ", " V ", " -v- ", " v "]

    for sep in separators:
        if sep.lower() in raw.lower():
            idx = raw.lower().find(sep.lower())
            plaintiff = raw[:idx].strip()
            defendant = raw[idx + len(sep):].strip()
            return {
                "plaintiff": plaintiff,
                "defendant": defendant,
                "raw": raw,
            }

    return {"plaintiff": raw.strip(), "defendant": None, "raw": raw}


def infer_case_type(case_number: str) -> str:
    """Infer case type from case number prefix."""
    case_lower = case_number.lower()

    type_map = {
        "cv": "civil",
        "cr": "criminal",
        "bk": "bankruptcy",
        "ap": "adversary_proceeding",
        "mc": "miscellaneous",
        "l": "civil_law",  # Common in some states
        "d": "divorce",
        "dr": "domestic_relations",
        "f": "felony",
        "m": "misdemeanor",
        "jv": "juvenile",
        "p": "probate",
        "pc": "probate",
    }

    for prefix, type_name in type_map.items():
        if re.search(rf"\b{prefix}\b", case_lower):
            return type_name

    return "unknown"


def normalize_date(raw: str) -> Optional[str]:
    """Convert various date formats to ISO 8601."""
    if not raw:
        return None

    import re

    # Remove extra whitespace
    raw = raw.strip()

    # ISO format — already correct
    if re.match(r"\d{4}-\d{2}-\d{2}", raw):
        return raw[:10]

    # MM/DD/YYYY
    match = re.match(r"(\d{1,2})/(\d{1,2})/(\d{4})", raw)
    if match:
        month, day, year = match.groups()
        return f"{year}-{month.zfill(2)}-{day.zfill(2)}"

    # Month DD, YYYY
    month_names = {
        "january": "01", "february": "02", "march": "03", "april": "04",
        "may": "05", "june": "06", "july": "07", "august": "08",
        "september": "09", "october": "10", "november": "11", "december": "12",
        "jan": "01", "feb": "02", "mar": "03", "apr": "04",
        "jun": "06", "jul": "07", "aug": "08", "sep": "09",
        "oct": "10", "nov": "11", "dec": "12",
    }
    match = re.match(r"(\w+)\s+(\d{1,2}),?\s+(\d{4})", raw, re.IGNORECASE)
    if match:
        month_str, day, year = match.groups()
        month_num = month_names.get(month_str.lower())
        if month_num:
            return f"{year}-{month_num}-{day.zfill(2)}"

    return raw  # Return as-is if no pattern matches

Error Handling and Retry Logic

Court portal scraping requires robust error handling. State portals are often slow, go down for maintenance, and have aggressive rate limiting:

import requests
import time
import random
import logging
from functools import wraps
from typing import Optional, Callable, Any

logger = logging.getLogger(__name__)


def retry_with_backoff(
    max_attempts: int = 5,
    base_delay: float = 2.0,
    max_delay: float = 120.0,
    exceptions: tuple = (requests.RequestException, Exception),
    rotate_proxy_on: tuple = (403, 429, 503),
):
    """
    Decorator for retrying court portal requests with exponential backoff.
    Optionally rotates proxy on specific HTTP status codes.
    """
    def decorator(func: Callable) -> Callable:
        @wraps(func)
        def wrapper(*args, **kwargs) -> Any:
            last_exception = None

            for attempt in range(1, max_attempts + 1):
                try:
                    return func(*args, **kwargs)

                except requests.HTTPError as e:
                    status_code = e.response.status_code if e.response else 0
                    last_exception = e

                    if status_code in rotate_proxy_on:
                        logger.warning(f"HTTP {status_code} on attempt {attempt} — needs proxy rotation")
                        # Signal to caller that proxy rotation is needed
                        if attempt == max_attempts:
                            raise

                    if status_code in (404, 410):
                        # Don't retry — resource doesn't exist
                        raise

                    # Respect Retry-After header
                    if status_code == 429:
                        retry_after = int(e.response.headers.get("Retry-After", 60))
                        logger.info(f"Rate limited — waiting {retry_after}s")
                        time.sleep(retry_after)
                        continue

                except (requests.ConnectionError, requests.Timeout) as e:
                    last_exception = e
                    logger.warning(f"Network error on attempt {attempt}: {e}")

                except Exception as e:
                    last_exception = e
                    logger.warning(f"Error on attempt {attempt}: {e}")

                if attempt < max_attempts:
                    delay = min(base_delay * (2 ** (attempt - 1)), max_delay)
                    jitter = random.uniform(0, delay * 0.3)
                    sleep_time = delay + jitter
                    logger.info(f"Retrying in {sleep_time:.1f}s (attempt {attempt}/{max_attempts})")
                    time.sleep(sleep_time)

            raise last_exception

        return wrapper
    return decorator


@retry_with_backoff(max_attempts=4, base_delay=3.0, max_delay=60.0)
def fetch_court_page(url: str, session: requests.Session, 
                      params: dict = None) -> BeautifulSoup:
    """Fetch a court portal page with retry logic."""
    resp = session.get(url, params=params, timeout=45)
    resp.raise_for_status()

    soup = BeautifulSoup(resp.text, "lxml")

    # Detect maintenance pages
    title = soup.find("title")
    if title:
        title_text = title.string.lower() if title.string else ""
        if any(kw in title_text for kw in ["maintenance", "unavailable", "down for"]):
            raise requests.ConnectionError("Court portal is under maintenance")

    # Detect session expiry
    if soup.find("div", id="session-expired") or "session has expired" in resp.text.lower():
        raise requests.HTTPError("Session expired", response=resp)

    return soup


def scrape_with_checkpointing(case_numbers: list, session: requests.Session,
                                portal_url: str, checkpoint_file: str = "progress.json"):
    """
    Scrape case details with checkpointing to resume interrupted jobs.
    Saves progress after each successful fetch.
    """
    import json
    import os

    # Load existing progress
    completed = {}
    if os.path.exists(checkpoint_file):
        with open(checkpoint_file, "r") as f:
            completed = json.load(f)
        print(f"Resuming: {len(completed)} cases already processed")

    results = dict(completed)

    for i, case_number in enumerate(case_numbers):
        if case_number in completed:
            continue

        try:
            soup = fetch_court_page(portal_url, session, params={"case": case_number})
            case_data = extract_case_from_soup(soup)
            results[case_number] = case_data

            # Save checkpoint after each successful fetch
            with open(checkpoint_file, "w") as f:
                json.dump(results, f)

            print(f"[{i+1}/{len(case_numbers)}] ✓ {case_number}")

        except Exception as e:
            results[case_number] = {"error": str(e)}
            print(f"[{i+1}/{len(case_numbers)}] ✗ {case_number}: {e}")

        time.sleep(random.uniform(1.5, 4.0))

    return results


def extract_case_from_soup(soup: BeautifulSoup) -> dict:
    """Extract case data from a parsed court page."""
    return {
        "title": soup.find("h1").get_text(strip=True) if soup.find("h1") else None,
    }

Seven Real-World Use Cases with Complete Code

Use Case 1: Litigation Monitor for Companies

import requests
import json
import time
from dataclasses import dataclass, field
from typing import List, Optional

@dataclass
class LitigationAlert:
    company_name: str
    case_name: str
    court: str
    date_filed: str
    case_type: str
    nature_of_suit: Optional[str]
    url: str
    docket_number: str

def monitor_company_litigation(company_names: List[str], 
                                 since_date: str,
                                 output_file: str = "litigation_alerts.jsonl") -> List[LitigationAlert]:
    """
    Monitor for new litigation involving specified companies.
    Checks CourtListener for new federal cases.
    """
    alerts = []

    for company in company_names:
        print(f"Checking litigation for: {company}")

        results = search_opinions(
            query=f'"{company}"',
            date_after=since_date,
            max_results=50
        )

        dockets = search_dockets(
            case_name=company,
            date_filed_after=since_date,
            max_results=50
        )

        for docket in dockets:
            alert = LitigationAlert(
                company_name=company,
                case_name=docket.case_name,
                court=docket.court,
                date_filed=docket.date_filed or "",
                case_type=infer_case_type(docket.docket_number or ""),
                nature_of_suit=docket.nature_of_suit,
                url=f"https://www.courtlistener.com/docket/{docket.pacer_case_id}/",
                docket_number=docket.docket_number or "",
            )
            alerts.append(alert)

        time.sleep(1.0)

    # Save to JSONL
    with open(output_file, "a") as f:
        for alert in alerts:
            f.write(json.dumps({
                "company": alert.company_name,
                "case_name": alert.case_name,
                "court": alert.court,
                "date_filed": alert.date_filed,
                "type": alert.case_type,
                "nature_of_suit": alert.nature_of_suit,
                "url": alert.url,
                "docket_number": alert.docket_number,
            }) + "\n")

    return alerts

Output schema:

{
  "company": "Acme Corp",
  "case_name": "Smith v. Acme Corp",
  "court": "nysd",
  "date_filed": "2026-03-15",
  "type": "civil",
  "nature_of_suit": "Employment Discrimination",
  "url": "https://www.courtlistener.com/docket/12345678/",
  "docket_number": "1:26-cv-01234"
}

Use Case 2: Bankruptcy Filing Tracker

from dataclasses import dataclass
from typing import Optional, List

@dataclass
class BankruptcyFiling:
    case_number: str
    debtor_name: str
    chapter: int  # 7, 11, 13, etc.
    date_filed: str
    date_closed: Optional[str]
    court: str
    trustee: Optional[str]
    assets: Optional[str]
    liabilities: Optional[str]
    creditor_count: Optional[int]

def fetch_bankruptcy_filings(court: str = "almb", date_after: str = "2026-01-01",
                               max_results: int = 200) -> List[BankruptcyFiling]:
    """
    Fetch recent bankruptcy filings from CourtListener RECAP archive.
    court: Bankruptcy court identifier (e.g., 'almb', 'caeb', 'nysb')
    """
    params = {
        "court": court,
        "date_filed__gte": date_after,
        "page_size": 100,
        "order_by": "-date_filed",
    }

    filings = []
    page = 1

    while len(filings) < max_results:
        params["page"] = page
        resp = requests.get(
            "https://www.courtlistener.com/api/rest/v4/dockets/",
            headers=HEADERS,
            params=params,
            timeout=30
        )
        resp.raise_for_status()
        data = resp.json()

        for item in data.get("results", []):
            docket_num = item.get("docket_number", "")

            # Detect chapter from docket number or nature of suit
            chapter = 7
            if "ch11" in docket_num.lower() or "chapter 11" in item.get("cause", "").lower():
                chapter = 11
            elif "ch13" in docket_num.lower() or "chapter 13" in item.get("cause", "").lower():
                chapter = 13

            filings.append(BankruptcyFiling(
                case_number=docket_num,
                debtor_name=item.get("case_name", "").split(" v. ")[0] if " v. " in item.get("case_name", "") else item.get("case_name", ""),
                chapter=chapter,
                date_filed=item.get("date_filed", ""),
                date_closed=item.get("date_terminated"),
                court=item.get("court_id", ""),
                trustee=item.get("assigned_to_str"),
                assets=None,
                liabilities=None,
                creditor_count=None,
            ))

        if not data.get("next"):
            break
        page += 1
        time.sleep(0.8)

    return filings[:max_results]

Use Case 3: Patent Litigation Intelligence

def fetch_patent_cases(assignee: str = None, patent_number: str = None,
                        date_after: str = "2024-01-01") -> list:
    """
    Find patent infringement cases. Useful for competitive intelligence 
    and patent validity research.
    """
    query_parts = ['nature_of_suit:"Patent"']
    if assignee:
        query_parts.append(f'"{assignee}"')
    if patent_number:
        query_parts.append(f'"{patent_number}"')

    query = " ".join(query_parts)

    dockets = search_dockets(
        nature_of_suit="830",  # Patent nature of suit code
        date_filed_after=date_after,
        max_results=100
    )

    patent_cases = []
    for docket in dockets:
        # Filter by assignee if specified
        if assignee and assignee.lower() not in docket.case_name.lower():
            continue

        patent_cases.append({
            "case_name": docket.case_name,
            "court": docket.court,
            "docket_number": docket.docket_number,
            "date_filed": docket.date_filed,
            "status": "terminated" if docket.date_terminated else "active",
            "judge": docket.assigned_to,
            "entry_count": docket.entry_count,
        })

    return patent_cases


# Output schema:
# {
#   "case_name": "TechCorp v. InnovateCo",
#   "court": "cacd",
#   "docket_number": "2:26-cv-01234",
#   "date_filed": "2026-01-15",
#   "status": "active",
#   "judge": "Hon. Jane Smith",
#   "entry_count": 42
# }

Use Case 4: Employment Discrimination Case Database

def build_employment_discrimination_database(
    courts: List[str] = None,
    date_after: str = "2023-01-01",
    max_per_court: int = 500
) -> list:
    """
    Build a database of employment discrimination cases for research.
    Nature of suit codes: 442 (Civil Rights - Employment), 
    446 (Americans with Disabilities), 448 (Education)
    """
    if courts is None:
        # Major district courts
        courts = ["nysd", "cacd", "ilnd", "txsd", "gamd"]

    all_cases = []

    for court in courts:
        print(f"Fetching from {court}...")

        for nos_code in ["442", "446"]:
            dockets = search_dockets(
                court=court,
                nature_of_suit=nos_code,
                date_filed_after=date_after,
                max_results=max_per_court
            )

            for docket in dockets:
                all_cases.append({
                    "court": court,
                    "case_name": docket.case_name,
                    "docket_number": docket.docket_number,
                    "date_filed": docket.date_filed,
                    "date_terminated": docket.date_terminated,
                    "nature_of_suit": nos_code,
                    "nature_of_suit_desc": "Civil Rights - Employment" if nos_code == "442" else "ADA",
                    "judge": docket.assigned_to,
                    "resolved": docket.date_terminated is not None,
                    "entry_count": docket.entry_count,
                })

        time.sleep(2.0)

    return all_cases

Use Case 5: Real-Time Docket Monitor

import json
import os
from datetime import datetime

class DocketMonitor:
    """
    Monitor specific federal cases for new docket entries.
    Useful for tracking active litigation involving your interests.
    """

    def __init__(self, watchlist_file: str = "watchlist.json",
                  state_file: str = "docket_state.json"):
        self.watchlist_file = watchlist_file
        self.state_file = state_file
        self.watchlist = self._load_watchlist()
        self.state = self._load_state()

    def _load_watchlist(self) -> list:
        if os.path.exists(self.watchlist_file):
            with open(self.watchlist_file) as f:
                return json.load(f)
        return []

    def _load_state(self) -> dict:
        if os.path.exists(self.state_file):
            with open(self.state_file) as f:
                return json.load(f)
        return {}

    def _save_state(self):
        with open(self.state_file, "w") as f:
            json.dump(self.state, f, indent=2)

    def add_case(self, docket_id: int, case_name: str, description: str = ""):
        """Add a case to the watchlist."""
        self.watchlist.append({
            "docket_id": docket_id,
            "case_name": case_name,
            "description": description,
            "added_at": datetime.utcnow().isoformat(),
        })
        with open(self.watchlist_file, "w") as f:
            json.dump(self.watchlist, f, indent=2)
        print(f"Added to watchlist: {case_name} (docket {docket_id})")

    def check_for_updates(self) -> List[dict]:
        """Check all watched cases for new docket entries."""
        new_entries = []

        for case in self.watchlist:
            docket_id = case["docket_id"]
            case_key = str(docket_id)

            try:
                entries = get_docket_entries(docket_id, max_entries=50)

                if not entries:
                    continue

                # Find entries newer than our last check
                last_seen = self.state.get(case_key, {}).get("last_entry_date")

                for entry in entries:
                    if entry.date_filed and (not last_seen or entry.date_filed > last_seen):
                        new_entries.append({
                            "case_name": case["case_name"],
                            "docket_id": docket_id,
                            "entry_number": entry.entry_number,
                            "date_filed": entry.date_filed,
                            "description": entry.description[:200],
                            "has_documents": len(entry.documents) > 0,
                        })

                # Update state
                if entries:
                    dates = [e.date_filed for e in entries if e.date_filed]
                    if dates:
                        self.state[case_key] = {
                            "last_entry_date": max(dates),
                            "last_checked": datetime.utcnow().isoformat(),
                            "entry_count": len(entries),
                        }

                time.sleep(1.0)

            except Exception as e:
                print(f"Error checking {case['case_name']}: {e}")

        self._save_state()
        return new_entries


# Usage
monitor = DocketMonitor()
monitor.add_case(12345678, "Smith v. TechCorp Inc", "Employment discrimination case")

updates = monitor.check_for_updates()
for update in updates:
    print(f"New filing in {update['case_name']}:")
    print(f"  Entry #{update['entry_number']} filed {update['date_filed']}")
    print(f"  {update['description']}")

Use Case 6: State Court Eviction Data Collector

def collect_eviction_data(state: str, county: str,
                           thordata_user: str, thordata_pass: str,
                           date_range: tuple) -> List[StateCaseRecord]:
    """
    Collect eviction filing data from state court portals.
    Eviction data is public record in all US states.
    Useful for housing researchers, tenant advocates, journalists.

    Note: Portal URLs and selectors vary by state — this is a template.
    Consult your state court's online access portal for the actual endpoint.
    """
    session = create_state_court_session(thordata_user, thordata_pass)

    # Common state portal URL patterns (adapt per state)
    portal_urls = {
        "FL": "https://myeclerk.myorangeclerk.com/Cases/Search",
        "TX": "https://www.txcourts.gov/court-search/",
        "CA": "https://www.lacourt.org/casesummary/ui/index.aspx",
        "NY": "https://iapps.courts.state.ny.us/webcivil/FCASMain",
    }

    portal_url = portal_urls.get(state.upper())
    if not portal_url:
        raise ValueError(f"No portal URL configured for state: {state}")

    records = list(scrape_state_court_portal(
        portal_url=portal_url,
        search_params={
            "county": county,
            "case_type": "eviction",
            "date_from": date_range[0],
            "date_to": date_range[1],
        },
        state=state,
        session=session,
        max_pages=20,
    ))

    print(f"Collected {len(records)} eviction records from {county}, {state}")
    return records

Use Case 7: Federal Regulatory Enforcement Tracker

def track_regulatory_enforcement(agency_keywords: List[str] = None,
                                    court: str = None,
                                    date_after: str = "2025-01-01") -> list:
    """
    Track federal regulatory enforcement actions.
    Covers SEC, FTC, DOJ, EPA, CFPB, and other agency cases.
    """
    if agency_keywords is None:
        agency_keywords = ["SEC", "FTC", "DOJ", "EPA", "CFPB", "FDA", "CFTC"]

    enforcement_cases = []

    for agency in agency_keywords:
        print(f"Searching for {agency} enforcement actions...")

        dockets = search_dockets(
            case_name=f"United States v",
            court=court,
            date_filed_after=date_after,
            max_results=100
        )

        # Also search opinions mentioning the agency
        opinions = search_opinions(
            query=f"{agency} enforcement penalty",
            date_after=date_after,
            max_results=20
        )

        for docket in dockets:
            if agency.lower() in docket.cause.lower() if docket.cause else False:
                enforcement_cases.append({
                    "agency": agency,
                    "case_name": docket.case_name,
                    "court": docket.court,
                    "docket_number": docket.docket_number,
                    "date_filed": docket.date_filed,
                    "cause": docket.cause,
                    "nature_of_suit": docket.nature_of_suit,
                    "status": "terminated" if docket.date_terminated else "active",
                    "type": "docket",
                })

        time.sleep(1.5)

    return enforcement_cases

Output Schema Reference

from dataclasses import dataclass, field
from typing import Optional, List

@dataclass 
class FullCourtRecord:
    """Unified output schema for court records from any source."""

    # Identifiers
    record_id: str
    source: str  # "courtlistener", "pacer_recap", "state_portal"
    court_name: str
    court_state: str  # Two-letter state code
    court_level: str  # "federal", "state", "local"

    # Case identification
    case_number: str
    case_name: str
    case_type: str  # "civil", "criminal", "bankruptcy", etc.
    nature_of_suit: Optional[str]
    cause_of_action: Optional[str]

    # Parties
    plaintiff: Optional[str]
    defendant: Optional[str]
    additional_parties: List[str] = field(default_factory=list)
    attorneys: List[dict] = field(default_factory=list)

    # Timeline
    date_filed: Optional[str]
    date_terminated: Optional[str]
    is_active: bool = True

    # Court details
    judge: Optional[str]
    jury_demand: Optional[str]

    # Docket
    entry_count: int = 0
    last_entry_date: Optional[str] = None

    # Source metadata
    source_url: str = ""
    scraped_at: str = ""

    # Extended data
    opinions: List[dict] = field(default_factory=list)
    raw_metadata: dict = field(default_factory=dict)

Key Considerations

Legal: Court records are public. Programmatic access is lawful under the principle of open courts. PACER's terms restrict automated bulk downloading without approval — use CourtListener/RECAP for federal data.

Ethical: Set reasonable rate limits. State court portals serve real users with filing deadlines. Degrading service to a court system is antisocial and may trigger legal scrutiny beyond simple terms violations.

Privacy: While records are public, they contain sensitive information — personal addresses, financial details, criminal records. Handle responsibly and consider applicable privacy laws in your jurisdiction before publishing or distributing extracted data.

Technical: State court portals change without notice. Selectors break. Budget for ongoing maintenance. Cache everything locally to avoid re-fetching.

For any volume above casual research, ThorData residential proxies are necessary for state court portals — they use aggressive IP-based blocking that datacenter IPs cannot get through.

CourtListener's API plus the RECAP archive covers the vast majority of federal court needs. For state courts, plan to build and maintain court-specific scrapers. It is not glamorous work, but court data is among the most valuable public data that exists for research, journalism, legal intelligence, and business use cases.