How to Scrape SEC EDGAR Filings, 10-K Reports & Financial Data in Python (2026)

2026-04-09 [python scraping sec-edgar finance api]

How to Scrape SEC EDGAR Filings, 10-K Reports & Financial Data in Python (2026)

SEC EDGAR is one of the most valuable public data sources in existence. Every public company in the US files financial disclosures here — 10-K annual reports, 10-Q quarterlies, 8-K event reports, insider trading forms, proxy statements. It's all public. It's all free. And the SEC actually wants you to use it.

Unlike most scraping targets, EDGAR provides official APIs. But those APIs have gaps, and the raw filing documents still require parsing. This guide covers the full stack: the official JSON APIs, XBRL financial data extraction, Form 4 insider trading parsing, full-text search, proxy setup for high-volume work, and a complete SQLite storage pipeline.

Why SEC EDGAR Data Is Valuable

Before diving into the code, it's worth understanding what you can actually do with this data:

Quantitative investing: Build factor models using standardized XBRL financial data — revenue growth, margins, debt ratios — for thousands of companies at once.
NLP research: Train language models on 10-K risk disclosures, MD&A sections, and earnings guidance language going back decades.
Insider signal tracking: Monitor Form 4 filings to detect when executives buy or sell their own company stock.
Competitive intelligence: Track when competitors file 8-Ks for material events — acquisitions, leadership changes, regulatory actions.
Academic research: The EDGAR corpus is a standard benchmark dataset in financial NLP research.

The SEC's EDGAR Full-Text Search System (EFTS) and the structured company data APIs are your starting point. They're free, require no API key, and return JSON.

The one hard requirement: include a User-Agent header with your name and email. The SEC uses this for rate limiting and will block requests without it.

Installation

pip install httpx beautifulsoup4 lxml

For heavier document parsing:

pip install sec-parser edgartools

The SEC EDGAR API (EFTS)

import httpx
import time
import sqlite3
import json
from datetime import datetime, timezone
from pathlib import Path

# SEC requires identifying User-Agent — not optional
HEADERS = {
    "User-Agent": "YourName [email protected]",
    "Accept": "application/json",
}

def get_company_info(cik: str) -> dict:
    """Get company metadata from SEC EDGAR."""
    # CIK must be zero-padded to 10 digits
    cik_padded = cik.zfill(10)
    url = f"https://data.sec.gov/submissions/CIK{cik_padded}.json"

    resp = httpx.get(url, headers=HEADERS, timeout=15)
    resp.raise_for_status()
    data = resp.json()

    return {
        "cik": data.get("cik"),
        "name": data.get("name"),
        "ticker": data.get("tickers", [None])[0],
        "sic": data.get("sic"),
        "sic_description": data.get("sicDescription"),
        "state": data.get("stateOfIncorporation"),
        "fiscal_year_end": data.get("fiscalYearEnd"),
        "exchange": data.get("exchanges", [None])[0],
        "ein": data.get("ein"),
        "category": data.get("category"),
        "recent_filings": data.get("filings", {}).get("recent", {}),
    }

# Example: Apple Inc (CIK 0000320193)
company = get_company_info("320193")
print(f"{company['name']} ({company['ticker']})")
print(f"SIC: {company['sic_description']}")
print(f"Exchange: {company['exchange']}")

Fetching Recent Filings

The submissions endpoint returns the most recent filings. Here's how to extract them with full metadata:

def get_recent_filings(cik: str, form_type: str = None, limit: int = 100) -> list[dict]:
    """Get recent filings for a company, optionally filtered by form type."""
    company = get_company_info(cik)
    recent = company["recent_filings"]

    filings = []
    forms = recent.get("form", [])
    dates = recent.get("filingDate", [])
    accessions = recent.get("accessionNumber", [])
    documents = recent.get("primaryDocument", [])
    descriptions = recent.get("primaryDocDescription", [])
    report_dates = recent.get("reportDate", [])
    sizes = recent.get("size", [])

    for i in range(min(len(forms), limit)):
        if form_type and forms[i] != form_type:
            continue

        accession_clean = accessions[i].replace("-", "")
        cik_stripped = str(int(cik))
        filing_url = (
            f"https://www.sec.gov/Archives/edgar/data/"
            f"{cik_stripped}/{accession_clean}/{documents[i]}"
        )
        index_url = (
            f"https://www.sec.gov/Archives/edgar/data/"
            f"{cik_stripped}/{accession_clean}/"
        )

        filings.append({
            "form": forms[i],
            "filing_date": dates[i],
            "report_date": report_dates[i] if i < len(report_dates) else "",
            "accession": accessions[i],
            "document": documents[i],
            "description": descriptions[i] if i < len(descriptions) else "",
            "size_bytes": sizes[i] if i < len(sizes) else 0,
            "url": filing_url,
            "index_url": index_url,
        })

    return filings

# Get Apple's 10-K filings
filings_10k = get_recent_filings("320193", form_type="10-K")
for f in filings_10k[:5]:
    print(f"  {f['filing_date']} — {f['form']} — {f['report_date']}")
    print(f"    {f['url']}")

Full-Text Search Across All Filings

EDGAR's full-text search API lets you search inside filing documents — extremely useful for finding disclosures about specific topics across thousands of filings at once:

def search_filings(
    query: str,
    form_type: str = None,
    date_from: str = None,
    date_to: str = None,
    company: str = None,
) -> list[dict]:
    """Search EDGAR full-text search system (EFTS)."""
    url = "https://efts.sec.gov/LATEST/search-index"
    params = {
        "q": f'"{query}"',   # quotes for exact phrase matching
        "dateRange": "custom",
        "startdt": date_from or "2020-01-01",
        "enddt": date_to or "2026-12-31",
    }
    if form_type:
        params["forms"] = form_type
    if company:
        params["entity"] = company

    resp = httpx.get(url, params=params, headers=HEADERS, timeout=20)
    resp.raise_for_status()
    data = resp.json()

    results = []
    for hit in data.get("hits", {}).get("hits", []):
        src = hit.get("_source", {})
        results.append({
            "company": src.get("display_names", [""])[0],
            "cik": src.get("entity_id", ""),
            "form": src.get("form_type", ""),
            "filing_date": src.get("file_date", ""),
            "period": src.get("period_of_report", ""),
            "description": src.get("display_description", ""),
            "accession": src.get("accession_no", ""),
        })

    total = data.get("hits", {}).get("total", {}).get("value", 0)
    print(f"Found {total:,} matching filings")
    return results

# Search for AI risk disclosures in 10-Ks filed since 2025
results = search_filings("artificial intelligence", form_type="10-K", date_from="2025-01-01")
for r in results[:10]:
    print(f"  {r['company']} — {r['form']} ({r['filing_date']})")

Extracting Financial Data (XBRL)

The real gold is in the XBRL data. The SEC's company facts API gives you structured financial data — revenue, net income, assets, liabilities — already parsed from filings. No document parsing required.

def get_company_facts(cik: str) -> dict:
    """Get all XBRL facts for a company (structured financial data)."""
    cik_padded = cik.zfill(10)
    url = f"https://data.sec.gov/api/xbrl/companyfacts/CIK{cik_padded}.json"

    resp = httpx.get(url, headers=HEADERS, timeout=30)
    resp.raise_for_status()
    return resp.json()


def extract_metric_history(cik: str, metric: str, tags: list[str]) -> list[dict]:
    """Extract a financial metric from XBRL data, trying multiple GAAP tags."""
    facts = get_company_facts(cik)
    us_gaap = facts.get("facts", {}).get("us-gaap", {})

    for tag in tags:
        if tag in us_gaap:
            units = us_gaap[tag].get("units", {})
            usd = units.get("USD", [])
            if usd:
                return [
                    {
                        "metric": metric,
                        "tag": tag,
                        "period": item.get("frame", item.get("fp", "")),
                        "start_date": item.get("start"),
                        "end_date": item.get("end"),
                        "value": item.get("val"),
                        "form": item.get("form"),
                        "accession": item.get("accn"),
                        "filed": item.get("filed"),
                    }
                    for item in usd
                    if item.get("form") in ("10-K", "10-Q")
                ]

    return []


def extract_revenue_history(cik: str) -> list[dict]:
    """Extract quarterly/annual revenue from XBRL data."""
    return extract_metric_history(cik, "revenue", [
        "RevenueFromContractWithCustomerExcludingAssessedTax",
        "Revenues",
        "SalesRevenueNet",
        "RevenueFromContractWithCustomerIncludingAssessedTax",
        "SalesRevenueGoodsNet",
    ])


def extract_net_income(cik: str) -> list[dict]:
    """Extract net income history."""
    return extract_metric_history(cik, "net_income", [
        "NetIncomeLoss",
        "ProfitLoss",
        "NetIncomeLossAvailableToCommonStockholdersBasic",
    ])


def extract_total_assets(cik: str) -> list[dict]:
    """Extract total assets history."""
    return extract_metric_history(cik, "total_assets", ["Assets"])


# Apple revenue history
revenue = extract_revenue_history("320193")
for r in sorted(revenue, key=lambda x: x["end_date"] or "")[-8:]:
    val_b = (r["value"] or 0) / 1_000_000_000
    print(f"  {r['end_date']} ({r['form']}): ${val_b:.1f}B")

Cross-Company Financial Comparison

One of the most powerful use cases: compare financial metrics across every public company in a sector:

def build_sector_dataset(cik_list: list[str]) -> list[dict]:
    """Build a comparable revenue dataset for multiple companies."""
    dataset = []
    for cik in cik_list:
        try:
            info = get_company_info(cik)
            revenue = extract_revenue_history(cik)
            # Get most recent annual figure
            annual = [r for r in revenue if r["form"] == "10-K"]
            latest = sorted(annual, key=lambda x: x["end_date"] or "")[-1] if annual else None
            if latest:
                dataset.append({
                    "cik": cik,
                    "company": info["name"],
                    "ticker": info.get("ticker"),
                    "latest_revenue_b": (latest["value"] or 0) / 1e9,
                    "period_end": latest["end_date"],
                })
            time.sleep(0.15)  # stay under 10 req/sec
        except Exception as e:
            print(f"Error for CIK {cik}: {e}")

    return sorted(dataset, key=lambda x: x["latest_revenue_b"], reverse=True)


# Big Tech revenue comparison
big_tech = {
    "Apple": "320193",
    "Microsoft": "789019",
    "Amazon": "1018724",
    "Alphabet": "1652044",
    "Meta": "1326801",
}

results = build_sector_dataset(list(big_tech.values()))
for r in results:
    print(f"  {r['company']}: ${r['latest_revenue_b']:.1f}B ({r['period_end']})")

Insider Trading (Form 4)

Form 4 filings show insider trades — when executives buy or sell stock. These are filed within 2 business days:

def get_insider_trades(cik: str, limit: int = 50) -> list[dict]:
    """Get recent insider trading filings for a company."""
    return get_recent_filings(cik, form_type="4", limit=limit)


def parse_form4_xml(filing: dict) -> dict:
    """Parse a Form 4 XML document for transaction details."""
    import xml.etree.ElementTree as ET

    # Try the XML version of the filing
    xml_url = filing["url"].replace(".htm", ".xml")
    try:
        resp = httpx.get(xml_url, headers=HEADERS, timeout=15)
        if resp.status_code == 404:
            resp = httpx.get(filing["url"], headers=HEADERS, timeout=15)
        root = ET.fromstring(resp.text)
    except Exception:
        return {}

    transactions = []
    for trans in root.findall(".//nonDerivativeTransaction"):
        transactions.append({
            "security_title": trans.findtext(".//securityTitle/value", ""),
            "transaction_date": trans.findtext(".//transactionDate/value", ""),
            "transaction_code": trans.findtext(".//transactionCoding/transactionCode", ""),
            # P = purchase, S = sale, A = grant
            "shares": trans.findtext(".//transactionAmounts/transactionShares/value", ""),
            "price_per_share": trans.findtext(
                ".//transactionAmounts/transactionPricePerShare/value", ""
            ),
            "shares_owned_after": trans.findtext(
                ".//postTransactionAmounts/sharesOwnedFollowingTransaction/value", ""
            ),
        })

    return {
        "issuer_name": root.findtext(".//issuerName", ""),
        "issuer_ticker": root.findtext(".//issuerTradingSymbol", ""),
        "reporter_name": root.findtext(".//rptOwnerName", ""),
        "reporter_title": root.findtext(".//officerTitle", ""),
        "filing_date": filing.get("filing_date", ""),
        "transactions": transactions,
    }


# Track insider buying at Apple
trades = get_insider_trades("320193", limit=20)
for filing in trades[:5]:
    parsed = parse_form4_xml(filing)
    for tx in parsed.get("transactions", []):
        code = tx.get("transaction_code", "")
        action = "BUY" if code == "P" else "SELL" if code == "S" else code
        shares = tx.get("shares", "?")
        price = tx.get("price_per_share", "?")
        print(f"  [{action}] {parsed['reporter_name']} — {shares} shares @ ${price}")

Rate Limits and Proxy Setup

The SEC is one of the friendliest scraping targets, but they do enforce rules:

10 requests per second limit. The SEC explicitly states this. Exceed it and your IP gets temporarily blocked.

import time


class RateLimiter:
    """Simple rate limiter for SEC's 10 req/sec limit."""

    def __init__(self, max_per_second: float = 8):
        self.interval = 1.0 / max_per_second
        self.last_request = 0.0

    def wait(self):
        elapsed = time.monotonic() - self.last_request
        if elapsed < self.interval:
            time.sleep(self.interval - elapsed)
        self.last_request = time.monotonic()


limiter = RateLimiter(max_per_second=8)


def rate_limited_get(url: str, **kwargs) -> httpx.Response:
    limiter.wait()
    return httpx.get(url, headers=HEADERS, **kwargs)

User-Agent requirement. Requests without a proper User-Agent get 403'd. The SEC wants Name [email protected] format.

IP blocking for high-volume work. For large-scale collection — downloading thousands of filings, crawling all 10-Ks for NLP analysis — rotating proxies prevent any single IP from hitting the rate limit. ThorData's proxy service distributes requests across residential IPs, letting you stay within per-IP limits while maintaining aggregate throughput:

# Proxied client for high-volume EDGAR work
proxy_client = httpx.Client(
    proxy="http://USER:[email protected]:9000",
    headers=HEADERS,
    timeout=25,
)

def edgar_get_proxied(url: str) -> httpx.Response:
    """Rate-limited, proxied request to SEC EDGAR."""
    limiter.wait()
    return proxy_client.get(url)

Parsing 10-K Documents

The raw 10-K HTML documents are complex — tables, footnotes, XBRL inline tags, and wildly varying formats across filers:

from bs4 import BeautifulSoup
import re


def extract_filing_text(filing_url: str) -> str:
    """Download and extract clean text from a filing document."""
    resp = rate_limited_get(filing_url, timeout=60)
    resp.raise_for_status()

    soup = BeautifulSoup(resp.text, "lxml")

    # Remove non-content elements
    for tag in soup(["script", "style", "ix:header", "ix:nonNumeric"]):
        tag.decompose()

    text = soup.get_text(separator="\n", strip=True)
    # Collapse excessive blank lines
    text = re.sub(r"\n{3,}", "\n\n", text)
    return text


def extract_10k_sections(filing_url: str) -> dict:
    """Extract named sections from a 10-K filing."""
    text = extract_filing_text(filing_url)
    lines = text.split("\n")

    section_patterns = {
        "business": r"item\s*1[^ab\d].*?business",
        "risk_factors": r"item\s*1a.*?risk\s*factors",
        "mda": r"item\s*7[^a].*?management.{1,40}discussion",
        "quantitative_risk": r"item\s*7a.*?quantitative",
        "financial_statements": r"item\s*8.*?financial\s*statements",
    }

    sections = {}
    current_section = None
    section_lines = {}

    for line in lines:
        line_lower = line.lower().strip()
        for name, pattern in section_patterns.items():
            if re.search(pattern, line_lower) and name not in sections:
                current_section = name
                section_lines[current_section] = []
                break
        if current_section:
            section_lines[current_section].append(line)

    for k, v in section_lines.items():
        # Cap at first 500 lines per section
        sections[k] = "\n".join(v[:500])

    return sections


# Parse most recent 10-K for Apple
filings = get_recent_filings("320193", form_type="10-K", limit=5)
if filings:
    sections = extract_10k_sections(filings[0]["url"])
    for name, content in sections.items():
        print(f"  {name}: {len(content):,} chars")

SQLite Storage Pipeline

For any serious EDGAR data collection, you want a structured local database:

def init_edgar_db(db_path: str = "edgar.db") -> sqlite3.Connection:
    """Initialize the EDGAR SQLite database."""
    conn = sqlite3.connect(db_path)
    conn.execute("PRAGMA journal_mode=WAL")
    conn.execute("PRAGMA foreign_keys=ON")

    conn.executescript("""
        CREATE TABLE IF NOT EXISTS companies (
            cik TEXT PRIMARY KEY,
            name TEXT NOT NULL,
            ticker TEXT,
            sic TEXT,
            sic_description TEXT,
            state TEXT,
            exchange TEXT,
            fiscal_year_end TEXT,
            last_fetched TEXT
        );

        CREATE TABLE IF NOT EXISTS filings (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            cik TEXT NOT NULL,
            form_type TEXT NOT NULL,
            accession TEXT UNIQUE NOT NULL,
            filing_date TEXT,
            report_date TEXT,
            primary_document TEXT,
            url TEXT,
            size_bytes INTEGER,
            fetched_at TEXT DEFAULT (datetime('now')),
            FOREIGN KEY (cik) REFERENCES companies(cik)
        );

        CREATE TABLE IF NOT EXISTS financial_facts (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            cik TEXT NOT NULL,
            metric TEXT NOT NULL,
            tag TEXT NOT NULL,
            period TEXT,
            end_date TEXT,
            value REAL,
            form_type TEXT,
            accession TEXT,
            UNIQUE(cik, tag, end_date, form_type),
            FOREIGN KEY (cik) REFERENCES companies(cik)
        );

        CREATE TABLE IF NOT EXISTS insider_transactions (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            cik TEXT NOT NULL,
            filing_date TEXT,
            reporter_name TEXT,
            reporter_title TEXT,
            transaction_date TEXT,
            transaction_code TEXT,
            shares REAL,
            price_per_share REAL,
            shares_owned_after REAL,
            accession TEXT,
            FOREIGN KEY (cik) REFERENCES companies(cik)
        );

        CREATE INDEX IF NOT EXISTS idx_filings_cik ON filings(cik);
        CREATE INDEX IF NOT EXISTS idx_filings_form ON filings(form_type);
        CREATE INDEX IF NOT EXISTS idx_facts_cik ON financial_facts(cik, metric);
        CREATE INDEX IF NOT EXISTS idx_facts_date ON financial_facts(end_date);
    """)
    conn.commit()
    return conn


def upsert_company(conn: sqlite3.Connection, company: dict) -> None:
    conn.execute("""
        INSERT INTO companies (cik, name, ticker, sic, sic_description, state,
            exchange, fiscal_year_end, last_fetched)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
        ON CONFLICT(cik) DO UPDATE SET
            name=excluded.name, ticker=excluded.ticker,
            last_fetched=excluded.last_fetched
    """, (
        company["cik"], company["name"], company.get("ticker"),
        company.get("sic"), company.get("sic_description"),
        company.get("state"), company.get("exchange"),
        company.get("fiscal_year_end"),
        datetime.now(timezone.utc).isoformat(),
    ))
    conn.commit()


def insert_financial_facts(conn: sqlite3.Connection, cik: str, facts: list[dict]) -> int:
    inserted = 0
    for fact in facts:
        try:
            conn.execute("""
                INSERT OR IGNORE INTO financial_facts
                    (cik, metric, tag, period, end_date, value, form_type, accession)
                VALUES (?, ?, ?, ?, ?, ?, ?, ?)
            """, (
                cik, fact["metric"], fact["tag"], fact.get("period"),
                fact.get("end_date"), fact.get("value"),
                fact.get("form"), fact.get("accession"),
            ))
            inserted += conn.execute("SELECT changes()").fetchone()[0]
        except sqlite3.Error:
            pass
    conn.commit()
    return inserted

Full Collection Pipeline

Putting it all together — a pipeline that collects data for a watchlist of companies:

def collect_company_data(cik: str, conn: sqlite3.Connection) -> dict:
    """Full pipeline: company info + filings + XBRL financial facts."""
    results = {"cik": cik, "status": "ok", "errors": []}

    # 1. Company metadata
    try:
        company = get_company_info(cik)
        upsert_company(conn, company)
        results["name"] = company.get("name", "")
        limiter.wait()
    except Exception as e:
        results["errors"].append(f"company: {e}")
        results["status"] = "partial"

    # 2. Recent filings (10-K, 10-Q, 8-K, Form 4)
    try:
        for form_type in ["10-K", "10-Q", "8-K", "4"]:
            filings = get_recent_filings(cik, form_type=form_type, limit=20)
            for filing in filings:
                try:
                    conn.execute("""
                        INSERT OR IGNORE INTO filings
                            (cik, form_type, accession, filing_date, report_date,
                             primary_document, url, size_bytes)
                        VALUES (?, ?, ?, ?, ?, ?, ?, ?)
                    """, (
                        cik, filing["form"], filing["accession"],
                        filing["filing_date"], filing["report_date"],
                        filing["document"], filing["url"],
                        filing.get("size_bytes", 0),
                    ))
                except sqlite3.IntegrityError:
                    pass
            conn.commit()
            limiter.wait()
        results["filings_stored"] = True
    except Exception as e:
        results["errors"].append(f"filings: {e}")
        results["status"] = "partial"

    # 3. Financial facts (XBRL)
    try:
        all_facts = (
            extract_revenue_history(cik) +
            extract_net_income(cik) +
            extract_total_assets(cik)
        )
        n = insert_financial_facts(conn, cik, all_facts)
        results["facts_inserted"] = n
        limiter.wait()
    except Exception as e:
        results["errors"].append(f"facts: {e}")
        results["status"] = "partial"

    return results


# Run for a watchlist
watchlist = {
    "Apple": "320193",
    "Microsoft": "789019",
    "Amazon": "1018724",
    "Alphabet": "1652044",
    "Meta": "1326801",
    "NVIDIA": "1045810",
    "Tesla": "1318605",
}

db = init_edgar_db("watchlist.db")
for name, cik in watchlist.items():
    result = collect_company_data(cik, db)
    status = "OK" if result["status"] == "ok" else "PARTIAL"
    facts = result.get("facts_inserted", 0)
    errors = result.get("errors", [])
    print(f"[{status}] {name}: {facts} facts stored" +
          (f" | Errors: {errors}" if errors else ""))

Bulk Historical Downloads

For large-scale historical work, use the EDGAR quarterly index files:

def download_quarterly_index(year: int, quarter: int) -> list[dict]:
    """Download the EDGAR quarterly filing index (all filings in a quarter)."""
    url = f"https://www.sec.gov/Archives/edgar/full-index/{year}/QTR{quarter}/company.idx"
    resp = rate_limited_get(url, timeout=60)
    resp.raise_for_status()

    filings = []
    lines = resp.text.strip().split("\n")
    # Skip header lines (first 9 lines are headers)
    for line in lines[9:]:
        # Fixed-width format
        if len(line) < 80:
            continue
        try:
            filings.append({
                "company_name": line[:62].strip(),
                "form_type": line[62:74].strip(),
                "cik": line[74:86].strip(),
                "date_filed": line[86:96].strip(),
                "filename": line[96:].strip(),
            })
        except (IndexError, ValueError):
            continue

    return filings


# All 10-K filings from Q1 2026
index = download_quarterly_index(2026, 1)
ten_ks = [f for f in index if f["form_type"] == "10-K"]
print(f"10-K filings in Q1 2026: {len(ten_ks):,}")

Querying the Data

Once you have data in SQLite, analytical queries are straightforward:

def revenue_trend(conn: sqlite3.Connection, cik: str) -> list[dict]:
    """Get annual revenue trend for a company."""
    rows = conn.execute("""
        SELECT end_date, value, form_type
        FROM financial_facts
        WHERE cik = ? AND metric = 'revenue' AND form_type = '10-K'
        ORDER BY end_date
    """, (cik,)).fetchall()
    return [{"date": r[0], "revenue": r[1], "form": r[2]} for r in rows]


def latest_insider_buys(conn: sqlite3.Connection, days_back: int = 30) -> list[dict]:
    """Get insider purchase transactions from the last N days."""
    rows = conn.execute("""
        SELECT c.name, it.reporter_name, it.reporter_title,
               it.transaction_date, it.shares, it.price_per_share
        FROM insider_transactions it
        JOIN companies c ON c.cik = it.cik
        WHERE it.transaction_code = 'P'
          AND it.transaction_date >= date('now', ?)
        ORDER BY it.transaction_date DESC
    """, (f"-{days_back} days",)).fetchall()
    return [
        {
            "company": r[0], "reporter": r[1], "title": r[2],
            "date": r[3], "shares": r[4], "price": r[5],
        }
        for r in rows
    ]


# Revenue trend for Apple
for point in revenue_trend(db, "320193")[-5:]:
    val_b = (point["revenue"] or 0) / 1e9
    print(f"  {point['date']}: ${val_b:.1f}B")

Key Takeaways

SEC EDGAR is the rare case where the official APIs are genuinely good. The company facts XBRL endpoint alone gives you structured financial data that would take hours to extract from PDFs. Start with the API, fall back to document parsing only when you need the full narrative text.

The main mistakes to avoid: 1. Scraping HTML pages when JSON APIs exist — data.sec.gov endpoints are faster and cleaner. 2. Exceeding the 10 req/sec limit — easy to prevent, easy to trip. 3. Omitting the User-Agent header — this is mandatory, not optional. 4. Ignoring XBRL structure — the company facts endpoint already gives you parsed financial data.

For high-volume work like crawling all 10-Ks for NLP research, ThorData's proxy service lets you maintain throughput across a rotating residential IP pool without triggering EDGAR's per-IP rate limits.