How to Scrape Patent Data from USPTO and Google Patents with Python (2026)

2026-04-09 [python scraping patents uspto google-patents api]

How to Scrape Patent Data from USPTO and Google Patents with Python (2026)

Patent data is one of the most underused public datasets in existence. Every granted patent and published application includes structured fields — claims, inventors, assignees, classifications, citation chains — that are machine-readable and free to access.

There are two main sources: USPTO's PatentsView API (structured, clean, limited to US patents) and Google Patents (global coverage, richer UI, requires HTML parsing). This guide covers both in depth, including full SQLite schemas, error handling, proxy rotation for Google Patents, and analytical queries to make sense of the data.

Why Patent Data Is Worth Collecting

Patents are the only legally required disclosure of new technology. Unlike academic papers (which can be selectively published) or trade secrets (which are hidden), patents must describe the invention in sufficient detail that someone skilled in the field could reproduce it. This makes the patent corpus uniquely valuable for:

Technology intelligence: Track which companies are investing in specific technology areas before products launch
Competitive analysis: Map who your competitors are citing, who is citing them, and what technology gaps exist
Prior art research: Before building something, understand what's already been patented
Inventor and researcher tracking: Identify subject matter experts by their patent portfolio
M&A signals: Heavy patenting activity in a specific area often precedes acquisition attempts
Academic citation networks: Patent citations create a technology lineage graph

The US patent corpus alone contains 13+ million granted patents and 5+ million published applications. It's free, structured, and updated weekly.

Approach 1: USPTO PatentsView API

PatentsView is a free API maintained by the USPTO. No API key required. It covers all US patents and published applications with fields for inventors, assignees, claims, citations, CPC classifications, and more. The query language is a proprietary JSON format that handles complex boolean logic.

Basic Patent Search

# patents_search.py
import httpx
import time
import json

BASE = "https://api.patentsview.org/patents/query"

client = httpx.Client(
    timeout=30,
    headers={"Content-Type": "application/json"},
)

def search_patents(query_text: str, max_results: int = 100) -> list:
    """
    Search patents by text in title or abstract.
    Uses PatentsView full-text search operators.
    """
    results = []
    per_page = min(max_results, 100)

    for page in range(1, (max_results // per_page) + 2):
        payload = {
            "q": {
                "_or": [
                    {"_text_any": {"patent_title": query_text}},
                    {"_text_any": {"patent_abstract": query_text}},
                ]
            },
            "f": [
                "patent_number", "patent_title", "patent_abstract",
                "patent_date", "patent_type",
                "inventor_first_name", "inventor_last_name",
                "inventor_city", "inventor_state", "inventor_country",
                "assignee_organization", "assignee_country",
                "cpc_group_id", "cpc_group_title",
                "uspc_mainclass_id", "uspc_mainclass_title",
            ],
            "o": {
                "page": page,
                "per_page": per_page,
            },
            "s": [{"patent_date": "desc"}],
        }

        resp = client.post(BASE, json=payload)

        if resp.status_code == 429:
            print("Rate limited, waiting 30s...")
            time.sleep(30)
            resp = client.post(BASE, json=payload)

        resp.raise_for_status()
        data = resp.json()

        batch = data.get("patents", [])
        if not batch:
            break
        results.extend(batch)

        total = data.get("total_patent_count", 0)
        print(f"  Page {page}: {len(batch)} patents (total: {total})")

        if len(results) >= total or len(results) >= max_results:
            break
        time.sleep(0.5)

    return results[:max_results]


# Usage
patents = search_patents("machine learning drug discovery", max_results=50)
for p in patents[:5]:
    inventors = ", ".join(
        f"{inv.get('inventor_first_name', '')} {inv.get('inventor_last_name', '')}".strip()
        for inv in p.get("inventors", [])[:3]
    )
    assignee = ", ".join(
        a.get("assignee_organization", "")
        for a in p.get("assignees", [])[:2]
        if a.get("assignee_organization")
    )
    print(f"{p['patent_number']} ({p.get('patent_date', 'N/A')})")
    print(f"  {p['patent_title'][:80]}...")
    print(f"  Inventors: {inventors}")
    print(f"  Assignee: {assignee}")
    print()

Searching by Date Range and Assignee

PatentsView supports precise boolean queries for targeted searches:

def search_by_assignee(
    assignee_name: str,
    start_date: str = "2023-01-01",
    end_date: str = "2026-01-01",
    max_results: int = 200,
) -> list:
    """
    Get all patents from a specific assignee (company) in a date range.
    date format: YYYY-MM-DD
    """
    payload = {
        "q": {
            "_and": [
                {"_text_any": {"assignee_organization": assignee_name}},
                {"_gte": {"patent_date": start_date}},
                {"_lte": {"patent_date": end_date}},
            ]
        },
        "f": [
            "patent_number", "patent_title", "patent_date",
            "assignee_organization", "cpc_group_id", "cpc_group_title",
            "inventor_first_name", "inventor_last_name",
        ],
        "o": {"per_page": 100},
        "s": [{"patent_date": "desc"}],
    }

    all_results = []
    page = 1

    while len(all_results) < max_results:
        payload["o"]["page"] = page
        resp = client.post(BASE, json=payload)
        resp.raise_for_status()
        data = resp.json()

        batch = data.get("patents", [])
        if not batch:
            break

        all_results.extend(batch)
        total = data.get("total_patent_count", 0)

        if len(all_results) >= total:
            break

        page += 1
        time.sleep(0.3)

    return all_results[:max_results]


# Example: get all Google patents from 2024
google_patents = search_by_assignee("Google LLC", "2024-01-01", "2024-12-31")
print(f"Google LLC filed {len(google_patents)} patents in 2024")

Getting Patent Claims

Claims are the legally binding part of a patent — what the patent actually protects. PatentsView provides them via the same endpoint with different fields:

def get_patent_claims(patent_number: str) -> list:
    """Fetch claims for a specific patent."""
    payload = {
        "q": {"patent_number": patent_number},
        "f": [
            "patent_number", "patent_title",
            "claim_text", "claim_number", "claim_dependent",
        ],
    }
    resp = client.post(BASE, json=payload)
    resp.raise_for_status()
    data = resp.json()

    patents = data.get("patents", [])
    if not patents:
        return []

    claims = patents[0].get("claims", [])
    # Sort by claim number
    claims.sort(key=lambda c: int(c.get("claim_number", 0) or 0))
    return claims


def summarize_claims(claims: list) -> dict:
    """Categorize claims by type and extract independent claims."""
    independent = [c for c in claims if not c.get("claim_dependent")]
    dependent = [c for c in claims if c.get("claim_dependent")]
    return {
        "total": len(claims),
        "independent_count": len(independent),
        "dependent_count": len(dependent),
        "independent_claims": [
            {"number": c["claim_number"], "text": c["claim_text"][:300] + "..."}
            for c in independent[:3]  # first 3 independent claims
        ],
    }


# Example usage
claims = get_patent_claims("11234567")
summary = summarize_claims(claims)
print(f"Total claims: {summary['total']}")
print(f"Independent: {summary['independent_count']}, Dependent: {summary['dependent_count']}")
for c in summary["independent_claims"]:
    print(f"\nClaim {c['number']}:")
    print(f"  {c['text']}")

Citation Network Analysis

Patent citations reveal technology lineages and competitive landscapes:

def get_citations(patent_number: str) -> dict:
    """Get both forward and backward citations for a patent."""
    payload = {
        "q": {"patent_number": patent_number},
        "f": [
            "patent_number",
            "cited_patent_number", "cited_patent_title", "cited_patent_date",
            "citedby_patent_number", "citedby_patent_title", "citedby_patent_date",
        ],
    }
    resp = client.post(BASE, json=payload)
    resp.raise_for_status()
    data = resp.json()

    patents = data.get("patents", [])
    if not patents:
        return {"backward": [], "forward": [], "patent_number": patent_number}

    patent = patents[0]
    backward = patent.get("cited_patents", [])
    forward = patent.get("citedby_patents", [])

    return {
        "patent_number": patent_number,
        "backward": backward,   # what this patent cites
        "forward": forward,     # who cites this patent
        "backward_count": len(backward),
        "forward_count": len(forward),
    }


def build_citation_graph(
    seed_patents: list,
    depth: int = 1,
    max_per_level: int = 20,
) -> dict:
    """
    Build a citation graph from seed patents.
    depth=1 means follow one level of citations.
    Returns dict with nodes and edges for graph visualization.
    """
    nodes = {}
    edges = []
    to_process = list(seed_patents)

    for level in range(depth + 1):
        next_level = []
        for pnum in to_process[:max_per_level]:
            if pnum in nodes:
                continue

            cites = get_citations(pnum)
            nodes[pnum] = {
                "level": level,
                "backward_count": cites["backward_count"],
                "forward_count": cites["forward_count"],
            }

            for cited in cites["backward"]:
                cited_num = cited.get("cited_patent_number")
                if cited_num:
                    edges.append({"source": pnum, "target": cited_num, "type": "cites"})
                    if level < depth:
                        next_level.append(cited_num)

            time.sleep(0.2)

        to_process = next_level

    return {"nodes": nodes, "edges": edges}


# Example
cites = get_citations("11234567")
print(f"Patent 11234567:")
print(f"  Cites {cites['backward_count']} prior patents")
print(f"  Has been cited by {cites['forward_count']} subsequent patents")

Technology Landscape Analysis

from collections import Counter

def analyze_landscape(query: str, sample_size: int = 500) -> dict:
    """Analyze patent landscape for a technology area."""
    patents = search_patents(query, max_results=sample_size)

    assignees = Counter()
    inventors = Counter()
    years = Counter()
    cpc_codes = Counter()
    countries = Counter()

    for p in patents:
        for a in p.get("assignees", []):
            org = a.get("assignee_organization", "")
            if org:
                assignees[org] += 1
            country = a.get("assignee_country", "")
            if country:
                countries[country] += 1

        for inv in p.get("inventors", []):
            name = (
                f"{inv.get('inventor_first_name', '')} "
                f"{inv.get('inventor_last_name', '')}".strip()
            )
            if name:
                inventors[name] += 1

        date = p.get("patent_date", "")
        if date:
            years[date[:4]] += 1

        for cpc in p.get("cpcs", []):
            code = cpc.get("cpc_group_id", "")
            if code:
                cpc_codes[code] += 1

    return {
        "total_patents": len(patents),
        "top_assignees": assignees.most_common(10),
        "top_inventors": inventors.most_common(10),
        "year_distribution": dict(sorted(years.items())),
        "top_cpc_codes": cpc_codes.most_common(10),
        "top_countries": countries.most_common(10),
    }


landscape = analyze_landscape("solid state battery")
print(f"Total patents: {landscape['total_patents']}")
print("\nTop assignees:")
for org, count in landscape["top_assignees"]:
    print(f"  {org}: {count} patents")
print("\nYear distribution:")
for year, count in sorted(landscape["year_distribution"].items()):
    bar = "#" * (count // 3)
    print(f"  {year}: {bar} ({count})")

Approach 2: Google Patents HTML Scraping

Google Patents covers international patents (WIPO, EPO, JPO, etc.) that PatentsView doesn't include. The trade-off: no API, so you need to parse HTML.

# google_patents_scraper.py
from bs4 import BeautifulSoup
import httpx
import re
import time
import random

GOOGLE_PATENTS = "https://patents.google.com"

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/125.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}


def make_google_patents_session(proxy_url: str = None) -> httpx.Client:
    """Create a session for Google Patents with optional proxy."""
    client_kwargs = {
        "headers": HEADERS,
        "follow_redirects": True,
        "timeout": 25,
    }
    if proxy_url:
        client_kwargs["proxy"] = proxy_url

    client = httpx.Client(**client_kwargs)

    # Warm up with a homepage visit
    try:
        client.get(GOOGLE_PATENTS)
        time.sleep(random.uniform(1.0, 2.0))
    except httpx.RequestError:
        pass

    return client


def search_google_patents(
    query: str,
    num_results: int = 20,
    session: httpx.Client = None,
) -> list:
    """Search Google Patents and parse results."""
    if session is None:
        session = make_google_patents_session()

    results = []
    resp = session.get(
        GOOGLE_PATENTS,
        params={"q": query, "num": min(num_results, 100)},
    )

    if resp.status_code != 200:
        print(f"Search returned {resp.status_code}")
        return results

    soup = BeautifulSoup(resp.text, "lxml")

    # Google Patents result items use various selector patterns
    for item in soup.select("search-result-item, article.search-result, .result"):
        title_elem = (
            item.select_one("h3")
            or item.select_one(".result-title")
            or item.select_one("span.style-scope.patent-text")
        )
        id_elem = item.select_one("a[href*='/patent/']")

        if title_elem and id_elem:
            href = id_elem.get("href", "")
            patent_id = ""
            if "/patent/" in href:
                patent_id = href.split("/patent/")[-1].split("/")[0]
            results.append({
                "title": title_elem.get_text(strip=True),
                "patent_id": patent_id,
                "url": f"{GOOGLE_PATENTS}/patent/{patent_id}" if patent_id else "",
            })

    return results


def scrape_patent_detail(
    patent_id: str,
    session: httpx.Client = None,
) -> dict:
    """Scrape detailed patent info from Google Patents."""
    if session is None:
        session = make_google_patents_session()

    resp = session.get(f"{GOOGLE_PATENTS}/patent/{patent_id}/en")

    if resp.status_code == 429:
        raise RuntimeError(f"Rate limited fetching {patent_id}")
    if resp.status_code != 200:
        return {}

    soup = BeautifulSoup(resp.text, "lxml")

    # Title
    title = (
        soup.select_one("h1#title")
        or soup.select_one("span.style-scope.patent-text")
    )
    title_text = title.get_text(strip=True) if title else ""

    # Abstract
    abstract = (
        soup.select_one("div.abstract")
        or soup.select_one("section#abstractSection")
    )
    abstract_text = abstract.get_text(strip=True) if abstract else ""

    # Claims
    claims = []
    for claim in soup.select("div.claim, div.claim-text"):
        text = claim.get_text(strip=True)
        if text and len(text) > 20:
            claims.append(text)

    # Description sections
    description_sections = []
    for section in soup.select("div.description-paragraph"):
        text = section.get_text(strip=True)
        if text:
            description_sections.append(text)

    # Classifications
    classifications = []
    for cls in soup.select(".classification-item, span[data-type='cpc']"):
        text = cls.get_text(strip=True)
        if text:
            classifications.append(text)

    # Filing and publication dates from info table
    meta = {}
    for dt in soup.select("dl dt"):
        dd = dt.find_next_sibling("dd")
        if dd:
            key = dt.get_text(strip=True).rstrip(":")
            val = dd.get_text(strip=True)
            if key and val:
                meta[key] = val

    # Inventors and assignees from structured data
    inventors = [el.get_text(strip=True) for el in soup.select("dd[itemprop='inventor']")]
    assignees = [el.get_text(strip=True) for el in soup.select("dd[itemprop='assignee']")]

    return {
        "patent_id": patent_id,
        "title": title_text,
        "abstract": abstract_text,
        "claims": claims,
        "claims_count": len(claims),
        "description_sections": len(description_sections),
        "classifications": classifications,
        "inventors": inventors,
        "assignees": assignees,
        "metadata": meta,
    }

Anti-Bot Measures and Proxy Usage

PatentsView is an open government API — no anti-bot measures, no authentication. You can query it freely within reason (they request staying under 45 requests per minute).

Google Patents is a different story. It sits behind Google's standard bot detection:

reCAPTCHA triggers after moderate request volumes from the same IP
IP-based rate limiting that blocks entire subnets quickly
JavaScript rendering requirements for some result pages

For Google Patents scraping beyond a few dozen lookups, rotating residential proxies keep you from hitting blocks. ThorData is well-suited for Google properties — their residential IPs rotate per request, which avoids the pattern detection that triggers CAPTCHAs on repeated requests from the same IP.

def make_proxied_session(proxy_url: str) -> httpx.Client:
    """Create a session for Google Patents with ThorData proxy."""
    return httpx.Client(
        headers=HEADERS,
        proxy=proxy_url,
        timeout=25,
        follow_redirects=True,
    )


# Usage
proxy = "http://USER:[email protected]:9000"
session = make_proxied_session(proxy)

# Fetch patent details through rotating proxies
patent_ids = ["US11234567B1", "US10987654B2", "US9876543B2"]
for pid in patent_ids:
    try:
        detail = scrape_patent_detail(pid, session=session)
        print(f"{pid}: {detail.get('title', 'N/A')} | Claims: {detail.get('claims_count', 0)}")
    except RuntimeError as e:
        print(f"{pid}: {e}")
    time.sleep(random.uniform(2.0, 5.0))

Practical tips:

Use PatentsView first — it's free, fast, and structured. Only fall back to Google Patents for non-US patents.
Cache aggressively — patent data doesn't change after grant. Store results locally and never re-fetch a patent you already have.
Batch your PatentsView queries — one request with 100 patent numbers is better than 100 individual requests.
Respect Google's robots.txt — Patents pages are listed in their sitemap and the data is public, but automated access is not officially supported.

SQLite Storage Schema

import sqlite3
import json

def init_patent_db(db_path: str = "patents.db") -> sqlite3.Connection:
    """Initialize SQLite database for patent data."""
    conn = sqlite3.connect(db_path)
    conn.execute("PRAGMA journal_mode=WAL")
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS patents (
            patent_number       TEXT PRIMARY KEY,
            title               TEXT,
            abstract            TEXT,
            date_granted        TEXT,
            date_filed          TEXT,
            patent_type         TEXT,
            inventors           TEXT,
            assignees           TEXT,
            cpc_codes           TEXT,
            claims_text         TEXT,
            claims_count        INTEGER DEFAULT 0,
            source              TEXT DEFAULT 'patentsview',
            query_matched       TEXT,
            added_at            TEXT DEFAULT CURRENT_TIMESTAMP
        );

        CREATE TABLE IF NOT EXISTS citations (
            id                  INTEGER PRIMARY KEY AUTOINCREMENT,
            citing_patent       TEXT NOT NULL,
            cited_patent        TEXT NOT NULL,
            citation_type       TEXT DEFAULT 'backward',
            cited_title         TEXT,
            cited_date          TEXT,
            UNIQUE(citing_patent, cited_patent, citation_type)
        );

        CREATE TABLE IF NOT EXISTS search_runs (
            id                  INTEGER PRIMARY KEY AUTOINCREMENT,
            query               TEXT,
            result_count        INTEGER,
            run_at              TEXT DEFAULT CURRENT_TIMESTAMP
        );

        CREATE INDEX IF NOT EXISTS idx_patents_date ON patents (date_granted);
        CREATE INDEX IF NOT EXISTS idx_citations_citing ON citations (citing_patent);
        CREATE INDEX IF NOT EXISTS idx_citations_cited ON citations (cited_patent);
    """)
    conn.commit()
    return conn


def save_patent(conn: sqlite3.Connection, patent: dict, query: str = None):
    """Insert or update a patent record."""
    conn.execute(
        """
        INSERT OR REPLACE INTO patents
            (patent_number, title, abstract, date_granted, patent_type,
             inventors, assignees, cpc_codes, query_matched)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
        """,
        (
            patent.get("patent_number"),
            patent.get("patent_title"),
            patent.get("patent_abstract"),
            patent.get("patent_date"),
            patent.get("patent_type"),
            json.dumps([
                f"{inv.get('inventor_first_name', '')} {inv.get('inventor_last_name', '')}".strip()
                for inv in patent.get("inventors", [])
            ]),
            json.dumps([
                a.get("assignee_organization", "")
                for a in patent.get("assignees", [])
                if a.get("assignee_organization")
            ]),
            json.dumps([
                cpc.get("cpc_group_id", "")
                for cpc in patent.get("cpcs", [])
                if cpc.get("cpc_group_id")
            ]),
            query,
        ),
    )
    conn.commit()


def save_citations(conn: sqlite3.Connection, citing_patent: str, citations: dict):
    """Save citation data for a patent."""
    for cited in citations.get("backward", []):
        cited_num = cited.get("cited_patent_number")
        if not cited_num:
            continue
        try:
            conn.execute(
                """INSERT OR IGNORE INTO citations
                   (citing_patent, cited_patent, citation_type, cited_title, cited_date)
                   VALUES (?, ?, 'backward', ?, ?)""",
                (citing_patent, cited_num,
                 cited.get("cited_patent_title"),
                 cited.get("cited_patent_date")),
            )
        except sqlite3.IntegrityError:
            pass

    for citedby in citations.get("forward", []):
        citedby_num = citedby.get("citedby_patent_number")
        if not citedby_num:
            continue
        try:
            conn.execute(
                """INSERT OR IGNORE INTO citations
                   (citing_patent, cited_patent, citation_type, cited_title, cited_date)
                   VALUES (?, ?, 'forward', ?, ?)""",
                (citedby_num, citing_patent,
                 citedby.get("citedby_patent_title"),
                 citedby.get("citedby_patent_date")),
            )
        except sqlite3.IntegrityError:
            pass

    conn.commit()

Building a Patent Monitoring Pipeline

Combine everything into a pipeline that tracks new patents in your technology area on a weekly basis:

def patent_monitor(
    queries: list,
    db_path: str = "patent_watch.db",
    results_per_query: int = 100,
    fetch_citations: bool = False,
):
    """
    Monitor new patents for given technology queries.
    Run weekly to stay current on a technology area.
    """
    conn = init_patent_db(db_path)

    for query in queries:
        print(f"\nProcessing query: '{query}'")
        try:
            patents = search_patents(query, max_results=results_per_query)
        except Exception as e:
            print(f"  Search failed: {e}")
            continue

        new_count = 0
        for p in patents:
            existing = conn.execute(
                "SELECT 1 FROM patents WHERE patent_number=?",
                (p.get("patent_number"),)
            ).fetchone()

            if not existing:
                save_patent(conn, p, query=query)
                new_count += 1

                if fetch_citations:
                    try:
                        cites = get_citations(p["patent_number"])
                        save_citations(conn, p["patent_number"], cites)
                        time.sleep(0.2)
                    except Exception:
                        pass

        conn.execute(
            "INSERT INTO search_runs (query, result_count) VALUES (?, ?)",
            (query, new_count)
        )
        conn.commit()
        print(f"  {new_count} new patents added (of {len(patents)} found)")

    total = conn.execute("SELECT COUNT(*) FROM patents").fetchone()[0]
    print(f"\nTotal patents in database: {total:,}")
    conn.close()


# Useful analytical queries
def find_emerging_assignees(conn: sqlite3.Connection, query_term: str, top_n: int = 10):
    """Find companies with recent rapid patent growth in a technology area."""
    rows = conn.execute(
        """
        SELECT
            json_each.value as assignee,
            COUNT(*) as count,
            MAX(date_granted) as latest,
            MIN(date_granted) as earliest
        FROM patents, json_each(patents.assignees)
        WHERE query_matched LIKE ?
        GROUP BY assignee
        ORDER BY count DESC
        LIMIT ?
        """,
        (f"%{query_term}%", top_n)
    ).fetchall()
    return [
        {"assignee": r[0], "count": r[1], "latest": r[2], "earliest": r[3]}
        for r in rows
    ]


# Run the monitor
patent_monitor(
    queries=[
        "solid state battery electrolyte",
        "quantum error correction surface code",
        "autonomous vehicle lidar point cloud",
    ],
    db_path="patent_watch.db",
    results_per_query=200,
    fetch_citations=False,  # set True to build citation graph
)

Legal Notes

Patent data from the USPTO is fully public domain — the whole point of the patent system is disclosure in exchange for limited monopoly. You can freely access, store, analyze, and republish patent data from PatentsView with no legal restrictions.

Google Patents is a different matter: Google's Terms of Service prohibit automated scraping of their services, including the Patents search interface. The underlying patent documents are public domain, but Google's search index and UI are their property.

In practice, small-scale access to Google Patents for research is common and not typically enforced. For production applications, the better path is the bulk USPTO data downloads (available at PatentsView.org/download), which give you complete patent datasets in structured JSON format without any scraping required.

For international patents, WIPO (World Intellectual Property Organization) offers the PATENTSCOPE free search and bulk download API that covers PCT applications and national filings from 150+ countries.