How to Scrape Patent Data from USPTO and Google Patents with Python (2026)
How to Scrape Patent Data from USPTO and Google Patents with Python (2026)
Patent data is one of the most underused public datasets in existence. Every granted patent and published application includes structured fields — claims, inventors, assignees, classifications, citation chains — that are machine-readable and free to access.
There are two main sources: USPTO's PatentsView API (structured, clean, limited to US patents) and Google Patents (global coverage, richer UI, requires HTML parsing). This guide covers both in depth, including full SQLite schemas, error handling, proxy rotation for Google Patents, and analytical queries to make sense of the data.
Why Patent Data Is Worth Collecting
Patents are the only legally required disclosure of new technology. Unlike academic papers (which can be selectively published) or trade secrets (which are hidden), patents must describe the invention in sufficient detail that someone skilled in the field could reproduce it. This makes the patent corpus uniquely valuable for:
- Technology intelligence: Track which companies are investing in specific technology areas before products launch
- Competitive analysis: Map who your competitors are citing, who is citing them, and what technology gaps exist
- Prior art research: Before building something, understand what's already been patented
- Inventor and researcher tracking: Identify subject matter experts by their patent portfolio
- M&A signals: Heavy patenting activity in a specific area often precedes acquisition attempts
- Academic citation networks: Patent citations create a technology lineage graph
The US patent corpus alone contains 13+ million granted patents and 5+ million published applications. It's free, structured, and updated weekly.
Approach 1: USPTO PatentsView API
PatentsView is a free API maintained by the USPTO. No API key required. It covers all US patents and published applications with fields for inventors, assignees, claims, citations, CPC classifications, and more. The query language is a proprietary JSON format that handles complex boolean logic.
Basic Patent Search
# patents_search.py
import httpx
import time
import json
BASE = "https://api.patentsview.org/patents/query"
client = httpx.Client(
timeout=30,
headers={"Content-Type": "application/json"},
)
def search_patents(query_text: str, max_results: int = 100) -> list:
"""
Search patents by text in title or abstract.
Uses PatentsView full-text search operators.
"""
results = []
per_page = min(max_results, 100)
for page in range(1, (max_results // per_page) + 2):
payload = {
"q": {
"_or": [
{"_text_any": {"patent_title": query_text}},
{"_text_any": {"patent_abstract": query_text}},
]
},
"f": [
"patent_number", "patent_title", "patent_abstract",
"patent_date", "patent_type",
"inventor_first_name", "inventor_last_name",
"inventor_city", "inventor_state", "inventor_country",
"assignee_organization", "assignee_country",
"cpc_group_id", "cpc_group_title",
"uspc_mainclass_id", "uspc_mainclass_title",
],
"o": {
"page": page,
"per_page": per_page,
},
"s": [{"patent_date": "desc"}],
}
resp = client.post(BASE, json=payload)
if resp.status_code == 429:
print("Rate limited, waiting 30s...")
time.sleep(30)
resp = client.post(BASE, json=payload)
resp.raise_for_status()
data = resp.json()
batch = data.get("patents", [])
if not batch:
break
results.extend(batch)
total = data.get("total_patent_count", 0)
print(f" Page {page}: {len(batch)} patents (total: {total})")
if len(results) >= total or len(results) >= max_results:
break
time.sleep(0.5)
return results[:max_results]
# Usage
patents = search_patents("machine learning drug discovery", max_results=50)
for p in patents[:5]:
inventors = ", ".join(
f"{inv.get('inventor_first_name', '')} {inv.get('inventor_last_name', '')}".strip()
for inv in p.get("inventors", [])[:3]
)
assignee = ", ".join(
a.get("assignee_organization", "")
for a in p.get("assignees", [])[:2]
if a.get("assignee_organization")
)
print(f"{p['patent_number']} ({p.get('patent_date', 'N/A')})")
print(f" {p['patent_title'][:80]}...")
print(f" Inventors: {inventors}")
print(f" Assignee: {assignee}")
print()
Searching by Date Range and Assignee
PatentsView supports precise boolean queries for targeted searches:
def search_by_assignee(
assignee_name: str,
start_date: str = "2023-01-01",
end_date: str = "2026-01-01",
max_results: int = 200,
) -> list:
"""
Get all patents from a specific assignee (company) in a date range.
date format: YYYY-MM-DD
"""
payload = {
"q": {
"_and": [
{"_text_any": {"assignee_organization": assignee_name}},
{"_gte": {"patent_date": start_date}},
{"_lte": {"patent_date": end_date}},
]
},
"f": [
"patent_number", "patent_title", "patent_date",
"assignee_organization", "cpc_group_id", "cpc_group_title",
"inventor_first_name", "inventor_last_name",
],
"o": {"per_page": 100},
"s": [{"patent_date": "desc"}],
}
all_results = []
page = 1
while len(all_results) < max_results:
payload["o"]["page"] = page
resp = client.post(BASE, json=payload)
resp.raise_for_status()
data = resp.json()
batch = data.get("patents", [])
if not batch:
break
all_results.extend(batch)
total = data.get("total_patent_count", 0)
if len(all_results) >= total:
break
page += 1
time.sleep(0.3)
return all_results[:max_results]
# Example: get all Google patents from 2024
google_patents = search_by_assignee("Google LLC", "2024-01-01", "2024-12-31")
print(f"Google LLC filed {len(google_patents)} patents in 2024")
Getting Patent Claims
Claims are the legally binding part of a patent — what the patent actually protects. PatentsView provides them via the same endpoint with different fields:
def get_patent_claims(patent_number: str) -> list:
"""Fetch claims for a specific patent."""
payload = {
"q": {"patent_number": patent_number},
"f": [
"patent_number", "patent_title",
"claim_text", "claim_number", "claim_dependent",
],
}
resp = client.post(BASE, json=payload)
resp.raise_for_status()
data = resp.json()
patents = data.get("patents", [])
if not patents:
return []
claims = patents[0].get("claims", [])
# Sort by claim number
claims.sort(key=lambda c: int(c.get("claim_number", 0) or 0))
return claims
def summarize_claims(claims: list) -> dict:
"""Categorize claims by type and extract independent claims."""
independent = [c for c in claims if not c.get("claim_dependent")]
dependent = [c for c in claims if c.get("claim_dependent")]
return {
"total": len(claims),
"independent_count": len(independent),
"dependent_count": len(dependent),
"independent_claims": [
{"number": c["claim_number"], "text": c["claim_text"][:300] + "..."}
for c in independent[:3] # first 3 independent claims
],
}
# Example usage
claims = get_patent_claims("11234567")
summary = summarize_claims(claims)
print(f"Total claims: {summary['total']}")
print(f"Independent: {summary['independent_count']}, Dependent: {summary['dependent_count']}")
for c in summary["independent_claims"]:
print(f"\nClaim {c['number']}:")
print(f" {c['text']}")
Citation Network Analysis
Patent citations reveal technology lineages and competitive landscapes:
def get_citations(patent_number: str) -> dict:
"""Get both forward and backward citations for a patent."""
payload = {
"q": {"patent_number": patent_number},
"f": [
"patent_number",
"cited_patent_number", "cited_patent_title", "cited_patent_date",
"citedby_patent_number", "citedby_patent_title", "citedby_patent_date",
],
}
resp = client.post(BASE, json=payload)
resp.raise_for_status()
data = resp.json()
patents = data.get("patents", [])
if not patents:
return {"backward": [], "forward": [], "patent_number": patent_number}
patent = patents[0]
backward = patent.get("cited_patents", [])
forward = patent.get("citedby_patents", [])
return {
"patent_number": patent_number,
"backward": backward, # what this patent cites
"forward": forward, # who cites this patent
"backward_count": len(backward),
"forward_count": len(forward),
}
def build_citation_graph(
seed_patents: list,
depth: int = 1,
max_per_level: int = 20,
) -> dict:
"""
Build a citation graph from seed patents.
depth=1 means follow one level of citations.
Returns dict with nodes and edges for graph visualization.
"""
nodes = {}
edges = []
to_process = list(seed_patents)
for level in range(depth + 1):
next_level = []
for pnum in to_process[:max_per_level]:
if pnum in nodes:
continue
cites = get_citations(pnum)
nodes[pnum] = {
"level": level,
"backward_count": cites["backward_count"],
"forward_count": cites["forward_count"],
}
for cited in cites["backward"]:
cited_num = cited.get("cited_patent_number")
if cited_num:
edges.append({"source": pnum, "target": cited_num, "type": "cites"})
if level < depth:
next_level.append(cited_num)
time.sleep(0.2)
to_process = next_level
return {"nodes": nodes, "edges": edges}
# Example
cites = get_citations("11234567")
print(f"Patent 11234567:")
print(f" Cites {cites['backward_count']} prior patents")
print(f" Has been cited by {cites['forward_count']} subsequent patents")
Technology Landscape Analysis
from collections import Counter
def analyze_landscape(query: str, sample_size: int = 500) -> dict:
"""Analyze patent landscape for a technology area."""
patents = search_patents(query, max_results=sample_size)
assignees = Counter()
inventors = Counter()
years = Counter()
cpc_codes = Counter()
countries = Counter()
for p in patents:
for a in p.get("assignees", []):
org = a.get("assignee_organization", "")
if org:
assignees[org] += 1
country = a.get("assignee_country", "")
if country:
countries[country] += 1
for inv in p.get("inventors", []):
name = (
f"{inv.get('inventor_first_name', '')} "
f"{inv.get('inventor_last_name', '')}".strip()
)
if name:
inventors[name] += 1
date = p.get("patent_date", "")
if date:
years[date[:4]] += 1
for cpc in p.get("cpcs", []):
code = cpc.get("cpc_group_id", "")
if code:
cpc_codes[code] += 1
return {
"total_patents": len(patents),
"top_assignees": assignees.most_common(10),
"top_inventors": inventors.most_common(10),
"year_distribution": dict(sorted(years.items())),
"top_cpc_codes": cpc_codes.most_common(10),
"top_countries": countries.most_common(10),
}
landscape = analyze_landscape("solid state battery")
print(f"Total patents: {landscape['total_patents']}")
print("\nTop assignees:")
for org, count in landscape["top_assignees"]:
print(f" {org}: {count} patents")
print("\nYear distribution:")
for year, count in sorted(landscape["year_distribution"].items()):
bar = "#" * (count // 3)
print(f" {year}: {bar} ({count})")
Approach 2: Google Patents HTML Scraping
Google Patents covers international patents (WIPO, EPO, JPO, etc.) that PatentsView doesn't include. The trade-off: no API, so you need to parse HTML.
# google_patents_scraper.py
from bs4 import BeautifulSoup
import httpx
import re
import time
import random
GOOGLE_PATENTS = "https://patents.google.com"
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/125.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}
def make_google_patents_session(proxy_url: str = None) -> httpx.Client:
"""Create a session for Google Patents with optional proxy."""
client_kwargs = {
"headers": HEADERS,
"follow_redirects": True,
"timeout": 25,
}
if proxy_url:
client_kwargs["proxy"] = proxy_url
client = httpx.Client(**client_kwargs)
# Warm up with a homepage visit
try:
client.get(GOOGLE_PATENTS)
time.sleep(random.uniform(1.0, 2.0))
except httpx.RequestError:
pass
return client
def search_google_patents(
query: str,
num_results: int = 20,
session: httpx.Client = None,
) -> list:
"""Search Google Patents and parse results."""
if session is None:
session = make_google_patents_session()
results = []
resp = session.get(
GOOGLE_PATENTS,
params={"q": query, "num": min(num_results, 100)},
)
if resp.status_code != 200:
print(f"Search returned {resp.status_code}")
return results
soup = BeautifulSoup(resp.text, "lxml")
# Google Patents result items use various selector patterns
for item in soup.select("search-result-item, article.search-result, .result"):
title_elem = (
item.select_one("h3")
or item.select_one(".result-title")
or item.select_one("span.style-scope.patent-text")
)
id_elem = item.select_one("a[href*='/patent/']")
if title_elem and id_elem:
href = id_elem.get("href", "")
patent_id = ""
if "/patent/" in href:
patent_id = href.split("/patent/")[-1].split("/")[0]
results.append({
"title": title_elem.get_text(strip=True),
"patent_id": patent_id,
"url": f"{GOOGLE_PATENTS}/patent/{patent_id}" if patent_id else "",
})
return results
def scrape_patent_detail(
patent_id: str,
session: httpx.Client = None,
) -> dict:
"""Scrape detailed patent info from Google Patents."""
if session is None:
session = make_google_patents_session()
resp = session.get(f"{GOOGLE_PATENTS}/patent/{patent_id}/en")
if resp.status_code == 429:
raise RuntimeError(f"Rate limited fetching {patent_id}")
if resp.status_code != 200:
return {}
soup = BeautifulSoup(resp.text, "lxml")
# Title
title = (
soup.select_one("h1#title")
or soup.select_one("span.style-scope.patent-text")
)
title_text = title.get_text(strip=True) if title else ""
# Abstract
abstract = (
soup.select_one("div.abstract")
or soup.select_one("section#abstractSection")
)
abstract_text = abstract.get_text(strip=True) if abstract else ""
# Claims
claims = []
for claim in soup.select("div.claim, div.claim-text"):
text = claim.get_text(strip=True)
if text and len(text) > 20:
claims.append(text)
# Description sections
description_sections = []
for section in soup.select("div.description-paragraph"):
text = section.get_text(strip=True)
if text:
description_sections.append(text)
# Classifications
classifications = []
for cls in soup.select(".classification-item, span[data-type='cpc']"):
text = cls.get_text(strip=True)
if text:
classifications.append(text)
# Filing and publication dates from info table
meta = {}
for dt in soup.select("dl dt"):
dd = dt.find_next_sibling("dd")
if dd:
key = dt.get_text(strip=True).rstrip(":")
val = dd.get_text(strip=True)
if key and val:
meta[key] = val
# Inventors and assignees from structured data
inventors = [el.get_text(strip=True) for el in soup.select("dd[itemprop='inventor']")]
assignees = [el.get_text(strip=True) for el in soup.select("dd[itemprop='assignee']")]
return {
"patent_id": patent_id,
"title": title_text,
"abstract": abstract_text,
"claims": claims,
"claims_count": len(claims),
"description_sections": len(description_sections),
"classifications": classifications,
"inventors": inventors,
"assignees": assignees,
"metadata": meta,
}
Anti-Bot Measures and Proxy Usage
PatentsView is an open government API — no anti-bot measures, no authentication. You can query it freely within reason (they request staying under 45 requests per minute).
Google Patents is a different story. It sits behind Google's standard bot detection:
- reCAPTCHA triggers after moderate request volumes from the same IP
- IP-based rate limiting that blocks entire subnets quickly
- JavaScript rendering requirements for some result pages
For Google Patents scraping beyond a few dozen lookups, rotating residential proxies keep you from hitting blocks. ThorData is well-suited for Google properties — their residential IPs rotate per request, which avoids the pattern detection that triggers CAPTCHAs on repeated requests from the same IP.
def make_proxied_session(proxy_url: str) -> httpx.Client:
"""Create a session for Google Patents with ThorData proxy."""
return httpx.Client(
headers=HEADERS,
proxy=proxy_url,
timeout=25,
follow_redirects=True,
)
# Usage
proxy = "http://USER:[email protected]:9000"
session = make_proxied_session(proxy)
# Fetch patent details through rotating proxies
patent_ids = ["US11234567B1", "US10987654B2", "US9876543B2"]
for pid in patent_ids:
try:
detail = scrape_patent_detail(pid, session=session)
print(f"{pid}: {detail.get('title', 'N/A')} | Claims: {detail.get('claims_count', 0)}")
except RuntimeError as e:
print(f"{pid}: {e}")
time.sleep(random.uniform(2.0, 5.0))
Practical tips:
- Use PatentsView first — it's free, fast, and structured. Only fall back to Google Patents for non-US patents.
- Cache aggressively — patent data doesn't change after grant. Store results locally and never re-fetch a patent you already have.
- Batch your PatentsView queries — one request with 100 patent numbers is better than 100 individual requests.
- Respect Google's robots.txt — Patents pages are listed in their sitemap and the data is public, but automated access is not officially supported.
SQLite Storage Schema
import sqlite3
import json
def init_patent_db(db_path: str = "patents.db") -> sqlite3.Connection:
"""Initialize SQLite database for patent data."""
conn = sqlite3.connect(db_path)
conn.execute("PRAGMA journal_mode=WAL")
conn.executescript("""
CREATE TABLE IF NOT EXISTS patents (
patent_number TEXT PRIMARY KEY,
title TEXT,
abstract TEXT,
date_granted TEXT,
date_filed TEXT,
patent_type TEXT,
inventors TEXT,
assignees TEXT,
cpc_codes TEXT,
claims_text TEXT,
claims_count INTEGER DEFAULT 0,
source TEXT DEFAULT 'patentsview',
query_matched TEXT,
added_at TEXT DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS citations (
id INTEGER PRIMARY KEY AUTOINCREMENT,
citing_patent TEXT NOT NULL,
cited_patent TEXT NOT NULL,
citation_type TEXT DEFAULT 'backward',
cited_title TEXT,
cited_date TEXT,
UNIQUE(citing_patent, cited_patent, citation_type)
);
CREATE TABLE IF NOT EXISTS search_runs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
query TEXT,
result_count INTEGER,
run_at TEXT DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_patents_date ON patents (date_granted);
CREATE INDEX IF NOT EXISTS idx_citations_citing ON citations (citing_patent);
CREATE INDEX IF NOT EXISTS idx_citations_cited ON citations (cited_patent);
""")
conn.commit()
return conn
def save_patent(conn: sqlite3.Connection, patent: dict, query: str = None):
"""Insert or update a patent record."""
conn.execute(
"""
INSERT OR REPLACE INTO patents
(patent_number, title, abstract, date_granted, patent_type,
inventors, assignees, cpc_codes, query_matched)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
""",
(
patent.get("patent_number"),
patent.get("patent_title"),
patent.get("patent_abstract"),
patent.get("patent_date"),
patent.get("patent_type"),
json.dumps([
f"{inv.get('inventor_first_name', '')} {inv.get('inventor_last_name', '')}".strip()
for inv in patent.get("inventors", [])
]),
json.dumps([
a.get("assignee_organization", "")
for a in patent.get("assignees", [])
if a.get("assignee_organization")
]),
json.dumps([
cpc.get("cpc_group_id", "")
for cpc in patent.get("cpcs", [])
if cpc.get("cpc_group_id")
]),
query,
),
)
conn.commit()
def save_citations(conn: sqlite3.Connection, citing_patent: str, citations: dict):
"""Save citation data for a patent."""
for cited in citations.get("backward", []):
cited_num = cited.get("cited_patent_number")
if not cited_num:
continue
try:
conn.execute(
"""INSERT OR IGNORE INTO citations
(citing_patent, cited_patent, citation_type, cited_title, cited_date)
VALUES (?, ?, 'backward', ?, ?)""",
(citing_patent, cited_num,
cited.get("cited_patent_title"),
cited.get("cited_patent_date")),
)
except sqlite3.IntegrityError:
pass
for citedby in citations.get("forward", []):
citedby_num = citedby.get("citedby_patent_number")
if not citedby_num:
continue
try:
conn.execute(
"""INSERT OR IGNORE INTO citations
(citing_patent, cited_patent, citation_type, cited_title, cited_date)
VALUES (?, ?, 'forward', ?, ?)""",
(citedby_num, citing_patent,
citedby.get("citedby_patent_title"),
citedby.get("citedby_patent_date")),
)
except sqlite3.IntegrityError:
pass
conn.commit()
Building a Patent Monitoring Pipeline
Combine everything into a pipeline that tracks new patents in your technology area on a weekly basis:
def patent_monitor(
queries: list,
db_path: str = "patent_watch.db",
results_per_query: int = 100,
fetch_citations: bool = False,
):
"""
Monitor new patents for given technology queries.
Run weekly to stay current on a technology area.
"""
conn = init_patent_db(db_path)
for query in queries:
print(f"\nProcessing query: '{query}'")
try:
patents = search_patents(query, max_results=results_per_query)
except Exception as e:
print(f" Search failed: {e}")
continue
new_count = 0
for p in patents:
existing = conn.execute(
"SELECT 1 FROM patents WHERE patent_number=?",
(p.get("patent_number"),)
).fetchone()
if not existing:
save_patent(conn, p, query=query)
new_count += 1
if fetch_citations:
try:
cites = get_citations(p["patent_number"])
save_citations(conn, p["patent_number"], cites)
time.sleep(0.2)
except Exception:
pass
conn.execute(
"INSERT INTO search_runs (query, result_count) VALUES (?, ?)",
(query, new_count)
)
conn.commit()
print(f" {new_count} new patents added (of {len(patents)} found)")
total = conn.execute("SELECT COUNT(*) FROM patents").fetchone()[0]
print(f"\nTotal patents in database: {total:,}")
conn.close()
# Useful analytical queries
def find_emerging_assignees(conn: sqlite3.Connection, query_term: str, top_n: int = 10):
"""Find companies with recent rapid patent growth in a technology area."""
rows = conn.execute(
"""
SELECT
json_each.value as assignee,
COUNT(*) as count,
MAX(date_granted) as latest,
MIN(date_granted) as earliest
FROM patents, json_each(patents.assignees)
WHERE query_matched LIKE ?
GROUP BY assignee
ORDER BY count DESC
LIMIT ?
""",
(f"%{query_term}%", top_n)
).fetchall()
return [
{"assignee": r[0], "count": r[1], "latest": r[2], "earliest": r[3]}
for r in rows
]
# Run the monitor
patent_monitor(
queries=[
"solid state battery electrolyte",
"quantum error correction surface code",
"autonomous vehicle lidar point cloud",
],
db_path="patent_watch.db",
results_per_query=200,
fetch_citations=False, # set True to build citation graph
)
Legal Notes
Patent data from the USPTO is fully public domain — the whole point of the patent system is disclosure in exchange for limited monopoly. You can freely access, store, analyze, and republish patent data from PatentsView with no legal restrictions.
Google Patents is a different matter: Google's Terms of Service prohibit automated scraping of their services, including the Patents search interface. The underlying patent documents are public domain, but Google's search index and UI are their property.
In practice, small-scale access to Google Patents for research is common and not typically enforced. For production applications, the better path is the bulk USPTO data downloads (available at PatentsView.org/download), which give you complete patent datasets in structured JSON format without any scraping required.
For international patents, WIPO (World Intellectual Property Organization) offers the PATENTSCOPE free search and bulk download API that covers PCT applications and national filings from 150+ countries.