Scraping Chrome Web Store Extensions in 2026: Ratings, Installs, and Permissions
Scraping Chrome Web Store Extensions in 2026: Ratings, Installs, and Permissions
The Chrome Web Store has no public API. Google killed the Chrome Web Store API in 2024, and in 2026 there's still no replacement. If you want extension data — install counts, ratings, permissions, version history — you have to scrape it.
The store uses server-rendered HTML with embedded JSON blobs, which makes extraction surprisingly reliable once you know where to look. For individual extensions, a plain HTTP client works. For category-level scraping, Playwright handles the JavaScript-heavy pages.
What Data Is Available
Each Chrome Web Store extension page exposes:
- Install count (approximate: "2,000,000+ users")
- Rating (1-5 stars with total review count)
- Version number and last updated date
- File size
- Required permissions and host permissions
- Category
- Developer name, developer website, and privacy policy URL
- Related extensions
- Languages supported
- Screenshots and description
The tricky part: much of this lives in structured data embedded in the page source, not in clean HTML elements. Google's HTML structure on the Chrome Web Store changes periodically, so selectors that worked last year may need updating.
Understanding the Page Structure
The Chrome Web Store (chromewebstore.google.com) serves two types of data:
- JSON-LD structured data in
<script type="application/ld+json">tags — containsSoftwareApplicationdata with rating, version, author, and operating system requirements - Inline data blobs — JavaScript variable assignments containing richer data including permissions and install counts
- Server-rendered HTML with regular CSS selectors — last resort, most brittle
The JSON-LD approach is the most stable because Google maintains it for SEO purposes.
Approach 1: Direct HTTP + HTML Parsing
For individual extensions or small batches, you don't need a full browser:
import httpx
import json
import re
import time
import random
from selectolax.parser import HTMLParser
BASE_HEADERS = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/127.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-CH-UA": '"Chromium";v="127", "Google Chrome";v="127"',
"Sec-CH-UA-Mobile": "?0",
"Sec-CH-UA-Platform": '"macOS"',
"Cache-Control": "max-age=0",
}
def scrape_extension(
extension_id: str,
proxy: str | None = None,
) -> dict:
"""Scrape metadata for a Chrome Web Store extension."""
url = f"https://chromewebstore.google.com/detail/{extension_id}"
transport = httpx.HTTPTransport(proxy=proxy) if proxy else None
client = httpx.Client(
headers=BASE_HEADERS,
transport=transport,
follow_redirects=True,
timeout=30,
)
try:
resp = client.get(url)
resp.raise_for_status()
except httpx.HTTPStatusError as e:
if e.response.status_code == 404:
return {"id": extension_id, "error": "not_found"}
raise
finally:
client.close()
tree = HTMLParser(resp.text)
result = {
"id": extension_id,
"url": str(resp.url),
"scraped_at": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
}
# --- JSON-LD Structured Data ---
for script in tree.css("script[type='application/ld+json']"):
try:
ld = json.loads(script.text())
if ld.get("@type") == "SoftwareApplication":
agg = ld.get("aggregateRating", {})
result["title"] = ld.get("name", "")
result["rating"] = float(agg.get("ratingValue", 0))
result["rating_count"] = int(agg.get("ratingCount", 0))
result["version"] = ld.get("softwareVersion", "")
result["author"] = ld.get("author", {}).get("name", "")
result["operating_system"] = ld.get("operatingSystem", "")
result["category"] = ld.get("applicationCategory", "")
result["description"] = ld.get("description", "")
except (json.JSONDecodeError, ValueError, KeyError):
continue
# --- Install Count (from page text pattern) ---
page_text = resp.text
user_patterns = [
r"([\d,]+)\+?\s*users",
r"([\d,]+)\+?\s*people use this",
r'"userCount"\s*:\s*"([^"]+)"',
]
for pattern in user_patterns:
match = re.search(pattern, page_text, re.IGNORECASE)
if match:
result["users"] = match.group(1).replace(",", "")
break
# --- Permissions ---
result["permissions"] = extract_permissions_from_html(page_text)
# --- Size and Last Updated ---
size_match = re.search(r'"size"\s*:\s*"([^"]+)"', page_text)
if size_match:
result["size"] = size_match.group(1)
updated_match = re.search(r'"lastUpdated"\s*:\s*"([^"]+)"', page_text)
if updated_match:
result["last_updated"] = updated_match.group(1)
# --- Title fallback ---
if not result.get("title"):
h1 = tree.css_first("h1")
result["title"] = h1.text(strip=True) if h1 else ""
# --- Developer Website ---
dev_link = tree.css_first("a[href*='developer']")
if dev_link:
result["developer_url"] = dev_link.attributes.get("href", "")
return result
def extract_permissions_from_html(html: str) -> list[str]:
"""Extract declared permissions from the page source."""
permissions = set()
# Pattern 1: JSON-style permissions arrays
for pattern in [
r'"permissions"\s*:\s*\[(.*?)\]',
r'"host_permissions"\s*:\s*\[(.*?)\]',
r'"optional_permissions"\s*:\s*\[(.*?)\]',
]:
match = re.search(pattern, html, re.DOTALL)
if match:
raw = match.group(1)
perms = re.findall(r'"([^"]+)"', raw)
permissions.update(perms)
# Pattern 2: Permission text in structured data blob
perm_section = re.search(
r'"permissionsText"\s*:\s*\[(.*?)\]',
html,
re.DOTALL,
)
if perm_section:
perm_items = re.findall(r'"([^"]{3,})"', perm_section.group(1))
permissions.update(perm_items)
# Filter out obvious noise
noise = {"", " ", "null", "true", "false"}
return sorted(p for p in permissions if p not in noise and len(p) > 2)
Approach 2: Playwright for JavaScript-Heavy Pages
Some extension pages load data dynamically. If the HTTP approach returns incomplete data, use Playwright:
from playwright.sync_api import sync_playwright, Page
import re
def scrape_with_browser(
extension_id: str,
proxy_url: str | None = None,
) -> dict:
"""Full browser scrape — handles dynamically loaded content."""
proxy_config = {"server": proxy_url} if proxy_url else None
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
proxy=proxy_config,
)
context = browser.new_context(
user_agent=(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 Chrome/127.0.0.0 Safari/537.36"
),
viewport={"width": 1280, "height": 800},
locale="en-US",
timezone_id="America/Los_Angeles",
)
page = context.new_page()
url = f"https://chromewebstore.google.com/detail/{extension_id}"
page.goto(url, wait_until="networkidle", timeout=30000)
result = extract_from_page(page, extension_id)
browser.close()
return result
def extract_from_page(page: Page, extension_id: str) -> dict:
"""Extract extension data from a loaded Playwright page."""
result = {"id": extension_id}
# Title
try:
result["title"] = page.locator("h1").first.inner_text()
except Exception:
result["title"] = ""
# Rating from aria-label or structured text
content = page.content()
rating_match = re.search(r"(\d+\.?\d*)\s*(?:out of 5|stars?)", content, re.IGNORECASE)
if rating_match:
result["rating"] = float(rating_match.group(1))
# Review count
count_match = re.search(r"([\d,]+)\s*(?:ratings?|reviews?)", content, re.IGNORECASE)
if count_match:
result["rating_count"] = int(count_match.group(1).replace(",", ""))
# User count
user_match = re.search(r"([\d,]+)\+?\s*users", content, re.IGNORECASE)
if user_match:
result["users"] = user_match.group(1).replace(",", "")
# Version and updated date from detail section
version_match = re.search(r'[Vv]ersion[:\s]+(\d+[\d.]+)', content)
if version_match:
result["version"] = version_match.group(1)
updated_match = re.search(r'Updated[:\s]+([A-Za-z]+ \d+, \d{4}|\d+/\d+/\d+)', content)
if updated_match:
result["last_updated"] = updated_match.group(1)
# Permissions
result["permissions"] = extract_permissions_from_html(content)
# JSON-LD
ld_data = page.evaluate("""() => {
const scripts = document.querySelectorAll('script[type="application/ld+json"]');
for (const s of scripts) {
try {
const data = JSON.parse(s.textContent);
if (data['@type'] === 'SoftwareApplication') return data;
} catch(e) {}
}
return null;
}""")
if ld_data:
result["author"] = ld_data.get("author", {}).get("name", "")
result["description"] = ld_data.get("description", "")
if not result.get("version"):
result["version"] = ld_data.get("softwareVersion", "")
return result
Anti-Bot Measures on Chrome Web Store
reCAPTCHA triggers after 20-30 rapid requests from the same IP. You'll get a CAPTCHA challenge page instead of the extension data.
Request fingerprinting. Missing or inconsistent Sec- headers, wrong TLS fingerprints, and non-browser-like header ordering all increase bot scores. The headers in the examples above are matched to what Chrome actually sends.
IP reputation. Datacenter IPs get flagged faster than residential ones. Google's anti-bot systems have built large IP reputation databases over years.
For scraping more than a handful of extensions, use residential proxies. ThorData's residential proxy pool works well here — Google's anti-bot systems trust residential IP ranges, and rotating per request keeps each IP's request count below detection thresholds:
# Configure httpx client with rotating residential proxies
def create_client(proxy_url: str | None = None) -> httpx.Client:
transport = httpx.HTTPTransport(proxy=proxy_url) if proxy_url else None
return httpx.Client(
headers=BASE_HEADERS,
transport=transport,
follow_redirects=True,
timeout=30,
)
# For Playwright with proxy
def scrape_batch_with_proxies(
extension_ids: list[str],
proxy_url: str,
) -> list[dict]:
results = []
for ext_id in extension_ids:
try:
data = scrape_extension(ext_id, proxy=proxy_url)
results.append(data)
except Exception as e:
print(f" Failed {ext_id}: {e}")
results.append({"id": ext_id, "error": str(e)})
delay = random.uniform(2, 5)
time.sleep(delay)
return results
Discovering Extensions: Crawling Categories
The Web Store organizes extensions into categories. Crawl category pages to discover extension IDs, then scrape each one:
def get_category_extensions(
category: str,
max_items: int = 100,
proxy: str | None = None,
) -> list[str]:
"""Get extension IDs from a category page."""
# Category slugs: productivity, developer-tools, shopping,
# social-networking, accessibility, fun, photos, search-tools, news-weather
url = f"https://chromewebstore.google.com/category/extensions/{category}"
transport = httpx.HTTPTransport(proxy=proxy) if proxy else None
client = httpx.Client(headers=BASE_HEADERS, transport=transport,
follow_redirects=True, timeout=30)
try:
resp = client.get(url, follow_redirects=True)
resp.raise_for_status()
finally:
client.close()
# Extension IDs are 32-character lowercase alphabetic strings
ids = re.findall(r'/detail/[^/]*/([a-z]{32})', resp.text)
# Also check for IDs in JSON data blobs
json_ids = re.findall(r'"extensionId"\s*:\s*"([a-z]{32})"', resp.text)
ids.extend(json_ids)
# Deduplicate while preserving order
seen = set()
unique_ids = []
for ext_id in ids:
if ext_id not in seen:
seen.add(ext_id)
unique_ids.append(ext_id)
return unique_ids[:max_items]
def crawl_all_categories(proxy: str | None = None) -> dict[str, list[str]]:
"""Discover extension IDs across all major categories."""
categories = [
"productivity",
"developer-tools",
"shopping",
"social-networking",
"accessibility",
"fun",
"photos",
"search-tools",
"news-weather",
]
all_ids = {}
for cat in categories:
ids = get_category_extensions(cat, max_items=100, proxy=proxy)
all_ids[cat] = ids
print(f"{cat}: {len(ids)} extensions")
time.sleep(random.uniform(2, 4))
return all_ids
Permission Analysis: Security Auditing
One of the most valuable applications of Chrome Web Store data is security analysis — identifying extensions with dangerous permission profiles:
# Permission risk levels
PERMISSION_RISK = {
# Critical: direct data access
"<all_urls>": ("critical", "Can read/modify data on all websites"),
"webRequest": ("high", "Can intercept network requests"),
"webRequestBlocking": ("critical", "Can block/modify network requests"),
"declarativeNetRequest": ("high", "Can modify network requests via rules"),
"tabs": ("high", "Can access tab URLs, titles, and navigation"),
"cookies": ("high", "Can read/write cookies for any site"),
"clipboardRead": ("high", "Can read clipboard contents"),
"history": ("medium", "Can access browsing history"),
"nativeMessaging": ("high", "Can communicate with native desktop apps"),
"downloads": ("medium", "Can manage file downloads"),
"management": ("high", "Can manage other Chrome extensions"),
"proxy": ("critical", "Can control browser proxy settings"),
"privacy": ("medium", "Can change privacy settings"),
"debugger": ("critical", "Full browser debugging access"),
# Medium: useful but sensitive
"identity": ("medium", "Can access Google account info"),
"bookmarks": ("medium", "Can read/modify bookmarks"),
"notifications": ("low", "Can display notifications"),
"storage": ("low", "Can store data locally"),
"contextMenus": ("low", "Can add right-click menu items"),
# Low: standard functionality
"activeTab": ("low", "Can access currently active tab"),
"scripting": ("medium", "Can inject scripts into pages"),
}
def audit_extension_permissions(ext_data: dict) -> dict:
"""Analyze extension permissions and produce a risk report."""
permissions = ext_data.get("permissions", [])
warnings = []
risk_score = 0
for perm in permissions:
if perm in PERMISSION_RISK:
level, description = PERMISSION_RISK[perm]
warnings.append({
"permission": perm,
"risk": level,
"description": description,
})
risk_score += {"critical": 10, "high": 5, "medium": 2, "low": 1}.get(level, 0)
elif perm.startswith("http") and "*" in perm:
warnings.append({
"permission": perm,
"risk": "high",
"description": f"Broad host access: can read/modify {perm}",
})
risk_score += 5
elif perm.startswith("*://") or perm == "*://*/*":
warnings.append({
"permission": perm,
"risk": "critical",
"description": "Access to all HTTP/HTTPS sites",
})
risk_score += 10
# Normalize to 0-100 scale (capped)
normalized_risk = min(risk_score * 2, 100)
return {
"extension_id": ext_data.get("id"),
"title": ext_data.get("title"),
"risk_score": normalized_risk,
"risk_level": (
"critical" if normalized_risk >= 70
else "high" if normalized_risk >= 40
else "medium" if normalized_risk >= 20
else "low"
),
"warnings": sorted(warnings, key=lambda w: {"critical": 0, "high": 1, "medium": 2, "low": 3}.get(w["risk"], 4)),
"permission_count": len(permissions),
}
def find_high_risk_extensions(extensions: list[dict]) -> list[dict]:
"""Filter extensions to those with high/critical risk scores."""
audits = [audit_extension_permissions(ext) for ext in extensions]
high_risk = [a for a in audits if a["risk_level"] in ("high", "critical")]
high_risk.sort(key=lambda x: -x["risk_score"])
return high_risk
Storing Extension Data in SQLite
import sqlite3
def init_extensions_db(db_path: str) -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.executescript("""
CREATE TABLE IF NOT EXISTS extensions (
id TEXT PRIMARY KEY,
title TEXT,
author TEXT,
category TEXT,
description TEXT,
version TEXT,
last_updated TEXT,
users TEXT,
rating REAL,
rating_count INTEGER,
size TEXT,
developer_url TEXT,
permissions_json TEXT,
risk_score INTEGER,
risk_level TEXT,
url TEXT,
scraped_at TEXT DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS scrape_history (
id INTEGER PRIMARY KEY AUTOINCREMENT,
extension_id TEXT,
users TEXT,
rating REAL,
rating_count INTEGER,
version TEXT,
scraped_at TEXT DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_risk_level ON extensions(risk_level);
CREATE INDEX IF NOT EXISTS idx_category ON extensions(category);
CREATE INDEX IF NOT EXISTS idx_rating ON extensions(rating_count DESC);
""")
conn.commit()
return conn
def store_extension(conn: sqlite3.Connection, ext_data: dict):
"""Store extension data and record a history snapshot."""
audit = audit_extension_permissions(ext_data)
conn.execute(
"""INSERT OR REPLACE INTO extensions
(id, title, author, category, description, version, last_updated,
users, rating, rating_count, size, developer_url,
permissions_json, risk_score, risk_level, url)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
(
ext_data["id"], ext_data.get("title"), ext_data.get("author"),
ext_data.get("category"), ext_data.get("description"),
ext_data.get("version"), ext_data.get("last_updated"),
ext_data.get("users"),
float(ext_data.get("rating", 0) or 0),
int(ext_data.get("rating_count", 0) or 0),
ext_data.get("size"), ext_data.get("developer_url"),
json.dumps(ext_data.get("permissions", [])),
audit["risk_score"], audit["risk_level"],
ext_data.get("url"),
)
)
# History snapshot for tracking changes
conn.execute(
"""INSERT INTO scrape_history (extension_id, users, rating, rating_count, version)
VALUES (?, ?, ?, ?, ?)""",
(ext_data["id"], ext_data.get("users"),
float(ext_data.get("rating", 0) or 0),
int(ext_data.get("rating_count", 0) or 0),
ext_data.get("version"))
)
conn.commit()
def get_install_growth(conn: sqlite3.Connection, extension_id: str) -> list[dict]:
"""Track user count changes over time."""
rows = conn.execute("""
SELECT users, rating, rating_count, version, scraped_at
FROM scrape_history
WHERE extension_id = ?
ORDER BY scraped_at
""", (extension_id,)).fetchall()
return [
{"users": r[0], "rating": r[1], "rating_count": r[2],
"version": r[3], "date": r[4]}
for r in rows
]
Practical Considerations
Extension IDs are permanent. The 32-character ID in the URL never changes, even if the extension is renamed or transferred to a new developer. Use it as your primary key.
User counts are approximate. Google rounds to the nearest magnitude (e.g., "2,000,000+ users"). Don't treat these as exact numbers — they're order-of-magnitude indicators. Track relative changes over time rather than absolute values.
Unlisted extensions exist. Some extensions aren't in search results but are accessible by direct URL if you have the ID. They appear in enterprise software deployments and developer testing scenarios.
Version updates trigger re-review. Extensions that receive major permission changes must go through Google's review process. Monitoring version changes alongside permission changes can flag suspicious extensions that gradually expand their access.
Rate yourself at 1 request per 2-3 seconds. Google's tolerance for scraping is lower than most sites. Going slower with residential proxies is more reliable than going fast with datacenter IPs.
Monitor for removals. Extensions get removed from the Web Store regularly — for policy violations, malware, or developer decisions. A 404 response is informative data: track when extensions disappear.
The Chrome Web Store is a mid-difficulty scraping target. No API means parsing HTML, but the JSON-LD structured data makes core fields reliable to extract. Start with the HTTP approach for single extensions, fall back to Playwright for stubborn pages or larger batches, and add residential proxies when you need to scale beyond a few dozen extensions per session.