Scraping Chrome Web Store Extensions in 2026: Ratings, Installs, and Permissions

2026-04-09 chrome-web-store web-scraping browser-extensions playwright python

Scraping Chrome Web Store Extensions in 2026: Ratings, Installs, and Permissions

The Chrome Web Store has no public API. Google killed the Chrome Web Store API in 2024, and in 2026 there's still no replacement. If you want extension data — install counts, ratings, permissions, version history — you have to scrape it.

The store uses server-rendered HTML with embedded JSON blobs, which makes extraction surprisingly reliable once you know where to look. For individual extensions, a plain HTTP client works. For category-level scraping, Playwright handles the JavaScript-heavy pages.

What Data Is Available

Each Chrome Web Store extension page exposes:

Install count (approximate: "2,000,000+ users")
Rating (1-5 stars with total review count)
Version number and last updated date
File size
Required permissions and host permissions
Category
Developer name, developer website, and privacy policy URL
Related extensions
Languages supported
Screenshots and description

The tricky part: much of this lives in structured data embedded in the page source, not in clean HTML elements. Google's HTML structure on the Chrome Web Store changes periodically, so selectors that worked last year may need updating.

Understanding the Page Structure

The Chrome Web Store (chromewebstore.google.com) serves two types of data:

JSON-LD structured data in <script type="application/ld+json"> tags — contains SoftwareApplication data with rating, version, author, and operating system requirements
Inline data blobs — JavaScript variable assignments containing richer data including permissions and install counts
Server-rendered HTML with regular CSS selectors — last resort, most brittle

The JSON-LD approach is the most stable because Google maintains it for SEO purposes.

Approach 1: Direct HTTP + HTML Parsing

For individual extensions or small batches, you don't need a full browser:

import httpx
import json
import re
import time
import random
from selectolax.parser import HTMLParser

BASE_HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                   "AppleWebKit/537.36 (KHTML, like Gecko) "
                   "Chrome/127.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-CH-UA": '"Chromium";v="127", "Google Chrome";v="127"',
    "Sec-CH-UA-Mobile": "?0",
    "Sec-CH-UA-Platform": '"macOS"',
    "Cache-Control": "max-age=0",
}


def scrape_extension(
    extension_id: str,
    proxy: str | None = None,
) -> dict:
    """Scrape metadata for a Chrome Web Store extension."""
    url = f"https://chromewebstore.google.com/detail/{extension_id}"
    transport = httpx.HTTPTransport(proxy=proxy) if proxy else None
    client = httpx.Client(
        headers=BASE_HEADERS,
        transport=transport,
        follow_redirects=True,
        timeout=30,
    )

    try:
        resp = client.get(url)
        resp.raise_for_status()
    except httpx.HTTPStatusError as e:
        if e.response.status_code == 404:
            return {"id": extension_id, "error": "not_found"}
        raise
    finally:
        client.close()

    tree = HTMLParser(resp.text)
    result = {
        "id": extension_id,
        "url": str(resp.url),
        "scraped_at": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
    }

    # --- JSON-LD Structured Data ---
    for script in tree.css("script[type='application/ld+json']"):
        try:
            ld = json.loads(script.text())
            if ld.get("@type") == "SoftwareApplication":
                agg = ld.get("aggregateRating", {})
                result["title"] = ld.get("name", "")
                result["rating"] = float(agg.get("ratingValue", 0))
                result["rating_count"] = int(agg.get("ratingCount", 0))
                result["version"] = ld.get("softwareVersion", "")
                result["author"] = ld.get("author", {}).get("name", "")
                result["operating_system"] = ld.get("operatingSystem", "")
                result["category"] = ld.get("applicationCategory", "")
                result["description"] = ld.get("description", "")
        except (json.JSONDecodeError, ValueError, KeyError):
            continue

    # --- Install Count (from page text pattern) ---
    page_text = resp.text
    user_patterns = [
        r"([\d,]+)\+?\s*users",
        r"([\d,]+)\+?\s*people use this",
        r'"userCount"\s*:\s*"([^"]+)"',
    ]
    for pattern in user_patterns:
        match = re.search(pattern, page_text, re.IGNORECASE)
        if match:
            result["users"] = match.group(1).replace(",", "")
            break

    # --- Permissions ---
    result["permissions"] = extract_permissions_from_html(page_text)

    # --- Size and Last Updated ---
    size_match = re.search(r'"size"\s*:\s*"([^"]+)"', page_text)
    if size_match:
        result["size"] = size_match.group(1)

    updated_match = re.search(r'"lastUpdated"\s*:\s*"([^"]+)"', page_text)
    if updated_match:
        result["last_updated"] = updated_match.group(1)

    # --- Title fallback ---
    if not result.get("title"):
        h1 = tree.css_first("h1")
        result["title"] = h1.text(strip=True) if h1 else ""

    # --- Developer Website ---
    dev_link = tree.css_first("a[href*='developer']")
    if dev_link:
        result["developer_url"] = dev_link.attributes.get("href", "")

    return result


def extract_permissions_from_html(html: str) -> list[str]:
    """Extract declared permissions from the page source."""
    permissions = set()

    # Pattern 1: JSON-style permissions arrays
    for pattern in [
        r'"permissions"\s*:\s*\[(.*?)\]',
        r'"host_permissions"\s*:\s*\[(.*?)\]',
        r'"optional_permissions"\s*:\s*\[(.*?)\]',
    ]:
        match = re.search(pattern, html, re.DOTALL)
        if match:
            raw = match.group(1)
            perms = re.findall(r'"([^"]+)"', raw)
            permissions.update(perms)

    # Pattern 2: Permission text in structured data blob
    perm_section = re.search(
        r'"permissionsText"\s*:\s*\[(.*?)\]',
        html,
        re.DOTALL,
    )
    if perm_section:
        perm_items = re.findall(r'"([^"]{3,})"', perm_section.group(1))
        permissions.update(perm_items)

    # Filter out obvious noise
    noise = {"", " ", "null", "true", "false"}
    return sorted(p for p in permissions if p not in noise and len(p) > 2)

Approach 2: Playwright for JavaScript-Heavy Pages

Some extension pages load data dynamically. If the HTTP approach returns incomplete data, use Playwright:

from playwright.sync_api import sync_playwright, Page
import re

def scrape_with_browser(
    extension_id: str,
    proxy_url: str | None = None,
) -> dict:
    """Full browser scrape — handles dynamically loaded content."""
    proxy_config = {"server": proxy_url} if proxy_url else None

    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            proxy=proxy_config,
        )
        context = browser.new_context(
            user_agent=(
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 Chrome/127.0.0.0 Safari/537.36"
            ),
            viewport={"width": 1280, "height": 800},
            locale="en-US",
            timezone_id="America/Los_Angeles",
        )
        page = context.new_page()

        url = f"https://chromewebstore.google.com/detail/{extension_id}"
        page.goto(url, wait_until="networkidle", timeout=30000)

        result = extract_from_page(page, extension_id)
        browser.close()

    return result


def extract_from_page(page: Page, extension_id: str) -> dict:
    """Extract extension data from a loaded Playwright page."""
    result = {"id": extension_id}

    # Title
    try:
        result["title"] = page.locator("h1").first.inner_text()
    except Exception:
        result["title"] = ""

    # Rating from aria-label or structured text
    content = page.content()
    rating_match = re.search(r"(\d+\.?\d*)\s*(?:out of 5|stars?)", content, re.IGNORECASE)
    if rating_match:
        result["rating"] = float(rating_match.group(1))

    # Review count
    count_match = re.search(r"([\d,]+)\s*(?:ratings?|reviews?)", content, re.IGNORECASE)
    if count_match:
        result["rating_count"] = int(count_match.group(1).replace(",", ""))

    # User count
    user_match = re.search(r"([\d,]+)\+?\s*users", content, re.IGNORECASE)
    if user_match:
        result["users"] = user_match.group(1).replace(",", "")

    # Version and updated date from detail section
    version_match = re.search(r'[Vv]ersion[:\s]+(\d+[\d.]+)', content)
    if version_match:
        result["version"] = version_match.group(1)

    updated_match = re.search(r'Updated[:\s]+([A-Za-z]+ \d+, \d{4}|\d+/\d+/\d+)', content)
    if updated_match:
        result["last_updated"] = updated_match.group(1)

    # Permissions
    result["permissions"] = extract_permissions_from_html(content)

    # JSON-LD
    ld_data = page.evaluate("""() => {
        const scripts = document.querySelectorAll('script[type="application/ld+json"]');
        for (const s of scripts) {
            try {
                const data = JSON.parse(s.textContent);
                if (data['@type'] === 'SoftwareApplication') return data;
            } catch(e) {}
        }
        return null;
    }""")

    if ld_data:
        result["author"] = ld_data.get("author", {}).get("name", "")
        result["description"] = ld_data.get("description", "")
        if not result.get("version"):
            result["version"] = ld_data.get("softwareVersion", "")

    return result

Anti-Bot Measures on Chrome Web Store

reCAPTCHA triggers after 20-30 rapid requests from the same IP. You'll get a CAPTCHA challenge page instead of the extension data.

Request fingerprinting. Missing or inconsistent Sec- headers, wrong TLS fingerprints, and non-browser-like header ordering all increase bot scores. The headers in the examples above are matched to what Chrome actually sends.

IP reputation. Datacenter IPs get flagged faster than residential ones. Google's anti-bot systems have built large IP reputation databases over years.

For scraping more than a handful of extensions, use residential proxies. ThorData's residential proxy pool works well here — Google's anti-bot systems trust residential IP ranges, and rotating per request keeps each IP's request count below detection thresholds:

# Configure httpx client with rotating residential proxies
def create_client(proxy_url: str | None = None) -> httpx.Client:
    transport = httpx.HTTPTransport(proxy=proxy_url) if proxy_url else None
    return httpx.Client(
        headers=BASE_HEADERS,
        transport=transport,
        follow_redirects=True,
        timeout=30,
    )


# For Playwright with proxy
def scrape_batch_with_proxies(
    extension_ids: list[str],
    proxy_url: str,
) -> list[dict]:
    results = []
    for ext_id in extension_ids:
        try:
            data = scrape_extension(ext_id, proxy=proxy_url)
            results.append(data)
        except Exception as e:
            print(f"  Failed {ext_id}: {e}")
            results.append({"id": ext_id, "error": str(e)})

        delay = random.uniform(2, 5)
        time.sleep(delay)

    return results

Discovering Extensions: Crawling Categories

The Web Store organizes extensions into categories. Crawl category pages to discover extension IDs, then scrape each one:

def get_category_extensions(
    category: str,
    max_items: int = 100,
    proxy: str | None = None,
) -> list[str]:
    """Get extension IDs from a category page."""
    # Category slugs: productivity, developer-tools, shopping,
    # social-networking, accessibility, fun, photos, search-tools, news-weather
    url = f"https://chromewebstore.google.com/category/extensions/{category}"
    transport = httpx.HTTPTransport(proxy=proxy) if proxy else None
    client = httpx.Client(headers=BASE_HEADERS, transport=transport,
                          follow_redirects=True, timeout=30)

    try:
        resp = client.get(url, follow_redirects=True)
        resp.raise_for_status()
    finally:
        client.close()

    # Extension IDs are 32-character lowercase alphabetic strings
    ids = re.findall(r'/detail/[^/]*/([a-z]{32})', resp.text)

    # Also check for IDs in JSON data blobs
    json_ids = re.findall(r'"extensionId"\s*:\s*"([a-z]{32})"', resp.text)
    ids.extend(json_ids)

    # Deduplicate while preserving order
    seen = set()
    unique_ids = []
    for ext_id in ids:
        if ext_id not in seen:
            seen.add(ext_id)
            unique_ids.append(ext_id)

    return unique_ids[:max_items]


def crawl_all_categories(proxy: str | None = None) -> dict[str, list[str]]:
    """Discover extension IDs across all major categories."""
    categories = [
        "productivity",
        "developer-tools",
        "shopping",
        "social-networking",
        "accessibility",
        "fun",
        "photos",
        "search-tools",
        "news-weather",
    ]

    all_ids = {}
    for cat in categories:
        ids = get_category_extensions(cat, max_items=100, proxy=proxy)
        all_ids[cat] = ids
        print(f"{cat}: {len(ids)} extensions")
        time.sleep(random.uniform(2, 4))

    return all_ids

Permission Analysis: Security Auditing

One of the most valuable applications of Chrome Web Store data is security analysis — identifying extensions with dangerous permission profiles:

# Permission risk levels
PERMISSION_RISK = {
    # Critical: direct data access
    "<all_urls>": ("critical", "Can read/modify data on all websites"),
    "webRequest": ("high", "Can intercept network requests"),
    "webRequestBlocking": ("critical", "Can block/modify network requests"),
    "declarativeNetRequest": ("high", "Can modify network requests via rules"),
    "tabs": ("high", "Can access tab URLs, titles, and navigation"),
    "cookies": ("high", "Can read/write cookies for any site"),
    "clipboardRead": ("high", "Can read clipboard contents"),
    "history": ("medium", "Can access browsing history"),
    "nativeMessaging": ("high", "Can communicate with native desktop apps"),
    "downloads": ("medium", "Can manage file downloads"),
    "management": ("high", "Can manage other Chrome extensions"),
    "proxy": ("critical", "Can control browser proxy settings"),
    "privacy": ("medium", "Can change privacy settings"),
    "debugger": ("critical", "Full browser debugging access"),

    # Medium: useful but sensitive
    "identity": ("medium", "Can access Google account info"),
    "bookmarks": ("medium", "Can read/modify bookmarks"),
    "notifications": ("low", "Can display notifications"),
    "storage": ("low", "Can store data locally"),
    "contextMenus": ("low", "Can add right-click menu items"),

    # Low: standard functionality
    "activeTab": ("low", "Can access currently active tab"),
    "scripting": ("medium", "Can inject scripts into pages"),
}


def audit_extension_permissions(ext_data: dict) -> dict:
    """Analyze extension permissions and produce a risk report."""
    permissions = ext_data.get("permissions", [])
    warnings = []
    risk_score = 0

    for perm in permissions:
        if perm in PERMISSION_RISK:
            level, description = PERMISSION_RISK[perm]
            warnings.append({
                "permission": perm,
                "risk": level,
                "description": description,
            })
            risk_score += {"critical": 10, "high": 5, "medium": 2, "low": 1}.get(level, 0)
        elif perm.startswith("http") and "*" in perm:
            warnings.append({
                "permission": perm,
                "risk": "high",
                "description": f"Broad host access: can read/modify {perm}",
            })
            risk_score += 5
        elif perm.startswith("*://") or perm == "*://*/*":
            warnings.append({
                "permission": perm,
                "risk": "critical",
                "description": "Access to all HTTP/HTTPS sites",
            })
            risk_score += 10

    # Normalize to 0-100 scale (capped)
    normalized_risk = min(risk_score * 2, 100)

    return {
        "extension_id": ext_data.get("id"),
        "title": ext_data.get("title"),
        "risk_score": normalized_risk,
        "risk_level": (
            "critical" if normalized_risk >= 70
            else "high" if normalized_risk >= 40
            else "medium" if normalized_risk >= 20
            else "low"
        ),
        "warnings": sorted(warnings, key=lambda w: {"critical": 0, "high": 1, "medium": 2, "low": 3}.get(w["risk"], 4)),
        "permission_count": len(permissions),
    }


def find_high_risk_extensions(extensions: list[dict]) -> list[dict]:
    """Filter extensions to those with high/critical risk scores."""
    audits = [audit_extension_permissions(ext) for ext in extensions]
    high_risk = [a for a in audits if a["risk_level"] in ("high", "critical")]
    high_risk.sort(key=lambda x: -x["risk_score"])
    return high_risk

Storing Extension Data in SQLite

import sqlite3

def init_extensions_db(db_path: str) -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS extensions (
            id              TEXT PRIMARY KEY,
            title           TEXT,
            author          TEXT,
            category        TEXT,
            description     TEXT,
            version         TEXT,
            last_updated    TEXT,
            users           TEXT,
            rating          REAL,
            rating_count    INTEGER,
            size            TEXT,
            developer_url   TEXT,
            permissions_json TEXT,
            risk_score      INTEGER,
            risk_level      TEXT,
            url             TEXT,
            scraped_at      TEXT DEFAULT CURRENT_TIMESTAMP
        );

        CREATE TABLE IF NOT EXISTS scrape_history (
            id              INTEGER PRIMARY KEY AUTOINCREMENT,
            extension_id    TEXT,
            users           TEXT,
            rating          REAL,
            rating_count    INTEGER,
            version         TEXT,
            scraped_at      TEXT DEFAULT CURRENT_TIMESTAMP
        );

        CREATE INDEX IF NOT EXISTS idx_risk_level ON extensions(risk_level);
        CREATE INDEX IF NOT EXISTS idx_category ON extensions(category);
        CREATE INDEX IF NOT EXISTS idx_rating ON extensions(rating_count DESC);
    """)
    conn.commit()
    return conn


def store_extension(conn: sqlite3.Connection, ext_data: dict):
    """Store extension data and record a history snapshot."""
    audit = audit_extension_permissions(ext_data)

    conn.execute(
        """INSERT OR REPLACE INTO extensions
           (id, title, author, category, description, version, last_updated,
            users, rating, rating_count, size, developer_url,
            permissions_json, risk_score, risk_level, url)
           VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
        (
            ext_data["id"], ext_data.get("title"), ext_data.get("author"),
            ext_data.get("category"), ext_data.get("description"),
            ext_data.get("version"), ext_data.get("last_updated"),
            ext_data.get("users"),
            float(ext_data.get("rating", 0) or 0),
            int(ext_data.get("rating_count", 0) or 0),
            ext_data.get("size"), ext_data.get("developer_url"),
            json.dumps(ext_data.get("permissions", [])),
            audit["risk_score"], audit["risk_level"],
            ext_data.get("url"),
        )
    )

    # History snapshot for tracking changes
    conn.execute(
        """INSERT INTO scrape_history (extension_id, users, rating, rating_count, version)
           VALUES (?, ?, ?, ?, ?)""",
        (ext_data["id"], ext_data.get("users"),
         float(ext_data.get("rating", 0) or 0),
         int(ext_data.get("rating_count", 0) or 0),
         ext_data.get("version"))
    )

    conn.commit()


def get_install_growth(conn: sqlite3.Connection, extension_id: str) -> list[dict]:
    """Track user count changes over time."""
    rows = conn.execute("""
        SELECT users, rating, rating_count, version, scraped_at
        FROM scrape_history
        WHERE extension_id = ?
        ORDER BY scraped_at
    """, (extension_id,)).fetchall()

    return [
        {"users": r[0], "rating": r[1], "rating_count": r[2],
         "version": r[3], "date": r[4]}
        for r in rows
    ]

Practical Considerations

Extension IDs are permanent. The 32-character ID in the URL never changes, even if the extension is renamed or transferred to a new developer. Use it as your primary key.

User counts are approximate. Google rounds to the nearest magnitude (e.g., "2,000,000+ users"). Don't treat these as exact numbers — they're order-of-magnitude indicators. Track relative changes over time rather than absolute values.

Unlisted extensions exist. Some extensions aren't in search results but are accessible by direct URL if you have the ID. They appear in enterprise software deployments and developer testing scenarios.

Version updates trigger re-review. Extensions that receive major permission changes must go through Google's review process. Monitoring version changes alongside permission changes can flag suspicious extensions that gradually expand their access.

Rate yourself at 1 request per 2-3 seconds. Google's tolerance for scraping is lower than most sites. Going slower with residential proxies is more reliable than going fast with datacenter IPs.

Monitor for removals. Extensions get removed from the Web Store regularly — for policy violations, malware, or developer decisions. A 404 response is informative data: track when extensions disappear.

The Chrome Web Store is a mid-difficulty scraping target. No API means parsing HTML, but the JSON-LD structured data makes core fields reliable to extract. Start with the HTTP approach for single extensions, fall back to Playwright for stubborn pages or larger batches, and add residential proxies when you need to scale beyond a few dozen extensions per session.