← Back to blog

Scraping Google Reviews and Business Data (2026)

Scraping Google Reviews and Business Data (2026)

Google Maps is the dominant source of local business intelligence. Every listing carries reviews, ratings, hours, photos, price levels, and structured address data. For lead generation, reputation monitoring, or competitive research, there is no substitute. The challenge is getting the data out at scale in 2026 — Google's anti-bot stack has grown significantly more aggressive.

This guide covers what data is available, two viable extraction approaches, the practical mechanics of pagination, anti-bot evasion, data storage, and advanced analytical patterns.

What Data You Can Extract

A Google Maps business listing contains more structured data than most competitors:

Two Approaches: Places API vs HTML Scraping

Google Maps Places API

Google's official Places API gives you structured JSON with zero parsing effort. The endpoint is well-documented and reliable. The problems are cost and data limits.

Pricing in 2026 sits at approximately $17 per 1,000 Place Details requests. More critically, the API caps review data at the 5 most relevant reviews per place — there is no pagination parameter. If you need full review history — hundreds or thousands of reviews per business — the official API cannot help you.

Use the API when you need address, hours, and basic ratings at scale with a budget. Use HTML scraping when you need reviews.

HTML Scraping

Full review history requires scraping the Maps frontend directly. Google renders its Maps UI as a heavily JavaScript-driven single-page app, which means raw HTTP requests to maps.google.com return minimal useful data in the initial HTML. However, the page source does embed structured initialization data, and Google's review loading uses an internal API endpoint that you can call directly once you have the place_id.

Dependencies and Setup

pip install httpx[http2] playwright beautifulsoup4 curl-cffi
playwright install chromium

We use httpx for most requests and curl_cffi when we need Chrome-level TLS fingerprinting. Playwright handles cases where JavaScript challenges block both.

Extracting the place_id

Every Google Maps business has a stable place_id identifier. It appears in multiple locations in the page source, most reliably inside the window.APP_INITIALIZATION_STATE JSON blob embedded in a <script> tag.

import httpx
import re
import json
import time
import random

def extract_place_id(maps_url: str, proxy: str = None) -> str:
    """Extract place_id from a Google Maps business URL."""
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/124.0.0.0 Safari/537.36"
        ),
        "Accept-Language": "en-US,en;q=0.9",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Referer": "https://www.google.com/",
    }

    client_kwargs = {"headers": headers, "follow_redirects": True, "timeout": 20}
    if proxy:
        client_kwargs["proxies"] = {"all://": proxy}

    with httpx.Client(**client_kwargs) as client:
        resp = client.get(maps_url)
        resp.raise_for_status()

    # place_id always begins with ChIJ and is 27 characters
    match = re.search(r'"(ChIJ[A-Za-z0-9_\-]{20,})"', resp.text)
    if match:
        return match.group(1)

    # Fallback: explicit key lookup
    match = re.search(r'"place_id"\s*:\s*"([^"]+)"', resp.text)
    if match:
        return match.group(1)

    raise ValueError(f"Could not extract place_id from {maps_url}")


def extract_place_ids_batch(urls: list, proxy: str = None) -> dict:
    """Extract place_ids for a list of business URLs."""
    results = {}
    for url in urls:
        try:
            pid = extract_place_id(url, proxy=proxy)
            results[url] = pid
            print(f"Extracted: {pid}")
        except Exception as e:
            print(f"Failed on {url}: {e}")
            results[url] = None
        time.sleep(random.uniform(1.5, 3.5))
    return results

Google Maps place_id values always begin with ChIJ and are 27 characters long. If your regex returns something else, you have the wrong token.

Instead of starting from individual URLs, you can construct search URLs programmatically:

def build_maps_search_url(query: str, location: str = None) -> str:
    """Build a Google Maps search URL for a business type in a location."""
    import urllib.parse
    search_term = f"{query} {location}".strip() if location else query
    encoded = urllib.parse.quote_plus(search_term)
    return f"https://www.google.com/maps/search/{encoded}/"


def extract_listing_urls_from_search(search_url: str, proxy: str = None) -> list:
    """
    Extract individual business listing URLs from a Maps search result page.
    Returns a list of place URLs from the sidebar.
    """
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/124.0.0.0 Safari/537.36"
        ),
        "Accept-Language": "en-US,en;q=0.9",
    }
    client_kwargs = {"headers": headers, "follow_redirects": True, "timeout": 20}
    if proxy:
        client_kwargs["proxies"] = {"all://": proxy}

    with httpx.Client(**client_kwargs) as client:
        resp = client.get(search_url)

    # Place URLs follow the pattern /maps/place/Name/@lat,lng,zoom/data=...
    urls = re.findall(r'https://www\.google\.com/maps/place/[^"\'\\]+', resp.text)
    seen = set()
    unique = []
    for u in urls:
        clean = u.split("\\")[0]
        if clean not in seen:
            seen.add(clean)
            unique.append(clean)
    return unique

Paginating Reviews via the Internal API

Once you have a place_id, Google's internal review endpoint accepts pagination tokens. The endpoint path is /maps/api/js/reviews/listugcposts and it returns JSON with a continuation token for the next page.

REVIEWS_ENDPOINT = "https://www.google.com/maps/api/js/reviews/listugcposts"

def scrape_reviews(place_id: str, proxy: str = None, max_pages: int = 10) -> list:
    """Scrape reviews for a Google Maps business using the internal pagination API."""
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/124.0.0.0 Safari/537.36"
        ),
        "Accept": "*/*",
        "Referer": f"https://www.google.com/maps/place/?place_id={place_id}",
        "X-Goog-Maps-Experience-Id": "maps:mf",
    }

    client_kwargs = {"headers": headers, "timeout": 20}
    if proxy:
        client_kwargs["proxies"] = {"all://": proxy}

    all_reviews = []
    next_page_token = None

    with httpx.Client(**client_kwargs) as client:
        for page in range(max_pages):
            params = {
                "authuser": "0",
                "hl": "en",
                "gl": "us",
                "pb": _build_pb_param(place_id, next_page_token),
            }

            resp = client.get(REVIEWS_ENDPOINT, params=params)
            if resp.status_code != 200:
                print(f"Request failed on page {page}: HTTP {resp.status_code}")
                break

            # Google wraps JSON responses with )]}' to prevent JSON hijacking
            raw = resp.text
            if raw.startswith(")]}'\n"):
                raw = raw[5:]

            try:
                data = json.loads(raw)
            except json.JSONDecodeError:
                print(f"Failed to parse JSON on page {page}")
                break

            # Reviews are in a nested list structure; position varies by response version
            reviews_block = data[2] if len(data) > 2 else []
            if not reviews_block:
                print(f"No reviews on page {page}, stopping")
                break

            for item in reviews_block:
                try:
                    review = {
                        "author": item[0][1],
                        "author_review_count": item[0][12] if len(item[0]) > 12 else None,
                        "rating": item[4],
                        "text": item[3] if len(item) > 3 and isinstance(item[3], str) else None,
                        "date_relative": item[1],
                        "helpful_count": item[16] if len(item) > 16 else 0,
                    }
                    all_reviews.append(review)
                except (IndexError, TypeError):
                    continue

            # Continuation token for next page
            next_page_token = data[-1] if isinstance(data[-1], str) and len(data[-1]) > 20 else None
            if not next_page_token:
                print(f"No continuation token after page {page}")
                break

            time.sleep(random.uniform(1.5, 3.5))

    return all_reviews


def _build_pb_param(place_id: str, next_page_token: str = None) -> str:
    """Build the protocol buffer parameter for the reviews request."""
    base = f"!1m2!1y{place_id}!4m6!2m5!1i10!2i0!3i0!4b1!5b1"
    if next_page_token:
        base += f"!6s{next_page_token}"
    return base

The pb parameter is a compact protobuf-derived encoding. The structure above handles most businesses; heavily-reviewed locations with 1,000+ reviews may require inspecting the actual network requests in DevTools to confirm the token format for your target.

Extracting Business Metadata

Beyond reviews, the Maps page source contains rich business metadata embedded in the initialization data:

from bs4 import BeautifulSoup

def extract_business_metadata(maps_url: str, proxy: str = None) -> dict:
    """
    Extract structured business metadata from a Google Maps page.
    Returns name, address, phone, website, hours, price_level, rating, review_count.
    """
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/124.0.0.0 Safari/537.36"
        ),
        "Accept-Language": "en-US,en;q=0.9",
    }
    client_kwargs = {"headers": headers, "follow_redirects": True, "timeout": 20}
    if proxy:
        client_kwargs["proxies"] = {"all://": proxy}

    with httpx.Client(**client_kwargs) as client:
        resp = client.get(maps_url)

    meta = {}
    soup = BeautifulSoup(resp.text, "html.parser")

    # JSON-LD block — Google embeds LocalBusiness schema on some listing pages
    for script in soup.find_all("script", type="application/ld+json"):
        try:
            ld = json.loads(script.string)
            if ld.get("@type") in ("LocalBusiness", "Restaurant", "Store", "Hotel"):
                meta["name"] = ld.get("name", "")
                meta["address"] = ld.get("address", {})
                meta["phone"] = ld.get("telephone", "")
                meta["website"] = ld.get("url", "")
                meta["rating"] = ld.get("aggregateRating", {}).get("ratingValue")
                meta["review_count"] = ld.get("aggregateRating", {}).get("reviewCount")
                meta["price_range"] = ld.get("priceRange", "")
                meta["categories"] = ld.get("servesCuisine", []) or ld.get("category", "")
                break
        except (json.JSONDecodeError, AttributeError):
            continue

    # Title tag fallback for business name
    if not meta.get("name"):
        title = soup.find("title")
        if title:
            meta["name"] = title.string.replace(" - Google Maps", "").strip()

    # Extract lat/lng from URL
    lat_lng = re.search(r"@(-?\d+\.\d+),(-?\d+\.\d+)", resp.url.path if hasattr(resp.url, "path") else str(resp.url))
    if lat_lng:
        meta["lat"] = float(lat_lng.group(1))
        meta["lng"] = float(lat_lng.group(2))

    return meta

Anti-Bot Measures: DataDome in 2026

Google Maps uses DataDome as its primary bot detection layer. Understanding what it checks helps you avoid triggers:

TLS fingerprinting — DataDome inspects the TLS ClientHello to verify it matches a real browser's cipher suite ordering. httpx uses Python's ssl module, which has a different fingerprint than Chrome. Libraries like curl_cffi can spoof Chrome's TLS fingerprint at the transport layer.

Browser JS challenges — On the first few requests from a fresh IP, DataDome may inject a JavaScript challenge page instead of returning content. httpx cannot execute this. If you see a response body containing datadome and no useful content, you have been challenged.

Rate limiting per IP — Google rate-limits review API calls to roughly 30-50 requests per hour per IP before triggering CAPTCHAs or empty responses. Residential IPs have higher thresholds than datacenter IPs, which are blocked outright.

Behavioral signals — DataDome tracks inter-request timing, mouse movements (via Playwright), scroll events, and navigation patterns. Requests that arrive at perfectly consistent intervals without any human-like variation are flagged faster.

Using curl_cffi for Chrome TLS Spoofing

When httpx fails DataDome TLS checks, curl_cffi provides a drop-in replacement that presents Chrome's exact cipher suite:

from curl_cffi import requests as cffi_requests

def fetch_with_chrome_tls(url: str, proxy: str = None) -> str:
    """Fetch a page using Chrome-level TLS fingerprint via curl_cffi."""
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/124.0.0.0 Safari/537.36"
        ),
        "Accept-Language": "en-US,en;q=0.9",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    }

    proxies = {"http": proxy, "https": proxy} if proxy else None

    resp = cffi_requests.get(
        url,
        headers=headers,
        proxies=proxies,
        impersonate="chrome124",  # spoof Chrome 124 TLS fingerprint
        timeout=20,
    )
    return resp.text

Playwright Approach for Blocked Requests

When httpx and curl_cffi both get DataDome challenges, switch to Playwright. You can intercept the network responses and extract JSON directly without parsing rendered HTML:

from playwright.sync_api import sync_playwright
import json

def scrape_reviews_playwright(maps_url: str, proxy: str = None) -> list:
    """Full browser approach with response interception for DataDome bypass."""
    intercepted = []

    def handle_response(response):
        if "listugcposts" in response.url and response.status == 200:
            try:
                body = response.body().decode("utf-8")
                if body.startswith(")]}'\n"):
                    body = body[5:]
                data = json.loads(body)
                intercepted.append(data)
            except Exception:
                pass

    launch_kwargs = {"headless": True}
    if proxy:
        launch_kwargs["proxy"] = {"server": proxy}

    with sync_playwright() as p:
        browser = p.chromium.launch(**launch_kwargs)
        context = browser.new_context(
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            ),
            locale="en-US",
            viewport={"width": 1280, "height": 900},
        )
        page = context.new_page()
        page.on("response", handle_response)

        page.goto(maps_url, wait_until="networkidle", timeout=30000)

        # Scroll the reviews panel to trigger pagination loads
        for _ in range(5):
            page.keyboard.press("End")
            page.wait_for_timeout(1500)

        browser.close()

    return intercepted


def scrape_reviews_playwright_full(maps_url: str, proxy: str = None, scroll_rounds: int = 8) -> list:
    """
    Extended Playwright scraper that scrolls the review panel multiple times
    to capture more paginated reviews before handing back intercepted data.
    """
    intercepted_batches = []

    def handle_response(response):
        if "listugcposts" in response.url and response.status == 200:
            try:
                body = response.body().decode("utf-8")
                if body.startswith(")]}'\n"):
                    body = body[5:]
                data = json.loads(body)
                intercepted_batches.append(data)
            except Exception:
                pass

    launch_kwargs = {"headless": True, "args": ["--no-sandbox", "--disable-dev-shm-usage"]}
    if proxy:
        launch_kwargs["proxy"] = {"server": proxy}

    all_reviews = []

    with sync_playwright() as p:
        browser = p.chromium.launch(**launch_kwargs)
        context = browser.new_context(
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            ),
            locale="en-US",
            viewport={"width": 1280, "height": 900},
            extra_http_headers={"Accept-Language": "en-US,en;q=0.9"},
        )
        page = context.new_page()
        page.on("response", handle_response)

        page.goto(maps_url, wait_until="networkidle", timeout=45000)
        page.wait_for_timeout(2000)

        # Click the "Reviews" tab to open the review panel
        try:
            page.click("button[aria-label*='Reviews']", timeout=5000)
            page.wait_for_timeout(1500)
        except Exception:
            pass

        # Scroll the review panel to load paginated content
        for i in range(scroll_rounds):
            page.keyboard.press("End")
            page.wait_for_timeout(random.uniform(1000, 2000))

        browser.close()

    # Parse all intercepted review batches
    for batch in intercepted_batches:
        reviews_block = batch[2] if len(batch) > 2 and isinstance(batch[2], list) else []
        for item in reviews_block:
            try:
                review = {
                    "author": item[0][1],
                    "rating": item[4],
                    "text": item[3] if len(item) > 3 and isinstance(item[3], str) else None,
                    "date_relative": item[1],
                }
                all_reviews.append(review)
            except (IndexError, TypeError):
                continue

    return all_reviews

Playwright with a residential proxy passes DataDome's JS challenge because it runs real Chromium. The response interception pattern captures the raw API JSON without needing to parse the DOM.

Rotating User-Agents

DataDome tracks User-Agent strings. Cycling through a pool of realistic browser strings reduces the fingerprint surface:

import random

USER_AGENTS = [
    # Chrome on Windows 11
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    # Chrome on macOS
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    # Firefox on Windows
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
    # Chrome on Linux
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    # Edge on Windows
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 Edg/124.0.0.0",
]

def random_headers() -> dict:
    return {
        "User-Agent": random.choice(USER_AGENTS),
        "Accept-Language": "en-US,en;q=0.9",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate, br",
        "Referer": "https://www.google.com/",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "same-origin",
    }

Proxy Strategy

Datacenter IPs are effectively useless for Google Maps scraping in 2026. Google's ASN blocklists cover every major hosting provider and datacenter range. Requests from these IPs either return CAPTCHAs immediately or trigger DataDome challenges that cycle faster than you can solve them.

Residential proxies route your traffic through real consumer IP addresses, which have clean reputations and pass ASN checks. For Google specifically, you want proxies with geo-targeting so you can pull reviews in the correct locale and language.

ThorData provides rotating residential proxies with city-level targeting. Their pool covers most major markets and the rotation is automatic — each request or session gets a fresh IP, which keeps you under Google's per-IP rate limits without manual management.

# ThorData rotating residential proxy config
PROXY_HOST = "proxy.thordata.com"
PROXY_PORT = 9000
PROXY_USER = "your_username"
PROXY_PASS = "your_password"

PROXY_URL = f"http://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}"

# Use with httpx
place_id = extract_place_id(
    "https://www.google.com/maps/place/Joe%27s+Pizza/@40.7305,-74.0021,17z",
    proxy=PROXY_URL,
)
reviews = scrape_reviews(place_id, proxy=PROXY_URL, max_pages=5)
print(f"Collected {len(reviews)} reviews")


def thordata_city_proxy(city: str, state: str) -> str:
    """
    Build a ThorData proxy URL targeting a specific US city.
    Useful for pulling local review data in the correct locale.
    """
    targeted_user = f"{PROXY_USER}-city-{city.lower().replace(' ', '_')}-state-{state.upper()}"
    return f"http://{targeted_user}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}"

# Example: pull reviews for NYC businesses using NYC-exit proxy
nyc_proxy = thordata_city_proxy("new_york", "NY")

Request Pacing and Backoff

Consistent inter-request timing is a bot signal. Human browsing has variance. Add jitter and exponential backoff on errors:

import time
import random

def exponential_backoff(attempt: int, base_delay: float = 2.0, max_delay: float = 120.0) -> float:
    """Calculate delay with exponential backoff and full jitter."""
    delay = min(base_delay * (2 ** attempt), max_delay)
    return random.uniform(0, delay)


def fetch_with_retry(url: str, params: dict, headers: dict, proxy: str = None, max_attempts: int = 5) -> dict | None:
    """Fetch a URL with exponential backoff retry on failures."""
    client_kwargs = {"headers": headers, "timeout": 20}
    if proxy:
        client_kwargs["proxies"] = {"all://": proxy}

    for attempt in range(max_attempts):
        try:
            with httpx.Client(**client_kwargs) as client:
                resp = client.get(url, params=params)

            if resp.status_code == 200:
                raw = resp.text
                if raw.startswith(")]}'\n"):
                    raw = raw[5:]
                return json.loads(raw)

            elif resp.status_code == 429:
                delay = exponential_backoff(attempt, base_delay=30.0)
                print(f"Rate limited (attempt {attempt+1}). Waiting {delay:.1f}s...")
                time.sleep(delay)

            elif resp.status_code in (403, 503):
                delay = exponential_backoff(attempt, base_delay=60.0)
                print(f"Blocked (HTTP {resp.status_code}, attempt {attempt+1}). Waiting {delay:.1f}s...")
                time.sleep(delay)

            else:
                print(f"Unexpected status {resp.status_code} on attempt {attempt+1}")
                break

        except (httpx.TimeoutException, httpx.ConnectError) as e:
            delay = exponential_backoff(attempt)
            print(f"Connection error (attempt {attempt+1}): {e}. Retrying in {delay:.1f}s...")
            time.sleep(delay)

    return None

Storing Data in SQLite

Save reviews incrementally so partial runs are not lost:

import sqlite3
from datetime import datetime

def init_db(db_path: str = "google_reviews.db") -> sqlite3.Connection:
    """Initialize the SQLite database with required tables."""
    conn = sqlite3.connect(db_path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS businesses (
            place_id TEXT PRIMARY KEY,
            name TEXT,
            address TEXT,
            phone TEXT,
            website TEXT,
            rating REAL,
            review_count INTEGER,
            price_level TEXT,
            categories TEXT,
            lat REAL,
            lng REAL,
            scraped_at TEXT
        );

        CREATE TABLE IF NOT EXISTS reviews (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            place_id TEXT,
            author TEXT,
            rating INTEGER,
            text TEXT,
            date_relative TEXT,
            helpful_count INTEGER,
            scraped_at TEXT,
            UNIQUE (place_id, author, date_relative),
            FOREIGN KEY (place_id) REFERENCES businesses(place_id)
        );

        CREATE INDEX IF NOT EXISTS idx_reviews_place ON reviews(place_id);
        CREATE INDEX IF NOT EXISTS idx_reviews_rating ON reviews(rating);
    """)
    conn.commit()
    return conn


def save_business(conn: sqlite3.Connection, place_id: str, meta: dict):
    """Save or update a business record."""
    conn.execute(
        """INSERT OR REPLACE INTO businesses
           (place_id, name, address, phone, website, rating, review_count,
            price_level, categories, lat, lng, scraped_at)
           VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
        (place_id, meta.get("name"), json.dumps(meta.get("address", {})),
         meta.get("phone"), meta.get("website"), meta.get("rating"),
         meta.get("review_count"), meta.get("price_range"),
         json.dumps(meta.get("categories", [])),
         meta.get("lat"), meta.get("lng"),
         datetime.utcnow().isoformat()),
    )
    conn.commit()


def save_reviews(conn: sqlite3.Connection, place_id: str, reviews: list) -> int:
    """Persist reviews to SQLite, skipping duplicates. Returns count inserted."""
    now = datetime.utcnow().isoformat()
    inserted = 0
    for r in reviews:
        try:
            conn.execute(
                "INSERT OR IGNORE INTO reviews "
                "(place_id, author, rating, text, date_relative, helpful_count, scraped_at) "
                "VALUES (?, ?, ?, ?, ?, ?, ?)",
                (place_id, r.get("author"), r.get("rating"),
                 r.get("text"), r.get("date_relative"),
                 r.get("helpful_count", 0), now),
            )
            if conn.execute("SELECT changes()").fetchone()[0]:
                inserted += 1
        except sqlite3.Error as e:
            print(f"DB error: {e}")
    conn.commit()
    print(f"Inserted {inserted} new reviews for {place_id}")
    return inserted

Sentiment Analysis on Review Text

Once you have a corpus of reviews, basic sentiment analysis surfaces patterns:

import re
from collections import Counter

POSITIVE_WORDS = {
    "excellent", "amazing", "great", "outstanding", "fantastic",
    "wonderful", "perfect", "love", "best", "highly", "recommend",
    "clean", "friendly", "professional", "fast", "fresh", "delicious"
}

NEGATIVE_WORDS = {
    "terrible", "awful", "horrible", "disgusting", "rude", "slow",
    "bad", "worst", "disappointing", "overpriced", "dirty", "cold",
    "never", "avoid", "waste", "poor", "mediocre"
}

def analyze_review_sentiment(reviews: list) -> dict:
    """
    Simple lexical sentiment analysis on review text.
    Returns counts and average rating by sentiment bucket.
    """
    buckets = {"positive": [], "negative": [], "neutral": []}
    keyword_freq = Counter()

    for r in reviews:
        text = (r.get("text") or "").lower()
        words = set(re.findall(r'\b[a-z]+\b', text))

        pos_hits = words & POSITIVE_WORDS
        neg_hits = words & NEGATIVE_WORDS
        keyword_freq.update(pos_hits | neg_hits)

        rating = r.get("rating", 3)
        if rating >= 4 or (len(pos_hits) > len(neg_hits) and rating >= 3):
            buckets["positive"].append(r)
        elif rating <= 2 or len(neg_hits) > len(pos_hits):
            buckets["negative"].append(r)
        else:
            buckets["neutral"].append(r)

    stats = {}
    for sentiment, bucket in buckets.items():
        ratings = [r["rating"] for r in bucket if r.get("rating")]
        stats[sentiment] = {
            "count": len(bucket),
            "avg_rating": round(sum(ratings) / len(ratings), 2) if ratings else None,
        }

    stats["top_keywords"] = dict(keyword_freq.most_common(20))
    return stats


def analyze_rating_trend(conn: sqlite3.Connection, place_id: str) -> dict:
    """
    Compute rolling average rating to detect sentiment trend over time.
    Uses date_relative as ordering proxy (sorted alphabetically, imperfect).
    """
    cursor = conn.execute(
        "SELECT rating, date_relative FROM reviews WHERE place_id = ? ORDER BY rowid",
        (place_id,),
    )
    rows = cursor.fetchall()
    if not rows:
        return {}

    ratings = [r[0] for r in rows if r[0] is not None]
    if not ratings:
        return {}

    window = 20
    rolling_avgs = []
    for i in range(window, len(ratings) + 1):
        window_slice = ratings[i - window:i]
        rolling_avgs.append(round(sum(window_slice) / len(window_slice), 2))

    return {
        "total_reviews": len(ratings),
        "overall_avg": round(sum(ratings) / len(ratings), 2),
        "recent_avg": round(sum(ratings[-50:]) / len(ratings[-50:]), 2) if len(ratings) >= 50 else None,
        "rolling_averages_20review_window": rolling_avgs[-10:],  # last 10 windows
    }

Full Pipeline: Multiple Businesses

A complete pipeline that runs discovery, extraction, and storage:

def run_review_pipeline(
    search_queries: list,
    proxy: str = None,
    max_reviews_per_place: int = 200,
    db_path: str = "google_reviews.db",
):
    """
    Full pipeline: search -> place_id extraction -> review pagination -> storage.
    """
    conn = init_db(db_path)
    total_reviews = 0

    for query in search_queries:
        print(f"\n=== Processing: {query} ===")

        # Step 1: Build search URL and find business pages
        search_url = build_maps_search_url(query)
        try:
            listing_urls = extract_listing_urls_from_search(search_url, proxy=proxy)
            print(f"Found {len(listing_urls)} listings for '{query}'")
        except Exception as e:
            print(f"Search failed for '{query}': {e}")
            continue

        # Step 2: Process each listing
        for url in listing_urls[:10]:  # cap per query to avoid runaway scraping
            try:
                # Extract place_id
                place_id = extract_place_id(url, proxy=proxy)
                time.sleep(random.uniform(1, 2))

                # Extract business metadata
                meta = extract_business_metadata(url, proxy=proxy)
                save_business(conn, place_id, meta)
                print(f"  {meta.get('name', 'Unknown')} ({place_id})")

                # Scrape reviews
                max_pages = max(1, max_reviews_per_place // 10)
                reviews = scrape_reviews(place_id, proxy=proxy, max_pages=max_pages)
                count = save_reviews(conn, place_id, reviews)
                total_reviews += count

                time.sleep(random.uniform(2.5, 5.0))

            except Exception as e:
                print(f"  Failed on {url}: {e}")
                continue

    conn.close()
    print(f"\nPipeline complete. Total new reviews inserted: {total_reviews}")

# Example run
PROXY_URL = f"http://YOUR_USER:[email protected]:9000"

run_review_pipeline(
    search_queries=[
        "pizza restaurants New York",
        "coffee shops Seattle",
        "auto repair shops Austin",
    ],
    proxy=PROXY_URL,
    max_reviews_per_place=100,
)

Competitive Intelligence Use Cases

Reputation Monitoring

Track a business's review velocity (reviews per week) and average rating over time. A sudden drop in weekly review count often signals the platform is filtering out reviews. A drop in average rating precedes and predicts revenue impact by 2-4 weeks for hospitality businesses.

def compute_review_velocity(conn: sqlite3.Connection, place_id: str, days: int = 30) -> dict:
    """Estimate recent review posting rate from relative date strings."""
    cursor = conn.execute(
        "SELECT date_relative FROM reviews WHERE place_id = ?",
        (place_id,),
    )
    dates = [r[0] for r in cursor.fetchall() if r[0]]

    recent_count = sum(
        1 for d in dates
        if any(term in d.lower() for term in
               ["day ago", "days ago", "week ago", "weeks ago", "hour", "yesterday"])
    )

    return {
        "total_reviews_stored": len(dates),
        "estimated_recent_reviews": recent_count,
        "recent_window_description": f"reviews with relative timestamps <= ~{days} days",
    }

Competitor Rating Gap Analysis

Compare your target business against 3-5 competitors on star rating, review velocity, and keyword frequency in reviews:

def competitor_gap_analysis(
    conn: sqlite3.Connection,
    target_place_id: str,
    competitor_place_ids: list,
) -> dict:
    """Compare target business review metrics against competitors."""
    def get_metrics(place_id):
        cursor = conn.execute(
            "SELECT rating, text FROM reviews WHERE place_id = ?", (place_id,)
        )
        rows = cursor.fetchall()
        ratings = [r[0] for r in rows if r[0]]
        texts = [r[1] for r in rows if r[1]]
        all_text = " ".join(texts).lower()
        words = re.findall(r'\b[a-z]{4,}\b', all_text)
        top_words = Counter(w for w in words if w not in {
            "this", "that", "they", "with", "have", "from", "were", "their"
        }).most_common(10)
        return {
            "avg_rating": round(sum(ratings) / len(ratings), 2) if ratings else None,
            "review_count": len(ratings),
            "top_keywords": top_words,
        }

    return {
        "target": get_metrics(target_place_id),
        "competitors": {pid: get_metrics(pid) for pid in competitor_place_ids},
    }

Google's Terms of Service prohibit automated access to Maps data. The hiQ v. LinkedIn Ninth Circuit ruling established that scraping publicly available data is not a Computer Fraud and Abuse Act violation, but ToS breach remains a civil risk. Google can terminate API keys, block IPs, or pursue breach of contract claims. Do not scrape while authenticated, do not republish review text at scale in ways that compete with Google's own data products, and keep request volumes reasonable. This guide is for research and personal use cases.

Performance Benchmarks

At a safe request pace with residential proxies:

Task Typical Time Notes
Extract place_id from URL 2-4 seconds Including proxy overhead
Scrape 10 reviews (1 page) 3-5 seconds With 2s delay
Scrape 100 reviews (10 pages) 45-90 seconds With jitter delays
Full business metadata 2-3 seconds JSON-LD extraction
Process 50 businesses 2-4 hours Conservative pacing

At this pace, a single residential IP can process approximately 200-400 businesses per day before hitting per-IP thresholds.

Summary

Extracting Google Maps reviews at scale in 2026 requires bypassing DataDome, working around the 5-review API cap, and using residential proxies — datacenter IPs get blocked before they can be useful. The place_id extraction from page source is stable, the internal listugcposts endpoint handles pagination, and Playwright with response interception covers cases where httpx gets challenged.

For proxies, ThorData's rotating residential pool is the practical choice — city-level targeting, automatic rotation, and clean IP reputations that actually pass Google's ASN checks. Combined with exponential backoff, User-Agent rotation, and incremental SQLite storage, you have a production-grade review collection pipeline.

The review data itself powers competitive intelligence, reputation monitoring, sentiment trend detection, and lead generation workflows that would require expensive SaaS subscriptions to replicate from commercial data providers.