Scraping Google Reviews and Business Data (2026)
Scraping Google Reviews and Business Data (2026)
Google Maps is the dominant source of local business intelligence. Every listing carries reviews, ratings, hours, photos, price levels, and structured address data. For lead generation, reputation monitoring, or competitive research, there is no substitute. The challenge is getting the data out at scale in 2026 — Google's anti-bot stack has grown significantly more aggressive.
This guide covers what data is available, two viable extraction approaches, the practical mechanics of pagination, anti-bot evasion, data storage, and advanced analytical patterns.
What Data You Can Extract
A Google Maps business listing contains more structured data than most competitors:
- Business name, address, phone, website — the NAP trifecta plus direct links
- Star rating and total review count — aggregate sentiment across all time
- Individual reviews — text, star rating, date, author name, author's review count, helpful votes
- Business categories — primary and secondary classifications
- Operating hours — per-day hours including holiday exceptions
- Price level — $ to $$$$ indicator
- Popular times — hourly foot traffic estimates by day of week
- Photos — customer and owner-uploaded images
- Attributes — accessibility, payment methods, amenities
- Q&A section — business owner responses and community questions
- Posts — Google Business Profile posts with offers and events
Two Approaches: Places API vs HTML Scraping
Google Maps Places API
Google's official Places API gives you structured JSON with zero parsing effort. The endpoint is well-documented and reliable. The problems are cost and data limits.
Pricing in 2026 sits at approximately $17 per 1,000 Place Details requests. More critically, the API caps review data at the 5 most relevant reviews per place — there is no pagination parameter. If you need full review history — hundreds or thousands of reviews per business — the official API cannot help you.
Use the API when you need address, hours, and basic ratings at scale with a budget. Use HTML scraping when you need reviews.
HTML Scraping
Full review history requires scraping the Maps frontend directly. Google renders its Maps UI as a heavily JavaScript-driven single-page app, which means raw HTTP requests to maps.google.com return minimal useful data in the initial HTML. However, the page source does embed structured initialization data, and Google's review loading uses an internal API endpoint that you can call directly once you have the place_id.
Dependencies and Setup
pip install httpx[http2] playwright beautifulsoup4 curl-cffi
playwright install chromium
We use httpx for most requests and curl_cffi when we need Chrome-level TLS fingerprinting. Playwright handles cases where JavaScript challenges block both.
Extracting the place_id
Every Google Maps business has a stable place_id identifier. It appears in multiple locations in the page source, most reliably inside the window.APP_INITIALIZATION_STATE JSON blob embedded in a <script> tag.
import httpx
import re
import json
import time
import random
def extract_place_id(maps_url: str, proxy: str = None) -> str:
"""Extract place_id from a Google Maps business URL."""
headers = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Referer": "https://www.google.com/",
}
client_kwargs = {"headers": headers, "follow_redirects": True, "timeout": 20}
if proxy:
client_kwargs["proxies"] = {"all://": proxy}
with httpx.Client(**client_kwargs) as client:
resp = client.get(maps_url)
resp.raise_for_status()
# place_id always begins with ChIJ and is 27 characters
match = re.search(r'"(ChIJ[A-Za-z0-9_\-]{20,})"', resp.text)
if match:
return match.group(1)
# Fallback: explicit key lookup
match = re.search(r'"place_id"\s*:\s*"([^"]+)"', resp.text)
if match:
return match.group(1)
raise ValueError(f"Could not extract place_id from {maps_url}")
def extract_place_ids_batch(urls: list, proxy: str = None) -> dict:
"""Extract place_ids for a list of business URLs."""
results = {}
for url in urls:
try:
pid = extract_place_id(url, proxy=proxy)
results[url] = pid
print(f"Extracted: {pid}")
except Exception as e:
print(f"Failed on {url}: {e}")
results[url] = None
time.sleep(random.uniform(1.5, 3.5))
return results
Google Maps place_id values always begin with ChIJ and are 27 characters long. If your regex returns something else, you have the wrong token.
Finding Business URLs via Search
Instead of starting from individual URLs, you can construct search URLs programmatically:
def build_maps_search_url(query: str, location: str = None) -> str:
"""Build a Google Maps search URL for a business type in a location."""
import urllib.parse
search_term = f"{query} {location}".strip() if location else query
encoded = urllib.parse.quote_plus(search_term)
return f"https://www.google.com/maps/search/{encoded}/"
def extract_listing_urls_from_search(search_url: str, proxy: str = None) -> list:
"""
Extract individual business listing URLs from a Maps search result page.
Returns a list of place URLs from the sidebar.
"""
headers = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
}
client_kwargs = {"headers": headers, "follow_redirects": True, "timeout": 20}
if proxy:
client_kwargs["proxies"] = {"all://": proxy}
with httpx.Client(**client_kwargs) as client:
resp = client.get(search_url)
# Place URLs follow the pattern /maps/place/Name/@lat,lng,zoom/data=...
urls = re.findall(r'https://www\.google\.com/maps/place/[^"\'\\]+', resp.text)
seen = set()
unique = []
for u in urls:
clean = u.split("\\")[0]
if clean not in seen:
seen.add(clean)
unique.append(clean)
return unique
Paginating Reviews via the Internal API
Once you have a place_id, Google's internal review endpoint accepts pagination tokens. The endpoint path is /maps/api/js/reviews/listugcposts and it returns JSON with a continuation token for the next page.
REVIEWS_ENDPOINT = "https://www.google.com/maps/api/js/reviews/listugcposts"
def scrape_reviews(place_id: str, proxy: str = None, max_pages: int = 10) -> list:
"""Scrape reviews for a Google Maps business using the internal pagination API."""
headers = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
"Accept": "*/*",
"Referer": f"https://www.google.com/maps/place/?place_id={place_id}",
"X-Goog-Maps-Experience-Id": "maps:mf",
}
client_kwargs = {"headers": headers, "timeout": 20}
if proxy:
client_kwargs["proxies"] = {"all://": proxy}
all_reviews = []
next_page_token = None
with httpx.Client(**client_kwargs) as client:
for page in range(max_pages):
params = {
"authuser": "0",
"hl": "en",
"gl": "us",
"pb": _build_pb_param(place_id, next_page_token),
}
resp = client.get(REVIEWS_ENDPOINT, params=params)
if resp.status_code != 200:
print(f"Request failed on page {page}: HTTP {resp.status_code}")
break
# Google wraps JSON responses with )]}' to prevent JSON hijacking
raw = resp.text
if raw.startswith(")]}'\n"):
raw = raw[5:]
try:
data = json.loads(raw)
except json.JSONDecodeError:
print(f"Failed to parse JSON on page {page}")
break
# Reviews are in a nested list structure; position varies by response version
reviews_block = data[2] if len(data) > 2 else []
if not reviews_block:
print(f"No reviews on page {page}, stopping")
break
for item in reviews_block:
try:
review = {
"author": item[0][1],
"author_review_count": item[0][12] if len(item[0]) > 12 else None,
"rating": item[4],
"text": item[3] if len(item) > 3 and isinstance(item[3], str) else None,
"date_relative": item[1],
"helpful_count": item[16] if len(item) > 16 else 0,
}
all_reviews.append(review)
except (IndexError, TypeError):
continue
# Continuation token for next page
next_page_token = data[-1] if isinstance(data[-1], str) and len(data[-1]) > 20 else None
if not next_page_token:
print(f"No continuation token after page {page}")
break
time.sleep(random.uniform(1.5, 3.5))
return all_reviews
def _build_pb_param(place_id: str, next_page_token: str = None) -> str:
"""Build the protocol buffer parameter for the reviews request."""
base = f"!1m2!1y{place_id}!4m6!2m5!1i10!2i0!3i0!4b1!5b1"
if next_page_token:
base += f"!6s{next_page_token}"
return base
The pb parameter is a compact protobuf-derived encoding. The structure above handles most businesses; heavily-reviewed locations with 1,000+ reviews may require inspecting the actual network requests in DevTools to confirm the token format for your target.
Extracting Business Metadata
Beyond reviews, the Maps page source contains rich business metadata embedded in the initialization data:
from bs4 import BeautifulSoup
def extract_business_metadata(maps_url: str, proxy: str = None) -> dict:
"""
Extract structured business metadata from a Google Maps page.
Returns name, address, phone, website, hours, price_level, rating, review_count.
"""
headers = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
}
client_kwargs = {"headers": headers, "follow_redirects": True, "timeout": 20}
if proxy:
client_kwargs["proxies"] = {"all://": proxy}
with httpx.Client(**client_kwargs) as client:
resp = client.get(maps_url)
meta = {}
soup = BeautifulSoup(resp.text, "html.parser")
# JSON-LD block — Google embeds LocalBusiness schema on some listing pages
for script in soup.find_all("script", type="application/ld+json"):
try:
ld = json.loads(script.string)
if ld.get("@type") in ("LocalBusiness", "Restaurant", "Store", "Hotel"):
meta["name"] = ld.get("name", "")
meta["address"] = ld.get("address", {})
meta["phone"] = ld.get("telephone", "")
meta["website"] = ld.get("url", "")
meta["rating"] = ld.get("aggregateRating", {}).get("ratingValue")
meta["review_count"] = ld.get("aggregateRating", {}).get("reviewCount")
meta["price_range"] = ld.get("priceRange", "")
meta["categories"] = ld.get("servesCuisine", []) or ld.get("category", "")
break
except (json.JSONDecodeError, AttributeError):
continue
# Title tag fallback for business name
if not meta.get("name"):
title = soup.find("title")
if title:
meta["name"] = title.string.replace(" - Google Maps", "").strip()
# Extract lat/lng from URL
lat_lng = re.search(r"@(-?\d+\.\d+),(-?\d+\.\d+)", resp.url.path if hasattr(resp.url, "path") else str(resp.url))
if lat_lng:
meta["lat"] = float(lat_lng.group(1))
meta["lng"] = float(lat_lng.group(2))
return meta
Anti-Bot Measures: DataDome in 2026
Google Maps uses DataDome as its primary bot detection layer. Understanding what it checks helps you avoid triggers:
TLS fingerprinting — DataDome inspects the TLS ClientHello to verify it matches a real browser's cipher suite ordering. httpx uses Python's ssl module, which has a different fingerprint than Chrome. Libraries like curl_cffi can spoof Chrome's TLS fingerprint at the transport layer.
Browser JS challenges — On the first few requests from a fresh IP, DataDome may inject a JavaScript challenge page instead of returning content. httpx cannot execute this. If you see a response body containing datadome and no useful content, you have been challenged.
Rate limiting per IP — Google rate-limits review API calls to roughly 30-50 requests per hour per IP before triggering CAPTCHAs or empty responses. Residential IPs have higher thresholds than datacenter IPs, which are blocked outright.
Behavioral signals — DataDome tracks inter-request timing, mouse movements (via Playwright), scroll events, and navigation patterns. Requests that arrive at perfectly consistent intervals without any human-like variation are flagged faster.
Using curl_cffi for Chrome TLS Spoofing
When httpx fails DataDome TLS checks, curl_cffi provides a drop-in replacement that presents Chrome's exact cipher suite:
from curl_cffi import requests as cffi_requests
def fetch_with_chrome_tls(url: str, proxy: str = None) -> str:
"""Fetch a page using Chrome-level TLS fingerprint via curl_cffi."""
headers = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}
proxies = {"http": proxy, "https": proxy} if proxy else None
resp = cffi_requests.get(
url,
headers=headers,
proxies=proxies,
impersonate="chrome124", # spoof Chrome 124 TLS fingerprint
timeout=20,
)
return resp.text
Playwright Approach for Blocked Requests
When httpx and curl_cffi both get DataDome challenges, switch to Playwright. You can intercept the network responses and extract JSON directly without parsing rendered HTML:
from playwright.sync_api import sync_playwright
import json
def scrape_reviews_playwright(maps_url: str, proxy: str = None) -> list:
"""Full browser approach with response interception for DataDome bypass."""
intercepted = []
def handle_response(response):
if "listugcposts" in response.url and response.status == 200:
try:
body = response.body().decode("utf-8")
if body.startswith(")]}'\n"):
body = body[5:]
data = json.loads(body)
intercepted.append(data)
except Exception:
pass
launch_kwargs = {"headless": True}
if proxy:
launch_kwargs["proxy"] = {"server": proxy}
with sync_playwright() as p:
browser = p.chromium.launch(**launch_kwargs)
context = browser.new_context(
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
locale="en-US",
viewport={"width": 1280, "height": 900},
)
page = context.new_page()
page.on("response", handle_response)
page.goto(maps_url, wait_until="networkidle", timeout=30000)
# Scroll the reviews panel to trigger pagination loads
for _ in range(5):
page.keyboard.press("End")
page.wait_for_timeout(1500)
browser.close()
return intercepted
def scrape_reviews_playwright_full(maps_url: str, proxy: str = None, scroll_rounds: int = 8) -> list:
"""
Extended Playwright scraper that scrolls the review panel multiple times
to capture more paginated reviews before handing back intercepted data.
"""
intercepted_batches = []
def handle_response(response):
if "listugcposts" in response.url and response.status == 200:
try:
body = response.body().decode("utf-8")
if body.startswith(")]}'\n"):
body = body[5:]
data = json.loads(body)
intercepted_batches.append(data)
except Exception:
pass
launch_kwargs = {"headless": True, "args": ["--no-sandbox", "--disable-dev-shm-usage"]}
if proxy:
launch_kwargs["proxy"] = {"server": proxy}
all_reviews = []
with sync_playwright() as p:
browser = p.chromium.launch(**launch_kwargs)
context = browser.new_context(
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
locale="en-US",
viewport={"width": 1280, "height": 900},
extra_http_headers={"Accept-Language": "en-US,en;q=0.9"},
)
page = context.new_page()
page.on("response", handle_response)
page.goto(maps_url, wait_until="networkidle", timeout=45000)
page.wait_for_timeout(2000)
# Click the "Reviews" tab to open the review panel
try:
page.click("button[aria-label*='Reviews']", timeout=5000)
page.wait_for_timeout(1500)
except Exception:
pass
# Scroll the review panel to load paginated content
for i in range(scroll_rounds):
page.keyboard.press("End")
page.wait_for_timeout(random.uniform(1000, 2000))
browser.close()
# Parse all intercepted review batches
for batch in intercepted_batches:
reviews_block = batch[2] if len(batch) > 2 and isinstance(batch[2], list) else []
for item in reviews_block:
try:
review = {
"author": item[0][1],
"rating": item[4],
"text": item[3] if len(item) > 3 and isinstance(item[3], str) else None,
"date_relative": item[1],
}
all_reviews.append(review)
except (IndexError, TypeError):
continue
return all_reviews
Playwright with a residential proxy passes DataDome's JS challenge because it runs real Chromium. The response interception pattern captures the raw API JSON without needing to parse the DOM.
Rotating User-Agents
DataDome tracks User-Agent strings. Cycling through a pool of realistic browser strings reduces the fingerprint surface:
import random
USER_AGENTS = [
# Chrome on Windows 11
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
# Chrome on macOS
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
# Firefox on Windows
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
# Chrome on Linux
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
# Edge on Windows
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 Edg/124.0.0.0",
]
def random_headers() -> dict:
return {
"User-Agent": random.choice(USER_AGENTS),
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://www.google.com/",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "same-origin",
}
Proxy Strategy
Datacenter IPs are effectively useless for Google Maps scraping in 2026. Google's ASN blocklists cover every major hosting provider and datacenter range. Requests from these IPs either return CAPTCHAs immediately or trigger DataDome challenges that cycle faster than you can solve them.
Residential proxies route your traffic through real consumer IP addresses, which have clean reputations and pass ASN checks. For Google specifically, you want proxies with geo-targeting so you can pull reviews in the correct locale and language.
ThorData provides rotating residential proxies with city-level targeting. Their pool covers most major markets and the rotation is automatic — each request or session gets a fresh IP, which keeps you under Google's per-IP rate limits without manual management.
# ThorData rotating residential proxy config
PROXY_HOST = "proxy.thordata.com"
PROXY_PORT = 9000
PROXY_USER = "your_username"
PROXY_PASS = "your_password"
PROXY_URL = f"http://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}"
# Use with httpx
place_id = extract_place_id(
"https://www.google.com/maps/place/Joe%27s+Pizza/@40.7305,-74.0021,17z",
proxy=PROXY_URL,
)
reviews = scrape_reviews(place_id, proxy=PROXY_URL, max_pages=5)
print(f"Collected {len(reviews)} reviews")
def thordata_city_proxy(city: str, state: str) -> str:
"""
Build a ThorData proxy URL targeting a specific US city.
Useful for pulling local review data in the correct locale.
"""
targeted_user = f"{PROXY_USER}-city-{city.lower().replace(' ', '_')}-state-{state.upper()}"
return f"http://{targeted_user}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}"
# Example: pull reviews for NYC businesses using NYC-exit proxy
nyc_proxy = thordata_city_proxy("new_york", "NY")
Request Pacing and Backoff
Consistent inter-request timing is a bot signal. Human browsing has variance. Add jitter and exponential backoff on errors:
import time
import random
def exponential_backoff(attempt: int, base_delay: float = 2.0, max_delay: float = 120.0) -> float:
"""Calculate delay with exponential backoff and full jitter."""
delay = min(base_delay * (2 ** attempt), max_delay)
return random.uniform(0, delay)
def fetch_with_retry(url: str, params: dict, headers: dict, proxy: str = None, max_attempts: int = 5) -> dict | None:
"""Fetch a URL with exponential backoff retry on failures."""
client_kwargs = {"headers": headers, "timeout": 20}
if proxy:
client_kwargs["proxies"] = {"all://": proxy}
for attempt in range(max_attempts):
try:
with httpx.Client(**client_kwargs) as client:
resp = client.get(url, params=params)
if resp.status_code == 200:
raw = resp.text
if raw.startswith(")]}'\n"):
raw = raw[5:]
return json.loads(raw)
elif resp.status_code == 429:
delay = exponential_backoff(attempt, base_delay=30.0)
print(f"Rate limited (attempt {attempt+1}). Waiting {delay:.1f}s...")
time.sleep(delay)
elif resp.status_code in (403, 503):
delay = exponential_backoff(attempt, base_delay=60.0)
print(f"Blocked (HTTP {resp.status_code}, attempt {attempt+1}). Waiting {delay:.1f}s...")
time.sleep(delay)
else:
print(f"Unexpected status {resp.status_code} on attempt {attempt+1}")
break
except (httpx.TimeoutException, httpx.ConnectError) as e:
delay = exponential_backoff(attempt)
print(f"Connection error (attempt {attempt+1}): {e}. Retrying in {delay:.1f}s...")
time.sleep(delay)
return None
Storing Data in SQLite
Save reviews incrementally so partial runs are not lost:
import sqlite3
from datetime import datetime
def init_db(db_path: str = "google_reviews.db") -> sqlite3.Connection:
"""Initialize the SQLite database with required tables."""
conn = sqlite3.connect(db_path)
conn.executescript("""
CREATE TABLE IF NOT EXISTS businesses (
place_id TEXT PRIMARY KEY,
name TEXT,
address TEXT,
phone TEXT,
website TEXT,
rating REAL,
review_count INTEGER,
price_level TEXT,
categories TEXT,
lat REAL,
lng REAL,
scraped_at TEXT
);
CREATE TABLE IF NOT EXISTS reviews (
id INTEGER PRIMARY KEY AUTOINCREMENT,
place_id TEXT,
author TEXT,
rating INTEGER,
text TEXT,
date_relative TEXT,
helpful_count INTEGER,
scraped_at TEXT,
UNIQUE (place_id, author, date_relative),
FOREIGN KEY (place_id) REFERENCES businesses(place_id)
);
CREATE INDEX IF NOT EXISTS idx_reviews_place ON reviews(place_id);
CREATE INDEX IF NOT EXISTS idx_reviews_rating ON reviews(rating);
""")
conn.commit()
return conn
def save_business(conn: sqlite3.Connection, place_id: str, meta: dict):
"""Save or update a business record."""
conn.execute(
"""INSERT OR REPLACE INTO businesses
(place_id, name, address, phone, website, rating, review_count,
price_level, categories, lat, lng, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
(place_id, meta.get("name"), json.dumps(meta.get("address", {})),
meta.get("phone"), meta.get("website"), meta.get("rating"),
meta.get("review_count"), meta.get("price_range"),
json.dumps(meta.get("categories", [])),
meta.get("lat"), meta.get("lng"),
datetime.utcnow().isoformat()),
)
conn.commit()
def save_reviews(conn: sqlite3.Connection, place_id: str, reviews: list) -> int:
"""Persist reviews to SQLite, skipping duplicates. Returns count inserted."""
now = datetime.utcnow().isoformat()
inserted = 0
for r in reviews:
try:
conn.execute(
"INSERT OR IGNORE INTO reviews "
"(place_id, author, rating, text, date_relative, helpful_count, scraped_at) "
"VALUES (?, ?, ?, ?, ?, ?, ?)",
(place_id, r.get("author"), r.get("rating"),
r.get("text"), r.get("date_relative"),
r.get("helpful_count", 0), now),
)
if conn.execute("SELECT changes()").fetchone()[0]:
inserted += 1
except sqlite3.Error as e:
print(f"DB error: {e}")
conn.commit()
print(f"Inserted {inserted} new reviews for {place_id}")
return inserted
Sentiment Analysis on Review Text
Once you have a corpus of reviews, basic sentiment analysis surfaces patterns:
import re
from collections import Counter
POSITIVE_WORDS = {
"excellent", "amazing", "great", "outstanding", "fantastic",
"wonderful", "perfect", "love", "best", "highly", "recommend",
"clean", "friendly", "professional", "fast", "fresh", "delicious"
}
NEGATIVE_WORDS = {
"terrible", "awful", "horrible", "disgusting", "rude", "slow",
"bad", "worst", "disappointing", "overpriced", "dirty", "cold",
"never", "avoid", "waste", "poor", "mediocre"
}
def analyze_review_sentiment(reviews: list) -> dict:
"""
Simple lexical sentiment analysis on review text.
Returns counts and average rating by sentiment bucket.
"""
buckets = {"positive": [], "negative": [], "neutral": []}
keyword_freq = Counter()
for r in reviews:
text = (r.get("text") or "").lower()
words = set(re.findall(r'\b[a-z]+\b', text))
pos_hits = words & POSITIVE_WORDS
neg_hits = words & NEGATIVE_WORDS
keyword_freq.update(pos_hits | neg_hits)
rating = r.get("rating", 3)
if rating >= 4 or (len(pos_hits) > len(neg_hits) and rating >= 3):
buckets["positive"].append(r)
elif rating <= 2 or len(neg_hits) > len(pos_hits):
buckets["negative"].append(r)
else:
buckets["neutral"].append(r)
stats = {}
for sentiment, bucket in buckets.items():
ratings = [r["rating"] for r in bucket if r.get("rating")]
stats[sentiment] = {
"count": len(bucket),
"avg_rating": round(sum(ratings) / len(ratings), 2) if ratings else None,
}
stats["top_keywords"] = dict(keyword_freq.most_common(20))
return stats
def analyze_rating_trend(conn: sqlite3.Connection, place_id: str) -> dict:
"""
Compute rolling average rating to detect sentiment trend over time.
Uses date_relative as ordering proxy (sorted alphabetically, imperfect).
"""
cursor = conn.execute(
"SELECT rating, date_relative FROM reviews WHERE place_id = ? ORDER BY rowid",
(place_id,),
)
rows = cursor.fetchall()
if not rows:
return {}
ratings = [r[0] for r in rows if r[0] is not None]
if not ratings:
return {}
window = 20
rolling_avgs = []
for i in range(window, len(ratings) + 1):
window_slice = ratings[i - window:i]
rolling_avgs.append(round(sum(window_slice) / len(window_slice), 2))
return {
"total_reviews": len(ratings),
"overall_avg": round(sum(ratings) / len(ratings), 2),
"recent_avg": round(sum(ratings[-50:]) / len(ratings[-50:]), 2) if len(ratings) >= 50 else None,
"rolling_averages_20review_window": rolling_avgs[-10:], # last 10 windows
}
Full Pipeline: Multiple Businesses
A complete pipeline that runs discovery, extraction, and storage:
def run_review_pipeline(
search_queries: list,
proxy: str = None,
max_reviews_per_place: int = 200,
db_path: str = "google_reviews.db",
):
"""
Full pipeline: search -> place_id extraction -> review pagination -> storage.
"""
conn = init_db(db_path)
total_reviews = 0
for query in search_queries:
print(f"\n=== Processing: {query} ===")
# Step 1: Build search URL and find business pages
search_url = build_maps_search_url(query)
try:
listing_urls = extract_listing_urls_from_search(search_url, proxy=proxy)
print(f"Found {len(listing_urls)} listings for '{query}'")
except Exception as e:
print(f"Search failed for '{query}': {e}")
continue
# Step 2: Process each listing
for url in listing_urls[:10]: # cap per query to avoid runaway scraping
try:
# Extract place_id
place_id = extract_place_id(url, proxy=proxy)
time.sleep(random.uniform(1, 2))
# Extract business metadata
meta = extract_business_metadata(url, proxy=proxy)
save_business(conn, place_id, meta)
print(f" {meta.get('name', 'Unknown')} ({place_id})")
# Scrape reviews
max_pages = max(1, max_reviews_per_place // 10)
reviews = scrape_reviews(place_id, proxy=proxy, max_pages=max_pages)
count = save_reviews(conn, place_id, reviews)
total_reviews += count
time.sleep(random.uniform(2.5, 5.0))
except Exception as e:
print(f" Failed on {url}: {e}")
continue
conn.close()
print(f"\nPipeline complete. Total new reviews inserted: {total_reviews}")
# Example run
PROXY_URL = f"http://YOUR_USER:[email protected]:9000"
run_review_pipeline(
search_queries=[
"pizza restaurants New York",
"coffee shops Seattle",
"auto repair shops Austin",
],
proxy=PROXY_URL,
max_reviews_per_place=100,
)
Competitive Intelligence Use Cases
Reputation Monitoring
Track a business's review velocity (reviews per week) and average rating over time. A sudden drop in weekly review count often signals the platform is filtering out reviews. A drop in average rating precedes and predicts revenue impact by 2-4 weeks for hospitality businesses.
def compute_review_velocity(conn: sqlite3.Connection, place_id: str, days: int = 30) -> dict:
"""Estimate recent review posting rate from relative date strings."""
cursor = conn.execute(
"SELECT date_relative FROM reviews WHERE place_id = ?",
(place_id,),
)
dates = [r[0] for r in cursor.fetchall() if r[0]]
recent_count = sum(
1 for d in dates
if any(term in d.lower() for term in
["day ago", "days ago", "week ago", "weeks ago", "hour", "yesterday"])
)
return {
"total_reviews_stored": len(dates),
"estimated_recent_reviews": recent_count,
"recent_window_description": f"reviews with relative timestamps <= ~{days} days",
}
Competitor Rating Gap Analysis
Compare your target business against 3-5 competitors on star rating, review velocity, and keyword frequency in reviews:
def competitor_gap_analysis(
conn: sqlite3.Connection,
target_place_id: str,
competitor_place_ids: list,
) -> dict:
"""Compare target business review metrics against competitors."""
def get_metrics(place_id):
cursor = conn.execute(
"SELECT rating, text FROM reviews WHERE place_id = ?", (place_id,)
)
rows = cursor.fetchall()
ratings = [r[0] for r in rows if r[0]]
texts = [r[1] for r in rows if r[1]]
all_text = " ".join(texts).lower()
words = re.findall(r'\b[a-z]{4,}\b', all_text)
top_words = Counter(w for w in words if w not in {
"this", "that", "they", "with", "have", "from", "were", "their"
}).most_common(10)
return {
"avg_rating": round(sum(ratings) / len(ratings), 2) if ratings else None,
"review_count": len(ratings),
"top_keywords": top_words,
}
return {
"target": get_metrics(target_place_id),
"competitors": {pid: get_metrics(pid) for pid in competitor_place_ids},
}
Legal Notes
Google's Terms of Service prohibit automated access to Maps data. The hiQ v. LinkedIn Ninth Circuit ruling established that scraping publicly available data is not a Computer Fraud and Abuse Act violation, but ToS breach remains a civil risk. Google can terminate API keys, block IPs, or pursue breach of contract claims. Do not scrape while authenticated, do not republish review text at scale in ways that compete with Google's own data products, and keep request volumes reasonable. This guide is for research and personal use cases.
Performance Benchmarks
At a safe request pace with residential proxies:
| Task | Typical Time | Notes |
|---|---|---|
| Extract place_id from URL | 2-4 seconds | Including proxy overhead |
| Scrape 10 reviews (1 page) | 3-5 seconds | With 2s delay |
| Scrape 100 reviews (10 pages) | 45-90 seconds | With jitter delays |
| Full business metadata | 2-3 seconds | JSON-LD extraction |
| Process 50 businesses | 2-4 hours | Conservative pacing |
At this pace, a single residential IP can process approximately 200-400 businesses per day before hitting per-IP thresholds.
Summary
Extracting Google Maps reviews at scale in 2026 requires bypassing DataDome, working around the 5-review API cap, and using residential proxies — datacenter IPs get blocked before they can be useful. The place_id extraction from page source is stable, the internal listugcposts endpoint handles pagination, and Playwright with response interception covers cases where httpx gets challenged.
For proxies, ThorData's rotating residential pool is the practical choice — city-level targeting, automatic rotation, and clean IP reputations that actually pass Google's ASN checks. Combined with exponential backoff, User-Agent rotation, and incremental SQLite storage, you have a production-grade review collection pipeline.
The review data itself powers competitive intelligence, reputation monitoring, sentiment trend detection, and lead generation workflows that would require expensive SaaS subscriptions to replicate from commercial data providers.