How to Scrape Yelp Business Data in 2026: A Complete Guide
How to Scrape Yelp Business Data in 2026: A Complete Guide
Yelp remains one of the richest sources of local business intelligence. With over 265 million reviews across restaurants, services, and retail, scraping Yelp gives you access to competitive pricing, sentiment analysis, and location-based market research that no API can match at scale.
This guide walks through scraping Yelp business listings with Python — including the anti-bot measures you'll actually encounter in 2026, Yelp's official Fusion API, their internal GraphQL endpoint, and a full batch pipeline that saves to SQLite and resumes from where it left off.
What Data Can You Extract from Yelp?
Each Yelp business page contains structured data worth extracting:
- Business name, full address (street, city, state, zip) — core NAP data
- Phone number and website URL — lead generation essentials
- Star rating and review count — aggregate sentiment at a glance
- Hours of operation — per-day schedule
- Price range —
$,$$,$$$, or$$$$ - Categories — taxonomy tags like "Bakeries", "Cafes", "Sandwiches"
- Photos count — proxy for business activity level
- Amenities and highlights — "Outdoor Seating", "Good for Groups", "Takes Reservations"
- Health score — where available from city health inspection data
- Individual reviews — text, date, reviewer, votes, elite status
Yelp's Anti-Bot Measures in 2026
Yelp has significantly hardened its defenses. Here's what you'll face:
- Aggressive rate limiting — More than 20–30 requests per minute from a single IP triggers a CAPTCHA wall or temporary block.
- JavaScript-rendered content — Review text and some business details load dynamically via XHR calls, not in the initial HTML.
- TLS fingerprinting — Yelp checks TLS fingerprints, header ordering, and cipher suites to identify non-browser clients. A standard
requestsorhttpxTLS handshake differs from Chrome's. - Bot detection cookies — A
bsecookie is set on first visit; missing or malformed values trigger blocks. - Honeypot links — Hidden elements in the DOM (
visibility:hidden,display:none) that only bots follow. Clicking them results in an instant ban. - ASN-level IP blocking — Yelp checks your IP's ASN. Datacenter ranges (AWS, GCP, DigitalOcean, etc.) get flagged immediately regardless of request behavior.
Setting Up Your Scraper
pip install httpx selectolax curl_cffi pandas sqlite-utils
Use curl_cffi instead of plain httpx for its Chrome TLS impersonation. This is the single biggest factor in avoiding Yelp blocks in 2026.
User Agent Rotation Pool
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4.1 Safari/605.1.15",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.0.0",
"Mozilla/5.0 (iPhone; CPU iPhone OS 17_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4.1 Mobile/15E148 Safari/604.1",
"Mozilla/5.0 (Linux; Android 14; Pixel 8) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.6367.82 Mobile Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]
Complete Business Scraper with Full Field Extraction
import random
import time
import re
import json
from curl_cffi import requests as cffi_requests
from selectolax.parser import HTMLParser
def make_session(proxy: str = None) -> cffi_requests.Session:
"""Create a curl_cffi session that impersonates Chrome TLS fingerprint."""
session = cffi_requests.Session(impersonate="chrome124")
if proxy:
session.proxies = {"http": proxy, "https": proxy}
session.headers.update({
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
})
return session
def scrape_yelp_business(url: str, session: cffi_requests.Session) -> dict:
"""Scrape a single Yelp business page for all available fields."""
session.headers["User-Agent"] = random.choice(USER_AGENTS)
session.headers["Referer"] = "https://www.yelp.com/"
resp = session.get(url, timeout=20)
resp.raise_for_status()
tree = HTMLParser(resp.text)
# Check for honeypot traps before proceeding
hidden = tree.css("a[style*=display:none], a[style*=visibility:hidden]")
honeypot_hrefs = {el.attributes.get("href") for el in hidden}
# Business name
name_el = tree.css_first("h1.css-1se8maq, h1")
name = name_el.text(strip=True) if name_el else None
# Rating from aria-label
rating = None
rating_el = tree.css_first("[aria-label*=star rating], div[aria-label*=rating]")
if rating_el:
label = rating_el.attributes.get("aria-label", "")
m = re.search(r"(\d+\.?\d*)", label)
rating = float(m.group(1)) if m else None
# Review count
review_count = None
review_el = tree.css_first("a[href*=#reviews]")
if review_el:
text = review_el.text(strip=True)
digits = re.sub(r"[^\d]", "", text)
review_count = int(digits) if digits else None
# Structured address — Yelp renders this as a series of spans
street, city, state, zip_code = None, None, None, None
addr_block = tree.css_first("address")
if addr_block:
raw = addr_block.text(strip=True)
# Yelp format: "600 Guerrero St\nSan Francisco, CA 94110"
parts = [p.strip() for p in raw.splitlines() if p.strip()]
if parts:
street = parts[0]
if len(parts) > 1:
m = re.match(r"^(.+),\s*([A-Z]{2})\s*(\d{5})", parts[1])
if m:
city, state, zip_code = m.group(1), m.group(2), m.group(3)
# Phone
phone = None
for el in tree.css("p.css-1p9ibgf, p[class*=phone], p"):
t = el.text(strip=True)
if re.match(r"^\(?\d{3}\)?[\s\-]\d{3}[\s\-]\d{4}$", t):
phone = t
break
# Website URL
website = None
biz_url_el = tree.css_first("a[href*=biz_website]")
if biz_url_el:
href = biz_url_el.attributes.get("href", "")
m = re.search(r"url=([^&]+)", href)
if m:
from urllib.parse import unquote
website = unquote(m.group(1))
# Price range ($, $$, $$$, $$$$)
price_el = tree.css_first("span.priceRange, span[class*=price]")
price_range = price_el.text(strip=True) if price_el else None
# Categories
categories = []
for el in tree.css("span.css-1xfc281 a, a[href*=/c/]"):
cat = el.text(strip=True)
if cat and cat not in categories:
categories.append(cat)
# Hours of operation
hours = {}
hours_rows = tree.css("table.hours-table tr, div[class*=hours] tr")
for row in hours_rows:
cells = row.css("td")
if len(cells) >= 2:
day = cells[0].text(strip=True)
time_val = cells[1].text(strip=True)
if day:
hours[day] = time_val
# Photos count
photos_count = None
photos_el = tree.css_first("a[href*=/photos] span")
if photos_el:
digits = re.sub(r"[^\d]", "", photos_el.text(strip=True))
photos_count = int(digits) if digits else None
# Highlights / amenities
highlights = []
for el in tree.css("span.css-1p9ibgf, div[class*=amenities] span, section[aria-label*=Amenities] span"):
t = el.text(strip=True)
if t and len(t) < 50:
highlights.append(t)
highlights = list(dict.fromkeys(highlights))[:15] # deduplicate, cap at 15
# Health score (city-specific, not always present)
health_score = None
health_el = tree.css_first("div[class*=health] span, span[class*=health-score]")
if health_el:
health_score = health_el.text(strip=True)
return {
"name": name,
"rating": rating,
"review_count": review_count,
"price_range": price_range,
"address": {
"street": street,
"city": city,
"state": state,
"zip": zip_code,
},
"phone": phone,
"website": website,
"categories": categories,
"hours": hours,
"photos_count": photos_count,
"highlights": highlights,
"health_score": health_score,
"url": url,
}
JSON Output Example
Here's what a fully populated result looks like:
{
"name": "Tartine Bakery",
"rating": 4.5,
"review_count": 8234,
"price_range": "$$",
"address": {
"street": "600 Guerrero St",
"city": "San Francisco",
"state": "CA",
"zip": "94110"
},
"phone": "(415) 487-2600",
"website": "https://www.tartinebakery.com",
"categories": ["Bakeries", "Cafes", "Sandwiches"],
"hours": {
"Monday": "8:00 AM - 3:00 PM",
"Tuesday": "8:00 AM - 3:00 PM",
"Wednesday": "8:00 AM - 3:00 PM",
"Thursday": "8:00 AM - 3:00 PM",
"Friday": "8:00 AM - 5:00 PM",
"Saturday": "8:00 AM - 5:00 PM",
"Sunday": "9:00 AM - 3:00 PM"
},
"photos_count": 4821,
"highlights": ["Outdoor Seating", "Good for Groups", "Takes Reservations", "Wi-Fi"],
"health_score": "A (98)",
"url": "https://www.yelp.com/biz/tartine-bakery-san-francisco"
}
Yelp Fusion API: The Official Route
Yelp offers a free tier of their Fusion API: 5,000 calls/day, no charge. It's rate-limited and doesn't expose reviews, but it's reliable, structured, and legal.
Get a key at https://www.yelp.com/developers/v3/manage_app.
Business Search
import httpx
FUSION_KEY = "your_api_key_here"
def fusion_search(term: str, location: str, limit: int = 50) -> list:
"""Search Yelp Fusion API for businesses."""
url = "https://api.yelp.com/v3/businesses/search"
headers = {"Authorization": f"Bearer {FUSION_KEY}"}
params = {
"term": term,
"location": location,
"limit": min(limit, 50), # Fusion caps at 50 per request
"sort_by": "best_match",
}
with httpx.Client() as client:
resp = client.get(url, headers=headers, params=params)
resp.raise_for_status()
return resp.json().get("businesses", [])
def fusion_business_detail(business_id: str) -> dict:
"""Get full details for a single business by Yelp ID."""
url = f"https://api.yelp.com/v3/businesses/{business_id}"
headers = {"Authorization": f"Bearer {FUSION_KEY}"}
with httpx.Client() as client:
resp = client.get(url, headers=headers)
resp.raise_for_status()
return resp.json()
# Usage
results = fusion_search("coffee", "Austin, TX")
for biz in results[:5]:
detail = fusion_business_detail(biz["id"])
print(f"{detail[name]} — {detail[rating]}★ — {detail[location][address1]}")
Fusion API Response Fields
The businesses/search endpoint returns:
- id, name, url, image_url
- rating, review_count
- price ($ to $$$$)
- location — address1, city, state, zip_code, country, display_address
- phone, display_phone
- categories — list of {alias, title} objects
- coordinates — latitude/longitude
- distance — meters from search center
- is_closed
The /businesses/{id} endpoint adds:
- hours — array of open periods per day
- photos — up to 3 photo URLs
- transactions — ["pickup", "delivery", "restaurant_reservation"]
- attributes — wheelchair accessible, outdoor seating, etc.
API vs Scraping: When to Use Each
| Factor | Fusion API | Direct Scraping |
|---|---|---|
| Reviews | Not available | Full text, votes, photos |
| Rate limit | 5,000/day free | No hard limit (proxy-dependent) |
| Reliability | High | Variable |
| Data freshness | Real-time | Real-time |
| Legal risk | None | Moderate (ToS) |
| Setup complexity | Low | Medium-high |
| Cost at scale | $0.001/call paid tier | Proxy cost only |
Use the Fusion API for business search and metadata. Scrape for reviews, Q&A, and photo data.
Internal GraphQL Endpoint
Yelp's front-end communicates with an internal GraphQL API at /gql/batch. This endpoint powers review feeds, photo galleries, Q&A, and more — and it returns clean JSON without any JavaScript rendering.
To find the exact query format, open Chrome DevTools on any Yelp business page, go to the Network tab, filter by "gql", and inspect the requests. Here's the general structure:
import json
GQL_URL = "https://www.yelp.com/gql/batch"
def gql_fetch_reviews(biz_alias: str, session, after_cursor: str = None) -> dict:
"""Fetch reviews via Yelp's internal GraphQL batch endpoint."""
variables = {
"bizEncId": biz_alias,
"after": after_cursor,
"first": 20,
"sortBy": "DATE_DESC",
"lang": "en",
}
payload = [
{
"operationName": "GetBusinessReviewFeed",
"variables": variables,
"extensions": {
"operationId": "GetBusinessReviewFeed",
"schemaVersion": "20240415",
},
}
]
session.headers.update({
"Content-Type": "application/json",
"x-apollo-operation-name": "GetBusinessReviewFeed",
"Accept": "application/json",
"Referer": f"https://www.yelp.com/biz/{biz_alias}",
})
resp = session.post(GQL_URL, data=json.dumps(payload), timeout=20)
return resp.json()
The response structure nests under data.business.reviewFeed.edges. Each edge has a node with:
- id, text, rating, createdAt
- author.displayName, author.isElite
- photos — array of photo URLs
- feedbackCounts — useful, funny, cool vote counts
- pageInfo.hasNextPage, pageInfo.endCursor — for pagination
Paginate by passing the endCursor from the previous response as after_cursor in the next call.
Review Scraper: Dedicated Section
def scrape_all_reviews(biz_alias: str, session, max_reviews: int = 200) -> list:
"""Scrape reviews using the internal review_feed endpoint."""
reviews = []
start = 0
page_size = 20
while len(reviews) < max_reviews:
url = f"https://www.yelp.com/biz/{biz_alias}/review_feed"
params = {
"rl": "en",
"sort_by": "date_desc",
"start": start,
"q": "",
}
session.headers.update({
"Accept": "application/json",
"X-Requested-With": "XMLHttpRequest",
"Referer": f"https://www.yelp.com/biz/{biz_alias}",
})
resp = session.get(url, params=params, timeout=15)
if resp.status_code != 200:
print(f"Review feed blocked at offset {start}: {resp.status_code}")
break
data = resp.json()
page_reviews = data.get("reviews", [])
if not page_reviews:
break
for r in page_reviews:
reviews.append({
"reviewer": r.get("user", {}).get("markupDisplayName"),
"rating": r.get("rating"),
"date": r.get("localizedDate"),
"text": r.get("comment", {}).get("text"),
"photos": [p.get("src") for p in r.get("photos", [])],
"useful": r.get("feedback", {}).get("useful", 0),
"funny": r.get("feedback", {}).get("funny", 0),
"cool": r.get("feedback", {}).get("cool", 0),
"is_elite": r.get("user", {}).get("isElite", False),
})
start += page_size
if start >= data.get("pagination", {}).get("totalResults", 0):
break
time.sleep(random.uniform(2, 5))
return reviews
Rate Limiting and Anti-Detection
Request Timing
import time
import random
def respectful_delay(min_s: float = 3.0, max_s: float = 7.0):
"""Random delay to mimic human browsing patterns."""
delay = random.uniform(min_s, max_s)
# Occasionally simulate a longer pause (reading the page)
if random.random() < 0.1:
delay += random.uniform(5, 15)
time.sleep(delay)
Session Cookie Persistence
The bse cookie is Yelp's bot-scoring cookie. Get it by making an initial homepage request and persisting the cookie jar across your session:
def warm_session(session) -> None:
"""Hit the homepage to acquire the bse cookie before scraping."""
session.headers["User-Agent"] = random.choice(USER_AGENTS)
resp = session.get("https://www.yelp.com/", timeout=15)
# The session automatically stores Set-Cookie headers
bse = session.cookies.get("bse")
if bse:
print(f"Session warmed, bse cookie acquired: {bse[:12]}...")
else:
print("Warning: bse cookie not set — may see increased blocking")
Honeypot Detection
def is_safe_link(tree: HTMLParser, href: str) -> bool:
"""Check if a link is visible (not a honeypot trap)."""
for el in tree.css(f"a[href={href}]"):
style = el.attributes.get("style", "")
if "display:none" in style or "visibility:hidden" in style:
return False
parent = el.parent
while parent:
pstyle = parent.attributes.get("style", "") if hasattr(parent, "attributes") else ""
if "display:none" in pstyle:
return False
parent = parent.parent if hasattr(parent, "parent") else None
return True
Batch Scraper with SQLite and Resume Support
import sqlite3
import csv
import os
DB_PATH = "yelp_scrape.db"
def init_db(db_path: str = DB_PATH):
"""Initialize SQLite schema."""
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS businesses (
url TEXT PRIMARY KEY,
name TEXT,
rating REAL,
review_count INTEGER,
price_range TEXT,
street TEXT,
city TEXT,
state TEXT,
zip TEXT,
phone TEXT,
website TEXT,
categories TEXT,
hours TEXT,
photos_count INTEGER,
highlights TEXT,
health_score TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
error TEXT
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS reviews (
id INTEGER PRIMARY KEY AUTOINCREMENT,
business_url TEXT,
reviewer TEXT,
rating INTEGER,
date TEXT,
text TEXT,
useful INTEGER,
funny INTEGER,
cool INTEGER,
is_elite INTEGER,
FOREIGN KEY (business_url) REFERENCES businesses(url)
)
""")
conn.commit()
conn.close()
def already_scraped(url: str, db_path: str = DB_PATH) -> bool:
conn = sqlite3.connect(db_path)
row = conn.execute("SELECT 1 FROM businesses WHERE url=? AND error IS NULL", (url,)).fetchone()
conn.close()
return row is not None
def save_business(data: dict, db_path: str = DB_PATH):
conn = sqlite3.connect(db_path)
conn.execute("""
INSERT OR REPLACE INTO businesses
(url, name, rating, review_count, price_range, street, city, state, zip,
phone, website, categories, hours, photos_count, highlights, health_score)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)
""", (
data["url"], data["name"], data["rating"], data["review_count"],
data["price_range"],
data["address"]["street"], data["address"]["city"],
data["address"]["state"], data["address"]["zip"],
data["phone"], data["website"],
json.dumps(data["categories"]),
json.dumps(data["hours"]),
data["photos_count"],
json.dumps(data["highlights"]),
data["health_score"],
))
conn.commit()
conn.close()
def batch_scrape(csv_path: str, proxy: str = None, db_path: str = DB_PATH):
"""
Process a CSV of business URLs. Resumes from last successful scrape.
CSV format: one URL per line (or column named 'url').
"""
init_db(db_path)
session = make_session(proxy)
warm_session(session)
with open(csv_path) as f:
reader = csv.DictReader(f) if "url" in f.readline() else csv.reader(f)
f.seek(0)
urls = [row["url"] if isinstance(row, dict) else row[0] for row in reader]
total = len(urls)
skipped = 0
success = 0
failed = 0
for i, url in enumerate(urls):
if already_scraped(url, db_path):
skipped += 1
continue
print(f"[{i+1}/{total}] Scraping: {url}")
try:
data = scrape_yelp_business(url, session)
save_business(data, db_path)
success += 1
except Exception as e:
print(f" ERROR: {e}")
# Log the failure so we can retry selectively
conn = sqlite3.connect(db_path)
conn.execute("INSERT OR REPLACE INTO businesses (url, error) VALUES (?,?)", (url, str(e)))
conn.commit()
conn.close()
failed += 1
respectful_delay()
print(f"\nDone. Success: {success}, Skipped: {skipped}, Failed: {failed}")
print(f"Data saved to {db_path}")
Use Cases
1. Local Market Research
Find all restaurants in a zip code and compare ratings across categories:
session = make_session(proxy=PROXY)
warm_session(session)
# Search for multiple categories
categories = ["restaurants", "coffee", "bars", "pizza"]
location = "10001" # NYC zip code
all_results = []
for cat in categories:
urls = scrape_yelp_search(cat, location, max_pages=10)
for biz in urls:
data = scrape_yelp_business(biz["url"], session)
data["search_category"] = cat
all_results.append(data)
respectful_delay()
2. Lead Generation for Local Services
Extract contact info for outreach campaigns:
def extract_leads(businesses: list) -> list:
"""Filter to businesses with phone AND website — higher quality leads."""
return [
{
"name": b["name"],
"phone": b["phone"],
"website": b["website"],
"city": b["address"]["city"],
"rating": b["rating"],
"review_count": b["review_count"],
}
for b in businesses
if b.get("phone") and b.get("website")
]
3. Competitor Monitoring Dashboard
Track rating changes over time by scraping on a schedule:
import sqlite3
from datetime import date
def track_rating_change(url: str, db_path: str = DB_PATH):
"""Compare today's rating with the most recent historical record."""
conn = sqlite3.connect(db_path)
rows = conn.execute(
"SELECT rating, scraped_at FROM businesses WHERE url=? ORDER BY scraped_at DESC LIMIT 2",
(url,)
).fetchall()
conn.close()
if len(rows) < 2:
return None
current, previous = rows[0][0], rows[1][0]
delta = round(current - previous, 2)
return {"url": url, "current": current, "previous": previous, "delta": delta}
4. Review Sentiment Analysis
# After scraping reviews into SQLite:
import sqlite3
conn = sqlite3.connect("yelp_scrape.db")
rows = conn.execute("SELECT text, rating FROM reviews WHERE business_url=?", (biz_url,)).fetchall()
conn.close()
# Simple keyword sentiment breakdown
positive_keywords = ["great", "amazing", "excellent", "love", "best", "fantastic"]
negative_keywords = ["terrible", "awful", "worst", "never", "disgusting", "rude"]
for text, rating in rows:
if text:
text_lower = text.lower()
pos = sum(1 for k in positive_keywords if k in text_lower)
neg = sum(1 for k in negative_keywords if k in text_lower)
print(f"Rating: {rating} | Pos signals: {pos} | Neg signals: {neg}")
5. Location Intelligence
Find underserved areas for a business category:
# Scrape multiple zip codes, count results per category
# Low review_count + high rating = established niche with growth potential
# Low count + low rating = underserved market with unmet demand
zip_codes = ["94110", "94103", "94117", "94102"]
category = "vegan"
coverage = {}
for zip_code in zip_codes:
results = scrape_yelp_search(category, zip_code, max_pages=3)
coverage[zip_code] = {
"count": len(results),
"avg_rating": sum(r.get("rating", 0) for r in results) / max(len(results), 1)
}
for zip_code, stats in coverage.items():
print(f"{zip_code}: {stats[count]} businesses, avg rating {stats[avg_rating]:.1f}")
Data Analysis with Pandas
Once you have data in SQLite, analysis is straightforward:
import pandas as pd
import sqlite3
conn = sqlite3.connect("yelp_scrape.db")
df = pd.read_sql("SELECT * FROM businesses WHERE error IS NULL", conn)
conn.close()
# Parse JSON columns
df["categories_list"] = df["categories"].apply(lambda x: json.loads(x) if x else [])
# Explode categories so each row = one category
df_cats = df.explode("categories_list").rename(columns={"categories_list": "category"})
# Average rating by category
print(df_cats.groupby("category")["rating"].agg(["mean", "count"]).sort_values("mean", ascending=False).head(20))
# Price range distribution
print(df["price_range"].value_counts())
# Review volume trend (requires date-stamped scrapes)
df["scraped_date"] = pd.to_datetime(df["scraped_at"]).dt.date
print(df.groupby("scraped_date")["review_count"].sum())
# Top cities by business count
print(df["city"].value_counts().head(10))
Why Residential Proxies Are Non-Negotiable for Yelp
Yelp's IP-level defenses are among the most aggressive of any consumer platform:
- ASN checking: Every request's IP is checked against known datacenter ASN ranges. AWS, GCP, Azure, DigitalOcean, Vultr — all flagged instantly.
- Velocity tracking per IP: More than ~25 requests/hour from one residential IP triggers rate limiting.
- Geo-consistency: If your session cookie was set from a US IP and suddenly you're coming from Eastern Europe, expect a block.
ThorData's rotating residential proxy network addresses all three. You get real residential IPs from ISPs like Comcast, AT&T, and Verizon — the same ASNs as actual Yelp users. Their rotation is automatic per-request or per-session depending on your config, and geo-targeting lets you keep your scrape traffic appearing to come from the same city as your target businesses.
# ThorData proxy config
PROXY = "http://USERNAME:[email protected]:9001"
# For city-specific targeting (reduces geo-mismatch blocks)
PROXY_SF = "http://USERNAME:[email protected]:9001?country=US&city=SanFrancisco"
Legal Considerations
hiQ v. LinkedIn (9th Cir. 2022): The court held that scraping publicly accessible data does not violate the Computer Fraud and Abuse Act (CFAA). Yelp business listings — names, addresses, ratings — are publicly available without authentication. This precedent offers meaningful protection against CFAA-based claims.
Yelp ToS: Yelp's Terms of Service prohibit automated access. ToS violations are a breach of contract matter, not a criminal one. Practical enforcement risk is low for moderate-volume scraping that does not: - Reproduce large volumes of review text verbatim for commercial redistribution - Bypass authentication or access non-public data - Cause measurable server load
Practical advice: - Never scrape behind login (authentication changes the legal calculus significantly) - Don't republish raw review text — use it for analysis, not syndication - Keep request rates reasonable; aggressive scraping is easier to litigate as interference with business - If you're building a commercial product on Yelp data, the Fusion API's licensing is cleaner
Key Takeaways
- Use
curl_cffiwithimpersonate="chrome124"— it's the single biggest factor for avoiding TLS-based blocks. - Warm your session by hitting the homepage first to acquire the
bsecookie. - The
review_feedJSON endpoint and/gql/batchGraphQL endpoint both return clean JSON — no JS rendering needed for reviews. - The Fusion API (free, 5k/day) handles business search and metadata reliably; scrape only what the API can't give you.
- Residential proxies are not optional for Yelp. ThorData handles the ASN and geo-consistency requirements automatically.
- Save to SQLite and check before re-scraping — Yelp blocks are frequent enough that resume support is essential.
- Always check for
visibility:hidden/display:nonelinks before following them.