← Back to blog

How to Scrape Yelp Business Data in 2026: A Complete Guide

How to Scrape Yelp Business Data in 2026: A Complete Guide

Yelp remains one of the richest sources of local business intelligence. With over 265 million reviews across restaurants, services, and retail, scraping Yelp gives you access to competitive pricing, sentiment analysis, and location-based market research that no API can match at scale.

This guide walks through scraping Yelp business listings with Python — including the anti-bot measures you'll actually encounter in 2026, Yelp's official Fusion API, their internal GraphQL endpoint, and a full batch pipeline that saves to SQLite and resumes from where it left off.


What Data Can You Extract from Yelp?

Each Yelp business page contains structured data worth extracting:


Yelp's Anti-Bot Measures in 2026

Yelp has significantly hardened its defenses. Here's what you'll face:

  1. Aggressive rate limiting — More than 20–30 requests per minute from a single IP triggers a CAPTCHA wall or temporary block.
  2. JavaScript-rendered content — Review text and some business details load dynamically via XHR calls, not in the initial HTML.
  3. TLS fingerprinting — Yelp checks TLS fingerprints, header ordering, and cipher suites to identify non-browser clients. A standard requests or httpx TLS handshake differs from Chrome's.
  4. Bot detection cookies — A bse cookie is set on first visit; missing or malformed values trigger blocks.
  5. Honeypot links — Hidden elements in the DOM (visibility:hidden, display:none) that only bots follow. Clicking them results in an instant ban.
  6. ASN-level IP blocking — Yelp checks your IP's ASN. Datacenter ranges (AWS, GCP, DigitalOcean, etc.) get flagged immediately regardless of request behavior.

Setting Up Your Scraper

pip install httpx selectolax curl_cffi pandas sqlite-utils

Use curl_cffi instead of plain httpx for its Chrome TLS impersonation. This is the single biggest factor in avoiding Yelp blocks in 2026.

User Agent Rotation Pool

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4.1 Safari/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.0.0",
    "Mozilla/5.0 (iPhone; CPU iPhone OS 17_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4.1 Mobile/15E148 Safari/604.1",
    "Mozilla/5.0 (Linux; Android 14; Pixel 8) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.6367.82 Mobile Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]

Complete Business Scraper with Full Field Extraction

import random
import time
import re
import json
from curl_cffi import requests as cffi_requests
from selectolax.parser import HTMLParser

def make_session(proxy: str = None) -> cffi_requests.Session:
    """Create a curl_cffi session that impersonates Chrome TLS fingerprint."""
    session = cffi_requests.Session(impersonate="chrome124")
    if proxy:
        session.proxies = {"http": proxy, "https": proxy}
    session.headers.update({
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "DNT": "1",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
    })
    return session


def scrape_yelp_business(url: str, session: cffi_requests.Session) -> dict:
    """Scrape a single Yelp business page for all available fields."""
    session.headers["User-Agent"] = random.choice(USER_AGENTS)
    session.headers["Referer"] = "https://www.yelp.com/"

    resp = session.get(url, timeout=20)
    resp.raise_for_status()
    tree = HTMLParser(resp.text)

    # Check for honeypot traps before proceeding
    hidden = tree.css("a[style*=display:none], a[style*=visibility:hidden]")
    honeypot_hrefs = {el.attributes.get("href") for el in hidden}

    # Business name
    name_el = tree.css_first("h1.css-1se8maq, h1")
    name = name_el.text(strip=True) if name_el else None

    # Rating from aria-label
    rating = None
    rating_el = tree.css_first("[aria-label*=star rating], div[aria-label*=rating]")
    if rating_el:
        label = rating_el.attributes.get("aria-label", "")
        m = re.search(r"(\d+\.?\d*)", label)
        rating = float(m.group(1)) if m else None

    # Review count
    review_count = None
    review_el = tree.css_first("a[href*=#reviews]")
    if review_el:
        text = review_el.text(strip=True)
        digits = re.sub(r"[^\d]", "", text)
        review_count = int(digits) if digits else None

    # Structured address — Yelp renders this as a series of spans
    street, city, state, zip_code = None, None, None, None
    addr_block = tree.css_first("address")
    if addr_block:
        raw = addr_block.text(strip=True)
        # Yelp format: "600 Guerrero St\nSan Francisco, CA 94110"
        parts = [p.strip() for p in raw.splitlines() if p.strip()]
        if parts:
            street = parts[0]
        if len(parts) > 1:
            m = re.match(r"^(.+),\s*([A-Z]{2})\s*(\d{5})", parts[1])
            if m:
                city, state, zip_code = m.group(1), m.group(2), m.group(3)

    # Phone
    phone = None
    for el in tree.css("p.css-1p9ibgf, p[class*=phone], p"):
        t = el.text(strip=True)
        if re.match(r"^\(?\d{3}\)?[\s\-]\d{3}[\s\-]\d{4}$", t):
            phone = t
            break

    # Website URL
    website = None
    biz_url_el = tree.css_first("a[href*=biz_website]")
    if biz_url_el:
        href = biz_url_el.attributes.get("href", "")
        m = re.search(r"url=([^&]+)", href)
        if m:
            from urllib.parse import unquote
            website = unquote(m.group(1))

    # Price range ($, $$, $$$, $$$$)
    price_el = tree.css_first("span.priceRange, span[class*=price]")
    price_range = price_el.text(strip=True) if price_el else None

    # Categories
    categories = []
    for el in tree.css("span.css-1xfc281 a, a[href*=/c/]"):
        cat = el.text(strip=True)
        if cat and cat not in categories:
            categories.append(cat)

    # Hours of operation
    hours = {}
    hours_rows = tree.css("table.hours-table tr, div[class*=hours] tr")
    for row in hours_rows:
        cells = row.css("td")
        if len(cells) >= 2:
            day = cells[0].text(strip=True)
            time_val = cells[1].text(strip=True)
            if day:
                hours[day] = time_val

    # Photos count
    photos_count = None
    photos_el = tree.css_first("a[href*=/photos] span")
    if photos_el:
        digits = re.sub(r"[^\d]", "", photos_el.text(strip=True))
        photos_count = int(digits) if digits else None

    # Highlights / amenities
    highlights = []
    for el in tree.css("span.css-1p9ibgf, div[class*=amenities] span, section[aria-label*=Amenities] span"):
        t = el.text(strip=True)
        if t and len(t) < 50:
            highlights.append(t)
    highlights = list(dict.fromkeys(highlights))[:15]  # deduplicate, cap at 15

    # Health score (city-specific, not always present)
    health_score = None
    health_el = tree.css_first("div[class*=health] span, span[class*=health-score]")
    if health_el:
        health_score = health_el.text(strip=True)

    return {
        "name": name,
        "rating": rating,
        "review_count": review_count,
        "price_range": price_range,
        "address": {
            "street": street,
            "city": city,
            "state": state,
            "zip": zip_code,
        },
        "phone": phone,
        "website": website,
        "categories": categories,
        "hours": hours,
        "photos_count": photos_count,
        "highlights": highlights,
        "health_score": health_score,
        "url": url,
    }

JSON Output Example

Here's what a fully populated result looks like:

{
  "name": "Tartine Bakery",
  "rating": 4.5,
  "review_count": 8234,
  "price_range": "$$",
  "address": {
    "street": "600 Guerrero St",
    "city": "San Francisco",
    "state": "CA",
    "zip": "94110"
  },
  "phone": "(415) 487-2600",
  "website": "https://www.tartinebakery.com",
  "categories": ["Bakeries", "Cafes", "Sandwiches"],
  "hours": {
    "Monday": "8:00 AM - 3:00 PM",
    "Tuesday": "8:00 AM - 3:00 PM",
    "Wednesday": "8:00 AM - 3:00 PM",
    "Thursday": "8:00 AM - 3:00 PM",
    "Friday": "8:00 AM - 5:00 PM",
    "Saturday": "8:00 AM - 5:00 PM",
    "Sunday": "9:00 AM - 3:00 PM"
  },
  "photos_count": 4821,
  "highlights": ["Outdoor Seating", "Good for Groups", "Takes Reservations", "Wi-Fi"],
  "health_score": "A (98)",
  "url": "https://www.yelp.com/biz/tartine-bakery-san-francisco"
}

Yelp Fusion API: The Official Route

Yelp offers a free tier of their Fusion API: 5,000 calls/day, no charge. It's rate-limited and doesn't expose reviews, but it's reliable, structured, and legal.

Get a key at https://www.yelp.com/developers/v3/manage_app.

import httpx

FUSION_KEY = "your_api_key_here"

def fusion_search(term: str, location: str, limit: int = 50) -> list:
    """Search Yelp Fusion API for businesses."""
    url = "https://api.yelp.com/v3/businesses/search"
    headers = {"Authorization": f"Bearer {FUSION_KEY}"}
    params = {
        "term": term,
        "location": location,
        "limit": min(limit, 50),  # Fusion caps at 50 per request
        "sort_by": "best_match",
    }
    with httpx.Client() as client:
        resp = client.get(url, headers=headers, params=params)
    resp.raise_for_status()
    return resp.json().get("businesses", [])


def fusion_business_detail(business_id: str) -> dict:
    """Get full details for a single business by Yelp ID."""
    url = f"https://api.yelp.com/v3/businesses/{business_id}"
    headers = {"Authorization": f"Bearer {FUSION_KEY}"}
    with httpx.Client() as client:
        resp = client.get(url, headers=headers)
    resp.raise_for_status()
    return resp.json()


# Usage
results = fusion_search("coffee", "Austin, TX")
for biz in results[:5]:
    detail = fusion_business_detail(biz["id"])
    print(f"{detail[name]} — {detail[rating]}★ — {detail[location][address1]}")

Fusion API Response Fields

The businesses/search endpoint returns: - id, name, url, image_url - rating, review_count - price ($ to $$$$) - location — address1, city, state, zip_code, country, display_address - phone, display_phone - categories — list of {alias, title} objects - coordinates — latitude/longitude - distance — meters from search center - is_closed

The /businesses/{id} endpoint adds: - hours — array of open periods per day - photos — up to 3 photo URLs - transactions — ["pickup", "delivery", "restaurant_reservation"] - attributes — wheelchair accessible, outdoor seating, etc.

API vs Scraping: When to Use Each

Factor Fusion API Direct Scraping
Reviews Not available Full text, votes, photos
Rate limit 5,000/day free No hard limit (proxy-dependent)
Reliability High Variable
Data freshness Real-time Real-time
Legal risk None Moderate (ToS)
Setup complexity Low Medium-high
Cost at scale $0.001/call paid tier Proxy cost only

Use the Fusion API for business search and metadata. Scrape for reviews, Q&A, and photo data.


Internal GraphQL Endpoint

Yelp's front-end communicates with an internal GraphQL API at /gql/batch. This endpoint powers review feeds, photo galleries, Q&A, and more — and it returns clean JSON without any JavaScript rendering.

To find the exact query format, open Chrome DevTools on any Yelp business page, go to the Network tab, filter by "gql", and inspect the requests. Here's the general structure:

import json

GQL_URL = "https://www.yelp.com/gql/batch"

def gql_fetch_reviews(biz_alias: str, session, after_cursor: str = None) -> dict:
    """Fetch reviews via Yelp's internal GraphQL batch endpoint."""
    variables = {
        "bizEncId": biz_alias,
        "after": after_cursor,
        "first": 20,
        "sortBy": "DATE_DESC",
        "lang": "en",
    }
    payload = [
        {
            "operationName": "GetBusinessReviewFeed",
            "variables": variables,
            "extensions": {
                "operationId": "GetBusinessReviewFeed",
                "schemaVersion": "20240415",
            },
        }
    ]
    session.headers.update({
        "Content-Type": "application/json",
        "x-apollo-operation-name": "GetBusinessReviewFeed",
        "Accept": "application/json",
        "Referer": f"https://www.yelp.com/biz/{biz_alias}",
    })
    resp = session.post(GQL_URL, data=json.dumps(payload), timeout=20)
    return resp.json()

The response structure nests under data.business.reviewFeed.edges. Each edge has a node with: - id, text, rating, createdAt - author.displayName, author.isElite - photos — array of photo URLs - feedbackCounts — useful, funny, cool vote counts - pageInfo.hasNextPage, pageInfo.endCursor — for pagination

Paginate by passing the endCursor from the previous response as after_cursor in the next call.


Review Scraper: Dedicated Section

def scrape_all_reviews(biz_alias: str, session, max_reviews: int = 200) -> list:
    """Scrape reviews using the internal review_feed endpoint."""
    reviews = []
    start = 0
    page_size = 20

    while len(reviews) < max_reviews:
        url = f"https://www.yelp.com/biz/{biz_alias}/review_feed"
        params = {
            "rl": "en",
            "sort_by": "date_desc",
            "start": start,
            "q": "",
        }
        session.headers.update({
            "Accept": "application/json",
            "X-Requested-With": "XMLHttpRequest",
            "Referer": f"https://www.yelp.com/biz/{biz_alias}",
        })
        resp = session.get(url, params=params, timeout=15)

        if resp.status_code != 200:
            print(f"Review feed blocked at offset {start}: {resp.status_code}")
            break

        data = resp.json()
        page_reviews = data.get("reviews", [])
        if not page_reviews:
            break

        for r in page_reviews:
            reviews.append({
                "reviewer": r.get("user", {}).get("markupDisplayName"),
                "rating": r.get("rating"),
                "date": r.get("localizedDate"),
                "text": r.get("comment", {}).get("text"),
                "photos": [p.get("src") for p in r.get("photos", [])],
                "useful": r.get("feedback", {}).get("useful", 0),
                "funny": r.get("feedback", {}).get("funny", 0),
                "cool": r.get("feedback", {}).get("cool", 0),
                "is_elite": r.get("user", {}).get("isElite", False),
            })

        start += page_size
        if start >= data.get("pagination", {}).get("totalResults", 0):
            break

        time.sleep(random.uniform(2, 5))

    return reviews

Rate Limiting and Anti-Detection

Request Timing

import time
import random

def respectful_delay(min_s: float = 3.0, max_s: float = 7.0):
    """Random delay to mimic human browsing patterns."""
    delay = random.uniform(min_s, max_s)
    # Occasionally simulate a longer pause (reading the page)
    if random.random() < 0.1:
        delay += random.uniform(5, 15)
    time.sleep(delay)

The bse cookie is Yelp's bot-scoring cookie. Get it by making an initial homepage request and persisting the cookie jar across your session:

def warm_session(session) -> None:
    """Hit the homepage to acquire the bse cookie before scraping."""
    session.headers["User-Agent"] = random.choice(USER_AGENTS)
    resp = session.get("https://www.yelp.com/", timeout=15)
    # The session automatically stores Set-Cookie headers
    bse = session.cookies.get("bse")
    if bse:
        print(f"Session warmed, bse cookie acquired: {bse[:12]}...")
    else:
        print("Warning: bse cookie not set — may see increased blocking")

Honeypot Detection

def is_safe_link(tree: HTMLParser, href: str) -> bool:
    """Check if a link is visible (not a honeypot trap)."""
    for el in tree.css(f"a[href={href}]"):
        style = el.attributes.get("style", "")
        if "display:none" in style or "visibility:hidden" in style:
            return False
        parent = el.parent
        while parent:
            pstyle = parent.attributes.get("style", "") if hasattr(parent, "attributes") else ""
            if "display:none" in pstyle:
                return False
            parent = parent.parent if hasattr(parent, "parent") else None
    return True

Batch Scraper with SQLite and Resume Support

import sqlite3
import csv
import os

DB_PATH = "yelp_scrape.db"

def init_db(db_path: str = DB_PATH):
    """Initialize SQLite schema."""
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS businesses (
            url TEXT PRIMARY KEY,
            name TEXT,
            rating REAL,
            review_count INTEGER,
            price_range TEXT,
            street TEXT,
            city TEXT,
            state TEXT,
            zip TEXT,
            phone TEXT,
            website TEXT,
            categories TEXT,
            hours TEXT,
            photos_count INTEGER,
            highlights TEXT,
            health_score TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
            error TEXT
        )
    """)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS reviews (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            business_url TEXT,
            reviewer TEXT,
            rating INTEGER,
            date TEXT,
            text TEXT,
            useful INTEGER,
            funny INTEGER,
            cool INTEGER,
            is_elite INTEGER,
            FOREIGN KEY (business_url) REFERENCES businesses(url)
        )
    """)
    conn.commit()
    conn.close()


def already_scraped(url: str, db_path: str = DB_PATH) -> bool:
    conn = sqlite3.connect(db_path)
    row = conn.execute("SELECT 1 FROM businesses WHERE url=? AND error IS NULL", (url,)).fetchone()
    conn.close()
    return row is not None


def save_business(data: dict, db_path: str = DB_PATH):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        INSERT OR REPLACE INTO businesses
        (url, name, rating, review_count, price_range, street, city, state, zip,
         phone, website, categories, hours, photos_count, highlights, health_score)
        VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)
    """, (
        data["url"], data["name"], data["rating"], data["review_count"],
        data["price_range"],
        data["address"]["street"], data["address"]["city"],
        data["address"]["state"], data["address"]["zip"],
        data["phone"], data["website"],
        json.dumps(data["categories"]),
        json.dumps(data["hours"]),
        data["photos_count"],
        json.dumps(data["highlights"]),
        data["health_score"],
    ))
    conn.commit()
    conn.close()


def batch_scrape(csv_path: str, proxy: str = None, db_path: str = DB_PATH):
    """
    Process a CSV of business URLs. Resumes from last successful scrape.
    CSV format: one URL per line (or column named 'url').
    """
    init_db(db_path)
    session = make_session(proxy)
    warm_session(session)

    with open(csv_path) as f:
        reader = csv.DictReader(f) if "url" in f.readline() else csv.reader(f)
        f.seek(0)
        urls = [row["url"] if isinstance(row, dict) else row[0] for row in reader]

    total = len(urls)
    skipped = 0
    success = 0
    failed = 0

    for i, url in enumerate(urls):
        if already_scraped(url, db_path):
            skipped += 1
            continue

        print(f"[{i+1}/{total}] Scraping: {url}")
        try:
            data = scrape_yelp_business(url, session)
            save_business(data, db_path)
            success += 1
        except Exception as e:
            print(f"  ERROR: {e}")
            # Log the failure so we can retry selectively
            conn = sqlite3.connect(db_path)
            conn.execute("INSERT OR REPLACE INTO businesses (url, error) VALUES (?,?)", (url, str(e)))
            conn.commit()
            conn.close()
            failed += 1

        respectful_delay()

    print(f"\nDone. Success: {success}, Skipped: {skipped}, Failed: {failed}")
    print(f"Data saved to {db_path}")

Use Cases

1. Local Market Research

Find all restaurants in a zip code and compare ratings across categories:

session = make_session(proxy=PROXY)
warm_session(session)

# Search for multiple categories
categories = ["restaurants", "coffee", "bars", "pizza"]
location = "10001"  # NYC zip code

all_results = []
for cat in categories:
    urls = scrape_yelp_search(cat, location, max_pages=10)
    for biz in urls:
        data = scrape_yelp_business(biz["url"], session)
        data["search_category"] = cat
        all_results.append(data)
        respectful_delay()

2. Lead Generation for Local Services

Extract contact info for outreach campaigns:

def extract_leads(businesses: list) -> list:
    """Filter to businesses with phone AND website — higher quality leads."""
    return [
        {
            "name": b["name"],
            "phone": b["phone"],
            "website": b["website"],
            "city": b["address"]["city"],
            "rating": b["rating"],
            "review_count": b["review_count"],
        }
        for b in businesses
        if b.get("phone") and b.get("website")
    ]

3. Competitor Monitoring Dashboard

Track rating changes over time by scraping on a schedule:

import sqlite3
from datetime import date

def track_rating_change(url: str, db_path: str = DB_PATH):
    """Compare today's rating with the most recent historical record."""
    conn = sqlite3.connect(db_path)
    rows = conn.execute(
        "SELECT rating, scraped_at FROM businesses WHERE url=? ORDER BY scraped_at DESC LIMIT 2",
        (url,)
    ).fetchall()
    conn.close()
    if len(rows) < 2:
        return None
    current, previous = rows[0][0], rows[1][0]
    delta = round(current - previous, 2)
    return {"url": url, "current": current, "previous": previous, "delta": delta}

4. Review Sentiment Analysis

# After scraping reviews into SQLite:
import sqlite3

conn = sqlite3.connect("yelp_scrape.db")
rows = conn.execute("SELECT text, rating FROM reviews WHERE business_url=?", (biz_url,)).fetchall()
conn.close()

# Simple keyword sentiment breakdown
positive_keywords = ["great", "amazing", "excellent", "love", "best", "fantastic"]
negative_keywords = ["terrible", "awful", "worst", "never", "disgusting", "rude"]

for text, rating in rows:
    if text:
        text_lower = text.lower()
        pos = sum(1 for k in positive_keywords if k in text_lower)
        neg = sum(1 for k in negative_keywords if k in text_lower)
        print(f"Rating: {rating} | Pos signals: {pos} | Neg signals: {neg}")

5. Location Intelligence

Find underserved areas for a business category:

# Scrape multiple zip codes, count results per category
# Low review_count + high rating = established niche with growth potential
# Low count + low rating = underserved market with unmet demand

zip_codes = ["94110", "94103", "94117", "94102"]
category = "vegan"

coverage = {}
for zip_code in zip_codes:
    results = scrape_yelp_search(category, zip_code, max_pages=3)
    coverage[zip_code] = {
        "count": len(results),
        "avg_rating": sum(r.get("rating", 0) for r in results) / max(len(results), 1)
    }

for zip_code, stats in coverage.items():
    print(f"{zip_code}: {stats[count]} businesses, avg rating {stats[avg_rating]:.1f}")

Data Analysis with Pandas

Once you have data in SQLite, analysis is straightforward:

import pandas as pd
import sqlite3

conn = sqlite3.connect("yelp_scrape.db")
df = pd.read_sql("SELECT * FROM businesses WHERE error IS NULL", conn)
conn.close()

# Parse JSON columns
df["categories_list"] = df["categories"].apply(lambda x: json.loads(x) if x else [])

# Explode categories so each row = one category
df_cats = df.explode("categories_list").rename(columns={"categories_list": "category"})

# Average rating by category
print(df_cats.groupby("category")["rating"].agg(["mean", "count"]).sort_values("mean", ascending=False).head(20))

# Price range distribution
print(df["price_range"].value_counts())

# Review volume trend (requires date-stamped scrapes)
df["scraped_date"] = pd.to_datetime(df["scraped_at"]).dt.date
print(df.groupby("scraped_date")["review_count"].sum())

# Top cities by business count
print(df["city"].value_counts().head(10))

Why Residential Proxies Are Non-Negotiable for Yelp

Yelp's IP-level defenses are among the most aggressive of any consumer platform:

ThorData's rotating residential proxy network addresses all three. You get real residential IPs from ISPs like Comcast, AT&T, and Verizon — the same ASNs as actual Yelp users. Their rotation is automatic per-request or per-session depending on your config, and geo-targeting lets you keep your scrape traffic appearing to come from the same city as your target businesses.

# ThorData proxy config
PROXY = "http://USERNAME:[email protected]:9001"

# For city-specific targeting (reduces geo-mismatch blocks)
PROXY_SF = "http://USERNAME:[email protected]:9001?country=US&city=SanFrancisco"

hiQ v. LinkedIn (9th Cir. 2022): The court held that scraping publicly accessible data does not violate the Computer Fraud and Abuse Act (CFAA). Yelp business listings — names, addresses, ratings — are publicly available without authentication. This precedent offers meaningful protection against CFAA-based claims.

Yelp ToS: Yelp's Terms of Service prohibit automated access. ToS violations are a breach of contract matter, not a criminal one. Practical enforcement risk is low for moderate-volume scraping that does not: - Reproduce large volumes of review text verbatim for commercial redistribution - Bypass authentication or access non-public data - Cause measurable server load

Practical advice: - Never scrape behind login (authentication changes the legal calculus significantly) - Don't republish raw review text — use it for analysis, not syndication - Keep request rates reasonable; aggressive scraping is easier to litigate as interference with business - If you're building a commercial product on Yelp data, the Fusion API's licensing is cleaner


Key Takeaways