Scraping TripAdvisor Reviews with Python: Hotels & Restaurants (2026)

2026-04-09 [python tripadvisor scraping reviews cloudflare]

TripAdvisor is one of the richest public sources of hospitality sentiment data on the internet. Hundreds of millions of reviews across hotels, restaurants, and attractions — all with structured ratings, timestamps, and free-text opinions from real customers. For competitive analysis, market research, reputation monitoring, or training review classifiers, this data is genuinely useful.

This guide covers what actually works in 2026: fetching listing pages, parsing review blocks, handling pagination, dealing with Cloudflare, storing results, and running at scale with proxy rotation.

How TripAdvisor Structures Its Review Pages

A typical hotel or restaurant listing URL looks like this:

https://www.tripadvisor.com/Hotel_Review-g60763-d93437-Reviews-The_Plaza-New_York_City_New_York.html

The URL encodes key identifiers: - g60763 — geographic area ID (New York City) - d93437 — property/location ID - The slug after Reviews- — human-readable property name

The page loads reviews in a container with data-test attributes that TripAdvisor uses consistently. Each review block is wrapped in a div with data-test="review-container". Inside that you'll find:

Rating: a span with class ui_bubble_rating and a class suffix like bubble_50 = 5 stars
Review title: data-test="review-title"
Review body: data-test="review-body"
Date posted: span.ratingDate or a div with a date-related class
Reviewer username: an a linking to their profile page

The structure has shifted slightly year to year but the data-test attributes have been stable. Most review content is in the initial HTML — you do not need JavaScript execution for the main review blocks.

Setting Up Your Environment

pip install requests beautifulsoup4 lxml httpx

For anti-Cloudflare work you may also want:

pip install curl-cffi

Fetching a Listing Page

Start with a session and realistic headers. TripAdvisor checks User-Agent and Accept-Language headers aggressively.

import requests
from bs4 import BeautifulSoup
import time
import random
import json
import re

HEADERS_POOL = [
    {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.9",
    },
    {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.9",
    },
    {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
        "Accept-Language": "en-GB,en;q=0.9",
    },
]

def make_session(proxy_url=None):
    session = requests.Session()
    ua_headers = random.choice(HEADERS_POOL)
    session.headers.update({
        **ua_headers,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate, br",
        "Referer": "https://www.tripadvisor.com/",
        "DNT": "1",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
    })
    if proxy_url:
        session.proxies = {"http": proxy_url, "https": proxy_url}
    return session

def fetch_page(session, url, retries=3):
    for attempt in range(retries):
        try:
            resp = session.get(url, timeout=25)
            if resp.status_code == 429:
                wait = 30 * (attempt + 1)
                print(f"Rate limited. Waiting {wait}s...")
                time.sleep(wait)
                continue
            if resp.status_code == 403:
                print(f"403 Forbidden at {url} — IP may be blocked")
                return None
            resp.raise_for_status()
            return BeautifulSoup(resp.text, "lxml")
        except requests.RequestException as e:
            if attempt == retries - 1:
                print(f"Failed after {retries} attempts: {e}")
                return None
            time.sleep(10 * (attempt + 1))
    return None

Always reuse the session object across requests. This persists cookies, which helps maintain the appearance of a browsing session rather than isolated bot hits.

Extracting Review Data

Once you have the parsed HTML, pulling the review blocks is straightforward:

def parse_reviews(soup):
    """Extract reviews from TripAdvisor HTML using multiple selector strategies."""
    if soup is None:
        return []

    reviews = []

    # Strategy 1: data-test attributes (most stable)
    containers = soup.find_all("div", attrs={"data-test": "review-container"})

    # Strategy 2: JSON-LD structured data (cleanest when available)
    for script in soup.find_all("script", type="application/ld+json"):
        try:
            data = json.loads(script.string)
            if data.get("@type") in ["Hotel", "Restaurant", "LodgingBusiness"]:
                for rev in data.get("review", []):
                    reviews.append({
                        "rating": rev.get("reviewRating", {}).get("ratingValue"),
                        "title": rev.get("name"),
                        "body": rev.get("reviewBody"),
                        "date": rev.get("datePublished"),
                        "reviewer": rev.get("author", {}).get("name"),
                        "source": "json-ld",
                    })
        except (json.JSONDecodeError, TypeError, AttributeError):
            continue

    if reviews:
        return reviews

    # Fallback: DOM parsing from data-test containers
    for container in containers:
        # Rating: class like "ui_bubble_rating bubble_50" — last two digits / 10 = stars
        rating = None
        rating_span = container.find("span", class_="ui_bubble_rating")
        if rating_span:
            for cls in rating_span.get("class", []):
                if cls.startswith("bubble_"):
                    rating = int(cls.replace("bubble_", "")) // 10
                    break

        # Title
        title_el = container.find(attrs={"data-test": "review-title"})
        title = title_el.get_text(strip=True) if title_el else None

        # Body — may be truncated with a "Read more" button
        body_el = container.find(attrs={"data-test": "review-body"})
        body = body_el.get_text(strip=True) if body_el else None

        # Date — multiple possible locations
        date = None
        for date_selector in [
            {"class": "ratingDate"},
            {"class": lambda c: c and "_date" in c.lower() if c else False},
        ]:
            date_el = container.find("span", **date_selector) or container.find("div", **date_selector)
            if date_el:
                date = date_el.get_text(strip=True)
                break

        # Reviewer username
        reviewer_el = container.find("a", href=lambda h: h and "/Profile/" in h)
        reviewer = reviewer_el.get_text(strip=True) if reviewer_el else None

        # Trip type (solo, couple, family, etc.)
        trip_el = container.find(attrs={"data-test": "trip-type"})
        trip_type = trip_el.get_text(strip=True) if trip_el else None

        # Helpful votes
        helpful_el = container.find(attrs={"data-test": "helpful-count"})
        helpful = helpful_el.get_text(strip=True) if helpful_el else None

        reviews.append({
            "rating": rating,
            "title": title,
            "body": body,
            "date": date,
            "reviewer": reviewer,
            "trip_type": trip_type,
            "helpful": helpful,
            "source": "dom",
        })

    return reviews


def diagnose_page(soup):
    """Debug helper — check what TripAdvisor actually returned."""
    if soup is None:
        return "No response"
    title = soup.find("title")
    review_containers = soup.find_all("div", attrs={"data-test": "review-container"})
    ld_scripts = soup.find_all("script", type="application/ld+json")
    cloudflare_check = "cf-browser-verification" in str(soup) or "Checking your browser" in str(soup)
    captcha = "captcha" in str(soup).lower()

    return {
        "page_title": title.get_text() if title else "None",
        "review_containers_found": len(review_containers),
        "ld_json_scripts": len(ld_scripts),
        "cloudflare_challenge": cloudflare_check,
        "captcha_present": captcha,
    }

Handling Pagination

TripAdvisor uses offset-based pagination embedded in the URL. For a hotel with the slug Reviews-The_Plaza, the pages look like:

...Reviews-The_Plaza-...html          # page 1, reviews 1-10
...Reviews-or10-The_Plaza-...html     # page 2, reviews 11-20
...Reviews-or20-The_Plaza-...html     # page 3, reviews 21-30

The pattern is inserting or{offset} before the property slug. Increment by 10 for each page.

def paginate_url(base_url: str, offset: int) -> str:
    """Insert pagination offset into TripAdvisor URL."""
    if offset == 0:
        return base_url
    # Insert or{offset} after "Reviews-" in the URL
    return re.sub(r"(Reviews-)", rf"\1or{offset}-", base_url)


def scrape_all_reviews(base_url: str, max_pages: int = 10,
                        proxy_url: str = None) -> list:
    """Scrape all reviews from a TripAdvisor listing."""
    all_reviews = []
    session = make_session(proxy_url)

    # First request: visit the homepage to get cookies
    session.get("https://www.tripadvisor.com/", timeout=15)
    time.sleep(random.uniform(1.5, 3.0))

    empty_pages = 0
    for page_num in range(max_pages):
        offset = page_num * 10
        url = paginate_url(base_url, offset)
        print(f"Page {page_num + 1}: {url}")

        soup = fetch_page(session, url)
        if soup is None:
            break

        # Diagnose if something looks wrong
        diag = diagnose_page(soup)
        if diag.get("cloudflare_challenge") or diag.get("captcha_present"):
            print(f"Bot detection triggered: {diag}")
            break

        reviews = parse_reviews(soup)
        if not reviews:
            empty_pages += 1
            if empty_pages >= 2:
                print("Two consecutive empty pages — likely end of reviews")
                break
        else:
            empty_pages = 0
            all_reviews.extend(reviews)
            print(f"  Found {len(reviews)} reviews (total: {len(all_reviews)})")

        # Randomized delay — essential for avoiding rate limits
        time.sleep(random.uniform(3.5, 7.0))

    return all_reviews

Anti-Bot Measures and Proxy Rotation

TripAdvisor runs Cloudflare in front of its listing pages. This means beyond header spoofing, your IP reputation matters a lot. Datacenter IPs — any cloud provider, VPS, or hosting range — get blocked almost immediately. You will see Cloudflare challenge pages or 403s within a handful of requests.

The only reliable solution is residential proxies. Routing requests through ThorData's residential proxy network gives access to IPs from real ISPs globally. For TripAdvisor specifically, matching the proxy country to the listing's locale helps — a New York hotel page performs better with a US residential IP than one from Singapore.

# ThorData residential proxy configuration
THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
THORDATA_HOST = "proxy.thordata.net"
THORDATA_PORT = 9000

def get_proxy_url(country: str = None, session_id: str = None) -> str:
    """
    Build ThorData proxy URL.
    country: ISO 2-letter code, e.g. "us", "gb", "de"
    session_id: for sticky sessions (same IP across multiple requests)
    """
    user = THORDATA_USER
    if country:
        user += f"-country-{country}"
    if session_id:
        user += f"-session-{session_id}"
    return f"http://{user}:{THORDATA_PASS}@{THORDATA_HOST}:{THORDATA_PORT}"

# For scraping US hotels — use US residential IP
us_proxy = get_proxy_url(country="us")
session = make_session(proxy_url=us_proxy)

# Sticky session: same IP for an entire listing scrape
import uuid
sticky_id = str(uuid.uuid4())[:8]
sticky_proxy = get_proxy_url(country="us", session_id=sticky_id)

Key practices that reduce detection:

Sticky sessions per listing: Use the same IP across all pages of a single listing, but rotate between listings
Geo-match proxies: Use US IPs for US listings, UK IPs for UK listings
Load the homepage first: Visit tripadvisor.com before your target page to get session cookies
Randomize delays: time.sleep(random.uniform(3, 7)) is the sweet spot
Rotate user agents: Maintain a pool of 5+ current browser UA strings
Handle 503s: If you get a 503, stop scraping for 5-10 minutes before retrying

TripAdvisor also does browser fingerprinting via JavaScript on some flows. If you need to handle challenge pages or click-through CAPTCHAs, you will need a headless browser. For most listing pages and basic review extraction though, plain requests works if your IP is clean.

Finding Location and Property IDs

TripAdvisor's internal location IDs (the g and d numbers in URLs) are stable and reusable.

Approach 1: Manual extraction Search TripAdvisor normally in a browser and copy the URL — the g number is the geographic area ID, d is the specific property.

Approach 2: Scrape search results

def search_tripadvisor(query: str, location_type: str = "Hotels",
                       session=None) -> list:
    """
    Search TripAdvisor and extract property IDs from results.
    location_type: "Hotels", "Restaurants", "Attractions"
    """
    if not session:
        session = make_session()

    url = "https://www.tripadvisor.com/Search"
    params = {
        "q": query,
        "searchSessionId": str(uuid.uuid4()),
        "geo": 1,  # worldwide
    }

    soup = fetch_page(session, url + "?" + "&".join(f"{k}={v}" for k, v in params.items()))
    if not soup:
        return []

    properties = []
    for link in soup.find_all("a", href=True):
        href = link["href"]
        # Extract geo and property IDs from href
        geo_match = re.search(r"-g(\d+)-", href)
        prop_match = re.search(r"-d(\d+)-", href)
        if geo_match and prop_match:
            full_url = "https://www.tripadvisor.com" + href if href.startswith("/") else href
            properties.append({
                "url": full_url,
                "geo_id": geo_match.group(1),
                "property_id": prop_match.group(1),
                "name": link.get_text(strip=True),
            })

    # Deduplicate by property_id
    seen = set()
    unique = []
    for p in properties:
        if p["property_id"] not in seen:
            seen.add(p["property_id"])
            unique.append(p)

    return unique


def extract_location_ids_from_page(soup) -> list:
    """Extract all property IDs from an already-fetched page."""
    ids = []
    for link in soup.find_all("a", href=True):
        href = link["href"]
        match = re.search(r"-d(\d+)-", href)
        if match:
            ids.append(match.group(1))
    return list(set(ids))

Saving Data to CSV and SQLite

import csv
import sqlite3
from datetime import datetime

def save_to_csv(reviews: list, filename: str):
    if not reviews:
        return
    fieldnames = ["rating", "title", "body", "date", "reviewer",
                  "trip_type", "helpful", "source"]
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction="ignore")
        writer.writeheader()
        writer.writerows(reviews)
    print(f"Saved {len(reviews)} reviews to {filename}")


def setup_db(db_path="tripadvisor.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS reviews (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            property_id TEXT,
            property_name TEXT,
            rating INTEGER,
            title TEXT,
            body TEXT,
            date TEXT,
            reviewer TEXT,
            trip_type TEXT,
            helpful TEXT,
            scraped_at TEXT
        )
    """)
    conn.execute("CREATE INDEX IF NOT EXISTS idx_property ON reviews(property_id)")
    conn.commit()
    return conn


def save_to_db(conn, reviews: list, property_id: str, property_name: str):
    now = datetime.utcnow().isoformat()
    rows = [
        (property_id, property_name,
         r.get("rating"), r.get("title"), r.get("body"),
         r.get("date"), r.get("reviewer"), r.get("trip_type"),
         r.get("helpful"), now)
        for r in reviews
    ]
    conn.executemany(
        """INSERT INTO reviews
           (property_id, property_name, rating, title, body, date, reviewer,
            trip_type, helpful, scraped_at)
           VALUES (?,?,?,?,?,?,?,?,?,?)""",
        rows
    )
    conn.commit()
    return len(rows)

Rating Distribution Analysis

Once you have reviews, here's a quick analysis you can run:

from collections import Counter

def analyze_reviews(reviews: list) -> dict:
    """Basic sentiment analysis on scraped review data."""
    ratings = [r["rating"] for r in reviews if r.get("rating")]
    distribution = Counter(ratings)

    avg = sum(r * c for r, c in distribution.items()) / sum(distribution.values()) if distribution else 0

    # Simple keyword frequency from review bodies
    all_text = " ".join(r.get("body", "") for r in reviews if r.get("body")).lower()
    positive_words = ["excellent", "amazing", "wonderful", "fantastic", "perfect",
                      "great", "love", "loved", "best", "outstanding"]
    negative_words = ["terrible", "awful", "horrible", "worst", "disgusting",
                      "dirty", "rude", "slow", "disappointing", "overpriced"]

    pos_count = sum(all_text.count(w) for w in positive_words)
    neg_count = sum(all_text.count(w) for w in negative_words)

    return {
        "total_reviews": len(reviews),
        "average_rating": round(avg, 2),
        "distribution": dict(sorted(distribution.items())),
        "positive_keyword_hits": pos_count,
        "negative_keyword_hits": neg_count,
        "sentiment_ratio": round(pos_count / max(neg_count, 1), 2),
    }

# Example
reviews = scrape_all_reviews(
    "https://www.tripadvisor.com/Hotel_Review-g60763-d93437-Reviews-The_Plaza-New_York_City_New_York.html",
    max_pages=5
)
stats = analyze_reviews(reviews)
for key, val in stats.items():
    print(f"  {key}: {val}")

Handling Rate Limits Systematically

For sustained collection across hundreds of listings:

import time
from dataclasses import dataclass, field
from typing import List

@dataclass
class ScraperConfig:
    min_delay: float = 3.0
    max_delay: float = 7.0
    max_retries: int = 3
    requests_per_session: int = 30  # rotate session after this many requests
    backoff_multiplier: float = 2.0
    max_backoff: float = 300.0  # 5 minutes max wait

config = ScraperConfig()
request_counter = 0
session = None


def get_or_refresh_session(proxy_url=None):
    global session, request_counter
    if session is None or request_counter >= config.requests_per_session:
        session = make_session(proxy_url)
        request_counter = 0
        # Always visit homepage first for fresh cookies
        try:
            session.get("https://www.tripadvisor.com/", timeout=15)
            time.sleep(random.uniform(1, 2))
        except Exception:
            pass
    return session


def scrape_with_backoff(url: str, proxy_url: str = None):
    global request_counter
    sess = get_or_refresh_session(proxy_url)
    backoff = 5.0

    for attempt in range(config.max_retries):
        soup = fetch_page(sess, url)
        request_counter += 1

        if soup is None:
            time.sleep(min(backoff, config.max_backoff))
            backoff *= config.backoff_multiplier
            sess = get_or_refresh_session(proxy_url)  # force refresh on error
            continue

        diag = diagnose_page(soup)
        if diag.get("cloudflare_challenge"):
            print("Cloudflare challenge — rotating session and waiting...")
            time.sleep(min(backoff * 3, config.max_backoff))
            backoff *= config.backoff_multiplier
            sess = get_or_refresh_session(proxy_url)
            continue

        return soup

    return None

Summary

TripAdvisor review extraction with Python is doable — the HTML structure is consistent, pagination is URL-based, and BeautifulSoup handles the parsing cleanly. The real constraint is IP quality.

Key takeaways: 1. Datacenter IPs don't work reliably — Cloudflare blocks them within a handful of requests 2. Residential proxies are a practical requirement — ThorData is a reliable option with country-targeting 3. Match proxy country to listing locale — US IPs for US hotels, UK IPs for UK restaurants 4. Use sticky sessions per listing — same IP for all pages of one property, rotate between properties 5. JSON-LD structured data is more stable than DOM selectors — prefer it when available 6. Randomize everything — delays, user agents, request timing

Get the basics working locally first, then layer in proxy rotation once you're confident the parsing logic is solid.