Scraping TripAdvisor Reviews with Python: Hotels & Restaurants (2026)
TripAdvisor is one of the richest public sources of hospitality sentiment data on the internet. Hundreds of millions of reviews across hotels, restaurants, and attractions — all with structured ratings, timestamps, and free-text opinions from real customers. For competitive analysis, market research, reputation monitoring, or training review classifiers, this data is genuinely useful.
This guide covers what actually works in 2026: fetching listing pages, parsing review blocks, handling pagination, dealing with Cloudflare, storing results, and running at scale with proxy rotation.
How TripAdvisor Structures Its Review Pages
A typical hotel or restaurant listing URL looks like this:
https://www.tripadvisor.com/Hotel_Review-g60763-d93437-Reviews-The_Plaza-New_York_City_New_York.html
The URL encodes key identifiers:
- g60763 — geographic area ID (New York City)
- d93437 — property/location ID
- The slug after Reviews- — human-readable property name
The page loads reviews in a container with data-test attributes that TripAdvisor uses consistently. Each review block is wrapped in a div with data-test="review-container". Inside that you'll find:
- Rating: a
spanwith classui_bubble_ratingand a class suffix likebubble_50= 5 stars - Review title:
data-test="review-title" - Review body:
data-test="review-body" - Date posted:
span.ratingDateor adivwith a date-related class - Reviewer username: an
alinking to their profile page
The structure has shifted slightly year to year but the data-test attributes have been stable. Most review content is in the initial HTML — you do not need JavaScript execution for the main review blocks.
Setting Up Your Environment
pip install requests beautifulsoup4 lxml httpx
For anti-Cloudflare work you may also want:
pip install curl-cffi
Fetching a Listing Page
Start with a session and realistic headers. TripAdvisor checks User-Agent and Accept-Language headers aggressively.
import requests
from bs4 import BeautifulSoup
import time
import random
import json
import re
HEADERS_POOL = [
{
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
},
{
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
},
{
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36",
"Accept-Language": "en-GB,en;q=0.9",
},
]
def make_session(proxy_url=None):
session = requests.Session()
ua_headers = random.choice(HEADERS_POOL)
session.headers.update({
**ua_headers,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://www.tripadvisor.com/",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
})
if proxy_url:
session.proxies = {"http": proxy_url, "https": proxy_url}
return session
def fetch_page(session, url, retries=3):
for attempt in range(retries):
try:
resp = session.get(url, timeout=25)
if resp.status_code == 429:
wait = 30 * (attempt + 1)
print(f"Rate limited. Waiting {wait}s...")
time.sleep(wait)
continue
if resp.status_code == 403:
print(f"403 Forbidden at {url} — IP may be blocked")
return None
resp.raise_for_status()
return BeautifulSoup(resp.text, "lxml")
except requests.RequestException as e:
if attempt == retries - 1:
print(f"Failed after {retries} attempts: {e}")
return None
time.sleep(10 * (attempt + 1))
return None
Always reuse the session object across requests. This persists cookies, which helps maintain the appearance of a browsing session rather than isolated bot hits.
Extracting Review Data
Once you have the parsed HTML, pulling the review blocks is straightforward:
def parse_reviews(soup):
"""Extract reviews from TripAdvisor HTML using multiple selector strategies."""
if soup is None:
return []
reviews = []
# Strategy 1: data-test attributes (most stable)
containers = soup.find_all("div", attrs={"data-test": "review-container"})
# Strategy 2: JSON-LD structured data (cleanest when available)
for script in soup.find_all("script", type="application/ld+json"):
try:
data = json.loads(script.string)
if data.get("@type") in ["Hotel", "Restaurant", "LodgingBusiness"]:
for rev in data.get("review", []):
reviews.append({
"rating": rev.get("reviewRating", {}).get("ratingValue"),
"title": rev.get("name"),
"body": rev.get("reviewBody"),
"date": rev.get("datePublished"),
"reviewer": rev.get("author", {}).get("name"),
"source": "json-ld",
})
except (json.JSONDecodeError, TypeError, AttributeError):
continue
if reviews:
return reviews
# Fallback: DOM parsing from data-test containers
for container in containers:
# Rating: class like "ui_bubble_rating bubble_50" — last two digits / 10 = stars
rating = None
rating_span = container.find("span", class_="ui_bubble_rating")
if rating_span:
for cls in rating_span.get("class", []):
if cls.startswith("bubble_"):
rating = int(cls.replace("bubble_", "")) // 10
break
# Title
title_el = container.find(attrs={"data-test": "review-title"})
title = title_el.get_text(strip=True) if title_el else None
# Body — may be truncated with a "Read more" button
body_el = container.find(attrs={"data-test": "review-body"})
body = body_el.get_text(strip=True) if body_el else None
# Date — multiple possible locations
date = None
for date_selector in [
{"class": "ratingDate"},
{"class": lambda c: c and "_date" in c.lower() if c else False},
]:
date_el = container.find("span", **date_selector) or container.find("div", **date_selector)
if date_el:
date = date_el.get_text(strip=True)
break
# Reviewer username
reviewer_el = container.find("a", href=lambda h: h and "/Profile/" in h)
reviewer = reviewer_el.get_text(strip=True) if reviewer_el else None
# Trip type (solo, couple, family, etc.)
trip_el = container.find(attrs={"data-test": "trip-type"})
trip_type = trip_el.get_text(strip=True) if trip_el else None
# Helpful votes
helpful_el = container.find(attrs={"data-test": "helpful-count"})
helpful = helpful_el.get_text(strip=True) if helpful_el else None
reviews.append({
"rating": rating,
"title": title,
"body": body,
"date": date,
"reviewer": reviewer,
"trip_type": trip_type,
"helpful": helpful,
"source": "dom",
})
return reviews
def diagnose_page(soup):
"""Debug helper — check what TripAdvisor actually returned."""
if soup is None:
return "No response"
title = soup.find("title")
review_containers = soup.find_all("div", attrs={"data-test": "review-container"})
ld_scripts = soup.find_all("script", type="application/ld+json")
cloudflare_check = "cf-browser-verification" in str(soup) or "Checking your browser" in str(soup)
captcha = "captcha" in str(soup).lower()
return {
"page_title": title.get_text() if title else "None",
"review_containers_found": len(review_containers),
"ld_json_scripts": len(ld_scripts),
"cloudflare_challenge": cloudflare_check,
"captcha_present": captcha,
}
Handling Pagination
TripAdvisor uses offset-based pagination embedded in the URL. For a hotel with the slug Reviews-The_Plaza, the pages look like:
...Reviews-The_Plaza-...html # page 1, reviews 1-10
...Reviews-or10-The_Plaza-...html # page 2, reviews 11-20
...Reviews-or20-The_Plaza-...html # page 3, reviews 21-30
The pattern is inserting or{offset} before the property slug. Increment by 10 for each page.
def paginate_url(base_url: str, offset: int) -> str:
"""Insert pagination offset into TripAdvisor URL."""
if offset == 0:
return base_url
# Insert or{offset} after "Reviews-" in the URL
return re.sub(r"(Reviews-)", rf"\1or{offset}-", base_url)
def scrape_all_reviews(base_url: str, max_pages: int = 10,
proxy_url: str = None) -> list:
"""Scrape all reviews from a TripAdvisor listing."""
all_reviews = []
session = make_session(proxy_url)
# First request: visit the homepage to get cookies
session.get("https://www.tripadvisor.com/", timeout=15)
time.sleep(random.uniform(1.5, 3.0))
empty_pages = 0
for page_num in range(max_pages):
offset = page_num * 10
url = paginate_url(base_url, offset)
print(f"Page {page_num + 1}: {url}")
soup = fetch_page(session, url)
if soup is None:
break
# Diagnose if something looks wrong
diag = diagnose_page(soup)
if diag.get("cloudflare_challenge") or diag.get("captcha_present"):
print(f"Bot detection triggered: {diag}")
break
reviews = parse_reviews(soup)
if not reviews:
empty_pages += 1
if empty_pages >= 2:
print("Two consecutive empty pages — likely end of reviews")
break
else:
empty_pages = 0
all_reviews.extend(reviews)
print(f" Found {len(reviews)} reviews (total: {len(all_reviews)})")
# Randomized delay — essential for avoiding rate limits
time.sleep(random.uniform(3.5, 7.0))
return all_reviews
Anti-Bot Measures and Proxy Rotation
TripAdvisor runs Cloudflare in front of its listing pages. This means beyond header spoofing, your IP reputation matters a lot. Datacenter IPs — any cloud provider, VPS, or hosting range — get blocked almost immediately. You will see Cloudflare challenge pages or 403s within a handful of requests.
The only reliable solution is residential proxies. Routing requests through ThorData's residential proxy network gives access to IPs from real ISPs globally. For TripAdvisor specifically, matching the proxy country to the listing's locale helps — a New York hotel page performs better with a US residential IP than one from Singapore.
# ThorData residential proxy configuration
THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
THORDATA_HOST = "proxy.thordata.net"
THORDATA_PORT = 9000
def get_proxy_url(country: str = None, session_id: str = None) -> str:
"""
Build ThorData proxy URL.
country: ISO 2-letter code, e.g. "us", "gb", "de"
session_id: for sticky sessions (same IP across multiple requests)
"""
user = THORDATA_USER
if country:
user += f"-country-{country}"
if session_id:
user += f"-session-{session_id}"
return f"http://{user}:{THORDATA_PASS}@{THORDATA_HOST}:{THORDATA_PORT}"
# For scraping US hotels — use US residential IP
us_proxy = get_proxy_url(country="us")
session = make_session(proxy_url=us_proxy)
# Sticky session: same IP for an entire listing scrape
import uuid
sticky_id = str(uuid.uuid4())[:8]
sticky_proxy = get_proxy_url(country="us", session_id=sticky_id)
Key practices that reduce detection:
- Sticky sessions per listing: Use the same IP across all pages of a single listing, but rotate between listings
- Geo-match proxies: Use US IPs for US listings, UK IPs for UK listings
- Load the homepage first: Visit tripadvisor.com before your target page to get session cookies
- Randomize delays:
time.sleep(random.uniform(3, 7))is the sweet spot - Rotate user agents: Maintain a pool of 5+ current browser UA strings
- Handle 503s: If you get a 503, stop scraping for 5-10 minutes before retrying
TripAdvisor also does browser fingerprinting via JavaScript on some flows. If you need to handle challenge pages or click-through CAPTCHAs, you will need a headless browser. For most listing pages and basic review extraction though, plain requests works if your IP is clean.
Finding Location and Property IDs
TripAdvisor's internal location IDs (the g and d numbers in URLs) are stable and reusable.
Approach 1: Manual extraction
Search TripAdvisor normally in a browser and copy the URL — the g number is the geographic area ID, d is the specific property.
Approach 2: Scrape search results
def search_tripadvisor(query: str, location_type: str = "Hotels",
session=None) -> list:
"""
Search TripAdvisor and extract property IDs from results.
location_type: "Hotels", "Restaurants", "Attractions"
"""
if not session:
session = make_session()
url = "https://www.tripadvisor.com/Search"
params = {
"q": query,
"searchSessionId": str(uuid.uuid4()),
"geo": 1, # worldwide
}
soup = fetch_page(session, url + "?" + "&".join(f"{k}={v}" for k, v in params.items()))
if not soup:
return []
properties = []
for link in soup.find_all("a", href=True):
href = link["href"]
# Extract geo and property IDs from href
geo_match = re.search(r"-g(\d+)-", href)
prop_match = re.search(r"-d(\d+)-", href)
if geo_match and prop_match:
full_url = "https://www.tripadvisor.com" + href if href.startswith("/") else href
properties.append({
"url": full_url,
"geo_id": geo_match.group(1),
"property_id": prop_match.group(1),
"name": link.get_text(strip=True),
})
# Deduplicate by property_id
seen = set()
unique = []
for p in properties:
if p["property_id"] not in seen:
seen.add(p["property_id"])
unique.append(p)
return unique
def extract_location_ids_from_page(soup) -> list:
"""Extract all property IDs from an already-fetched page."""
ids = []
for link in soup.find_all("a", href=True):
href = link["href"]
match = re.search(r"-d(\d+)-", href)
if match:
ids.append(match.group(1))
return list(set(ids))
Saving Data to CSV and SQLite
import csv
import sqlite3
from datetime import datetime
def save_to_csv(reviews: list, filename: str):
if not reviews:
return
fieldnames = ["rating", "title", "body", "date", "reviewer",
"trip_type", "helpful", "source"]
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction="ignore")
writer.writeheader()
writer.writerows(reviews)
print(f"Saved {len(reviews)} reviews to {filename}")
def setup_db(db_path="tripadvisor.db"):
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS reviews (
id INTEGER PRIMARY KEY AUTOINCREMENT,
property_id TEXT,
property_name TEXT,
rating INTEGER,
title TEXT,
body TEXT,
date TEXT,
reviewer TEXT,
trip_type TEXT,
helpful TEXT,
scraped_at TEXT
)
""")
conn.execute("CREATE INDEX IF NOT EXISTS idx_property ON reviews(property_id)")
conn.commit()
return conn
def save_to_db(conn, reviews: list, property_id: str, property_name: str):
now = datetime.utcnow().isoformat()
rows = [
(property_id, property_name,
r.get("rating"), r.get("title"), r.get("body"),
r.get("date"), r.get("reviewer"), r.get("trip_type"),
r.get("helpful"), now)
for r in reviews
]
conn.executemany(
"""INSERT INTO reviews
(property_id, property_name, rating, title, body, date, reviewer,
trip_type, helpful, scraped_at)
VALUES (?,?,?,?,?,?,?,?,?,?)""",
rows
)
conn.commit()
return len(rows)
Rating Distribution Analysis
Once you have reviews, here's a quick analysis you can run:
from collections import Counter
def analyze_reviews(reviews: list) -> dict:
"""Basic sentiment analysis on scraped review data."""
ratings = [r["rating"] for r in reviews if r.get("rating")]
distribution = Counter(ratings)
avg = sum(r * c for r, c in distribution.items()) / sum(distribution.values()) if distribution else 0
# Simple keyword frequency from review bodies
all_text = " ".join(r.get("body", "") for r in reviews if r.get("body")).lower()
positive_words = ["excellent", "amazing", "wonderful", "fantastic", "perfect",
"great", "love", "loved", "best", "outstanding"]
negative_words = ["terrible", "awful", "horrible", "worst", "disgusting",
"dirty", "rude", "slow", "disappointing", "overpriced"]
pos_count = sum(all_text.count(w) for w in positive_words)
neg_count = sum(all_text.count(w) for w in negative_words)
return {
"total_reviews": len(reviews),
"average_rating": round(avg, 2),
"distribution": dict(sorted(distribution.items())),
"positive_keyword_hits": pos_count,
"negative_keyword_hits": neg_count,
"sentiment_ratio": round(pos_count / max(neg_count, 1), 2),
}
# Example
reviews = scrape_all_reviews(
"https://www.tripadvisor.com/Hotel_Review-g60763-d93437-Reviews-The_Plaza-New_York_City_New_York.html",
max_pages=5
)
stats = analyze_reviews(reviews)
for key, val in stats.items():
print(f" {key}: {val}")
Handling Rate Limits Systematically
For sustained collection across hundreds of listings:
import time
from dataclasses import dataclass, field
from typing import List
@dataclass
class ScraperConfig:
min_delay: float = 3.0
max_delay: float = 7.0
max_retries: int = 3
requests_per_session: int = 30 # rotate session after this many requests
backoff_multiplier: float = 2.0
max_backoff: float = 300.0 # 5 minutes max wait
config = ScraperConfig()
request_counter = 0
session = None
def get_or_refresh_session(proxy_url=None):
global session, request_counter
if session is None or request_counter >= config.requests_per_session:
session = make_session(proxy_url)
request_counter = 0
# Always visit homepage first for fresh cookies
try:
session.get("https://www.tripadvisor.com/", timeout=15)
time.sleep(random.uniform(1, 2))
except Exception:
pass
return session
def scrape_with_backoff(url: str, proxy_url: str = None):
global request_counter
sess = get_or_refresh_session(proxy_url)
backoff = 5.0
for attempt in range(config.max_retries):
soup = fetch_page(sess, url)
request_counter += 1
if soup is None:
time.sleep(min(backoff, config.max_backoff))
backoff *= config.backoff_multiplier
sess = get_or_refresh_session(proxy_url) # force refresh on error
continue
diag = diagnose_page(soup)
if diag.get("cloudflare_challenge"):
print("Cloudflare challenge — rotating session and waiting...")
time.sleep(min(backoff * 3, config.max_backoff))
backoff *= config.backoff_multiplier
sess = get_or_refresh_session(proxy_url)
continue
return soup
return None
Summary
TripAdvisor review extraction with Python is doable — the HTML structure is consistent, pagination is URL-based, and BeautifulSoup handles the parsing cleanly. The real constraint is IP quality.
Key takeaways: 1. Datacenter IPs don't work reliably — Cloudflare blocks them within a handful of requests 2. Residential proxies are a practical requirement — ThorData is a reliable option with country-targeting 3. Match proxy country to listing locale — US IPs for US hotels, UK IPs for UK restaurants 4. Use sticky sessions per listing — same IP for all pages of one property, rotate between properties 5. JSON-LD structured data is more stable than DOM selectors — prefer it when available 6. Randomize everything — delays, user agents, request timing
Get the basics working locally first, then layer in proxy rotation once you're confident the parsing logic is solid.