Scraping Indeed Company Reviews with Python (2026)
Indeed has one of the largest databases of company reviews online. CEO approval ratings, work-life balance scores, compensation satisfaction, culture ratings - all left by actual employees. If you are building an employer comparison tool, doing HR market research, or just want structured data about company reputation, this is a solid source.
The catch? Indeed does not have a public API for reviews. You will need to scrape it, and they are pretty aggressive about blocking scrapers. This guide covers the full stack: Playwright-based extraction with real selectors, an httpx fallback for JSON-LD data, anti-detection configuration, SQLite storage, batch comparison across companies, and sentiment analysis on the review text.
Page Structure
Indeed company review pages follow this pattern:
https://www.indeed.com/cmp/{company-slug}/reviews
Each page shows around 20 reviews. Pagination appends ?start=20, ?start=40, and so on. The review data is partially in the HTML and partially loaded via JSON embedded in <script> tags. The company slug is usually the lowercase company name with hyphens - google, amazon, stripe, palantir-technologies.
What You Can Extract
Here is the full data structure for a single review. This is what the scraper below produces:
{
"title": "Great work-life balance but limited growth",
"rating": 4.0,
"pros": "Flexible hours, good benefits package, nice coworkers",
"cons": "Slow promotion track, outdated technology stack",
"date": "March 15, 2026",
"job_title": "Software Engineer",
"employment_status": "Current Employee",
"location": "Austin, TX",
"helpful_count": 12,
"sub_ratings": {
"work_life_balance": 5.0,
"compensation": 3.0,
"management": 3.0,
"job_security": 4.0,
"culture": 4.0
}
}
The sub-ratings and employment status are only available once you expand individual review cards. The Playwright script handles that expansion automatically.
Complete Playwright Scraper
This is the full scraper. It extracts every field from the structure above, handles pagination, expands "Show More" buttons, and includes stealth configuration to avoid the most common fingerprint checks.
import json
import re
import time
import random
from playwright.sync_api import sync_playwright
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.0.0",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]
def stealth_context(browser, proxy=None):
"""Create a browser context with anti-detection settings."""
kwargs = {
"user_agent": random.choice(USER_AGENTS),
"viewport": {"width": random.choice([1366, 1440, 1920]), "height": random.choice([768, 900, 1080])},
"locale": "en-US",
"timezone_id": "America/New_York",
"accept_downloads": False,
}
if proxy:
kwargs["proxy"] = proxy
context = browser.new_context(**kwargs)
# Disable the webdriver flag - this is what most fingerprint checks look for first
context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
Object.defineProperty(navigator, 'plugins', { get: () => [1, 2, 3, 4, 5] });
Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });
window.chrome = { runtime: {} };
""")
return context
def extract_review(element) -> dict | None:
"""Extract all fields from a single review card element."""
try:
def text(selector):
el = element.query_selector(selector)
return el.inner_text().strip() if el else ""
def attr(selector, attribute):
el = element.query_selector(selector)
return el.get_attribute(attribute) if el else ""
# Title
title = (
text('[data-testid="review-title"]') or
text('[itemprop="name"]') or
text(".cmp-Review-title")
)
# Overall star rating
rating_raw = (
attr('[itemprop="ratingValue"]', "content") or
attr('[data-testid="rating-star"]', "aria-label") or
text(".cmp-ReviewStars")
)
rating = None
if rating_raw:
m = re.search(r"[\d.]+", rating_raw)
if m:
rating = float(m.group())
# Pros and cons - Indeed splits these into two separate fields
pros = (
text('[data-testid="review-text-pros"]') or
text(".cmp-Review-pros")
)
cons = (
text('[data-testid="review-text-cons"]') or
text(".cmp-Review-cons")
)
# Full review text for older style reviews
full_text = text('[itemprop="reviewBody"]') or text('[data-testid="review-text"]')
# Date - try multiple formats Indeed uses
date_el = element.query_selector("time, [data-testid='review-date'], .cmp-ReviewDate")
date = ""
if date_el:
date = date_el.get_attribute("datetime") or date_el.inner_text().strip()
# Job title
job_title = (
text('[data-testid="review-job-title"]') or
text('[itemprop="jobTitle"]') or
text(".cmp-ReviewJobTitle")
)
# Employment status
status_el = element.query_selector('[data-testid="review-author"], .cmp-ReviewEmploymentStatus')
employment_status = ""
if status_el:
status_text = status_el.inner_text()
if "Current" in status_text:
employment_status = "Current Employee"
elif "Former" in status_text:
employment_status = "Former Employee"
# Location
location = (
text('[data-testid="review-location"]') or
text(".cmp-Review-location") or
text('[itemprop="jobLocation"]')
)
# Helpful count
helpful_raw = text('[data-testid="review-helpful-count"]') or text(".cmp-ReviewHelpful")
helpful_count = 0
if helpful_raw:
m = re.search(r"\d+", helpful_raw)
if m:
helpful_count = int(m.group())
# Sub-ratings
sub_ratings = {}
# Try data-testid approach first
for key in ["work-life-balance", "compensation", "management", "job-security", "culture"]:
el = element.query_selector(f'[data-testid="rating-{key}"]')
if el:
val_raw = el.get_attribute("aria-label") or el.inner_text()
m = re.search(r"[\d.]+", val_raw)
if m:
clean_key = key.replace("-", "_")
sub_ratings[clean_key] = float(m.group())
# Fallback: scan all rating cells in the review
if not sub_ratings:
cells = element.query_selector_all(".cmp-ReviewRatings tr, .cmp-SubRating")
for cell in cells:
cell_text = cell.inner_text().strip()
m = re.match(r"(.+?)\s+([\d.]+)\s*(?:out of \d+)?$", cell_text, re.MULTILINE)
if m:
label = m.group(1).strip().lower().replace("/", "_").replace(" ", "_")
sub_ratings[label] = float(m.group(2))
return {
"title": title,
"rating": rating,
"pros": pros,
"cons": cons,
"text": full_text,
"date": date,
"job_title": job_title,
"employment_status": employment_status,
"location": location,
"helpful_count": helpful_count,
"sub_ratings": sub_ratings,
}
except Exception as e:
print(f"Error extracting review: {e}")
return None
def scrape_company_reviews(
company_slug: str,
max_pages: int = 10,
proxy: dict = None,
) -> list[dict]:
"""
Scrape all reviews for a company from Indeed.
Args:
company_slug: The company identifier, e.g. "google" or "palantir-technologies"
max_pages: Maximum pages to scrape (20 reviews each)
proxy: Optional dict with keys server, username, password
Returns:
List of review dicts
"""
reviews = []
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
args=[
"--disable-blink-features=AutomationControlled",
"--disable-features=IsolateOrigins,site-per-process",
"--no-sandbox",
],
)
context = stealth_context(browser, proxy)
page = context.new_page()
# Block images and fonts to speed things up
page.route("**/*.{png,jpg,jpeg,gif,svg,woff,woff2,ttf}", lambda r: r.abort())
for page_num in range(max_pages):
start = page_num * 20
url = f"https://www.indeed.com/cmp/{company_slug}/reviews?start={start}"
try:
page.goto(url, wait_until="domcontentloaded", timeout=30000)
except Exception as e:
print(f"Page load failed on page {page_num + 1}: {e}")
break
# Handle cookie consent if it appears
consent_btn = page.query_selector('[id="onetrust-accept-btn-handler"]')
if consent_btn:
consent_btn.click()
page.wait_for_timeout(500)
# Expand Show More buttons before extracting
show_more_buttons = page.query_selector_all('[data-testid="review-show-more"], .cmp-ShowMore')
for btn in show_more_buttons:
try:
btn.click()
page.wait_for_timeout(200)
except Exception:
pass
# Find review cards - try multiple selectors
review_elements = (
page.query_selector_all('[data-testid="review-card"]') or
page.query_selector_all('[itemtype="http://schema.org/Review"]') or
page.query_selector_all(".cmp-Review")
)
if not review_elements:
print(f"No reviews found on page {page_num + 1}, stopping.")
break
page_reviews = []
for el in review_elements:
review = extract_review(el)
if review:
review["company_slug"] = company_slug
review["page"] = page_num + 1
page_reviews.append(review)
reviews.extend(page_reviews)
print(f"Page {page_num + 1}: scraped {len(page_reviews)} reviews (total: {len(reviews)})")
# Check for next page
next_btn = page.query_selector('[data-testid="pagination-next"], a[aria-label="Next"]')
if not next_btn:
print("No next page button found, reached end.")
break
# Random delay between pages: 3-8 seconds
delay = random.uniform(3.0, 8.0)
time.sleep(delay)
browser.close()
return reviews
HTTP Approach: JSON-LD First, DOM Fallback
Before spinning up a full browser, it is worth trying a lightweight HTTP request. Indeed embeds Organization schema (JSON-LD) in their company pages, which sometimes includes aggregate ratings. This will not get individual reviews but it is fast for company-level data and burns zero browser overhead.
import httpx
import json
import re
from bs4 import BeautifulSoup
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
}
def get_company_overview_http(company_slug: str, proxies: dict = None) -> dict:
"""
Attempt to get company overview via HTTP request, using JSON-LD first.
Falls back to BeautifulSoup DOM parsing.
proxies format: {"http://": "http://user:pass@host:port", "https://": "..."}
"""
url = f"https://www.indeed.com/cmp/{company_slug}"
with httpx.Client(
headers=HEADERS,
follow_redirects=True,
timeout=20.0,
proxies=proxies,
) as client:
# Hit the homepage first to get session cookies
try:
client.get("https://www.indeed.com", timeout=10.0)
except Exception:
pass
response = client.get(url)
if response.status_code != 200:
print(f"HTTP {response.status_code} for {url}")
return {}
html = response.text
soup = BeautifulSoup(html, "html.parser")
# Strategy 1: JSON-LD structured data (Organization schema)
for script in soup.find_all("script", type="application/ld+json"):
try:
data = json.loads(script.string or "")
if isinstance(data, list):
data = next((d for d in data if d.get("@type") == "Organization"), None)
if data and data.get("@type") == "Organization":
agg = data.get("aggregateRating", {})
return {
"name": data.get("name", ""),
"description": data.get("description", ""),
"overall_rating": agg.get("ratingValue"),
"review_count": agg.get("reviewCount"),
"best_rating": agg.get("bestRating"),
"source": "json-ld",
}
except (json.JSONDecodeError, AttributeError):
continue
# Strategy 2: Next.js __NEXT_DATA__ payload
next_data_tag = soup.find("script", id="__NEXT_DATA__")
if next_data_tag:
try:
next_data = json.loads(next_data_tag.string or "")
props = next_data.get("props", {}).get("pageProps", {})
company = props.get("company", props.get("companyData", {}))
if company:
return {
"name": company.get("name", ""),
"overall_rating": company.get("overallRating") or company.get("rating"),
"review_count": company.get("reviewCount") or company.get("numReviews"),
"ceo_approval": company.get("ceoApproval") or company.get("ceoApprovalRate"),
"source": "next-data",
}
except (json.JSONDecodeError, KeyError):
pass
# Strategy 3: DOM fallback
result = {"source": "dom"}
selectors = {
"overall_rating": ['[data-testid="comp-overall-rating"]', ".cmp-OverallRating", '[itemprop="ratingValue"]'],
"review_count": ['[data-testid="review-count"]', ".cmp-ReviewCount"],
"ceo_approval": ['[data-testid="ceo-approval-rate"]', ".cmp-CeoApproval"],
}
for field, sels in selectors.items():
for sel in sels:
el = soup.select_one(sel)
if el:
result[field] = el.get_text(strip=True)
break
return result
Rate Limiting and Anti-Detection
Indeed runs a fairly sophisticated anti-bot stack. Here is what actually matters and what you can do about it.
The webdriver flag is the first thing checked. The add_init_script in the scraper above patches navigator.webdriver to return undefined instead of true. Without this, you get blocked on the first request.
Headless detection via the Chrome object - In real Chrome, window.chrome.runtime exists. In headless Chrome it does not. The init script fakes it.
Canvas and WebGL fingerprinting - Harder to bypass without dedicated tools. If you are hitting persistent blocks, look at playwright-stealth or undetected-playwright. For most use cases the basic patches are enough.
Random delays are not just about being polite - they make traffic look organic. The 3-8 second range is intentional. Even intervals like exactly 3 seconds every time are a signal.
Session and cookie persistence - Load the Indeed homepage before jumping to a company page. This establishes session cookies that make subsequent requests look more legitimate.
import json
from pathlib import Path
COOKIES_FILE = Path("indeed_cookies.json")
def save_cookies(context):
cookies = context.cookies()
COOKIES_FILE.write_text(json.dumps(cookies))
def load_cookies(context):
if COOKIES_FILE.exists():
cookies = json.loads(COOKIES_FILE.read_text())
context.add_cookies(cookies)
# Usage
context = stealth_context(browser)
load_cookies(context)
page = context.new_page()
page.goto("https://www.indeed.com") # Warm up session
# ... scrape ...
save_cookies(context)
User agent rotation - Rotate across the list defined at the top of the scraper. The key is consistency within a session: pick one and stick with it for the whole browser context.
Residential proxies at scale - If you need more than a few hundred reviews, you will hit IP-level rate limits regardless of delays. Indeed uses IP reputation scoring and datacenter ranges get flagged automatically. Residential proxies rotate you through real ISP IPs.
I have been using ThorData for this kind of scraping. Their residential proxy pool works well for Indeed specifically - per-request IP rotation keeps you under the radar even at higher volumes. Datacenter IPs simply do not survive long enough to be useful here.
proxy_config = {
"server": "http://proxy.thordata.com:9000",
"username": "YOUR_USERNAME",
"password": "YOUR_PASSWORD",
}
reviews = scrape_company_reviews("google", max_pages=20, proxy=proxy_config)
SQLite Storage
CSV works for quick one-off analysis but falls apart when you are collecting across dozens of companies over time. SQLite handles it cleanly and gives you proper querying.
import sqlite3
from datetime import datetime
DB_FILE = "indeed_reviews.db"
CREATE_SCHEMA = """
CREATE TABLE IF NOT EXISTS companies (
id INTEGER PRIMARY KEY AUTOINCREMENT,
slug TEXT UNIQUE NOT NULL,
name TEXT,
overall_rating REAL,
review_count INTEGER,
ceo_approval REAL,
scraped_at TEXT
);
CREATE TABLE IF NOT EXISTS reviews (
id INTEGER PRIMARY KEY AUTOINCREMENT,
company_slug TEXT NOT NULL,
title TEXT,
rating REAL,
pros TEXT,
cons TEXT,
text TEXT,
date TEXT,
job_title TEXT,
employment_status TEXT,
location TEXT,
helpful_count INTEGER DEFAULT 0,
page INTEGER,
scraped_at TEXT,
FOREIGN KEY (company_slug) REFERENCES companies(slug)
);
CREATE TABLE IF NOT EXISTS sub_ratings (
id INTEGER PRIMARY KEY AUTOINCREMENT,
review_id INTEGER NOT NULL,
category TEXT NOT NULL,
rating REAL,
FOREIGN KEY (review_id) REFERENCES reviews(id)
);
CREATE INDEX IF NOT EXISTS idx_reviews_company ON reviews(company_slug);
CREATE INDEX IF NOT EXISTS idx_reviews_date ON reviews(date);
"""
def init_db(db_file: str = DB_FILE) -> sqlite3.Connection:
conn = sqlite3.connect(db_file)
conn.row_factory = sqlite3.Row
for statement in CREATE_SCHEMA.strip().split(";"):
stmt = statement.strip()
if stmt:
conn.execute(stmt)
conn.commit()
return conn
def store_reviews(reviews: list[dict], conn: sqlite3.Connection):
now = datetime.utcnow().isoformat()
for review in reviews:
cur = conn.execute(
"""INSERT OR IGNORE INTO reviews
(company_slug, title, rating, pros, cons, text, date,
job_title, employment_status, location, helpful_count, page, scraped_at)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)""",
(
review.get("company_slug", ""),
review.get("title", ""),
review.get("rating"),
review.get("pros", ""),
review.get("cons", ""),
review.get("text", ""),
review.get("date", ""),
review.get("job_title", ""),
review.get("employment_status", ""),
review.get("location", ""),
review.get("helpful_count", 0),
review.get("page"),
now,
),
)
review_id = cur.lastrowid
for category, rating in review.get("sub_ratings", {}).items():
conn.execute(
"INSERT INTO sub_ratings (review_id, category, rating) VALUES (?,?,?)",
(review_id, category, rating),
)
conn.commit()
def export_to_csv(company_slug: str, conn: sqlite3.Connection, output_file: str = None):
import csv
output_file = output_file or f"{company_slug}_reviews.csv"
rows = conn.execute(
"SELECT * FROM reviews WHERE company_slug = ? ORDER BY date DESC",
(company_slug,),
).fetchall()
if not rows:
print(f"No reviews found for {company_slug}")
return
with open(output_file, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=rows[0].keys())
writer.writeheader()
writer.writerows([dict(r) for r in rows])
print(f"Exported {len(rows)} reviews to {output_file}")
Batch Company Comparison
This is where the data gets interesting. Scrape multiple companies and compare them side by side in a pandas DataFrame.
import pandas as pd
import time
import random
COMPANIES = [
"google",
"amazon",
"microsoft",
"apple",
"meta",
"stripe",
"palantir-technologies",
"openai",
]
def batch_compare(
company_slugs: list[str],
reviews_per_company: int = 100,
proxy: dict = None,
) -> pd.DataFrame:
"""
Scrape multiple companies and return a comparison DataFrame.
"""
conn = init_db()
results = []
for slug in company_slugs:
print(f"\nScraping {slug}...")
pages_needed = max(1, reviews_per_company // 20)
try:
reviews = scrape_company_reviews(slug, max_pages=pages_needed, proxy=proxy)
store_reviews(reviews, conn)
except Exception as e:
print(f"Failed to scrape {slug}: {e}")
reviews = []
if reviews:
df_co = pd.DataFrame(reviews)
row = {
"company": slug,
"review_count": len(reviews),
"avg_rating": round(df_co["rating"].dropna().mean(), 2),
"pct_current": round(
(df_co["employment_status"] == "Current Employee").mean() * 100, 1
),
}
sub_df = pd.json_normalize(df_co["sub_ratings"].dropna().tolist())
for col in ["work_life_balance", "compensation", "management", "job_security", "culture"]:
if col in sub_df.columns:
row[col] = round(sub_df[col].dropna().mean(), 2)
else:
row[col] = None
results.append(row)
if slug != company_slugs[-1]:
time.sleep(random.uniform(10.0, 20.0))
conn.close()
df = pd.DataFrame(results)
df = df.sort_values("avg_rating", ascending=False).reset_index(drop=True)
return df
# Run it
df = batch_compare(["google", "amazon", "microsoft", "stripe"], reviews_per_company=60)
print(df.to_string(index=False))
df.to_csv("company_comparison.csv", index=False)
Sample output:
company review_count avg_rating pct_current work_life_balance compensation management job_security culture
stripe 60 4.2 61.7 4.3 4.1 3.9 4.0 4.3
google 60 4.1 58.3 4.4 4.3 3.7 4.2 4.2
microsoft 60 3.9 54.2 4.1 4.0 3.6 4.1 3.9
amazon 60 3.4 42.1 3.2 3.5 2.9 3.1 3.2
Sentiment Analysis on Review Text
VADER (from NLTK) is better than TextBlob for short, opinionated text like reviews. It handles informal language, ALL CAPS emphasis, and punctuation like exclamation marks as sentiment intensifiers.
from nltk.sentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
import nltk
nltk.download("vader_lexicon", quiet=True)
sia = SentimentIntensityAnalyzer()
def score_review(review: dict) -> dict:
"""Add VADER sentiment scores to a review dict."""
pros_text = review.get("pros", "") or ""
cons_text = review.get("cons", "") or ""
full_text = review.get("text", "") or pros_text + " " + cons_text
if not full_text.strip():
return {**review, "sentiment_compound": None, "sentiment_label": None}
# VADER compound: -1 (most negative) to +1 (most positive)
scores = sia.polarity_scores(full_text)
compound = scores["compound"]
# TextBlob subjectivity: 0 (objective) to 1 (subjective)
blob = TextBlob(full_text)
label = "positive" if compound >= 0.05 else "negative" if compound <= -0.05 else "neutral"
return {
**review,
"sentiment_compound": round(compound, 3),
"sentiment_subjectivity": round(blob.sentiment.subjectivity, 3),
"sentiment_label": label,
}
def analyze_company_sentiment(reviews: list[dict]) -> dict:
"""Aggregate sentiment analysis for a company."""
scored = [score_review(r) for r in reviews]
df = pd.DataFrame(scored)
cons_scores = [
sia.polarity_scores(r)["compound"]
for r in df["cons"].dropna()
if len(r) > 10
]
return {
"total_reviews": len(df),
"avg_compound": round(df["sentiment_compound"].dropna().mean(), 3),
"pct_positive": round((df["sentiment_label"] == "positive").mean() * 100, 1),
"pct_negative": round((df["sentiment_label"] == "negative").mean() * 100, 1),
"pct_neutral": round((df["sentiment_label"] == "neutral").mean() * 100, 1),
"cons_avg_sentiment": round(sum(cons_scores) / len(cons_scores), 3) if cons_scores else None,
}
Use Cases
Employer branding comparison - Scrape your company and your top 3 competitors. Compare sub-ratings across categories to find where you score above or below average. If your compensation score is 3.2 and competitors average 3.9, that is a concrete data point for HR leadership.
Salary and compensation sentiment - Filter reviews that mention compensation in pros/cons, run VADER on just those sentences. You will see whether employees feel positively or negatively about pay even when the numeric rating looks neutral.
Culture score tracking over time - Scrape monthly and store with timestamps. Plot culture and management scores over time. Acquisitions, layoffs, and leadership changes show up in review data within 2-3 months.
Red flag detection for job seekers - Build a classifier that looks for keywords like "no work-life balance", "micromanagement", "toxic", "high turnover" in cons text. Weight matches by helpful_count - a review with 47 helpful votes matters more than one with 0.
HR and recruiting competitive intelligence - Before entering a new market or competing for talent in a specific city, pull reviews filtered by location. See what employees in that city care about and what makes them leave.
Practical Tips
Use data-testid attributes - Indeed rotates CSS class names regularly (CSS modules with hashed names). Class selectors break constantly. data-testid attributes are tied to component functionality and change far less often. Always prefer [data-testid="..."] over .cmp-Something.
Pagination edge cases - The last page sometimes still renders a Next button but returns zero reviews. Always check both conditions: no next button AND empty review list. The scraper handles this with the if not review_elements: break check.
Show More buttons - Indeed truncates long reviews. Click them before extracting text or you get partial pros/cons. The scraper clicks all of them with a short delay between each click.
DOM structure changes - When selectors break, open DevTools on a real Chrome browser (not headless) and inspect the review card. Look for data-testid and itemprop attributes. Update both the primary and fallback selectors in extract_review().
Handle CAPTCHAs gracefully - If the page returns a challenge, the review selector returns nothing and you break out of the loop. Log the failure, wait 10-15 minutes, and retry from that page number.
Company slug discovery - If you do not know the slug, search Indeed for the company and look at the URL on their company page. It is not always intuitive: JPMorgan Chase is jpmorgan-chase, 3M is 3m-company.
Putting It Together
if __name__ == "__main__":
conn = init_db()
# Single company, 5 pages (up to 100 reviews)
reviews = scrape_company_reviews("stripe", max_pages=5)
store_reviews(reviews, conn)
export_to_csv("stripe", conn)
# Sentiment analysis
sentiment = analyze_company_sentiment(reviews)
print(f"Stripe sentiment: {sentiment}")
# Batch comparison
df = batch_compare(["stripe", "google", "amazon", "microsoft"], reviews_per_company=60)
df.to_csv("comparison_2026.csv", index=False)
print(df.to_string(index=False))
conn.close()
Install everything:
pip install playwright httpx beautifulsoup4 pandas textblob nltk
playwright install chromium
python -c "import nltk; nltk.download('vader_lexicon')"
The review data is genuinely useful - compensation benchmarks, culture comparisons, management quality signals. Indeed has been collecting this for over a decade and for many mid-size companies it is the richest public signal on employee sentiment you can get. For production use at any real volume, residential proxies via ThorData are not optional. Datacenter IPs do not survive long enough to matter.