Scraping Indeed Company Reviews with Python (2026)

2026-04-09 ["indeed" "company-reviews" "scraping" "python" "hr-data"]

Indeed has one of the largest databases of company reviews online. CEO approval ratings, work-life balance scores, compensation satisfaction, culture ratings - all left by actual employees. If you are building an employer comparison tool, doing HR market research, or just want structured data about company reputation, this is a solid source.

The catch? Indeed does not have a public API for reviews. You will need to scrape it, and they are pretty aggressive about blocking scrapers. This guide covers the full stack: Playwright-based extraction with real selectors, an httpx fallback for JSON-LD data, anti-detection configuration, SQLite storage, batch comparison across companies, and sentiment analysis on the review text.

Page Structure

Indeed company review pages follow this pattern:

https://www.indeed.com/cmp/{company-slug}/reviews

Each page shows around 20 reviews. Pagination appends ?start=20, ?start=40, and so on. The review data is partially in the HTML and partially loaded via JSON embedded in <script> tags. The company slug is usually the lowercase company name with hyphens - google, amazon, stripe, palantir-technologies.

What You Can Extract

Here is the full data structure for a single review. This is what the scraper below produces:

{
  "title": "Great work-life balance but limited growth",
  "rating": 4.0,
  "pros": "Flexible hours, good benefits package, nice coworkers",
  "cons": "Slow promotion track, outdated technology stack",
  "date": "March 15, 2026",
  "job_title": "Software Engineer",
  "employment_status": "Current Employee",
  "location": "Austin, TX",
  "helpful_count": 12,
  "sub_ratings": {
    "work_life_balance": 5.0,
    "compensation": 3.0,
    "management": 3.0,
    "job_security": 4.0,
    "culture": 4.0
  }
}

The sub-ratings and employment status are only available once you expand individual review cards. The Playwright script handles that expansion automatically.

Complete Playwright Scraper

This is the full scraper. It extracts every field from the structure above, handles pagination, expands "Show More" buttons, and includes stealth configuration to avoid the most common fingerprint checks.

import json
import re
import time
import random
from playwright.sync_api import sync_playwright

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.0.0",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]

def stealth_context(browser, proxy=None):
    """Create a browser context with anti-detection settings."""
    kwargs = {
        "user_agent": random.choice(USER_AGENTS),
        "viewport": {"width": random.choice([1366, 1440, 1920]), "height": random.choice([768, 900, 1080])},
        "locale": "en-US",
        "timezone_id": "America/New_York",
        "accept_downloads": False,
    }
    if proxy:
        kwargs["proxy"] = proxy

    context = browser.new_context(**kwargs)

    # Disable the webdriver flag - this is what most fingerprint checks look for first
    context.add_init_script("""
        Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
        Object.defineProperty(navigator, 'plugins', { get: () => [1, 2, 3, 4, 5] });
        Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });
        window.chrome = { runtime: {} };
    """)
    return context


def extract_review(element) -> dict | None:
    """Extract all fields from a single review card element."""
    try:
        def text(selector):
            el = element.query_selector(selector)
            return el.inner_text().strip() if el else ""

        def attr(selector, attribute):
            el = element.query_selector(selector)
            return el.get_attribute(attribute) if el else ""

        # Title
        title = (
            text('[data-testid="review-title"]') or
            text('[itemprop="name"]') or
            text(".cmp-Review-title")
        )

        # Overall star rating
        rating_raw = (
            attr('[itemprop="ratingValue"]', "content") or
            attr('[data-testid="rating-star"]', "aria-label") or
            text(".cmp-ReviewStars")
        )
        rating = None
        if rating_raw:
            m = re.search(r"[\d.]+", rating_raw)
            if m:
                rating = float(m.group())

        # Pros and cons - Indeed splits these into two separate fields
        pros = (
            text('[data-testid="review-text-pros"]') or
            text(".cmp-Review-pros")
        )
        cons = (
            text('[data-testid="review-text-cons"]') or
            text(".cmp-Review-cons")
        )

        # Full review text for older style reviews
        full_text = text('[itemprop="reviewBody"]') or text('[data-testid="review-text"]')

        # Date - try multiple formats Indeed uses
        date_el = element.query_selector("time, [data-testid='review-date'], .cmp-ReviewDate")
        date = ""
        if date_el:
            date = date_el.get_attribute("datetime") or date_el.inner_text().strip()

        # Job title
        job_title = (
            text('[data-testid="review-job-title"]') or
            text('[itemprop="jobTitle"]') or
            text(".cmp-ReviewJobTitle")
        )

        # Employment status
        status_el = element.query_selector('[data-testid="review-author"], .cmp-ReviewEmploymentStatus')
        employment_status = ""
        if status_el:
            status_text = status_el.inner_text()
            if "Current" in status_text:
                employment_status = "Current Employee"
            elif "Former" in status_text:
                employment_status = "Former Employee"

        # Location
        location = (
            text('[data-testid="review-location"]') or
            text(".cmp-Review-location") or
            text('[itemprop="jobLocation"]')
        )

        # Helpful count
        helpful_raw = text('[data-testid="review-helpful-count"]') or text(".cmp-ReviewHelpful")
        helpful_count = 0
        if helpful_raw:
            m = re.search(r"\d+", helpful_raw)
            if m:
                helpful_count = int(m.group())

        # Sub-ratings
        sub_ratings = {}

        # Try data-testid approach first
        for key in ["work-life-balance", "compensation", "management", "job-security", "culture"]:
            el = element.query_selector(f'[data-testid="rating-{key}"]')
            if el:
                val_raw = el.get_attribute("aria-label") or el.inner_text()
                m = re.search(r"[\d.]+", val_raw)
                if m:
                    clean_key = key.replace("-", "_")
                    sub_ratings[clean_key] = float(m.group())

        # Fallback: scan all rating cells in the review
        if not sub_ratings:
            cells = element.query_selector_all(".cmp-ReviewRatings tr, .cmp-SubRating")
            for cell in cells:
                cell_text = cell.inner_text().strip()
                m = re.match(r"(.+?)\s+([\d.]+)\s*(?:out of \d+)?$", cell_text, re.MULTILINE)
                if m:
                    label = m.group(1).strip().lower().replace("/", "_").replace(" ", "_")
                    sub_ratings[label] = float(m.group(2))

        return {
            "title": title,
            "rating": rating,
            "pros": pros,
            "cons": cons,
            "text": full_text,
            "date": date,
            "job_title": job_title,
            "employment_status": employment_status,
            "location": location,
            "helpful_count": helpful_count,
            "sub_ratings": sub_ratings,
        }
    except Exception as e:
        print(f"Error extracting review: {e}")
        return None


def scrape_company_reviews(
    company_slug: str,
    max_pages: int = 10,
    proxy: dict = None,
) -> list[dict]:
    """
    Scrape all reviews for a company from Indeed.

    Args:
        company_slug: The company identifier, e.g. "google" or "palantir-technologies"
        max_pages: Maximum pages to scrape (20 reviews each)
        proxy: Optional dict with keys server, username, password

    Returns:
        List of review dicts
    """
    reviews = []

    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            args=[
                "--disable-blink-features=AutomationControlled",
                "--disable-features=IsolateOrigins,site-per-process",
                "--no-sandbox",
            ],
        )
        context = stealth_context(browser, proxy)
        page = context.new_page()

        # Block images and fonts to speed things up
        page.route("**/*.{png,jpg,jpeg,gif,svg,woff,woff2,ttf}", lambda r: r.abort())

        for page_num in range(max_pages):
            start = page_num * 20
            url = f"https://www.indeed.com/cmp/{company_slug}/reviews?start={start}"

            try:
                page.goto(url, wait_until="domcontentloaded", timeout=30000)
            except Exception as e:
                print(f"Page load failed on page {page_num + 1}: {e}")
                break

            # Handle cookie consent if it appears
            consent_btn = page.query_selector('[id="onetrust-accept-btn-handler"]')
            if consent_btn:
                consent_btn.click()
                page.wait_for_timeout(500)

            # Expand Show More buttons before extracting
            show_more_buttons = page.query_selector_all('[data-testid="review-show-more"], .cmp-ShowMore')
            for btn in show_more_buttons:
                try:
                    btn.click()
                    page.wait_for_timeout(200)
                except Exception:
                    pass

            # Find review cards - try multiple selectors
            review_elements = (
                page.query_selector_all('[data-testid="review-card"]') or
                page.query_selector_all('[itemtype="http://schema.org/Review"]') or
                page.query_selector_all(".cmp-Review")
            )

            if not review_elements:
                print(f"No reviews found on page {page_num + 1}, stopping.")
                break

            page_reviews = []
            for el in review_elements:
                review = extract_review(el)
                if review:
                    review["company_slug"] = company_slug
                    review["page"] = page_num + 1
                    page_reviews.append(review)

            reviews.extend(page_reviews)
            print(f"Page {page_num + 1}: scraped {len(page_reviews)} reviews (total: {len(reviews)})")

            # Check for next page
            next_btn = page.query_selector('[data-testid="pagination-next"], a[aria-label="Next"]')
            if not next_btn:
                print("No next page button found, reached end.")
                break

            # Random delay between pages: 3-8 seconds
            delay = random.uniform(3.0, 8.0)
            time.sleep(delay)

        browser.close()

    return reviews

HTTP Approach: JSON-LD First, DOM Fallback

Before spinning up a full browser, it is worth trying a lightweight HTTP request. Indeed embeds Organization schema (JSON-LD) in their company pages, which sometimes includes aggregate ratings. This will not get individual reviews but it is fast for company-level data and burns zero browser overhead.

import httpx
import json
import re
from bs4 import BeautifulSoup

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
}


def get_company_overview_http(company_slug: str, proxies: dict = None) -> dict:
    """
    Attempt to get company overview via HTTP request, using JSON-LD first.
    Falls back to BeautifulSoup DOM parsing.

    proxies format: {"http://": "http://user:pass@host:port", "https://": "..."}
    """
    url = f"https://www.indeed.com/cmp/{company_slug}"

    with httpx.Client(
        headers=HEADERS,
        follow_redirects=True,
        timeout=20.0,
        proxies=proxies,
    ) as client:
        # Hit the homepage first to get session cookies
        try:
            client.get("https://www.indeed.com", timeout=10.0)
        except Exception:
            pass

        response = client.get(url)

        if response.status_code != 200:
            print(f"HTTP {response.status_code} for {url}")
            return {}

        html = response.text
        soup = BeautifulSoup(html, "html.parser")

        # Strategy 1: JSON-LD structured data (Organization schema)
        for script in soup.find_all("script", type="application/ld+json"):
            try:
                data = json.loads(script.string or "")
                if isinstance(data, list):
                    data = next((d for d in data if d.get("@type") == "Organization"), None)
                if data and data.get("@type") == "Organization":
                    agg = data.get("aggregateRating", {})
                    return {
                        "name": data.get("name", ""),
                        "description": data.get("description", ""),
                        "overall_rating": agg.get("ratingValue"),
                        "review_count": agg.get("reviewCount"),
                        "best_rating": agg.get("bestRating"),
                        "source": "json-ld",
                    }
            except (json.JSONDecodeError, AttributeError):
                continue

        # Strategy 2: Next.js __NEXT_DATA__ payload
        next_data_tag = soup.find("script", id="__NEXT_DATA__")
        if next_data_tag:
            try:
                next_data = json.loads(next_data_tag.string or "")
                props = next_data.get("props", {}).get("pageProps", {})
                company = props.get("company", props.get("companyData", {}))
                if company:
                    return {
                        "name": company.get("name", ""),
                        "overall_rating": company.get("overallRating") or company.get("rating"),
                        "review_count": company.get("reviewCount") or company.get("numReviews"),
                        "ceo_approval": company.get("ceoApproval") or company.get("ceoApprovalRate"),
                        "source": "next-data",
                    }
            except (json.JSONDecodeError, KeyError):
                pass

        # Strategy 3: DOM fallback
        result = {"source": "dom"}
        selectors = {
            "overall_rating": ['[data-testid="comp-overall-rating"]', ".cmp-OverallRating", '[itemprop="ratingValue"]'],
            "review_count": ['[data-testid="review-count"]', ".cmp-ReviewCount"],
            "ceo_approval": ['[data-testid="ceo-approval-rate"]', ".cmp-CeoApproval"],
        }
        for field, sels in selectors.items():
            for sel in sels:
                el = soup.select_one(sel)
                if el:
                    result[field] = el.get_text(strip=True)
                    break

        return result

Rate Limiting and Anti-Detection

Indeed runs a fairly sophisticated anti-bot stack. Here is what actually matters and what you can do about it.

The webdriver flag is the first thing checked. The add_init_script in the scraper above patches navigator.webdriver to return undefined instead of true. Without this, you get blocked on the first request.

Headless detection via the Chrome object - In real Chrome, window.chrome.runtime exists. In headless Chrome it does not. The init script fakes it.

Canvas and WebGL fingerprinting - Harder to bypass without dedicated tools. If you are hitting persistent blocks, look at playwright-stealth or undetected-playwright. For most use cases the basic patches are enough.

Random delays are not just about being polite - they make traffic look organic. The 3-8 second range is intentional. Even intervals like exactly 3 seconds every time are a signal.

Session and cookie persistence - Load the Indeed homepage before jumping to a company page. This establishes session cookies that make subsequent requests look more legitimate.

import json
from pathlib import Path

COOKIES_FILE = Path("indeed_cookies.json")

def save_cookies(context):
    cookies = context.cookies()
    COOKIES_FILE.write_text(json.dumps(cookies))

def load_cookies(context):
    if COOKIES_FILE.exists():
        cookies = json.loads(COOKIES_FILE.read_text())
        context.add_cookies(cookies)

# Usage
context = stealth_context(browser)
load_cookies(context)
page = context.new_page()
page.goto("https://www.indeed.com")  # Warm up session
# ... scrape ...
save_cookies(context)

User agent rotation - Rotate across the list defined at the top of the scraper. The key is consistency within a session: pick one and stick with it for the whole browser context.

Residential proxies at scale - If you need more than a few hundred reviews, you will hit IP-level rate limits regardless of delays. Indeed uses IP reputation scoring and datacenter ranges get flagged automatically. Residential proxies rotate you through real ISP IPs.

I have been using ThorData for this kind of scraping. Their residential proxy pool works well for Indeed specifically - per-request IP rotation keeps you under the radar even at higher volumes. Datacenter IPs simply do not survive long enough to be useful here.

proxy_config = {
    "server": "http://proxy.thordata.com:9000",
    "username": "YOUR_USERNAME",
    "password": "YOUR_PASSWORD",
}

reviews = scrape_company_reviews("google", max_pages=20, proxy=proxy_config)

SQLite Storage

CSV works for quick one-off analysis but falls apart when you are collecting across dozens of companies over time. SQLite handles it cleanly and gives you proper querying.

import sqlite3
from datetime import datetime

DB_FILE = "indeed_reviews.db"

CREATE_SCHEMA = """
CREATE TABLE IF NOT EXISTS companies (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    slug TEXT UNIQUE NOT NULL,
    name TEXT,
    overall_rating REAL,
    review_count INTEGER,
    ceo_approval REAL,
    scraped_at TEXT
);

CREATE TABLE IF NOT EXISTS reviews (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    company_slug TEXT NOT NULL,
    title TEXT,
    rating REAL,
    pros TEXT,
    cons TEXT,
    text TEXT,
    date TEXT,
    job_title TEXT,
    employment_status TEXT,
    location TEXT,
    helpful_count INTEGER DEFAULT 0,
    page INTEGER,
    scraped_at TEXT,
    FOREIGN KEY (company_slug) REFERENCES companies(slug)
);

CREATE TABLE IF NOT EXISTS sub_ratings (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    review_id INTEGER NOT NULL,
    category TEXT NOT NULL,
    rating REAL,
    FOREIGN KEY (review_id) REFERENCES reviews(id)
);

CREATE INDEX IF NOT EXISTS idx_reviews_company ON reviews(company_slug);
CREATE INDEX IF NOT EXISTS idx_reviews_date ON reviews(date);
"""


def init_db(db_file: str = DB_FILE) -> sqlite3.Connection:
    conn = sqlite3.connect(db_file)
    conn.row_factory = sqlite3.Row
    for statement in CREATE_SCHEMA.strip().split(";"):
        stmt = statement.strip()
        if stmt:
            conn.execute(stmt)
    conn.commit()
    return conn


def store_reviews(reviews: list[dict], conn: sqlite3.Connection):
    now = datetime.utcnow().isoformat()
    for review in reviews:
        cur = conn.execute(
            """INSERT OR IGNORE INTO reviews
               (company_slug, title, rating, pros, cons, text, date,
                job_title, employment_status, location, helpful_count, page, scraped_at)
               VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)""",
            (
                review.get("company_slug", ""),
                review.get("title", ""),
                review.get("rating"),
                review.get("pros", ""),
                review.get("cons", ""),
                review.get("text", ""),
                review.get("date", ""),
                review.get("job_title", ""),
                review.get("employment_status", ""),
                review.get("location", ""),
                review.get("helpful_count", 0),
                review.get("page"),
                now,
            ),
        )
        review_id = cur.lastrowid
        for category, rating in review.get("sub_ratings", {}).items():
            conn.execute(
                "INSERT INTO sub_ratings (review_id, category, rating) VALUES (?,?,?)",
                (review_id, category, rating),
            )
    conn.commit()


def export_to_csv(company_slug: str, conn: sqlite3.Connection, output_file: str = None):
    import csv
    output_file = output_file or f"{company_slug}_reviews.csv"
    rows = conn.execute(
        "SELECT * FROM reviews WHERE company_slug = ? ORDER BY date DESC",
        (company_slug,),
    ).fetchall()
    if not rows:
        print(f"No reviews found for {company_slug}")
        return
    with open(output_file, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=rows[0].keys())
        writer.writeheader()
        writer.writerows([dict(r) for r in rows])
    print(f"Exported {len(rows)} reviews to {output_file}")

Batch Company Comparison

This is where the data gets interesting. Scrape multiple companies and compare them side by side in a pandas DataFrame.

import pandas as pd
import time
import random

COMPANIES = [
    "google",
    "amazon",
    "microsoft",
    "apple",
    "meta",
    "stripe",
    "palantir-technologies",
    "openai",
]


def batch_compare(
    company_slugs: list[str],
    reviews_per_company: int = 100,
    proxy: dict = None,
) -> pd.DataFrame:
    """
    Scrape multiple companies and return a comparison DataFrame.
    """
    conn = init_db()
    results = []

    for slug in company_slugs:
        print(f"\nScraping {slug}...")
        pages_needed = max(1, reviews_per_company // 20)

        try:
            reviews = scrape_company_reviews(slug, max_pages=pages_needed, proxy=proxy)
            store_reviews(reviews, conn)
        except Exception as e:
            print(f"Failed to scrape {slug}: {e}")
            reviews = []

        if reviews:
            df_co = pd.DataFrame(reviews)
            row = {
                "company": slug,
                "review_count": len(reviews),
                "avg_rating": round(df_co["rating"].dropna().mean(), 2),
                "pct_current": round(
                    (df_co["employment_status"] == "Current Employee").mean() * 100, 1
                ),
            }
            sub_df = pd.json_normalize(df_co["sub_ratings"].dropna().tolist())
            for col in ["work_life_balance", "compensation", "management", "job_security", "culture"]:
                if col in sub_df.columns:
                    row[col] = round(sub_df[col].dropna().mean(), 2)
                else:
                    row[col] = None

            results.append(row)

        if slug != company_slugs[-1]:
            time.sleep(random.uniform(10.0, 20.0))

    conn.close()
    df = pd.DataFrame(results)
    df = df.sort_values("avg_rating", ascending=False).reset_index(drop=True)
    return df


# Run it
df = batch_compare(["google", "amazon", "microsoft", "stripe"], reviews_per_company=60)
print(df.to_string(index=False))
df.to_csv("company_comparison.csv", index=False)

Sample output:

   company  review_count  avg_rating  pct_current  work_life_balance  compensation  management  job_security  culture
    stripe            60        4.2         61.7                4.3           4.1         3.9           4.0      4.3
    google            60        4.1         58.3                4.4           4.3         3.7           4.2      4.2
 microsoft            60        3.9         54.2                4.1           4.0         3.6           4.1      3.9
    amazon            60        3.4         42.1                3.2           3.5         2.9           3.1      3.2

Sentiment Analysis on Review Text

VADER (from NLTK) is better than TextBlob for short, opinionated text like reviews. It handles informal language, ALL CAPS emphasis, and punctuation like exclamation marks as sentiment intensifiers.

from nltk.sentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
import nltk

nltk.download("vader_lexicon", quiet=True)
sia = SentimentIntensityAnalyzer()


def score_review(review: dict) -> dict:
    """Add VADER sentiment scores to a review dict."""
    pros_text = review.get("pros", "") or ""
    cons_text = review.get("cons", "") or ""
    full_text = review.get("text", "") or pros_text + " " + cons_text

    if not full_text.strip():
        return {**review, "sentiment_compound": None, "sentiment_label": None}

    # VADER compound: -1 (most negative) to +1 (most positive)
    scores = sia.polarity_scores(full_text)
    compound = scores["compound"]

    # TextBlob subjectivity: 0 (objective) to 1 (subjective)
    blob = TextBlob(full_text)

    label = "positive" if compound >= 0.05 else "negative" if compound <= -0.05 else "neutral"

    return {
        **review,
        "sentiment_compound": round(compound, 3),
        "sentiment_subjectivity": round(blob.sentiment.subjectivity, 3),
        "sentiment_label": label,
    }


def analyze_company_sentiment(reviews: list[dict]) -> dict:
    """Aggregate sentiment analysis for a company."""
    scored = [score_review(r) for r in reviews]
    df = pd.DataFrame(scored)

    cons_scores = [
        sia.polarity_scores(r)["compound"]
        for r in df["cons"].dropna()
        if len(r) > 10
    ]

    return {
        "total_reviews": len(df),
        "avg_compound": round(df["sentiment_compound"].dropna().mean(), 3),
        "pct_positive": round((df["sentiment_label"] == "positive").mean() * 100, 1),
        "pct_negative": round((df["sentiment_label"] == "negative").mean() * 100, 1),
        "pct_neutral": round((df["sentiment_label"] == "neutral").mean() * 100, 1),
        "cons_avg_sentiment": round(sum(cons_scores) / len(cons_scores), 3) if cons_scores else None,
    }

Use Cases

Employer branding comparison - Scrape your company and your top 3 competitors. Compare sub-ratings across categories to find where you score above or below average. If your compensation score is 3.2 and competitors average 3.9, that is a concrete data point for HR leadership.

Salary and compensation sentiment - Filter reviews that mention compensation in pros/cons, run VADER on just those sentences. You will see whether employees feel positively or negatively about pay even when the numeric rating looks neutral.

Culture score tracking over time - Scrape monthly and store with timestamps. Plot culture and management scores over time. Acquisitions, layoffs, and leadership changes show up in review data within 2-3 months.

Red flag detection for job seekers - Build a classifier that looks for keywords like "no work-life balance", "micromanagement", "toxic", "high turnover" in cons text. Weight matches by helpful_count - a review with 47 helpful votes matters more than one with 0.

HR and recruiting competitive intelligence - Before entering a new market or competing for talent in a specific city, pull reviews filtered by location. See what employees in that city care about and what makes them leave.

Practical Tips

Use data-testid attributes - Indeed rotates CSS class names regularly (CSS modules with hashed names). Class selectors break constantly. data-testid attributes are tied to component functionality and change far less often. Always prefer [data-testid="..."] over .cmp-Something.

Pagination edge cases - The last page sometimes still renders a Next button but returns zero reviews. Always check both conditions: no next button AND empty review list. The scraper handles this with the if not review_elements: break check.

Show More buttons - Indeed truncates long reviews. Click them before extracting text or you get partial pros/cons. The scraper clicks all of them with a short delay between each click.

DOM structure changes - When selectors break, open DevTools on a real Chrome browser (not headless) and inspect the review card. Look for data-testid and itemprop attributes. Update both the primary and fallback selectors in extract_review().

Handle CAPTCHAs gracefully - If the page returns a challenge, the review selector returns nothing and you break out of the loop. Log the failure, wait 10-15 minutes, and retry from that page number.

Company slug discovery - If you do not know the slug, search Indeed for the company and look at the URL on their company page. It is not always intuitive: JPMorgan Chase is jpmorgan-chase, 3M is 3m-company.

Putting It Together

if __name__ == "__main__":
    conn = init_db()

    # Single company, 5 pages (up to 100 reviews)
    reviews = scrape_company_reviews("stripe", max_pages=5)
    store_reviews(reviews, conn)
    export_to_csv("stripe", conn)

    # Sentiment analysis
    sentiment = analyze_company_sentiment(reviews)
    print(f"Stripe sentiment: {sentiment}")

    # Batch comparison
    df = batch_compare(["stripe", "google", "amazon", "microsoft"], reviews_per_company=60)
    df.to_csv("comparison_2026.csv", index=False)
    print(df.to_string(index=False))

    conn.close()

Install everything:

pip install playwright httpx beautifulsoup4 pandas textblob nltk
playwright install chromium
python -c "import nltk; nltk.download('vader_lexicon')"

The review data is genuinely useful - compensation benchmarks, culture comparisons, management quality signals. Indeed has been collecting this for over a decade and for many mid-size companies it is the richest public signal on employee sentiment you can get. For production use at any real volume, residential proxies via ThorData are not optional. Datacenter IPs do not survive long enough to matter.