How to Scrape Glassdoor Reviews with Python (2026)

2026-04-09 [python glassdoor scraping hr-data]

How to Scrape Glassdoor Reviews with Python (2026)

Glassdoor has some of the most valuable employer data on the internet — real employee reviews, salary ranges, interview questions, and company ratings. Their official API was shut down years ago, but the data is still accessible through the same GraphQL endpoints their frontend uses.

This guide walks through extracting company reviews, ratings, salary data, and interview experiences from Glassdoor using Python — including how to handle the anti-bot protections that make this challenging.

How Glassdoor Serves Its Data

Open any Glassdoor company page in DevTools and watch the network tab. You will see POST requests to www.glassdoor.com/graph — that is their internal GraphQL API. Every piece of data on the page comes through this single endpoint.

The requests carry a session cookie and a few custom headers, but the key insight is that Glassdoor does not require authentication for reading public review data. You just need to look like a real browser.

Key endpoint: POST https://www.glassdoor.com/graph

Headers you need to mimic: - Content-Type: application/json - gd-csrf-token: <value> — empty string works for reads - User-Agent: <realistic browser UA> - Referer: https://www.glassdoor.com/

Setup

# requirements: httpx, selectolax (optional)
# pip install httpx selectolax

The examples below use httpx because it handles HTTP/2 connections that Glassdoor sometimes requires, and its connection pooling is cleaner than requests.

Finding a Company by Name

Before fetching reviews, you need the employer ID. The typeahead search endpoint works without authentication.

import httpx
import time
import random

BASE_URL = "https://www.glassdoor.com/graph"

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
    "Content-Type": "application/json",
    "Accept": "*/*",
    "Referer": "https://www.glassdoor.com/",
    "gd-csrf-token": "",
    "Origin": "https://www.glassdoor.com",
}

def search_company(name: str) -> list:
    """Search Glassdoor for a company by name."""
    payload = {
        "operationName": "SuggestionsTypeahead",
        "variables": {
            "keyword": name,
            "numSuggestions": 5,
        },
        "query": """
            query SuggestionsTypeahead($keyword: String!, $numSuggestions: Int!) {
                typeaheadSuggestions(keyword: $keyword, numSuggestions: $numSuggestions) {
                    suggestions {
                        suggestion
                        employerId
                        employerName
                        employerShortName
                        sectorName
                    }
                }
            }
        """,
    }

    response = httpx.post(BASE_URL, headers=HEADERS, json=payload, timeout=30)
    response.raise_for_status()
    data = response.json()
    return data.get("data", {}).get("typeaheadSuggestions", {}).get("suggestions", [])

results = search_company("Spotify")
for r in results:
    print(f"{r['employerName']} (ID: {r['employerId']}) — {r['sectorName']}")

Extracting Employee Reviews

With the employer ID, you can pull paginated reviews. The review object contains overall rating plus categorical ratings for work-life balance, culture, leadership, career growth, and compensation.

def get_reviews(employer_id: int, page: int = 1, per_page: int = 10) -> dict:
    """Fetch employee reviews for a company."""
    payload = {
        "operationName": "EmployerReviewsPage",
        "variables": {
            "employerId": employer_id,
            "reviewsInput": {
                "sort": "DATE",
                "page": {"num": page, "size": per_page},
                "dynamicProfileId": employer_id,
            },
        },
        "query": """
            query EmployerReviewsPage($employerId: Int!, $reviewsInput: ReviewsInput!) {
                employer(id: $employerId) {
                    name
                    overallRating
                    reviewCount
                    reviews(input: $reviewsInput) {
                        reviews {
                            reviewId
                            reviewDateTime
                            ratingOverall
                            ratingWorkLifeBalance
                            ratingCultureAndValues
                            ratingDiversityAndInclusion
                            ratingSeniorLeadership
                            ratingCareerOpportunities
                            ratingCompensationAndBenefits
                            summary
                            pros
                            cons
                            advice
                            jobTitle {
                                text
                            }
                            location {
                                name
                            }
                            employmentStatus
                            isCurrentJob
                        }
                        totalNumberOfPages
                        currentPage
                    }
                }
            }
        """,
    }

    response = httpx.post(BASE_URL, headers=HEADERS, json=payload, timeout=30)
    response.raise_for_status()
    return response.json()

data = get_reviews(employer_id=25953, page=1)
employer = data["data"]["employer"]
print(f"{employer['name']} — {employer['overallRating']}/5 ({employer['reviewCount']} reviews)")

for review in employer["reviews"]["reviews"]:
    print(f"\n[{review['ratingOverall']}/5] {review['summary']}")
    print(f"  Role: {review['jobTitle']['text']}")
    print(f"  Employment: {review['employmentStatus']} | Current: {review['isCurrentJob']}")
    print(f"  Pros: {(review['pros'] or '')[:120]}")
    print(f"  Cons: {(review['cons'] or '')[:120]}")

Paginating All Reviews

def get_all_reviews(employer_id: int, max_pages: int = 50) -> list:
    """Fetch all available reviews for a company with pagination."""
    all_reviews = []

    first_page = get_reviews(employer_id, page=1, per_page=10)
    employer_data = first_page["data"]["employer"]
    total_pages = employer_data["reviews"]["totalNumberOfPages"]
    total_pages = min(total_pages, max_pages)

    print(f"Fetching {total_pages} pages of reviews for {employer_data['name']}")

    for page in range(1, total_pages + 1):
        try:
            data = get_reviews(employer_id, page=page, per_page=10)
            reviews = data["data"]["employer"]["reviews"]["reviews"]
            all_reviews.extend(reviews)
            print(f"  Page {page}/{total_pages}: {len(reviews)} reviews")
        except Exception as e:
            print(f"  Page {page} failed: {e}")
            time.sleep(5)
            continue

        time.sleep(random.uniform(2.0, 4.0))

    return all_reviews

Salary Data

Glassdoor salary data comes through a separate query and includes percentile breakdowns.

def get_salaries(employer_id: int, page: int = 1) -> dict:
    """Fetch salary data for a company."""
    payload = {
        "operationName": "SalariesByEmployer",
        "variables": {
            "employerId": employer_id,
            "input": {
                "page": {"num": page, "size": 20},
            },
        },
        "query": """
            query SalariesByEmployer($employerId: Int!, $input: SalariesInput!) {
                employer(id: $employerId) {
                    salaries(input: $input) {
                        results {
                            jobTitle
                            salaryCount
                            salaryPercentileMap {
                                payPercentile10
                                payPercentile25
                                payPercentile50
                                payPercentile75
                                payPercentile90
                            }
                            currency
                            payPeriod
                        }
                        totalCount
                        numberOfPages
                    }
                }
            }
        """,
    }

    response = httpx.post(BASE_URL, headers=HEADERS, json=payload, timeout=30)
    response.raise_for_status()
    return response.json()

salary_data = get_salaries(employer_id=25953)
salaries = salary_data["data"]["employer"]["salaries"]["results"]

for salary in salaries[:10]:
    p = salary["salaryPercentileMap"]
    median = p["payPercentile50"]
    p10 = p["payPercentile10"]
    p90 = p["payPercentile90"]
    print(f"{salary['jobTitle']:40} ${median:>9,.0f}/yr  "
          f"(p10: ${p10:>8,.0f} — p90: ${p90:>8,.0f})  "
          f"n={salary['salaryCount']}")

Interview Data

Interview experiences are another valuable Glassdoor data point.

def get_interviews(employer_id: int, page: int = 1) -> dict:
    """Fetch interview reviews for a company."""
    payload = {
        "operationName": "InterviewsByEmployer",
        "variables": {
            "employerId": employer_id,
            "interviewInput": {
                "page": {"num": page, "size": 10},
                "sort": "DATE",
            },
        },
        "query": """
            query InterviewsByEmployer($employerId: Int!, $interviewInput: InterviewsInput!) {
                employer(id: $employerId) {
                    interviews(input: $interviewInput) {
                        interviews {
                            interviewId
                            interviewDateTime
                            jobTitle { text }
                            interviewExperience
                            interviewDifficulty
                            offer
                            pct
                            description
                            questions { questionText }
                        }
                        totalNumberOfPages
                    }
                }
            }
        """,
    }

    response = httpx.post(BASE_URL, headers=HEADERS, json=payload, timeout=30)
    response.raise_for_status()
    return response.json()

data = get_interviews(25953, page=1)
interviews = data["data"]["employer"]["interviews"]["interviews"]

for iv in interviews[:5]:
    print(f"\n{iv['jobTitle']['text']} — Difficulty: {iv['interviewDifficulty']}/5")
    print(f"  Experience: {iv['interviewExperience']}")
    print(f"  Got offer: {iv.get('offer', 'unknown')}")
    for q in iv.get("questions", [])[:2]:
        print(f"  Q: {q['questionText'][:100]}")

Anti-Bot Measures and How to Handle Them

Glassdoor is one of the more aggressively protected sites. Here is what you encounter:

Mandatory login walls. After a few page views, Glassdoor forces you to log in. This is cookie-based — the site tracks your session and triggers a modal after 3-5 pages. The GraphQL API is less strict but still monitors request volume per session.

Cloudflare Bot Management. Glassdoor uses Cloudflare enterprise tier with JavaScript challenges. Your requests need to pass browser fingerprint checks.

Rate limiting. The GraphQL endpoint starts returning 429 errors after about 10-15 requests per minute from the same IP. More requests gets your IP temporarily blocked.

Session fingerprinting. Glassdoor generates a unique session token (gdId cookie) and tracks request patterns per session. Reusing the same session for hundreds of requests triggers blocks.

Geographic restrictions. Some review data is blocked depending on your exit IP country.

The most reliable approach combines proxy rotation with session management.

import random
import time

def make_client(proxy_url=None):
    """Create a fresh httpx client with optional proxy."""
    proxies = {"http://": proxy_url, "https://": proxy_url} if proxy_url else None
    return httpx.Client(
        headers=HEADERS,
        proxies=proxies,
        timeout=30,
        follow_redirects=True,
    )

def scrape_reviews_safe(
    employer_id: int,
    max_pages: int = 10,
    proxy_url: str = None,
) -> list:
    """
    Scrape reviews with rate limiting and error handling.
    Creates a fresh session every 10 requests to avoid fingerprinting.
    """
    all_reviews = []
    client = make_client(proxy_url)
    requests_this_session = 0

    for page in range(1, max_pages + 1):
        # Rotate session every 10 requests
        if requests_this_session >= 10:
            client.close()
            time.sleep(random.uniform(3, 6))
            client = make_client(proxy_url)
            requests_this_session = 0

        try:
            data = get_reviews(employer_id, page=page)
            reviews = data["data"]["employer"]["reviews"]["reviews"]
            all_reviews.extend(reviews)

            total_pages = data["data"]["employer"]["reviews"]["totalNumberOfPages"]
            requests_this_session += 1

            if page >= total_pages:
                break

            time.sleep(random.uniform(2.0, 5.0))

        except httpx.HTTPStatusError as e:
            if e.response.status_code == 429:
                wait = 30 + random.uniform(0, 10)
                print(f"Rate limited on page {page}. Waiting {wait:.0f}s...")
                time.sleep(wait)
                continue
            else:
                print(f"HTTP error on page {page}: {e.response.status_code}")
                break

        except Exception as e:
            print(f"Page {page} failed: {e}")
            time.sleep(10)
            continue

    client.close()
    return all_reviews

Using Residential Proxies

For scraping Glassdoor at any meaningful scale, residential proxies are necessary. Glassdoor specifically blocks datacenter IP ranges — any request from AWS, GCP, Azure, DigitalOcean, or similar providers gets filtered by Cloudflare.

ThorData's residential proxies route requests through real residential IP addresses that pass Cloudflare reputation checks. The auto-rotation feature gives you a fresh IP per request, which prevents the per-IP rate limits from triggering.

THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"

def get_proxy(country=None):
    """Build a ThorData proxy URL with optional geo-targeting."""
    user = THORDATA_USER
    if country:
        user = f"{user}-country-{country.upper()}"
    return f"http://{user}:{THORDATA_PASS}@proxy.thordata.com:9000"

# US-targeted proxies for US salary data
proxy_url = get_proxy(country="US")
reviews = scrape_reviews_safe(employer_id=25953, max_pages=10, proxy_url=proxy_url)

Sentiment Analysis on Reviews

Once you have the reviews, basic sentiment analysis is straightforward using the structured ratings.

from collections import Counter

def analyze_review_sentiment(reviews: list) -> dict:
    """Compute sentiment breakdown from review ratings."""
    if not reviews:
        return {}

    ratings = [r["ratingOverall"] for r in reviews if r.get("ratingOverall")]
    wb_ratings = [r["ratingWorkLifeBalance"] for r in reviews if r.get("ratingWorkLifeBalance")]
    culture_ratings = [r["ratingCultureAndValues"] for r in reviews if r.get("ratingCultureAndValues")]
    leadership_ratings = [r["ratingSeniorLeadership"] for r in reviews if r.get("ratingSeniorLeadership")]
    career_ratings = [r["ratingCareerOpportunities"] for r in reviews if r.get("ratingCareerOpportunities")]
    comp_ratings = [r["ratingCompensationAndBenefits"] for r in reviews if r.get("ratingCompensationAndBenefits")]

    avg = lambda lst: round(sum(lst) / len(lst), 2) if lst else None

    current_count = sum(1 for r in reviews if r.get("isCurrentJob"))
    status_count = Counter(r.get("employmentStatus") for r in reviews)

    return {
        "total_reviews": len(reviews),
        "avg_overall": avg(ratings),
        "avg_work_life_balance": avg(wb_ratings),
        "avg_culture": avg(culture_ratings),
        "avg_leadership": avg(leadership_ratings),
        "avg_career_opportunities": avg(career_ratings),
        "avg_compensation": avg(comp_ratings),
        "rating_distribution": dict(Counter(ratings)),
        "current_employees_pct": round(current_count / len(reviews) * 100, 1),
        "employment_status_breakdown": dict(status_count),
        "recommend_rate": round(
            sum(1 for r in reviews if r.get("ratingOverall", 0) >= 4) / len(reviews) * 100, 1
        ),
    }

def extract_key_themes(reviews: list, field: str = "pros", top_n: int = 20) -> Counter:
    """Extract most-mentioned words from pros or cons sections."""
    import re

    stopwords = {
        "the", "a", "an", "and", "or", "but", "is", "are", "was", "were",
        "be", "been", "being", "have", "has", "had", "do", "does", "did",
        "will", "would", "could", "should", "may", "might", "of", "in",
        "on", "at", "to", "for", "with", "by", "from", "as", "this",
        "that", "it", "its", "they", "them", "their", "you", "your",
        "very", "really", "quite", "much", "many", "good", "great",
        "nice", "lot", "lots", "can", "not", "no", "so",
    }

    word_counter = Counter()
    for review in reviews:
        text = review.get(field, "") or ""
        words = re.findall(r"\b[a-z]{4,}\b", text.lower())
        for word in words:
            if word not in stopwords:
                word_counter[word] += 1

    return word_counter.most_common(top_n)

# Full analysis
reviews = scrape_reviews_safe(employer_id=25953, max_pages=5)
sentiment = analyze_review_sentiment(reviews)

print(f"\nCompany Sentiment Analysis ({sentiment['total_reviews']} reviews)")
print(f"Overall: {sentiment['avg_overall']}/5")
print(f"Work-Life Balance: {sentiment['avg_work_life_balance']}/5")
print(f"Culture & Values: {sentiment['avg_culture']}/5")
print(f"Senior Leadership: {sentiment['avg_leadership']}/5")
print(f"Career Opportunities: {sentiment['avg_career_opportunities']}/5")
print(f"Compensation: {sentiment['avg_compensation']}/5")
print(f"Current employees: {sentiment['current_employees_pct']}%")
print(f"Would recommend: {sentiment['recommend_rate']}%")

pros_themes = extract_key_themes(reviews, "pros")
cons_themes = extract_key_themes(reviews, "cons")

print(f"\nTop pros themes: {[w for w, _ in pros_themes[:10]]}")
print(f"Top cons themes: {[w for w, _ in cons_themes[:10]]}")

Storing Results in SQLite

import sqlite3
from datetime import datetime

def init_glassdoor_db(db_path="glassdoor.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("PRAGMA journal_mode=WAL")

    conn.execute("""
        CREATE TABLE IF NOT EXISTS companies (
            employer_id INTEGER PRIMARY KEY,
            name TEXT,
            overall_rating REAL,
            review_count INTEGER,
            fetched_at TEXT
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS reviews (
            review_id INTEGER PRIMARY KEY,
            employer_id INTEGER,
            review_date TEXT,
            rating_overall INTEGER,
            rating_work_life_balance INTEGER,
            rating_culture INTEGER,
            rating_diversity INTEGER,
            rating_leadership INTEGER,
            rating_career INTEGER,
            rating_compensation INTEGER,
            summary TEXT,
            pros TEXT,
            cons TEXT,
            advice TEXT,
            job_title TEXT,
            location TEXT,
            employment_status TEXT,
            is_current_job INTEGER,
            fetched_at TEXT,
            FOREIGN KEY (employer_id) REFERENCES companies(employer_id)
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS salaries (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            employer_id INTEGER,
            job_title TEXT,
            salary_count INTEGER,
            p10 REAL,
            p25 REAL,
            p50 REAL,
            p75 REAL,
            p90 REAL,
            currency TEXT,
            pay_period TEXT,
            fetched_at TEXT
        )
    """)

    conn.execute("CREATE INDEX IF NOT EXISTS idx_reviews_employer ON reviews(employer_id)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_reviews_date ON reviews(review_date)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_reviews_rating ON reviews(rating_overall)")

    conn.commit()
    return conn

def save_reviews(conn, employer_id, reviews):
    now = datetime.utcnow().isoformat()
    conn.executemany("""
        INSERT OR REPLACE INTO reviews
        (review_id, employer_id, review_date, rating_overall,
         rating_work_life_balance, rating_culture, rating_diversity,
         rating_leadership, rating_career, rating_compensation,
         summary, pros, cons, advice, job_title, location,
         employment_status, is_current_job, fetched_at)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, [
        (
            r["reviewId"],
            employer_id,
            r.get("reviewDateTime"),
            r.get("ratingOverall"),
            r.get("ratingWorkLifeBalance"),
            r.get("ratingCultureAndValues"),
            r.get("ratingDiversityAndInclusion"),
            r.get("ratingSeniorLeadership"),
            r.get("ratingCareerOpportunities"),
            r.get("ratingCompensationAndBenefits"),
            r.get("summary"),
            r.get("pros"),
            r.get("cons"),
            r.get("advice"),
            r.get("jobTitle", {}).get("text") if r.get("jobTitle") else None,
            r.get("location", {}).get("name") if r.get("location") else None,
            r.get("employmentStatus"),
            1 if r.get("isCurrentJob") else 0,
            now,
        )
        for r in reviews
    ])
    conn.commit()
    print(f"Saved {len(reviews)} reviews for employer {employer_id}")

Tips for Reliable Glassdoor Scraping

Rotate sessions, not just IPs. Create a fresh httpx.Client every 10-15 requests with new cookies. The gdId cookie is the session identifier that triggers escalating challenges.

Start with the GraphQL API. It is more stable than parsing HTML and less likely to trigger the login wall. The same queries the frontend uses work fine from Python.

Handle schema changes. Glassdoor updates their GraphQL schema periodically. If a query breaks, open the browser developer tools, find a request to /graph, and inspect the current query structure in the payload.

Cache company IDs. The employer search endpoint is the most heavily rate-limited call. Cache the mapping between company names and IDs locally and avoid re-fetching.

Watch for 403 vs 429. Glassdoor returns 403 for auth-related blocks (Cloudflare) and 429 for pure rate limits. The retry strategy differs: 403 needs a new IP and fresh session, 429 just needs a delay.

Geo-target your proxies. Glassdoor shows different salary data and sometimes different reviews based on IP location. Use country-targeted proxies when you need location-specific data.

Review data changes. Glassdoor sometimes removes reviews that violate their policies. If you are building a historical dataset, note your scrape timestamps — reviews that disappear from the live site are not necessarily fraudulent, just removed.

Glassdoor data is invaluable for competitive intelligence, recruiting analysis, and understanding company culture at scale. With careful session management and proper proxy rotation via ThorData, Python makes it accessible without paying for their enterprise API.