How to Scrape Glassdoor Reviews with Python (2026)
How to Scrape Glassdoor Reviews with Python (2026)
Glassdoor has some of the most valuable employer data on the internet — real employee reviews, salary ranges, interview questions, and company ratings. Their official API was shut down years ago, but the data is still accessible through the same GraphQL endpoints their frontend uses.
This guide walks through extracting company reviews, ratings, salary data, and interview experiences from Glassdoor using Python — including how to handle the anti-bot protections that make this challenging.
How Glassdoor Serves Its Data
Open any Glassdoor company page in DevTools and watch the network tab. You will see POST requests to www.glassdoor.com/graph — that is their internal GraphQL API. Every piece of data on the page comes through this single endpoint.
The requests carry a session cookie and a few custom headers, but the key insight is that Glassdoor does not require authentication for reading public review data. You just need to look like a real browser.
Key endpoint: POST https://www.glassdoor.com/graph
Headers you need to mimic:
- Content-Type: application/json
- gd-csrf-token: <value> — empty string works for reads
- User-Agent: <realistic browser UA>
- Referer: https://www.glassdoor.com/
Setup
# requirements: httpx, selectolax (optional)
# pip install httpx selectolax
The examples below use httpx because it handles HTTP/2 connections that Glassdoor sometimes requires, and its connection pooling is cleaner than requests.
Finding a Company by Name
Before fetching reviews, you need the employer ID. The typeahead search endpoint works without authentication.
import httpx
import time
import random
BASE_URL = "https://www.glassdoor.com/graph"
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
"Content-Type": "application/json",
"Accept": "*/*",
"Referer": "https://www.glassdoor.com/",
"gd-csrf-token": "",
"Origin": "https://www.glassdoor.com",
}
def search_company(name: str) -> list:
"""Search Glassdoor for a company by name."""
payload = {
"operationName": "SuggestionsTypeahead",
"variables": {
"keyword": name,
"numSuggestions": 5,
},
"query": """
query SuggestionsTypeahead($keyword: String!, $numSuggestions: Int!) {
typeaheadSuggestions(keyword: $keyword, numSuggestions: $numSuggestions) {
suggestions {
suggestion
employerId
employerName
employerShortName
sectorName
}
}
}
""",
}
response = httpx.post(BASE_URL, headers=HEADERS, json=payload, timeout=30)
response.raise_for_status()
data = response.json()
return data.get("data", {}).get("typeaheadSuggestions", {}).get("suggestions", [])
results = search_company("Spotify")
for r in results:
print(f"{r['employerName']} (ID: {r['employerId']}) — {r['sectorName']}")
Extracting Employee Reviews
With the employer ID, you can pull paginated reviews. The review object contains overall rating plus categorical ratings for work-life balance, culture, leadership, career growth, and compensation.
def get_reviews(employer_id: int, page: int = 1, per_page: int = 10) -> dict:
"""Fetch employee reviews for a company."""
payload = {
"operationName": "EmployerReviewsPage",
"variables": {
"employerId": employer_id,
"reviewsInput": {
"sort": "DATE",
"page": {"num": page, "size": per_page},
"dynamicProfileId": employer_id,
},
},
"query": """
query EmployerReviewsPage($employerId: Int!, $reviewsInput: ReviewsInput!) {
employer(id: $employerId) {
name
overallRating
reviewCount
reviews(input: $reviewsInput) {
reviews {
reviewId
reviewDateTime
ratingOverall
ratingWorkLifeBalance
ratingCultureAndValues
ratingDiversityAndInclusion
ratingSeniorLeadership
ratingCareerOpportunities
ratingCompensationAndBenefits
summary
pros
cons
advice
jobTitle {
text
}
location {
name
}
employmentStatus
isCurrentJob
}
totalNumberOfPages
currentPage
}
}
}
""",
}
response = httpx.post(BASE_URL, headers=HEADERS, json=payload, timeout=30)
response.raise_for_status()
return response.json()
data = get_reviews(employer_id=25953, page=1)
employer = data["data"]["employer"]
print(f"{employer['name']} — {employer['overallRating']}/5 ({employer['reviewCount']} reviews)")
for review in employer["reviews"]["reviews"]:
print(f"\n[{review['ratingOverall']}/5] {review['summary']}")
print(f" Role: {review['jobTitle']['text']}")
print(f" Employment: {review['employmentStatus']} | Current: {review['isCurrentJob']}")
print(f" Pros: {(review['pros'] or '')[:120]}")
print(f" Cons: {(review['cons'] or '')[:120]}")
Paginating All Reviews
def get_all_reviews(employer_id: int, max_pages: int = 50) -> list:
"""Fetch all available reviews for a company with pagination."""
all_reviews = []
first_page = get_reviews(employer_id, page=1, per_page=10)
employer_data = first_page["data"]["employer"]
total_pages = employer_data["reviews"]["totalNumberOfPages"]
total_pages = min(total_pages, max_pages)
print(f"Fetching {total_pages} pages of reviews for {employer_data['name']}")
for page in range(1, total_pages + 1):
try:
data = get_reviews(employer_id, page=page, per_page=10)
reviews = data["data"]["employer"]["reviews"]["reviews"]
all_reviews.extend(reviews)
print(f" Page {page}/{total_pages}: {len(reviews)} reviews")
except Exception as e:
print(f" Page {page} failed: {e}")
time.sleep(5)
continue
time.sleep(random.uniform(2.0, 4.0))
return all_reviews
Salary Data
Glassdoor salary data comes through a separate query and includes percentile breakdowns.
def get_salaries(employer_id: int, page: int = 1) -> dict:
"""Fetch salary data for a company."""
payload = {
"operationName": "SalariesByEmployer",
"variables": {
"employerId": employer_id,
"input": {
"page": {"num": page, "size": 20},
},
},
"query": """
query SalariesByEmployer($employerId: Int!, $input: SalariesInput!) {
employer(id: $employerId) {
salaries(input: $input) {
results {
jobTitle
salaryCount
salaryPercentileMap {
payPercentile10
payPercentile25
payPercentile50
payPercentile75
payPercentile90
}
currency
payPeriod
}
totalCount
numberOfPages
}
}
}
""",
}
response = httpx.post(BASE_URL, headers=HEADERS, json=payload, timeout=30)
response.raise_for_status()
return response.json()
salary_data = get_salaries(employer_id=25953)
salaries = salary_data["data"]["employer"]["salaries"]["results"]
for salary in salaries[:10]:
p = salary["salaryPercentileMap"]
median = p["payPercentile50"]
p10 = p["payPercentile10"]
p90 = p["payPercentile90"]
print(f"{salary['jobTitle']:40} ${median:>9,.0f}/yr "
f"(p10: ${p10:>8,.0f} — p90: ${p90:>8,.0f}) "
f"n={salary['salaryCount']}")
Interview Data
Interview experiences are another valuable Glassdoor data point.
def get_interviews(employer_id: int, page: int = 1) -> dict:
"""Fetch interview reviews for a company."""
payload = {
"operationName": "InterviewsByEmployer",
"variables": {
"employerId": employer_id,
"interviewInput": {
"page": {"num": page, "size": 10},
"sort": "DATE",
},
},
"query": """
query InterviewsByEmployer($employerId: Int!, $interviewInput: InterviewsInput!) {
employer(id: $employerId) {
interviews(input: $interviewInput) {
interviews {
interviewId
interviewDateTime
jobTitle { text }
interviewExperience
interviewDifficulty
offer
pct
description
questions { questionText }
}
totalNumberOfPages
}
}
}
""",
}
response = httpx.post(BASE_URL, headers=HEADERS, json=payload, timeout=30)
response.raise_for_status()
return response.json()
data = get_interviews(25953, page=1)
interviews = data["data"]["employer"]["interviews"]["interviews"]
for iv in interviews[:5]:
print(f"\n{iv['jobTitle']['text']} — Difficulty: {iv['interviewDifficulty']}/5")
print(f" Experience: {iv['interviewExperience']}")
print(f" Got offer: {iv.get('offer', 'unknown')}")
for q in iv.get("questions", [])[:2]:
print(f" Q: {q['questionText'][:100]}")
Anti-Bot Measures and How to Handle Them
Glassdoor is one of the more aggressively protected sites. Here is what you encounter:
Mandatory login walls. After a few page views, Glassdoor forces you to log in. This is cookie-based — the site tracks your session and triggers a modal after 3-5 pages. The GraphQL API is less strict but still monitors request volume per session.
Cloudflare Bot Management. Glassdoor uses Cloudflare enterprise tier with JavaScript challenges. Your requests need to pass browser fingerprint checks.
Rate limiting. The GraphQL endpoint starts returning 429 errors after about 10-15 requests per minute from the same IP. More requests gets your IP temporarily blocked.
Session fingerprinting. Glassdoor generates a unique session token (gdId cookie) and tracks request patterns per session. Reusing the same session for hundreds of requests triggers blocks.
Geographic restrictions. Some review data is blocked depending on your exit IP country.
The most reliable approach combines proxy rotation with session management.
import random
import time
def make_client(proxy_url=None):
"""Create a fresh httpx client with optional proxy."""
proxies = {"http://": proxy_url, "https://": proxy_url} if proxy_url else None
return httpx.Client(
headers=HEADERS,
proxies=proxies,
timeout=30,
follow_redirects=True,
)
def scrape_reviews_safe(
employer_id: int,
max_pages: int = 10,
proxy_url: str = None,
) -> list:
"""
Scrape reviews with rate limiting and error handling.
Creates a fresh session every 10 requests to avoid fingerprinting.
"""
all_reviews = []
client = make_client(proxy_url)
requests_this_session = 0
for page in range(1, max_pages + 1):
# Rotate session every 10 requests
if requests_this_session >= 10:
client.close()
time.sleep(random.uniform(3, 6))
client = make_client(proxy_url)
requests_this_session = 0
try:
data = get_reviews(employer_id, page=page)
reviews = data["data"]["employer"]["reviews"]["reviews"]
all_reviews.extend(reviews)
total_pages = data["data"]["employer"]["reviews"]["totalNumberOfPages"]
requests_this_session += 1
if page >= total_pages:
break
time.sleep(random.uniform(2.0, 5.0))
except httpx.HTTPStatusError as e:
if e.response.status_code == 429:
wait = 30 + random.uniform(0, 10)
print(f"Rate limited on page {page}. Waiting {wait:.0f}s...")
time.sleep(wait)
continue
else:
print(f"HTTP error on page {page}: {e.response.status_code}")
break
except Exception as e:
print(f"Page {page} failed: {e}")
time.sleep(10)
continue
client.close()
return all_reviews
Using Residential Proxies
For scraping Glassdoor at any meaningful scale, residential proxies are necessary. Glassdoor specifically blocks datacenter IP ranges — any request from AWS, GCP, Azure, DigitalOcean, or similar providers gets filtered by Cloudflare.
ThorData's residential proxies route requests through real residential IP addresses that pass Cloudflare reputation checks. The auto-rotation feature gives you a fresh IP per request, which prevents the per-IP rate limits from triggering.
THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
def get_proxy(country=None):
"""Build a ThorData proxy URL with optional geo-targeting."""
user = THORDATA_USER
if country:
user = f"{user}-country-{country.upper()}"
return f"http://{user}:{THORDATA_PASS}@proxy.thordata.com:9000"
# US-targeted proxies for US salary data
proxy_url = get_proxy(country="US")
reviews = scrape_reviews_safe(employer_id=25953, max_pages=10, proxy_url=proxy_url)
Sentiment Analysis on Reviews
Once you have the reviews, basic sentiment analysis is straightforward using the structured ratings.
from collections import Counter
def analyze_review_sentiment(reviews: list) -> dict:
"""Compute sentiment breakdown from review ratings."""
if not reviews:
return {}
ratings = [r["ratingOverall"] for r in reviews if r.get("ratingOverall")]
wb_ratings = [r["ratingWorkLifeBalance"] for r in reviews if r.get("ratingWorkLifeBalance")]
culture_ratings = [r["ratingCultureAndValues"] for r in reviews if r.get("ratingCultureAndValues")]
leadership_ratings = [r["ratingSeniorLeadership"] for r in reviews if r.get("ratingSeniorLeadership")]
career_ratings = [r["ratingCareerOpportunities"] for r in reviews if r.get("ratingCareerOpportunities")]
comp_ratings = [r["ratingCompensationAndBenefits"] for r in reviews if r.get("ratingCompensationAndBenefits")]
avg = lambda lst: round(sum(lst) / len(lst), 2) if lst else None
current_count = sum(1 for r in reviews if r.get("isCurrentJob"))
status_count = Counter(r.get("employmentStatus") for r in reviews)
return {
"total_reviews": len(reviews),
"avg_overall": avg(ratings),
"avg_work_life_balance": avg(wb_ratings),
"avg_culture": avg(culture_ratings),
"avg_leadership": avg(leadership_ratings),
"avg_career_opportunities": avg(career_ratings),
"avg_compensation": avg(comp_ratings),
"rating_distribution": dict(Counter(ratings)),
"current_employees_pct": round(current_count / len(reviews) * 100, 1),
"employment_status_breakdown": dict(status_count),
"recommend_rate": round(
sum(1 for r in reviews if r.get("ratingOverall", 0) >= 4) / len(reviews) * 100, 1
),
}
def extract_key_themes(reviews: list, field: str = "pros", top_n: int = 20) -> Counter:
"""Extract most-mentioned words from pros or cons sections."""
import re
stopwords = {
"the", "a", "an", "and", "or", "but", "is", "are", "was", "were",
"be", "been", "being", "have", "has", "had", "do", "does", "did",
"will", "would", "could", "should", "may", "might", "of", "in",
"on", "at", "to", "for", "with", "by", "from", "as", "this",
"that", "it", "its", "they", "them", "their", "you", "your",
"very", "really", "quite", "much", "many", "good", "great",
"nice", "lot", "lots", "can", "not", "no", "so",
}
word_counter = Counter()
for review in reviews:
text = review.get(field, "") or ""
words = re.findall(r"\b[a-z]{4,}\b", text.lower())
for word in words:
if word not in stopwords:
word_counter[word] += 1
return word_counter.most_common(top_n)
# Full analysis
reviews = scrape_reviews_safe(employer_id=25953, max_pages=5)
sentiment = analyze_review_sentiment(reviews)
print(f"\nCompany Sentiment Analysis ({sentiment['total_reviews']} reviews)")
print(f"Overall: {sentiment['avg_overall']}/5")
print(f"Work-Life Balance: {sentiment['avg_work_life_balance']}/5")
print(f"Culture & Values: {sentiment['avg_culture']}/5")
print(f"Senior Leadership: {sentiment['avg_leadership']}/5")
print(f"Career Opportunities: {sentiment['avg_career_opportunities']}/5")
print(f"Compensation: {sentiment['avg_compensation']}/5")
print(f"Current employees: {sentiment['current_employees_pct']}%")
print(f"Would recommend: {sentiment['recommend_rate']}%")
pros_themes = extract_key_themes(reviews, "pros")
cons_themes = extract_key_themes(reviews, "cons")
print(f"\nTop pros themes: {[w for w, _ in pros_themes[:10]]}")
print(f"Top cons themes: {[w for w, _ in cons_themes[:10]]}")
Storing Results in SQLite
import sqlite3
from datetime import datetime
def init_glassdoor_db(db_path="glassdoor.db"):
conn = sqlite3.connect(db_path)
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("""
CREATE TABLE IF NOT EXISTS companies (
employer_id INTEGER PRIMARY KEY,
name TEXT,
overall_rating REAL,
review_count INTEGER,
fetched_at TEXT
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS reviews (
review_id INTEGER PRIMARY KEY,
employer_id INTEGER,
review_date TEXT,
rating_overall INTEGER,
rating_work_life_balance INTEGER,
rating_culture INTEGER,
rating_diversity INTEGER,
rating_leadership INTEGER,
rating_career INTEGER,
rating_compensation INTEGER,
summary TEXT,
pros TEXT,
cons TEXT,
advice TEXT,
job_title TEXT,
location TEXT,
employment_status TEXT,
is_current_job INTEGER,
fetched_at TEXT,
FOREIGN KEY (employer_id) REFERENCES companies(employer_id)
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS salaries (
id INTEGER PRIMARY KEY AUTOINCREMENT,
employer_id INTEGER,
job_title TEXT,
salary_count INTEGER,
p10 REAL,
p25 REAL,
p50 REAL,
p75 REAL,
p90 REAL,
currency TEXT,
pay_period TEXT,
fetched_at TEXT
)
""")
conn.execute("CREATE INDEX IF NOT EXISTS idx_reviews_employer ON reviews(employer_id)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_reviews_date ON reviews(review_date)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_reviews_rating ON reviews(rating_overall)")
conn.commit()
return conn
def save_reviews(conn, employer_id, reviews):
now = datetime.utcnow().isoformat()
conn.executemany("""
INSERT OR REPLACE INTO reviews
(review_id, employer_id, review_date, rating_overall,
rating_work_life_balance, rating_culture, rating_diversity,
rating_leadership, rating_career, rating_compensation,
summary, pros, cons, advice, job_title, location,
employment_status, is_current_job, fetched_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", [
(
r["reviewId"],
employer_id,
r.get("reviewDateTime"),
r.get("ratingOverall"),
r.get("ratingWorkLifeBalance"),
r.get("ratingCultureAndValues"),
r.get("ratingDiversityAndInclusion"),
r.get("ratingSeniorLeadership"),
r.get("ratingCareerOpportunities"),
r.get("ratingCompensationAndBenefits"),
r.get("summary"),
r.get("pros"),
r.get("cons"),
r.get("advice"),
r.get("jobTitle", {}).get("text") if r.get("jobTitle") else None,
r.get("location", {}).get("name") if r.get("location") else None,
r.get("employmentStatus"),
1 if r.get("isCurrentJob") else 0,
now,
)
for r in reviews
])
conn.commit()
print(f"Saved {len(reviews)} reviews for employer {employer_id}")
Tips for Reliable Glassdoor Scraping
Rotate sessions, not just IPs. Create a fresh httpx.Client every 10-15 requests with new cookies. The gdId cookie is the session identifier that triggers escalating challenges.
Start with the GraphQL API. It is more stable than parsing HTML and less likely to trigger the login wall. The same queries the frontend uses work fine from Python.
Handle schema changes. Glassdoor updates their GraphQL schema periodically. If a query breaks, open the browser developer tools, find a request to /graph, and inspect the current query structure in the payload.
Cache company IDs. The employer search endpoint is the most heavily rate-limited call. Cache the mapping between company names and IDs locally and avoid re-fetching.
Watch for 403 vs 429. Glassdoor returns 403 for auth-related blocks (Cloudflare) and 429 for pure rate limits. The retry strategy differs: 403 needs a new IP and fresh session, 429 just needs a delay.
Geo-target your proxies. Glassdoor shows different salary data and sometimes different reviews based on IP location. Use country-targeted proxies when you need location-specific data.
Review data changes. Glassdoor sometimes removes reviews that violate their policies. If you are building a historical dataset, note your scrape timestamps — reviews that disappear from the live site are not necessarily fraudulent, just removed.
Glassdoor data is invaluable for competitive intelligence, recruiting analysis, and understanding company culture at scale. With careful session management and proper proxy rotation via ThorData, Python makes it accessible without paying for their enterprise API.