Scraping Glassdoor: Salaries, Reviews, and Interview Questions (2026)
Scraping Glassdoor: Salaries, Reviews, and Interview Questions (2026)
Glassdoor is one of the hardest sites to scrape in 2026. It gates almost everything useful behind a login wall, uses aggressive bot detection via Akamai Bot Manager, and renders most content dynamically. But the salary and review data is genuinely valuable — it's the largest public dataset of self-reported compensation and workplace feedback. This guide covers exactly what you can get, how to get it, and what it takes to do so reliably.
What's Public vs. What's Gated
Glassdoor shows a surprising amount on company overview pages without login:
No login required: - Company name, logo, overall rating (1-5 stars) - Number of reviews, number of salary reports - Company size, headquarters, industry, revenue range - "Featured" review snippets (1-2 per page) - Job listings (redirects to Glassdoor's job board) - High-level rating breakdowns (culture, management, etc.)
Login required (the useful stuff): - Full salary ranges by job title, base/bonus/equity splits - Complete review text with pros/cons/advice to management - Interview questions, difficulty ratings, offer outcomes - Benefits ratings and detailed breakdowns - CEO approval ratings over time - Individual salaries with location and experience data
The login wall is the core challenge. Glassdoor wants you to contribute a review or salary report before showing you the full dataset. They enforce this even for logged-in users who haven't contributed (the "give to get" model).
Setting Up Your Scraping Environment
pip install curl-cffi beautifulsoup4 httpx playwright
playwright install chromium
You'll also need: - A Glassdoor account (free to create) - Valid session cookies from that account - Residential proxies for anything beyond very light usage
Scraping Public Company Data
The company overview page has structured data you can get without authentication. The key is using curl-cffi to bypass TLS fingerprinting — standard requests gets blocked at the Akamai layer before it even reaches Glassdoor's application servers:
from curl_cffi import requests as cffi_requests
from bs4 import BeautifulSoup
import json
import re
import time
import random
# ThorData residential proxy for Glassdoor
# https://thordata.partnerstack.com/partner/0a0x4nzq (or [Oxylabs](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=2066&url_id=174))
PROXY = "http://USERNAME:[email protected]:7777"
def get_company_overview(company_slug: str, proxy: str = None) -> dict:
"""
Get public company data from Glassdoor overview page.
company_slug examples: 'Google', 'Apple', 'meta-platforms'
"""
url = f"https://www.glassdoor.com/Overview/Working-at-{company_slug}.htm"
session = cffi_requests.Session(impersonate="chrome124")
if proxy:
session.proxies = {"http": proxy, "https": proxy}
resp = session.get(url, headers={
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "en-US,en;q=0.9",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
})
if resp.status_code != 200:
return {"error": f"Status {resp.status_code}", "url": url}
soup = BeautifulSoup(resp.text, "html.parser")
# Glassdoor embeds Apollo GraphQL cache as JSON in a script tag
script = soup.find("script", string=re.compile("window\\.__APOLLO_STATE__"))
if not script:
# Try __NEXT_DATA__ as fallback
next_script = soup.find("script", {"id": "__NEXT_DATA__"})
if next_script:
try:
next_data = json.loads(next_script.string)
return _extract_from_next_data(next_data)
except (json.JSONDecodeError, TypeError):
pass
return {"error": "No Apollo state found — likely blocked or page structure changed"}
match = re.search(r"window\.__APOLLO_STATE__\s*=\s*({.+?});", script.string, re.DOTALL)
if not match:
return {"error": "Could not parse Apollo state"}
try:
state = json.loads(match.group(1))
except json.JSONDecodeError:
return {"error": "Apollo state JSON parse failed"}
# Extract employer data from Apollo cache
employer_keys = [k for k in state if k.startswith("Employer:")]
if not employer_keys:
return {"error": "No employer data in Apollo cache"}
emp = state[employer_keys[0]]
return {
"name": emp.get("shortName") or emp.get("name"),
"rating": emp.get("overallRating"),
"review_count": emp.get("numberOfRatings"),
"ceo_approval": emp.get("ceo", {}).get("pctApprove") if isinstance(emp.get("ceo"), dict) else None,
"size": emp.get("size"),
"industry": emp.get("primaryIndustry", {}).get("industryName") if isinstance(emp.get("primaryIndustry"), dict) else None,
"revenue": emp.get("revenue"),
"headquarters": emp.get("headquarters"),
"website": emp.get("website"),
"founded": emp.get("yearFounded"),
"description": emp.get("squareLogoUrl"),
"culture_rating": emp.get("ratingCulture"),
"work_life_rating": emp.get("ratingWorkLife"),
"career_rating": emp.get("ratingCareerOpportunities"),
"comp_benefits_rating": emp.get("ratingCompensationAndBenefits"),
"management_rating": emp.get("ratingSeniorLeadership"),
}
def _extract_from_next_data(data: dict) -> dict:
"""Fallback: extract employer info from Next.js page data."""
emp = (data.get("props", {})
.get("pageProps", {})
.get("employerReviews", {})
.get("employer", {}))
return {
"name": emp.get("shortName"),
"rating": emp.get("ratings", {}).get("overallRating"),
"review_count": emp.get("numberOfRatings"),
"size": emp.get("size"),
}
Finding Glassdoor Company IDs
Before using the GraphQL API for salaries and reviews, you need the numeric employer ID. You can extract it from the overview page URL or the Apollo state:
def get_company_id(company_slug: str, proxy: str = None) -> int | None:
"""Extract the numeric Glassdoor employer ID from a company page."""
url = f"https://www.glassdoor.com/Overview/Working-at-{company_slug}.htm"
session = cffi_requests.Session(impersonate="chrome124")
if proxy:
session.proxies = {"http": proxy, "https": proxy}
resp = session.get(url)
# Method 1: Extract from URL redirect (Glassdoor adds EI_IE{id} to URLs)
match = re.search(r"EI_IE(\d+)\.htm", resp.url)
if match:
return int(match.group(1))
# Method 2: Extract from Apollo state
match = re.search(r'"Employer:(\d+)"', resp.text)
if match:
return int(match.group(1))
# Method 3: Extract from page HTML
match = re.search(r'"employerId"\s*:\s*(\d+)', resp.text)
if match:
return int(match.group(1))
return None
# Common company IDs for reference:
# Google: 9079, Apple: 1138, Amazon: 6036, Microsoft: 1651
# Meta: 40772, Netflix: 11891, Airbnb: 391850
The Unofficial GraphQL API: Authenticated Salary Search
The real salary data lives behind Glassdoor's GraphQL API at https://www.glassdoor.com/graph. To access it, you need session cookies from a legitimate login. Export them from your browser's DevTools (Application > Cookies) after logging in:
from curl_cffi import requests as cffi_requests
import json
class GlassdoorSalaryClient:
GRAPH_URL = "https://www.glassdoor.com/graph"
def __init__(self, cookies: dict, proxy: str = None):
"""
cookies: dict exported from browser DevTools after login.
Required: GSESSIONID, gdId, gdsid — plus any others Glassdoor sets.
Export process:
1. Log in to glassdoor.com in Chrome
2. Open DevTools > Application > Cookies > glassdoor.com
3. Copy all cookie name/value pairs
"""
self.session = cffi_requests.Session(impersonate="chrome124")
if proxy:
self.session.proxies = {"http": proxy, "https": proxy}
for name, value in cookies.items():
self.session.cookies.set(name, value, domain=".glassdoor.com")
self.headers = {
"Content-Type": "application/json",
"Accept": "application/json",
"gd-csrf-token": cookies.get("gdId", ""),
"Referer": "https://www.glassdoor.com/Salaries/",
"Origin": "https://www.glassdoor.com",
}
def _make_request(self, payload: list) -> dict:
"""Send a GraphQL request with retry logic."""
for attempt in range(3):
try:
resp = self.session.post(
self.GRAPH_URL,
headers=self.headers,
json=payload,
timeout=20,
)
if resp.status_code == 403:
raise Exception("Session expired or CAPTCHA triggered — re-export cookies")
if resp.status_code == 429:
wait = 30 * (2 ** attempt)
print(f"Rate limited. Waiting {wait}s...")
time.sleep(wait)
continue
return resp.json()
except Exception as e:
if "Session expired" in str(e):
raise
if attempt == 2:
raise
time.sleep(10 * (attempt + 1))
return {}
def search_salaries(
self,
company_id: int,
job_title: str,
location: str = None,
page: int = 1
) -> dict:
"""
Search salary data for a specific company and job title.
Returns paginated results with base/total/additional pay.
"""
payload = [{
"operationName": "SalariesByEmployer",
"variables": {
"employerId": company_id,
"jobTitle": job_title,
"location": location or "",
"page": page,
"pageSize": 20,
"currencyCode": "USD",
},
"query": """
query SalariesByEmployer(
$employerId: Int!,
$jobTitle: String,
$location: String,
$page: Int,
$pageSize: Int,
$currencyCode: String
) {
salariesByEmployer(
employer: { id: $employerId }
jobTitle: $jobTitle
location: $location
pagination: { page: $page, pageSize: $pageSize }
currencyCode: $currencyCode
) {
results {
jobTitle
basePay {
avg
min
max
currency
}
totalPay {
avg
min
max
currency
}
additionalPay {
avg
min
max
}
count
lastUpdated
}
totalCount
hasNextPage
}
}
"""
}]
result = self._make_request(payload)
return result[0].get("data", {}).get("salariesByEmployer", {})
def get_all_salaries_for_role(
self,
company_id: int,
job_title: str,
location: str = None,
max_pages: int = 10
) -> list[dict]:
"""Paginate through all salary data for a role."""
all_results = []
for page in range(1, max_pages + 1):
data = self.search_salaries(company_id, job_title, location, page)
results = data.get("results", [])
if not results:
break
all_results.extend(results)
if not data.get("hasNextPage"):
break
# Respectful delay between pages
time.sleep(random.uniform(3, 8))
return all_results
def get_salary_overview(self, company_id: int) -> list[dict]:
"""Get top-level salary data for all roles at a company."""
payload = [{
"operationName": "EmployerSalaryTrends",
"variables": {
"employerId": company_id,
"numJobTitles": 25,
},
"query": """
query EmployerSalaryTrends($employerId: Int!, $numJobTitles: Int) {
employerSalaryTrends(
employer: { id: $employerId }
numJobTitles: $numJobTitles
) {
jobTitle
count
basePay { avg min max currency }
totalPay { avg min max currency }
}
}
"""
}]
result = self._make_request(payload)
return result[0].get("data", {}).get("employerSalaryTrends", [])
# Usage
cookies = {
"GSESSIONID": "your_session_id",
"gdId": "your_gd_id",
"gdsid": "your_gdsid",
"uc": "your_uc_value",
# Export all Glassdoor cookies from your browser
}
client = GlassdoorSalaryClient(cookies, proxy=PROXY)
# Get salaries for Software Engineers at Google (company_id=9079)
salaries = client.get_all_salaries_for_role(
company_id=9079,
job_title="Software Engineer",
location="San Francisco, CA"
)
print(f"Found {len(salaries)} salary data points:")
for s in salaries:
base = s.get("basePay", {})
total = s.get("totalPay", {})
print(f" {s['jobTitle']}: ${base.get('avg', 0):,.0f} base "
f"(${base.get('min', 0):,.0f}–${base.get('max', 0):,.0f}), "
f"${total.get('avg', 0):,.0f} total comp")
Scraping Reviews via GraphQL
The same GraphQL approach works for reviews. Reviews are the most actively moderated content on Glassdoor, so the API can return censored versions with asterisks for flagged words:
def get_company_reviews(
client: GlassdoorSalaryClient,
company_id: int,
sort: str = "RELEVANCE",
page: int = 1,
job_title: str = None,
) -> dict:
"""
Fetch company reviews via GraphQL.
sort options: RELEVANCE, DATE, RATING_HIGH, RATING_LOW, HELPFUL
"""
payload = [{
"operationName": "EmployerReviews",
"variables": {
"employerId": company_id,
"sort": sort,
"page": page,
"pageSize": 10,
"jobTitle": job_title,
"languageCode": "eng",
},
"query": """
query EmployerReviews(
$employerId: Int!,
$sort: String,
$page: Int,
$pageSize: Int,
$jobTitle: String,
$languageCode: String
) {
employerReviews(
employer: { id: $employerId }
sort: $sort
pagination: { page: $page, pageSize: $pageSize }
jobTitle: $jobTitle
languageCode: $languageCode
) {
reviews {
reviewId
dateTime
jobTitle
location
ratingOverall
ratingCeo
ratingBusinessOutlook
ratingWorkLifeBalance
ratingCultureAndValues
pros
cons
advice
isCurrentEmployee
lengthOfEmployment
employmentStatus
reviewCount
}
totalCount
hasNextPage
}
}
"""
}]
result = client._make_request(payload)
return result[0].get("data", {}).get("employerReviews", {})
def get_interview_questions(
client: GlassdoorSalaryClient,
company_id: int,
job_title: str = None,
page: int = 1,
) -> dict:
"""Fetch interview questions and outcomes for a company."""
payload = [{
"operationName": "EmployerInterviews",
"variables": {
"employerId": company_id,
"jobTitle": job_title,
"page": page,
"pageSize": 10,
},
"query": """
query EmployerInterviews(
$employerId: Int!,
$jobTitle: String,
$page: Int,
$pageSize: Int
) {
employerInterviews(
employer: { id: $employerId }
jobTitle: $jobTitle
pagination: { page: $page, pageSize: $pageSize }
) {
interviews {
interviewId
dateTime
jobTitle
difficulty
experience
outcome
questions {
question
answer
}
duration
interviewProcess
howGotInterview
}
totalCount
hasNextPage
}
}
"""
}]
result = client._make_request(payload)
return result[0].get("data", {}).get("employerInterviews", {})
Building a Salary Dataset Across Companies
Here's a complete pipeline to build a comparative salary dataset for multiple companies:
import sqlite3
import datetime
def setup_salary_db(db_path: str):
"""Create the SQLite schema for salary data."""
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS salary_data (
id INTEGER PRIMARY KEY AUTOINCREMENT,
company_id INTEGER,
company_name TEXT,
job_title TEXT,
location TEXT,
base_avg INTEGER,
base_min INTEGER,
base_max INTEGER,
total_avg INTEGER,
total_min INTEGER,
total_max INTEGER,
additional_avg INTEGER,
count INTEGER,
last_updated TEXT,
scraped_at TEXT,
UNIQUE(company_id, job_title, location)
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS reviews (
review_id TEXT PRIMARY KEY,
company_id INTEGER,
company_name TEXT,
date_time TEXT,
job_title TEXT,
location TEXT,
rating_overall INTEGER,
rating_culture INTEGER,
rating_work_life INTEGER,
pros TEXT,
cons TEXT,
advice TEXT,
is_current_employee INTEGER,
scraped_at TEXT
)
""")
conn.commit()
return conn
def scrape_company_salaries(
client: GlassdoorSalaryClient,
company_id: int,
company_name: str,
job_titles: list[str],
db_conn: sqlite3.Connection,
):
"""Scrape salaries for multiple job titles at one company."""
now = datetime.datetime.now().isoformat()
saved = 0
for job_title in job_titles:
print(f" Scraping {job_title} at {company_name}...")
all_salaries = client.get_all_salaries_for_role(
company_id, job_title, max_pages=5
)
for s in all_salaries:
base = s.get("basePay", {})
total = s.get("totalPay", {})
additional = s.get("additionalPay", {})
try:
db_conn.execute("""
INSERT OR REPLACE INTO salary_data
(company_id, company_name, job_title, location,
base_avg, base_min, base_max,
total_avg, total_min, total_max,
additional_avg, count, last_updated, scraped_at)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?)
""", (
company_id, company_name, s.get("jobTitle"), "",
base.get("avg"), base.get("min"), base.get("max"),
total.get("avg"), total.get("min"), total.get("max"),
additional.get("avg"),
s.get("count"), s.get("lastUpdated"), now,
))
saved += 1
except Exception as e:
print(f" DB error: {e}")
db_conn.commit()
time.sleep(random.uniform(5, 12))
return saved
# Build a tech company salary comparison dataset
companies = [
(9079, "Google"),
(1651, "Microsoft"),
(6036, "Amazon"),
(40772, "Meta"),
(11891, "Netflix"),
]
target_roles = [
"Software Engineer",
"Senior Software Engineer",
"Product Manager",
"Data Scientist",
"Engineering Manager",
]
conn = setup_salary_db("glassdoor_salaries.db")
for company_id, company_name in companies:
print(f"\nScraping {company_name} (id={company_id})...")
count = scrape_company_salaries(
client, company_id, company_name, target_roles, conn
)
print(f" Saved {count} salary records")
# Longer pause between companies
time.sleep(random.uniform(30, 60))
conn.close()
Handling CAPTCHA and Session Expiry
Glassdoor's Akamai Bot Manager integration triggers challenges based on these patterns:
- More than ~20-30 requests per minute from the same session
- Missing or expired
_abckcookie (Akamai's bot detection cookie) - Session cookies older than 4-6 hours
- Requests from datacenter IPs (AWS, GCP, Azure are aggressively flagged)
- Requests with missing or unusual fingerprint signals in headers
import time
import random
def resilient_request(client: GlassdoorSalaryClient, func, *args, max_retries=3, **kwargs):
"""Wrapper that handles CAPTCHA and rate limiting gracefully."""
for attempt in range(max_retries):
try:
result = func(*args, **kwargs)
# Check for empty/error response indicating a block
if isinstance(result, dict) and result.get("error"):
raise Exception(f"API error: {result['error']}")
# Respectful delay between requests
time.sleep(random.uniform(3, 9))
return result
except Exception as e:
error_str = str(e).lower()
if "captcha" in error_str or "403" in error_str or "session expired" in error_str:
if attempt < max_retries - 1:
wait = 60 * (2 ** attempt) # 60s, 120s, 240s
print(f"CAPTCHA/block detected (attempt {attempt+1}). Waiting {wait}s...")
print("Consider re-exporting fresh cookies from your browser.")
time.sleep(wait)
else:
print("Max retries reached. Session likely dead — re-export cookies.")
raise
elif "429" in error_str or "rate" in error_str:
wait = 30 * (2 ** attempt)
print(f"Rate limited. Waiting {wait}s...")
time.sleep(wait)
else:
if attempt == max_retries - 1:
raise
time.sleep(10 * (attempt + 1))
return None
# Usage
salaries = resilient_request(
client,
client.search_salaries,
company_id=9079,
job_title="Software Engineer",
location="New York, NY"
)
Proxy Configuration for Glassdoor
Glassdoor blocks all major datacenter IP ranges at the Akamai layer. This is applied before any session or cookie check — a bare AWS IP returns a CAPTCHA page immediately.
Residential proxies from ThorData are one of the few reliable options. Akamai's scoring considers the ASN (Internet Service Provider) of the connecting IP, and residential IPs score much lower on the bot probability scale than datacenter ones.
# Configure proxy with session stickiness
# Sticky sessions maintain the same IP for multiple requests
PROXY_BASE = "http://USERNAME:PASSWORD"
PROXY_HOST = "gate.thordata.com:7777"
def get_sticky_proxy(session_label: str) -> str:
"""
Get a sticky proxy URL that maintains the same exit IP.
Use the same session_label for all requests in one scraping session.
"""
return f"{PROXY_BASE}-session-{session_label}@{PROXY_HOST}"
# Important: reuse the same session label across all requests for one Glassdoor session
# This prevents the "teleporting IP" signal that triggers blocks
proxy = get_sticky_proxy("glassdoor-session-001")
client = GlassdoorSalaryClient(cookies, proxy=proxy)
Rate Limits and Practical Throughput
Being realistic about throughput with session cookies and residential proxies:
- ~200-400 salary lookups per hour before sessions start getting CAPTCHA'd
- ~100-200 review fetches per hour (reviews seem more heavily monitored)
- 4-6 hour session lifetime before cookies need refreshing
This is enough to build a dataset for a specific industry or metro area, but scraping all of Glassdoor isn't feasible without significant infrastructure.
Practical scaling tips:
- Rotate sessions: Keep 3-5 active Glassdoor accounts with fresh cookies. When one gets CAPTCHA'd, switch to another.
- Stagger requests: Don't run multiple scrapers simultaneously on the same session.
- Cache aggressively: Salary data doesn't change daily — cache by company_id + job_title + location with a 7-day TTL.
- Monitor response quality: An empty results array with totalCount > 0 means you're being rate-limited but not blocked.
Analyzing the Data
Once you have salary data in SQLite, you can run useful analyses:
import sqlite3
def compare_salaries(db_path: str, job_title: str):
"""Compare pay across companies for a specific role."""
conn = sqlite3.connect(db_path)
results = conn.execute("""
SELECT
company_name,
ROUND(base_avg) as avg_base,
ROUND(total_avg) as avg_total,
ROUND(total_avg - base_avg) as avg_equity_bonus,
count as data_points
FROM salary_data
WHERE job_title LIKE ?
AND base_avg IS NOT NULL
AND base_avg > 50000
ORDER BY total_avg DESC
""", (f"%{job_title}%",)).fetchall()
print(f"\n{job_title} Compensation Comparison:")
print(f"{'Company':<20} {'Base':>10} {'Total':>10} {'Equity+Bonus':>12} {'n':>5}")
print("-" * 60)
for row in results:
print(f"{row[0]:<20} ${row[1]:>9,.0f} ${row[2]:>9,.0f} ${row[3]:>11,.0f} {row[4]:>5}")
conn.close()
compare_salaries("glassdoor_salaries.db", "Software Engineer")
Legal and Ethical Considerations
Glassdoor's Terms of Service explicitly prohibit scraping. The hiQ Labs v. LinkedIn case (2022) established that scraping publicly accessible data isn't automatically a CFAA violation, but Glassdoor's salary data is behind a login wall — making it less clearly "publicly accessible."
From a practical standpoint: this is self-reported compensation data that employees voluntarily share to increase pay transparency. The ethical case for accessing it is strong. The legal risk depends on scale and commercial use.
Practical guidelines: - Don't redistribute raw scraped data at scale - Don't sell Glassdoor data directly - Cache to avoid unnecessary repeated requests - Use for research and analysis, not as a competing product - Don't use the data to identify or contact individuals
Key Takeaways
Glassdoor scraping in 2026:
- Public overview pages: Easy with
curl-cffiimpersonating Chrome — no authentication needed - Salary and review data: Requires authenticated sessions via the GraphQL API at
https://www.glassdoor.com/graph - Session management: 4-6 hour lifetime for cookies; maintain 3-5 rotating accounts
- Anti-bot: Akamai Bot Manager blocks datacenter IPs — ThorData residential proxies are required for any serious volume
- Throughput: ~200-400 salary lookups/hour maximum before throttling
- Caching: Salary data changes weekly at most — 7-day TTL cache is appropriate
- Legal risk: Elevated due to login wall — keep volumes reasonable and don't republish raw data