Scraping Glassdoor Interview Questions: Python Guide (2026)
Scraping Glassdoor Interview Questions: Python Guide (2026)
Glassdoor interview reviews are one of the most valuable datasets for job seekers and recruiters. Each review includes the interview questions asked, difficulty rating, experience sentiment, and whether the candidate got an offer. The data is right there in the page source — embedded as Apollo state JSON — if you know where to look.
This guide covers extracting interview data from Glassdoor's embedded state, using their GraphQL API, dealing with their anti-bot protections in 2026, and storing the results in SQLite for analysis.
What Data Is Available
Each Glassdoor interview review contains:
- Job title — the role the candidate interviewed for
- Review date — when the interview took place
- Experience sentiment — POSITIVE, NEGATIVE, or NEUTRAL overall experience
- Difficulty rating — 1 (very easy) to 5 (very difficult) scale
- Offer received — ACCEPTED, DECLINED, NO_OFFER
- How they applied — online application, recruiter contact, employee referral, etc.
- Interview process description — free-text walkthrough of the interview stages
- Interview questions — the actual questions asked, as text
- Reviewer job level — sometimes includes seniority information
Across thousands of companies, this dataset lets you answer questions like: which companies have the hardest interviews? Which job titles get the most offers? What questions come up repeatedly at FAANG companies? What's the conversion rate at different stages of a company's process?
Apollo State Extraction
Glassdoor uses Apollo Client for their GraphQL state management. The entire page's data is serialized in a window.__apolloState__ or __NEXT_DATA__ script tag embedded in the HTML. You don't need to render JavaScript — just parse the HTML and extract the JSON blob.
# glassdoor_scraper.py
import httpx
import json
import re
import time
import random
from bs4 import BeautifulSoup
def fetch_glassdoor_page(url: str, proxy: str = None) -> str:
"""Fetch a Glassdoor page with browser-like headers."""
headers = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/125.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Ch-Ua": '"Chromium";v="125", "Not=A?Brand";v="8"',
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": '"macOS"',
"Cache-Control": "max-age=0",
"Upgrade-Insecure-Requests": "1",
}
client_kwargs = {
"headers": headers,
"follow_redirects": True,
"timeout": 30,
}
if proxy:
client_kwargs["proxy"] = proxy
with httpx.Client(**client_kwargs) as client:
# Warm up with homepage visit to collect cookies
try:
client.get("https://www.glassdoor.com/")
time.sleep(random.uniform(1.5, 3.0))
except httpx.RequestError:
pass
resp = client.get(url, headers={**headers, "Referer": "https://www.glassdoor.com/"})
if resp.status_code == 403:
raise RuntimeError(f"Blocked (403) on {url} — need residential proxy")
if resp.status_code == 429:
raise RuntimeError(f"Rate limited (429) on {url}")
if resp.status_code != 200:
raise RuntimeError(f"Unexpected status {resp.status_code}")
return resp.text
def extract_apollo_state(html: str) -> dict:
"""Extract Apollo state JSON from Glassdoor page source."""
soup = BeautifulSoup(html, "lxml")
# Method 1: __NEXT_DATA__ (most common in 2026)
next_data = soup.find("script", id="__NEXT_DATA__")
if next_data and next_data.string:
try:
data = json.loads(next_data.string)
return data.get("props", {}).get("pageProps", {})
except json.JSONDecodeError:
pass
# Method 2: Apollo state in inline script
for script in soup.find_all("script"):
content = script.string or ""
if "apolloState" in content:
match = re.search(
r'window\.__apolloState__\s*=\s*({.*?});\s*(?:window|var|let|const)',
content,
re.DOTALL
)
if match:
try:
return json.loads(match.group(1))
except json.JSONDecodeError:
pass
# Method 3: Serialized props with interview data
for script in soup.find_all("script"):
content = script.string or ""
if "interviewReviews" in content or "interviewExperience" in content:
# Try to find JSON blob
match = re.search(r'\{.*"interviewReviews".*\}', content, re.DOTALL)
if match:
try:
return json.loads(match.group(0))
except json.JSONDecodeError:
pass
return {}
Parsing Interview Data
Once you have the Apollo state, interview reviews are nested under predictable keys:
def parse_interview_reviews(apollo_state: dict) -> list:
"""Extract interview reviews from Apollo state."""
reviews = []
if not apollo_state:
return reviews
# Walk the nested structure to find interviewReviews
interview_data = find_nested_key(apollo_state, "interviewReviews")
if not interview_data:
# Also try alternate key names
interview_data = find_nested_key(apollo_state, "reviews")
if not interview_data or not isinstance(interview_data, list):
return reviews
for review in interview_data:
if not isinstance(review, dict):
continue
parsed = {
"job_title": _safe_text(review.get("jobTitle")),
"date": review.get("reviewDateTime") or review.get("reviewDate"),
"experience": review.get("interviewExperience"), # POSITIVE/NEGATIVE/NEUTRAL
"difficulty": review.get("interviewDifficulty"), # 1-5
"offer": review.get("interviewOffer"), # ACCEPTED/DECLINED/NO_OFFER
"application_method": review.get("interviewApplication"),
"process_description": _safe_text(review.get("interviewProcess")),
"questions": [],
"interview_stages": [],
"source": "apollo_state",
}
# Extract interview questions
for q in review.get("interviewQuestions", []):
if isinstance(q, dict):
text = q.get("text") or q.get("question") or q.get("body")
if text:
parsed["questions"].append(text.strip())
elif isinstance(q, str) and q.strip():
parsed["questions"].append(q.strip())
# Extract interview stages/rounds if present
for stage in review.get("interviewStages", []):
if isinstance(stage, dict):
parsed["interview_stages"].append(stage.get("name") or stage.get("type"))
if parsed.get("job_title") or parsed.get("process_description"):
reviews.append(parsed)
return reviews
def _safe_text(value) -> str:
"""Safely extract text from a string or dict with a text key."""
if isinstance(value, str):
return value.strip()
if isinstance(value, dict):
return (value.get("text") or value.get("value") or "").strip()
return ""
def find_nested_key(data, target_key: str):
"""Recursively search for a key in nested dict/list."""
if isinstance(data, dict):
if target_key in data:
return data[target_key]
for value in data.values():
result = find_nested_key(value, target_key)
if result is not None:
return result
elif isinstance(data, list):
for item in data:
result = find_nested_key(item, target_key)
if result is not None:
return result
return None
Scraping Company Interview Pages
def scrape_company_interviews(
company_slug: str,
company_id: str,
num_pages: int = 5,
proxy: str = None,
) -> list:
"""
Scrape interview reviews for a company.
company_slug: URL slug, e.g. "Google"
company_id: numeric Glassdoor employer ID, e.g. "9079"
"""
all_reviews = []
for page in range(1, num_pages + 1):
if page == 1:
url = f"https://www.glassdoor.com/Interview/{company_slug}-Interview-Questions-E{company_id}.htm"
else:
url = f"https://www.glassdoor.com/Interview/{company_slug}-Interview-Questions-E{company_id}_P{page}.htm"
print(f"Fetching page {page}/{num_pages}: {url}")
try:
html = fetch_glassdoor_page(url, proxy=proxy)
state = extract_apollo_state(html)
reviews = parse_interview_reviews(state)
all_reviews.extend(reviews)
print(f" Extracted {len(reviews)} reviews (total: {len(all_reviews)})")
except RuntimeError as e:
print(f" Error on page {page}: {e}")
break
except Exception as e:
print(f" Unexpected error on page {page}: {e}")
break
if page < num_pages:
delay = random.uniform(8.0, 15.0)
print(f" Waiting {delay:.1f}s...")
time.sleep(delay)
return all_reviews
GraphQL API Approach
Glassdoor also has a GraphQL API endpoint used by their frontend. You can query it directly if you have the right cookies and headers from a valid browser session:
def glassdoor_graphql(
query: str,
variables: dict,
cookies: dict,
proxy: str = None,
) -> dict:
"""
Execute a Glassdoor GraphQL query.
Requires cookies from a valid Glassdoor browser session.
cookies: dict with at minimum gdId, GSESSIONID, at
"""
# Extract CSRF token from cookies
csrf_token = cookies.get("gdId", "")
headers = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
),
"Content-Type": "application/json",
"gd-csrf-token": csrf_token,
"apollographql-client-name": "job-interview",
"apollographql-client-version": "2.0.0",
"Referer": "https://www.glassdoor.com/",
"Origin": "https://www.glassdoor.com",
}
client_kwargs = {"timeout": 30}
if proxy:
client_kwargs["proxy"] = proxy
with httpx.Client(**client_kwargs) as client:
resp = client.post(
"https://www.glassdoor.com/graph",
json={"query": query, "variables": variables},
headers=headers,
cookies=cookies,
)
resp.raise_for_status()
return resp.json()
# GraphQL query for interview reviews
INTERVIEW_QUERY = """
query InterviewReviews($employerId: Int!, $page: Int, $pageSize: Int) {
employer(id: $employerId) {
id
name
interviewReviews(page: $page, pageSize: $pageSize) {
reviews {
id
jobTitle {
text
}
reviewDateTime
interviewExperience
interviewDifficulty
interviewOffer
interviewApplication
interviewProcess {
text
}
interviewQuestions {
text
}
}
totalCount
currentPage
totalPages
}
}
}
"""
def fetch_all_gql_reviews(
employer_id: int,
cookies: dict,
max_pages: int = 10,
proxy: str = None,
) -> list:
"""Fetch all interview reviews via Glassdoor GraphQL API."""
all_reviews = []
for page in range(1, max_pages + 1):
try:
result = glassdoor_graphql(
INTERVIEW_QUERY,
variables={"employerId": employer_id, "page": page, "pageSize": 20},
cookies=cookies,
proxy=proxy,
)
employer_data = result.get("data", {}).get("employer", {})
reviews_data = employer_data.get("interviewReviews", {})
reviews = reviews_data.get("reviews", [])
if not reviews:
break
all_reviews.extend(reviews)
total_pages = reviews_data.get("totalPages", 1)
print(f"GQL page {page}/{total_pages}: {len(reviews)} reviews")
if page >= total_pages:
break
time.sleep(random.uniform(3.0, 7.0))
except Exception as e:
print(f"GQL error on page {page}: {e}")
break
return all_reviews
Anti-Bot Measures
Glassdoor uses multiple layers of bot detection that require different countermeasures.
Cloudflare Protection
Glassdoor sits behind Cloudflare with JavaScript challenges enabled. Standard HTTP requests without JavaScript execution get a challenge page instead of content. The Apollo state extraction works only when you get past Cloudflare — which requires a browser-like TLS fingerprint.
Cookie Wall
Glassdoor requires accepting cookies before showing content. Without the right cookie consent state, you get redirected to a modal/wall that blocks the actual data.
Rate Limiting
Glassdoor rate-limits by IP and session. More than 20-30 page loads in quick succession triggers CAPTCHAs. After several CAPTCHA triggers, the IP gets temporarily blocked.
Session Fingerprinting
Glassdoor tracks session consistency. If your cookies, User-Agent, and IP don't match across requests, the session gets invalidated. This means you can't easily rotate IPs mid-session.
Handling It All
For reliable Glassdoor scraping, you need three things: residential IPs, consistent sessions, and browser-level TLS fingerprints.
ThorData's residential proxies handle the IP side — clean residential IPs that pass Cloudflare's reputation checks. Pair that with sticky sessions (same IP for the duration of a scraping session) and curl_cffi for TLS fingerprinting:
from curl_cffi import requests as curl_requests
def fetch_with_fingerprint(url: str, proxy_url: str) -> str:
"""
Fetch Glassdoor page with browser-like TLS fingerprint.
curl_cffi impersonates Chrome at the TLS layer, bypassing
fingerprint-based bot detection.
"""
resp = curl_requests.get(
url,
impersonate="chrome124",
proxies={"https": proxy_url},
timeout=30,
headers={
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}
)
if resp.status_code != 200:
raise RuntimeError(f"Status {resp.status_code}")
return resp.text
Install with: pip install curl_cffi
SQLite Storage Schema
import sqlite3
import json
def init_glassdoor_db(db_path: str = "glassdoor_interviews.db") -> sqlite3.Connection:
"""Initialize SQLite database for Glassdoor interview data."""
conn = sqlite3.connect(db_path)
conn.execute("PRAGMA journal_mode=WAL")
conn.executescript("""
CREATE TABLE IF NOT EXISTS companies (
id INTEGER PRIMARY KEY AUTOINCREMENT,
slug TEXT UNIQUE,
glassdoor_id TEXT,
name TEXT,
industry TEXT,
added_at TEXT DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS interview_reviews (
id INTEGER PRIMARY KEY AUTOINCREMENT,
company_slug TEXT,
job_title TEXT,
review_date TEXT,
experience TEXT,
difficulty INTEGER,
offer TEXT,
application_method TEXT,
process_description TEXT,
questions TEXT,
interview_stages TEXT,
source TEXT DEFAULT 'html',
scraped_at TEXT DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (company_slug) REFERENCES companies(slug)
);
CREATE TABLE IF NOT EXISTS questions_flat (
id INTEGER PRIMARY KEY AUTOINCREMENT,
company_slug TEXT,
job_title TEXT,
question_text TEXT,
review_date TEXT,
difficulty INTEGER
);
CREATE INDEX IF NOT EXISTS idx_reviews_company ON interview_reviews (company_slug);
CREATE INDEX IF NOT EXISTS idx_questions_company ON questions_flat (company_slug);
CREATE INDEX IF NOT EXISTS idx_questions_title ON questions_flat (job_title);
""")
conn.commit()
return conn
def save_reviews(conn: sqlite3.Connection, company_slug: str, reviews: list) -> int:
"""Save interview reviews and flatten questions into separate table."""
saved = 0
for r in reviews:
try:
cursor = conn.execute(
"""
INSERT INTO interview_reviews
(company_slug, job_title, review_date, experience, difficulty,
offer, application_method, process_description, questions,
interview_stages, source)
VALUES (?,?,?,?,?,?,?,?,?,?,?)
""",
(
company_slug,
r.get("job_title"),
r.get("date"),
r.get("experience"),
r.get("difficulty"),
r.get("offer"),
r.get("application_method"),
r.get("process_description"),
json.dumps(r.get("questions", [])),
json.dumps(r.get("interview_stages", [])),
r.get("source", "html"),
),
)
review_id = cursor.lastrowid
saved += 1
# Flatten questions into questions_flat for easy querying
for question in r.get("questions", []):
if question and len(question) > 10:
conn.execute(
"""INSERT INTO questions_flat
(company_slug, job_title, question_text, review_date, difficulty)
VALUES (?,?,?,?,?)""",
(
company_slug,
r.get("job_title"),
question,
r.get("date"),
r.get("difficulty"),
),
)
except Exception as e:
print(f" Save error: {e}")
conn.commit()
return saved
Analyzing Interview Difficulty
Once you have the data, you can build useful analyses:
from collections import Counter
def analyze_company_interviews(conn: sqlite3.Connection, company_slug: str) -> dict:
"""Generate interview statistics for a company."""
rows = conn.execute(
"""
SELECT difficulty, experience, offer, job_title
FROM interview_reviews
WHERE company_slug = ?
""",
(company_slug,)
).fetchall()
if not rows:
return {"error": "No data found"}
difficulties = [r[0] for r in rows if r[0] is not None]
experiences = [r[1] for r in rows if r[1]]
offers = [r[2] for r in rows if r[2]]
titles = [r[3] for r in rows if r[3]]
avg_difficulty = sum(difficulties) / len(difficulties) if difficulties else 0
exp_counts = Counter(experiences)
offer_counts = Counter(offers)
title_counts = Counter(titles)
# Question frequency analysis
q_rows = conn.execute(
"""
SELECT question_text, COUNT(*) as freq
FROM questions_flat
WHERE company_slug = ?
GROUP BY question_text
ORDER BY freq DESC
LIMIT 20
""",
(company_slug,)
).fetchall()
return {
"total_reviews": len(rows),
"avg_difficulty": round(avg_difficulty, 2),
"experience_breakdown": dict(exp_counts.most_common()),
"offer_breakdown": dict(offer_counts.most_common()),
"top_job_titles": title_counts.most_common(10),
"most_frequent_questions": [
{"question": r[0][:200], "frequency": r[1]}
for r in q_rows
],
}
def compare_companies_difficulty(conn: sqlite3.Connection, slugs: list) -> list:
"""Compare interview difficulty across companies."""
results = []
for slug in slugs:
row = conn.execute(
"""
SELECT
AVG(CASE WHEN difficulty IS NOT NULL THEN difficulty END) as avg_diff,
COUNT(*) as total,
SUM(CASE WHEN experience='POSITIVE' THEN 1 ELSE 0 END) as positive,
SUM(CASE WHEN offer='ACCEPTED' THEN 1 ELSE 0 END) as offers
FROM interview_reviews WHERE company_slug=?
""",
(slug,)
).fetchone()
if row and row[1] > 0:
results.append({
"company": slug,
"avg_difficulty": round(row[0] or 0, 2),
"total_reviews": row[1],
"positive_pct": round((row[2] or 0) / row[1] * 100, 1),
"offer_pct": round((row[3] or 0) / row[1] * 100, 1),
})
return sorted(results, key=lambda x: x["avg_difficulty"], reverse=True)
Complete Pipeline
def run_glassdoor_pipeline(
companies: list,
db_path: str = "glassdoor_interviews.db",
proxy: str = None,
pages_per_company: int = 5,
):
"""
companies: list of dicts with 'slug', 'id', 'name' keys
e.g. [{"slug": "Google", "id": "9079", "name": "Google"}]
"""
conn = init_glassdoor_db(db_path)
for company in companies:
slug = company["slug"]
company_id = company["id"]
print(f"\nScraping {company['name']} (ID: {company_id})...")
# Ensure company record exists
conn.execute(
"INSERT OR IGNORE INTO companies (slug, glassdoor_id, name) VALUES (?,?,?)",
(slug, company_id, company["name"])
)
conn.commit()
try:
reviews = scrape_company_interviews(
company_slug=slug,
company_id=company_id,
num_pages=pages_per_company,
proxy=proxy,
)
saved = save_reviews(conn, slug, reviews)
print(f"Saved {saved} reviews for {company['name']}")
# Print quick stats
stats = analyze_company_interviews(conn, slug)
print(f" Avg difficulty: {stats['avg_difficulty']}/5")
print(f" Experience: {stats['experience_breakdown']}")
except Exception as e:
print(f"Error processing {company['name']}: {e}")
# Longer delay between companies
time.sleep(random.uniform(30.0, 60.0))
conn.close()
# Usage
PROXY = "http://user:[email protected]:9000"
COMPANIES = [
{"slug": "Google", "id": "9079", "name": "Google"},
{"slug": "Meta", "id": "40772", "name": "Meta"},
{"slug": "Amazon-com", "id": "6036", "name": "Amazon"},
{"slug": "Microsoft", "id": "1651", "name": "Microsoft"},
{"slug": "Apple", "id": "1138", "name": "Apple"},
]
run_glassdoor_pipeline(COMPANIES, proxy=PROXY, pages_per_company=3)
Practical Tips
Extract Apollo state first. Don't render JavaScript if you don't have to. The data you need is usually serialized in the page source. Check __NEXT_DATA__ and __apolloState__ script tags before reaching for a headless browser.
Use sticky sessions. Glassdoor invalidates sessions when the IP changes. Configure your proxy for session persistence — same IP for at least 10-15 minutes per company.
Limit pages per session. Don't scrape more than 5-10 pages per session. Start a new session with a new IP for the next batch. This keeps you well under the rate limit threshold.
Store raw HTML. Save the full page source before parsing. Glassdoor's internal data structure changes periodically — having the raw HTML lets you re-parse without re-scraping.
Check the JSON structure. Glassdoor's Apollo state format changes between page types and over time. Always inspect the actual JSON before writing parsing code. A missing nested key is the most common failure mode.
The GraphQL endpoint requires fresh cookies. The gdId cookie used as a CSRF token rotates. If your GraphQL requests return 401 or empty data, your cookies have expired. Re-authenticate by visiting Glassdoor in a browser and extracting fresh cookies.
Residential proxies are required. Glassdoor's Cloudflare integration blocks all datacenter IPs at the TLS fingerprint level. ThorData provides the residential IP rotation needed to get past the initial check.
Legal Notes
Glassdoor's Terms of Service prohibit automated data collection. Section 2.3 explicitly disallows scraping, crawling, or using automated means to access the site. Unlike patents or government data, Glassdoor has actively pursued enforcement against commercial scrapers.
For non-commercial research and personal use, small-scale scraping is low risk. For any commercial application, consider whether the hirevue API, LinkedIn's official talent insights API, or other licensed data providers could meet your needs without the legal exposure.
The interview questions themselves are user-generated content posted publicly — there's a reasonable argument that aggregating them doesn't infringe copyright, but Glassdoor's ToS still applies to the method of access.