Scraping Glassdoor: Salaries, Reviews, and Interview Questions (2026)

2026-04-09 glassdoor web-scraping salary-data career hr

Scraping Glassdoor: Salaries, Reviews, and Interview Questions (2026)

Glassdoor is one of the hardest sites to scrape in 2026. It gates almost everything useful behind a login wall, uses aggressive bot detection via Akamai Bot Manager, and renders most content dynamically. But the salary and review data is genuinely valuable — it's the largest public dataset of self-reported compensation and workplace feedback. This guide covers exactly what you can get, how to get it, and what it takes to do so reliably.

What's Public vs. What's Gated

Glassdoor shows a surprising amount on company overview pages without login:

No login required: - Company name, logo, overall rating (1-5 stars) - Number of reviews, number of salary reports - Company size, headquarters, industry, revenue range - "Featured" review snippets (1-2 per page) - Job listings (redirects to Glassdoor's job board) - High-level rating breakdowns (culture, management, etc.)

Login required (the useful stuff): - Full salary ranges by job title, base/bonus/equity splits - Complete review text with pros/cons/advice to management - Interview questions, difficulty ratings, offer outcomes - Benefits ratings and detailed breakdowns - CEO approval ratings over time - Individual salaries with location and experience data

The login wall is the core challenge. Glassdoor wants you to contribute a review or salary report before showing you the full dataset. They enforce this even for logged-in users who haven't contributed (the "give to get" model).

Setting Up Your Scraping Environment

pip install curl-cffi beautifulsoup4 httpx playwright
playwright install chromium

You'll also need: - A Glassdoor account (free to create) - Valid session cookies from that account - Residential proxies for anything beyond very light usage

Scraping Public Company Data

The company overview page has structured data you can get without authentication. The key is using curl-cffi to bypass TLS fingerprinting — standard requests gets blocked at the Akamai layer before it even reaches Glassdoor's application servers:

from curl_cffi import requests as cffi_requests
from bs4 import BeautifulSoup
import json
import re
import time
import random

# ThorData residential proxy for Glassdoor
# https://thordata.partnerstack.com/partner/0a0x4nzq (or [Oxylabs](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=2066&url_id=174))
PROXY = "http://USERNAME:[email protected]:7777"

def get_company_overview(company_slug: str, proxy: str = None) -> dict:
    """
    Get public company data from Glassdoor overview page.
    company_slug examples: 'Google', 'Apple', 'meta-platforms'
    """
    url = f"https://www.glassdoor.com/Overview/Working-at-{company_slug}.htm"

    session = cffi_requests.Session(impersonate="chrome124")
    if proxy:
        session.proxies = {"http": proxy, "https": proxy}

    resp = session.get(url, headers={
        "Accept": "text/html,application/xhtml+xml",
        "Accept-Language": "en-US,en;q=0.9",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
    })

    if resp.status_code != 200:
        return {"error": f"Status {resp.status_code}", "url": url}

    soup = BeautifulSoup(resp.text, "html.parser")

    # Glassdoor embeds Apollo GraphQL cache as JSON in a script tag
    script = soup.find("script", string=re.compile("window\\.__APOLLO_STATE__"))
    if not script:
        # Try __NEXT_DATA__ as fallback
        next_script = soup.find("script", {"id": "__NEXT_DATA__"})
        if next_script:
            try:
                next_data = json.loads(next_script.string)
                return _extract_from_next_data(next_data)
            except (json.JSONDecodeError, TypeError):
                pass
        return {"error": "No Apollo state found — likely blocked or page structure changed"}

    match = re.search(r"window\.__APOLLO_STATE__\s*=\s*({.+?});", script.string, re.DOTALL)
    if not match:
        return {"error": "Could not parse Apollo state"}

    try:
        state = json.loads(match.group(1))
    except json.JSONDecodeError:
        return {"error": "Apollo state JSON parse failed"}

    # Extract employer data from Apollo cache
    employer_keys = [k for k in state if k.startswith("Employer:")]
    if not employer_keys:
        return {"error": "No employer data in Apollo cache"}

    emp = state[employer_keys[0]]
    return {
        "name": emp.get("shortName") or emp.get("name"),
        "rating": emp.get("overallRating"),
        "review_count": emp.get("numberOfRatings"),
        "ceo_approval": emp.get("ceo", {}).get("pctApprove") if isinstance(emp.get("ceo"), dict) else None,
        "size": emp.get("size"),
        "industry": emp.get("primaryIndustry", {}).get("industryName") if isinstance(emp.get("primaryIndustry"), dict) else None,
        "revenue": emp.get("revenue"),
        "headquarters": emp.get("headquarters"),
        "website": emp.get("website"),
        "founded": emp.get("yearFounded"),
        "description": emp.get("squareLogoUrl"),
        "culture_rating": emp.get("ratingCulture"),
        "work_life_rating": emp.get("ratingWorkLife"),
        "career_rating": emp.get("ratingCareerOpportunities"),
        "comp_benefits_rating": emp.get("ratingCompensationAndBenefits"),
        "management_rating": emp.get("ratingSeniorLeadership"),
    }

def _extract_from_next_data(data: dict) -> dict:
    """Fallback: extract employer info from Next.js page data."""
    emp = (data.get("props", {})
           .get("pageProps", {})
           .get("employerReviews", {})
           .get("employer", {}))
    return {
        "name": emp.get("shortName"),
        "rating": emp.get("ratings", {}).get("overallRating"),
        "review_count": emp.get("numberOfRatings"),
        "size": emp.get("size"),
    }

Finding Glassdoor Company IDs

Before using the GraphQL API for salaries and reviews, you need the numeric employer ID. You can extract it from the overview page URL or the Apollo state:

def get_company_id(company_slug: str, proxy: str = None) -> int | None:
    """Extract the numeric Glassdoor employer ID from a company page."""
    url = f"https://www.glassdoor.com/Overview/Working-at-{company_slug}.htm"

    session = cffi_requests.Session(impersonate="chrome124")
    if proxy:
        session.proxies = {"http": proxy, "https": proxy}

    resp = session.get(url)

    # Method 1: Extract from URL redirect (Glassdoor adds EI_IE{id} to URLs)
    match = re.search(r"EI_IE(\d+)\.htm", resp.url)
    if match:
        return int(match.group(1))

    # Method 2: Extract from Apollo state
    match = re.search(r'"Employer:(\d+)"', resp.text)
    if match:
        return int(match.group(1))

    # Method 3: Extract from page HTML
    match = re.search(r'"employerId"\s*:\s*(\d+)', resp.text)
    if match:
        return int(match.group(1))

    return None

# Common company IDs for reference:
# Google: 9079, Apple: 1138, Amazon: 6036, Microsoft: 1651
# Meta: 40772, Netflix: 11891, Airbnb: 391850

The Unofficial GraphQL API: Authenticated Salary Search

The real salary data lives behind Glassdoor's GraphQL API at https://www.glassdoor.com/graph. To access it, you need session cookies from a legitimate login. Export them from your browser's DevTools (Application > Cookies) after logging in:

from curl_cffi import requests as cffi_requests
import json

class GlassdoorSalaryClient:
    GRAPH_URL = "https://www.glassdoor.com/graph"

    def __init__(self, cookies: dict, proxy: str = None):
        """
        cookies: dict exported from browser DevTools after login.
        Required: GSESSIONID, gdId, gdsid — plus any others Glassdoor sets.

        Export process:
        1. Log in to glassdoor.com in Chrome
        2. Open DevTools > Application > Cookies > glassdoor.com
        3. Copy all cookie name/value pairs
        """
        self.session = cffi_requests.Session(impersonate="chrome124")
        if proxy:
            self.session.proxies = {"http": proxy, "https": proxy}

        for name, value in cookies.items():
            self.session.cookies.set(name, value, domain=".glassdoor.com")

        self.headers = {
            "Content-Type": "application/json",
            "Accept": "application/json",
            "gd-csrf-token": cookies.get("gdId", ""),
            "Referer": "https://www.glassdoor.com/Salaries/",
            "Origin": "https://www.glassdoor.com",
        }

    def _make_request(self, payload: list) -> dict:
        """Send a GraphQL request with retry logic."""
        for attempt in range(3):
            try:
                resp = self.session.post(
                    self.GRAPH_URL,
                    headers=self.headers,
                    json=payload,
                    timeout=20,
                )
                if resp.status_code == 403:
                    raise Exception("Session expired or CAPTCHA triggered — re-export cookies")
                if resp.status_code == 429:
                    wait = 30 * (2 ** attempt)
                    print(f"Rate limited. Waiting {wait}s...")
                    time.sleep(wait)
                    continue
                return resp.json()
            except Exception as e:
                if "Session expired" in str(e):
                    raise
                if attempt == 2:
                    raise
                time.sleep(10 * (attempt + 1))
        return {}

    def search_salaries(
        self,
        company_id: int,
        job_title: str,
        location: str = None,
        page: int = 1
    ) -> dict:
        """
        Search salary data for a specific company and job title.
        Returns paginated results with base/total/additional pay.
        """
        payload = [{
            "operationName": "SalariesByEmployer",
            "variables": {
                "employerId": company_id,
                "jobTitle": job_title,
                "location": location or "",
                "page": page,
                "pageSize": 20,
                "currencyCode": "USD",
            },
            "query": """
            query SalariesByEmployer(
                $employerId: Int!,
                $jobTitle: String,
                $location: String,
                $page: Int,
                $pageSize: Int,
                $currencyCode: String
            ) {
                salariesByEmployer(
                    employer: { id: $employerId }
                    jobTitle: $jobTitle
                    location: $location
                    pagination: { page: $page, pageSize: $pageSize }
                    currencyCode: $currencyCode
                ) {
                    results {
                        jobTitle
                        basePay {
                            avg
                            min
                            max
                            currency
                        }
                        totalPay {
                            avg
                            min
                            max
                            currency
                        }
                        additionalPay {
                            avg
                            min
                            max
                        }
                        count
                        lastUpdated
                    }
                    totalCount
                    hasNextPage
                }
            }
            """
        }]

        result = self._make_request(payload)
        return result[0].get("data", {}).get("salariesByEmployer", {})

    def get_all_salaries_for_role(
        self,
        company_id: int,
        job_title: str,
        location: str = None,
        max_pages: int = 10
    ) -> list[dict]:
        """Paginate through all salary data for a role."""
        all_results = []
        for page in range(1, max_pages + 1):
            data = self.search_salaries(company_id, job_title, location, page)
            results = data.get("results", [])
            if not results:
                break
            all_results.extend(results)
            if not data.get("hasNextPage"):
                break
            # Respectful delay between pages
            time.sleep(random.uniform(3, 8))
        return all_results

    def get_salary_overview(self, company_id: int) -> list[dict]:
        """Get top-level salary data for all roles at a company."""
        payload = [{
            "operationName": "EmployerSalaryTrends",
            "variables": {
                "employerId": company_id,
                "numJobTitles": 25,
            },
            "query": """
            query EmployerSalaryTrends($employerId: Int!, $numJobTitles: Int) {
                employerSalaryTrends(
                    employer: { id: $employerId }
                    numJobTitles: $numJobTitles
                ) {
                    jobTitle
                    count
                    basePay { avg min max currency }
                    totalPay { avg min max currency }
                }
            }
            """
        }]

        result = self._make_request(payload)
        return result[0].get("data", {}).get("employerSalaryTrends", [])


# Usage
cookies = {
    "GSESSIONID": "your_session_id",
    "gdId": "your_gd_id",
    "gdsid": "your_gdsid",
    "uc": "your_uc_value",
    # Export all Glassdoor cookies from your browser
}

client = GlassdoorSalaryClient(cookies, proxy=PROXY)

# Get salaries for Software Engineers at Google (company_id=9079)
salaries = client.get_all_salaries_for_role(
    company_id=9079,
    job_title="Software Engineer",
    location="San Francisco, CA"
)

print(f"Found {len(salaries)} salary data points:")
for s in salaries:
    base = s.get("basePay", {})
    total = s.get("totalPay", {})
    print(f"  {s['jobTitle']}: ${base.get('avg', 0):,.0f} base "
          f"(${base.get('min', 0):,.0f}–${base.get('max', 0):,.0f}), "
          f"${total.get('avg', 0):,.0f} total comp")

Scraping Reviews via GraphQL

The same GraphQL approach works for reviews. Reviews are the most actively moderated content on Glassdoor, so the API can return censored versions with asterisks for flagged words:

def get_company_reviews(
    client: GlassdoorSalaryClient,
    company_id: int,
    sort: str = "RELEVANCE",
    page: int = 1,
    job_title: str = None,
) -> dict:
    """
    Fetch company reviews via GraphQL.
    sort options: RELEVANCE, DATE, RATING_HIGH, RATING_LOW, HELPFUL
    """
    payload = [{
        "operationName": "EmployerReviews",
        "variables": {
            "employerId": company_id,
            "sort": sort,
            "page": page,
            "pageSize": 10,
            "jobTitle": job_title,
            "languageCode": "eng",
        },
        "query": """
        query EmployerReviews(
            $employerId: Int!,
            $sort: String,
            $page: Int,
            $pageSize: Int,
            $jobTitle: String,
            $languageCode: String
        ) {
            employerReviews(
                employer: { id: $employerId }
                sort: $sort
                pagination: { page: $page, pageSize: $pageSize }
                jobTitle: $jobTitle
                languageCode: $languageCode
            ) {
                reviews {
                    reviewId
                    dateTime
                    jobTitle
                    location
                    ratingOverall
                    ratingCeo
                    ratingBusinessOutlook
                    ratingWorkLifeBalance
                    ratingCultureAndValues
                    pros
                    cons
                    advice
                    isCurrentEmployee
                    lengthOfEmployment
                    employmentStatus
                    reviewCount
                }
                totalCount
                hasNextPage
            }
        }
        """
    }]

    result = client._make_request(payload)
    return result[0].get("data", {}).get("employerReviews", {})

def get_interview_questions(
    client: GlassdoorSalaryClient,
    company_id: int,
    job_title: str = None,
    page: int = 1,
) -> dict:
    """Fetch interview questions and outcomes for a company."""
    payload = [{
        "operationName": "EmployerInterviews",
        "variables": {
            "employerId": company_id,
            "jobTitle": job_title,
            "page": page,
            "pageSize": 10,
        },
        "query": """
        query EmployerInterviews(
            $employerId: Int!,
            $jobTitle: String,
            $page: Int,
            $pageSize: Int
        ) {
            employerInterviews(
                employer: { id: $employerId }
                jobTitle: $jobTitle
                pagination: { page: $page, pageSize: $pageSize }
            ) {
                interviews {
                    interviewId
                    dateTime
                    jobTitle
                    difficulty
                    experience
                    outcome
                    questions {
                        question
                        answer
                    }
                    duration
                    interviewProcess
                    howGotInterview
                }
                totalCount
                hasNextPage
            }
        }
        """
    }]

    result = client._make_request(payload)
    return result[0].get("data", {}).get("employerInterviews", {})

Building a Salary Dataset Across Companies

Here's a complete pipeline to build a comparative salary dataset for multiple companies:

import sqlite3
import datetime

def setup_salary_db(db_path: str):
    """Create the SQLite schema for salary data."""
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS salary_data (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            company_id INTEGER,
            company_name TEXT,
            job_title TEXT,
            location TEXT,
            base_avg INTEGER,
            base_min INTEGER,
            base_max INTEGER,
            total_avg INTEGER,
            total_min INTEGER,
            total_max INTEGER,
            additional_avg INTEGER,
            count INTEGER,
            last_updated TEXT,
            scraped_at TEXT,
            UNIQUE(company_id, job_title, location)
        )
    """)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS reviews (
            review_id TEXT PRIMARY KEY,
            company_id INTEGER,
            company_name TEXT,
            date_time TEXT,
            job_title TEXT,
            location TEXT,
            rating_overall INTEGER,
            rating_culture INTEGER,
            rating_work_life INTEGER,
            pros TEXT,
            cons TEXT,
            advice TEXT,
            is_current_employee INTEGER,
            scraped_at TEXT
        )
    """)
    conn.commit()
    return conn

def scrape_company_salaries(
    client: GlassdoorSalaryClient,
    company_id: int,
    company_name: str,
    job_titles: list[str],
    db_conn: sqlite3.Connection,
):
    """Scrape salaries for multiple job titles at one company."""
    now = datetime.datetime.now().isoformat()
    saved = 0

    for job_title in job_titles:
        print(f"  Scraping {job_title} at {company_name}...")
        all_salaries = client.get_all_salaries_for_role(
            company_id, job_title, max_pages=5
        )

        for s in all_salaries:
            base = s.get("basePay", {})
            total = s.get("totalPay", {})
            additional = s.get("additionalPay", {})

            try:
                db_conn.execute("""
                    INSERT OR REPLACE INTO salary_data
                    (company_id, company_name, job_title, location,
                     base_avg, base_min, base_max,
                     total_avg, total_min, total_max,
                     additional_avg, count, last_updated, scraped_at)
                    VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?)
                """, (
                    company_id, company_name, s.get("jobTitle"), "",
                    base.get("avg"), base.get("min"), base.get("max"),
                    total.get("avg"), total.get("min"), total.get("max"),
                    additional.get("avg"),
                    s.get("count"), s.get("lastUpdated"), now,
                ))
                saved += 1
            except Exception as e:
                print(f"    DB error: {e}")

        db_conn.commit()
        time.sleep(random.uniform(5, 12))

    return saved

# Build a tech company salary comparison dataset
companies = [
    (9079, "Google"),
    (1651, "Microsoft"),
    (6036, "Amazon"),
    (40772, "Meta"),
    (11891, "Netflix"),
]

target_roles = [
    "Software Engineer",
    "Senior Software Engineer",
    "Product Manager",
    "Data Scientist",
    "Engineering Manager",
]

conn = setup_salary_db("glassdoor_salaries.db")

for company_id, company_name in companies:
    print(f"\nScraping {company_name} (id={company_id})...")
    count = scrape_company_salaries(
        client, company_id, company_name, target_roles, conn
    )
    print(f"  Saved {count} salary records")
    # Longer pause between companies
    time.sleep(random.uniform(30, 60))

conn.close()

Handling CAPTCHA and Session Expiry

Glassdoor's Akamai Bot Manager integration triggers challenges based on these patterns:

More than ~20-30 requests per minute from the same session
Missing or expired _abck cookie (Akamai's bot detection cookie)
Session cookies older than 4-6 hours
Requests from datacenter IPs (AWS, GCP, Azure are aggressively flagged)
Requests with missing or unusual fingerprint signals in headers

import time
import random

def resilient_request(client: GlassdoorSalaryClient, func, *args, max_retries=3, **kwargs):
    """Wrapper that handles CAPTCHA and rate limiting gracefully."""
    for attempt in range(max_retries):
        try:
            result = func(*args, **kwargs)

            # Check for empty/error response indicating a block
            if isinstance(result, dict) and result.get("error"):
                raise Exception(f"API error: {result['error']}")

            # Respectful delay between requests
            time.sleep(random.uniform(3, 9))
            return result

        except Exception as e:
            error_str = str(e).lower()

            if "captcha" in error_str or "403" in error_str or "session expired" in error_str:
                if attempt < max_retries - 1:
                    wait = 60 * (2 ** attempt)  # 60s, 120s, 240s
                    print(f"CAPTCHA/block detected (attempt {attempt+1}). Waiting {wait}s...")
                    print("Consider re-exporting fresh cookies from your browser.")
                    time.sleep(wait)
                else:
                    print("Max retries reached. Session likely dead — re-export cookies.")
                    raise
            elif "429" in error_str or "rate" in error_str:
                wait = 30 * (2 ** attempt)
                print(f"Rate limited. Waiting {wait}s...")
                time.sleep(wait)
            else:
                if attempt == max_retries - 1:
                    raise
                time.sleep(10 * (attempt + 1))

    return None

# Usage
salaries = resilient_request(
    client,
    client.search_salaries,
    company_id=9079,
    job_title="Software Engineer",
    location="New York, NY"
)

Proxy Configuration for Glassdoor

Glassdoor blocks all major datacenter IP ranges at the Akamai layer. This is applied before any session or cookie check — a bare AWS IP returns a CAPTCHA page immediately.

Residential proxies from ThorData are one of the few reliable options. Akamai's scoring considers the ASN (Internet Service Provider) of the connecting IP, and residential IPs score much lower on the bot probability scale than datacenter ones.

# Configure proxy with session stickiness
# Sticky sessions maintain the same IP for multiple requests
PROXY_BASE = "http://USERNAME:PASSWORD"
PROXY_HOST = "gate.thordata.com:7777"

def get_sticky_proxy(session_label: str) -> str:
    """
    Get a sticky proxy URL that maintains the same exit IP.
    Use the same session_label for all requests in one scraping session.
    """
    return f"{PROXY_BASE}-session-{session_label}@{PROXY_HOST}"

# Important: reuse the same session label across all requests for one Glassdoor session
# This prevents the "teleporting IP" signal that triggers blocks
proxy = get_sticky_proxy("glassdoor-session-001")
client = GlassdoorSalaryClient(cookies, proxy=proxy)

Rate Limits and Practical Throughput

Being realistic about throughput with session cookies and residential proxies:

~200-400 salary lookups per hour before sessions start getting CAPTCHA'd
~100-200 review fetches per hour (reviews seem more heavily monitored)
4-6 hour session lifetime before cookies need refreshing

This is enough to build a dataset for a specific industry or metro area, but scraping all of Glassdoor isn't feasible without significant infrastructure.

Practical scaling tips: - Rotate sessions: Keep 3-5 active Glassdoor accounts with fresh cookies. When one gets CAPTCHA'd, switch to another. - Stagger requests: Don't run multiple scrapers simultaneously on the same session. - Cache aggressively: Salary data doesn't change daily — cache by company_id + job_title + location with a 7-day TTL. - Monitor response quality: An empty results array with totalCount > 0 means you're being rate-limited but not blocked.

Analyzing the Data

Once you have salary data in SQLite, you can run useful analyses:

import sqlite3

def compare_salaries(db_path: str, job_title: str):
    """Compare pay across companies for a specific role."""
    conn = sqlite3.connect(db_path)
    results = conn.execute("""
        SELECT
            company_name,
            ROUND(base_avg) as avg_base,
            ROUND(total_avg) as avg_total,
            ROUND(total_avg - base_avg) as avg_equity_bonus,
            count as data_points
        FROM salary_data
        WHERE job_title LIKE ?
            AND base_avg IS NOT NULL
            AND base_avg > 50000
        ORDER BY total_avg DESC
    """, (f"%{job_title}%",)).fetchall()

    print(f"\n{job_title} Compensation Comparison:")
    print(f"{'Company':<20} {'Base':>10} {'Total':>10} {'Equity+Bonus':>12} {'n':>5}")
    print("-" * 60)
    for row in results:
        print(f"{row[0]:<20} ${row[1]:>9,.0f} ${row[2]:>9,.0f} ${row[3]:>11,.0f} {row[4]:>5}")

    conn.close()

compare_salaries("glassdoor_salaries.db", "Software Engineer")

Legal and Ethical Considerations

Glassdoor's Terms of Service explicitly prohibit scraping. The hiQ Labs v. LinkedIn case (2022) established that scraping publicly accessible data isn't automatically a CFAA violation, but Glassdoor's salary data is behind a login wall — making it less clearly "publicly accessible."

From a practical standpoint: this is self-reported compensation data that employees voluntarily share to increase pay transparency. The ethical case for accessing it is strong. The legal risk depends on scale and commercial use.

Practical guidelines: - Don't redistribute raw scraped data at scale - Don't sell Glassdoor data directly - Cache to avoid unnecessary repeated requests - Use for research and analysis, not as a competing product - Don't use the data to identify or contact individuals

Key Takeaways

Glassdoor scraping in 2026:

Public overview pages: Easy with curl-cffi impersonating Chrome — no authentication needed
Salary and review data: Requires authenticated sessions via the GraphQL API at https://www.glassdoor.com/graph
Session management: 4-6 hour lifetime for cookies; maintain 3-5 rotating accounts
Anti-bot: Akamai Bot Manager blocks datacenter IPs — ThorData residential proxies are required for any serious volume
Throughput: ~200-400 salary lookups/hour maximum before throttling
Caching: Salary data changes weekly at most — 7-day TTL cache is appropriate
Legal risk: Elevated due to login wall — keep volumes reasonable and don't republish raw data