Scraping Glassdoor Interview Questions: Python Guide (2026)

2026-04-09 [python scraping glassdoor interviews graphql]

Scraping Glassdoor Interview Questions: Python Guide (2026)

Glassdoor interview reviews are one of the most valuable datasets for job seekers and recruiters. Each review includes the interview questions asked, difficulty rating, experience sentiment, and whether the candidate got an offer. The data is right there in the page source — embedded as Apollo state JSON — if you know where to look.

This guide covers extracting interview data from Glassdoor's embedded state, using their GraphQL API, dealing with their anti-bot protections in 2026, and storing the results in SQLite for analysis.

What Data Is Available

Each Glassdoor interview review contains:

Job title — the role the candidate interviewed for
Review date — when the interview took place
Experience sentiment — POSITIVE, NEGATIVE, or NEUTRAL overall experience
Difficulty rating — 1 (very easy) to 5 (very difficult) scale
Offer received — ACCEPTED, DECLINED, NO_OFFER
How they applied — online application, recruiter contact, employee referral, etc.
Interview process description — free-text walkthrough of the interview stages
Interview questions — the actual questions asked, as text
Reviewer job level — sometimes includes seniority information

Across thousands of companies, this dataset lets you answer questions like: which companies have the hardest interviews? Which job titles get the most offers? What questions come up repeatedly at FAANG companies? What's the conversion rate at different stages of a company's process?

Apollo State Extraction

Glassdoor uses Apollo Client for their GraphQL state management. The entire page's data is serialized in a window.__apolloState__ or __NEXT_DATA__ script tag embedded in the HTML. You don't need to render JavaScript — just parse the HTML and extract the JSON blob.

# glassdoor_scraper.py
import httpx
import json
import re
import time
import random
from bs4 import BeautifulSoup

def fetch_glassdoor_page(url: str, proxy: str = None) -> str:
    """Fetch a Glassdoor page with browser-like headers."""
    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/125.0.0.0 Safari/537.36"
        ),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Ch-Ua": '"Chromium";v="125", "Not=A?Brand";v="8"',
        "Sec-Ch-Ua-Mobile": "?0",
        "Sec-Ch-Ua-Platform": '"macOS"',
        "Cache-Control": "max-age=0",
        "Upgrade-Insecure-Requests": "1",
    }

    client_kwargs = {
        "headers": headers,
        "follow_redirects": True,
        "timeout": 30,
    }
    if proxy:
        client_kwargs["proxy"] = proxy

    with httpx.Client(**client_kwargs) as client:
        # Warm up with homepage visit to collect cookies
        try:
            client.get("https://www.glassdoor.com/")
            time.sleep(random.uniform(1.5, 3.0))
        except httpx.RequestError:
            pass

        resp = client.get(url, headers={**headers, "Referer": "https://www.glassdoor.com/"})

        if resp.status_code == 403:
            raise RuntimeError(f"Blocked (403) on {url} — need residential proxy")
        if resp.status_code == 429:
            raise RuntimeError(f"Rate limited (429) on {url}")
        if resp.status_code != 200:
            raise RuntimeError(f"Unexpected status {resp.status_code}")

        return resp.text


def extract_apollo_state(html: str) -> dict:
    """Extract Apollo state JSON from Glassdoor page source."""
    soup = BeautifulSoup(html, "lxml")

    # Method 1: __NEXT_DATA__ (most common in 2026)
    next_data = soup.find("script", id="__NEXT_DATA__")
    if next_data and next_data.string:
        try:
            data = json.loads(next_data.string)
            return data.get("props", {}).get("pageProps", {})
        except json.JSONDecodeError:
            pass

    # Method 2: Apollo state in inline script
    for script in soup.find_all("script"):
        content = script.string or ""
        if "apolloState" in content:
            match = re.search(
                r'window\.__apolloState__\s*=\s*({.*?});\s*(?:window|var|let|const)',
                content,
                re.DOTALL
            )
            if match:
                try:
                    return json.loads(match.group(1))
                except json.JSONDecodeError:
                    pass

    # Method 3: Serialized props with interview data
    for script in soup.find_all("script"):
        content = script.string or ""
        if "interviewReviews" in content or "interviewExperience" in content:
            # Try to find JSON blob
            match = re.search(r'\{.*"interviewReviews".*\}', content, re.DOTALL)
            if match:
                try:
                    return json.loads(match.group(0))
                except json.JSONDecodeError:
                    pass

    return {}

Parsing Interview Data

Once you have the Apollo state, interview reviews are nested under predictable keys:

def parse_interview_reviews(apollo_state: dict) -> list:
    """Extract interview reviews from Apollo state."""
    reviews = []

    if not apollo_state:
        return reviews

    # Walk the nested structure to find interviewReviews
    interview_data = find_nested_key(apollo_state, "interviewReviews")
    if not interview_data:
        # Also try alternate key names
        interview_data = find_nested_key(apollo_state, "reviews")

    if not interview_data or not isinstance(interview_data, list):
        return reviews

    for review in interview_data:
        if not isinstance(review, dict):
            continue

        parsed = {
            "job_title": _safe_text(review.get("jobTitle")),
            "date": review.get("reviewDateTime") or review.get("reviewDate"),
            "experience": review.get("interviewExperience"),   # POSITIVE/NEGATIVE/NEUTRAL
            "difficulty": review.get("interviewDifficulty"),   # 1-5
            "offer": review.get("interviewOffer"),             # ACCEPTED/DECLINED/NO_OFFER
            "application_method": review.get("interviewApplication"),
            "process_description": _safe_text(review.get("interviewProcess")),
            "questions": [],
            "interview_stages": [],
            "source": "apollo_state",
        }

        # Extract interview questions
        for q in review.get("interviewQuestions", []):
            if isinstance(q, dict):
                text = q.get("text") or q.get("question") or q.get("body")
                if text:
                    parsed["questions"].append(text.strip())
            elif isinstance(q, str) and q.strip():
                parsed["questions"].append(q.strip())

        # Extract interview stages/rounds if present
        for stage in review.get("interviewStages", []):
            if isinstance(stage, dict):
                parsed["interview_stages"].append(stage.get("name") or stage.get("type"))

        if parsed.get("job_title") or parsed.get("process_description"):
            reviews.append(parsed)

    return reviews


def _safe_text(value) -> str:
    """Safely extract text from a string or dict with a text key."""
    if isinstance(value, str):
        return value.strip()
    if isinstance(value, dict):
        return (value.get("text") or value.get("value") or "").strip()
    return ""


def find_nested_key(data, target_key: str):
    """Recursively search for a key in nested dict/list."""
    if isinstance(data, dict):
        if target_key in data:
            return data[target_key]
        for value in data.values():
            result = find_nested_key(value, target_key)
            if result is not None:
                return result
    elif isinstance(data, list):
        for item in data:
            result = find_nested_key(item, target_key)
            if result is not None:
                return result
    return None

Scraping Company Interview Pages

def scrape_company_interviews(
    company_slug: str,
    company_id: str,
    num_pages: int = 5,
    proxy: str = None,
) -> list:
    """
    Scrape interview reviews for a company.
    company_slug: URL slug, e.g. "Google"
    company_id: numeric Glassdoor employer ID, e.g. "9079"
    """
    all_reviews = []

    for page in range(1, num_pages + 1):
        if page == 1:
            url = f"https://www.glassdoor.com/Interview/{company_slug}-Interview-Questions-E{company_id}.htm"
        else:
            url = f"https://www.glassdoor.com/Interview/{company_slug}-Interview-Questions-E{company_id}_P{page}.htm"

        print(f"Fetching page {page}/{num_pages}: {url}")
        try:
            html = fetch_glassdoor_page(url, proxy=proxy)
            state = extract_apollo_state(html)
            reviews = parse_interview_reviews(state)
            all_reviews.extend(reviews)
            print(f"  Extracted {len(reviews)} reviews (total: {len(all_reviews)})")
        except RuntimeError as e:
            print(f"  Error on page {page}: {e}")
            break
        except Exception as e:
            print(f"  Unexpected error on page {page}: {e}")
            break

        if page < num_pages:
            delay = random.uniform(8.0, 15.0)
            print(f"  Waiting {delay:.1f}s...")
            time.sleep(delay)

    return all_reviews

GraphQL API Approach

Glassdoor also has a GraphQL API endpoint used by their frontend. You can query it directly if you have the right cookies and headers from a valid browser session:

def glassdoor_graphql(
    query: str,
    variables: dict,
    cookies: dict,
    proxy: str = None,
) -> dict:
    """
    Execute a Glassdoor GraphQL query.
    Requires cookies from a valid Glassdoor browser session.
    cookies: dict with at minimum gdId, GSESSIONID, at
    """
    # Extract CSRF token from cookies
    csrf_token = cookies.get("gdId", "")

    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"
        ),
        "Content-Type": "application/json",
        "gd-csrf-token": csrf_token,
        "apollographql-client-name": "job-interview",
        "apollographql-client-version": "2.0.0",
        "Referer": "https://www.glassdoor.com/",
        "Origin": "https://www.glassdoor.com",
    }

    client_kwargs = {"timeout": 30}
    if proxy:
        client_kwargs["proxy"] = proxy

    with httpx.Client(**client_kwargs) as client:
        resp = client.post(
            "https://www.glassdoor.com/graph",
            json={"query": query, "variables": variables},
            headers=headers,
            cookies=cookies,
        )
        resp.raise_for_status()
        return resp.json()


# GraphQL query for interview reviews
INTERVIEW_QUERY = """
query InterviewReviews($employerId: Int!, $page: Int, $pageSize: Int) {
    employer(id: $employerId) {
        id
        name
        interviewReviews(page: $page, pageSize: $pageSize) {
            reviews {
                id
                jobTitle {
                    text
                }
                reviewDateTime
                interviewExperience
                interviewDifficulty
                interviewOffer
                interviewApplication
                interviewProcess {
                    text
                }
                interviewQuestions {
                    text
                }
            }
            totalCount
            currentPage
            totalPages
        }
    }
}
"""


def fetch_all_gql_reviews(
    employer_id: int,
    cookies: dict,
    max_pages: int = 10,
    proxy: str = None,
) -> list:
    """Fetch all interview reviews via Glassdoor GraphQL API."""
    all_reviews = []

    for page in range(1, max_pages + 1):
        try:
            result = glassdoor_graphql(
                INTERVIEW_QUERY,
                variables={"employerId": employer_id, "page": page, "pageSize": 20},
                cookies=cookies,
                proxy=proxy,
            )

            employer_data = result.get("data", {}).get("employer", {})
            reviews_data = employer_data.get("interviewReviews", {})
            reviews = reviews_data.get("reviews", [])

            if not reviews:
                break

            all_reviews.extend(reviews)
            total_pages = reviews_data.get("totalPages", 1)
            print(f"GQL page {page}/{total_pages}: {len(reviews)} reviews")

            if page >= total_pages:
                break

            time.sleep(random.uniform(3.0, 7.0))

        except Exception as e:
            print(f"GQL error on page {page}: {e}")
            break

    return all_reviews

Anti-Bot Measures

Glassdoor uses multiple layers of bot detection that require different countermeasures.

Cloudflare Protection

Glassdoor sits behind Cloudflare with JavaScript challenges enabled. Standard HTTP requests without JavaScript execution get a challenge page instead of content. The Apollo state extraction works only when you get past Cloudflare — which requires a browser-like TLS fingerprint.

Glassdoor requires accepting cookies before showing content. Without the right cookie consent state, you get redirected to a modal/wall that blocks the actual data.

Rate Limiting

Glassdoor rate-limits by IP and session. More than 20-30 page loads in quick succession triggers CAPTCHAs. After several CAPTCHA triggers, the IP gets temporarily blocked.

Session Fingerprinting

Glassdoor tracks session consistency. If your cookies, User-Agent, and IP don't match across requests, the session gets invalidated. This means you can't easily rotate IPs mid-session.

Handling It All

For reliable Glassdoor scraping, you need three things: residential IPs, consistent sessions, and browser-level TLS fingerprints.

ThorData's residential proxies handle the IP side — clean residential IPs that pass Cloudflare's reputation checks. Pair that with sticky sessions (same IP for the duration of a scraping session) and curl_cffi for TLS fingerprinting:

from curl_cffi import requests as curl_requests

def fetch_with_fingerprint(url: str, proxy_url: str) -> str:
    """
    Fetch Glassdoor page with browser-like TLS fingerprint.
    curl_cffi impersonates Chrome at the TLS layer, bypassing
    fingerprint-based bot detection.
    """
    resp = curl_requests.get(
        url,
        impersonate="chrome124",
        proxies={"https": proxy_url},
        timeout=30,
        headers={
            "Accept-Language": "en-US,en;q=0.9",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        }
    )
    if resp.status_code != 200:
        raise RuntimeError(f"Status {resp.status_code}")
    return resp.text

Install with: pip install curl_cffi

SQLite Storage Schema

import sqlite3
import json

def init_glassdoor_db(db_path: str = "glassdoor_interviews.db") -> sqlite3.Connection:
    """Initialize SQLite database for Glassdoor interview data."""
    conn = sqlite3.connect(db_path)
    conn.execute("PRAGMA journal_mode=WAL")
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS companies (
            id              INTEGER PRIMARY KEY AUTOINCREMENT,
            slug            TEXT UNIQUE,
            glassdoor_id    TEXT,
            name            TEXT,
            industry        TEXT,
            added_at        TEXT DEFAULT CURRENT_TIMESTAMP
        );

        CREATE TABLE IF NOT EXISTS interview_reviews (
            id                  INTEGER PRIMARY KEY AUTOINCREMENT,
            company_slug        TEXT,
            job_title           TEXT,
            review_date         TEXT,
            experience          TEXT,
            difficulty          INTEGER,
            offer               TEXT,
            application_method  TEXT,
            process_description TEXT,
            questions           TEXT,
            interview_stages    TEXT,
            source              TEXT DEFAULT 'html',
            scraped_at          TEXT DEFAULT CURRENT_TIMESTAMP,
            FOREIGN KEY (company_slug) REFERENCES companies(slug)
        );

        CREATE TABLE IF NOT EXISTS questions_flat (
            id                  INTEGER PRIMARY KEY AUTOINCREMENT,
            company_slug        TEXT,
            job_title           TEXT,
            question_text       TEXT,
            review_date         TEXT,
            difficulty          INTEGER
        );

        CREATE INDEX IF NOT EXISTS idx_reviews_company ON interview_reviews (company_slug);
        CREATE INDEX IF NOT EXISTS idx_questions_company ON questions_flat (company_slug);
        CREATE INDEX IF NOT EXISTS idx_questions_title ON questions_flat (job_title);
    """)
    conn.commit()
    return conn


def save_reviews(conn: sqlite3.Connection, company_slug: str, reviews: list) -> int:
    """Save interview reviews and flatten questions into separate table."""
    saved = 0
    for r in reviews:
        try:
            cursor = conn.execute(
                """
                INSERT INTO interview_reviews
                    (company_slug, job_title, review_date, experience, difficulty,
                     offer, application_method, process_description, questions,
                     interview_stages, source)
                VALUES (?,?,?,?,?,?,?,?,?,?,?)
                """,
                (
                    company_slug,
                    r.get("job_title"),
                    r.get("date"),
                    r.get("experience"),
                    r.get("difficulty"),
                    r.get("offer"),
                    r.get("application_method"),
                    r.get("process_description"),
                    json.dumps(r.get("questions", [])),
                    json.dumps(r.get("interview_stages", [])),
                    r.get("source", "html"),
                ),
            )
            review_id = cursor.lastrowid
            saved += 1

            # Flatten questions into questions_flat for easy querying
            for question in r.get("questions", []):
                if question and len(question) > 10:
                    conn.execute(
                        """INSERT INTO questions_flat
                           (company_slug, job_title, question_text, review_date, difficulty)
                           VALUES (?,?,?,?,?)""",
                        (
                            company_slug,
                            r.get("job_title"),
                            question,
                            r.get("date"),
                            r.get("difficulty"),
                        ),
                    )
        except Exception as e:
            print(f"  Save error: {e}")

    conn.commit()
    return saved

Analyzing Interview Difficulty

Once you have the data, you can build useful analyses:

from collections import Counter

def analyze_company_interviews(conn: sqlite3.Connection, company_slug: str) -> dict:
    """Generate interview statistics for a company."""
    rows = conn.execute(
        """
        SELECT difficulty, experience, offer, job_title
        FROM interview_reviews
        WHERE company_slug = ?
        """,
        (company_slug,)
    ).fetchall()

    if not rows:
        return {"error": "No data found"}

    difficulties = [r[0] for r in rows if r[0] is not None]
    experiences = [r[1] for r in rows if r[1]]
    offers = [r[2] for r in rows if r[2]]
    titles = [r[3] for r in rows if r[3]]

    avg_difficulty = sum(difficulties) / len(difficulties) if difficulties else 0
    exp_counts = Counter(experiences)
    offer_counts = Counter(offers)
    title_counts = Counter(titles)

    # Question frequency analysis
    q_rows = conn.execute(
        """
        SELECT question_text, COUNT(*) as freq
        FROM questions_flat
        WHERE company_slug = ?
        GROUP BY question_text
        ORDER BY freq DESC
        LIMIT 20
        """,
        (company_slug,)
    ).fetchall()

    return {
        "total_reviews": len(rows),
        "avg_difficulty": round(avg_difficulty, 2),
        "experience_breakdown": dict(exp_counts.most_common()),
        "offer_breakdown": dict(offer_counts.most_common()),
        "top_job_titles": title_counts.most_common(10),
        "most_frequent_questions": [
            {"question": r[0][:200], "frequency": r[1]}
            for r in q_rows
        ],
    }


def compare_companies_difficulty(conn: sqlite3.Connection, slugs: list) -> list:
    """Compare interview difficulty across companies."""
    results = []
    for slug in slugs:
        row = conn.execute(
            """
            SELECT
                AVG(CASE WHEN difficulty IS NOT NULL THEN difficulty END) as avg_diff,
                COUNT(*) as total,
                SUM(CASE WHEN experience='POSITIVE' THEN 1 ELSE 0 END) as positive,
                SUM(CASE WHEN offer='ACCEPTED' THEN 1 ELSE 0 END) as offers
            FROM interview_reviews WHERE company_slug=?
            """,
            (slug,)
        ).fetchone()
        if row and row[1] > 0:
            results.append({
                "company": slug,
                "avg_difficulty": round(row[0] or 0, 2),
                "total_reviews": row[1],
                "positive_pct": round((row[2] or 0) / row[1] * 100, 1),
                "offer_pct": round((row[3] or 0) / row[1] * 100, 1),
            })
    return sorted(results, key=lambda x: x["avg_difficulty"], reverse=True)

Complete Pipeline

def run_glassdoor_pipeline(
    companies: list,
    db_path: str = "glassdoor_interviews.db",
    proxy: str = None,
    pages_per_company: int = 5,
):
    """
    companies: list of dicts with 'slug', 'id', 'name' keys
    e.g. [{"slug": "Google", "id": "9079", "name": "Google"}]
    """
    conn = init_glassdoor_db(db_path)

    for company in companies:
        slug = company["slug"]
        company_id = company["id"]
        print(f"\nScraping {company['name']} (ID: {company_id})...")

        # Ensure company record exists
        conn.execute(
            "INSERT OR IGNORE INTO companies (slug, glassdoor_id, name) VALUES (?,?,?)",
            (slug, company_id, company["name"])
        )
        conn.commit()

        try:
            reviews = scrape_company_interviews(
                company_slug=slug,
                company_id=company_id,
                num_pages=pages_per_company,
                proxy=proxy,
            )
            saved = save_reviews(conn, slug, reviews)
            print(f"Saved {saved} reviews for {company['name']}")

            # Print quick stats
            stats = analyze_company_interviews(conn, slug)
            print(f"  Avg difficulty: {stats['avg_difficulty']}/5")
            print(f"  Experience: {stats['experience_breakdown']}")

        except Exception as e:
            print(f"Error processing {company['name']}: {e}")

        # Longer delay between companies
        time.sleep(random.uniform(30.0, 60.0))

    conn.close()


# Usage
PROXY = "http://user:[email protected]:9000"

COMPANIES = [
    {"slug": "Google", "id": "9079", "name": "Google"},
    {"slug": "Meta", "id": "40772", "name": "Meta"},
    {"slug": "Amazon-com", "id": "6036", "name": "Amazon"},
    {"slug": "Microsoft", "id": "1651", "name": "Microsoft"},
    {"slug": "Apple", "id": "1138", "name": "Apple"},
]

run_glassdoor_pipeline(COMPANIES, proxy=PROXY, pages_per_company=3)

Practical Tips

Extract Apollo state first. Don't render JavaScript if you don't have to. The data you need is usually serialized in the page source. Check __NEXT_DATA__ and __apolloState__ script tags before reaching for a headless browser.

Use sticky sessions. Glassdoor invalidates sessions when the IP changes. Configure your proxy for session persistence — same IP for at least 10-15 minutes per company.

Limit pages per session. Don't scrape more than 5-10 pages per session. Start a new session with a new IP for the next batch. This keeps you well under the rate limit threshold.

Store raw HTML. Save the full page source before parsing. Glassdoor's internal data structure changes periodically — having the raw HTML lets you re-parse without re-scraping.

Check the JSON structure. Glassdoor's Apollo state format changes between page types and over time. Always inspect the actual JSON before writing parsing code. A missing nested key is the most common failure mode.

The GraphQL endpoint requires fresh cookies. The gdId cookie used as a CSRF token rotates. If your GraphQL requests return 401 or empty data, your cookies have expired. Re-authenticate by visiting Glassdoor in a browser and extracting fresh cookies.

Residential proxies are required. Glassdoor's Cloudflare integration blocks all datacenter IPs at the TLS fingerprint level. ThorData provides the residential IP rotation needed to get past the initial check.

Legal Notes

Glassdoor's Terms of Service prohibit automated data collection. Section 2.3 explicitly disallows scraping, crawling, or using automated means to access the site. Unlike patents or government data, Glassdoor has actively pursued enforcement against commercial scrapers.

For non-commercial research and personal use, small-scale scraping is low risk. For any commercial application, consider whether the hirevue API, LinkedIn's official talent insights API, or other licensed data providers could meet your needs without the legal exposure.

The interview questions themselves are user-generated content posted publicly — there's a reasonable argument that aggregating them doesn't infringe copyright, but Glassdoor's ToS still applies to the method of access.

Scraping Glassdoor Interview Questions: Python Guide (2026)

Scraping Glassdoor Interview Questions: Python Guide (2026)

What Data Is Available

Apollo State Extraction

Parsing Interview Data

Scraping Company Interview Pages

GraphQL API Approach

Anti-Bot Measures

Cloudflare Protection

Cookie Wall

Rate Limiting

Session Fingerprinting

Handling It All

SQLite Storage Schema

Analyzing Interview Difficulty

Complete Pipeline

Practical Tips

Legal Notes