← Back to blog

How to Scrape Indeed Job Listings in 2026: Playwright + Anti-Bot Evasion

How to Scrape Indeed Job Listings in 2026: Playwright + Anti-Bot Evasion

Indeed is the largest job aggregator globally — over 350 million unique visitors per month and job listings from nearly every industry. If you're building salary comparison tools, tracking hiring trends, or doing labor market research, Indeed has the data. But getting it out requires overcoming some of the most aggressive anti-bot systems in the job board space.

Simple HTTP requests won't work. This guide uses Playwright — a headless browser automation library — because Indeed's 2026 defenses specifically target non-browser clients.

What Data Can You Extract?

Indeed job listings contain:

Indeed's Anti-Bot Measures in 2026

Indeed runs some of the most sophisticated bot detection in the job board space:

  1. Cloudflare Turnstile — Indeed uses Cloudflare's challenge platform. Requests without valid cf_clearance cookies get blocked.
  2. Browser fingerprinting — Canvas hashing, WebGL renderer strings, font enumeration, and audio context fingerprinting are all checked via inline JavaScript.
  3. Behavioral analysis — Pages track mouse movements, scroll patterns, and time-on-page. No interaction triggers a soft block after a few pages.
  4. TLS fingerprinting (JA3) — The TLS handshake signature is checked against known bot fingerprints. Python's requests library has a recognizable JA3 hash.
  5. IP reputation scoring — Datacenter IPs, VPN exit nodes, and previously flagged IPs get immediate challenges.
  6. Dynamic CSS selectors — Class names on job cards are randomized per session, breaking static CSS selectors between runs.

Why Playwright, Not Requests

Indeed's Cloudflare integration means the page must execute JavaScript to obtain the cf_clearance cookie. You cannot fake this with HTTP requests — you need an actual browser engine. Playwright provides this while being scriptable in Python.

pip install playwright
playwright install chromium

Basic Job Search Scraper

import asyncio
import json
import random
from playwright.async_api import async_playwright

async def scrape_indeed_jobs(
    query: str,
    location: str,
    max_pages: int = 3,
    proxy: dict = None,
) -> list:
    jobs = []

    async with async_playwright() as p:
        launch_args = {
            "headless": True,
            "args": [
                "--disable-blink-features=AutomationControlled",
                "--disable-dev-shm-usage",
                "--no-sandbox",
                "--disable-setuid-sandbox",
            ],
        }
        if proxy:
            launch_args["proxy"] = proxy

        browser = await p.chromium.launch(**launch_args)
        context = await browser.new_context(
            viewport={"width": 1366, "height": 768},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
            locale="en-US",
        )

        # Remove webdriver detection flag
        await context.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
            delete window.cdc_adoQpoasnfa76pfcZLmcfl_Array;
        """)

        page = await context.new_page()

        for page_num in range(max_pages):
            start = page_num * 10
            url = f"https://www.indeed.com/jobs?q={query}&l={location}&start={start}"

            await page.goto(url, wait_until="networkidle", timeout=30000)

            try:
                await page.wait_for_selector("[data-testid='jobsearch-resultsList']", timeout=10000)
            except Exception:
                print(f"No results on page {page_num}, possibly blocked")
                break

            # Simulate human scrolling
            for _ in range(3):
                await page.mouse.wheel(0, random.randint(300, 600))
                await asyncio.sleep(random.uniform(0.5, 1.5))

            # Move mouse to simulate human presence
            await page.mouse.move(random.randint(100, 800), random.randint(100, 500))

            cards = await page.query_selector_all("[data-testid='slider_item']")

            for card in cards:
                job = {}

                title_el = await card.query_selector("h2 a")
                if title_el:
                    job["title"] = (await title_el.inner_text()).strip()
                    job["url"] = await title_el.get_attribute("href")
                    if job["url"] and job["url"].startswith("/"):
                        job["url"] = f"https://www.indeed.com{job['url']}"

                company_el = await card.query_selector("[data-testid='company-name']")
                job["company"] = (await company_el.inner_text()).strip() if company_el else None

                location_el = await card.query_selector("[data-testid='text-location']")
                job["location"] = (await location_el.inner_text()).strip() if location_el else None

                salary_el = await card.query_selector("[data-testid='attribute_snippet_testid']")
                job["salary"] = (await salary_el.inner_text()).strip() if salary_el else None

                date_el = await card.query_selector("[data-testid='myJobsStateDate']")
                job["posted"] = (await date_el.inner_text()).strip() if date_el else None

                jobs.append(job)

            print(f"  Page {page_num + 1}: found {len(cards)} job cards")
            await asyncio.sleep(random.uniform(4, 8))

        await browser.close()
    return jobs

Scraping Full Job Descriptions

Job cards only show previews. Full descriptions require opening each posting:

async def scrape_job_detail(url: str, context) -> dict:
    page = await context.new_page()
    try:
        await page.goto(url, wait_until="networkidle", timeout=30000)
        await page.wait_for_selector("#jobDescriptionText", timeout=10000)
        description = await page.inner_text("#jobDescriptionText")

        salary = None
        salary_el = await page.query_selector("#salaryInfoAndJobType")
        if salary_el:
            salary = (await salary_el.inner_text()).strip()

        benefits = []
        benefit_els = await page.query_selector_all("[data-testid='benefits-entry']")
        for b in benefit_els:
            benefits.append((await b.inner_text()).strip())

        job_type = None
        type_el = await page.query_selector("[data-testid='jobsearch-JobInfoHeader-jobType']")
        if type_el:
            job_type = (await type_el.inner_text()).strip()

        return {
            "description": description.strip(),
            "salary_detail": salary,
            "benefits": benefits,
            "job_type": job_type,
        }
    except Exception as e:
        return {"error": str(e)}
    finally:
        await page.close()

Handling Dynamic CSS Selectors

Indeed randomizes CSS class names between sessions, but data-testid attributes are stable. Always prefer [data-testid='...'] selectors over class-based ones. If Indeed removes a testid, fall back to structural selectors:

# Stable: data-testid attributes
title_el = await card.query_selector("[data-testid='jobTitle']")
company_el = await card.query_selector("[data-testid='company-name']")

# Fallback: select by structure
if not title_el:
    title_el = await card.query_selector("h2 a")

if not company_el:
    company_el = await card.query_selector("h2 + div span")

# Extract job ID from data attribute as stable identifier
jk_attr = await card.get_attribute("data-jk")
if jk_attr:
    job_id = jk_attr

Proxy Strategy for Indeed

Indeed's IP reputation system is the toughest obstacle. Datacenter proxies last maybe 5-10 requests before hitting Turnstile challenges. Free proxy lists are almost entirely pre-flagged.

Residential proxies are the only reliable option for sustained Indeed scraping. ThorData works well here because their residential IPs have clean reputation scores — they have not been abused by other scraping operations. This matters specifically for Cloudflare Turnstile, which maintains a shared IP reputation database across all sites using it.

# ThorData residential proxy — https://thordata.partnerstack.com/partner/0a0x4nzb (or [Oxylabs](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=2066&url_id=174))
PROXY_CONFIG = {
    "server": "http://proxy.thordata.net:9000",
    "username": "YOUR_THORDATA_USER",
    "password": "YOUR_THORDATA_PASS",
}

async def main():
    jobs = await scrape_indeed_jobs(
        query="python+developer",
        location="Remote",
        max_pages=5,
        proxy=PROXY_CONFIG,
    )
    print(f"Found {len(jobs)} listings")
    for j in jobs[:5]:
        salary_str = j.get('salary', 'No salary listed')
        print(f"  {j['title']} @ {j['company']} | {salary_str}")

asyncio.run(main())

Handling Cloudflare Challenges

If you encounter a Cloudflare challenge page, a few strategies help navigate it:

import asyncio
from playwright.async_api import async_playwright

async def bypass_cloudflare(url: str, proxy: dict = None) -> str:
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy=proxy,
            args=[
                "--disable-blink-features=AutomationControlled",
                "--no-sandbox",
            ],
        )
        context = await browser.new_context(
            viewport={"width": 1440, "height": 900},
            user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
        )
        await context.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
            Object.defineProperty(navigator, 'languages', {get: () => ['en-US', 'en']});
            Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3, 4, 5]});
        """)
        page = await context.new_page()
        await page.goto(url, wait_until="domcontentloaded", timeout=60000)

        # Wait for challenge to clear (up to 15s)
        for sel in ["#challenge-running", "#cf-spinner-please-wait", ".cf-browser-verification"]:
            try:
                await page.wait_for_selector(sel, state="hidden", timeout=15000)
            except Exception:
                pass

        await asyncio.sleep(3)
        content = await page.content()
        await browser.close()
        return content

Salary Data Extraction and Normalization

Indeed's salary display is inconsistent. Some listings use annual ranges, others hourly. Normalizing to comparable annual figures:

import re

def normalize_salary(raw_salary: str) -> dict | None:
    if not raw_salary:
        return None
    raw = raw_salary.lower().strip()
    nums = re.findall(r'[\d,]+', raw.replace(',', ''))
    if not nums:
        return None
    amounts = [int(n) for n in nums if n]
    if not amounts:
        return None

    is_hourly = bool(re.search(r'/hr|per hour|/hour|an hour', raw))
    is_monthly = bool(re.search(r'/month|per month|/mo', raw))

    def annualize(amt):
        if is_hourly:
            return amt * 2080
        if is_monthly:
            return amt * 12
        return amt

    annual = [annualize(a) for a in amounts]
    return {
        "min": min(annual),
        "max": max(annual),
        "mid": sum(annual) / len(annual),
        "raw": raw_salary,
        "period": "hourly" if is_hourly else "monthly" if is_monthly else "annual",
    }

def print_salary_summary(jobs: list) -> None:
    salaries = [normalize_salary(j.get('salary')) for j in jobs]
    salaries = [s for s in salaries if s]
    if not salaries:
        print('No salary data found')
        return
    mids = [s['mid'] for s in salaries]
    print(f'Salary stats across {len(salaries)} listings:')
    print(f"  Min: ${min(s['min'] for s in salaries):>9,.0f}")
    print(f"  Max: ${max(s['max'] for s in salaries):>9,.0f}")
    print(f"  Median: ${sorted(mids)[len(mids)//2]:>9,.0f}")

Remote Job Filtering

Use Indeed's built-in remote filter for clean remote-only datasets:

async def scrape_remote_jobs(query: str, max_pages: int = 5, proxy: dict = None) -> list:
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy=proxy,
            args=['--disable-blink-features=AutomationControlled'],
        )
        context = await browser.new_context(
            viewport={"width": 1366, "height": 768},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
        )
        await context.add_init_script(
            "Object.defineProperty(navigator, 'webdriver', {get: () => undefined});"
        )
        page = await context.new_page()
        jobs = []

        for page_num in range(max_pages):
            start = page_num * 10
            # remotejobs=1 filters to remote-only listings
            url = f"https://www.indeed.com/jobs?q={query}&remotejobs=1&start={start}"
            await page.goto(url, wait_until="networkidle", timeout=30000)
            try:
                await page.wait_for_selector("[data-testid='jobsearch-resultsList']", timeout=10000)
            except Exception:
                break
            cards = await page.query_selector_all("[data-testid='slider_item']")
            for card in cards:
                title_el = await card.query_selector('h2 a')
                company_el = await card.query_selector("[data-testid='company-name']")
                salary_el = await card.query_selector("[data-testid='attribute_snippet_testid']")
                if title_el:
                    href = await title_el.get_attribute('href')
                    jobs.append({
                        'title': (await title_el.inner_text()).strip(),
                        'company': (await company_el.inner_text()).strip() if company_el else None,
                        'salary': (await salary_el.inner_text()).strip() if salary_el else None,
                        'remote': True,
                        'url': f"https://www.indeed.com{href}" if href and href.startswith('/') else href,
                    })
            await asyncio.sleep(random.uniform(4, 7))
        await browser.close()
        return jobs

Incremental Scraping: Only New Listings

For ongoing job tracking, skip already-seen listings:

import json
from pathlib import Path

class IncrementalJobScraper:
    def __init__(self, state_file: str = 'seen_jobs.json'):
        self.state_file = Path(state_file)
        self.seen_ids = self._load_seen()

    def _load_seen(self) -> set:
        if self.state_file.exists():
            data = json.loads(self.state_file.read_text())
            return set(data.get('seen_ids', []))
        return set()

    def _save_seen(self) -> None:
        self.state_file.write_text(json.dumps({'seen_ids': list(self.seen_ids)}, indent=2))

    def filter_new(self, jobs: list) -> list:
        new_jobs = []
        for job in jobs:
            job_id = job.get('url', '')
            if job_id and job_id not in self.seen_ids:
                new_jobs.append(job)
                self.seen_ids.add(job_id)
        self._save_seen()
        return new_jobs

Saving Job Data

import csv
import json
from pathlib import Path
from datetime import datetime

def save_jobs(jobs: list, prefix: str = 'indeed_jobs', output_dir: str = '.') -> None:
    if not jobs:
        print('No jobs to save')
        return
    out = Path(output_dir)
    out.mkdir(exist_ok=True)
    timestamp = datetime.now().strftime('%Y%m%d_%H%M')
    json_file = out / f'{prefix}_{timestamp}.json'
    json_file.write_text(json.dumps(jobs, indent=2, ensure_ascii=False))
    csv_file = out / f'{prefix}_{timestamp}.csv'
    keys = ['title', 'company', 'location', 'salary', 'posted', 'job_type', 'url']
    with open(csv_file, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=keys, extrasaction='ignore')
        writer.writeheader()
        writer.writerows(jobs)
    print(f'Saved {len(jobs)} jobs: {json_file}, {csv_file}')

Indeed's Terms of Service prohibit scraping. In hiQ Labs v. LinkedIn, courts ruled that scraping public data is not a CFAA violation, but Indeed has pursued legal action against scrapers under state computer fraud laws. Keep your volumes moderate, do not scrape behind login walls, and use the data for analysis — not rebuilding Indeed's listings database.

Key Takeaways

Building a Job Alert System

Combine Indeed scraping with email notifications to build a personal job alert that catches new listings:

import asyncio
import json
import random
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from pathlib import Path
from datetime import datetime
from playwright.async_api import async_playwright

STATE_FILE = Path("indeed_alert_state.json")


def load_seen_jobs() -> set:
    if STATE_FILE.exists():
        return set(json.loads(STATE_FILE.read_text()).get("seen_urls", []))
    return set()


def save_seen_jobs(seen: set) -> None:
    STATE_FILE.write_text(json.dumps({"seen_urls": list(seen)}, indent=2))


def send_alert_email(new_jobs: list, smtp_config: dict) -> None:
    if not new_jobs:
        return

    subject = f"Indeed Alert: {len(new_jobs)} new job(s) found"
    body_lines = [f"Found {len(new_jobs)} new listings:\n"]
    for job in new_jobs[:20]:
        body_lines.append(f"- {job.get('title')} @ {job.get('company')}")
        if job.get("salary"):
            body_lines.append(f"  Salary: {job['salary']}")
        body_lines.append(f"  Location: {job.get('location')}")
        body_lines.append(f"  URL: {job.get('url', 'N/A')}")
        body_lines.append("")

    msg = MIMEMultipart()
    msg["Subject"] = subject
    msg["From"] = smtp_config["from"]
    msg["To"] = smtp_config["to"]
    msg.attach(MIMEText("\n".join(body_lines), "plain"))

    with smtplib.SMTP_SSL(smtp_config["host"], smtp_config["port"]) as server:
        server.login(smtp_config["user"], smtp_config["password"])
        server.sendmail(smtp_config["from"], smtp_config["to"], msg.as_string())

    print(f"Alert sent: {len(new_jobs)} new jobs")


async def run_job_alert(
    queries: list[dict],
    proxy: dict = None,
    smtp_config: dict = None,
) -> list:
    seen = load_seen_jobs()
    all_new_jobs = []

    for query_config in queries:
        keywords = query_config["keywords"]
        location = query_config.get("location", "Remote")
        print(f"Checking: {keywords} in {location}")

        jobs = await scrape_indeed_jobs(
            query=keywords,
            location=location,
            max_pages=2,
            proxy=proxy,
        )

        new_jobs = [j for j in jobs if j.get("url") and j["url"] not in seen]
        if new_jobs:
            print(f"  {len(new_jobs)} new listings found!")
            for job in new_jobs:
                seen.add(job["url"])
            all_new_jobs.extend(new_jobs)
        else:
            print(f"  No new listings")

        await asyncio.sleep(random.uniform(8, 15))

    save_seen_jobs(seen)

    if all_new_jobs and smtp_config:
        send_alert_email(all_new_jobs, smtp_config)

    return all_new_jobs

Scraping Company Reviews and Ratings

Indeed surfaces company ratings alongside job listings. You can scrape employer review data for company intelligence:

import asyncio
import random
from playwright.async_api import async_playwright


async def scrape_company_reviews(
    company_name: str,
    max_pages: int = 5,
    proxy: dict = None,
) -> list[dict]:
    reviews = []
    company_slug = company_name.lower().replace(" ", "-")

    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy=proxy,
            args=["--disable-blink-features=AutomationControlled"],
        )
        context = await browser.new_context(
            viewport={"width": 1366, "height": 768},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
        )
        await context.add_init_script(
            "Object.defineProperty(navigator, 'webdriver', {get: () => undefined});"
        )
        page = await context.new_page()

        for page_num in range(max_pages):
            start = page_num * 20
            url = f"https://www.indeed.com/cmp/{company_slug}/reviews?start={start}"
            await page.goto(url, wait_until="networkidle", timeout=30000)

            try:
                await page.wait_for_selector("[data-testid='review-card']", timeout=8000)
            except Exception:
                break

            review_cards = await page.query_selector_all("[data-testid='review-card']")
            for card in review_cards:
                title_el = await card.query_selector("[data-testid='review-title']")
                rating_el = await card.query_selector("[data-testid='review-rating']")
                pros_el = await card.query_selector("[data-testid='review-pros']")
                cons_el = await card.query_selector("[data-testid='review-cons']")
                date_el = await card.query_selector("[data-testid='review-date']")
                job_title_el = await card.query_selector("[data-testid='review-job-title']")

                reviews.append({
                    "title": (await title_el.inner_text()).strip() if title_el else None,
                    "rating": (await rating_el.get_attribute("aria-label")) if rating_el else None,
                    "pros": (await pros_el.inner_text()).strip() if pros_el else None,
                    "cons": (await cons_el.inner_text()).strip() if cons_el else None,
                    "date": (await date_el.inner_text()).strip() if date_el else None,
                    "job_title": (await job_title_el.inner_text()).strip() if job_title_el else None,
                })

            await asyncio.sleep(random.uniform(3, 6))

        await browser.close()

    return reviews

Scraping Salary Estimates by Role

Indeed's salary estimation tool provides aggregated salary data by job title and location. You can query this directly:

import asyncio
import random
from playwright.async_api import async_playwright


async def scrape_salary_estimates(
    job_title: str,
    location: str = "United States",
    proxy: dict = None,
) -> dict:
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy=proxy,
            args=["--disable-blink-features=AutomationControlled"],
        )
        context = await browser.new_context(
            viewport={"width": 1366, "height": 768},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
        )
        await context.add_init_script(
            "Object.defineProperty(navigator, 'webdriver', {get: () => undefined});"
        )
        page = await context.new_page()

        # Indeed salary page structure
        url_title = job_title.replace(" ", "-").lower()
        url = f"https://www.indeed.com/career/{url_title}/salaries"

        await page.goto(url, wait_until="networkidle", timeout=30000)
        await asyncio.sleep(2)

        # Extract salary display
        result = {}
        salary_el = await page.query_selector("[data-testid='salary-hero-amount']")
        if salary_el:
            result["average_salary"] = (await salary_el.inner_text()).strip()

        range_els = await page.query_selector_all("[data-testid='salary-percentile']")
        percentiles = []
        for el in range_els:
            percentiles.append((await el.inner_text()).strip())
        if percentiles:
            result["percentiles"] = percentiles

        await browser.close()

    return result


# Async runner
async def batch_salary_lookup(roles: list[str]) -> dict:
    results = {}
    for role in roles:
        print(f"Looking up salary for: {role}")
        data = await scrape_salary_estimates(role)
        results[role] = data
        print(f"  {data.get('average_salary', 'N/A')}")
        await asyncio.sleep(random.uniform(5, 10))
    return results

Indeed Job Market Analytics Dashboard

Combine multiple scraped datasets to build a job market analytics view:

import json
import statistics
from pathlib import Path
from collections import Counter, defaultdict
from datetime import datetime


def generate_market_report(jobs_file: str) -> str:
    jobs = json.loads(Path(jobs_file).read_text())
    if not jobs:
        return "No data"

    # Company hiring volume
    companies = Counter(j["company"] for j in jobs if j.get("company"))

    # Location distribution
    locations = Counter(j["location"] for j in jobs if j.get("location"))

    # Salary analysis
    salary_jobs = []
    for job in jobs:
        s = normalize_salary(job.get("salary"))
        if s:
            salary_jobs.append(s)

    # Remote vs. on-site
    remote_count = sum(
        1 for j in jobs
        if j.get("location") and "remote" in j["location"].lower()
    )

    # Posted date distribution
    today = datetime.now().strftime("%Y-%m-%d")
    posted_today = sum(1 for j in jobs if j.get("posted") and today in str(j["posted"]))

    lines = []
    lines.append("# Indeed Job Market Report")
    lines.append(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
    lines.append("")
    lines.append(f"**Total listings analyzed:** {len(jobs)}")
    lines.append(f"**Remote positions:** {remote_count} ({remote_count/len(jobs)*100:.1f}%)")
    lines.append(f"**Posted today:** {posted_today}")
    lines.append(f"**Listings with salary data:** {len(salary_jobs)} ({len(salary_jobs)/len(jobs)*100:.1f}%)")
    lines.append("")

    if salary_jobs:
        mids = [s["mid"] for s in salary_jobs]
        lines.append("## Salary Statistics")
        lines.append(f"- Median: ${statistics.median(mids):,.0f}")
        lines.append(f"- Mean: ${statistics.mean(mids):,.0f}")
        lines.append(f"- Min range: ${min(s['min'] for s in salary_jobs):,.0f}")
        lines.append(f"- Max range: ${max(s['max'] for s in salary_jobs):,.0f}")
        lines.append("")

    lines.append("## Top Hiring Companies")
    for company, count in companies.most_common(15):
        lines.append(f"- {company}: {count} postings")
    lines.append("")

    lines.append("## Top Locations")
    for location, count in locations.most_common(10):
        lines.append(f"- {location}: {count} postings")

    return "\n".join(lines)

Production Deployment Considerations

When running Indeed scraping in production (e.g., scheduled daily collection), consider these operational patterns:

State management — Always maintain a state file tracking seen job IDs. This prevents re-processing listings and enables efficient incremental runs.

Error recovery — Indeed blocks can happen mid-run. Structure your scraper to save progress after each page so a block on page 5 of 10 does not lose pages 1-4:

import json
from pathlib import Path

class ProgressiveScraper:
    def __init__(self, job_name: str, output_dir: str = "jobs"):
        self.job_name = job_name
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(exist_ok=True)
        self.checkpoint_file = self.output_dir / f"{job_name}_checkpoint.json"
        self.results = self._load_checkpoint()

    def _load_checkpoint(self) -> list:
        if self.checkpoint_file.exists():
            data = json.loads(self.checkpoint_file.read_text())
            print(f"Resuming from checkpoint: {len(data)} jobs already collected")
            return data
        return []

    def save_checkpoint(self) -> None:
        self.checkpoint_file.write_text(json.dumps(self.results, indent=2))

    def add_jobs(self, new_jobs: list) -> None:
        existing_urls = {j.get("url") for j in self.results}
        truly_new = [j for j in new_jobs if j.get("url") not in existing_urls]
        self.results.extend(truly_new)
        self.save_checkpoint()
        print(f"  Added {len(truly_new)} new jobs (total: {len(self.results)})")

    def finalize(self, filename: str = None) -> Path:
        if not filename:
            from datetime import datetime
            filename = f"{self.job_name}_{datetime.now().strftime('%Y%m%d_%H%M')}.json"
        final_file = self.output_dir / filename
        final_file.write_text(json.dumps(self.results, indent=2))
        self.checkpoint_file.unlink(missing_ok=True)  # clean up checkpoint
        print(f"Finalized: {len(self.results)} jobs saved to {final_file}")
        return final_file

Proxy health monitoring — Track which proxy sessions are getting blocked and rotate away from degraded IPs. ThorData provides automatic rotation, but monitoring your block rate helps tune request parameters.

Request timing — Run scrapers during off-peak hours (early morning US time) when Indeed's servers are less loaded and rate limiting is more relaxed.

Job Quality Filtering

Not all Indeed listings are worth tracking. Apply quality filters to surface the most relevant opportunities:

import re
from typing import Optional


def is_high_quality_job(job: dict) -> bool:
    title = (job.get("title") or "").lower()
    company = (job.get("company") or "").lower()
    description = (job.get("description") or "").lower()

    # Filter out staffing agencies (optional)
    agency_signals = ["staffing", "consulting", "recruiter", "placement", "talent acquisition"]
    if any(signal in company for signal in agency_signals):
        return False

    # Filter out obviously low-quality listings
    spam_signals = ["work from home", "no experience required", "make money fast", "unlimited earning"]
    if any(signal in title.lower() for signal in spam_signals):
        return False

    # Require meaningful description length
    if description and len(description) < 200:
        return False

    return True


def filter_by_keywords(
    jobs: list[dict],
    required: list[str] = None,
    excluded: list[str] = None,
) -> list[dict]:
    filtered = []
    for job in jobs:
        desc = (job.get("description") or job.get("title") or "").lower()
        if required and not all(kw.lower() in desc for kw in required):
            continue
        if excluded and any(kw.lower() in desc for kw in excluded):
            continue
        filtered.append(job)
    return filtered


def filter_by_salary(
    jobs: list[dict],
    min_salary: int = 100_000,
    max_salary: int = None,
) -> list[dict]:
    filtered = []
    for job in jobs:
        salary = normalize_salary(job.get("salary"))
        if not salary:
            continue  # skip jobs without salary if filtering by salary
        if salary["max"] < min_salary:
            continue
        if max_salary and salary["min"] > max_salary:
            continue
        filtered.append(job)
    return filtered


# Example: find senior Python roles over $150k with specific tech requirements
def find_senior_python_roles(jobs: list[dict]) -> list[dict]:
    qualified = [j for j in jobs if is_high_quality_job(j)]
    salary_filtered = filter_by_salary(qualified, min_salary=150_000)
    tech_filtered = filter_by_keywords(
        salary_filtered,
        required=["python"],
        excluded=["junior", "entry level", "0-1 year", "fresh graduate"],
    )
    return tech_filtered

Competitor Job Posting Intelligence

For companies and recruiters, monitoring competitor job postings reveals growth signals:

import asyncio
import json
import random
from pathlib import Path
from datetime import datetime
from playwright.async_api import async_playwright


async def monitor_company_jobs(
    company_names: list[str],
    max_pages_per_company: int = 3,
    proxy: dict = None,
) -> dict:
    company_data = {}

    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy=proxy,
            args=["--disable-blink-features=AutomationControlled"],
        )
        context = await browser.new_context(
            viewport={"width": 1366, "height": 768},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
        )
        await context.add_init_script(
            "Object.defineProperty(navigator, 'webdriver', {get: () => undefined});"
        )

        for company in company_names:
            page = await context.new_page()
            jobs = []

            for page_num in range(max_pages_per_company):
                start = page_num * 10
                # Indeed allows searching by company name
                url = f"https://www.indeed.com/jobs?q={company}&start={start}"
                await page.goto(url, wait_until="networkidle", timeout=30000)

                try:
                    await page.wait_for_selector("[data-testid='jobsearch-resultsList']", timeout=8000)
                except Exception:
                    break

                cards = await page.query_selector_all("[data-testid='slider_item']")
                for card in cards:
                    company_el = await card.query_selector("[data-testid='company-name']")
                    if not company_el:
                        continue
                    card_company = (await company_el.inner_text()).strip().lower()
                    if company.lower() not in card_company:
                        continue  # skip if different company

                    title_el = await card.query_selector("h2 a")
                    location_el = await card.query_selector("[data-testid='text-location']")
                    salary_el = await card.query_selector("[data-testid='attribute_snippet_testid']")
                    date_el = await card.query_selector("[data-testid='myJobsStateDate']")

                    if title_el:
                        href = await title_el.get_attribute("href")
                        jobs.append({
                            "title": (await title_el.inner_text()).strip(),
                            "company": company,
                            "location": (await location_el.inner_text()).strip() if location_el else None,
                            "salary": (await salary_el.inner_text()).strip() if salary_el else None,
                            "posted": (await date_el.inner_text()).strip() if date_el else None,
                            "url": f"https://www.indeed.com{href}" if href and href.startswith("/") else href,
                        })

                await asyncio.sleep(random.uniform(3, 6))

            company_data[company] = jobs
            print(f"  {company}: {len(jobs)} open positions")
            await page.close()
            await asyncio.sleep(random.uniform(5, 10))

        await browser.close()

    return company_data


def analyze_company_hiring_trends(company_data: dict) -> None:
    print("\nCompany hiring overview:")
    print(f"{'Company':<30} {'Open Roles':>10} {'Has Salary Data':>15}")
    print("-" * 58)
    for company, jobs in sorted(company_data.items(), key=lambda x: len(x[1]), reverse=True):
        salary_count = sum(1 for j in jobs if j.get("salary"))
        print(f"{company:<30} {len(jobs):>10} {salary_count:>15}")

    # All titles across all companies
    from collections import Counter
    all_titles = [j["title"].lower() for jobs in company_data.values() for j in jobs]

    # Find common role types
    role_types = Counter()
    role_keywords = ["engineer", "manager", "analyst", "designer", "scientist",
                     "developer", "architect", "lead", "director", "intern"]
    for title in all_titles:
        for kw in role_keywords:
            if kw in title:
                role_types[kw] += 1

    print("\nRole type distribution:")
    for role, count in role_types.most_common(10):
        print(f"  {role}: {count}")

Rate Limit Reference and Production Checklist

Before deploying Indeed scraping in production, verify these checklist items:

Playwright setup: - Install Playwright: pip install playwright - Install browser: playwright install chromium - Verify headless mode works on your server (may need --no-sandbox on Linux)

Proxy configuration: - Test proxy connectivity before production run - Verify residential IP type (not datacenter) with https://ipinfo.io - ThorData proxy signup for clean residential IPs

Rate limiting parameters: - Page delay: 4-8 seconds (randomized) - Detail fetch delay: 5-10 seconds - Between searches: 10-20 seconds - On 429 response: wait 60-120 seconds minimum

Data validation: - Check job card count per page (0 = blocked, <5 = suspected block) - Monitor for redirect to https://www.indeed.com/viewjob?jk=... vs expected URL patterns - Log HTTP status codes to detect soft blocks

Storage: - Save raw HTML alongside parsed data for debugging - Use incremental state files so restarts don't lose progress - Checkpoint after every 25-50 jobs scraped

Advanced Indeed Scraping: ATS Detection and Application Tracking

One underutilized pattern is detecting the Applicant Tracking System (ATS) behind each job posting. Indeed frequently shows "Apply on company site" — and the redirect destination reveals which ATS the employer uses.

import httpx
import asyncio
from urllib.parse import urlparse

ATS_PATTERNS = {
    "greenhouse.io": "Greenhouse",
    "lever.co": "Lever",
    "workday.com": "Workday",
    "icims.com": "iCIMS",
    "taleo.net": "Taleo",
    "smartrecruiters.com": "SmartRecruiters",
    "jobvite.com": "Jobvite",
    "breezy.hr": "Breezy",
    "ashbyhq.com": "Ashby",
    "rippling.com": "Rippling",
}

async def detect_ats(apply_url: str, client: httpx.AsyncClient) -> str:
    """Follow apply URL redirects to identify the ATS."""
    try:
        response = await client.get(apply_url, follow_redirects=True, timeout=10)
        final_url = str(response.url)
        parsed = urlparse(final_url)
        domain = parsed.netloc.lower()
        for pattern, name in ATS_PATTERNS.items():
            if pattern in domain:
                return name
        return "Unknown"
    except Exception:
        return "Error"

async def enrich_jobs_with_ats(jobs: list[dict]) -> list[dict]:
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
    }
    async with httpx.AsyncClient(headers=headers) as client:
        tasks = []
        for job in jobs:
            apply_url = job.get("apply_url", "")
            if apply_url and "indeed.com" not in apply_url:
                tasks.append(detect_ats(apply_url, client))
            else:
                tasks.append(asyncio.coroutine(lambda: "Indeed Easy Apply")())
        results = await asyncio.gather(*tasks, return_exceptions=True)
        for job, ats in zip(jobs, results):
            job["ats"] = ats if isinstance(ats, str) else "Error"
    return jobs

ATS data is commercially valuable. Greenhouse and Lever employers tend to be tech-forward startups. Workday and Taleo dominate enterprise/Fortune 500. Filtering by ATS lets you build targeted job boards for specific audiences.

Scaling with Async Playwright and Multiple Browser Contexts

Running a single browser context sequentially is the bottleneck for large-scale Indeed scraping. Use multiple isolated browser contexts sharing a single Playwright browser instance:

import asyncio
from playwright.async_api import async_playwright

async def scrape_with_context(browser, query: str, location: str, proxy: dict) -> list[dict]:
    context = await browser.new_context(
        proxy=proxy,
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        viewport={"width": 1366, "height": 768},
        locale="en-US",
        timezone_id="America/New_York",
    )
    page = await context.new_page()
    # Remove automation fingerprints
    await page.add_init_script("""
        Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
        Object.defineProperty(navigator, 'plugins', { get: () => [1, 2, 3, 4, 5] });
    """)
    jobs = []
    try:
        url = f"https://www.indeed.com/jobs?q={query}&l={location}&start=0"
        await page.goto(url, wait_until="domcontentloaded")
        await page.wait_for_timeout(2000 + asyncio.get_event_loop().time() % 1500)

        cards = await page.query_selector_all("[data-testid='slider_item']")
        for card in cards:
            try:
                title = await card.query_selector("[data-testid='jobTitle']")
                company = await card.query_selector("[data-testid='company-name']")
                location_el = await card.query_selector("[data-testid='text-location']")
                jobs.append({
                    "title": await title.inner_text() if title else "",
                    "company": await company.inner_text() if company else "",
                    "location": await location_el.inner_text() if location_el else "",
                    "query": query,
                })
            except Exception:
                continue
    finally:
        await context.close()
    return jobs

async def parallel_scrape(queries: list[str], location: str, proxies: list[dict]) -> list[dict]:
    """Run multiple browser contexts in parallel, one per query."""
    all_jobs = []
    async with async_playwright() as pw:
        browser = await pw.chromium.launch(headless=True)
        sem = asyncio.Semaphore(3)  # Max 3 concurrent contexts

        async def bounded_scrape(query, proxy):
            async with sem:
                return await scrape_with_context(browser, query, location, proxy)

        tasks = [
            bounded_scrape(q, proxies[i % len(proxies)])
            for i, q in enumerate(queries)
        ]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        for r in results:
            if isinstance(r, list):
                all_jobs.extend(r)
        await browser.close()
    return all_jobs

# ThorData proxy list — get residential IPs per context
# Sign up at https://thordata.partnerstack.com/partner/0a0x4nzb (or [Oxylabs](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=2066&url_id=174))
PROXIES = [
    {"server": "http://proxy.thordata.net:9000", "username": "USER", "password": "PASS"},
    {"server": "http://proxy.thordata.net:9001", "username": "USER2", "password": "PASS"},
]

Monitoring Indeed for Salary Band Changes

Companies sometimes adjust salary ranges on job postings over time — revealing budget shifts, hiring urgency, or internal compensation restructuring. Track this with a diff-based monitor:

import sqlite3
import json
from datetime import datetime

def init_salary_tracker(db_path: str = "salary_tracker.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS salary_snapshots (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            job_key TEXT NOT NULL,
            company TEXT,
            title TEXT,
            salary_min INTEGER,
            salary_max INTEGER,
            salary_text TEXT,
            scraped_at TEXT,
            UNIQUE(job_key, scraped_at)
        )
    """)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS salary_changes (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            job_key TEXT,
            prev_min INTEGER,
            prev_max INTEGER,
            new_min INTEGER,
            new_max INTEGER,
            change_pct REAL,
            detected_at TEXT
        )
    """)
    conn.commit()
    return conn

def record_salary_snapshot(conn, job: dict):
    today = datetime.utcnow().date().isoformat()
    try:
        conn.execute(
            "INSERT OR IGNORE INTO salary_snapshots VALUES (NULL,?,?,?,?,?,?,?)",
            (job["key"], job.get("company"), job.get("title"),
             job.get("salary_min"), job.get("salary_max"),
             job.get("salary_text"), today)
        )
        conn.commit()
    except Exception as e:
        print(f"Insert error: {e}")

def detect_salary_changes(conn) -> list[dict]:
    """Find jobs where salary changed between last two snapshots."""
    query = """
        WITH ranked AS (
            SELECT *, ROW_NUMBER() OVER (PARTITION BY job_key ORDER BY scraped_at DESC) AS rn
            FROM salary_snapshots
            WHERE salary_min IS NOT NULL
        )
        SELECT a.job_key, a.company, a.title,
               b.salary_min AS prev_min, b.salary_max AS prev_max,
               a.salary_min AS new_min, a.salary_max AS new_max
        FROM ranked a
        JOIN ranked b ON a.job_key = b.job_key AND a.rn = 1 AND b.rn = 2
        WHERE a.salary_min != b.salary_min OR a.salary_max != b.salary_max
    """
    rows = conn.execute(query).fetchall()
    changes = []
    for row in rows:
        prev_avg = (row[3] + row[4]) / 2 if row[3] and row[4] else 0
        new_avg = (row[5] + row[6]) / 2 if row[5] and row[6] else 0
        pct = ((new_avg - prev_avg) / prev_avg * 100) if prev_avg else 0
        changes.append({
            "job_key": row[0], "company": row[1], "title": row[2],
            "prev_min": row[3], "prev_max": row[4],
            "new_min": row[5], "new_max": row[6],
            "change_pct": round(pct, 1),
        })
    return changes

Run this daily against your job database to surface employers actively adjusting compensation — a strong signal for negotiation leverage or market shift detection.

Production Deployment Checklist

Before running an Indeed scraper in production, verify:

Item Recommendation
Proxy rotation ThorData residential, rotate every 50 requests
Browser fingerprint Randomize viewport, timezone, locale per session
Request rate Max 1 request/3 seconds per IP
Captcha handling Use Playwright + real browser (no headless detection)
Data deduplication Hash (title, company, location) as job key
Storage SQLite for <100K jobs; PostgreSQL for larger
Monitoring Alert on >20% empty-result pages (bot detection)
Respect robots.txt indeed.com/robots.txt disallows /jobs — use for research only

Always review the Terms of Service for your specific use case before deploying at scale.