How to Scrape Indeed Job Listings in 2026: Playwright + Anti-Bot Evasion
How to Scrape Indeed Job Listings in 2026: Playwright + Anti-Bot Evasion
Indeed is the largest job aggregator globally — over 350 million unique visitors per month and job listings from nearly every industry. If you're building salary comparison tools, tracking hiring trends, or doing labor market research, Indeed has the data. But getting it out requires overcoming some of the most aggressive anti-bot systems in the job board space.
Simple HTTP requests won't work. This guide uses Playwright — a headless browser automation library — because Indeed's 2026 defenses specifically target non-browser clients.
What Data Can You Extract?
Indeed job listings contain:
- Job title, company, location — core listing data
- Salary estimate — Indeed's estimated range or employer-posted salary
- Job description — full text of the posting
- Company rating — Indeed's aggregate employer rating
- Job type — full-time, part-time, contract, remote
- Date posted — relative or absolute posting date
- Benefits — health insurance, 401k, PTO when listed
- Application count — 'X people have applied' indicator
- Hiring urgency signals — 'Urgently hiring', 'Actively reviewing' flags
Indeed's Anti-Bot Measures in 2026
Indeed runs some of the most sophisticated bot detection in the job board space:
- Cloudflare Turnstile — Indeed uses Cloudflare's challenge platform. Requests without valid
cf_clearancecookies get blocked. - Browser fingerprinting — Canvas hashing, WebGL renderer strings, font enumeration, and audio context fingerprinting are all checked via inline JavaScript.
- Behavioral analysis — Pages track mouse movements, scroll patterns, and time-on-page. No interaction triggers a soft block after a few pages.
- TLS fingerprinting (JA3) — The TLS handshake signature is checked against known bot fingerprints. Python's
requestslibrary has a recognizable JA3 hash. - IP reputation scoring — Datacenter IPs, VPN exit nodes, and previously flagged IPs get immediate challenges.
- Dynamic CSS selectors — Class names on job cards are randomized per session, breaking static CSS selectors between runs.
Why Playwright, Not Requests
Indeed's Cloudflare integration means the page must execute JavaScript to obtain the cf_clearance cookie. You cannot fake this with HTTP requests — you need an actual browser engine. Playwright provides this while being scriptable in Python.
pip install playwright
playwright install chromium
Basic Job Search Scraper
import asyncio
import json
import random
from playwright.async_api import async_playwright
async def scrape_indeed_jobs(
query: str,
location: str,
max_pages: int = 3,
proxy: dict = None,
) -> list:
jobs = []
async with async_playwright() as p:
launch_args = {
"headless": True,
"args": [
"--disable-blink-features=AutomationControlled",
"--disable-dev-shm-usage",
"--no-sandbox",
"--disable-setuid-sandbox",
],
}
if proxy:
launch_args["proxy"] = proxy
browser = await p.chromium.launch(**launch_args)
context = await browser.new_context(
viewport={"width": 1366, "height": 768},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
locale="en-US",
)
# Remove webdriver detection flag
await context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
delete window.cdc_adoQpoasnfa76pfcZLmcfl_Array;
""")
page = await context.new_page()
for page_num in range(max_pages):
start = page_num * 10
url = f"https://www.indeed.com/jobs?q={query}&l={location}&start={start}"
await page.goto(url, wait_until="networkidle", timeout=30000)
try:
await page.wait_for_selector("[data-testid='jobsearch-resultsList']", timeout=10000)
except Exception:
print(f"No results on page {page_num}, possibly blocked")
break
# Simulate human scrolling
for _ in range(3):
await page.mouse.wheel(0, random.randint(300, 600))
await asyncio.sleep(random.uniform(0.5, 1.5))
# Move mouse to simulate human presence
await page.mouse.move(random.randint(100, 800), random.randint(100, 500))
cards = await page.query_selector_all("[data-testid='slider_item']")
for card in cards:
job = {}
title_el = await card.query_selector("h2 a")
if title_el:
job["title"] = (await title_el.inner_text()).strip()
job["url"] = await title_el.get_attribute("href")
if job["url"] and job["url"].startswith("/"):
job["url"] = f"https://www.indeed.com{job['url']}"
company_el = await card.query_selector("[data-testid='company-name']")
job["company"] = (await company_el.inner_text()).strip() if company_el else None
location_el = await card.query_selector("[data-testid='text-location']")
job["location"] = (await location_el.inner_text()).strip() if location_el else None
salary_el = await card.query_selector("[data-testid='attribute_snippet_testid']")
job["salary"] = (await salary_el.inner_text()).strip() if salary_el else None
date_el = await card.query_selector("[data-testid='myJobsStateDate']")
job["posted"] = (await date_el.inner_text()).strip() if date_el else None
jobs.append(job)
print(f" Page {page_num + 1}: found {len(cards)} job cards")
await asyncio.sleep(random.uniform(4, 8))
await browser.close()
return jobs
Scraping Full Job Descriptions
Job cards only show previews. Full descriptions require opening each posting:
async def scrape_job_detail(url: str, context) -> dict:
page = await context.new_page()
try:
await page.goto(url, wait_until="networkidle", timeout=30000)
await page.wait_for_selector("#jobDescriptionText", timeout=10000)
description = await page.inner_text("#jobDescriptionText")
salary = None
salary_el = await page.query_selector("#salaryInfoAndJobType")
if salary_el:
salary = (await salary_el.inner_text()).strip()
benefits = []
benefit_els = await page.query_selector_all("[data-testid='benefits-entry']")
for b in benefit_els:
benefits.append((await b.inner_text()).strip())
job_type = None
type_el = await page.query_selector("[data-testid='jobsearch-JobInfoHeader-jobType']")
if type_el:
job_type = (await type_el.inner_text()).strip()
return {
"description": description.strip(),
"salary_detail": salary,
"benefits": benefits,
"job_type": job_type,
}
except Exception as e:
return {"error": str(e)}
finally:
await page.close()
Handling Dynamic CSS Selectors
Indeed randomizes CSS class names between sessions, but data-testid attributes are stable. Always prefer [data-testid='...'] selectors over class-based ones. If Indeed removes a testid, fall back to structural selectors:
# Stable: data-testid attributes
title_el = await card.query_selector("[data-testid='jobTitle']")
company_el = await card.query_selector("[data-testid='company-name']")
# Fallback: select by structure
if not title_el:
title_el = await card.query_selector("h2 a")
if not company_el:
company_el = await card.query_selector("h2 + div span")
# Extract job ID from data attribute as stable identifier
jk_attr = await card.get_attribute("data-jk")
if jk_attr:
job_id = jk_attr
Proxy Strategy for Indeed
Indeed's IP reputation system is the toughest obstacle. Datacenter proxies last maybe 5-10 requests before hitting Turnstile challenges. Free proxy lists are almost entirely pre-flagged.
Residential proxies are the only reliable option for sustained Indeed scraping. ThorData works well here because their residential IPs have clean reputation scores — they have not been abused by other scraping operations. This matters specifically for Cloudflare Turnstile, which maintains a shared IP reputation database across all sites using it.
# ThorData residential proxy — https://thordata.partnerstack.com/partner/0a0x4nzb (or [Oxylabs](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=2066&url_id=174))
PROXY_CONFIG = {
"server": "http://proxy.thordata.net:9000",
"username": "YOUR_THORDATA_USER",
"password": "YOUR_THORDATA_PASS",
}
async def main():
jobs = await scrape_indeed_jobs(
query="python+developer",
location="Remote",
max_pages=5,
proxy=PROXY_CONFIG,
)
print(f"Found {len(jobs)} listings")
for j in jobs[:5]:
salary_str = j.get('salary', 'No salary listed')
print(f" {j['title']} @ {j['company']} | {salary_str}")
asyncio.run(main())
Handling Cloudflare Challenges
If you encounter a Cloudflare challenge page, a few strategies help navigate it:
import asyncio
from playwright.async_api import async_playwright
async def bypass_cloudflare(url: str, proxy: dict = None) -> str:
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy=proxy,
args=[
"--disable-blink-features=AutomationControlled",
"--no-sandbox",
],
)
context = await browser.new_context(
viewport={"width": 1440, "height": 900},
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
)
await context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
Object.defineProperty(navigator, 'languages', {get: () => ['en-US', 'en']});
Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3, 4, 5]});
""")
page = await context.new_page()
await page.goto(url, wait_until="domcontentloaded", timeout=60000)
# Wait for challenge to clear (up to 15s)
for sel in ["#challenge-running", "#cf-spinner-please-wait", ".cf-browser-verification"]:
try:
await page.wait_for_selector(sel, state="hidden", timeout=15000)
except Exception:
pass
await asyncio.sleep(3)
content = await page.content()
await browser.close()
return content
Salary Data Extraction and Normalization
Indeed's salary display is inconsistent. Some listings use annual ranges, others hourly. Normalizing to comparable annual figures:
import re
def normalize_salary(raw_salary: str) -> dict | None:
if not raw_salary:
return None
raw = raw_salary.lower().strip()
nums = re.findall(r'[\d,]+', raw.replace(',', ''))
if not nums:
return None
amounts = [int(n) for n in nums if n]
if not amounts:
return None
is_hourly = bool(re.search(r'/hr|per hour|/hour|an hour', raw))
is_monthly = bool(re.search(r'/month|per month|/mo', raw))
def annualize(amt):
if is_hourly:
return amt * 2080
if is_monthly:
return amt * 12
return amt
annual = [annualize(a) for a in amounts]
return {
"min": min(annual),
"max": max(annual),
"mid": sum(annual) / len(annual),
"raw": raw_salary,
"period": "hourly" if is_hourly else "monthly" if is_monthly else "annual",
}
def print_salary_summary(jobs: list) -> None:
salaries = [normalize_salary(j.get('salary')) for j in jobs]
salaries = [s for s in salaries if s]
if not salaries:
print('No salary data found')
return
mids = [s['mid'] for s in salaries]
print(f'Salary stats across {len(salaries)} listings:')
print(f" Min: ${min(s['min'] for s in salaries):>9,.0f}")
print(f" Max: ${max(s['max'] for s in salaries):>9,.0f}")
print(f" Median: ${sorted(mids)[len(mids)//2]:>9,.0f}")
Remote Job Filtering
Use Indeed's built-in remote filter for clean remote-only datasets:
async def scrape_remote_jobs(query: str, max_pages: int = 5, proxy: dict = None) -> list:
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy=proxy,
args=['--disable-blink-features=AutomationControlled'],
)
context = await browser.new_context(
viewport={"width": 1366, "height": 768},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
)
await context.add_init_script(
"Object.defineProperty(navigator, 'webdriver', {get: () => undefined});"
)
page = await context.new_page()
jobs = []
for page_num in range(max_pages):
start = page_num * 10
# remotejobs=1 filters to remote-only listings
url = f"https://www.indeed.com/jobs?q={query}&remotejobs=1&start={start}"
await page.goto(url, wait_until="networkidle", timeout=30000)
try:
await page.wait_for_selector("[data-testid='jobsearch-resultsList']", timeout=10000)
except Exception:
break
cards = await page.query_selector_all("[data-testid='slider_item']")
for card in cards:
title_el = await card.query_selector('h2 a')
company_el = await card.query_selector("[data-testid='company-name']")
salary_el = await card.query_selector("[data-testid='attribute_snippet_testid']")
if title_el:
href = await title_el.get_attribute('href')
jobs.append({
'title': (await title_el.inner_text()).strip(),
'company': (await company_el.inner_text()).strip() if company_el else None,
'salary': (await salary_el.inner_text()).strip() if salary_el else None,
'remote': True,
'url': f"https://www.indeed.com{href}" if href and href.startswith('/') else href,
})
await asyncio.sleep(random.uniform(4, 7))
await browser.close()
return jobs
Incremental Scraping: Only New Listings
For ongoing job tracking, skip already-seen listings:
import json
from pathlib import Path
class IncrementalJobScraper:
def __init__(self, state_file: str = 'seen_jobs.json'):
self.state_file = Path(state_file)
self.seen_ids = self._load_seen()
def _load_seen(self) -> set:
if self.state_file.exists():
data = json.loads(self.state_file.read_text())
return set(data.get('seen_ids', []))
return set()
def _save_seen(self) -> None:
self.state_file.write_text(json.dumps({'seen_ids': list(self.seen_ids)}, indent=2))
def filter_new(self, jobs: list) -> list:
new_jobs = []
for job in jobs:
job_id = job.get('url', '')
if job_id and job_id not in self.seen_ids:
new_jobs.append(job)
self.seen_ids.add(job_id)
self._save_seen()
return new_jobs
Saving Job Data
import csv
import json
from pathlib import Path
from datetime import datetime
def save_jobs(jobs: list, prefix: str = 'indeed_jobs', output_dir: str = '.') -> None:
if not jobs:
print('No jobs to save')
return
out = Path(output_dir)
out.mkdir(exist_ok=True)
timestamp = datetime.now().strftime('%Y%m%d_%H%M')
json_file = out / f'{prefix}_{timestamp}.json'
json_file.write_text(json.dumps(jobs, indent=2, ensure_ascii=False))
csv_file = out / f'{prefix}_{timestamp}.csv'
keys = ['title', 'company', 'location', 'salary', 'posted', 'job_type', 'url']
with open(csv_file, 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=keys, extrasaction='ignore')
writer.writeheader()
writer.writerows(jobs)
print(f'Saved {len(jobs)} jobs: {json_file}, {csv_file}')
Legal Considerations
Indeed's Terms of Service prohibit scraping. In hiQ Labs v. LinkedIn, courts ruled that scraping public data is not a CFAA violation, but Indeed has pursued legal action against scrapers under state computer fraud laws. Keep your volumes moderate, do not scrape behind login walls, and use the data for analysis — not rebuilding Indeed's listings database.
Key Takeaways
- Indeed requires a real browser engine — Playwright is the right tool. HTTP requests alone cannot pass Cloudflare Turnstile.
- Remove the
webdrivernavigator flag and simulate mouse/scroll behavior to avoid behavioral detection. - Use
data-testidselectors — they are stable across sessions, unlike randomized class names. - Clean residential proxies are critical for Indeed. ThorData's residential pool maintains the IP reputation needed to pass Cloudflare challenges consistently.
- Add 4-8 second delays between pages and randomize all timing. Indeed actively monitors request cadence.
- Store results incrementally — Indeed blocks are sudden, so save after each successful page.
- Normalize salary data at collection time — hourly, monthly, and annual figures need to be made comparable for analysis.
Building a Job Alert System
Combine Indeed scraping with email notifications to build a personal job alert that catches new listings:
import asyncio
import json
import random
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from pathlib import Path
from datetime import datetime
from playwright.async_api import async_playwright
STATE_FILE = Path("indeed_alert_state.json")
def load_seen_jobs() -> set:
if STATE_FILE.exists():
return set(json.loads(STATE_FILE.read_text()).get("seen_urls", []))
return set()
def save_seen_jobs(seen: set) -> None:
STATE_FILE.write_text(json.dumps({"seen_urls": list(seen)}, indent=2))
def send_alert_email(new_jobs: list, smtp_config: dict) -> None:
if not new_jobs:
return
subject = f"Indeed Alert: {len(new_jobs)} new job(s) found"
body_lines = [f"Found {len(new_jobs)} new listings:\n"]
for job in new_jobs[:20]:
body_lines.append(f"- {job.get('title')} @ {job.get('company')}")
if job.get("salary"):
body_lines.append(f" Salary: {job['salary']}")
body_lines.append(f" Location: {job.get('location')}")
body_lines.append(f" URL: {job.get('url', 'N/A')}")
body_lines.append("")
msg = MIMEMultipart()
msg["Subject"] = subject
msg["From"] = smtp_config["from"]
msg["To"] = smtp_config["to"]
msg.attach(MIMEText("\n".join(body_lines), "plain"))
with smtplib.SMTP_SSL(smtp_config["host"], smtp_config["port"]) as server:
server.login(smtp_config["user"], smtp_config["password"])
server.sendmail(smtp_config["from"], smtp_config["to"], msg.as_string())
print(f"Alert sent: {len(new_jobs)} new jobs")
async def run_job_alert(
queries: list[dict],
proxy: dict = None,
smtp_config: dict = None,
) -> list:
seen = load_seen_jobs()
all_new_jobs = []
for query_config in queries:
keywords = query_config["keywords"]
location = query_config.get("location", "Remote")
print(f"Checking: {keywords} in {location}")
jobs = await scrape_indeed_jobs(
query=keywords,
location=location,
max_pages=2,
proxy=proxy,
)
new_jobs = [j for j in jobs if j.get("url") and j["url"] not in seen]
if new_jobs:
print(f" {len(new_jobs)} new listings found!")
for job in new_jobs:
seen.add(job["url"])
all_new_jobs.extend(new_jobs)
else:
print(f" No new listings")
await asyncio.sleep(random.uniform(8, 15))
save_seen_jobs(seen)
if all_new_jobs and smtp_config:
send_alert_email(all_new_jobs, smtp_config)
return all_new_jobs
Scraping Company Reviews and Ratings
Indeed surfaces company ratings alongside job listings. You can scrape employer review data for company intelligence:
import asyncio
import random
from playwright.async_api import async_playwright
async def scrape_company_reviews(
company_name: str,
max_pages: int = 5,
proxy: dict = None,
) -> list[dict]:
reviews = []
company_slug = company_name.lower().replace(" ", "-")
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy=proxy,
args=["--disable-blink-features=AutomationControlled"],
)
context = await browser.new_context(
viewport={"width": 1366, "height": 768},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
)
await context.add_init_script(
"Object.defineProperty(navigator, 'webdriver', {get: () => undefined});"
)
page = await context.new_page()
for page_num in range(max_pages):
start = page_num * 20
url = f"https://www.indeed.com/cmp/{company_slug}/reviews?start={start}"
await page.goto(url, wait_until="networkidle", timeout=30000)
try:
await page.wait_for_selector("[data-testid='review-card']", timeout=8000)
except Exception:
break
review_cards = await page.query_selector_all("[data-testid='review-card']")
for card in review_cards:
title_el = await card.query_selector("[data-testid='review-title']")
rating_el = await card.query_selector("[data-testid='review-rating']")
pros_el = await card.query_selector("[data-testid='review-pros']")
cons_el = await card.query_selector("[data-testid='review-cons']")
date_el = await card.query_selector("[data-testid='review-date']")
job_title_el = await card.query_selector("[data-testid='review-job-title']")
reviews.append({
"title": (await title_el.inner_text()).strip() if title_el else None,
"rating": (await rating_el.get_attribute("aria-label")) if rating_el else None,
"pros": (await pros_el.inner_text()).strip() if pros_el else None,
"cons": (await cons_el.inner_text()).strip() if cons_el else None,
"date": (await date_el.inner_text()).strip() if date_el else None,
"job_title": (await job_title_el.inner_text()).strip() if job_title_el else None,
})
await asyncio.sleep(random.uniform(3, 6))
await browser.close()
return reviews
Scraping Salary Estimates by Role
Indeed's salary estimation tool provides aggregated salary data by job title and location. You can query this directly:
import asyncio
import random
from playwright.async_api import async_playwright
async def scrape_salary_estimates(
job_title: str,
location: str = "United States",
proxy: dict = None,
) -> dict:
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy=proxy,
args=["--disable-blink-features=AutomationControlled"],
)
context = await browser.new_context(
viewport={"width": 1366, "height": 768},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
)
await context.add_init_script(
"Object.defineProperty(navigator, 'webdriver', {get: () => undefined});"
)
page = await context.new_page()
# Indeed salary page structure
url_title = job_title.replace(" ", "-").lower()
url = f"https://www.indeed.com/career/{url_title}/salaries"
await page.goto(url, wait_until="networkidle", timeout=30000)
await asyncio.sleep(2)
# Extract salary display
result = {}
salary_el = await page.query_selector("[data-testid='salary-hero-amount']")
if salary_el:
result["average_salary"] = (await salary_el.inner_text()).strip()
range_els = await page.query_selector_all("[data-testid='salary-percentile']")
percentiles = []
for el in range_els:
percentiles.append((await el.inner_text()).strip())
if percentiles:
result["percentiles"] = percentiles
await browser.close()
return result
# Async runner
async def batch_salary_lookup(roles: list[str]) -> dict:
results = {}
for role in roles:
print(f"Looking up salary for: {role}")
data = await scrape_salary_estimates(role)
results[role] = data
print(f" {data.get('average_salary', 'N/A')}")
await asyncio.sleep(random.uniform(5, 10))
return results
Indeed Job Market Analytics Dashboard
Combine multiple scraped datasets to build a job market analytics view:
import json
import statistics
from pathlib import Path
from collections import Counter, defaultdict
from datetime import datetime
def generate_market_report(jobs_file: str) -> str:
jobs = json.loads(Path(jobs_file).read_text())
if not jobs:
return "No data"
# Company hiring volume
companies = Counter(j["company"] for j in jobs if j.get("company"))
# Location distribution
locations = Counter(j["location"] for j in jobs if j.get("location"))
# Salary analysis
salary_jobs = []
for job in jobs:
s = normalize_salary(job.get("salary"))
if s:
salary_jobs.append(s)
# Remote vs. on-site
remote_count = sum(
1 for j in jobs
if j.get("location") and "remote" in j["location"].lower()
)
# Posted date distribution
today = datetime.now().strftime("%Y-%m-%d")
posted_today = sum(1 for j in jobs if j.get("posted") and today in str(j["posted"]))
lines = []
lines.append("# Indeed Job Market Report")
lines.append(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
lines.append("")
lines.append(f"**Total listings analyzed:** {len(jobs)}")
lines.append(f"**Remote positions:** {remote_count} ({remote_count/len(jobs)*100:.1f}%)")
lines.append(f"**Posted today:** {posted_today}")
lines.append(f"**Listings with salary data:** {len(salary_jobs)} ({len(salary_jobs)/len(jobs)*100:.1f}%)")
lines.append("")
if salary_jobs:
mids = [s["mid"] for s in salary_jobs]
lines.append("## Salary Statistics")
lines.append(f"- Median: ${statistics.median(mids):,.0f}")
lines.append(f"- Mean: ${statistics.mean(mids):,.0f}")
lines.append(f"- Min range: ${min(s['min'] for s in salary_jobs):,.0f}")
lines.append(f"- Max range: ${max(s['max'] for s in salary_jobs):,.0f}")
lines.append("")
lines.append("## Top Hiring Companies")
for company, count in companies.most_common(15):
lines.append(f"- {company}: {count} postings")
lines.append("")
lines.append("## Top Locations")
for location, count in locations.most_common(10):
lines.append(f"- {location}: {count} postings")
return "\n".join(lines)
Production Deployment Considerations
When running Indeed scraping in production (e.g., scheduled daily collection), consider these operational patterns:
State management — Always maintain a state file tracking seen job IDs. This prevents re-processing listings and enables efficient incremental runs.
Error recovery — Indeed blocks can happen mid-run. Structure your scraper to save progress after each page so a block on page 5 of 10 does not lose pages 1-4:
import json
from pathlib import Path
class ProgressiveScraper:
def __init__(self, job_name: str, output_dir: str = "jobs"):
self.job_name = job_name
self.output_dir = Path(output_dir)
self.output_dir.mkdir(exist_ok=True)
self.checkpoint_file = self.output_dir / f"{job_name}_checkpoint.json"
self.results = self._load_checkpoint()
def _load_checkpoint(self) -> list:
if self.checkpoint_file.exists():
data = json.loads(self.checkpoint_file.read_text())
print(f"Resuming from checkpoint: {len(data)} jobs already collected")
return data
return []
def save_checkpoint(self) -> None:
self.checkpoint_file.write_text(json.dumps(self.results, indent=2))
def add_jobs(self, new_jobs: list) -> None:
existing_urls = {j.get("url") for j in self.results}
truly_new = [j for j in new_jobs if j.get("url") not in existing_urls]
self.results.extend(truly_new)
self.save_checkpoint()
print(f" Added {len(truly_new)} new jobs (total: {len(self.results)})")
def finalize(self, filename: str = None) -> Path:
if not filename:
from datetime import datetime
filename = f"{self.job_name}_{datetime.now().strftime('%Y%m%d_%H%M')}.json"
final_file = self.output_dir / filename
final_file.write_text(json.dumps(self.results, indent=2))
self.checkpoint_file.unlink(missing_ok=True) # clean up checkpoint
print(f"Finalized: {len(self.results)} jobs saved to {final_file}")
return final_file
Proxy health monitoring — Track which proxy sessions are getting blocked and rotate away from degraded IPs. ThorData provides automatic rotation, but monitoring your block rate helps tune request parameters.
Request timing — Run scrapers during off-peak hours (early morning US time) when Indeed's servers are less loaded and rate limiting is more relaxed.
Job Quality Filtering
Not all Indeed listings are worth tracking. Apply quality filters to surface the most relevant opportunities:
import re
from typing import Optional
def is_high_quality_job(job: dict) -> bool:
title = (job.get("title") or "").lower()
company = (job.get("company") or "").lower()
description = (job.get("description") or "").lower()
# Filter out staffing agencies (optional)
agency_signals = ["staffing", "consulting", "recruiter", "placement", "talent acquisition"]
if any(signal in company for signal in agency_signals):
return False
# Filter out obviously low-quality listings
spam_signals = ["work from home", "no experience required", "make money fast", "unlimited earning"]
if any(signal in title.lower() for signal in spam_signals):
return False
# Require meaningful description length
if description and len(description) < 200:
return False
return True
def filter_by_keywords(
jobs: list[dict],
required: list[str] = None,
excluded: list[str] = None,
) -> list[dict]:
filtered = []
for job in jobs:
desc = (job.get("description") or job.get("title") or "").lower()
if required and not all(kw.lower() in desc for kw in required):
continue
if excluded and any(kw.lower() in desc for kw in excluded):
continue
filtered.append(job)
return filtered
def filter_by_salary(
jobs: list[dict],
min_salary: int = 100_000,
max_salary: int = None,
) -> list[dict]:
filtered = []
for job in jobs:
salary = normalize_salary(job.get("salary"))
if not salary:
continue # skip jobs without salary if filtering by salary
if salary["max"] < min_salary:
continue
if max_salary and salary["min"] > max_salary:
continue
filtered.append(job)
return filtered
# Example: find senior Python roles over $150k with specific tech requirements
def find_senior_python_roles(jobs: list[dict]) -> list[dict]:
qualified = [j for j in jobs if is_high_quality_job(j)]
salary_filtered = filter_by_salary(qualified, min_salary=150_000)
tech_filtered = filter_by_keywords(
salary_filtered,
required=["python"],
excluded=["junior", "entry level", "0-1 year", "fresh graduate"],
)
return tech_filtered
Competitor Job Posting Intelligence
For companies and recruiters, monitoring competitor job postings reveals growth signals:
import asyncio
import json
import random
from pathlib import Path
from datetime import datetime
from playwright.async_api import async_playwright
async def monitor_company_jobs(
company_names: list[str],
max_pages_per_company: int = 3,
proxy: dict = None,
) -> dict:
company_data = {}
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy=proxy,
args=["--disable-blink-features=AutomationControlled"],
)
context = await browser.new_context(
viewport={"width": 1366, "height": 768},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/131.0.0.0 Safari/537.36",
)
await context.add_init_script(
"Object.defineProperty(navigator, 'webdriver', {get: () => undefined});"
)
for company in company_names:
page = await context.new_page()
jobs = []
for page_num in range(max_pages_per_company):
start = page_num * 10
# Indeed allows searching by company name
url = f"https://www.indeed.com/jobs?q={company}&start={start}"
await page.goto(url, wait_until="networkidle", timeout=30000)
try:
await page.wait_for_selector("[data-testid='jobsearch-resultsList']", timeout=8000)
except Exception:
break
cards = await page.query_selector_all("[data-testid='slider_item']")
for card in cards:
company_el = await card.query_selector("[data-testid='company-name']")
if not company_el:
continue
card_company = (await company_el.inner_text()).strip().lower()
if company.lower() not in card_company:
continue # skip if different company
title_el = await card.query_selector("h2 a")
location_el = await card.query_selector("[data-testid='text-location']")
salary_el = await card.query_selector("[data-testid='attribute_snippet_testid']")
date_el = await card.query_selector("[data-testid='myJobsStateDate']")
if title_el:
href = await title_el.get_attribute("href")
jobs.append({
"title": (await title_el.inner_text()).strip(),
"company": company,
"location": (await location_el.inner_text()).strip() if location_el else None,
"salary": (await salary_el.inner_text()).strip() if salary_el else None,
"posted": (await date_el.inner_text()).strip() if date_el else None,
"url": f"https://www.indeed.com{href}" if href and href.startswith("/") else href,
})
await asyncio.sleep(random.uniform(3, 6))
company_data[company] = jobs
print(f" {company}: {len(jobs)} open positions")
await page.close()
await asyncio.sleep(random.uniform(5, 10))
await browser.close()
return company_data
def analyze_company_hiring_trends(company_data: dict) -> None:
print("\nCompany hiring overview:")
print(f"{'Company':<30} {'Open Roles':>10} {'Has Salary Data':>15}")
print("-" * 58)
for company, jobs in sorted(company_data.items(), key=lambda x: len(x[1]), reverse=True):
salary_count = sum(1 for j in jobs if j.get("salary"))
print(f"{company:<30} {len(jobs):>10} {salary_count:>15}")
# All titles across all companies
from collections import Counter
all_titles = [j["title"].lower() for jobs in company_data.values() for j in jobs]
# Find common role types
role_types = Counter()
role_keywords = ["engineer", "manager", "analyst", "designer", "scientist",
"developer", "architect", "lead", "director", "intern"]
for title in all_titles:
for kw in role_keywords:
if kw in title:
role_types[kw] += 1
print("\nRole type distribution:")
for role, count in role_types.most_common(10):
print(f" {role}: {count}")
Rate Limit Reference and Production Checklist
Before deploying Indeed scraping in production, verify these checklist items:
Playwright setup:
- Install Playwright: pip install playwright
- Install browser: playwright install chromium
- Verify headless mode works on your server (may need --no-sandbox on Linux)
Proxy configuration:
- Test proxy connectivity before production run
- Verify residential IP type (not datacenter) with https://ipinfo.io
- ThorData proxy signup for clean residential IPs
Rate limiting parameters: - Page delay: 4-8 seconds (randomized) - Detail fetch delay: 5-10 seconds - Between searches: 10-20 seconds - On 429 response: wait 60-120 seconds minimum
Data validation:
- Check job card count per page (0 = blocked, <5 = suspected block)
- Monitor for redirect to https://www.indeed.com/viewjob?jk=... vs expected URL patterns
- Log HTTP status codes to detect soft blocks
Storage: - Save raw HTML alongside parsed data for debugging - Use incremental state files so restarts don't lose progress - Checkpoint after every 25-50 jobs scraped
Advanced Indeed Scraping: ATS Detection and Application Tracking
One underutilized pattern is detecting the Applicant Tracking System (ATS) behind each job posting. Indeed frequently shows "Apply on company site" — and the redirect destination reveals which ATS the employer uses.
import httpx
import asyncio
from urllib.parse import urlparse
ATS_PATTERNS = {
"greenhouse.io": "Greenhouse",
"lever.co": "Lever",
"workday.com": "Workday",
"icims.com": "iCIMS",
"taleo.net": "Taleo",
"smartrecruiters.com": "SmartRecruiters",
"jobvite.com": "Jobvite",
"breezy.hr": "Breezy",
"ashbyhq.com": "Ashby",
"rippling.com": "Rippling",
}
async def detect_ats(apply_url: str, client: httpx.AsyncClient) -> str:
"""Follow apply URL redirects to identify the ATS."""
try:
response = await client.get(apply_url, follow_redirects=True, timeout=10)
final_url = str(response.url)
parsed = urlparse(final_url)
domain = parsed.netloc.lower()
for pattern, name in ATS_PATTERNS.items():
if pattern in domain:
return name
return "Unknown"
except Exception:
return "Error"
async def enrich_jobs_with_ats(jobs: list[dict]) -> list[dict]:
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
}
async with httpx.AsyncClient(headers=headers) as client:
tasks = []
for job in jobs:
apply_url = job.get("apply_url", "")
if apply_url and "indeed.com" not in apply_url:
tasks.append(detect_ats(apply_url, client))
else:
tasks.append(asyncio.coroutine(lambda: "Indeed Easy Apply")())
results = await asyncio.gather(*tasks, return_exceptions=True)
for job, ats in zip(jobs, results):
job["ats"] = ats if isinstance(ats, str) else "Error"
return jobs
ATS data is commercially valuable. Greenhouse and Lever employers tend to be tech-forward startups. Workday and Taleo dominate enterprise/Fortune 500. Filtering by ATS lets you build targeted job boards for specific audiences.
Scaling with Async Playwright and Multiple Browser Contexts
Running a single browser context sequentially is the bottleneck for large-scale Indeed scraping. Use multiple isolated browser contexts sharing a single Playwright browser instance:
import asyncio
from playwright.async_api import async_playwright
async def scrape_with_context(browser, query: str, location: str, proxy: dict) -> list[dict]:
context = await browser.new_context(
proxy=proxy,
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
viewport={"width": 1366, "height": 768},
locale="en-US",
timezone_id="America/New_York",
)
page = await context.new_page()
# Remove automation fingerprints
await page.add_init_script("""
Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
Object.defineProperty(navigator, 'plugins', { get: () => [1, 2, 3, 4, 5] });
""")
jobs = []
try:
url = f"https://www.indeed.com/jobs?q={query}&l={location}&start=0"
await page.goto(url, wait_until="domcontentloaded")
await page.wait_for_timeout(2000 + asyncio.get_event_loop().time() % 1500)
cards = await page.query_selector_all("[data-testid='slider_item']")
for card in cards:
try:
title = await card.query_selector("[data-testid='jobTitle']")
company = await card.query_selector("[data-testid='company-name']")
location_el = await card.query_selector("[data-testid='text-location']")
jobs.append({
"title": await title.inner_text() if title else "",
"company": await company.inner_text() if company else "",
"location": await location_el.inner_text() if location_el else "",
"query": query,
})
except Exception:
continue
finally:
await context.close()
return jobs
async def parallel_scrape(queries: list[str], location: str, proxies: list[dict]) -> list[dict]:
"""Run multiple browser contexts in parallel, one per query."""
all_jobs = []
async with async_playwright() as pw:
browser = await pw.chromium.launch(headless=True)
sem = asyncio.Semaphore(3) # Max 3 concurrent contexts
async def bounded_scrape(query, proxy):
async with sem:
return await scrape_with_context(browser, query, location, proxy)
tasks = [
bounded_scrape(q, proxies[i % len(proxies)])
for i, q in enumerate(queries)
]
results = await asyncio.gather(*tasks, return_exceptions=True)
for r in results:
if isinstance(r, list):
all_jobs.extend(r)
await browser.close()
return all_jobs
# ThorData proxy list — get residential IPs per context
# Sign up at https://thordata.partnerstack.com/partner/0a0x4nzb (or [Oxylabs](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=2066&url_id=174))
PROXIES = [
{"server": "http://proxy.thordata.net:9000", "username": "USER", "password": "PASS"},
{"server": "http://proxy.thordata.net:9001", "username": "USER2", "password": "PASS"},
]
Monitoring Indeed for Salary Band Changes
Companies sometimes adjust salary ranges on job postings over time — revealing budget shifts, hiring urgency, or internal compensation restructuring. Track this with a diff-based monitor:
import sqlite3
import json
from datetime import datetime
def init_salary_tracker(db_path: str = "salary_tracker.db"):
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS salary_snapshots (
id INTEGER PRIMARY KEY AUTOINCREMENT,
job_key TEXT NOT NULL,
company TEXT,
title TEXT,
salary_min INTEGER,
salary_max INTEGER,
salary_text TEXT,
scraped_at TEXT,
UNIQUE(job_key, scraped_at)
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS salary_changes (
id INTEGER PRIMARY KEY AUTOINCREMENT,
job_key TEXT,
prev_min INTEGER,
prev_max INTEGER,
new_min INTEGER,
new_max INTEGER,
change_pct REAL,
detected_at TEXT
)
""")
conn.commit()
return conn
def record_salary_snapshot(conn, job: dict):
today = datetime.utcnow().date().isoformat()
try:
conn.execute(
"INSERT OR IGNORE INTO salary_snapshots VALUES (NULL,?,?,?,?,?,?,?)",
(job["key"], job.get("company"), job.get("title"),
job.get("salary_min"), job.get("salary_max"),
job.get("salary_text"), today)
)
conn.commit()
except Exception as e:
print(f"Insert error: {e}")
def detect_salary_changes(conn) -> list[dict]:
"""Find jobs where salary changed between last two snapshots."""
query = """
WITH ranked AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY job_key ORDER BY scraped_at DESC) AS rn
FROM salary_snapshots
WHERE salary_min IS NOT NULL
)
SELECT a.job_key, a.company, a.title,
b.salary_min AS prev_min, b.salary_max AS prev_max,
a.salary_min AS new_min, a.salary_max AS new_max
FROM ranked a
JOIN ranked b ON a.job_key = b.job_key AND a.rn = 1 AND b.rn = 2
WHERE a.salary_min != b.salary_min OR a.salary_max != b.salary_max
"""
rows = conn.execute(query).fetchall()
changes = []
for row in rows:
prev_avg = (row[3] + row[4]) / 2 if row[3] and row[4] else 0
new_avg = (row[5] + row[6]) / 2 if row[5] and row[6] else 0
pct = ((new_avg - prev_avg) / prev_avg * 100) if prev_avg else 0
changes.append({
"job_key": row[0], "company": row[1], "title": row[2],
"prev_min": row[3], "prev_max": row[4],
"new_min": row[5], "new_max": row[6],
"change_pct": round(pct, 1),
})
return changes
Run this daily against your job database to surface employers actively adjusting compensation — a strong signal for negotiation leverage or market shift detection.
Production Deployment Checklist
Before running an Indeed scraper in production, verify:
| Item | Recommendation |
|---|---|
| Proxy rotation | ThorData residential, rotate every 50 requests |
| Browser fingerprint | Randomize viewport, timezone, locale per session |
| Request rate | Max 1 request/3 seconds per IP |
| Captcha handling | Use Playwright + real browser (no headless detection) |
| Data deduplication | Hash (title, company, location) as job key |
| Storage | SQLite for <100K jobs; PostgreSQL for larger |
| Monitoring | Alert on >20% empty-result pages (bot detection) |
| Respect robots.txt | indeed.com/robots.txt disallows /jobs — use for research only |
Always review the Terms of Service for your specific use case before deploying at scale.