Proxy Types for Web Scraping: Residential, Datacenter, and ISP Explained (2026)
Proxy Types for Web Scraping: Residential, Datacenter, and ISP Explained (2026)
If you have ever scaled a scraper past a few hundred requests, you know the drill: 403s start rolling in, CAPTCHAs multiply, and your clean data pipeline turns into a wall of errors. The IP layer is almost always the first thing that breaks. You can write perfect Python, respect robots.txt, randomize your user agents, add human-like delays, and still get blocked because your IP address gives you away immediately.
Proxies are the standard solution. But "just use proxies" is advice that fails people constantly, because the proxy type you choose determines whether you succeed or waste money. Residential proxies cost 10-20x more than datacenter proxies. Picking the wrong type means either burning bandwidth on expensive IPs you did not need, or getting blocked anyway because you went too cheap on a target that required better coverage.
This guide covers every proxy type used in production web scraping in 2026: datacenter, residential, ISP/static residential, and mobile. For each type you will get the technical characteristics, a clear picture of when it works and when it fails, and Python code examples you can drop into your own scraper. We also cover rotating versus sticky sessions, CAPTCHA handling, anti-detection techniques, and how to build retry logic that does not burn through your proxy quota on transient errors.
The goal is to give you enough context to make the right proxy decision on your next project without spending three hours testing configurations that were never going to work.
Why Proxies Matter: What Sites Actually Check
Before picking a proxy type, it helps to understand what anti-bot systems are actually measuring. Modern bot detection is not just "is this IP in a data center." Systems like Cloudflare, Akamai Bot Manager, PerimeterX, DataDome, and Kasada look at a stack of signals simultaneously.
IP-level signals: - ASN (Autonomous System Number) - which organization owns this IP block - Whether the ASN is a known cloud provider, VPN service, or data center - IP reputation score based on past abuse history - Geographic consistency (IP in Germany, Accept-Language header says Chinese) - Whether the IP has been seen on abuse databases
TLS fingerprint: - The exact sequence of cipher suites offered during TLS handshake - TLS version, extension order, supported groups - Tools like curl and Python requests have distinctive TLS fingerprints that differ from Chrome
HTTP/2 fingerprint: - HTTP/2 SETTINGS frames, WINDOW_UPDATE values, header order - These fingerprints identify the underlying HTTP client library, not just the browser string
Behavioral signals: - Request rate and timing patterns - Navigation paths (does this "user" ever go to non-product pages) - Mouse movement and interaction data (for sites that inject JS tracking) - Session consistency (same IP for 1000 requests in 3 minutes)
Proxies solve the IP-level signals. They do not help with TLS fingerprinting or behavioral analysis unless you also handle those layers.
Datacenter Proxies
Datacenter proxies are IPs assigned to physical or virtual servers in commercial data centers. They come from providers like AWS, GCP, Hetzner, OVH, Leaseweb, and thousands of smaller hosts. They are the cheapest proxy type and the easiest to detect.
Technical characteristics: - Latency: typically 20-80ms, sometimes faster than residential because of direct routing - Speed: no bandwidth constraints from consumer ISPs - Pricing: $0.50-3/GB or flat monthly rates for dedicated IPs - Pool size: unlimited in theory, constrained by what you can afford - ASN: registered to known hosting companies
Why they get blocked: Anti-bot systems maintain databases of data center IP ranges. Amazon, Google, Azure, DigitalOcean, Linode, Hetzner - every major cloud provider has their ASN ranges widely published. When your request comes from a hosting AS number, Cloudflare knows before it even reads your headers that this is not a human browsing from home.
The tell is not just the ASN. Data center IPs also lack reverse DNS entries that look like consumer ISP records, they do not appear in residential IP geolocation databases, and they tend to have very clean traffic histories.
When datacenter proxies work: - Public APIs with no bot detection layer (government open data, academic datasets) - Sites running minimal protection - basic rate limiting by IP, no fingerprinting - Your own infrastructure and staging environments - Development and testing before you burn residential proxy bandwidth - Scraping sites where the data is meant to be accessed programmatically
When they fail: Any site running Cloudflare Business/Enterprise, Akamai Bot Manager, PerimeterX, DataDome, or Kasada will challenge or block datacenter IPs on the first request.
Python example with datacenter proxy rotation:
import httpx
import random
import time
from typing import Optional
DATACENTER_PROXIES = [
"http://user:[email protected]:8080",
"http://user:[email protected]:8080",
"http://user:[email protected]:8080",
]
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
}
def scrape_with_datacenter(url: str, max_retries: int = 3) -> Optional[str]:
for attempt in range(max_retries):
proxy = random.choice(DATACENTER_PROXIES)
try:
with httpx.Client(proxy=proxy, timeout=15.0, headers=HEADERS) as client:
resp = client.get(url)
if resp.status_code == 200:
return resp.text
elif resp.status_code in (429, 503):
wait = 2 ** attempt
time.sleep(wait)
elif resp.status_code == 403:
print(f"403 on {url} - consider upgrading to residential proxies")
return None
except (httpx.TimeoutException, httpx.ProxyError) as e:
print(f"Proxy error attempt {attempt + 1}: {e}")
time.sleep(1)
return None
Residential Proxies
Residential proxies use IP addresses assigned by internet service providers to real households. When a website checks the ASN for a residential proxy request, it sees Comcast, Vodafone, Telstra, or a regional ISP - exactly what it would see for any normal human browsing from home.
This is the fundamental advantage of residential proxies: they are not distinguishable from real user traffic at the IP layer. A Comcast IP in Chicago could belong to a data scientist scraping competitor pricing or someone watching Netflix. The site cannot know, and that uncertainty is what you are paying for.
Technical characteristics: - Latency: 200-800ms typical, varies by location and ISP routing - Speed: limited by consumer broadband upstream bandwidth - Pricing: $3-10/GB depending on provider, geo-targeting, and contract volume - Pool size: major providers claim 10-100M IPs, though active pool is smaller - ASN: registered to residential ISPs worldwide
How residential proxy networks work: Residential proxy providers build their networks by running software on real users devices - typically through SDKs bundled into mobile apps, browser extensions, or VPN clients. The device owner consents (buried in terms of service) to have their bandwidth used when their device is idle. Your traffic exits through that device IP address.
This creates some quirks: IPs go offline when devices sleep or lose connectivity, bandwidth is shared with other customers, and you have no control over which specific IP you get within a geo-targeting filter.
When residential proxies are necessary: - Amazon product and pricing data - Google Shopping, Google Maps, Google SERP results - LinkedIn profile and company data - Real estate portals (Zillow, Realtor, Redfin) - Social media platforms (Instagram, Twitter/X, TikTok) - Ticketing platforms (StubHub, Ticketmaster) - Any Cloudflare-protected site running Bot Fight Mode or Super Bot Fight Mode - Price monitoring across major e-commerce retailers
Python example with ThorData residential proxies:
import httpx
import time
import random
from dataclasses import dataclass
from typing import Optional
@dataclass
class ScrapeResult:
url: str
status: int
content: Optional[str]
proxy_used: str
attempt: int
latency_ms: float
# ThorData gateway - rotation is handled server-side
THORDATA_PROXY = "http://username:[email protected]:7000"
def build_realistic_headers(accept_language: str = "en-US,en;q=0.9") -> dict:
return {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": accept_language,
"Accept-Encoding": "gzip, deflate, br",
"sec-ch-ua": '"Chromium";v="131", "Not_A Brand";v="24", "Google Chrome";v="131"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"Windows"',
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
}
def scrape_residential(
url: str,
max_retries: int = 5,
min_delay: float = 1.0,
max_delay: float = 4.0,
) -> ScrapeResult:
headers = build_realistic_headers()
for attempt in range(1, max_retries + 1):
start = time.monotonic()
try:
with httpx.Client(proxy=THORDATA_PROXY, timeout=30.0, headers=headers) as client:
resp = client.get(url)
latency = (time.monotonic() - start) * 1000
if resp.status_code == 200:
return ScrapeResult(url, 200, resp.text, THORDATA_PROXY, attempt, latency)
elif resp.status_code == 429:
wait = random.uniform(min_delay * 2 ** attempt, max_delay * 2 ** attempt)
time.sleep(min(wait, 60))
else:
return ScrapeResult(url, resp.status_code, None, THORDATA_PROXY, attempt, latency)
except httpx.TimeoutException:
time.sleep(random.uniform(min_delay, max_delay))
return ScrapeResult(url, -1, None, THORDATA_PROXY, max_retries, 0.0)
ThorData provides rotating residential proxies with geo-targeting down to city level. Their gateway handles rotation automatically - you do not need to manage a proxy list, just point your client at the gateway endpoint and it distributes requests across their pool.
ISP Proxies (Static Residential)
ISP proxies are the hybrid nobody talks about enough. They are hosted in data centers - so they have datacenter-level speed and uptime - but registered under residential ASNs. When a website checks the ASN for an ISP proxy, it sees a residential internet service provider, not AWS or Hetzner.
The technical trick is that proxy providers purchase IP blocks from ISPs and colocate the actual servers in their own data centers. The IP routing goes through the ISPs network, so the ASN lookup returns the ISP. The traffic itself travels over fast data center infrastructure.
Technical characteristics: - Latency: 30-100ms - much closer to datacenter than residential - Speed: not constrained by consumer upload bandwidth - Pricing: $2-5/GB, more than datacenter but cheaper per-GB than rotating residential - Assignment: dedicated - you get specific IPs rather than random pool rotation - Persistence: same IP for days, weeks, or months depending on contract
When to use ISP proxies: - Long-running monitoring jobs that need consistent identity (price trackers, stock monitors) - Account management scenarios where IP changes trigger security alerts - High-volume scrapes where residential bandwidth costs would be prohibitive - Sites that use IP-session binding (same IP required across a multi-page workflow) - Performance-sensitive scraping where residential latency is a bottleneck
Python example for session-consistent ISP proxy scraping:
import httpx
import asyncio
from typing import List, Dict, Any
ISP_PROXIES = [
"http://user:[email protected]:8080",
"http://user:[email protected]:8080",
"http://user:[email protected]:8080",
]
async def scrape_paginated_session(
base_url: str,
proxy: str,
max_pages: int = 50
) -> List[Dict[str, Any]]:
all_items = []
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}
async with httpx.AsyncClient(proxy=proxy, timeout=20.0, headers=headers) as client:
for page in range(1, max_pages + 1):
url = f"{base_url}?page={page}"
try:
resp = await client.get(url)
if resp.status_code == 200:
pass # Parse page content here
elif resp.status_code == 404:
break
await asyncio.sleep(0.8 + (page % 3) * 0.4)
except httpx.TimeoutException:
await asyncio.sleep(5)
continue
return all_items
Mobile Proxies
Mobile proxies route traffic through mobile devices on carrier networks (4G, 5G, LTE). They are the most expensive proxy type and also the hardest to block, because mobile carrier IPs are used by millions of users simultaneously - blocking a single mobile IP risks blocking thousands of legitimate users.
Technical characteristics: - Latency: highly variable, 100-2000ms depending on carrier and location - Cost: $15-50/GB or per-device monthly pricing - Detection resistance: extremely high - carriers rotate IPs via CGNAT - Use cases: mobile-specific content, carrier-gated sites, maximum stealth
Mobile proxies are rarely necessary for standard scraping. They become relevant for carrier-specific content access, mobile app API scraping, or situations where you need to appear as mobile traffic specifically.
Rotating vs Sticky Sessions
Understanding when to rotate IPs versus when to maintain a session is as important as picking the right proxy type.
Rotating sessions assign a new IP for each request or after a short interval. Use rotation for: - Search result pages (each query is an independent, stateless request) - Product catalog scraping where each URL is self-contained - News article collection across multiple outlets - Any data collection where there is no session state to maintain
Sticky sessions maintain the same IP for a configurable duration (typically 1-30 minutes). Use sticky sessions for: - Paginating through multi-page results (the site tracks your session) - Login and authenticated scraping (cookie-to-IP binding is common) - Shopping cart and checkout observation - Any workflow where the site uses IP as part of session validation
import httpx
import time
import random
from typing import Generator, List
def rotating_scraper(
urls: List[str],
proxy_gateway: str,
delay_range: tuple = (0.5, 2.0)
) -> Generator[tuple, None, None]:
for url in urls:
with httpx.Client(proxy=proxy_gateway, timeout=15.0) as client:
try:
resp = client.get(url)
yield (url, resp.status_code, resp.text if resp.status_code == 200 else None)
except Exception:
yield (url, -1, None)
time.sleep(random.uniform(*delay_range))
def sticky_session_scraper(
start_url: str,
proxy_with_session_id: str,
) -> List[str]:
"""
Maintain same IP across a paginated sequence.
ThorData sticky session format: user-sessid12345:[email protected]:7000
"""
pages = []
with httpx.Client(proxy=proxy_with_session_id, timeout=20.0) as client:
url = start_url
while url:
resp = client.get(url)
if resp.status_code != 200:
break
pages.append(resp.text)
next_url = None # Extract from response
url = next_url
time.sleep(1.5)
return pages
Anti-Detection: Headers, Delays, and Fingerprint Spoofing
Proxies handle the IP layer. Anti-detection covers everything else that bot detection systems analyze.
Request Headers
A bare Python requests call sends headers that no browser actually sends. Building a realistic header set is table stakes:
import random
CHROME_USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
]
def get_chrome_headers(referer: str = None) -> dict:
headers = {
"User-Agent": random.choice(CHROME_USER_AGENTS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"sec-ch-ua": '"Chromium";v="131", "Not_A Brand";v="24", "Google Chrome";v="131"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"Windows"',
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none" if not referer else "same-origin",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"DNT": "1",
}
if referer:
headers["Referer"] = referer
return headers
TLS Fingerprinting with curl_cffi
Standard Python HTTP libraries send a TLS fingerprint that does not match Chrome. The curl_cffi library impersonates real browser TLS fingerprints:
from curl_cffi import requests as cffi_requests
def scrape_with_tls_impersonation(url: str, proxy: str) -> str:
resp = cffi_requests.get(
url,
impersonate="chrome131",
proxy=proxy,
timeout=15,
)
return resp.text
Request Timing and Rate Control
import time
import random
def human_delay(base_seconds: float = 1.0, variance: float = 0.5) -> None:
delay = max(0.1, random.gauss(base_seconds, variance))
time.sleep(delay)
def rate_limited_batch(urls: list, scrape_fn, requests_per_minute: int = 30) -> list:
results = []
min_interval = 60.0 / requests_per_minute
last_request_time = 0.0
for i, url in enumerate(urls):
elapsed = time.monotonic() - last_request_time
if elapsed < min_interval and i >= 5:
time.sleep(min_interval - elapsed + random.uniform(0, 0.3))
result = scrape_fn(url)
results.append(result)
last_request_time = time.monotonic()
if i > 0 and i % random.randint(15, 25) == 0:
time.sleep(random.uniform(5, 15))
return results
CAPTCHA Handling
CAPTCHAs appear when bot detection has flagged your traffic but not outright blocked it.
from bs4 import BeautifulSoup
import httpx
import re
def detect_captcha(response: httpx.Response) -> str:
if response.status_code == 403:
body = response.text.lower()
if "cf-challenge" in body or "challenge-platform" in body:
return "cloudflare"
if "px-captcha" in body or "perimeterx" in body:
return "perimeterx"
if response.status_code == 200:
soup = BeautifulSoup(response.text, "lxml")
if soup.find(attrs={"class": re.compile(r"g-recaptcha")}):
return "recaptcha_v2"
if soup.find(attrs={"data-sitekey": True}):
return "recaptcha_v3"
return "none"
Rate Limiting and Retry Logic
Production scrapers need retry logic that distinguishes between transient failures and permanent blocks:
import httpx
import time
import random
import logging
from enum import Enum
from typing import Optional, Callable
logger = logging.getLogger(__name__)
class RetryOutcome(Enum):
SUCCESS = "success"
RATE_LIMITED = "rate_limited"
BLOCKED = "blocked"
SERVER_ERROR = "server_error"
EXHAUSTED = "exhausted"
def classify_response(status: int) -> RetryOutcome:
if status == 200:
return RetryOutcome.SUCCESS
elif status == 429:
return RetryOutcome.RATE_LIMITED
elif status in (403, 401):
return RetryOutcome.BLOCKED
elif status >= 500:
return RetryOutcome.SERVER_ERROR
return RetryOutcome.BLOCKED
def scrape_with_retry(
url: str,
proxy_fn: Callable[[], str],
max_attempts: int = 5,
base_backoff: float = 2.0,
) -> tuple:
for attempt in range(1, max_attempts + 1):
proxy = proxy_fn()
try:
with httpx.Client(proxy=proxy, timeout=20.0) as client:
resp = client.get(url, headers=get_chrome_headers())
outcome = classify_response(resp.status_code)
if outcome == RetryOutcome.SUCCESS:
return resp.text, outcome
elif outcome == RetryOutcome.RATE_LIMITED:
retry_after = int(resp.headers.get("Retry-After", base_backoff * 2 ** attempt))
wait = min(retry_after + random.uniform(0, 2), 120)
logger.warning(f"Rate limited attempt {attempt}, waiting {wait:.1f}s")
time.sleep(wait)
elif outcome == RetryOutcome.BLOCKED:
logger.warning(f"Blocked HTTP {resp.status_code} attempt {attempt}")
time.sleep(random.uniform(2, 5))
elif outcome == RetryOutcome.SERVER_ERROR:
wait = base_backoff * (2 ** (attempt - 1)) + random.uniform(0, 1)
time.sleep(min(wait, 60))
except httpx.TimeoutException:
logger.warning(f"Timeout on attempt {attempt}")
time.sleep(random.uniform(2, 6))
except httpx.ProxyError as e:
logger.error(f"Proxy error attempt {attempt}: {e}")
time.sleep(2)
return None, RetryOutcome.EXHAUSTED
Real-World Use Cases
1. E-commerce Price Monitoring
Price monitoring is one of the most common scraping use cases. The challenge is that major retailers have aggressive bot detection, and residential proxies are required for Amazon, Walmart, and Target.
import httpx
import json
import re
import time
import datetime
from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class ProductPrice:
url: str
title: str
price: Optional[float]
currency: str
in_stock: bool
scraped_at: str
def extract_price_from_html(html: str, url: str) -> ProductPrice:
soup = BeautifulSoup(html, "lxml")
schema = soup.find("script", type="application/ld+json")
if schema:
try:
data = json.loads(schema.string)
if isinstance(data, list):
data = data[0]
if data.get("@type") == "Product":
offer = data.get("offers", {})
if isinstance(offer, list):
offer = offer[0]
return ProductPrice(
url=url,
title=data.get("name", ""),
price=float(offer.get("price", 0)),
currency=offer.get("priceCurrency", "USD"),
in_stock=offer.get("availability", "").endswith("InStock"),
scraped_at=datetime.datetime.utcnow().isoformat(),
)
except (json.JSONDecodeError, KeyError, ValueError):
pass
price_tag = soup.select_one('[itemprop="price"], .price, #price')
title_tag = soup.select_one("h1")
price_text = price_tag.get_text(strip=True) if price_tag else ""
price_match = re.search(r"[\d,]+\.?\d*", price_text.replace(",", ""))
return ProductPrice(
url=url,
title=title_tag.get_text(strip=True) if title_tag else "",
price=float(price_match.group()) if price_match else None,
currency="USD",
in_stock=bool(soup.find(string=re.compile(r"in stock", re.I))),
scraped_at=datetime.datetime.utcnow().isoformat(),
)
2. Real Estate Listing Scraper
Real estate portals like Zillow, Realtor.com, and Redfin are among the most aggressively protected scraping targets. They use device fingerprinting in addition to IP-based blocking, so you need both residential proxies and realistic browser headers.
import httpx
import time
import random
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class RealEstateListing:
address: str
city: str
state: str
zip_code: str
price: Optional[int]
bedrooms: Optional[int]
bathrooms: Optional[float]
sqft: Optional[int]
listing_url: str
days_on_market: Optional[int]
def scrape_real_estate_search(
search_url: str,
proxy: str,
max_pages: int = 10,
) -> List[RealEstateListing]:
listings = []
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://www.google.com/",
}
with httpx.Client(proxy=proxy, timeout=30.0, headers=headers, follow_redirects=True) as client:
for page in range(1, max_pages + 1):
url = f"{search_url}&page={page}" if "?" in search_url else f"{search_url}?page={page}"
resp = client.get(url)
if resp.status_code != 200:
break
if page % 3 == 0:
time.sleep(5 + random.uniform(0, 3))
else:
time.sleep(1.5 + random.uniform(0, 1))
return listings
3. Job Board Aggregator
Job boards are generally less aggressive with bot detection than e-commerce, but the most popular ones (Indeed, LinkedIn) have significant protection. Many job boards expose JSON APIs that are much cleaner to scrape than HTML.
import httpx
import json
import datetime
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class JobListing:
title: str
company: str
location: str
salary_min: Optional[int]
salary_max: Optional[int]
remote: bool
posted_date: str
listing_url: str
source: str
def scrape_job_api(board_url: str, proxy: str, keyword: str = "python developer") -> List[JobListing]:
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
"Accept": "application/json, text/html, */*",
"Accept-Language": "en-US,en;q=0.9",
}
jobs = []
with httpx.Client(proxy=proxy, timeout=20.0, headers=headers) as client:
encoded = keyword.replace(" ", "+")
resp = client.get(f"{board_url}/search?q={encoded}&format=json")
if resp.status_code == 200:
try:
data = resp.json()
raw_jobs = data.get("jobs", data.get("results", []))
for job in raw_jobs:
company = job.get("company", {})
company_name = company.get("name", "") if isinstance(company, dict) else str(company)
jobs.append(JobListing(
title=job.get("title", ""),
company=company_name,
location=job.get("location", ""),
salary_min=job.get("salary_min"),
salary_max=job.get("salary_max"),
remote="remote" in job.get("location", "").lower(),
posted_date=job.get("created_at", datetime.datetime.utcnow().isoformat()),
listing_url=job.get("url", ""),
source=board_url,
))
except json.JSONDecodeError:
pass
return jobs
4. SERP Rank Tracker
Google is one of the most difficult scraping targets. Residential proxies are mandatory - datacenter IPs get CAPTCHA challenges on nearly every request. Use geo-targeted proxies matching the country you want results for.
import httpx
import time
import random
from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import List
@dataclass
class SearchResult:
position: int
title: str
url: str
description: str
is_ad: bool
keyword: str
def track_serp_positions(
keywords: List[str],
target_domain: str,
proxy: str,
country_code: str = "us",
) -> List[SearchResult]:
results = []
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
"Accept-Language": f"{country_code}-{country_code.upper()},{country_code};q=0.9,en;q=0.8",
"Referer": "https://www.google.com/",
}
for keyword in keywords:
with httpx.Client(proxy=proxy, timeout=15.0, headers=headers) as client:
encoded = keyword.replace(" ", "+")
resp = client.get(f"https://www.google.com/search?q={encoded}&gl={country_code}&num=100")
if resp.status_code == 200:
soup = BeautifulSoup(resp.text, "lxml")
organic = soup.select("div.g")
for pos, div in enumerate(organic[:20], 1):
link = div.select_one("a[href]")
title_el = div.select_one("h3")
snippet = div.select_one(".VwiC3b")
if link and title_el:
url = link.get("href", "")
if target_domain in url:
results.append(SearchResult(
position=pos,
title=title_el.get_text(),
url=url,
description=snippet.get_text() if snippet else "",
is_ad=False,
keyword=keyword,
))
time.sleep(random.uniform(3, 8))
return results
5. Social Media Profile Data Collector
Social platforms have the most sophisticated bot detection systems. Instagram and TikTok run device fingerprinting in JavaScript - bypassing this requires a real browser context via Playwright.
from playwright.async_api import async_playwright
import asyncio
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class SocialProfile:
username: str
display_name: str
bio: str
follower_count: Optional[int]
following_count: Optional[int]
post_count: Optional[int]
verified: bool
profile_url: str
async def scrape_public_profiles(
usernames: List[str],
proxy_config: dict,
platform: str = "twitter",
) -> List[SocialProfile]:
profiles = []
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy=proxy_config,
args=["--disable-blink-features=AutomationControlled"],
)
for username in usernames:
context = await browser.new_context(
viewport={"width": 1366, "height": 768},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
locale="en-US",
)
page = await context.new_page()
await page.add_init_script(
"Object.defineProperty(navigator, 'webdriver', {get: () => undefined});"
)
try:
base_url = f"https://twitter.com/{username}" if platform == "twitter" else f"https://www.instagram.com/{username}/"
await page.goto(base_url, wait_until="networkidle", timeout=30000)
await page.wait_for_timeout(2000)
finally:
await context.close()
await asyncio.sleep(2)
await browser.close()
return profiles
6. News and Media Monitoring
News monitoring benefits from a two-phase approach: parse RSS feeds without proxies to get article URLs cheaply, then use proxies only for fetching full article text. This cuts proxy bandwidth costs significantly.
import httpx
import feedparser
from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import List, Optional
import datetime
import time
import random
@dataclass
class NewsArticle:
headline: str
url: str
source: str
published_at: str
summary: str
author: Optional[str]
full_text: Optional[str]
def scrape_news_sources(
rss_feeds: List[str],
article_proxy: str,
max_articles_per_feed: int = 20,
) -> List[NewsArticle]:
feed_items = []
for feed_url in rss_feeds:
feed = feedparser.parse(feed_url)
for entry in feed.entries[:max_articles_per_feed]:
feed_items.append({
"title": entry.get("title", ""),
"url": entry.get("link", ""),
"published": entry.get("published", datetime.datetime.utcnow().isoformat()),
"summary": entry.get("summary", ""),
"source": feed.feed.get("title", feed_url),
})
articles = []
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}
with httpx.Client(proxy=article_proxy, timeout=15.0, headers=headers) as client:
for item in feed_items:
try:
resp = client.get(item["url"])
if resp.status_code == 200:
soup = BeautifulSoup(resp.text, "lxml")
for tag in soup.select("nav, header, footer, .ad, aside"):
tag.decompose()
paragraphs = soup.select("article p, .article-body p, .post-content p")
full_text = " ".join(p.get_text(strip=True) for p in paragraphs)
articles.append(NewsArticle(
headline=item["title"],
url=item["url"],
source=item["source"],
published_at=item["published"],
summary=item["summary"],
author=None,
full_text=full_text if full_text else None,
))
time.sleep(random.uniform(0.5, 1.5))
except Exception:
pass
return articles
7. Academic and Research Data Harvesting
Academic databases are generally more tolerant of scraping but have strict rate limits. Datacenter proxies work fine for most academic sources - there is no need to pay for residential proxies here.
import httpx
import time
from dataclasses import dataclass, field
from typing import List, Optional
@dataclass
class ResearchPaper:
title: str
authors: List[str]
abstract: str
doi: Optional[str]
publication_year: Optional[int]
journal: Optional[str]
citations: Optional[int]
download_url: Optional[str]
keywords: List[str] = field(default_factory=list)
def scrape_arxiv_papers(
search_query: str,
max_results: int = 100,
proxy: Optional[str] = None,
) -> List[ResearchPaper]:
"""
Scrape arXiv preprints.
arXiv allows programmatic access but rate-limits to roughly 1 req/3s.
Datacenter proxies are fine here.
"""
papers = []
headers = {
"User-Agent": "ResearchScraper/1.0 (academic research; [email protected])",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}
client_kwargs: dict = {"timeout": 20.0, "headers": headers}
if proxy:
client_kwargs["proxy"] = proxy
base_url = "https://export.arxiv.org/find/cs/1/all:+{query}/0/{start}/0/all/0/1"
with httpx.Client(**client_kwargs) as client:
start = 0
while start < max_results:
url = base_url.format(query=search_query.replace(" ", "+"), start=start)
resp = client.get(url)
if resp.status_code != 200:
break
start += 25
time.sleep(3)
return papers
Output Schema and Storage
Always define output schemas before scraping at scale. It forces you to think about what you actually need and makes downstream processing trivial.
from dataclasses import dataclass, field
from typing import Optional
import json
import datetime
import sqlite3
import hashlib
@dataclass
class ScrapedPage:
url: str
status_code: int
scraped_at: str = field(default_factory=lambda: datetime.datetime.utcnow().isoformat())
proxy_type: str = "residential"
content_hash: Optional[str] = None
data: dict = field(default_factory=dict)
error: Optional[str] = None
class ScrapingStorage:
def __init__(self, db_path: str = "scraping_results.db"):
self.conn = sqlite3.connect(db_path)
self.conn.execute("""
CREATE TABLE IF NOT EXISTS scraped_pages (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT UNIQUE,
status_code INTEGER,
scraped_at TEXT,
proxy_type TEXT,
content_hash TEXT,
data TEXT,
error TEXT
)
""")
self.conn.commit()
def save(self, page: ScrapedPage) -> None:
if page.data:
content = json.dumps(page.data, sort_keys=True)
page.content_hash = hashlib.sha256(content.encode()).hexdigest()[:16]
self.conn.execute(
"INSERT OR REPLACE INTO scraped_pages VALUES (NULL, ?, ?, ?, ?, ?, ?, ?)",
(page.url, page.status_code, page.scraped_at, page.proxy_type,
page.content_hash, json.dumps(page.data), page.error)
)
self.conn.commit()
def is_scraped(self, url: str, max_age_hours: int = 24) -> bool:
cutoff = (datetime.datetime.utcnow() - datetime.timedelta(hours=max_age_hours)).isoformat()
row = self.conn.execute(
"SELECT 1 FROM scraped_pages WHERE url = ? AND scraped_at > ? AND status_code = 200",
(url, cutoff)
).fetchone()
return row is not None
def export_to_jsonl(self, output_file: str) -> int:
rows = self.conn.execute("SELECT url, data FROM scraped_pages WHERE status_code = 200").fetchall()
with open(output_file, "w") as f:
for url, data_str in rows:
if data_str:
f.write(data_str + "\n")
return len(rows)
Choosing the Right Proxy Provider
What matters in practice: pool size, geo coverage, uptime consistency, and how the provider handles failures. A provider with 100M residential IPs sounds impressive until you realize 70% are offline at any given time and the active pool has heavy overlap with other customers.
For rotating residential proxy coverage with solid geo-targeting, ThorData is worth evaluating. They offer rotating residential proxies with city-level targeting, sticky session support, and pricing that does not penalize high-bandwidth months.
The Decision Matrix
| Scenario | Proxy Type | Session | Expected Success Rate |
|---|---|---|---|
| Public API, no bot detection | Datacenter | Rotating | Very High |
| Cloudflare-protected site | Residential | Rotating | Medium-High |
| Login / session scraping | ISP or Residential | Sticky | Medium |
| Amazon / Google pricing | Residential | Rotating | Medium |
| High-volume catalog scrape | ISP | Mixed | High |
| Mobile-specific content | Mobile | Rotating | High |
| JS-heavy SPA | Residential + Playwright | Per-context | Medium |
Start with the cheapest proxy type that works for your target. Escalate when you hit consistent blocks. Monitor your success rate per proxy type and per target domain - the data will tell you where to invest more.