How to Scrape Google Search Results Without Getting Blocked (2025)

2026-03-29 scraping google python serp proxy

Google is the largest search engine on the planet, processing over 8.5 billion queries per day. Whether you are building an SEO monitoring tool, tracking competitor rankings, aggregating market research data, or feeding a machine learning pipeline with search intelligence, the ability to programmatically extract Google search results is an incredibly valuable skill. But Google does not make it easy. Their infrastructure is designed to serve human users, and they invest heavily in detecting and blocking automated access.

I have spent over a year building SERP data pipelines, testing every publicly available scraping method against Google's ever-evolving bot detection stack. In that time, I have burned through dozens of IP addresses, hit every flavor of CAPTCHA Google throws, and watched perfectly working scrapers break overnight after a fingerprinting update. This guide distills everything I have learned into a single, practical resource.

This is not a "copy-paste this snippet and you are done" tutorial. Google scraping is fundamentally an adversarial problem — you are working against a multi-billion dollar infrastructure designed to stop exactly what you are trying to do. I will be upfront about what works, what does not, and the real costs involved. By the end of this guide, you will understand the detection landscape, have working code for multiple approaches, and know how to choose the right method for your specific use case.

Legal disclaimer: Google's Terms of Service prohibit automated scraping of search results. This guide is for educational and research purposes. For production use, consider official APIs or licensed SERP data providers. Always respect robots.txt and applicable laws in your jurisdiction.

Why Scraping Google Is Hard (The Full Picture)

Most tutorials gloss over why Google scraping is difficult, leading developers to underestimate the challenge and waste time on approaches that fail at the first hurdle. Let me break down every detection layer Google employs.

Layer 1: IP-Based Rate Limiting

This is the most basic defense. Google tracks request volume per IP address. From a clean residential IP, you can typically make 5-10 queries per minute before triggering a soft block (CAPTCHA). From a known datacenter IP range (AWS, GCP, Azure, DigitalOcean), you might get blocked on the very first request.

Google maintains an internal reputation score for IP addresses. If an IP has a history of automated traffic, it gets flagged more aggressively. This is why a fresh residential IP works better than a datacenter IP that has been used for scraping before — even if the datacenter IP is technically "clean" in terms of recent activity.

Layer 2: TLS Fingerprinting (JA3/JA4+)

Every TLS connection starts with a ClientHello message that reveals your client's cipher suite preferences, extension order, and supported curves. Google uses JA3 and the newer JA4+ fingerprinting to identify your HTTP library before your request even reaches their application layer.

Python's requests library, httpx, and aiohttp all produce OpenSSL-default TLS fingerprints that are trivially distinguishable from real browsers. This means that no amount of header spoofing will help if your TLS handshake says "I am a Python script."

# This is what Google sees BEFORE your headers arrive:
# TLS ClientHello -> JA3 hash -> "771,4866-4867-4865-49196-49200..."
# That hash maps to "Python/OpenSSL" -> immediate block

# Meanwhile, Chrome 131 produces:
# JA3 hash -> "769,4865-4866-4867-49195-49199..."
# Different cipher order, different extensions -> "real browser"

Layer 3: JavaScript Challenges

Google increasingly serves JavaScript challenges that must be executed before search results are rendered. These challenges:

Verify that a real browser runtime exists
Check for browser APIs (canvas, WebGL, AudioContext)
Measure execution timing to detect automation
Set cookies that are required for subsequent requests

A raw HTTP client like httpx cannot execute JavaScript, making these challenges impossible to bypass without a browser engine.

Layer 4: Behavioral Analysis

Even with a perfect browser fingerprint, Google analyzes behavioral signals:

Request timing: Uniform intervals between requests (e.g., exactly every 2 seconds) are a dead giveaway. Human browsing has natural variance.
Mouse movement: On pages with ReCAPTCHA, Google tracks cursor movement patterns. Automated mouse movements follow mathematical curves that differ from human motion.
Cookie state: A browser that makes search queries but never visits any results is suspicious.
Session patterns: Searching 100 different keywords in sequence without any page dwelling time is not human behavior.
Geolocation consistency: Your IP says Germany but your browser language is set to en-US with a US timezone — these mismatches raise flags.

Layer 5: Browser Fingerprint Consistency

If you use a headless browser, Google checks for internal consistency:

// Google's detection scripts check things like:
navigator.webdriver  // true in automated browsers
navigator.plugins.length  // 0 in headless Chrome
navigator.languages  // often just ["en"] in headless
window.chrome  // sometimes undefined in headless
window.outerHeight  // 0 in headless mode

These checks form a composite fingerprint. Failing any single check might not trigger a block, but multiple inconsistencies will.

Approach 1: Raw HTTP Requests with httpx

The simplest approach. Best for low-volume, quick data grabs where you need a few dozen results and don't want to spin up browser infrastructure.

Basic Implementation

import httpx
from bs4 import BeautifulSoup
from urllib.parse import quote_plus
import random
import time
from dataclasses import dataclass, field, asdict
import json

@dataclass
class SearchResult:
    """Structured output for a single Google search result."""
    position: int
    title: str
    url: str
    snippet: str
    displayed_url: str = ""
    featured: bool = False

@dataclass
class SERPResponse:
    """Complete SERP response with metadata."""
    query: str
    results: list[SearchResult] = field(default_factory=list)
    total_results: str = ""
    search_time: float = 0.0
    people_also_ask: list[str] = field(default_factory=list)

    def to_json(self) -> str:
        return json.dumps(asdict(self), indent=2, ensure_ascii=False)


USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:132.0) Gecko/20100101 Firefox/132.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.1 Safari/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
]

ACCEPT_LANGUAGES = [
    "en-US,en;q=0.9",
    "en-GB,en;q=0.9,en-US;q=0.8",
    "en-US,en;q=0.9,de;q=0.8",
]


def build_google_headers() -> dict:
    """Generate realistic browser headers for Google requests."""
    ua = random.choice(USER_AGENTS)
    is_chrome = "Chrome" in ua and "Safari" in ua and "Firefox" not in ua

    headers = {
        "User-Agent": ua,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": random.choice(ACCEPT_LANGUAGES),
        "Accept-Encoding": "gzip, deflate, br",
        "DNT": "1",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
    }

    if is_chrome:
        headers.update({
            "Sec-Ch-Ua": '"Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
            "Sec-Ch-Ua-Mobile": "?0",
            "Sec-Ch-Ua-Platform": '"Windows"' if "Windows" in ua else '"macOS"',
            "Sec-Fetch-Dest": "document",
            "Sec-Fetch-Mode": "navigate",
            "Sec-Fetch-Site": "none",
            "Sec-Fetch-User": "?1",
        })

    return headers


def parse_serp(html: str, query: str) -> SERPResponse:
    """Parse Google SERP HTML into structured data."""
    soup = BeautifulSoup(html, "lxml")
    response = SERPResponse(query=query)

    # Extract total results count
    stats = soup.select_one("#result-stats")
    if stats:
        response.total_results = stats.get_text(strip=True)

    # Extract organic results
    position = 1
    for div in soup.select("div.g"):
        # Skip nested results (like sub-links)
        if div.find_parent("div", class_="g"):
            continue

        link_el = div.select_one("a[href]")
        title_el = div.select_one("h3")
        snippet_el = div.select_one("div[data-sncf], div.VwiC3b, span.aCOpRe")

        if not link_el or not title_el:
            continue

        href = link_el.get("href", "")
        if not href.startswith("http"):
            continue

        result = SearchResult(
            position=position,
            title=title_el.get_text(strip=True),
            url=href,
            snippet=snippet_el.get_text(strip=True) if snippet_el else "",
            displayed_url=div.select_one("cite").get_text(strip=True) if div.select_one("cite") else "",
        )
        response.results.append(result)
        position += 1

    # Extract "People Also Ask"
    for paa in soup.select("div.related-question-pair span"):
        text = paa.get_text(strip=True)
        if text and len(text) > 10:
            response.people_also_ask.append(text)

    return response


def scrape_google(
    query: str,
    num_results: int = 10,
    country: str = "us",
    language: str = "en",
) -> SERPResponse:
    """
    Scrape Google search results for a given query.

    Args:
        query: Search query string
        num_results: Number of results to request (10, 20, 50, or 100)
        country: Country code for localized results
        language: Language code

    Returns:
        SERPResponse with parsed results
    """
    params = {
        "q": query,
        "num": num_results,
        "hl": language,
        "gl": country,
        "pws": "0",  # Disable personalized results
    }

    url = "https://www.google.com/search?" + "&".join(
        f"{k}={quote_plus(str(v))}" for k, v in params.items()
    )

    with httpx.Client(
        headers=build_google_headers(),
        follow_redirects=True,
        timeout=15.0,
    ) as client:
        resp = client.get(url)

        if resp.status_code == 429:
            raise Exception("Rate limited by Google. Wait or switch IP.")
        if resp.status_code == 503:
            raise Exception("Google served a CAPTCHA page. IP is flagged.")
        resp.raise_for_status()

        return parse_serp(resp.text, query)


# Usage
if __name__ == "__main__":
    results = scrape_google("best python web scraping libraries 2025")
    print(results.to_json())

Output Schema

The above code produces structured JSON output like this:

{
  "query": "best python web scraping libraries 2025",
  "results": [
    {
      "position": 1,
      "title": "Top 10 Python Web Scraping Libraries in 2025",
      "url": "https://example.com/python-scraping-libs",
      "snippet": "Comprehensive comparison of Python scraping libraries including BeautifulSoup, Scrapy, Playwright, and newer options...",
      "displayed_url": "example.com > python-scraping-libs",
      "featured": false
    },
    {
      "position": 2,
      "title": "The Best Web Scraping Tools for Python Developers",
      "url": "https://another-example.com/scraping-tools",
      "snippet": "Updated guide covering httpx, selectolax, and headless browser automation...",
      "displayed_url": "another-example.com > scraping-tools",
      "featured": false
    }
  ],
  "total_results": "About 4,230,000 results (0.52 seconds)",
  "search_time": 0.0,
  "people_also_ask": [
    "What is the best Python library for web scraping?",
    "Is web scraping legal in 2025?",
    "How to scrape Google without getting blocked?"
  ]
}

Limitations of Raw HTTP

Works for maybe 20-50 queries per day from a single residential IP
CAPTCHAs appear almost immediately on datacenter IPs (AWS, GCP, Azure)
Google changes CSS selectors regularly — your parser will break
No JavaScript rendering means some results (especially featured snippets and knowledge panels) may be missing
The TLS fingerprint from httpx/requests immediately identifies you as a Python script

Approach 2: curl_cffi for TLS Fingerprint Spoofing

The biggest weakness of raw HTTP libraries is TLS fingerprinting. curl_cffi solves this by using BoringSSL (the same TLS library Chrome uses) and impersonating specific browser versions at the TLS layer.

from curl_cffi import requests as cffi_requests
from bs4 import BeautifulSoup
import random
import time

class StealthGoogleScraper:
    """Google scraper using curl_cffi for TLS fingerprint impersonation."""

    BROWSER_VERSIONS = ["chrome131", "chrome130", "chrome124"]

    def __init__(self, proxy: str | None = None):
        self.proxy = proxy
        self.session = cffi_requests.Session(
            impersonate=random.choice(self.BROWSER_VERSIONS),
            proxies={"https": proxy, "http": proxy} if proxy else None,
        )
        self.request_count = 0

    def search(self, query: str, num: int = 10) -> dict:
        """Execute a Google search with TLS impersonation."""
        # Random delay to mimic human behavior
        if self.request_count > 0:
            delay = random.uniform(3.0, 8.0)
            time.sleep(delay)

        params = {
            "q": query,
            "num": str(num),
            "hl": "en",
            "gl": "us",
        }

        try:
            resp = self.session.get(
                "https://www.google.com/search",
                params=params,
                timeout=15,
            )
            self.request_count += 1

            if resp.status_code == 429:
                return {"error": "rate_limited", "query": query}
            if "unusual traffic" in resp.text.lower():
                return {"error": "captcha_triggered", "query": query}

            return self._parse(resp.text, query)

        except Exception as e:
            return {"error": str(e), "query": query}

    def _parse(self, html: str, query: str) -> dict:
        soup = BeautifulSoup(html, "lxml")
        results = []
        for i, div in enumerate(soup.select("div.g"), 1):
            link = div.select_one("a[href]")
            title = div.select_one("h3")
            snippet = div.select_one("div[data-sncf], div.VwiC3b")
            if link and title:
                href = link.get("href", "")
                if href.startswith("http"):
                    results.append({
                        "position": i,
                        "title": title.get_text(strip=True),
                        "url": href,
                        "snippet": snippet.get_text(strip=True) if snippet else "",
                    })
        return {"query": query, "results": results, "count": len(results)}

    def close(self):
        self.session.close()


# Usage with proxy rotation
scraper = StealthGoogleScraper(
    proxy="http://user:[email protected]:9000"
)
results = scraper.search("web scraping python tutorial")
print(results)
scraper.close()

This approach gives you a dramatically better success rate than raw httpx because your TLS fingerprint now matches Chrome instead of Python/OpenSSL. Combined with rotating residential proxies, this is the most cost-effective method for moderate-volume SERP scraping.

Approach 3: Headless Browser with Playwright

When a site requires JavaScript execution or you need to handle complex interactions (consent dialogs, infinite scroll, login flows), Playwright is the right tool.

import asyncio
import random
from playwright.async_api import async_playwright
from dataclasses import dataclass
import json

@dataclass
class PlaywrightSERPResult:
    position: int
    title: str
    url: str
    snippet: str

async def scrape_google_playwright(
    query: str,
    num_results: int = 10,
    proxy: dict | None = None,
    headless: bool = True,
) -> list[PlaywrightSERPResult]:
    """
    Scrape Google search results using Playwright.

    Args:
        query: Search query
        num_results: Number of results to fetch
        proxy: Optional proxy config {"server": "...", "username": "...", "password": "..."}
        headless: Run browser in headless mode

    Returns:
        List of search results
    """
    async with async_playwright() as p:
        launch_opts = {
            "headless": headless,
            "args": [
                "--disable-blink-features=AutomationControlled",
                "--disable-dev-shm-usage",
                "--no-first-run",
                "--no-default-browser-check",
            ],
        }
        if proxy:
            launch_opts["proxy"] = proxy

        browser = await p.chromium.launch(**launch_opts)

        context = await browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
            locale="en-US",
            timezone_id="America/New_York",
            geolocation={"latitude": 40.7128, "longitude": -74.0060},
            permissions=["geolocation"],
        )

        # Inject stealth scripts to bypass detection
        await context.add_init_script("""
            // Override webdriver detection
            Object.defineProperty(navigator, 'webdriver', {
                get: () => undefined,
            });

            // Override plugins to look real
            Object.defineProperty(navigator, 'plugins', {
                get: () => [1, 2, 3, 4, 5],
            });

            // Override languages
            Object.defineProperty(navigator, 'languages', {
                get: () => ['en-US', 'en'],
            });

            // Override Chrome detection
            window.chrome = {
                runtime: {},
                loadTimes: function() {},
                csi: function() {},
                app: {},
            };
        """)

        page = await context.new_page()

        # Navigate to Google first (build cookies)
        await page.goto("https://www.google.com", wait_until="domcontentloaded")
        await asyncio.sleep(random.uniform(1.0, 2.5))

        # Handle consent dialog (EU regions)
        try:
            consent_btn = page.locator("button:has-text('Accept all')")
            if await consent_btn.is_visible(timeout=3000):
                await consent_btn.click()
                await asyncio.sleep(random.uniform(0.5, 1.5))
        except Exception:
            pass

        # Type the search query with human-like timing
        search_box = page.locator("textarea[name='q'], input[name='q']")
        await search_box.click()
        for char in query:
            await search_box.type(char, delay=random.randint(50, 150))
        await asyncio.sleep(random.uniform(0.3, 0.8))
        await page.keyboard.press("Enter")

        # Wait for results
        await page.wait_for_selector("div.g", timeout=15000)
        await asyncio.sleep(random.uniform(1.0, 2.0))

        # Extract results
        results = []
        divs = await page.query_selector_all("div.g")

        for i, div in enumerate(divs[:num_results], 1):
            try:
                title_el = await div.query_selector("h3")
                link_el = await div.query_selector("a")
                snippet_el = await div.query_selector("div[data-sncf], div.VwiC3b, span.aCOpRe")

                title = await title_el.inner_text() if title_el else ""
                url = await link_el.get_attribute("href") if link_el else ""
                snippet = await snippet_el.inner_text() if snippet_el else ""

                if title and url and url.startswith("http"):
                    results.append(PlaywrightSERPResult(
                        position=i,
                        title=title,
                        url=url,
                        snippet=snippet,
                    ))
            except Exception:
                continue

        await browser.close()
        return results


# Run it
async def main():
    results = await scrape_google_playwright(
        query="python web scraping best practices 2025",
        proxy={
            "server": "http://rotating.thordata.com:9000",
            "username": "your_username",
            "password": "your_password",
        },
    )
    for r in results:
        print(f"{r.position}. {r.title}")
        print(f"   {r.url}")
        print(f"   {r.snippet[:100]}...")
        print()

asyncio.run(main())

Why Playwright Still Gets Caught

Even with stealth scripts, headless Chrome is increasingly detectable:

navigator.webdriver: While we override it, some detection scripts check the prototype chain or use Object.getOwnPropertyDescriptor to detect the override itself.
Chrome DevTools Protocol artifacts: The CDP connection leaves traces that sophisticated detection can identify.
Missing browser features: Headless Chrome lacks WebGL renderer info, audio codec support, and other features present in headed Chrome.
Resource loading patterns: Automated browsers often load resources in a different order than real browsers.

For maximum stealth, consider playwright-stealth or undetected-playwright libraries, but understand that this is an arms race you will eventually lose against Google-level detection.

Approach 4: Google Custom Search JSON API (Official)

If you need reliable, legal, production-grade SERP data, the official API is the safest choice.

import httpx
from dataclasses import dataclass
import json

@dataclass
class CSEResult:
    title: str
    link: str
    snippet: str
    display_link: str

class GoogleCSEClient:
    """Client for Google Custom Search Engine API."""

    BASE_URL = "https://www.googleapis.com/customsearch/v1"

    def __init__(self, api_key: str, cx: str):
        """
        Args:
            api_key: Google API key
            cx: Custom Search Engine ID
        """
        self.api_key = api_key
        self.cx = cx
        self.client = httpx.Client(timeout=15.0)

    def search(
        self,
        query: str,
        num: int = 10,
        start: int = 1,
        date_restrict: str | None = None,
        site_search: str | None = None,
        file_type: str | None = None,
    ) -> dict:
        """
        Execute a search query.

        Args:
            query: Search terms
            num: Results per page (1-10)
            start: Starting result index (1-based)
            date_restrict: Restrict by date (e.g., "d7" for last 7 days)
            site_search: Limit to specific site
            file_type: Filter by file extension (pdf, doc, etc.)

        Returns:
            Parsed search results with metadata
        """
        params = {
            "key": self.api_key,
            "cx": self.cx,
            "q": query,
            "num": min(num, 10),
            "start": start,
        }
        if date_restrict:
            params["dateRestrict"] = date_restrict
        if site_search:
            params["siteSearch"] = site_search
        if file_type:
            params["fileType"] = file_type

        resp = self.client.get(self.BASE_URL, params=params)
        resp.raise_for_status()
        data = resp.json()

        results = []
        for item in data.get("items", []):
            results.append(CSEResult(
                title=item.get("title", ""),
                link=item.get("link", ""),
                snippet=item.get("snippet", ""),
                display_link=item.get("displayLink", ""),
            ))

        return {
            "query": query,
            "total_results": data.get("searchInformation", {}).get("totalResults", "0"),
            "search_time": data.get("searchInformation", {}).get("searchTime", 0),
            "results": [vars(r) for r in results],
            "next_start": start + num if len(results) == num else None,
        }

    def search_all_pages(self, query: str, max_results: int = 100) -> list[CSEResult]:
        """Paginate through multiple pages of results."""
        all_results = []
        start = 1

        while len(all_results) < max_results and start <= 91:
            page = self.search(query, num=10, start=start)
            results = page["results"]
            if not results:
                break
            all_results.extend(results)
            start += 10

        return all_results[:max_results]

    def close(self):
        self.client.close()


# Usage
cse = GoogleCSEClient(api_key="YOUR_API_KEY", cx="YOUR_CX_ID")

# Basic search
results = cse.search("python web scraping", num=10)
print(json.dumps(results, indent=2))

# Search with filters
pdf_results = cse.search(
    "web scraping guide",
    file_type="pdf",
    date_restrict="m6",  # Last 6 months
)

# Paginate through results
all_results = cse.search_all_pages("site:github.com python scraper", max_results=50)
cse.close()

Pricing: 100 queries/day free, then $5 per 1,000 queries. The results come from a Custom Search Engine, not the main Google index — they are close but not identical.

Approach 5: Third-Party SERP API Services

For production workloads where you need actual Google results at scale without maintaining infrastructure:

import httpx
import json
from typing import Generator

class SERPAPIClient:
    """Generic SERP API client pattern (adapt to your provider)."""

    def __init__(self, api_key: str, base_url: str):
        self.client = httpx.Client(
            base_url=base_url,
            headers={"Authorization": f"Bearer {api_key}"},
            timeout=30.0,
        )

    def search(self, query: str, **kwargs) -> dict:
        resp = self.client.get("/search", params={"q": query, **kwargs})
        resp.raise_for_status()
        return resp.json()

    def batch_search(self, queries: list[str]) -> Generator[dict, None, None]:
        """Execute multiple searches with built-in rate limiting."""
        for query in queries:
            try:
                yield self.search(query)
            except httpx.HTTPStatusError as e:
                if e.response.status_code == 429:
                    retry_after = int(e.response.headers.get("retry-after", 60))
                    import time
                    time.sleep(retry_after)
                    yield self.search(query)
                else:
                    yield {"error": str(e), "query": query}

Popular SERP API providers include SerpAPI ($50-250/mo), ScraperAPI ($29-249/mo), and Apify actors (pay-per-result). These handle proxy rotation, CAPTCHA solving, and browser fingerprinting for you.

Proxy Rotation: The Missing Piece

Regardless of which scraping approach you use, proxy rotation is essential for anything beyond toy volumes. Here is a production-ready proxy rotation setup:

import httpx
import random
import time
from dataclasses import dataclass
from enum import Enum

class ProxyType(Enum):
    DATACENTER = "datacenter"
    RESIDENTIAL = "residential"
    ISP = "isp"
    MOBILE = "mobile"

@dataclass
class ProxyConfig:
    host: str
    port: int
    username: str
    password: str
    proxy_type: ProxyType = ProxyType.RESIDENTIAL

    @property
    def url(self) -> str:
        return f"http://{self.username}:{self.password}@{self.host}:{self.port}"

class ProxyRotator:
    """Rotate through proxy endpoints with health tracking."""

    def __init__(self, proxies: list[ProxyConfig]):
        self.proxies = proxies
        self.failures: dict[str, int] = {}
        self.max_failures = 3

    def get_proxy(self) -> ProxyConfig:
        """Get a healthy proxy, avoiding recently failed ones."""
        healthy = [
            p for p in self.proxies
            if self.failures.get(p.url, 0) < self.max_failures
        ]
        if not healthy:
            self.failures.clear()
            healthy = self.proxies
        return random.choice(healthy)

    def report_failure(self, proxy: ProxyConfig):
        self.failures[proxy.url] = self.failures.get(proxy.url, 0) + 1

    def report_success(self, proxy: ProxyConfig):
        self.failures.pop(proxy.url, None)


def scrape_with_proxy_rotation(
    queries: list[str],
    proxy_rotator: ProxyRotator,
    delay_range: tuple[float, float] = (3.0, 8.0),
) -> list[dict]:
    """Scrape multiple Google queries with proxy rotation and retry logic."""
    results = []

    for query in queries:
        max_retries = 3
        for attempt in range(max_retries):
            proxy = proxy_rotator.get_proxy()

            try:
                with httpx.Client(
                    proxies={"all://": proxy.url},
                    headers=build_google_headers(),
                    timeout=15.0,
                    follow_redirects=True,
                ) as client:
                    resp = client.get(
                        "https://www.google.com/search",
                        params={"q": query, "num": "10", "hl": "en"},
                    )

                    if resp.status_code == 429 or "unusual traffic" in resp.text.lower():
                        proxy_rotator.report_failure(proxy)
                        time.sleep(random.uniform(10, 30))
                        continue

                    resp.raise_for_status()
                    proxy_rotator.report_success(proxy)

                    serp = parse_serp(resp.text, query)
                    results.append({"query": query, "results": serp})
                    break

            except (httpx.ConnectError, httpx.ProxyError):
                proxy_rotator.report_failure(proxy)
                continue

        # Human-like delay between queries
        time.sleep(random.uniform(*delay_range))

    return results


# Setup with ThorData residential proxies
thordata_proxies = [
    ProxyConfig(
        host="rotating.thordata.com",
        port=9000,
        username="your_username",
        password="your_password",
        proxy_type=ProxyType.RESIDENTIAL,
    ),
]

rotator = ProxyRotator(thordata_proxies)
queries = ["python web scraping 2025", "best proxy providers", "httpx tutorial"]
results = scrape_with_proxy_rotation(queries, rotator)

For proxy providers, I have had consistently good results with ThorData for residential proxy rotation. Their rotating residential pool is specifically effective for search engine scraping because their IPs come from real ISPs and have high trust scores with Google. The per-GB pricing model means you only pay for bandwidth used, which keeps costs predictable compared to per-request pricing.

Proxy selection tip: For Google scraping specifically, residential proxies are non-negotiable. Datacenter proxies ($1-5/GB) are detected and blocked almost immediately. Residential proxies ($5-15/GB) from providers like ThorData route through real ISP connections that Google cannot easily distinguish from normal users. ISP proxies (static residential) have the best success rates but cost more.

Anti-Detection Best Practices

1. Request Timing

import random
import time

def human_delay(min_sec: float = 2.0, max_sec: float = 8.0):
    """Generate a random delay mimicking human browsing patterns.
    Uses a slight bias toward shorter delays (humans are impatient)."""
    delay = random.triangular(min_sec, max_sec, min_sec + (max_sec - min_sec) * 0.3)
    time.sleep(delay)

# Between searches: 3-10 seconds
# Between pagination: 1-4 seconds
# After CAPTCHA or soft block: 30-120 seconds
# Between sessions: 5-15 minutes

2. Session Management

class ScrapingSession:
    """Manage a scraping session with limits and rotation."""

    def __init__(self, max_requests_per_session: int = 15):
        self.max_requests = max_requests_per_session
        self.request_count = 0
        self.session_start = time.time()

    def should_rotate(self) -> bool:
        """Check if it is time to rotate IP/session."""
        if self.request_count >= self.max_requests:
            return True
        if time.time() - self.session_start > 300:  # 5 minutes
            return True
        return False

    def record_request(self):
        self.request_count += 1

3. Geographic Consistency

# Match your proxy location to your search parameters
search_configs = {
    "us": {
        "gl": "us",
        "hl": "en",
        "timezone": "America/New_York",
        "accept_language": "en-US,en;q=0.9",
    },
    "uk": {
        "gl": "uk",
        "hl": "en",
        "domain": "google.co.uk",
        "timezone": "Europe/London",
        "accept_language": "en-GB,en;q=0.9",
    },
    "de": {
        "gl": "de",
        "hl": "de",
        "domain": "google.de",
        "timezone": "Europe/Berlin",
        "accept_language": "de-DE,de;q=0.9,en;q=0.8",
    },
}

Error Handling: CAPTCHAs, Rate Limits, and Blocks

A production scraper needs robust error handling. Here is a comprehensive error handler:

from enum import Enum
from dataclasses import dataclass
import logging

logger = logging.getLogger(__name__)

class BlockType(Enum):
    CAPTCHA = "captcha"
    RATE_LIMIT = "rate_limit"
    IP_BAN = "ip_ban"
    SOFT_BLOCK = "soft_block"
    UNKNOWN = "unknown"

@dataclass
class BlockDetection:
    blocked: bool
    block_type: BlockType
    message: str
    retry_after: int  # seconds

def detect_block(response: httpx.Response) -> BlockDetection:
    """Analyze a Google response for various types of blocking."""

    # HTTP 429 - explicit rate limit
    if response.status_code == 429:
        retry_after = int(response.headers.get("retry-after", 60))
        return BlockDetection(
            blocked=True,
            block_type=BlockType.RATE_LIMIT,
            message="HTTP 429 rate limit hit",
            retry_after=retry_after,
        )

    # HTTP 503 - usually CAPTCHA
    if response.status_code == 503:
        return BlockDetection(
            blocked=True,
            block_type=BlockType.CAPTCHA,
            message="CAPTCHA challenge served",
            retry_after=120,
        )

    text = response.text.lower()

    # Check for CAPTCHA indicators
    if "unusual traffic" in text or "captcha" in text or "recaptcha" in text:
        return BlockDetection(
            blocked=True,
            block_type=BlockType.CAPTCHA,
            message="CAPTCHA detected in response body",
            retry_after=120,
        )

    # Check for empty results (soft block)
    if response.status_code == 200 and 'div class="g"' not in response.text:
        if "did not match any documents" not in text:
            return BlockDetection(
                blocked=True,
                block_type=BlockType.SOFT_BLOCK,
                message="200 OK but no results rendered - possible soft block",
                retry_after=60,
            )

    return BlockDetection(blocked=False, block_type=BlockType.UNKNOWN, message="", retry_after=0)


def handle_block(detection: BlockDetection, proxy_rotator: ProxyRotator, current_proxy: ProxyConfig):
    """Take appropriate action based on block type."""

    if detection.block_type == BlockType.RATE_LIMIT:
        logger.warning(f"Rate limited. Waiting {detection.retry_after}s and rotating proxy.")
        proxy_rotator.report_failure(current_proxy)
        time.sleep(detection.retry_after)

    elif detection.block_type == BlockType.CAPTCHA:
        logger.warning("CAPTCHA triggered. Rotating proxy and backing off.")
        proxy_rotator.report_failure(current_proxy)
        time.sleep(detection.retry_after + random.uniform(0, 60))

    elif detection.block_type == BlockType.IP_BAN:
        logger.error("IP appears banned. Removing from rotation.")
        proxy_rotator.report_failure(current_proxy)
        proxy_rotator.report_failure(current_proxy)  # Double-fail to remove faster
        time.sleep(300)

    elif detection.block_type == BlockType.SOFT_BLOCK:
        logger.info("Soft block detected. Brief pause and retry.")
        time.sleep(detection.retry_after)

Real-World Use Cases

SEO Rank Tracking

def track_rankings(
    domain: str,
    keywords: list[str],
    location: str = "us",
) -> list[dict]:
    """Track search rankings for a domain across keywords."""
    rankings = []

    for keyword in keywords:
        serp = scrape_google(keyword, num_results=100, country=location)

        rank = None
        for result in serp.results:
            if domain in result.url:
                rank = result.position
                break

        rankings.append({
            "keyword": keyword,
            "rank": rank,
            "top_competitor": serp.results[0].url if serp.results else None,
            "date": "2025-01-15",
        })

        human_delay(5, 12)

    return rankings

Competitor Monitoring

def monitor_competitors(
    competitors: list[str],
    industry_keywords: list[str],
) -> dict:
    """Identify which competitors rank for target keywords."""
    visibility = {
        comp: {"ranked_keywords": 0, "avg_position": 0, "positions": []}
        for comp in competitors
    }

    for keyword in industry_keywords:
        serp = scrape_google(keyword, num_results=20)

        for result in serp.results:
            for comp in competitors:
                if comp in result.url:
                    visibility[comp]["ranked_keywords"] += 1
                    visibility[comp]["positions"].append(result.position)

        human_delay()

    for comp in competitors:
        positions = visibility[comp]["positions"]
        if positions:
            visibility[comp]["avg_position"] = sum(positions) / len(positions)

    return visibility

Market Research and Content Gap Analysis

def find_content_gaps(
    your_domain: str,
    competitor_domain: str,
    seed_keywords: list[str],
) -> list[dict]:
    """Find keywords where competitors rank but you do not."""
    gaps = []

    for keyword in seed_keywords:
        serp = scrape_google(keyword, num_results=50)
        urls = [r.url for r in serp.results]

        competitor_ranks = any(competitor_domain in u for u in urls)
        your_ranks = any(your_domain in u for u in urls)

        if competitor_ranks and not your_ranks:
            gaps.append({
                "keyword": keyword,
                "competitor_position": next(
                    r.position for r in serp.results if competitor_domain in r.url
                ),
                "opportunity": "high" if any(
                    competitor_domain in u for u in urls[:10]
                ) else "medium",
            })

        human_delay(4, 10)

    return gaps

Lead Generation from SERPs

def extract_business_leads(
    search_queries: list[str],
    exclude_domains: list[str] | None = None,
) -> list[dict]:
    """Extract business websites from search results for lead generation."""
    exclude = set(exclude_domains or [
        "wikipedia.org", "youtube.com", "reddit.com",
        "facebook.com", "twitter.com", "linkedin.com",
    ])
    leads = []

    for query in search_queries:
        serp = scrape_google(query, num_results=20)

        for result in serp.results:
            from urllib.parse import urlparse
            domain = urlparse(result.url).netloc.lower()
            base_domain = ".".join(domain.split(".")[-2:])

            if base_domain not in exclude:
                leads.append({
                    "domain": base_domain,
                    "title": result.title,
                    "url": result.url,
                    "snippet": result.snippet,
                    "source_query": query,
                })

        human_delay(5, 10)

    # Deduplicate by domain
    seen = set()
    unique_leads = []
    for lead in leads:
        if lead["domain"] not in seen:
            seen.add(lead["domain"])
            unique_leads.append(lead)

    return unique_leads

Method Comparison Table

Method	Cost	Volume/Day	Success Rate	Maintenance	Best For
Raw httpx	Free	20-50	30-50%	High	Quick one-off lookups
curl_cffi	Free + proxy cost	200-500	60-80%	Medium	Moderate volume, budget-conscious
curl_cffi + ThorData residential proxies	$5-15/GB	1,000-5,000	85-95%	Medium	Best value for scale
Playwright	Free + proxy cost	50-200	50-70%	Very High	JS-required pages
Google CSE API	$5/1K queries	Unlimited	99%+	Low	Official data, simple needs
SERP API services	$50-250/mo	Unlimited	95-99%	Very Low	Production pipelines

Advanced: Distributed Scraping Architecture

For large-scale SERP monitoring (10,000+ queries/day), you need a distributed architecture:

import asyncio
import json
from dataclasses import dataclass, asdict
from datetime import datetime

@dataclass
class ScrapeJob:
    query: str
    priority: int = 0
    country: str = "us"
    created_at: str = ""

    def __post_init__(self):
        if not self.created_at:
            self.created_at = datetime.utcnow().isoformat()

class ScrapeOrchestrator:
    """Coordinate distributed scraping across multiple workers."""

    def __init__(self, worker_count: int = 5, proxies_per_worker: int = 3):
        self.worker_count = worker_count
        self.job_queue: asyncio.Queue[ScrapeJob] = asyncio.Queue()
        self.results: list[dict] = []
        self.proxies_per_worker = proxies_per_worker

    async def add_jobs(self, queries: list[str], priority: int = 0):
        """Add scraping jobs to the queue."""
        for query in queries:
            await self.job_queue.put(ScrapeJob(query=query, priority=priority))

    async def worker(self, worker_id: int, proxy_rotator: ProxyRotator):
        """Individual worker that processes jobs from the queue."""
        while True:
            try:
                job = await asyncio.wait_for(self.job_queue.get(), timeout=30)
            except asyncio.TimeoutError:
                break

            proxy = proxy_rotator.get_proxy()
            try:
                result = await asyncio.to_thread(
                    scrape_google, job.query, 10, job.country
                )
                self.results.append({
                    "job": asdict(job),
                    "results": asdict(result),
                    "worker_id": worker_id,
                    "timestamp": datetime.utcnow().isoformat(),
                })
                proxy_rotator.report_success(proxy)
            except Exception as e:
                proxy_rotator.report_failure(proxy)
                # Re-queue failed jobs with lower priority
                if job.priority < 3:
                    job.priority += 1
                    await self.job_queue.put(job)

            # Stagger requests per worker
            await asyncio.sleep(random.uniform(5, 15))

            self.job_queue.task_done()

    async def run(self, proxy_rotator: ProxyRotator):
        """Launch all workers and wait for completion."""
        workers = [
            asyncio.create_task(self.worker(i, proxy_rotator))
            for i in range(self.worker_count)
        ]
        await self.job_queue.join()
        for w in workers:
            w.cancel()
        return self.results

For learning and experimentation: Start with the raw httpx approach. Understanding how Google's detection works is valuable even if the approach does not scale.

For moderate-volume data collection (100-2,000 queries/day): Use curl_cffi with browser impersonation plus ThorData rotating residential proxies. This combination gives you a genuine browser TLS fingerprint with real residential IPs — the two biggest factors in avoiding detection. At $5-15 per GB, a thousand searches costs only a few dollars in bandwidth.

For production pipelines (2,000+ queries/day): Use a managed SERP API service. I spent months maintaining my own Playwright-based SERP scraper with proxy rotation and stealth patches, and I was spending more time fixing breakage than building features. The managed services cost money, but they cost less than your engineering time.

For official/compliant use: Google's Custom Search API. The results are not identical to organic search, but they are legal, reliable, and require zero maintenance.

The general rule: if you are scraping Google fewer than 50 times a day, raw requests with a good user agent might work. Beyond that, you need residential proxies. Beyond a few hundred queries, the economics of build-vs-buy favor a managed API — your time is worth more than the subscription cost.