Scraping Google Scholar Citations and Author Profiles with Python (2026)

2026-04-09 [python google-scholar scraping citations research proxy-rotation anti-bot playwright]

Scraping Google Scholar Citations and Author Profiles with Python (2026)

Google Scholar is the world's most comprehensive academic search engine, indexing over 200 million scholarly documents across virtually every field of human knowledge. Yet for all its scope and value, Google Scholar provides no public API. The Academic Knowledge API that Microsoft offered died years ago. Google's own Scholar API was quietly shuttered long before that. The result is that every researcher, bibliometrician, academic institution, or competitive intelligence analyst who needs machine-readable citation data faces the same fundamental problem: the only way to get it is to scrape it.

And scraping Google Scholar is genuinely hard. Not because the HTML is complex — Scholar pages are remarkably clean markup. The difficulty is Google's layered bot detection infrastructure, which sits in a different category from most websites. Google operates one of the most sophisticated anti-bot systems ever built, and they apply it to Scholar just as aggressively as to Search. After 5-10 requests from a datacenter IP, you'll see a CAPTCHA. Push harder and you'll see an IP ban lasting hours to days. The combination of request fingerprinting, behavioral analysis, cookie tracking, and IP reputation scoring means that naive approaches simply don't work.

This guide covers the full spectrum of what actually works in 2026: the scholarly library for simple cases, raw httpx requests for controlled scraping, Playwright for full browser automation, and the complete proxy rotation strategy that makes all of it viable at scale. Every code example is tested against real Scholar endpoints.

The use cases for this data are substantial: tracking citation counts for grant applications, computing h-indices and bibliometric scores for promotion decisions, building academic recommendation systems, monitoring when your papers get cited, competitive analysis of research groups, and building literature maps that visualize how ideas propagate through citation networks.

Rate of change: Google regularly updates its bot detection. Techniques that worked 6 months ago sometimes fail today. Always test your approach with small volumes before running large jobs. The fundamentals in this guide — proper proxy rotation, realistic request timing, session management — remain stable even as surface-level selectors change.

Why Residential Proxies Are Non-Negotiable

Before getting into code, let's be explicit about infrastructure requirements. Google Scholar cannot be scraped at any meaningful volume from:

Raw datacenter IPs (AWS, GCP, Azure, DigitalOcean, etc.) — blocked within minutes
VPN exit nodes — most have burned IP reputation from prior abuse
Tor exit nodes — outright blocked

What actually works is residential proxy networks — IP addresses assigned to real home and mobile internet connections, belonging to real ISPs. Google's reputation scoring trusts these because they look like actual users. ThorData provides access to a residential proxy network covering 195+ countries with real ISP-assigned IPs. For Scholar specifically, US residential IPs perform best since Scholar's primary interface assumes US browsing behavior.

The economics: you pay per GB of proxy traffic. A single Scholar author profile page is roughly 50-80KB. At 10-15 requests per complete author profile (filling all publications), you're looking at roughly 1MB per author. Plan your proxy budget accordingly.

Setup

pip install scholarly httpx beautifulsoup4 lxml playwright tenacity aiohttp pandas
playwright install chromium

Environment setup:

export THORDATA_USER="your_username"
export THORDATA_PASS="your_password"
export SCHOLAR_PROXY="http://${THORDATA_USER}:${THORDATA_PASS}@proxy.thordata.com:9000"

Understanding Google Scholar's URL Structure

The API surface is consistent and predictable:

# Author profile
https://scholar.google.com/citations?user={AUTHOR_ID}&hl=en

# Author profile with sort by citations
https://scholar.google.com/citations?user={AUTHOR_ID}&hl=en&sortby=citedby

# Search for author
https://scholar.google.com/citations?view_op=search_authors&mauthors={name}&hl=en

# Article search
https://scholar.google.com/scholar?q={query}&hl=en

# Papers citing a specific paper (by cluster ID)
https://scholar.google.com/scholar?cites={CLUSTER_ID}&hl=en

# All versions of a paper
https://scholar.google.com/scholar?cluster={CLUSTER_ID}&hl=en

# Publication detail (from author profile)
https://scholar.google.com/citations?view_op=view_citation&user={AUTHOR_ID}&citation_for_view={AUTHOR_ID}:{PUB_ID}

Author IDs are 12-character alphanumeric strings visible in profile URLs. Cluster IDs appear in citation links. Both are stable identifiers that persist across Scholar's indexing updates.

The scholarly Library for Small-Scale Work

For up to a few hundred requests, scholarly provides the most convenient interface:

import os
import time
import random
import logging
from scholarly import scholarly, ProxyGenerator
from typing import Optional, List, Dict, Any

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s: %(message)s")
logger = logging.getLogger(__name__)

def configure_scholarly_proxy(
    proxy_url: Optional[str] = None,
    use_tor: bool = False,
    use_luminati: bool = False,
) -> None:
    """Configure scholarly's proxy backend."""
    pg = ProxyGenerator()

    if proxy_url:
        # Single proxy (use with sticky session URL for per-author consistency)
        pg.SingleProxy(http=proxy_url, https=proxy_url)
        scholarly.use_proxy(pg)
        logger.info(f"Configured scholarly with proxy: {proxy_url[:40]}...")
    elif use_tor:
        # Tor browser must be running locally
        pg.Tor_Internal(tor_cmd="tor")
        scholarly.use_proxy(pg)
        logger.info("Configured scholarly with Tor")
    else:
        logger.warning("No proxy configured - will hit CAPTCHA quickly on Scholar")

def get_author_by_id(author_id: str, fill: bool = True) -> Optional[Dict]:
    """Fetch a Scholar author by their ID."""
    try:
        author = scholarly.search_author_id(author_id)
        if fill:
            author = scholarly.fill(author, sections=["basics", "indices", "counts", "publications"])
        return dict(author)
    except Exception as e:
        logger.error(f"Failed to fetch author {author_id}: {e}")
        return None

def get_author_by_name(name: str, affiliation_filter: str = "") -> Optional[Dict]:
    """Search for an author by name with optional affiliation filter."""
    try:
        search_results = scholarly.search_author(name)
        for candidate in search_results:
            if affiliation_filter:
                aff = candidate.get("affiliation", "").lower()
                if affiliation_filter.lower() not in aff:
                    continue
            # Fill the first matching result
            return dict(scholarly.fill(candidate))
    except StopIteration:
        logger.info(f"No results found for author: {name}")
    except Exception as e:
        logger.error(f"Author search failed for {name}: {e}")
    return None

def get_publication_details(author: Dict, pub_index: int) -> Optional[Dict]:
    """Fill complete details for a specific publication."""
    try:
        pubs = author.get("publications", [])
        if pub_index >= len(pubs):
            return None
        pub = scholarly.fill(pubs[pub_index])
        return dict(pub)
    except Exception as e:
        logger.error(f"Failed to fill publication {pub_index}: {e}")
        return None

# Complete author scraping example
proxy_url = os.environ.get("SCHOLAR_PROXY", "")
configure_scholarly_proxy(proxy_url=proxy_url if proxy_url else None)

# Geoffrey Hinton's Scholar ID
hinton = get_author_by_id("JicYPdAAAAAJ")
if hinton:
    print(f"Name: {hinton['name']}")
    print(f"Affiliation: {hinton.get('affiliation', 'N/A')}")
    print(f"Citations (all time): {hinton.get('citedby', 0)}")
    print(f"Citations (since 2019): {hinton.get('citedby5y', 0)}")
    print(f"h-index: {hinton.get('hindex', 0)}")
    print(f"h-index (5yr): {hinton.get('hindex5y', 0)}")
    print(f"i10-index: {hinton.get('i10index', 0)}")
    print(f"Publications: {len(hinton.get('publications', []))}")

Raw httpx Scraper for Fine-Grained Control

When you need precise control over headers, timing, and proxy rotation, bypass scholarly and hit Scholar directly:

import httpx
from bs4 import BeautifulSoup
import re
from dataclasses import dataclass, asdict, field
from typing import Iterator, Optional, List, Dict, Any
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]

@dataclass
class ScholarPublication:
    title: str
    authors: str
    venue: str
    year: Optional[int]
    citations: int
    citation_url: Optional[str]
    pub_url: Optional[str]
    cluster_id: Optional[str]

@dataclass
class ScholarAuthor:
    author_id: str
    name: str
    affiliation: str
    email_domain: str
    interests: List[str]
    total_citations: int
    citations_since_2019: int
    h_index: int
    h_index_5yr: int
    i10_index: int
    i10_index_5yr: int
    publications: List[ScholarPublication] = field(default_factory=list)

def build_scholar_headers(referer: Optional[str] = None) -> Dict[str, str]:
    ua = random.choice(USER_AGENTS)
    headers = {
        "User-Agent": ua,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "same-origin" if referer else "none",
        "Sec-Fetch-User": "?1",
    }
    if referer:
        headers["Referer"] = referer
    return headers

class ScholarScraper:
    """Direct HTTP scraper for Google Scholar with proxy rotation."""

    BASE_URL = "https://scholar.google.com"

    def __init__(self, proxy_url: Optional[str] = None, request_delay: float = 5.0):
        self.proxy_url = proxy_url
        self.delay = request_delay
        self._cookies: Dict[str, str] = {}

    def _make_client(self) -> httpx.Client:
        kwargs = {
            "headers": build_scholar_headers(),
            "timeout": httpx.Timeout(30.0),
            "follow_redirects": True,
        }
        if self.proxy_url:
            kwargs["proxy"] = self.proxy_url
        return httpx.Client(**kwargs)

    def _is_captcha(self, html: str) -> bool:
        signals = ["captcha", "unusual traffic", "i'm not a robot", "recaptcha"]
        lower = html.lower()
        return any(s in lower for s in signals)

    @retry(
        stop=stop_after_attempt(4),
        wait=wait_exponential(multiplier=2, min=5, max=90),
        retry=retry_if_exception_type((httpx.RequestError, httpx.HTTPStatusError)),
    )
    def get(self, path: str, params: Optional[Dict] = None) -> httpx.Response:
        """Make a GET request with retry logic."""
        time.sleep(self.delay + random.uniform(-1, 2))

        url = f"{self.BASE_URL}{path}"
        referer = self.BASE_URL if path != "/" else None

        with self._make_client() as client:
            # Carry cookies from previous requests
            for name, value in self._cookies.items():
                client.cookies.set(name, value)

            resp = client.get(url, params=params, headers=build_scholar_headers(referer))

            # Store any new cookies
            for name, value in resp.cookies.items():
                self._cookies[name] = value

            if resp.status_code == 429:
                logger.warning("Rate limited by Scholar")
                raise httpx.HTTPStatusError("Rate limited", request=resp.request, response=resp)

            if self._is_captcha(resp.text):
                logger.warning(f"CAPTCHA detected at {url}")
                raise httpx.RequestError("CAPTCHA encountered")

            resp.raise_for_status()
            return resp

    def get_author_profile(self, author_id: str) -> Optional[ScholarAuthor]:
        """Fetch and parse a complete author profile."""
        resp = self.get("/citations", params={"user": author_id, "hl": "en"})
        return self._parse_author_page(resp.text, author_id)

    def _parse_author_page(self, html: str, author_id: str) -> Optional[ScholarAuthor]:
        soup = BeautifulSoup(html, "lxml")

        # Author name
        name_el = soup.find("div", id="gsc_prf_in")
        name = name_el.get_text(strip=True) if name_el else "Unknown"

        # Affiliation
        aff_el = soup.find("div", class_="gsc_prf_il")
        affiliation = aff_el.get_text(strip=True) if aff_el else ""

        # Email domain
        email_el = soup.find("div", id="gsc_prf_ivh")
        email_domain = ""
        if email_el:
            match = re.search(r"Verified email at (\S+)", email_el.get_text())
            if match:
                email_domain = match.group(1).rstrip("·").strip()

        # Research interests
        interests = []
        for interest_link in soup.select("#gsc_prf_int a"):
            interests.append(interest_link.get_text(strip=True))

        # Citation statistics from the stats table
        stats = {}
        stats_table = soup.find("table", id="gsc_rsb_st")
        if stats_table:
            rows = stats_table.find_all("tr")
            for row in rows[1:]:  # Skip header
                cells = row.find_all("td")
                if len(cells) >= 3:
                    metric = cells[0].get_text(strip=True)
                    all_time = cells[1].get_text(strip=True)
                    since_2019 = cells[2].get_text(strip=True)
                    stats[metric] = {"all": self._safe_int(all_time), "since_2019": self._safe_int(since_2019)}

        # Parse publications from the table
        publications = []
        for pub_row in soup.select("#gsc_a_b tr.gsc_a_tr"):
            pub = self._parse_publication_row(pub_row)
            if pub:
                publications.append(pub)

        return ScholarAuthor(
            author_id=author_id,
            name=name,
            affiliation=affiliation,
            email_domain=email_domain,
            interests=interests,
            total_citations=stats.get("Citations", {}).get("all", 0),
            citations_since_2019=stats.get("Citations", {}).get("since_2019", 0),
            h_index=stats.get("h-index", {}).get("all", 0),
            h_index_5yr=stats.get("h-index", {}).get("since_2019", 0),
            i10_index=stats.get("i10-index", {}).get("all", 0),
            i10_index_5yr=stats.get("i10-index", {}).get("since_2019", 0),
            publications=publications,
        )

    def _parse_publication_row(self, row: BeautifulSoup) -> Optional[ScholarPublication]:
        """Parse a single publication row from author profile."""
        try:
            title_el = row.select_one(".gsc_a_at")
            title = title_el.get_text(strip=True) if title_el else ""
            pub_url = title_el.get("href", "") if title_el else ""
            if pub_url and not pub_url.startswith("http"):
                pub_url = "https://scholar.google.com" + pub_url

            meta_el = row.select_one(".gsc_a_t .gs_gray")
            authors = ""
            venue = ""
            if meta_el:
                metas = [m.get_text(strip=True) for m in row.select(".gsc_a_t .gs_gray")]
                if metas:
                    authors = metas[0] if metas else ""
                    venue = metas[1] if len(metas) > 1 else ""

            cite_el = row.select_one(".gsc_a_c a")
            citations = self._safe_int(cite_el.get_text(strip=True) if cite_el else "0")
            citation_url = cite_el.get("href", "") if cite_el else ""
            if citation_url and not citation_url.startswith("http"):
                citation_url = "https://scholar.google.com" + citation_url

            year_el = row.select_one(".gsc_a_y span")
            year = self._safe_int(year_el.get_text(strip=True) if year_el else "")

            return ScholarPublication(
                title=title,
                authors=authors,
                venue=venue,
                year=year,
                citations=citations,
                citation_url=citation_url,
                pub_url=pub_url,
                cluster_id=self._extract_cluster_id(citation_url),
            )
        except Exception as e:
            logger.warning(f"Failed parsing publication row: {e}")
            return None

    def _extract_cluster_id(self, url: str) -> Optional[str]:
        if not url:
            return None
        match = re.search(r"cites=(\d+)", url)
        return match.group(1) if match else None

    def _safe_int(self, text: str) -> int:
        try:
            return int(re.sub(r"[^\d]", "", text))
        except (ValueError, TypeError):
            return 0

    def search_papers(self, query: str, num_pages: int = 3) -> Iterator[Dict]:
        """Search Scholar for papers matching a query."""
        for page in range(num_pages):
            start = page * 10
            resp = self.get("/scholar", params={"q": query, "hl": "en", "start": start})
            soup = BeautifulSoup(resp.text, "lxml")

            results = soup.select(".gs_ri")
            if not results:
                break

            for result in results:
                yield self._parse_search_result(result)

            # Check if there's a next page
            next_btn = soup.find("button", {"aria-label": re.compile(r"Next", re.I)})
            if not next_btn:
                break

    def _parse_search_result(self, element: BeautifulSoup) -> Dict:
        title_el = element.select_one(".gs_rt")
        snippet_el = element.select_one(".gs_rs")
        meta_el = element.select_one(".gs_fl")
        authors_venue_el = element.select_one(".gs_a")

        cite_count = 0
        cite_url = ""
        if meta_el:
            cite_link = meta_el.find("a", string=re.compile(r"Cited by"))
            if cite_link:
                match = re.search(r"(\d+)", cite_link.get_text())
                if match:
                    cite_count = int(match.group(1))
                cite_url = "https://scholar.google.com" + cite_link.get("href", "")

        return {
            "title": title_el.get_text(strip=True) if title_el else "",
            "title_url": (title_el.find("a") or {}).get("href", "") if title_el else "",
            "snippet": snippet_el.get_text(strip=True) if snippet_el else "",
            "authors_venue": authors_venue_el.get_text(strip=True) if authors_venue_el else "",
            "citation_count": cite_count,
            "citation_url": cite_url,
        }

# Usage
scraper = ScholarScraper(
    proxy_url=os.environ.get("SCHOLAR_PROXY", ""),
    request_delay=5.0,
)
author = scraper.get_author_profile("JicYPdAAAAAJ")
if author:
    print(f"{author.name}: {author.total_citations} citations, h-index {author.h_index}")
    print(f"Top paper: {author.publications[0].title if author.publications else 'N/A'}")

Playwright for JavaScript-Rendered Pages

Scholar's author profiles sometimes trigger JS challenges or require interaction (like "Show more" for full publication lists). Playwright handles both:

import asyncio
from playwright.async_api import async_playwright, Page, Browser, BrowserContext
from typing import List, AsyncIterator

async def launch_stealth_browser(proxy_url: Optional[str] = None) -> Browser:
    """Launch Chromium with anti-detection measures."""
    playwright = await async_playwright().start()

    args = [
        "--disable-blink-features=AutomationControlled",
        "--no-sandbox",
        "--disable-setuid-sandbox",
        "--disable-dev-shm-usage",
        "--disable-accelerated-2d-canvas",
        "--no-first-run",
        "--no-zygote",
        "--disable-gpu",
        "--window-size=1920,1080",
        "--lang=en-US,en",
    ]

    kwargs: Dict[str, Any] = {"headless": True, "args": args}

    if proxy_url:
        from urllib.parse import urlparse
        parsed = urlparse(proxy_url)
        kwargs["proxy"] = {
            "server": f"http://{parsed.hostname}:{parsed.port}",
            "username": parsed.username or "",
            "password": parsed.password or "",
        }

    return await playwright.chromium.launch(**kwargs)

async def make_stealth_context(browser: Browser) -> BrowserContext:
    """Create browser context with realistic fingerprint."""
    context = await browser.new_context(
        viewport={"width": 1920, "height": 1080},
        locale="en-US",
        timezone_id="America/New_York",
        user_agent=random.choice(USER_AGENTS),
        extra_http_headers={
            "Accept-Language": "en-US,en;q=0.9",
        },
    )

    # Patch automation markers
    await context.add_init_script("""
        // Hide webdriver
        Object.defineProperty(navigator, 'webdriver', { get: () => undefined });

        // Mock plugins (real browsers have them)
        Object.defineProperty(navigator, 'plugins', {
            get: () => {
                return {
                    length: 3,
                    0: { name: 'Chrome PDF Plugin' },
                    1: { name: 'Chrome PDF Viewer' },
                    2: { name: 'Native Client' },
                };
            }
        });

        // Realistic language list
        Object.defineProperty(navigator, 'languages', {
            get: () => ['en-US', 'en']
        });

        // Chrome object (missing in some headless configs)
        window.chrome = {
            runtime: {},
            loadTimes: function() {},
            csi: function() {},
            app: {},
        };

        // Remove headless hardware concurrency hint
        Object.defineProperty(navigator, 'hardwareConcurrency', { get: () => 8 });
    """)

    return context

async def scrape_full_author_profile(
    author_id: str,
    proxy_url: Optional[str] = None,
) -> Optional[Dict]:
    """
    Scrape a complete author profile including all publications
    using full browser automation.
    """
    browser = await launch_stealth_browser(proxy_url)

    try:
        context = await make_stealth_context(browser)
        page = await context.new_page()
        page.set_default_timeout(30000)

        # Navigate to author profile
        url = f"https://scholar.google.com/citations?user={author_id}&hl=en"
        await page.goto(url, wait_until="networkidle")
        await asyncio.sleep(random.uniform(2, 4))

        # Handle CAPTCHA if present
        captcha = await page.query_selector("form#captcha-form, .captcha-container")
        if captcha:
            logger.warning(f"CAPTCHA on author profile {author_id}")
            return None

        # Extract basic stats
        author_data = await page.evaluate("""
            () => {
                const stats = {};

                // Name and affiliation
                const nameEl = document.querySelector('#gsc_prf_in');
                stats.name = nameEl ? nameEl.textContent.trim() : '';

                const affEl = document.querySelector('.gsc_prf_il');
                stats.affiliation = affEl ? affEl.textContent.trim() : '';

                // Citation stats table
                const statTable = document.querySelector('#gsc_rsb_st');
                if (statTable) {
                    const rows = statTable.querySelectorAll('tr');
                    rows.forEach(row => {
                        const cells = row.querySelectorAll('td');
                        if (cells.length >= 3) {
                            const metric = cells[0].textContent.trim();
                            stats[metric + '_all'] = cells[1].textContent.trim();
                            stats[metric + '_5yr'] = cells[2].textContent.trim();
                        }
                    });
                }

                // Interests
                stats.interests = Array.from(
                    document.querySelectorAll('#gsc_prf_int a')
                ).map(a => a.textContent.trim());

                return stats;
            }
        """)

        # Load ALL publications by clicking "Show more" button
        all_pubs = []
        while True:
            show_more = await page.query_selector("#gsc_bpf_more:not([disabled])")
            if not show_more:
                break
            await show_more.click()
            await asyncio.sleep(random.uniform(1.5, 3.0))

        # Extract all publication rows
        pub_rows = await page.query_selector_all("#gsc_a_b tr.gsc_a_tr")
        for row in pub_rows:
            title_el = await row.query_selector(".gsc_a_at")
            cite_el = await row.query_selector(".gsc_a_c a")
            year_el = await row.query_selector(".gsc_a_y span")

            title = await title_el.inner_text() if title_el else ""
            pub_href = await title_el.get_attribute("href") if title_el else ""
            cite_text = await cite_el.inner_text() if cite_el else "0"
            cite_href = await cite_el.get_attribute("href") if cite_el else ""
            year_text = await year_el.inner_text() if year_el else ""

            all_pubs.append({
                "title": title.strip(),
                "pub_url": f"https://scholar.google.com{pub_href}" if pub_href else "",
                "citations": int(re.sub(r"\D", "", cite_text) or "0"),
                "citation_url": f"https://scholar.google.com{cite_href}" if cite_href else "",
                "year": int(year_text) if year_text.isdigit() else None,
            })

        author_data["publications"] = all_pubs
        author_data["author_id"] = author_id
        return author_data

    finally:
        await browser.close()

async def scrape_citing_papers(
    cluster_id: str,
    proxy_url: Optional[str] = None,
    max_pages: int = 5,
) -> List[Dict]:
    """Scrape all papers that cite a specific paper (by cluster ID)."""
    browser = await launch_stealth_browser(proxy_url)
    all_papers = []

    try:
        context = await make_stealth_context(browser)
        page = await context.new_page()

        for page_num in range(max_pages):
            start = page_num * 10
            url = f"https://scholar.google.com/scholar?cites={cluster_id}&hl=en&start={start}"

            await page.goto(url, wait_until="networkidle")
            await asyncio.sleep(random.uniform(3, 6))

            # Check for CAPTCHA
            if await page.query_selector("form#captcha-form"):
                logger.warning(f"CAPTCHA on page {page_num}")
                break

            papers = await page.evaluate("""
                () => {
                    return Array.from(document.querySelectorAll('.gs_ri')).map(el => ({
                        title: el.querySelector('.gs_rt')?.textContent?.trim() || '',
                        authors: el.querySelector('.gs_a')?.textContent?.trim() || '',
                        snippet: el.querySelector('.gs_rs')?.textContent?.trim() || '',
                        cite_count: (() => {
                            const citeLink = el.querySelector('.gs_fl a');
                            if (!citeLink) return 0;
                            const match = citeLink.textContent.match(/\d+/);
                            return match ? parseInt(match[0]) : 0;
                        })(),
                    }));
                }
            """)

            if not papers:
                break

            all_papers.extend(papers)
            logger.info(f"Citing papers page {page_num + 1}: {len(papers)} results")

    finally:
        await browser.close()

    return all_papers

# Run async scrapers
async def main():
    proxy = os.environ.get("SCHOLAR_PROXY", "")
    profile = await scrape_full_author_profile("JicYPdAAAAAJ", proxy_url=proxy or None)
    if profile:
        print(f"Scraped {len(profile.get('publications', []))} publications")
        for pub in sorted(profile["publications"], key=lambda x: x["citations"], reverse=True)[:5]:
            print(f"  {pub['citations']:6d} cites: {pub['title'][:60]}")

asyncio.run(main())

Proxy Rotation with ThorData

Effective proxy rotation for Scholar requires both rotating and sticky session support. Rotating for search queries, sticky for multi-page profile loads:

import threading
from urllib.parse import urlparse

class ThorDataScholarProxy:
    """
    Manages ThorData proxy sessions optimized for Google Scholar scraping.

    Scholar's rate limiting tracks by IP + cookie combination.
    Sticky sessions let you complete a full author profile on one IP,
    then rotate to a fresh IP for the next author.
    """

    def __init__(
        self,
        username: str,
        password: str,
        host: str = "proxy.thordata.com",
        port: int = 9000,
        country: str = "US",
    ):
        self.username = username
        self.password = password
        self.host = host
        self.port = port
        self.country = country
        self._sticky_id: Optional[str] = None
        self._sticky_created: float = 0
        self._lock = threading.Lock()
        self._request_count = 0
        self._error_count = 0

    def rotating_url(self) -> str:
        """New IP on every request."""
        return f"http://{self.username}-country-{self.country}:{self.password}@{self.host}:{self.port}"

    def sticky_url(self, session_minutes: int = 10) -> str:
        """Same IP for up to session_minutes."""
        with self._lock:
            now = time.time()
            if not self._sticky_id or (now - self._sticky_created) > session_minutes * 60:
                self._sticky_id = f"scholar{random.randint(10000, 99999)}"
                self._sticky_created = now
            return (
                f"http://{self.username}-country-{self.country}-"
                f"session-{self._sticky_id}:{self.password}@{self.host}:{self.port}"
            )

    def rotate(self):
        """Force new sticky session on next call."""
        with self._lock:
            self._sticky_id = None
        logger.info("Proxy session rotated")

    def record_request(self, success: bool):
        with self._lock:
            self._request_count += 1
            if not success:
                self._error_count += 1
                # Auto-rotate after 3 consecutive-ish errors
                if self._error_count >= 3:
                    self._sticky_id = None
                    self._error_count = 0
                    logger.info("Auto-rotated proxy after errors")
            else:
                self._error_count = max(0, self._error_count - 1)

# Usage
proxy = ThorDataScholarProxy(
    username=os.environ.get("THORDATA_USER", ""),
    password=os.environ.get("THORDATA_PASS", ""),
    country="US",
)

def scrape_author_with_rotation(author_id: str) -> Optional[Dict]:
    """Scrape one author, rotate proxy after completion."""
    scraper = ScholarScraper(
        proxy_url=proxy.sticky_url(session_minutes=8),
        request_delay=5.0,
    )
    try:
        result = scraper.get_author_profile(author_id)
        proxy.record_request(True)
        proxy.rotate()  # Fresh IP for next author
        return asdict(result) if result else None
    except Exception as e:
        proxy.record_request(False)
        logger.error(f"Author {author_id} failed: {e}")
        return None

Rate Limiting, Backoff, and CAPTCHA Handling

import hashlib
from datetime import datetime

class ScholarRateLimiter:
    """
    Adaptive rate limiter that responds to Scholar's throttling signals.
    Tracks per-IP success rates and adjusts timing accordingly.
    """

    def __init__(self, base_delay: float = 5.0):
        self.base_delay = base_delay
        self.current_delay = base_delay
        self._success_streak = 0
        self._failure_streak = 0

    def wait(self):
        jitter = random.gauss(0, 0.5)
        sleep_for = max(2.0, self.current_delay + jitter)
        logger.debug(f"Rate limiter sleeping {sleep_for:.1f}s")
        time.sleep(sleep_for)

    def on_success(self):
        self._success_streak += 1
        self._failure_streak = 0
        # Slowly decrease delay after sustained success
        if self._success_streak >= 10:
            self.current_delay = max(3.0, self.current_delay * 0.85)
            self._success_streak = 0

    def on_captcha(self):
        self._failure_streak += 1
        self._success_streak = 0
        self.current_delay = min(120.0, self.current_delay * 3.0)
        logger.warning(f"CAPTCHA hit - delay increased to {self.current_delay:.0f}s")

    def on_rate_limit(self, retry_after: int = 60):
        self._failure_streak += 1
        self.current_delay = min(120.0, self.current_delay * 2.5)
        logger.warning(f"Rate limited - sleeping {retry_after}s then resuming at {self.current_delay:.0f}s delay")
        time.sleep(retry_after)

class ScholarCaptchaDetector:
    """Detect and categorize Scholar anti-bot responses."""

    @staticmethod
    def classify(html: str, status_code: int) -> str:
        """Returns: 'ok', 'captcha', 'rate_limit', 'ip_ban', 'error'"""
        if status_code == 429:
            return "rate_limit"
        if status_code == 503:
            return "rate_limit"
        if status_code >= 500:
            return "error"

        lower = html.lower()
        if "our systems have detected unusual traffic" in lower:
            return "captcha"
        if "recaptcha" in lower or "g-recaptcha" in lower:
            return "captcha"
        if "sorry, we can't verify that you're not a robot" in lower:
            return "captcha"
        if "access to this page has been denied" in lower:
            return "ip_ban"

        # Valid Scholar page markers
        if "gsc_prf" in html or "gs_ri" in html or "scholar.google" in html:
            return "ok"

        return "unknown"

    @staticmethod
    def save_captcha_url(url: str, context: str = "") -> None:
        """Log CAPTCHA hits for retry queue."""
        with open("scholar_captcha_log.txt", "a") as f:
            f.write(f"{datetime.now().isoformat()}\t{url}\t{context}\n")

Complete Output Schemas with Examples

import json
from dataclasses import asdict

# Author profile output schema
author_example = {
    "author_id": "JicYPdAAAAAJ",
    "name": "Geoffrey Hinton",
    "affiliation": "Professor Emeritus, University of Toronto",
    "email_domain": "cs.toronto.edu",
    "interests": ["machine learning", "neural networks", "deep learning", "AI"],
    "total_citations": 752483,
    "citations_since_2019": 421892,
    "h_index": 185,
    "h_index_5yr": 122,
    "i10_index": 272,
    "i10_index_5yr": 200,
    "publications": [
        {
            "title": "ImageNet classification with deep convolutional neural networks",
            "authors": "A Krizhevsky, I Sutskever, GE Hinton",
            "venue": "Advances in neural information processing systems 25",
            "year": 2012,
            "citations": 128623,
            "citation_url": "https://scholar.google.com/scholar?cites=17322548362154064355",
            "pub_url": "https://scholar.google.com/citations?view_op=view_citation&user=JicYPdAAAAAJ&citation_for_view=JicYPdAAAAAJ:u5HHmVD_uO8C",
            "cluster_id": "17322548362154064355",
        }
    ],
}

# Search result output schema
search_example = {
    "title": "Deep learning",
    "title_url": "https://www.nature.com/articles/nature14539",
    "snippet": "Deep learning allows computational models that are composed of multiple processing layers...",
    "authors_venue": "Y LeCun, Y Bengio, G Hinton - Nature, 2015",
    "citation_count": 78542,
    "citation_url": "https://scholar.google.com/scholar?cites=4816722523314893612",
}

# Citing papers output schema
citing_paper_example = {
    "title": "Attention Is All You Need",
    "authors": "A Vaswani, N Shazeer, N Parmar… - Advances in neural…, 2017",
    "snippet": "The dominant sequence transduction models are based on complex recurrent or convolutional neural...",
    "cite_count": 112840,
}

print(json.dumps(author_example, indent=2))

Real-World Use Cases with Code

Use Case 1: Citation Monitoring Dashboard

Track when your papers get cited:

import sqlite3
from datetime import datetime, timedelta

def monitor_paper_citations(
    cluster_ids: List[str],
    db_path: str = "citation_monitor.db",
    check_interval_hours: int = 24,
) -> List[Dict]:
    """Monitor citation counts for a list of papers."""
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS citation_history (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            cluster_id TEXT,
            citation_count INTEGER,
            checked_at TEXT,
            delta INTEGER DEFAULT 0
        )
    """)
    conn.commit()

    new_citations = []
    scraper = ScholarScraper(proxy_url=proxy.rotating_url())

    for cluster_id in cluster_ids:
        # Get current citation count
        resp = scraper.get("/scholar", params={"cites": cluster_id, "hl": "en"})
        soup = BeautifulSoup(resp.text, "lxml")

        # Extract "About N results" from the top
        result_stats = soup.find("div", id="gs_ab_md")
        count = 0
        if result_stats:
            match = re.search(r"About ([\d,]+) results", result_stats.get_text())
            if match:
                count = int(match.group(1).replace(",", ""))

        # Compare with last reading
        last = conn.execute(
            "SELECT citation_count FROM citation_history WHERE cluster_id=? ORDER BY checked_at DESC LIMIT 1",
            (cluster_id,)
        ).fetchone()

        delta = count - (last[0] if last else 0)
        conn.execute(
            "INSERT INTO citation_history (cluster_id, citation_count, checked_at, delta) VALUES (?,?,?,?)",
            (cluster_id, count, datetime.now().isoformat(), delta),
        )
        conn.commit()

        if delta > 0:
            new_citations.append({"cluster_id": cluster_id, "total": count, "new": delta})
            logger.info(f"Paper {cluster_id}: +{delta} new citations (total: {count})")

    conn.close()
    return new_citations

Use Case 2: Academic Collaboration Network

import json
from collections import defaultdict

def build_collaboration_network(
    author_ids: List[str],
    proxy_pool: ThorDataScholarProxy,
) -> Dict:
    """
    Build a co-authorship network from a list of Scholar author IDs.
    Returns a graph structure for visualization with D3.js or Gephi.
    """
    nodes = {}
    edges = defaultdict(int)

    for author_id in author_ids:
        scraper = ScholarScraper(proxy_url=proxy_pool.sticky_url())
        author = scraper.get_author_profile(author_id)
        if not author:
            continue

        nodes[author_id] = {
            "id": author_id,
            "name": author.name,
            "citations": author.total_citations,
            "h_index": author.h_index,
        }

        # Extract co-authors from publication metadata
        for pub in author.publications[:20]:
            coauthors = [a.strip() for a in pub.authors.split(",")]
            for coauthor in coauthors:
                if coauthor and coauthor != author.name:
                    edge_key = tuple(sorted([author.name, coauthor]))
                    edges[edge_key] += 1

        proxy_pool.rotate()
        time.sleep(random.uniform(8, 15))

    # Format as graph JSON
    graph = {
        "nodes": list(nodes.values()),
        "links": [
            {"source": src, "target": tgt, "weight": weight}
            for (src, tgt), weight in edges.items()
        ],
    }

    with open("collaboration_network.json", "w") as f:
        json.dump(graph, f, indent=2)

    return graph

Use Case 3: H-Index Benchmarking by Field

import pandas as pd
import statistics

def benchmark_hindex_by_field(
    field_author_map: Dict[str, List[str]],
    proxy_pool: ThorDataScholarProxy,
) -> pd.DataFrame:
    """
    Compute h-index distribution statistics by research field.
    Useful for understanding career benchmarks in different disciplines.
    """
    results = []

    for field, author_ids in field_author_map.items():
        hindices = []

        for author_id in author_ids:
            scraper = ScholarScraper(proxy_url=proxy_pool.sticky_url())
            author = scraper.get_author_profile(author_id)
            if author:
                hindices.append(author.h_index)
            proxy_pool.rotate()
            time.sleep(random.uniform(6, 12))

        if hindices:
            results.append({
                "field": field,
                "n_authors": len(hindices),
                "mean_hindex": statistics.mean(hindices),
                "median_hindex": statistics.median(hindices),
                "max_hindex": max(hindices),
                "min_hindex": min(hindices),
                "stdev": statistics.stdev(hindices) if len(hindices) > 1 else 0,
            })

    return pd.DataFrame(results).sort_values("median_hindex", ascending=False)

Use Case 4: Paper Recommendation Engine Input

def build_paper_similarity_dataset(
    seed_paper_cluster_ids: List[str],
    depth: int = 2,
    proxy_pool: ThorDataScholarProxy = None,
) -> List[Dict]:
    """
    Build a dataset of papers and their citations for a recommendation engine.
    BFS expansion from seed papers following citation links.
    """
    visited = set()
    queue = list(seed_paper_cluster_ids)
    papers = []

    for _ in range(depth):
        next_queue = []
        for cluster_id in queue:
            if cluster_id in visited:
                continue
            visited.add(cluster_id)

            proxy_url = proxy_pool.rotating_url() if proxy_pool else None
            scraper = ScholarScraper(proxy_url=proxy_url)

            citing = asyncio.run(scrape_citing_papers(cluster_id, proxy_url=proxy_url, max_pages=2))
            for paper in citing:
                papers.append({**paper, "cited_cluster": cluster_id})
                # Extract cluster IDs from citation URLs for next depth
                if paper.get("citation_url"):
                    match = re.search(r"cites=(\d+)", paper["citation_url"])
                    if match:
                        next_queue.append(match.group(1))

            time.sleep(random.uniform(5, 10))

        queue = next_queue

    return papers

Use Case 5: Institutional Research Output Tracker

def track_institution_output(
    institution_name: str,
    department: str,
    author_ids: List[str],
    year_range: tuple = (2020, 2026),
    proxy_pool: ThorDataScholarProxy = None,
) -> pd.DataFrame:
    """Aggregate research output metrics for an institution or department."""
    records = []

    for author_id in author_ids:
        proxy_url = proxy_pool.sticky_url() if proxy_pool else None
        scraper = ScholarScraper(proxy_url=proxy_url)
        author = scraper.get_author_profile(author_id)
        if not author:
            continue

        recent_pubs = [
            p for p in author.publications
            if p.year and year_range[0] <= p.year <= year_range[1]
        ]

        records.append({
            "author_id": author_id,
            "name": author.name,
            "institution": institution_name,
            "department": department,
            "total_citations": author.total_citations,
            "h_index": author.h_index,
            "publications_in_range": len(recent_pubs),
            "citations_in_range": sum(p.citations for p in recent_pubs),
            "top_paper": recent_pubs[0].title if recent_pubs else "",
        })

        if proxy_pool:
            proxy_pool.rotate()
        time.sleep(random.uniform(8, 15))

    return pd.DataFrame(records)

Use Case 6: Citation Velocity Analysis

def compute_citation_velocity(
    publications: List[ScholarPublication],
) -> pd.DataFrame:
    """
    Compute citation velocity (citations per year since publication)
    to identify papers with growing vs declining influence.
    """
    current_year = 2026
    records = []

    for pub in publications:
        if not pub.year or not pub.citations:
            continue

        years_since = max(1, current_year - pub.year)
        velocity = pub.citations / years_since

        records.append({
            "title": pub.title[:80],
            "year": pub.year,
            "total_citations": pub.citations,
            "years_old": years_since,
            "citations_per_year": round(velocity, 1),
        })

    df = pd.DataFrame(records).sort_values("citations_per_year", ascending=False)
    return df

Use Case 7: Systematic Literature Review Helper

def systematic_review_collector(
    search_queries: List[str],
    min_citations: int = 10,
    year_from: int = 2018,
    max_pages_per_query: int = 5,
    output_csv: str = "literature_review.csv",
    proxy_pool: ThorDataScholarProxy = None,
) -> pd.DataFrame:
    """
    Collect papers for a systematic literature review across multiple search queries.
    Deduplicates by title similarity and filters by citation threshold.
    """
    all_papers = []
    seen_titles = set()

    for query in search_queries:
        proxy_url = proxy_pool.rotating_url() if proxy_pool else None
        scraper = ScholarScraper(proxy_url=proxy_url, request_delay=6.0)

        for paper in scraper.search_papers(query, num_pages=max_pages_per_query):
            title = paper.get("title", "").lower().strip()
            if not title or title in seen_titles:
                continue

            # Filter by citation count
            if paper.get("citation_count", 0) < min_citations:
                continue

            # Extract year from authors/venue string
            av = paper.get("authors_venue", "")
            year_match = re.search(r"\b(20\d\d)\b", av)
            year = int(year_match.group(1)) if year_match else None

            if year and year < year_from:
                continue

            seen_titles.add(title)
            all_papers.append({
                **paper,
                "year": year,
                "query": query,
            })

        time.sleep(random.uniform(8, 15))

    df = pd.DataFrame(all_papers).sort_values("citation_count", ascending=False)
    df.to_csv(output_csv, index=False)
    logger.info(f"Saved {len(df)} papers to {output_csv}")
    return df

Practical Guidance for Scale

Volume expectations: With residential proxies and 5-6 second delays, expect 8-10 complete author profiles per hour. Plan accordingly for large research projects.

Caching is essential: Citation counts change slowly. Cache author profiles for 24 hours minimum. Use SQLite with a fetched_at timestamp and skip re-fetching recent data.

def get_cached_or_fetch(author_id: str, scraper: ScholarScraper, db: sqlite3.Connection, ttl_hours: int = 24) -> Optional[Dict]:
    cutoff = (datetime.now() - timedelta(hours=ttl_hours)).isoformat()
    row = db.execute(
        "SELECT data FROM author_cache WHERE author_id=? AND fetched_at > ?",
        (author_id, cutoff)
    ).fetchone()

    if row:
        return json.loads(row[0])

    author = scraper.get_author_profile(author_id)
    if author:
        db.execute(
            "INSERT OR REPLACE INTO author_cache (author_id, data, fetched_at) VALUES (?, ?, ?)",
            (author_id, json.dumps(asdict(author), default=str), datetime.now().isoformat())
        )
        db.commit()
    return asdict(author) if author else None

When Scholar fails completely: SerpAPI offers a Google Scholar endpoint that handles all the anti-bot complexity but charges per request. For high-volume production systems, the cost may be justified. For research projects with budget constraints, the proxy approach in this guide is substantially more economical.

Use BibTeX links for structured data: Scholar exposes BibTeX export for individual papers — cleaner than scraping HTML, and less likely to trigger detection since it's a lower-traffic endpoint. Worth using when you need bibliographic metadata rather than citation counts.

The fundamental reality hasn't changed: Google Scholar is one of the most bot-hostile targets on the internet, and that's unlikely to change. But with proper residential proxy rotation via ThorData, realistic browser behavior via Playwright, adaptive rate limiting, and aggressive caching, reliable automated access is achievable for legitimate research purposes.