Scraping Google Scholar Citations and Author Profiles with Python (2026)
Scraping Google Scholar Citations and Author Profiles with Python (2026)
Google Scholar is the world's most comprehensive academic search engine, indexing over 200 million scholarly documents across virtually every field of human knowledge. Yet for all its scope and value, Google Scholar provides no public API. The Academic Knowledge API that Microsoft offered died years ago. Google's own Scholar API was quietly shuttered long before that. The result is that every researcher, bibliometrician, academic institution, or competitive intelligence analyst who needs machine-readable citation data faces the same fundamental problem: the only way to get it is to scrape it.
And scraping Google Scholar is genuinely hard. Not because the HTML is complex — Scholar pages are remarkably clean markup. The difficulty is Google's layered bot detection infrastructure, which sits in a different category from most websites. Google operates one of the most sophisticated anti-bot systems ever built, and they apply it to Scholar just as aggressively as to Search. After 5-10 requests from a datacenter IP, you'll see a CAPTCHA. Push harder and you'll see an IP ban lasting hours to days. The combination of request fingerprinting, behavioral analysis, cookie tracking, and IP reputation scoring means that naive approaches simply don't work.
This guide covers the full spectrum of what actually works in 2026: the scholarly library for simple cases, raw httpx requests for controlled scraping, Playwright for full browser automation, and the complete proxy rotation strategy that makes all of it viable at scale. Every code example is tested against real Scholar endpoints.
The use cases for this data are substantial: tracking citation counts for grant applications, computing h-indices and bibliometric scores for promotion decisions, building academic recommendation systems, monitoring when your papers get cited, competitive analysis of research groups, and building literature maps that visualize how ideas propagate through citation networks.
Rate of change: Google regularly updates its bot detection. Techniques that worked 6 months ago sometimes fail today. Always test your approach with small volumes before running large jobs. The fundamentals in this guide — proper proxy rotation, realistic request timing, session management — remain stable even as surface-level selectors change.
Why Residential Proxies Are Non-Negotiable
Before getting into code, let's be explicit about infrastructure requirements. Google Scholar cannot be scraped at any meaningful volume from:
- Raw datacenter IPs (AWS, GCP, Azure, DigitalOcean, etc.) — blocked within minutes
- VPN exit nodes — most have burned IP reputation from prior abuse
- Tor exit nodes — outright blocked
What actually works is residential proxy networks — IP addresses assigned to real home and mobile internet connections, belonging to real ISPs. Google's reputation scoring trusts these because they look like actual users. ThorData provides access to a residential proxy network covering 195+ countries with real ISP-assigned IPs. For Scholar specifically, US residential IPs perform best since Scholar's primary interface assumes US browsing behavior.
The economics: you pay per GB of proxy traffic. A single Scholar author profile page is roughly 50-80KB. At 10-15 requests per complete author profile (filling all publications), you're looking at roughly 1MB per author. Plan your proxy budget accordingly.
Setup
pip install scholarly httpx beautifulsoup4 lxml playwright tenacity aiohttp pandas
playwright install chromium
Environment setup:
export THORDATA_USER="your_username"
export THORDATA_PASS="your_password"
export SCHOLAR_PROXY="http://${THORDATA_USER}:${THORDATA_PASS}@proxy.thordata.com:9000"
Understanding Google Scholar's URL Structure
The API surface is consistent and predictable:
# Author profile
https://scholar.google.com/citations?user={AUTHOR_ID}&hl=en
# Author profile with sort by citations
https://scholar.google.com/citations?user={AUTHOR_ID}&hl=en&sortby=citedby
# Search for author
https://scholar.google.com/citations?view_op=search_authors&mauthors={name}&hl=en
# Article search
https://scholar.google.com/scholar?q={query}&hl=en
# Papers citing a specific paper (by cluster ID)
https://scholar.google.com/scholar?cites={CLUSTER_ID}&hl=en
# All versions of a paper
https://scholar.google.com/scholar?cluster={CLUSTER_ID}&hl=en
# Publication detail (from author profile)
https://scholar.google.com/citations?view_op=view_citation&user={AUTHOR_ID}&citation_for_view={AUTHOR_ID}:{PUB_ID}
Author IDs are 12-character alphanumeric strings visible in profile URLs. Cluster IDs appear in citation links. Both are stable identifiers that persist across Scholar's indexing updates.
The scholarly Library for Small-Scale Work
For up to a few hundred requests, scholarly provides the most convenient interface:
import os
import time
import random
import logging
from scholarly import scholarly, ProxyGenerator
from typing import Optional, List, Dict, Any
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s: %(message)s")
logger = logging.getLogger(__name__)
def configure_scholarly_proxy(
proxy_url: Optional[str] = None,
use_tor: bool = False,
use_luminati: bool = False,
) -> None:
"""Configure scholarly's proxy backend."""
pg = ProxyGenerator()
if proxy_url:
# Single proxy (use with sticky session URL for per-author consistency)
pg.SingleProxy(http=proxy_url, https=proxy_url)
scholarly.use_proxy(pg)
logger.info(f"Configured scholarly with proxy: {proxy_url[:40]}...")
elif use_tor:
# Tor browser must be running locally
pg.Tor_Internal(tor_cmd="tor")
scholarly.use_proxy(pg)
logger.info("Configured scholarly with Tor")
else:
logger.warning("No proxy configured - will hit CAPTCHA quickly on Scholar")
def get_author_by_id(author_id: str, fill: bool = True) -> Optional[Dict]:
"""Fetch a Scholar author by their ID."""
try:
author = scholarly.search_author_id(author_id)
if fill:
author = scholarly.fill(author, sections=["basics", "indices", "counts", "publications"])
return dict(author)
except Exception as e:
logger.error(f"Failed to fetch author {author_id}: {e}")
return None
def get_author_by_name(name: str, affiliation_filter: str = "") -> Optional[Dict]:
"""Search for an author by name with optional affiliation filter."""
try:
search_results = scholarly.search_author(name)
for candidate in search_results:
if affiliation_filter:
aff = candidate.get("affiliation", "").lower()
if affiliation_filter.lower() not in aff:
continue
# Fill the first matching result
return dict(scholarly.fill(candidate))
except StopIteration:
logger.info(f"No results found for author: {name}")
except Exception as e:
logger.error(f"Author search failed for {name}: {e}")
return None
def get_publication_details(author: Dict, pub_index: int) -> Optional[Dict]:
"""Fill complete details for a specific publication."""
try:
pubs = author.get("publications", [])
if pub_index >= len(pubs):
return None
pub = scholarly.fill(pubs[pub_index])
return dict(pub)
except Exception as e:
logger.error(f"Failed to fill publication {pub_index}: {e}")
return None
# Complete author scraping example
proxy_url = os.environ.get("SCHOLAR_PROXY", "")
configure_scholarly_proxy(proxy_url=proxy_url if proxy_url else None)
# Geoffrey Hinton's Scholar ID
hinton = get_author_by_id("JicYPdAAAAAJ")
if hinton:
print(f"Name: {hinton['name']}")
print(f"Affiliation: {hinton.get('affiliation', 'N/A')}")
print(f"Citations (all time): {hinton.get('citedby', 0)}")
print(f"Citations (since 2019): {hinton.get('citedby5y', 0)}")
print(f"h-index: {hinton.get('hindex', 0)}")
print(f"h-index (5yr): {hinton.get('hindex5y', 0)}")
print(f"i10-index: {hinton.get('i10index', 0)}")
print(f"Publications: {len(hinton.get('publications', []))}")
Raw httpx Scraper for Fine-Grained Control
When you need precise control over headers, timing, and proxy rotation, bypass scholarly and hit Scholar directly:
import httpx
from bs4 import BeautifulSoup
import re
from dataclasses import dataclass, asdict, field
from typing import Iterator, Optional, List, Dict, Any
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]
@dataclass
class ScholarPublication:
title: str
authors: str
venue: str
year: Optional[int]
citations: int
citation_url: Optional[str]
pub_url: Optional[str]
cluster_id: Optional[str]
@dataclass
class ScholarAuthor:
author_id: str
name: str
affiliation: str
email_domain: str
interests: List[str]
total_citations: int
citations_since_2019: int
h_index: int
h_index_5yr: int
i10_index: int
i10_index_5yr: int
publications: List[ScholarPublication] = field(default_factory=list)
def build_scholar_headers(referer: Optional[str] = None) -> Dict[str, str]:
ua = random.choice(USER_AGENTS)
headers = {
"User-Agent": ua,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "same-origin" if referer else "none",
"Sec-Fetch-User": "?1",
}
if referer:
headers["Referer"] = referer
return headers
class ScholarScraper:
"""Direct HTTP scraper for Google Scholar with proxy rotation."""
BASE_URL = "https://scholar.google.com"
def __init__(self, proxy_url: Optional[str] = None, request_delay: float = 5.0):
self.proxy_url = proxy_url
self.delay = request_delay
self._cookies: Dict[str, str] = {}
def _make_client(self) -> httpx.Client:
kwargs = {
"headers": build_scholar_headers(),
"timeout": httpx.Timeout(30.0),
"follow_redirects": True,
}
if self.proxy_url:
kwargs["proxy"] = self.proxy_url
return httpx.Client(**kwargs)
def _is_captcha(self, html: str) -> bool:
signals = ["captcha", "unusual traffic", "i'm not a robot", "recaptcha"]
lower = html.lower()
return any(s in lower for s in signals)
@retry(
stop=stop_after_attempt(4),
wait=wait_exponential(multiplier=2, min=5, max=90),
retry=retry_if_exception_type((httpx.RequestError, httpx.HTTPStatusError)),
)
def get(self, path: str, params: Optional[Dict] = None) -> httpx.Response:
"""Make a GET request with retry logic."""
time.sleep(self.delay + random.uniform(-1, 2))
url = f"{self.BASE_URL}{path}"
referer = self.BASE_URL if path != "/" else None
with self._make_client() as client:
# Carry cookies from previous requests
for name, value in self._cookies.items():
client.cookies.set(name, value)
resp = client.get(url, params=params, headers=build_scholar_headers(referer))
# Store any new cookies
for name, value in resp.cookies.items():
self._cookies[name] = value
if resp.status_code == 429:
logger.warning("Rate limited by Scholar")
raise httpx.HTTPStatusError("Rate limited", request=resp.request, response=resp)
if self._is_captcha(resp.text):
logger.warning(f"CAPTCHA detected at {url}")
raise httpx.RequestError("CAPTCHA encountered")
resp.raise_for_status()
return resp
def get_author_profile(self, author_id: str) -> Optional[ScholarAuthor]:
"""Fetch and parse a complete author profile."""
resp = self.get("/citations", params={"user": author_id, "hl": "en"})
return self._parse_author_page(resp.text, author_id)
def _parse_author_page(self, html: str, author_id: str) -> Optional[ScholarAuthor]:
soup = BeautifulSoup(html, "lxml")
# Author name
name_el = soup.find("div", id="gsc_prf_in")
name = name_el.get_text(strip=True) if name_el else "Unknown"
# Affiliation
aff_el = soup.find("div", class_="gsc_prf_il")
affiliation = aff_el.get_text(strip=True) if aff_el else ""
# Email domain
email_el = soup.find("div", id="gsc_prf_ivh")
email_domain = ""
if email_el:
match = re.search(r"Verified email at (\S+)", email_el.get_text())
if match:
email_domain = match.group(1).rstrip("·").strip()
# Research interests
interests = []
for interest_link in soup.select("#gsc_prf_int a"):
interests.append(interest_link.get_text(strip=True))
# Citation statistics from the stats table
stats = {}
stats_table = soup.find("table", id="gsc_rsb_st")
if stats_table:
rows = stats_table.find_all("tr")
for row in rows[1:]: # Skip header
cells = row.find_all("td")
if len(cells) >= 3:
metric = cells[0].get_text(strip=True)
all_time = cells[1].get_text(strip=True)
since_2019 = cells[2].get_text(strip=True)
stats[metric] = {"all": self._safe_int(all_time), "since_2019": self._safe_int(since_2019)}
# Parse publications from the table
publications = []
for pub_row in soup.select("#gsc_a_b tr.gsc_a_tr"):
pub = self._parse_publication_row(pub_row)
if pub:
publications.append(pub)
return ScholarAuthor(
author_id=author_id,
name=name,
affiliation=affiliation,
email_domain=email_domain,
interests=interests,
total_citations=stats.get("Citations", {}).get("all", 0),
citations_since_2019=stats.get("Citations", {}).get("since_2019", 0),
h_index=stats.get("h-index", {}).get("all", 0),
h_index_5yr=stats.get("h-index", {}).get("since_2019", 0),
i10_index=stats.get("i10-index", {}).get("all", 0),
i10_index_5yr=stats.get("i10-index", {}).get("since_2019", 0),
publications=publications,
)
def _parse_publication_row(self, row: BeautifulSoup) -> Optional[ScholarPublication]:
"""Parse a single publication row from author profile."""
try:
title_el = row.select_one(".gsc_a_at")
title = title_el.get_text(strip=True) if title_el else ""
pub_url = title_el.get("href", "") if title_el else ""
if pub_url and not pub_url.startswith("http"):
pub_url = "https://scholar.google.com" + pub_url
meta_el = row.select_one(".gsc_a_t .gs_gray")
authors = ""
venue = ""
if meta_el:
metas = [m.get_text(strip=True) for m in row.select(".gsc_a_t .gs_gray")]
if metas:
authors = metas[0] if metas else ""
venue = metas[1] if len(metas) > 1 else ""
cite_el = row.select_one(".gsc_a_c a")
citations = self._safe_int(cite_el.get_text(strip=True) if cite_el else "0")
citation_url = cite_el.get("href", "") if cite_el else ""
if citation_url and not citation_url.startswith("http"):
citation_url = "https://scholar.google.com" + citation_url
year_el = row.select_one(".gsc_a_y span")
year = self._safe_int(year_el.get_text(strip=True) if year_el else "")
return ScholarPublication(
title=title,
authors=authors,
venue=venue,
year=year,
citations=citations,
citation_url=citation_url,
pub_url=pub_url,
cluster_id=self._extract_cluster_id(citation_url),
)
except Exception as e:
logger.warning(f"Failed parsing publication row: {e}")
return None
def _extract_cluster_id(self, url: str) -> Optional[str]:
if not url:
return None
match = re.search(r"cites=(\d+)", url)
return match.group(1) if match else None
def _safe_int(self, text: str) -> int:
try:
return int(re.sub(r"[^\d]", "", text))
except (ValueError, TypeError):
return 0
def search_papers(self, query: str, num_pages: int = 3) -> Iterator[Dict]:
"""Search Scholar for papers matching a query."""
for page in range(num_pages):
start = page * 10
resp = self.get("/scholar", params={"q": query, "hl": "en", "start": start})
soup = BeautifulSoup(resp.text, "lxml")
results = soup.select(".gs_ri")
if not results:
break
for result in results:
yield self._parse_search_result(result)
# Check if there's a next page
next_btn = soup.find("button", {"aria-label": re.compile(r"Next", re.I)})
if not next_btn:
break
def _parse_search_result(self, element: BeautifulSoup) -> Dict:
title_el = element.select_one(".gs_rt")
snippet_el = element.select_one(".gs_rs")
meta_el = element.select_one(".gs_fl")
authors_venue_el = element.select_one(".gs_a")
cite_count = 0
cite_url = ""
if meta_el:
cite_link = meta_el.find("a", string=re.compile(r"Cited by"))
if cite_link:
match = re.search(r"(\d+)", cite_link.get_text())
if match:
cite_count = int(match.group(1))
cite_url = "https://scholar.google.com" + cite_link.get("href", "")
return {
"title": title_el.get_text(strip=True) if title_el else "",
"title_url": (title_el.find("a") or {}).get("href", "") if title_el else "",
"snippet": snippet_el.get_text(strip=True) if snippet_el else "",
"authors_venue": authors_venue_el.get_text(strip=True) if authors_venue_el else "",
"citation_count": cite_count,
"citation_url": cite_url,
}
# Usage
scraper = ScholarScraper(
proxy_url=os.environ.get("SCHOLAR_PROXY", ""),
request_delay=5.0,
)
author = scraper.get_author_profile("JicYPdAAAAAJ")
if author:
print(f"{author.name}: {author.total_citations} citations, h-index {author.h_index}")
print(f"Top paper: {author.publications[0].title if author.publications else 'N/A'}")
Playwright for JavaScript-Rendered Pages
Scholar's author profiles sometimes trigger JS challenges or require interaction (like "Show more" for full publication lists). Playwright handles both:
import asyncio
from playwright.async_api import async_playwright, Page, Browser, BrowserContext
from typing import List, AsyncIterator
async def launch_stealth_browser(proxy_url: Optional[str] = None) -> Browser:
"""Launch Chromium with anti-detection measures."""
playwright = await async_playwright().start()
args = [
"--disable-blink-features=AutomationControlled",
"--no-sandbox",
"--disable-setuid-sandbox",
"--disable-dev-shm-usage",
"--disable-accelerated-2d-canvas",
"--no-first-run",
"--no-zygote",
"--disable-gpu",
"--window-size=1920,1080",
"--lang=en-US,en",
]
kwargs: Dict[str, Any] = {"headless": True, "args": args}
if proxy_url:
from urllib.parse import urlparse
parsed = urlparse(proxy_url)
kwargs["proxy"] = {
"server": f"http://{parsed.hostname}:{parsed.port}",
"username": parsed.username or "",
"password": parsed.password or "",
}
return await playwright.chromium.launch(**kwargs)
async def make_stealth_context(browser: Browser) -> BrowserContext:
"""Create browser context with realistic fingerprint."""
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
locale="en-US",
timezone_id="America/New_York",
user_agent=random.choice(USER_AGENTS),
extra_http_headers={
"Accept-Language": "en-US,en;q=0.9",
},
)
# Patch automation markers
await context.add_init_script("""
// Hide webdriver
Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
// Mock plugins (real browsers have them)
Object.defineProperty(navigator, 'plugins', {
get: () => {
return {
length: 3,
0: { name: 'Chrome PDF Plugin' },
1: { name: 'Chrome PDF Viewer' },
2: { name: 'Native Client' },
};
}
});
// Realistic language list
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en']
});
// Chrome object (missing in some headless configs)
window.chrome = {
runtime: {},
loadTimes: function() {},
csi: function() {},
app: {},
};
// Remove headless hardware concurrency hint
Object.defineProperty(navigator, 'hardwareConcurrency', { get: () => 8 });
""")
return context
async def scrape_full_author_profile(
author_id: str,
proxy_url: Optional[str] = None,
) -> Optional[Dict]:
"""
Scrape a complete author profile including all publications
using full browser automation.
"""
browser = await launch_stealth_browser(proxy_url)
try:
context = await make_stealth_context(browser)
page = await context.new_page()
page.set_default_timeout(30000)
# Navigate to author profile
url = f"https://scholar.google.com/citations?user={author_id}&hl=en"
await page.goto(url, wait_until="networkidle")
await asyncio.sleep(random.uniform(2, 4))
# Handle CAPTCHA if present
captcha = await page.query_selector("form#captcha-form, .captcha-container")
if captcha:
logger.warning(f"CAPTCHA on author profile {author_id}")
return None
# Extract basic stats
author_data = await page.evaluate("""
() => {
const stats = {};
// Name and affiliation
const nameEl = document.querySelector('#gsc_prf_in');
stats.name = nameEl ? nameEl.textContent.trim() : '';
const affEl = document.querySelector('.gsc_prf_il');
stats.affiliation = affEl ? affEl.textContent.trim() : '';
// Citation stats table
const statTable = document.querySelector('#gsc_rsb_st');
if (statTable) {
const rows = statTable.querySelectorAll('tr');
rows.forEach(row => {
const cells = row.querySelectorAll('td');
if (cells.length >= 3) {
const metric = cells[0].textContent.trim();
stats[metric + '_all'] = cells[1].textContent.trim();
stats[metric + '_5yr'] = cells[2].textContent.trim();
}
});
}
// Interests
stats.interests = Array.from(
document.querySelectorAll('#gsc_prf_int a')
).map(a => a.textContent.trim());
return stats;
}
""")
# Load ALL publications by clicking "Show more" button
all_pubs = []
while True:
show_more = await page.query_selector("#gsc_bpf_more:not([disabled])")
if not show_more:
break
await show_more.click()
await asyncio.sleep(random.uniform(1.5, 3.0))
# Extract all publication rows
pub_rows = await page.query_selector_all("#gsc_a_b tr.gsc_a_tr")
for row in pub_rows:
title_el = await row.query_selector(".gsc_a_at")
cite_el = await row.query_selector(".gsc_a_c a")
year_el = await row.query_selector(".gsc_a_y span")
title = await title_el.inner_text() if title_el else ""
pub_href = await title_el.get_attribute("href") if title_el else ""
cite_text = await cite_el.inner_text() if cite_el else "0"
cite_href = await cite_el.get_attribute("href") if cite_el else ""
year_text = await year_el.inner_text() if year_el else ""
all_pubs.append({
"title": title.strip(),
"pub_url": f"https://scholar.google.com{pub_href}" if pub_href else "",
"citations": int(re.sub(r"\D", "", cite_text) or "0"),
"citation_url": f"https://scholar.google.com{cite_href}" if cite_href else "",
"year": int(year_text) if year_text.isdigit() else None,
})
author_data["publications"] = all_pubs
author_data["author_id"] = author_id
return author_data
finally:
await browser.close()
async def scrape_citing_papers(
cluster_id: str,
proxy_url: Optional[str] = None,
max_pages: int = 5,
) -> List[Dict]:
"""Scrape all papers that cite a specific paper (by cluster ID)."""
browser = await launch_stealth_browser(proxy_url)
all_papers = []
try:
context = await make_stealth_context(browser)
page = await context.new_page()
for page_num in range(max_pages):
start = page_num * 10
url = f"https://scholar.google.com/scholar?cites={cluster_id}&hl=en&start={start}"
await page.goto(url, wait_until="networkidle")
await asyncio.sleep(random.uniform(3, 6))
# Check for CAPTCHA
if await page.query_selector("form#captcha-form"):
logger.warning(f"CAPTCHA on page {page_num}")
break
papers = await page.evaluate("""
() => {
return Array.from(document.querySelectorAll('.gs_ri')).map(el => ({
title: el.querySelector('.gs_rt')?.textContent?.trim() || '',
authors: el.querySelector('.gs_a')?.textContent?.trim() || '',
snippet: el.querySelector('.gs_rs')?.textContent?.trim() || '',
cite_count: (() => {
const citeLink = el.querySelector('.gs_fl a');
if (!citeLink) return 0;
const match = citeLink.textContent.match(/\d+/);
return match ? parseInt(match[0]) : 0;
})(),
}));
}
""")
if not papers:
break
all_papers.extend(papers)
logger.info(f"Citing papers page {page_num + 1}: {len(papers)} results")
finally:
await browser.close()
return all_papers
# Run async scrapers
async def main():
proxy = os.environ.get("SCHOLAR_PROXY", "")
profile = await scrape_full_author_profile("JicYPdAAAAAJ", proxy_url=proxy or None)
if profile:
print(f"Scraped {len(profile.get('publications', []))} publications")
for pub in sorted(profile["publications"], key=lambda x: x["citations"], reverse=True)[:5]:
print(f" {pub['citations']:6d} cites: {pub['title'][:60]}")
asyncio.run(main())
Proxy Rotation with ThorData
Effective proxy rotation for Scholar requires both rotating and sticky session support. Rotating for search queries, sticky for multi-page profile loads:
import threading
from urllib.parse import urlparse
class ThorDataScholarProxy:
"""
Manages ThorData proxy sessions optimized for Google Scholar scraping.
Scholar's rate limiting tracks by IP + cookie combination.
Sticky sessions let you complete a full author profile on one IP,
then rotate to a fresh IP for the next author.
"""
def __init__(
self,
username: str,
password: str,
host: str = "proxy.thordata.com",
port: int = 9000,
country: str = "US",
):
self.username = username
self.password = password
self.host = host
self.port = port
self.country = country
self._sticky_id: Optional[str] = None
self._sticky_created: float = 0
self._lock = threading.Lock()
self._request_count = 0
self._error_count = 0
def rotating_url(self) -> str:
"""New IP on every request."""
return f"http://{self.username}-country-{self.country}:{self.password}@{self.host}:{self.port}"
def sticky_url(self, session_minutes: int = 10) -> str:
"""Same IP for up to session_minutes."""
with self._lock:
now = time.time()
if not self._sticky_id or (now - self._sticky_created) > session_minutes * 60:
self._sticky_id = f"scholar{random.randint(10000, 99999)}"
self._sticky_created = now
return (
f"http://{self.username}-country-{self.country}-"
f"session-{self._sticky_id}:{self.password}@{self.host}:{self.port}"
)
def rotate(self):
"""Force new sticky session on next call."""
with self._lock:
self._sticky_id = None
logger.info("Proxy session rotated")
def record_request(self, success: bool):
with self._lock:
self._request_count += 1
if not success:
self._error_count += 1
# Auto-rotate after 3 consecutive-ish errors
if self._error_count >= 3:
self._sticky_id = None
self._error_count = 0
logger.info("Auto-rotated proxy after errors")
else:
self._error_count = max(0, self._error_count - 1)
# Usage
proxy = ThorDataScholarProxy(
username=os.environ.get("THORDATA_USER", ""),
password=os.environ.get("THORDATA_PASS", ""),
country="US",
)
def scrape_author_with_rotation(author_id: str) -> Optional[Dict]:
"""Scrape one author, rotate proxy after completion."""
scraper = ScholarScraper(
proxy_url=proxy.sticky_url(session_minutes=8),
request_delay=5.0,
)
try:
result = scraper.get_author_profile(author_id)
proxy.record_request(True)
proxy.rotate() # Fresh IP for next author
return asdict(result) if result else None
except Exception as e:
proxy.record_request(False)
logger.error(f"Author {author_id} failed: {e}")
return None
Rate Limiting, Backoff, and CAPTCHA Handling
import hashlib
from datetime import datetime
class ScholarRateLimiter:
"""
Adaptive rate limiter that responds to Scholar's throttling signals.
Tracks per-IP success rates and adjusts timing accordingly.
"""
def __init__(self, base_delay: float = 5.0):
self.base_delay = base_delay
self.current_delay = base_delay
self._success_streak = 0
self._failure_streak = 0
def wait(self):
jitter = random.gauss(0, 0.5)
sleep_for = max(2.0, self.current_delay + jitter)
logger.debug(f"Rate limiter sleeping {sleep_for:.1f}s")
time.sleep(sleep_for)
def on_success(self):
self._success_streak += 1
self._failure_streak = 0
# Slowly decrease delay after sustained success
if self._success_streak >= 10:
self.current_delay = max(3.0, self.current_delay * 0.85)
self._success_streak = 0
def on_captcha(self):
self._failure_streak += 1
self._success_streak = 0
self.current_delay = min(120.0, self.current_delay * 3.0)
logger.warning(f"CAPTCHA hit - delay increased to {self.current_delay:.0f}s")
def on_rate_limit(self, retry_after: int = 60):
self._failure_streak += 1
self.current_delay = min(120.0, self.current_delay * 2.5)
logger.warning(f"Rate limited - sleeping {retry_after}s then resuming at {self.current_delay:.0f}s delay")
time.sleep(retry_after)
class ScholarCaptchaDetector:
"""Detect and categorize Scholar anti-bot responses."""
@staticmethod
def classify(html: str, status_code: int) -> str:
"""Returns: 'ok', 'captcha', 'rate_limit', 'ip_ban', 'error'"""
if status_code == 429:
return "rate_limit"
if status_code == 503:
return "rate_limit"
if status_code >= 500:
return "error"
lower = html.lower()
if "our systems have detected unusual traffic" in lower:
return "captcha"
if "recaptcha" in lower or "g-recaptcha" in lower:
return "captcha"
if "sorry, we can't verify that you're not a robot" in lower:
return "captcha"
if "access to this page has been denied" in lower:
return "ip_ban"
# Valid Scholar page markers
if "gsc_prf" in html or "gs_ri" in html or "scholar.google" in html:
return "ok"
return "unknown"
@staticmethod
def save_captcha_url(url: str, context: str = "") -> None:
"""Log CAPTCHA hits for retry queue."""
with open("scholar_captcha_log.txt", "a") as f:
f.write(f"{datetime.now().isoformat()}\t{url}\t{context}\n")
Complete Output Schemas with Examples
import json
from dataclasses import asdict
# Author profile output schema
author_example = {
"author_id": "JicYPdAAAAAJ",
"name": "Geoffrey Hinton",
"affiliation": "Professor Emeritus, University of Toronto",
"email_domain": "cs.toronto.edu",
"interests": ["machine learning", "neural networks", "deep learning", "AI"],
"total_citations": 752483,
"citations_since_2019": 421892,
"h_index": 185,
"h_index_5yr": 122,
"i10_index": 272,
"i10_index_5yr": 200,
"publications": [
{
"title": "ImageNet classification with deep convolutional neural networks",
"authors": "A Krizhevsky, I Sutskever, GE Hinton",
"venue": "Advances in neural information processing systems 25",
"year": 2012,
"citations": 128623,
"citation_url": "https://scholar.google.com/scholar?cites=17322548362154064355",
"pub_url": "https://scholar.google.com/citations?view_op=view_citation&user=JicYPdAAAAAJ&citation_for_view=JicYPdAAAAAJ:u5HHmVD_uO8C",
"cluster_id": "17322548362154064355",
}
],
}
# Search result output schema
search_example = {
"title": "Deep learning",
"title_url": "https://www.nature.com/articles/nature14539",
"snippet": "Deep learning allows computational models that are composed of multiple processing layers...",
"authors_venue": "Y LeCun, Y Bengio, G Hinton - Nature, 2015",
"citation_count": 78542,
"citation_url": "https://scholar.google.com/scholar?cites=4816722523314893612",
}
# Citing papers output schema
citing_paper_example = {
"title": "Attention Is All You Need",
"authors": "A Vaswani, N Shazeer, N Parmar… - Advances in neural…, 2017",
"snippet": "The dominant sequence transduction models are based on complex recurrent or convolutional neural...",
"cite_count": 112840,
}
print(json.dumps(author_example, indent=2))
Real-World Use Cases with Code
Use Case 1: Citation Monitoring Dashboard
Track when your papers get cited:
import sqlite3
from datetime import datetime, timedelta
def monitor_paper_citations(
cluster_ids: List[str],
db_path: str = "citation_monitor.db",
check_interval_hours: int = 24,
) -> List[Dict]:
"""Monitor citation counts for a list of papers."""
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS citation_history (
id INTEGER PRIMARY KEY AUTOINCREMENT,
cluster_id TEXT,
citation_count INTEGER,
checked_at TEXT,
delta INTEGER DEFAULT 0
)
""")
conn.commit()
new_citations = []
scraper = ScholarScraper(proxy_url=proxy.rotating_url())
for cluster_id in cluster_ids:
# Get current citation count
resp = scraper.get("/scholar", params={"cites": cluster_id, "hl": "en"})
soup = BeautifulSoup(resp.text, "lxml")
# Extract "About N results" from the top
result_stats = soup.find("div", id="gs_ab_md")
count = 0
if result_stats:
match = re.search(r"About ([\d,]+) results", result_stats.get_text())
if match:
count = int(match.group(1).replace(",", ""))
# Compare with last reading
last = conn.execute(
"SELECT citation_count FROM citation_history WHERE cluster_id=? ORDER BY checked_at DESC LIMIT 1",
(cluster_id,)
).fetchone()
delta = count - (last[0] if last else 0)
conn.execute(
"INSERT INTO citation_history (cluster_id, citation_count, checked_at, delta) VALUES (?,?,?,?)",
(cluster_id, count, datetime.now().isoformat(), delta),
)
conn.commit()
if delta > 0:
new_citations.append({"cluster_id": cluster_id, "total": count, "new": delta})
logger.info(f"Paper {cluster_id}: +{delta} new citations (total: {count})")
conn.close()
return new_citations
Use Case 2: Academic Collaboration Network
import json
from collections import defaultdict
def build_collaboration_network(
author_ids: List[str],
proxy_pool: ThorDataScholarProxy,
) -> Dict:
"""
Build a co-authorship network from a list of Scholar author IDs.
Returns a graph structure for visualization with D3.js or Gephi.
"""
nodes = {}
edges = defaultdict(int)
for author_id in author_ids:
scraper = ScholarScraper(proxy_url=proxy_pool.sticky_url())
author = scraper.get_author_profile(author_id)
if not author:
continue
nodes[author_id] = {
"id": author_id,
"name": author.name,
"citations": author.total_citations,
"h_index": author.h_index,
}
# Extract co-authors from publication metadata
for pub in author.publications[:20]:
coauthors = [a.strip() for a in pub.authors.split(",")]
for coauthor in coauthors:
if coauthor and coauthor != author.name:
edge_key = tuple(sorted([author.name, coauthor]))
edges[edge_key] += 1
proxy_pool.rotate()
time.sleep(random.uniform(8, 15))
# Format as graph JSON
graph = {
"nodes": list(nodes.values()),
"links": [
{"source": src, "target": tgt, "weight": weight}
for (src, tgt), weight in edges.items()
],
}
with open("collaboration_network.json", "w") as f:
json.dump(graph, f, indent=2)
return graph
Use Case 3: H-Index Benchmarking by Field
import pandas as pd
import statistics
def benchmark_hindex_by_field(
field_author_map: Dict[str, List[str]],
proxy_pool: ThorDataScholarProxy,
) -> pd.DataFrame:
"""
Compute h-index distribution statistics by research field.
Useful for understanding career benchmarks in different disciplines.
"""
results = []
for field, author_ids in field_author_map.items():
hindices = []
for author_id in author_ids:
scraper = ScholarScraper(proxy_url=proxy_pool.sticky_url())
author = scraper.get_author_profile(author_id)
if author:
hindices.append(author.h_index)
proxy_pool.rotate()
time.sleep(random.uniform(6, 12))
if hindices:
results.append({
"field": field,
"n_authors": len(hindices),
"mean_hindex": statistics.mean(hindices),
"median_hindex": statistics.median(hindices),
"max_hindex": max(hindices),
"min_hindex": min(hindices),
"stdev": statistics.stdev(hindices) if len(hindices) > 1 else 0,
})
return pd.DataFrame(results).sort_values("median_hindex", ascending=False)
Use Case 4: Paper Recommendation Engine Input
def build_paper_similarity_dataset(
seed_paper_cluster_ids: List[str],
depth: int = 2,
proxy_pool: ThorDataScholarProxy = None,
) -> List[Dict]:
"""
Build a dataset of papers and their citations for a recommendation engine.
BFS expansion from seed papers following citation links.
"""
visited = set()
queue = list(seed_paper_cluster_ids)
papers = []
for _ in range(depth):
next_queue = []
for cluster_id in queue:
if cluster_id in visited:
continue
visited.add(cluster_id)
proxy_url = proxy_pool.rotating_url() if proxy_pool else None
scraper = ScholarScraper(proxy_url=proxy_url)
citing = asyncio.run(scrape_citing_papers(cluster_id, proxy_url=proxy_url, max_pages=2))
for paper in citing:
papers.append({**paper, "cited_cluster": cluster_id})
# Extract cluster IDs from citation URLs for next depth
if paper.get("citation_url"):
match = re.search(r"cites=(\d+)", paper["citation_url"])
if match:
next_queue.append(match.group(1))
time.sleep(random.uniform(5, 10))
queue = next_queue
return papers
Use Case 5: Institutional Research Output Tracker
def track_institution_output(
institution_name: str,
department: str,
author_ids: List[str],
year_range: tuple = (2020, 2026),
proxy_pool: ThorDataScholarProxy = None,
) -> pd.DataFrame:
"""Aggregate research output metrics for an institution or department."""
records = []
for author_id in author_ids:
proxy_url = proxy_pool.sticky_url() if proxy_pool else None
scraper = ScholarScraper(proxy_url=proxy_url)
author = scraper.get_author_profile(author_id)
if not author:
continue
recent_pubs = [
p for p in author.publications
if p.year and year_range[0] <= p.year <= year_range[1]
]
records.append({
"author_id": author_id,
"name": author.name,
"institution": institution_name,
"department": department,
"total_citations": author.total_citations,
"h_index": author.h_index,
"publications_in_range": len(recent_pubs),
"citations_in_range": sum(p.citations for p in recent_pubs),
"top_paper": recent_pubs[0].title if recent_pubs else "",
})
if proxy_pool:
proxy_pool.rotate()
time.sleep(random.uniform(8, 15))
return pd.DataFrame(records)
Use Case 6: Citation Velocity Analysis
def compute_citation_velocity(
publications: List[ScholarPublication],
) -> pd.DataFrame:
"""
Compute citation velocity (citations per year since publication)
to identify papers with growing vs declining influence.
"""
current_year = 2026
records = []
for pub in publications:
if not pub.year or not pub.citations:
continue
years_since = max(1, current_year - pub.year)
velocity = pub.citations / years_since
records.append({
"title": pub.title[:80],
"year": pub.year,
"total_citations": pub.citations,
"years_old": years_since,
"citations_per_year": round(velocity, 1),
})
df = pd.DataFrame(records).sort_values("citations_per_year", ascending=False)
return df
Use Case 7: Systematic Literature Review Helper
def systematic_review_collector(
search_queries: List[str],
min_citations: int = 10,
year_from: int = 2018,
max_pages_per_query: int = 5,
output_csv: str = "literature_review.csv",
proxy_pool: ThorDataScholarProxy = None,
) -> pd.DataFrame:
"""
Collect papers for a systematic literature review across multiple search queries.
Deduplicates by title similarity and filters by citation threshold.
"""
all_papers = []
seen_titles = set()
for query in search_queries:
proxy_url = proxy_pool.rotating_url() if proxy_pool else None
scraper = ScholarScraper(proxy_url=proxy_url, request_delay=6.0)
for paper in scraper.search_papers(query, num_pages=max_pages_per_query):
title = paper.get("title", "").lower().strip()
if not title or title in seen_titles:
continue
# Filter by citation count
if paper.get("citation_count", 0) < min_citations:
continue
# Extract year from authors/venue string
av = paper.get("authors_venue", "")
year_match = re.search(r"\b(20\d\d)\b", av)
year = int(year_match.group(1)) if year_match else None
if year and year < year_from:
continue
seen_titles.add(title)
all_papers.append({
**paper,
"year": year,
"query": query,
})
time.sleep(random.uniform(8, 15))
df = pd.DataFrame(all_papers).sort_values("citation_count", ascending=False)
df.to_csv(output_csv, index=False)
logger.info(f"Saved {len(df)} papers to {output_csv}")
return df
Practical Guidance for Scale
Volume expectations: With residential proxies and 5-6 second delays, expect 8-10 complete author profiles per hour. Plan accordingly for large research projects.
Caching is essential: Citation counts change slowly. Cache author profiles for 24 hours minimum. Use SQLite with a fetched_at timestamp and skip re-fetching recent data.
def get_cached_or_fetch(author_id: str, scraper: ScholarScraper, db: sqlite3.Connection, ttl_hours: int = 24) -> Optional[Dict]:
cutoff = (datetime.now() - timedelta(hours=ttl_hours)).isoformat()
row = db.execute(
"SELECT data FROM author_cache WHERE author_id=? AND fetched_at > ?",
(author_id, cutoff)
).fetchone()
if row:
return json.loads(row[0])
author = scraper.get_author_profile(author_id)
if author:
db.execute(
"INSERT OR REPLACE INTO author_cache (author_id, data, fetched_at) VALUES (?, ?, ?)",
(author_id, json.dumps(asdict(author), default=str), datetime.now().isoformat())
)
db.commit()
return asdict(author) if author else None
When Scholar fails completely: SerpAPI offers a Google Scholar endpoint that handles all the anti-bot complexity but charges per request. For high-volume production systems, the cost may be justified. For research projects with budget constraints, the proxy approach in this guide is substantially more economical.
Use BibTeX links for structured data: Scholar exposes BibTeX export for individual papers — cleaner than scraping HTML, and less likely to trigger detection since it's a lower-traffic endpoint. Worth using when you need bibliographic metadata rather than citation counts.
The fundamental reality hasn't changed: Google Scholar is one of the most bot-hostile targets on the internet, and that's unlikely to change. But with proper residential proxy rotation via ThorData, realistic browser behavior via Playwright, adaptive rate limiting, and aggressive caching, reliable automated access is achievable for legitimate research purposes.