Scraping ResearchGate Researcher Profiles and Publications with Python (2026)
Scraping ResearchGate Researcher Profiles and Publications with Python (2026)
ResearchGate doesn't offer a public API. There's no academic equivalent of the Twitter API or Google Scholar API that lets you pull researcher stats programmatically. If you need citation counts, h-index data, publication lists, or co-author networks at scale, you're scraping the HTML.
This is a realistic guide to doing that with Python in 2026. It covers what data is available, what defenses you'll hit, and working code for the full pipeline from session warmup through SQLite storage.
Why ResearchGate Data Is Valuable
ResearchGate has accumulated over 25 million researcher profiles and 135 million publication pages. Unlike Google Scholar (which only indexes public pages and has no structured profile data) or PubMed (which covers biomedical literature but not researcher impact metrics), ResearchGate provides:
- RG Score — a composite engagement metric that correlates with researcher visibility within the platform
- h-index — the standard academic impact metric, updated from their own citation database
- Per-paper citation counts that can differ from Scopus or Web of Science because ResearchGate tracks citations from papers uploaded directly to the platform
- Reads — a unique engagement metric representing how many times papers have been opened on the platform
- Research Interest Score — a derivative metric capturing follower engagement with a researcher's work
Use cases: academic hiring pipelines, grant landscape analysis, co-author network mapping, competitive research intelligence, citation tracking for specific technology areas.
What Data Is Available on a Profile Page
A ResearchGate researcher profile page (https://www.researchgate.net/profile/Firstname-Lastname) exposes a significant amount of structured data in the HTML and embedded JSON:
Profile metadata: - Full name and display name - Current institution and department/faculty - Country and research location - RG Score - h-index - Total citation count - Research Interest score - Total research items count - Reads count
Publications list (accessible via the publications tab): - Paper titles - Publication dates - Journal or conference name - DOI links - Per-publication citation counts - Per-publication read counts - Co-authors listed per paper
Co-author network: - Linked co-author profiles - Institution affiliation per co-author
The profile stats live in <div> elements with class patterns like nova-legacy-e-text and inside <span> tags within stat cards. The publications list is rendered server-side and paginated.
Anti-Bot Measures on ResearchGate
ResearchGate is significantly more aggressive than most academic platforms. Expect all of the following:
Cloudflare protection. Every request to researchgate.net passes through Cloudflare's bot management layer. Datacenter IP ranges (AWS, GCP, DigitalOcean, Hetzner, etc.) are blocked outright before any HTML is served — you'll get a 403 or a JS challenge page. This isn't a rate limit issue; it's IP reputation filtering. You need residential IPs from the start. ThorData's residential proxies work here because the exit nodes are genuine ISP-assigned addresses that pass Cloudflare's ASN reputation checks.
JavaScript rendering for some content. The core profile stats and most of the publications list are server-side rendered, which means plain httpx requests return usable HTML. However, some elements (follower counts, certain sidebar widgets) only appear after JS execution. For the data points listed above, a headless browser is not required.
Rate limiting and IP blocking. After 15-20 requests from the same IP in a short window, ResearchGate starts returning 429s or redirect loops to a bot challenge page. The threshold is lower than most sites.
Session cookie validation. ResearchGate sets _ga, rgUserId, and session cookies on first visit. Requests without plausible cookie state get flagged. You need to initialize a session before scraping profile data.
User-agent validation. Requests with Python's default python-httpx/x.x or python-requests/x.x user-agent return 403 immediately.
Login walls. Some profile sections (full author statistics, full publication metadata) are gated behind a ResearchGate account. Public profile pages show enough for most use cases.
Dependencies
pip install httpx beautifulsoup4 fake-useragent lxml
Session Initialization
# researchgate_scraper.py
import httpx
import time
import random
import json
import re
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
ua = UserAgent()
def make_session(proxy: str = None) -> httpx.Client:
"""
Initialize an httpx session that looks like a browser.
Fetches the RG homepage first to collect session cookies.
proxy: full proxy URL, e.g. "http://user:pass@host:port"
"""
headers = {
"User-Agent": ua.random,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Cache-Control": "max-age=0",
}
client_kwargs = {
"headers": headers,
"follow_redirects": True,
"timeout": 20,
}
if proxy:
client_kwargs["proxy"] = proxy
client = httpx.Client(**client_kwargs)
# Warm up: visit homepage to collect cookies and establish IP reputation
try:
resp = client.get("https://www.researchgate.net/")
if resp.status_code in (200, 301, 302):
print(f"Session warmed. Cookies collected: {len(client.cookies)}")
time.sleep(random.uniform(2.0, 4.0))
except httpx.RequestError as e:
print(f"Warning: homepage warmup failed: {e}")
return client
def fetch_profile(client: httpx.Client, researcher_slug: str) -> dict:
"""
Fetch a ResearchGate researcher profile page.
researcher_slug: the URL slug, e.g. 'Jane-Smith-42' from
https://www.researchgate.net/profile/Jane-Smith-42
"""
url = f"https://www.researchgate.net/profile/{researcher_slug}"
# Rotate user agent between researcher fetches
client.headers.update({"User-Agent": ua.random})
resp = client.get(
url,
headers={
"Referer": "https://www.researchgate.net/",
"Sec-Fetch-Site": "same-origin",
"Sec-Fetch-Mode": "navigate",
},
)
if resp.status_code == 429:
raise RuntimeError("Rate limited (429) — back off and rotate IP")
if resp.status_code == 403:
raise RuntimeError("Blocked (403) — IP likely flagged by Cloudflare")
if resp.status_code != 200:
raise RuntimeError(f"Unexpected status {resp.status_code} for {url}")
# Check if we got a Cloudflare challenge instead of real content
if "Checking if the site connection is secure" in resp.text:
raise RuntimeError("Cloudflare JS challenge page — need residential IP")
return parse_profile(resp.text, researcher_slug)
Parsing Profile Metadata
ResearchGate embeds structured data in <meta> tags and in JSON-LD inside a <script> tag. The HTML stats use nova-legacy-e-text class variants for display values:
def parse_profile(html: str, slug: str) -> dict:
"""Parse researcher profile HTML into a structured dict."""
soup = BeautifulSoup(html, "lxml")
profile = {
"slug": slug,
"url": f"https://www.researchgate.net/profile/{slug}",
}
# Name from og:title meta tag
og_title = soup.find("meta", property="og:title")
if og_title:
profile["name"] = og_title.get("content", "").strip()
# Description from og:description
og_desc = soup.find("meta", property="og:description")
if og_desc:
profile["og_description"] = og_desc.get("content", "").strip()
# JSON-LD for structured name/institution/description
ld_tag = soup.find("script", type="application/ld+json")
if ld_tag and ld_tag.string:
try:
ld = json.loads(ld_tag.string)
profile.setdefault("name", ld.get("name"))
profile["description"] = ld.get("description")
profile["url_canonical"] = ld.get("url")
if "worksFor" in ld:
works = ld["worksFor"]
if isinstance(works, list) and works:
profile.setdefault("institution", works[0].get("name"))
elif isinstance(works, dict):
profile.setdefault("institution", works.get("name"))
if "alumniOf" in ld:
alumni = ld["alumniOf"]
if isinstance(alumni, list) and alumni:
profile["alumni_of"] = alumni[0].get("name")
except (json.JSONDecodeError, AttributeError):
pass
# Institution from nova-legacy-e-text size-m
institution_candidates = soup.find_all(
"div", class_=re.compile(r"nova-legacy-e-text.*size-m")
)
for el in institution_candidates:
text = el.get_text(strip=True)
if text and len(text) > 3:
profile.setdefault("institution", text)
break
# Department from nova-legacy-e-text size-s
dept_tags = soup.find_all("div", class_=re.compile(r"nova-legacy-e-text.*size-s"))
for el in dept_tags:
text = el.get_text(strip=True)
if text and len(text) > 3:
profile.setdefault("department", text)
break
# Stats: RG Score, citations, h-index, reads, research interest
stats_section = soup.find(
"div", class_=re.compile(r"research-detail-header-section__stats")
)
if stats_section:
stat_items = stats_section.find_all(
"div", class_=re.compile(r"nova-legacy-c-card__body")
)
for item in stat_items:
label_tag = item.find(
"div", class_=re.compile(r"nova-legacy-e-text.*color-grey")
)
value_tag = item.find(
"div", class_=re.compile(r"nova-legacy-e-text.*size-xxl")
)
if label_tag and value_tag:
label = label_tag.get_text(strip=True).lower()
value = value_tag.get_text(strip=True)
if "rg score" in label:
profile["rg_score"] = value
elif "citation" in label:
profile["citations_total"] = value
elif "h-index" in label:
profile["h_index"] = value
elif "read" in label:
profile["reads"] = value
elif "research interest" in label:
profile["research_interest_score"] = value
return profile
Extracting the Publications List
Publications appear at /profile/Firstname-Lastname/publications. The page uses a research-detail-list container with individual nova-legacy-o-stack__item entries per paper:
def fetch_publications(
client: httpx.Client,
researcher_slug: str,
max_pages: int = 5,
) -> list:
"""
Fetch paginated publication list for a researcher.
ResearchGate loads 10 publications per page.
"""
publications = []
base_url = f"https://www.researchgate.net/profile/{researcher_slug}/publications"
for page in range(1, max_pages + 1):
params = {"page": page} if page > 1 else {}
try:
resp = client.get(
base_url,
params=params,
headers={
"Referer": f"https://www.researchgate.net/profile/{researcher_slug}",
"Sec-Fetch-Site": "same-origin",
},
)
except httpx.RequestError as e:
print(f"Network error on page {page}: {e}")
break
if resp.status_code == 429:
print(f"Rate limited at publications page {page}, stopping")
break
if resp.status_code != 200:
break
soup = BeautifulSoup(resp.text, "lxml")
items = soup.find_all("div", class_=re.compile(r"nova-legacy-o-stack__item"))
if not items:
items = soup.find_all("li", class_=re.compile(r"nova-legacy-e-list__item"))
if not items:
break
for item in items:
pub = parse_publication_card(item)
if pub:
publications.append(pub)
print(f" Page {page}: {len(items)} items, {len(publications)} total")
next_btn = soup.find("a", class_=re.compile(r"nova-legacy.*next"))
if not next_btn:
break
time.sleep(random.uniform(6.0, 12.0))
return publications
def parse_publication_card(item) -> dict:
"""Parse a single publication card from the publications list page."""
pub = {}
# Title
title_tag = item.find("a", class_=re.compile(r"nova-legacy-e-link.*size-l"))
if title_tag:
pub["title"] = title_tag.get_text(strip=True)
href = title_tag.get("href", "")
if href.startswith("/publication/"):
pub["rg_url"] = f"https://www.researchgate.net{href}"
match = re.search(r"/publication/(\d+)", href)
if match:
pub["rg_publication_id"] = match.group(1)
if not pub.get("title"):
return None
# Date
date_tag = item.find("span", class_=re.compile(r"nova-legacy-e-text.*color-grey-600"))
if date_tag:
pub["date"] = date_tag.get_text(strip=True)
# Journal/conference name
journal_tag = item.find("span", class_=re.compile(r"nova-legacy-e-badge"))
if journal_tag:
pub["venue"] = journal_tag.get_text(strip=True)
# Citation and read counts
stat_tags = item.find_all("li", class_=re.compile(r"nova-legacy-e-list__item"))
for stat in stat_tags:
text = stat.get_text(strip=True).lower()
digits = re.search(r"[\d,]+", text)
if digits:
count = int(digits.group().replace(",", ""))
if "citation" in text:
pub["citations"] = count
elif "read" in text:
pub["reads"] = count
elif "recommendation" in text:
pub["recommendations"] = count
# DOI
doi_tag = item.find("a", href=re.compile(r"doi\.org"))
if doi_tag:
pub["doi"] = doi_tag.get("href")
# Co-authors
author_tags = item.find_all("a", class_=re.compile(r"nova-legacy-e-link.*color-inherit"))
pub["co_authors"] = [
a.get_text(strip=True)
for a in author_tags
if a.get_text(strip=True) and "/profile/" in a.get("href", "")
]
# Publication type
type_tag = item.find("span", class_=re.compile(r"nova-legacy-e-badge.*type"))
if type_tag:
pub["pub_type"] = type_tag.get_text(strip=True)
return pub
SQLite Storage Schema
import sqlite3
def init_db(db_path: str = "researchgate.db") -> sqlite3.Connection:
"""Initialize database with tables for researchers and publications."""
conn = sqlite3.connect(db_path)
conn.execute("PRAGMA journal_mode=WAL")
conn.executescript("""
CREATE TABLE IF NOT EXISTS researchers (
slug TEXT PRIMARY KEY,
name TEXT,
institution TEXT,
department TEXT,
alumni_of TEXT,
rg_score TEXT,
h_index TEXT,
citations_total TEXT,
reads TEXT,
research_interest_score TEXT,
description TEXT,
og_description TEXT,
url_canonical TEXT,
scraped_at TEXT DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS publications (
id INTEGER PRIMARY KEY AUTOINCREMENT,
researcher_slug TEXT NOT NULL,
rg_publication_id TEXT,
title TEXT,
rg_url TEXT,
date TEXT,
venue TEXT,
doi TEXT,
pub_type TEXT,
citations INTEGER DEFAULT 0,
reads INTEGER DEFAULT 0,
recommendations INTEGER DEFAULT 0,
co_authors TEXT,
scraped_at TEXT DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (researcher_slug) REFERENCES researchers(slug)
);
CREATE TABLE IF NOT EXISTS scrape_errors (
id INTEGER PRIMARY KEY AUTOINCREMENT,
slug TEXT,
error_stage TEXT,
error_msg TEXT,
proxy_used TEXT,
occurred_at TEXT DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_pubs_slug ON publications (researcher_slug);
CREATE INDEX IF NOT EXISTS idx_pubs_doi ON publications (doi);
""")
conn.commit()
return conn
def save_researcher(conn: sqlite3.Connection, profile: dict):
"""Upsert a researcher profile record."""
conn.execute(
"""INSERT OR REPLACE INTO researchers
(slug, name, institution, department, alumni_of, rg_score, h_index,
citations_total, reads, research_interest_score, description,
og_description, url_canonical, scraped_at)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,CURRENT_TIMESTAMP)""",
(
profile.get("slug"),
profile.get("name"),
profile.get("institution"),
profile.get("department"),
profile.get("alumni_of"),
profile.get("rg_score"),
profile.get("h_index"),
profile.get("citations_total"),
profile.get("reads"),
profile.get("research_interest_score"),
profile.get("description"),
profile.get("og_description"),
profile.get("url_canonical"),
),
)
conn.commit()
def save_publications(conn: sqlite3.Connection, slug: str, pubs: list) -> int:
"""Insert publications for a researcher. Returns count of inserted rows."""
inserted = 0
for p in pubs:
try:
conn.execute(
"""INSERT INTO publications
(researcher_slug, rg_publication_id, title, rg_url, date,
venue, doi, pub_type, citations, reads, recommendations, co_authors)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?)""",
(
slug,
p.get("rg_publication_id"),
p.get("title"),
p.get("rg_url"),
p.get("date"),
p.get("venue"),
p.get("doi"),
p.get("pub_type"),
p.get("citations", 0),
p.get("reads", 0),
p.get("recommendations", 0),
json.dumps(p.get("co_authors", [])),
),
)
inserted += 1
except sqlite3.IntegrityError:
pass
conn.commit()
return inserted
Complete Pipeline
def scrape_researchers(
slugs: list,
db_path: str = "researchgate.db",
proxy: str = None,
max_pub_pages: int = 3,
delay_between_profiles: tuple = (15.0, 30.0),
):
"""
Full pipeline: session init -> profile fetch -> publications -> SQLite storage.
Uses one fresh session per researcher profile.
"""
conn = init_db(db_path)
results = {"success": 0, "errors": 0}
for i, slug in enumerate(slugs):
print(f"\n[{i+1}/{len(slugs)}] Processing {slug}...")
try:
client = make_session(proxy=proxy)
except Exception as e:
print(f" Session init failed: {e}")
results["errors"] += 1
continue
try:
profile = fetch_profile(client, slug)
save_researcher(conn, profile)
print(
f" {profile.get('name', slug)} | "
f"Citations: {profile.get('citations_total', 'N/A')} | "
f"RG Score: {profile.get('rg_score', 'N/A')} | "
f"h-index: {profile.get('h_index', 'N/A')}"
)
time.sleep(random.uniform(5.0, 10.0))
pubs = fetch_publications(client, slug, max_pages=max_pub_pages)
inserted = save_publications(conn, slug, pubs)
print(f" Saved {inserted} publications ({len(pubs)} fetched)")
results["success"] += 1
except RuntimeError as e:
print(f" Error for {slug}: {e}")
conn.execute(
"INSERT INTO scrape_errors (slug, error_stage, error_msg, proxy_used) "
"VALUES (?, 'fetch', ?, ?)",
(slug, str(e), proxy)
)
conn.commit()
results["errors"] += 1
finally:
client.close()
if i < len(slugs) - 1:
delay = random.uniform(*delay_between_profiles)
print(f" Waiting {delay:.1f}s before next researcher...")
time.sleep(delay)
conn.close()
print(f"\nCompleted: {results['success']} ok, {results['errors']} errors")
return results
# Usage
PROXY = "http://user:[email protected]:9000"
RESEARCHERS = [
"Geoffrey-Hinton",
"Yoshua-Bengio",
"Yann-LeCun-2",
"Fei-Fei-Li",
"Andrew-Ng",
]
scrape_researchers(RESEARCHERS, proxy=PROXY, max_pub_pages=3)
Citation Analysis Queries
Once you have data in SQLite, useful analytical queries:
def top_cited_publications(conn: sqlite3.Connection, slug: str, n: int = 10) -> list:
"""Return the n most cited publications for a researcher."""
rows = conn.execute(
"""
SELECT title, venue, date, citations, doi
FROM publications
WHERE researcher_slug = ?
ORDER BY citations DESC
LIMIT ?
""",
(slug, n)
).fetchall()
return [
{"title": r[0], "venue": r[1], "date": r[2], "citations": r[3], "doi": r[4]}
for r in rows
]
def most_frequent_co_authors(conn: sqlite3.Connection, slug: str) -> list:
"""Find the most frequent co-authors for a researcher."""
from collections import Counter
rows = conn.execute(
"SELECT co_authors FROM publications WHERE researcher_slug = ?",
(slug,)
).fetchall()
counter = Counter()
for row in rows:
authors = json.loads(row[0] or "[]")
counter.update(authors)
return counter.most_common(20)
def compare_researchers(conn: sqlite3.Connection, slugs: list) -> list:
"""Compare multiple researchers by their headline stats."""
results = []
for slug in slugs:
row = conn.execute(
"SELECT name, h_index, citations_total, rg_score FROM researchers WHERE slug=?",
(slug,)
).fetchone()
pub_count = conn.execute(
"SELECT COUNT(*) FROM publications WHERE researcher_slug=?",
(slug,)
).fetchone()[0]
if row:
results.append({
"slug": slug, "name": row[0], "h_index": row[1],
"citations": row[2], "rg_score": row[3], "pub_count": pub_count
})
return results
Practical Tips
Delays matter more than anything else. ResearchGate tracks inter-request timing. 15-30 seconds between researcher profiles is the safe range. Anything under 8 seconds per request will trigger rate limits within a few pages.
Rotate proxies per researcher, not per request. Switching IPs mid-session looks more suspicious than using one IP per researcher profile. Initialize a new session with a new proxy for each slug in your list.
The RG Score is volatile. It updates frequently and the displayed value can differ between page loads depending on server caching. Scrape it multiple times and average if precision matters.
Cloudflare challenges increase after 10 PM UTC. ResearchGate's bot detection appears to run stricter rules during off-peak hours. Schedule heavy scraping runs during European or US business hours when real user traffic is highest.
Avoid recursive co-author graph scraping without rate controls. It is tempting to follow every co-author link and build a network graph. Each researcher profile is another full page fetch. A 3-hop network from a prolific researcher can mean thousands of requests.
Residential proxies are non-negotiable for this target. Datacenter IPs, even premium ones, get blocked by Cloudflare before the first response. ThorData routes traffic through genuine residential ISP addresses that ResearchGate's defenses don't flag. If you're hitting consistent 403s or empty Cloudflare challenge pages, the proxy type is almost always the cause.
Parse defensively. The HTML class names in ResearchGate's nova-legacy component library change occasionally. Write regex-based class selectors rather than exact matches so that minor CSS class renames don't break your parser.
Store raw HTML during development. While you're writing your parser, save the complete page source alongside your parsed output. This lets you debug selector failures without re-fetching and burning through your proxy budget.
Legal Notes
ResearchGate's Terms of Service (Section 7.1) prohibit automated access and data extraction. Their robots.txt disallows most scraping paths. This is advisory and has no direct legal force, but their ToS creates a contract-based restriction.
Practically: individual researchers scraping their own citation data or doing small-scale academic research operate in a gray zone that ResearchGate has not historically enforced against. Building a commercial product that resells ResearchGate profile data is a different matter entirely.
For large-scale academic data needs, established datasets like OpenAlex (formerly Microsoft Academic Graph), Semantic Scholar API, or Dimensions are legitimate alternatives with proper APIs. These are worth evaluating before investing in ResearchGate scraping infrastructure.
Always scrape only what you need, cache aggressively to avoid repeat fetches, and avoid placing load on the platform beyond what a determined human researcher would generate.