Scraping ResearchGate Researcher Profiles and Publications with Python (2026)

2026-04-09 ["researchgate" "web scraping" "python" "academic-data" "researcher-profiles"]

Scraping ResearchGate Researcher Profiles and Publications with Python (2026)

ResearchGate doesn't offer a public API. There's no academic equivalent of the Twitter API or Google Scholar API that lets you pull researcher stats programmatically. If you need citation counts, h-index data, publication lists, or co-author networks at scale, you're scraping the HTML.

This is a realistic guide to doing that with Python in 2026. It covers what data is available, what defenses you'll hit, and working code for the full pipeline from session warmup through SQLite storage.

Why ResearchGate Data Is Valuable

ResearchGate has accumulated over 25 million researcher profiles and 135 million publication pages. Unlike Google Scholar (which only indexes public pages and has no structured profile data) or PubMed (which covers biomedical literature but not researcher impact metrics), ResearchGate provides:

RG Score — a composite engagement metric that correlates with researcher visibility within the platform
h-index — the standard academic impact metric, updated from their own citation database
Per-paper citation counts that can differ from Scopus or Web of Science because ResearchGate tracks citations from papers uploaded directly to the platform
Reads — a unique engagement metric representing how many times papers have been opened on the platform
Research Interest Score — a derivative metric capturing follower engagement with a researcher's work

Use cases: academic hiring pipelines, grant landscape analysis, co-author network mapping, competitive research intelligence, citation tracking for specific technology areas.

What Data Is Available on a Profile Page

A ResearchGate researcher profile page (https://www.researchgate.net/profile/Firstname-Lastname) exposes a significant amount of structured data in the HTML and embedded JSON:

Profile metadata: - Full name and display name - Current institution and department/faculty - Country and research location - RG Score - h-index - Total citation count - Research Interest score - Total research items count - Reads count

Publications list (accessible via the publications tab): - Paper titles - Publication dates - Journal or conference name - DOI links - Per-publication citation counts - Per-publication read counts - Co-authors listed per paper

Co-author network: - Linked co-author profiles - Institution affiliation per co-author

The profile stats live in <div> elements with class patterns like nova-legacy-e-text and inside <span> tags within stat cards. The publications list is rendered server-side and paginated.

Anti-Bot Measures on ResearchGate

ResearchGate is significantly more aggressive than most academic platforms. Expect all of the following:

Cloudflare protection. Every request to researchgate.net passes through Cloudflare's bot management layer. Datacenter IP ranges (AWS, GCP, DigitalOcean, Hetzner, etc.) are blocked outright before any HTML is served — you'll get a 403 or a JS challenge page. This isn't a rate limit issue; it's IP reputation filtering. You need residential IPs from the start. ThorData's residential proxies work here because the exit nodes are genuine ISP-assigned addresses that pass Cloudflare's ASN reputation checks.

JavaScript rendering for some content. The core profile stats and most of the publications list are server-side rendered, which means plain httpx requests return usable HTML. However, some elements (follower counts, certain sidebar widgets) only appear after JS execution. For the data points listed above, a headless browser is not required.

Rate limiting and IP blocking. After 15-20 requests from the same IP in a short window, ResearchGate starts returning 429s or redirect loops to a bot challenge page. The threshold is lower than most sites.

Session cookie validation. ResearchGate sets _ga, rgUserId, and session cookies on first visit. Requests without plausible cookie state get flagged. You need to initialize a session before scraping profile data.

User-agent validation. Requests with Python's default python-httpx/x.x or python-requests/x.x user-agent return 403 immediately.

Login walls. Some profile sections (full author statistics, full publication metadata) are gated behind a ResearchGate account. Public profile pages show enough for most use cases.

Dependencies

pip install httpx beautifulsoup4 fake-useragent lxml

Session Initialization

# researchgate_scraper.py
import httpx
import time
import random
import json
import re
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

ua = UserAgent()

def make_session(proxy: str = None) -> httpx.Client:
    """
    Initialize an httpx session that looks like a browser.
    Fetches the RG homepage first to collect session cookies.
    proxy: full proxy URL, e.g. "http://user:pass@host:port"
    """
    headers = {
        "User-Agent": ua.random,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Cache-Control": "max-age=0",
    }

    client_kwargs = {
        "headers": headers,
        "follow_redirects": True,
        "timeout": 20,
    }
    if proxy:
        client_kwargs["proxy"] = proxy

    client = httpx.Client(**client_kwargs)

    # Warm up: visit homepage to collect cookies and establish IP reputation
    try:
        resp = client.get("https://www.researchgate.net/")
        if resp.status_code in (200, 301, 302):
            print(f"Session warmed. Cookies collected: {len(client.cookies)}")
        time.sleep(random.uniform(2.0, 4.0))
    except httpx.RequestError as e:
        print(f"Warning: homepage warmup failed: {e}")

    return client


def fetch_profile(client: httpx.Client, researcher_slug: str) -> dict:
    """
    Fetch a ResearchGate researcher profile page.
    researcher_slug: the URL slug, e.g. 'Jane-Smith-42' from
    https://www.researchgate.net/profile/Jane-Smith-42
    """
    url = f"https://www.researchgate.net/profile/{researcher_slug}"

    # Rotate user agent between researcher fetches
    client.headers.update({"User-Agent": ua.random})

    resp = client.get(
        url,
        headers={
            "Referer": "https://www.researchgate.net/",
            "Sec-Fetch-Site": "same-origin",
            "Sec-Fetch-Mode": "navigate",
        },
    )

    if resp.status_code == 429:
        raise RuntimeError("Rate limited (429) — back off and rotate IP")
    if resp.status_code == 403:
        raise RuntimeError("Blocked (403) — IP likely flagged by Cloudflare")
    if resp.status_code != 200:
        raise RuntimeError(f"Unexpected status {resp.status_code} for {url}")

    # Check if we got a Cloudflare challenge instead of real content
    if "Checking if the site connection is secure" in resp.text:
        raise RuntimeError("Cloudflare JS challenge page — need residential IP")

    return parse_profile(resp.text, researcher_slug)

Parsing Profile Metadata

ResearchGate embeds structured data in <meta> tags and in JSON-LD inside a <script> tag. The HTML stats use nova-legacy-e-text class variants for display values:

def parse_profile(html: str, slug: str) -> dict:
    """Parse researcher profile HTML into a structured dict."""
    soup = BeautifulSoup(html, "lxml")
    profile = {
        "slug": slug,
        "url": f"https://www.researchgate.net/profile/{slug}",
    }

    # Name from og:title meta tag
    og_title = soup.find("meta", property="og:title")
    if og_title:
        profile["name"] = og_title.get("content", "").strip()

    # Description from og:description
    og_desc = soup.find("meta", property="og:description")
    if og_desc:
        profile["og_description"] = og_desc.get("content", "").strip()

    # JSON-LD for structured name/institution/description
    ld_tag = soup.find("script", type="application/ld+json")
    if ld_tag and ld_tag.string:
        try:
            ld = json.loads(ld_tag.string)
            profile.setdefault("name", ld.get("name"))
            profile["description"] = ld.get("description")
            profile["url_canonical"] = ld.get("url")
            if "worksFor" in ld:
                works = ld["worksFor"]
                if isinstance(works, list) and works:
                    profile.setdefault("institution", works[0].get("name"))
                elif isinstance(works, dict):
                    profile.setdefault("institution", works.get("name"))
            if "alumniOf" in ld:
                alumni = ld["alumniOf"]
                if isinstance(alumni, list) and alumni:
                    profile["alumni_of"] = alumni[0].get("name")
        except (json.JSONDecodeError, AttributeError):
            pass

    # Institution from nova-legacy-e-text size-m
    institution_candidates = soup.find_all(
        "div", class_=re.compile(r"nova-legacy-e-text.*size-m")
    )
    for el in institution_candidates:
        text = el.get_text(strip=True)
        if text and len(text) > 3:
            profile.setdefault("institution", text)
            break

    # Department from nova-legacy-e-text size-s
    dept_tags = soup.find_all("div", class_=re.compile(r"nova-legacy-e-text.*size-s"))
    for el in dept_tags:
        text = el.get_text(strip=True)
        if text and len(text) > 3:
            profile.setdefault("department", text)
            break

    # Stats: RG Score, citations, h-index, reads, research interest
    stats_section = soup.find(
        "div", class_=re.compile(r"research-detail-header-section__stats")
    )
    if stats_section:
        stat_items = stats_section.find_all(
            "div", class_=re.compile(r"nova-legacy-c-card__body")
        )
        for item in stat_items:
            label_tag = item.find(
                "div", class_=re.compile(r"nova-legacy-e-text.*color-grey")
            )
            value_tag = item.find(
                "div", class_=re.compile(r"nova-legacy-e-text.*size-xxl")
            )
            if label_tag and value_tag:
                label = label_tag.get_text(strip=True).lower()
                value = value_tag.get_text(strip=True)
                if "rg score" in label:
                    profile["rg_score"] = value
                elif "citation" in label:
                    profile["citations_total"] = value
                elif "h-index" in label:
                    profile["h_index"] = value
                elif "read" in label:
                    profile["reads"] = value
                elif "research interest" in label:
                    profile["research_interest_score"] = value

    return profile

Extracting the Publications List

Publications appear at /profile/Firstname-Lastname/publications. The page uses a research-detail-list container with individual nova-legacy-o-stack__item entries per paper:

def fetch_publications(
    client: httpx.Client,
    researcher_slug: str,
    max_pages: int = 5,
) -> list:
    """
    Fetch paginated publication list for a researcher.
    ResearchGate loads 10 publications per page.
    """
    publications = []
    base_url = f"https://www.researchgate.net/profile/{researcher_slug}/publications"

    for page in range(1, max_pages + 1):
        params = {"page": page} if page > 1 else {}
        try:
            resp = client.get(
                base_url,
                params=params,
                headers={
                    "Referer": f"https://www.researchgate.net/profile/{researcher_slug}",
                    "Sec-Fetch-Site": "same-origin",
                },
            )
        except httpx.RequestError as e:
            print(f"Network error on page {page}: {e}")
            break

        if resp.status_code == 429:
            print(f"Rate limited at publications page {page}, stopping")
            break
        if resp.status_code != 200:
            break

        soup = BeautifulSoup(resp.text, "lxml")
        items = soup.find_all("div", class_=re.compile(r"nova-legacy-o-stack__item"))

        if not items:
            items = soup.find_all("li", class_=re.compile(r"nova-legacy-e-list__item"))

        if not items:
            break

        for item in items:
            pub = parse_publication_card(item)
            if pub:
                publications.append(pub)

        print(f"  Page {page}: {len(items)} items, {len(publications)} total")

        next_btn = soup.find("a", class_=re.compile(r"nova-legacy.*next"))
        if not next_btn:
            break

        time.sleep(random.uniform(6.0, 12.0))

    return publications


def parse_publication_card(item) -> dict:
    """Parse a single publication card from the publications list page."""
    pub = {}

    # Title
    title_tag = item.find("a", class_=re.compile(r"nova-legacy-e-link.*size-l"))
    if title_tag:
        pub["title"] = title_tag.get_text(strip=True)
        href = title_tag.get("href", "")
        if href.startswith("/publication/"):
            pub["rg_url"] = f"https://www.researchgate.net{href}"
            match = re.search(r"/publication/(\d+)", href)
            if match:
                pub["rg_publication_id"] = match.group(1)

    if not pub.get("title"):
        return None

    # Date
    date_tag = item.find("span", class_=re.compile(r"nova-legacy-e-text.*color-grey-600"))
    if date_tag:
        pub["date"] = date_tag.get_text(strip=True)

    # Journal/conference name
    journal_tag = item.find("span", class_=re.compile(r"nova-legacy-e-badge"))
    if journal_tag:
        pub["venue"] = journal_tag.get_text(strip=True)

    # Citation and read counts
    stat_tags = item.find_all("li", class_=re.compile(r"nova-legacy-e-list__item"))
    for stat in stat_tags:
        text = stat.get_text(strip=True).lower()
        digits = re.search(r"[\d,]+", text)
        if digits:
            count = int(digits.group().replace(",", ""))
            if "citation" in text:
                pub["citations"] = count
            elif "read" in text:
                pub["reads"] = count
            elif "recommendation" in text:
                pub["recommendations"] = count

    # DOI
    doi_tag = item.find("a", href=re.compile(r"doi\.org"))
    if doi_tag:
        pub["doi"] = doi_tag.get("href")

    # Co-authors
    author_tags = item.find_all("a", class_=re.compile(r"nova-legacy-e-link.*color-inherit"))
    pub["co_authors"] = [
        a.get_text(strip=True)
        for a in author_tags
        if a.get_text(strip=True) and "/profile/" in a.get("href", "")
    ]

    # Publication type
    type_tag = item.find("span", class_=re.compile(r"nova-legacy-e-badge.*type"))
    if type_tag:
        pub["pub_type"] = type_tag.get_text(strip=True)

    return pub

SQLite Storage Schema

import sqlite3

def init_db(db_path: str = "researchgate.db") -> sqlite3.Connection:
    """Initialize database with tables for researchers and publications."""
    conn = sqlite3.connect(db_path)
    conn.execute("PRAGMA journal_mode=WAL")
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS researchers (
            slug                    TEXT PRIMARY KEY,
            name                    TEXT,
            institution             TEXT,
            department              TEXT,
            alumni_of               TEXT,
            rg_score                TEXT,
            h_index                 TEXT,
            citations_total         TEXT,
            reads                   TEXT,
            research_interest_score TEXT,
            description             TEXT,
            og_description          TEXT,
            url_canonical           TEXT,
            scraped_at              TEXT DEFAULT CURRENT_TIMESTAMP
        );

        CREATE TABLE IF NOT EXISTS publications (
            id                  INTEGER PRIMARY KEY AUTOINCREMENT,
            researcher_slug     TEXT NOT NULL,
            rg_publication_id   TEXT,
            title               TEXT,
            rg_url              TEXT,
            date                TEXT,
            venue               TEXT,
            doi                 TEXT,
            pub_type            TEXT,
            citations           INTEGER DEFAULT 0,
            reads               INTEGER DEFAULT 0,
            recommendations     INTEGER DEFAULT 0,
            co_authors          TEXT,
            scraped_at          TEXT DEFAULT CURRENT_TIMESTAMP,
            FOREIGN KEY (researcher_slug) REFERENCES researchers(slug)
        );

        CREATE TABLE IF NOT EXISTS scrape_errors (
            id              INTEGER PRIMARY KEY AUTOINCREMENT,
            slug            TEXT,
            error_stage     TEXT,
            error_msg       TEXT,
            proxy_used      TEXT,
            occurred_at     TEXT DEFAULT CURRENT_TIMESTAMP
        );

        CREATE INDEX IF NOT EXISTS idx_pubs_slug ON publications (researcher_slug);
        CREATE INDEX IF NOT EXISTS idx_pubs_doi ON publications (doi);
    """)
    conn.commit()
    return conn


def save_researcher(conn: sqlite3.Connection, profile: dict):
    """Upsert a researcher profile record."""
    conn.execute(
        """INSERT OR REPLACE INTO researchers
           (slug, name, institution, department, alumni_of, rg_score, h_index,
            citations_total, reads, research_interest_score, description,
            og_description, url_canonical, scraped_at)
           VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,CURRENT_TIMESTAMP)""",
        (
            profile.get("slug"),
            profile.get("name"),
            profile.get("institution"),
            profile.get("department"),
            profile.get("alumni_of"),
            profile.get("rg_score"),
            profile.get("h_index"),
            profile.get("citations_total"),
            profile.get("reads"),
            profile.get("research_interest_score"),
            profile.get("description"),
            profile.get("og_description"),
            profile.get("url_canonical"),
        ),
    )
    conn.commit()


def save_publications(conn: sqlite3.Connection, slug: str, pubs: list) -> int:
    """Insert publications for a researcher. Returns count of inserted rows."""
    inserted = 0
    for p in pubs:
        try:
            conn.execute(
                """INSERT INTO publications
                   (researcher_slug, rg_publication_id, title, rg_url, date,
                    venue, doi, pub_type, citations, reads, recommendations, co_authors)
                   VALUES (?,?,?,?,?,?,?,?,?,?,?,?)""",
                (
                    slug,
                    p.get("rg_publication_id"),
                    p.get("title"),
                    p.get("rg_url"),
                    p.get("date"),
                    p.get("venue"),
                    p.get("doi"),
                    p.get("pub_type"),
                    p.get("citations", 0),
                    p.get("reads", 0),
                    p.get("recommendations", 0),
                    json.dumps(p.get("co_authors", [])),
                ),
            )
            inserted += 1
        except sqlite3.IntegrityError:
            pass
    conn.commit()
    return inserted

Complete Pipeline

def scrape_researchers(
    slugs: list,
    db_path: str = "researchgate.db",
    proxy: str = None,
    max_pub_pages: int = 3,
    delay_between_profiles: tuple = (15.0, 30.0),
):
    """
    Full pipeline: session init -> profile fetch -> publications -> SQLite storage.
    Uses one fresh session per researcher profile.
    """
    conn = init_db(db_path)
    results = {"success": 0, "errors": 0}

    for i, slug in enumerate(slugs):
        print(f"\n[{i+1}/{len(slugs)}] Processing {slug}...")

        try:
            client = make_session(proxy=proxy)
        except Exception as e:
            print(f"  Session init failed: {e}")
            results["errors"] += 1
            continue

        try:
            profile = fetch_profile(client, slug)
            save_researcher(conn, profile)
            print(
                f"  {profile.get('name', slug)} | "
                f"Citations: {profile.get('citations_total', 'N/A')} | "
                f"RG Score: {profile.get('rg_score', 'N/A')} | "
                f"h-index: {profile.get('h_index', 'N/A')}"
            )

            time.sleep(random.uniform(5.0, 10.0))

            pubs = fetch_publications(client, slug, max_pages=max_pub_pages)
            inserted = save_publications(conn, slug, pubs)
            print(f"  Saved {inserted} publications ({len(pubs)} fetched)")
            results["success"] += 1

        except RuntimeError as e:
            print(f"  Error for {slug}: {e}")
            conn.execute(
                "INSERT INTO scrape_errors (slug, error_stage, error_msg, proxy_used) "
                "VALUES (?, 'fetch', ?, ?)",
                (slug, str(e), proxy)
            )
            conn.commit()
            results["errors"] += 1

        finally:
            client.close()

        if i < len(slugs) - 1:
            delay = random.uniform(*delay_between_profiles)
            print(f"  Waiting {delay:.1f}s before next researcher...")
            time.sleep(delay)

    conn.close()
    print(f"\nCompleted: {results['success']} ok, {results['errors']} errors")
    return results


# Usage
PROXY = "http://user:[email protected]:9000"

RESEARCHERS = [
    "Geoffrey-Hinton",
    "Yoshua-Bengio",
    "Yann-LeCun-2",
    "Fei-Fei-Li",
    "Andrew-Ng",
]

scrape_researchers(RESEARCHERS, proxy=PROXY, max_pub_pages=3)

Citation Analysis Queries

Once you have data in SQLite, useful analytical queries:

def top_cited_publications(conn: sqlite3.Connection, slug: str, n: int = 10) -> list:
    """Return the n most cited publications for a researcher."""
    rows = conn.execute(
        """
        SELECT title, venue, date, citations, doi
        FROM publications
        WHERE researcher_slug = ?
        ORDER BY citations DESC
        LIMIT ?
        """,
        (slug, n)
    ).fetchall()
    return [
        {"title": r[0], "venue": r[1], "date": r[2], "citations": r[3], "doi": r[4]}
        for r in rows
    ]


def most_frequent_co_authors(conn: sqlite3.Connection, slug: str) -> list:
    """Find the most frequent co-authors for a researcher."""
    from collections import Counter
    rows = conn.execute(
        "SELECT co_authors FROM publications WHERE researcher_slug = ?",
        (slug,)
    ).fetchall()
    counter = Counter()
    for row in rows:
        authors = json.loads(row[0] or "[]")
        counter.update(authors)
    return counter.most_common(20)


def compare_researchers(conn: sqlite3.Connection, slugs: list) -> list:
    """Compare multiple researchers by their headline stats."""
    results = []
    for slug in slugs:
        row = conn.execute(
            "SELECT name, h_index, citations_total, rg_score FROM researchers WHERE slug=?",
            (slug,)
        ).fetchone()
        pub_count = conn.execute(
            "SELECT COUNT(*) FROM publications WHERE researcher_slug=?",
            (slug,)
        ).fetchone()[0]
        if row:
            results.append({
                "slug": slug, "name": row[0], "h_index": row[1],
                "citations": row[2], "rg_score": row[3], "pub_count": pub_count
            })
    return results

Practical Tips

Delays matter more than anything else. ResearchGate tracks inter-request timing. 15-30 seconds between researcher profiles is the safe range. Anything under 8 seconds per request will trigger rate limits within a few pages.

Rotate proxies per researcher, not per request. Switching IPs mid-session looks more suspicious than using one IP per researcher profile. Initialize a new session with a new proxy for each slug in your list.

The RG Score is volatile. It updates frequently and the displayed value can differ between page loads depending on server caching. Scrape it multiple times and average if precision matters.

Cloudflare challenges increase after 10 PM UTC. ResearchGate's bot detection appears to run stricter rules during off-peak hours. Schedule heavy scraping runs during European or US business hours when real user traffic is highest.

Avoid recursive co-author graph scraping without rate controls. It is tempting to follow every co-author link and build a network graph. Each researcher profile is another full page fetch. A 3-hop network from a prolific researcher can mean thousands of requests.

Residential proxies are non-negotiable for this target. Datacenter IPs, even premium ones, get blocked by Cloudflare before the first response. ThorData routes traffic through genuine residential ISP addresses that ResearchGate's defenses don't flag. If you're hitting consistent 403s or empty Cloudflare challenge pages, the proxy type is almost always the cause.

Parse defensively. The HTML class names in ResearchGate's nova-legacy component library change occasionally. Write regex-based class selectors rather than exact matches so that minor CSS class renames don't break your parser.

Store raw HTML during development. While you're writing your parser, save the complete page source alongside your parsed output. This lets you debug selector failures without re-fetching and burning through your proxy budget.

Legal Notes

ResearchGate's Terms of Service (Section 7.1) prohibit automated access and data extraction. Their robots.txt disallows most scraping paths. This is advisory and has no direct legal force, but their ToS creates a contract-based restriction.

Practically: individual researchers scraping their own citation data or doing small-scale academic research operate in a gray zone that ResearchGate has not historically enforced against. Building a commercial product that resells ResearchGate profile data is a different matter entirely.

For large-scale academic data needs, established datasets like OpenAlex (formerly Microsoft Academic Graph), Semantic Scholar API, or Dimensions are legitimate alternatives with proper APIs. These are worth evaluating before investing in ResearchGate scraping infrastructure.

Always scrape only what you need, cache aggressively to avoid repeat fetches, and avoid placing load on the platform beyond what a determined human researcher would generate.