Scrape G2 Software Reviews: Ratings, Comparisons & GraphQL API (2026)

2026-04-09 [python g2 scraping playwright graphql]

Scrape G2 Software Reviews: Ratings, Comparisons & GraphQL API (2026)

G2 is the largest B2B software review platform — 2 million+ reviews across 150,000+ products. The data is valuable for competitive intelligence, market research, and building comparison tools.

But G2 doesn't offer a public API. And their frontend is a React SPA that loads data via GraphQL. Traditional HTTP scraping won't get you far here. You need a browser — specifically one that can intercept network requests and extract structured GraphQL responses.

What You Can Extract

Each G2 product listing contains:

Overall rating (0-5 stars) and sub-scores (ease of use, quality of support, ease of setup, meets requirements)
Individual reviews with pros, cons, recommendations, and verified status
Reviewer metadata — role, company size, industry, how long they've used the product
Pricing tier information
Category rankings — where the product sits in G2's grid (Leader, High Performer, Niche, Challenger)
Comparison data — head-to-head feature comparisons between products
Review count by star rating — distribution of 1-5 star ratings

Why You Need a Browser

G2's frontend makes GraphQL requests to load review data. The HTML served on initial page load is a shell — reviews get injected by JavaScript. If you fetch the page with requests or httpx, you get an empty container with no reviews.

The good news: those GraphQL requests contain cleanly structured JSON. If you intercept them with Playwright, you skip HTML parsing entirely and get the data in a well-structured format that's less brittle than CSS selectors.

Anti-Bot Protections

G2 uses multiple detection layers:

Imperva/Incapsula — JavaScript challenge on first visit, sets __utmvc and __utmvb cookies
Browser fingerprinting — canvas fingerprint, WebGL renderer, navigator properties checked against bot signatures
Behavioral analysis — mouse movements, scroll patterns, time on page monitored
IP reputation scoring — datacenter IPs blocked, residential IPs throttled after about 50 requests per hour

Standard headless Chrome gets detected within 2-3 page loads. You need stealth patches and residential proxies. For proxy rotation, ThorData's residential pool handles G2 well — their IPs pass Imperva's reputation checks and the automatic rotation means you're not hammering from one address.

Setup

pip install playwright
playwright install chromium

The Scraper: GraphQL Interception

# g2_scraper.py
import json
import asyncio
import random
import time
from playwright.async_api import async_playwright

PROXY_CONFIG = {
    "server": "http://proxy.thordata.com:9000",
    "username": "USER",
    "password": "PASS",
}

class G2Scraper:
    def __init__(self):
        self.graphql_responses = []

    async def intercept_response(self, response):
        """Capture GraphQL responses containing review data."""
        url = response.url
        if "graphql" in url or "gateway" in url or "g2.com/api" in url:
            try:
                if response.status == 200:
                    content_type = response.headers.get("content-type", "")
                    if "json" in content_type:
                        body = await response.json()
                        self.graphql_responses.append({
                            "url": url,
                            "data": body,
                        })
            except Exception:
                pass

    async def scrape_product(
        self,
        slug: str,
        max_pages: int = 5,
        use_proxy: bool = True,
    ) -> dict:
        """
        Scrape G2 product reviews via Playwright with GraphQL interception.
        slug: the product URL slug, e.g. "hubspot-crm" or "salesforce-crm"
        """
        async with async_playwright() as p:
            launch_args = [
                "--disable-blink-features=AutomationControlled",
                "--disable-dev-shm-usage",
                "--no-sandbox",
                "--disable-setuid-sandbox",
                "--disable-infobars",
                "--window-size=1366,768",
            ]

            browser = await p.chromium.launch(
                headless=True,
                args=launch_args,
            )

            context_kwargs = {
                "viewport": {"width": 1366, "height": 768},
                "user_agent": (
                    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                    "AppleWebKit/537.36 (KHTML, like Gecko) "
                    "Chrome/126.0.0.0 Safari/537.36"
                ),
                "locale": "en-US",
                "timezone_id": "America/New_York",
                "extra_http_headers": {
                    "Accept-Language": "en-US,en;q=0.9",
                },
            }

            if use_proxy:
                context_kwargs["proxy"] = PROXY_CONFIG

            context = await browser.new_context(**context_kwargs)

            # Remove automation indicators
            await context.add_init_script("""
                Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
                Object.defineProperty(navigator, 'plugins', {
                    get: () => [
                        {name: 'Chrome PDF Plugin', filename: 'internal-pdf-viewer'},
                        {name: 'Chrome PDF Viewer', filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai'},
                        {name: 'Native Client', filename: 'internal-nacl-plugin'},
                    ]
                });
                window.chrome = {
                    runtime: {},
                    loadTimes: function() {},
                    csi: function() {},
                    app: {},
                };
                Object.defineProperty(navigator, 'languages', {
                    get: () => ['en-US', 'en']
                });
            """)

            page = await context.new_page()
            page.on("response", self.intercept_response)

            product_data = {"slug": slug, "reviews": [], "metadata": {}}

            for pg in range(1, max_pages + 1):
                url = f"https://www.g2.com/products/{slug}/reviews"
                if pg > 1:
                    url = f"https://www.g2.com/products/{slug}/reviews?page={pg}"

                print(f"  Page {pg}/{max_pages}: {url}")

                try:
                    await page.goto(url, wait_until="networkidle", timeout=35000)

                    # Wait for review content
                    try:
                        await page.wait_for_selector(
                            '[itemprop="review"], [class*="review"], [data-test*="review"]',
                            timeout=15000
                        )
                    except Exception:
                        print(f"  No review selector found on page {pg}")

                    # Human-like scroll to trigger lazy loading
                    await self._human_scroll(page)

                    # Extract from page DOM
                    page_data = await self.extract_page_data(page)

                    if pg == 1:
                        product_data["metadata"] = {
                            "rating": page_data.get("rating"),
                            "review_count": page_data.get("review_count"),
                            "product_name": page_data.get("product_name"),
                        }

                    reviews = page_data.get("reviews", [])
                    product_data["reviews"].extend(reviews)
                    print(f"    DOM reviews: {len(reviews)}")

                except Exception as e:
                    print(f"  Page {pg} error: {e}")
                    if pg == 1:
                        break

                await self.human_delay()

            await browser.close()
            return product_data

    async def extract_page_data(self, page) -> dict:
        """Extract review data from current page DOM and JSON-LD."""
        data = await page.evaluate("""
            () => {
                const reviews = [];

                // Primary method: JSON-LD structured data
                const scripts = document.querySelectorAll('script[type="application/ld+json"]');
                for (const s of scripts) {
                    try {
                        const d = JSON.parse(s.textContent);
                        if (d['@type'] === 'Product' && d.review) {
                            const reviewList = Array.isArray(d.review) ? d.review : [d.review];
                            for (const r of reviewList) {
                                reviews.push({
                                    rating: r.reviewRating?.ratingValue,
                                    best_rating: r.reviewRating?.bestRating,
                                    title: r.name,
                                    body: r.reviewBody,
                                    author: r.author?.name,
                                    date: r.datePublished,
                                    source: 'json-ld',
                                });
                            }
                        }
                    } catch (e) {}
                }

                // Fallback: itemprop microdata
                if (reviews.length === 0) {
                    document.querySelectorAll('[itemprop="review"]').forEach(el => {
                        const ratingEl = el.querySelector('[itemprop="ratingValue"]');
                        const titleEl = el.querySelector('[itemprop="name"]');
                        const bodyEl = el.querySelector('[itemprop="reviewBody"]');
                        const authorEl = el.querySelector('[itemprop="author"]');
                        reviews.push({
                            rating: ratingEl?.getAttribute('content') || ratingEl?.textContent,
                            title: titleEl?.textContent?.trim(),
                            body: bodyEl?.textContent?.trim(),
                            author: authorEl?.textContent?.trim(),
                            source: 'microdata',
                        });
                    });
                }

                // Product metadata
                const ratingEl = document.querySelector('[itemprop="ratingValue"]');
                const countEl = document.querySelector('[itemprop="reviewCount"]');
                const nameEl = document.querySelector('[itemprop="name"] h1, h1.product-name');

                return {
                    rating: ratingEl?.getAttribute('content') || ratingEl?.textContent?.trim(),
                    review_count: countEl?.getAttribute('content') || countEl?.textContent?.trim(),
                    product_name: nameEl?.textContent?.trim() || '',
                    reviews: reviews,
                };
            }
        """)
        return data or {}

    async def _human_scroll(self, page):
        """Scroll through the page in a human-like pattern."""
        viewport_height = await page.evaluate("() => window.innerHeight")
        total_height = await page.evaluate("() => document.body.scrollHeight")

        current_pos = 0
        while current_pos < total_height:
            scroll_amount = random.randint(300, 700)
            current_pos += scroll_amount
            await page.evaluate(f"() => window.scrollBy(0, {scroll_amount})")
            await asyncio.sleep(random.uniform(0.3, 0.8))

        # Scroll back up
        await page.evaluate("() => window.scrollTo(0, 0)")

    async def human_delay(self):
        """Randomized delay mimicking human browsing patterns."""
        await asyncio.sleep(random.uniform(4.0, 9.0))

    def get_graphql_reviews(self) -> list:
        """Parse intercepted GraphQL data for review objects."""
        reviews = []
        for resp_obj in self.graphql_responses:
            data = resp_obj.get("data", {})
            self._find_reviews(data, reviews)
        return reviews

    def _find_reviews(self, obj, results: list):
        """Recursively search GraphQL response for review data."""
        if isinstance(obj, dict):
            # Check if this object looks like a review
            has_review_fields = any(
                k in obj for k in ["reviewBody", "starRating", "reviewRating", "pros", "cons"]
            )
            if has_review_fields:
                results.append(obj)
            else:
                for v in obj.values():
                    self._find_reviews(v, results)
        elif isinstance(obj, list):
            for item in obj:
                self._find_reviews(item, results)

Running the Scraper

async def scrape_multiple_products(
    products: list,
    db_path: str = "g2_reviews.db",
    max_pages: int = 5,
):
    """
    Scrape multiple G2 products.
    products: list of dicts with 'slug' and 'name' keys
    """
    conn = init_g2_db(db_path)

    for product in products:
        slug = product["slug"]
        print(f"\nScraping {product['name']} ({slug})...")

        scraper = G2Scraper()
        try:
            data = await scraper.scrape_product(slug, max_pages=max_pages)

            # Save HTML-extracted reviews
            dom_count = save_reviews(conn, slug, product["name"], data["reviews"])

            # Save GraphQL-intercepted reviews
            gql_reviews = scraper.get_graphql_reviews()
            gql_count = save_graphql_reviews(conn, slug, product["name"], gql_reviews)

            print(f"  DOM reviews saved: {dom_count}")
            print(f"  GraphQL reviews saved: {gql_count}")
            print(f"  Rating: {data['metadata'].get('rating')}")

        except Exception as e:
            print(f"  Error: {e}")

        # Long pause between products
        await asyncio.sleep(random.uniform(30.0, 60.0))

    conn.close()


# Run it
async def main():
    products = [
        {"slug": "hubspot-crm", "name": "HubSpot CRM"},
        {"slug": "salesforce-crm", "name": "Salesforce CRM"},
        {"slug": "monday-com", "name": "Monday.com"},
        {"slug": "notion", "name": "Notion"},
    ]
    await scrape_multiple_products(products, max_pages=3)


asyncio.run(main())

How GraphQL Interception Works

Instead of fighting with CSS selectors that G2 changes monthly, we intercept the network layer. When Playwright loads the page, G2's React frontend makes GraphQL requests to fetch review data. The intercept_response callback captures every GraphQL response as JSON.

The _find_reviews method walks the nested JSON looking for objects that contain review-like fields (reviewBody, starRating, pros, cons). This is resilient to schema changes — as long as those field names exist somewhere in the response, we find them without writing brittle CSS selectors.

You get data in two complementary ways: 1. JSON-LD structured data in the HTML (reliable, standards-compliant, sometimes limited to 10 reviews per page) 2. GraphQL responses (complete data, all fields G2 tracks internally including sub-ratings)

SQLite Storage

import sqlite3
import json

def init_g2_db(db_path: str = "g2_reviews.db") -> sqlite3.Connection:
    """Initialize database for G2 review data."""
    conn = sqlite3.connect(db_path)
    conn.execute("PRAGMA journal_mode=WAL")
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS products (
            slug            TEXT PRIMARY KEY,
            name            TEXT,
            category        TEXT,
            overall_rating  REAL,
            review_count    INTEGER,
            g2_rank         TEXT,
            updated_at      TEXT DEFAULT CURRENT_TIMESTAMP
        );

        CREATE TABLE IF NOT EXISTS reviews (
            id              INTEGER PRIMARY KEY AUTOINCREMENT,
            product_slug    TEXT NOT NULL,
            reviewer_name   TEXT,
            reviewer_title  TEXT,
            reviewer_company_size TEXT,
            reviewer_industry TEXT,
            rating          REAL,
            ease_of_use     REAL,
            quality_support REAL,
            ease_of_setup   REAL,
            meets_requirements REAL,
            title           TEXT,
            pros            TEXT,
            cons            TEXT,
            review_body     TEXT,
            review_date     TEXT,
            verified        INTEGER DEFAULT 0,
            source          TEXT DEFAULT 'dom',
            scraped_at      TEXT DEFAULT CURRENT_TIMESTAMP,
            FOREIGN KEY (product_slug) REFERENCES products(slug)
        );

        CREATE INDEX IF NOT EXISTS idx_reviews_product ON reviews (product_slug);
        CREATE INDEX IF NOT EXISTS idx_reviews_rating ON reviews (rating);
        CREATE INDEX IF NOT EXISTS idx_reviews_date ON reviews (review_date);
    """)
    conn.commit()
    return conn


def save_reviews(conn: sqlite3.Connection, slug: str, name: str, reviews: list) -> int:
    """Save DOM-extracted reviews."""
    # Ensure product record
    conn.execute(
        "INSERT OR IGNORE INTO products (slug, name) VALUES (?, ?)",
        (slug, name)
    )

    saved = 0
    for r in reviews:
        if not r.get("body") and not r.get("title"):
            continue
        try:
            conn.execute(
                """INSERT INTO reviews
                   (product_slug, reviewer_name, rating, title, review_body, review_date, source)
                   VALUES (?,?,?,?,?,?,?)""",
                (slug, r.get("author"), r.get("rating"),
                 r.get("title"), r.get("body"), r.get("date"), r.get("source", "dom"))
            )
            saved += 1
        except Exception:
            pass
    conn.commit()
    return saved


def save_graphql_reviews(conn: sqlite3.Connection, slug: str, name: str, gql_reviews: list) -> int:
    """Save GraphQL-intercepted reviews with richer field set."""
    conn.execute("INSERT OR IGNORE INTO products (slug, name) VALUES (?, ?)", (slug, name))

    saved = 0
    for r in gql_reviews:
        try:
            conn.execute(
                """INSERT INTO reviews
                   (product_slug, reviewer_name, reviewer_title, reviewer_company_size,
                    reviewer_industry, rating, ease_of_use, quality_support, ease_of_setup,
                    meets_requirements, title, pros, cons, review_body, review_date,
                    verified, source)
                   VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)""",
                (
                    slug,
                    r.get("reviewerName") or r.get("author", {}).get("name"),
                    r.get("reviewerTitle") or r.get("jobTitle"),
                    r.get("companySize") or r.get("companySizeEnum"),
                    r.get("industry"),
                    r.get("starRating") or r.get("rating") or r.get("ratingValue"),
                    r.get("easeOfUse") or r.get("easeOfUseRating"),
                    r.get("qualityOfSupport") or r.get("supportRating"),
                    r.get("easeOfSetup") or r.get("setupRating"),
                    r.get("meetsRequirements") or r.get("requirementsRating"),
                    r.get("title") or r.get("reviewTitle"),
                    r.get("pros") or r.get("likedBest"),
                    r.get("cons") or r.get("dislikedMost"),
                    r.get("reviewBody") or r.get("body"),
                    r.get("reviewedAt") or r.get("publishedAt") or r.get("datePublished"),
                    int(r.get("isVerified", False) or r.get("verified", False)),
                    "graphql",
                ),
            )
            saved += 1
        except Exception:
            pass
    conn.commit()
    return saved

Competitive Analysis Queries

def compare_products(conn: sqlite3.Connection, slugs: list) -> list:
    """Compare products by aggregate review metrics."""
    results = []
    for slug in slugs:
        row = conn.execute(
            """
            SELECT
                p.name,
                COUNT(r.id) as review_count,
                AVG(r.rating) as avg_rating,
                AVG(r.ease_of_use) as avg_ease,
                AVG(r.quality_support) as avg_support,
                AVG(r.meets_requirements) as avg_meets,
                SUM(CASE WHEN r.rating >= 4 THEN 1 ELSE 0 END) * 100.0 / COUNT(r.id) as pct_positive
            FROM products p
            LEFT JOIN reviews r ON r.product_slug = p.slug
            WHERE p.slug = ?
            GROUP BY p.slug
            """,
            (slug,)
        ).fetchone()
        if row:
            results.append({
                "slug": slug,
                "name": row[0],
                "reviews": row[1],
                "avg_rating": round(row[2] or 0, 2),
                "ease_of_use": round(row[3] or 0, 2),
                "support": round(row[4] or 0, 2),
                "meets_requirements": round(row[5] or 0, 2),
                "pct_positive": round(row[6] or 0, 1),
            })
    return sorted(results, key=lambda x: x["avg_rating"], reverse=True)


def sentiment_by_company_size(conn: sqlite3.Connection, product_slug: str) -> list:
    """Break down ratings by reviewer company size."""
    rows = conn.execute(
        """
        SELECT
            reviewer_company_size,
            COUNT(*) as count,
            AVG(rating) as avg_rating
        FROM reviews
        WHERE product_slug = ?
            AND reviewer_company_size IS NOT NULL
        GROUP BY reviewer_company_size
        ORDER BY avg_rating DESC
        """,
        (product_slug,)
    ).fetchall()
    return [{"size": r[0], "count": r[1], "avg_rating": round(r[2], 2)} for r in rows]

Extracting Comparison Data

G2 has head-to-head comparison pages at /compare/slug-a-vs-slug-b:

async def scrape_comparison(slug_a: str, slug_b: str) -> dict:
    """Scrape G2 product comparison page."""
    url = f"https://www.g2.com/compare/{slug_a}-vs-{slug_b}"

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True, args=[
            "--disable-blink-features=AutomationControlled",
        ])
        context = await browser.new_context(
            proxy=PROXY_CONFIG,
            viewport={"width": 1366, "height": 768},
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 Chrome/126.0.0.0 Safari/537.36"
            ),
        )
        await context.add_init_script(
            "Object.defineProperty(navigator, 'webdriver', {get: () => undefined});"
        )
        page = await context.new_page()
        await page.goto(url, wait_until="networkidle", timeout=30000)

        comparison_data = await page.evaluate("""
            () => {
                const rows = [];
                document.querySelectorAll('tr').forEach(row => {
                    const cells = Array.from(row.querySelectorAll('td, th'));
                    if (cells.length >= 2) {
                        rows.push({
                            feature: cells[0]?.textContent?.trim(),
                            product_a: cells[1]?.textContent?.trim(),
                            product_b: cells[2]?.textContent?.trim() || null,
                        });
                    }
                });
                return rows.filter(r => r.feature && r.feature.length > 0);
            }
        """)

        await browser.close()
        return {
            "slug_a": slug_a,
            "slug_b": slug_b,
            "url": url,
            "comparison_rows": comparison_data,
        }

Practical Tips

Start with JSON-LD. Always check structured data before writing DOM selectors. G2 embeds enough for basic product analysis, and it's far more stable than class-based selectors.

Intercept GraphQL for production. The network interception approach gives you cleaner data than HTML parsing. G2's component structure changes frequently; GraphQL field names change less often.

Respect rate limits. 50 products per hour is a safe ceiling with residential proxies. If you need more throughput, run across multiple sessions spread over hours with different IP ranges.

Cache aggressively. G2 reviews don't change hourly. Scrape once, update weekly. Store raw GraphQL responses alongside parsed data so you can re-parse when the field mapping changes.

Monitor for challenge pages. If your review count drops to zero on a page that should have reviews, check if the HTML contains Imperva challenge content. Rotate your proxy and add a longer cooldown before retrying.

Check for login requirements. Some G2 data (detailed reviewer profiles, company information beyond a certain depth) requires a G2 account. Public review content doesn't — that's accessible without authentication.

Legal Notes

G2's Terms of Use (Section 3) prohibit automated access and data extraction. Their data is not public domain — G2 owns the compilation rights to the review database they've assembled.

For commercial applications requiring G2 data at scale, G2 offers a paid API through their partner program. This is the legally appropriate route for market intelligence products or any application that resells or publishes G2 data.

For research and competitive analysis for internal business use, the scraping approach described here is widely used but operates outside G2's terms. Understand the legal exposure before using this in a commercial context.