Scrape G2 Software Reviews: Ratings, Comparisons & GraphQL API (2026)
Scrape G2 Software Reviews: Ratings, Comparisons & GraphQL API (2026)
G2 is the largest B2B software review platform — 2 million+ reviews across 150,000+ products. The data is valuable for competitive intelligence, market research, and building comparison tools.
But G2 doesn't offer a public API. And their frontend is a React SPA that loads data via GraphQL. Traditional HTTP scraping won't get you far here. You need a browser — specifically one that can intercept network requests and extract structured GraphQL responses.
What You Can Extract
Each G2 product listing contains:
- Overall rating (0-5 stars) and sub-scores (ease of use, quality of support, ease of setup, meets requirements)
- Individual reviews with pros, cons, recommendations, and verified status
- Reviewer metadata — role, company size, industry, how long they've used the product
- Pricing tier information
- Category rankings — where the product sits in G2's grid (Leader, High Performer, Niche, Challenger)
- Comparison data — head-to-head feature comparisons between products
- Review count by star rating — distribution of 1-5 star ratings
Why You Need a Browser
G2's frontend makes GraphQL requests to load review data. The HTML served on initial page load is a shell — reviews get injected by JavaScript. If you fetch the page with requests or httpx, you get an empty container with no reviews.
The good news: those GraphQL requests contain cleanly structured JSON. If you intercept them with Playwright, you skip HTML parsing entirely and get the data in a well-structured format that's less brittle than CSS selectors.
Anti-Bot Protections
G2 uses multiple detection layers:
- Imperva/Incapsula — JavaScript challenge on first visit, sets
__utmvcand__utmvbcookies - Browser fingerprinting — canvas fingerprint, WebGL renderer, navigator properties checked against bot signatures
- Behavioral analysis — mouse movements, scroll patterns, time on page monitored
- IP reputation scoring — datacenter IPs blocked, residential IPs throttled after about 50 requests per hour
Standard headless Chrome gets detected within 2-3 page loads. You need stealth patches and residential proxies. For proxy rotation, ThorData's residential pool handles G2 well — their IPs pass Imperva's reputation checks and the automatic rotation means you're not hammering from one address.
Setup
pip install playwright
playwright install chromium
The Scraper: GraphQL Interception
# g2_scraper.py
import json
import asyncio
import random
import time
from playwright.async_api import async_playwright
PROXY_CONFIG = {
"server": "http://proxy.thordata.com:9000",
"username": "USER",
"password": "PASS",
}
class G2Scraper:
def __init__(self):
self.graphql_responses = []
async def intercept_response(self, response):
"""Capture GraphQL responses containing review data."""
url = response.url
if "graphql" in url or "gateway" in url or "g2.com/api" in url:
try:
if response.status == 200:
content_type = response.headers.get("content-type", "")
if "json" in content_type:
body = await response.json()
self.graphql_responses.append({
"url": url,
"data": body,
})
except Exception:
pass
async def scrape_product(
self,
slug: str,
max_pages: int = 5,
use_proxy: bool = True,
) -> dict:
"""
Scrape G2 product reviews via Playwright with GraphQL interception.
slug: the product URL slug, e.g. "hubspot-crm" or "salesforce-crm"
"""
async with async_playwright() as p:
launch_args = [
"--disable-blink-features=AutomationControlled",
"--disable-dev-shm-usage",
"--no-sandbox",
"--disable-setuid-sandbox",
"--disable-infobars",
"--window-size=1366,768",
]
browser = await p.chromium.launch(
headless=True,
args=launch_args,
)
context_kwargs = {
"viewport": {"width": 1366, "height": 768},
"user_agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/126.0.0.0 Safari/537.36"
),
"locale": "en-US",
"timezone_id": "America/New_York",
"extra_http_headers": {
"Accept-Language": "en-US,en;q=0.9",
},
}
if use_proxy:
context_kwargs["proxy"] = PROXY_CONFIG
context = await browser.new_context(**context_kwargs)
# Remove automation indicators
await context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
Object.defineProperty(navigator, 'plugins', {
get: () => [
{name: 'Chrome PDF Plugin', filename: 'internal-pdf-viewer'},
{name: 'Chrome PDF Viewer', filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai'},
{name: 'Native Client', filename: 'internal-nacl-plugin'},
]
});
window.chrome = {
runtime: {},
loadTimes: function() {},
csi: function() {},
app: {},
};
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en']
});
""")
page = await context.new_page()
page.on("response", self.intercept_response)
product_data = {"slug": slug, "reviews": [], "metadata": {}}
for pg in range(1, max_pages + 1):
url = f"https://www.g2.com/products/{slug}/reviews"
if pg > 1:
url = f"https://www.g2.com/products/{slug}/reviews?page={pg}"
print(f" Page {pg}/{max_pages}: {url}")
try:
await page.goto(url, wait_until="networkidle", timeout=35000)
# Wait for review content
try:
await page.wait_for_selector(
'[itemprop="review"], [class*="review"], [data-test*="review"]',
timeout=15000
)
except Exception:
print(f" No review selector found on page {pg}")
# Human-like scroll to trigger lazy loading
await self._human_scroll(page)
# Extract from page DOM
page_data = await self.extract_page_data(page)
if pg == 1:
product_data["metadata"] = {
"rating": page_data.get("rating"),
"review_count": page_data.get("review_count"),
"product_name": page_data.get("product_name"),
}
reviews = page_data.get("reviews", [])
product_data["reviews"].extend(reviews)
print(f" DOM reviews: {len(reviews)}")
except Exception as e:
print(f" Page {pg} error: {e}")
if pg == 1:
break
await self.human_delay()
await browser.close()
return product_data
async def extract_page_data(self, page) -> dict:
"""Extract review data from current page DOM and JSON-LD."""
data = await page.evaluate("""
() => {
const reviews = [];
// Primary method: JSON-LD structured data
const scripts = document.querySelectorAll('script[type="application/ld+json"]');
for (const s of scripts) {
try {
const d = JSON.parse(s.textContent);
if (d['@type'] === 'Product' && d.review) {
const reviewList = Array.isArray(d.review) ? d.review : [d.review];
for (const r of reviewList) {
reviews.push({
rating: r.reviewRating?.ratingValue,
best_rating: r.reviewRating?.bestRating,
title: r.name,
body: r.reviewBody,
author: r.author?.name,
date: r.datePublished,
source: 'json-ld',
});
}
}
} catch (e) {}
}
// Fallback: itemprop microdata
if (reviews.length === 0) {
document.querySelectorAll('[itemprop="review"]').forEach(el => {
const ratingEl = el.querySelector('[itemprop="ratingValue"]');
const titleEl = el.querySelector('[itemprop="name"]');
const bodyEl = el.querySelector('[itemprop="reviewBody"]');
const authorEl = el.querySelector('[itemprop="author"]');
reviews.push({
rating: ratingEl?.getAttribute('content') || ratingEl?.textContent,
title: titleEl?.textContent?.trim(),
body: bodyEl?.textContent?.trim(),
author: authorEl?.textContent?.trim(),
source: 'microdata',
});
});
}
// Product metadata
const ratingEl = document.querySelector('[itemprop="ratingValue"]');
const countEl = document.querySelector('[itemprop="reviewCount"]');
const nameEl = document.querySelector('[itemprop="name"] h1, h1.product-name');
return {
rating: ratingEl?.getAttribute('content') || ratingEl?.textContent?.trim(),
review_count: countEl?.getAttribute('content') || countEl?.textContent?.trim(),
product_name: nameEl?.textContent?.trim() || '',
reviews: reviews,
};
}
""")
return data or {}
async def _human_scroll(self, page):
"""Scroll through the page in a human-like pattern."""
viewport_height = await page.evaluate("() => window.innerHeight")
total_height = await page.evaluate("() => document.body.scrollHeight")
current_pos = 0
while current_pos < total_height:
scroll_amount = random.randint(300, 700)
current_pos += scroll_amount
await page.evaluate(f"() => window.scrollBy(0, {scroll_amount})")
await asyncio.sleep(random.uniform(0.3, 0.8))
# Scroll back up
await page.evaluate("() => window.scrollTo(0, 0)")
async def human_delay(self):
"""Randomized delay mimicking human browsing patterns."""
await asyncio.sleep(random.uniform(4.0, 9.0))
def get_graphql_reviews(self) -> list:
"""Parse intercepted GraphQL data for review objects."""
reviews = []
for resp_obj in self.graphql_responses:
data = resp_obj.get("data", {})
self._find_reviews(data, reviews)
return reviews
def _find_reviews(self, obj, results: list):
"""Recursively search GraphQL response for review data."""
if isinstance(obj, dict):
# Check if this object looks like a review
has_review_fields = any(
k in obj for k in ["reviewBody", "starRating", "reviewRating", "pros", "cons"]
)
if has_review_fields:
results.append(obj)
else:
for v in obj.values():
self._find_reviews(v, results)
elif isinstance(obj, list):
for item in obj:
self._find_reviews(item, results)
Running the Scraper
async def scrape_multiple_products(
products: list,
db_path: str = "g2_reviews.db",
max_pages: int = 5,
):
"""
Scrape multiple G2 products.
products: list of dicts with 'slug' and 'name' keys
"""
conn = init_g2_db(db_path)
for product in products:
slug = product["slug"]
print(f"\nScraping {product['name']} ({slug})...")
scraper = G2Scraper()
try:
data = await scraper.scrape_product(slug, max_pages=max_pages)
# Save HTML-extracted reviews
dom_count = save_reviews(conn, slug, product["name"], data["reviews"])
# Save GraphQL-intercepted reviews
gql_reviews = scraper.get_graphql_reviews()
gql_count = save_graphql_reviews(conn, slug, product["name"], gql_reviews)
print(f" DOM reviews saved: {dom_count}")
print(f" GraphQL reviews saved: {gql_count}")
print(f" Rating: {data['metadata'].get('rating')}")
except Exception as e:
print(f" Error: {e}")
# Long pause between products
await asyncio.sleep(random.uniform(30.0, 60.0))
conn.close()
# Run it
async def main():
products = [
{"slug": "hubspot-crm", "name": "HubSpot CRM"},
{"slug": "salesforce-crm", "name": "Salesforce CRM"},
{"slug": "monday-com", "name": "Monday.com"},
{"slug": "notion", "name": "Notion"},
]
await scrape_multiple_products(products, max_pages=3)
asyncio.run(main())
How GraphQL Interception Works
Instead of fighting with CSS selectors that G2 changes monthly, we intercept the network layer. When Playwright loads the page, G2's React frontend makes GraphQL requests to fetch review data. The intercept_response callback captures every GraphQL response as JSON.
The _find_reviews method walks the nested JSON looking for objects that contain review-like fields (reviewBody, starRating, pros, cons). This is resilient to schema changes — as long as those field names exist somewhere in the response, we find them without writing brittle CSS selectors.
You get data in two complementary ways: 1. JSON-LD structured data in the HTML (reliable, standards-compliant, sometimes limited to 10 reviews per page) 2. GraphQL responses (complete data, all fields G2 tracks internally including sub-ratings)
SQLite Storage
import sqlite3
import json
def init_g2_db(db_path: str = "g2_reviews.db") -> sqlite3.Connection:
"""Initialize database for G2 review data."""
conn = sqlite3.connect(db_path)
conn.execute("PRAGMA journal_mode=WAL")
conn.executescript("""
CREATE TABLE IF NOT EXISTS products (
slug TEXT PRIMARY KEY,
name TEXT,
category TEXT,
overall_rating REAL,
review_count INTEGER,
g2_rank TEXT,
updated_at TEXT DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS reviews (
id INTEGER PRIMARY KEY AUTOINCREMENT,
product_slug TEXT NOT NULL,
reviewer_name TEXT,
reviewer_title TEXT,
reviewer_company_size TEXT,
reviewer_industry TEXT,
rating REAL,
ease_of_use REAL,
quality_support REAL,
ease_of_setup REAL,
meets_requirements REAL,
title TEXT,
pros TEXT,
cons TEXT,
review_body TEXT,
review_date TEXT,
verified INTEGER DEFAULT 0,
source TEXT DEFAULT 'dom',
scraped_at TEXT DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (product_slug) REFERENCES products(slug)
);
CREATE INDEX IF NOT EXISTS idx_reviews_product ON reviews (product_slug);
CREATE INDEX IF NOT EXISTS idx_reviews_rating ON reviews (rating);
CREATE INDEX IF NOT EXISTS idx_reviews_date ON reviews (review_date);
""")
conn.commit()
return conn
def save_reviews(conn: sqlite3.Connection, slug: str, name: str, reviews: list) -> int:
"""Save DOM-extracted reviews."""
# Ensure product record
conn.execute(
"INSERT OR IGNORE INTO products (slug, name) VALUES (?, ?)",
(slug, name)
)
saved = 0
for r in reviews:
if not r.get("body") and not r.get("title"):
continue
try:
conn.execute(
"""INSERT INTO reviews
(product_slug, reviewer_name, rating, title, review_body, review_date, source)
VALUES (?,?,?,?,?,?,?)""",
(slug, r.get("author"), r.get("rating"),
r.get("title"), r.get("body"), r.get("date"), r.get("source", "dom"))
)
saved += 1
except Exception:
pass
conn.commit()
return saved
def save_graphql_reviews(conn: sqlite3.Connection, slug: str, name: str, gql_reviews: list) -> int:
"""Save GraphQL-intercepted reviews with richer field set."""
conn.execute("INSERT OR IGNORE INTO products (slug, name) VALUES (?, ?)", (slug, name))
saved = 0
for r in gql_reviews:
try:
conn.execute(
"""INSERT INTO reviews
(product_slug, reviewer_name, reviewer_title, reviewer_company_size,
reviewer_industry, rating, ease_of_use, quality_support, ease_of_setup,
meets_requirements, title, pros, cons, review_body, review_date,
verified, source)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)""",
(
slug,
r.get("reviewerName") or r.get("author", {}).get("name"),
r.get("reviewerTitle") or r.get("jobTitle"),
r.get("companySize") or r.get("companySizeEnum"),
r.get("industry"),
r.get("starRating") or r.get("rating") or r.get("ratingValue"),
r.get("easeOfUse") or r.get("easeOfUseRating"),
r.get("qualityOfSupport") or r.get("supportRating"),
r.get("easeOfSetup") or r.get("setupRating"),
r.get("meetsRequirements") or r.get("requirementsRating"),
r.get("title") or r.get("reviewTitle"),
r.get("pros") or r.get("likedBest"),
r.get("cons") or r.get("dislikedMost"),
r.get("reviewBody") or r.get("body"),
r.get("reviewedAt") or r.get("publishedAt") or r.get("datePublished"),
int(r.get("isVerified", False) or r.get("verified", False)),
"graphql",
),
)
saved += 1
except Exception:
pass
conn.commit()
return saved
Competitive Analysis Queries
def compare_products(conn: sqlite3.Connection, slugs: list) -> list:
"""Compare products by aggregate review metrics."""
results = []
for slug in slugs:
row = conn.execute(
"""
SELECT
p.name,
COUNT(r.id) as review_count,
AVG(r.rating) as avg_rating,
AVG(r.ease_of_use) as avg_ease,
AVG(r.quality_support) as avg_support,
AVG(r.meets_requirements) as avg_meets,
SUM(CASE WHEN r.rating >= 4 THEN 1 ELSE 0 END) * 100.0 / COUNT(r.id) as pct_positive
FROM products p
LEFT JOIN reviews r ON r.product_slug = p.slug
WHERE p.slug = ?
GROUP BY p.slug
""",
(slug,)
).fetchone()
if row:
results.append({
"slug": slug,
"name": row[0],
"reviews": row[1],
"avg_rating": round(row[2] or 0, 2),
"ease_of_use": round(row[3] or 0, 2),
"support": round(row[4] or 0, 2),
"meets_requirements": round(row[5] or 0, 2),
"pct_positive": round(row[6] or 0, 1),
})
return sorted(results, key=lambda x: x["avg_rating"], reverse=True)
def sentiment_by_company_size(conn: sqlite3.Connection, product_slug: str) -> list:
"""Break down ratings by reviewer company size."""
rows = conn.execute(
"""
SELECT
reviewer_company_size,
COUNT(*) as count,
AVG(rating) as avg_rating
FROM reviews
WHERE product_slug = ?
AND reviewer_company_size IS NOT NULL
GROUP BY reviewer_company_size
ORDER BY avg_rating DESC
""",
(product_slug,)
).fetchall()
return [{"size": r[0], "count": r[1], "avg_rating": round(r[2], 2)} for r in rows]
Extracting Comparison Data
G2 has head-to-head comparison pages at /compare/slug-a-vs-slug-b:
async def scrape_comparison(slug_a: str, slug_b: str) -> dict:
"""Scrape G2 product comparison page."""
url = f"https://www.g2.com/compare/{slug_a}-vs-{slug_b}"
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True, args=[
"--disable-blink-features=AutomationControlled",
])
context = await browser.new_context(
proxy=PROXY_CONFIG,
viewport={"width": 1366, "height": 768},
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 Chrome/126.0.0.0 Safari/537.36"
),
)
await context.add_init_script(
"Object.defineProperty(navigator, 'webdriver', {get: () => undefined});"
)
page = await context.new_page()
await page.goto(url, wait_until="networkidle", timeout=30000)
comparison_data = await page.evaluate("""
() => {
const rows = [];
document.querySelectorAll('tr').forEach(row => {
const cells = Array.from(row.querySelectorAll('td, th'));
if (cells.length >= 2) {
rows.push({
feature: cells[0]?.textContent?.trim(),
product_a: cells[1]?.textContent?.trim(),
product_b: cells[2]?.textContent?.trim() || null,
});
}
});
return rows.filter(r => r.feature && r.feature.length > 0);
}
""")
await browser.close()
return {
"slug_a": slug_a,
"slug_b": slug_b,
"url": url,
"comparison_rows": comparison_data,
}
Practical Tips
Start with JSON-LD. Always check structured data before writing DOM selectors. G2 embeds enough for basic product analysis, and it's far more stable than class-based selectors.
Intercept GraphQL for production. The network interception approach gives you cleaner data than HTML parsing. G2's component structure changes frequently; GraphQL field names change less often.
Respect rate limits. 50 products per hour is a safe ceiling with residential proxies. If you need more throughput, run across multiple sessions spread over hours with different IP ranges.
Cache aggressively. G2 reviews don't change hourly. Scrape once, update weekly. Store raw GraphQL responses alongside parsed data so you can re-parse when the field mapping changes.
Monitor for challenge pages. If your review count drops to zero on a page that should have reviews, check if the HTML contains Imperva challenge content. Rotate your proxy and add a longer cooldown before retrying.
Check for login requirements. Some G2 data (detailed reviewer profiles, company information beyond a certain depth) requires a G2 account. Public review content doesn't — that's accessible without authentication.
Legal Notes
G2's Terms of Use (Section 3) prohibit automated access and data extraction. Their data is not public domain — G2 owns the compilation rights to the review database they've assembled.
For commercial applications requiring G2 data at scale, G2 offers a paid API through their partner program. This is the legally appropriate route for market intelligence products or any application that resells or publishes G2 data.
For research and competitive analysis for internal business use, the scraping approach described here is widely used but operates outside G2's terms. Understand the legal exposure before using this in a commercial context.