Scrape Capterra Reviews: Software Ratings, Pricing & Comparisons (2026)
Scrape Capterra Reviews: Software Ratings, Pricing & Comparisons (2026)
Capterra lists 100,000+ software products with user reviews, pricing tiers, and feature comparisons. If you're building a SaaS comparison tool, tracking competitor sentiment, or doing market research — that data is gold.
The problem: Capterra actively blocks scrapers. Cloudflare protection, rate limiting, JavaScript-rendered content. This guide shows you how to extract what you need without getting blocked, covering everything from structured data extraction to bulk review collection and ongoing monitoring.
What Data Is Available
Each Capterra product page contains: - Overall rating (1–5 stars) plus sub-ratings (ease of use, customer service, features, value for money) - Individual reviews with full text, pros/cons, reviewer role, company size, and usage frequency - Pricing — free tier availability, starting price, pricing model (per user/month, flat, etc.) - Product details — categories, deployment options (cloud/on-premise/mobile), supported platforms - Feature list — capabilities and integrations - Alternatives — competitor products Capterra recommends - Awards — Capterra Shortlist rankings, GetApp awards
Reviews paginate at 25 per page. A popular product like Salesforce has 20,000+ reviews — that's 800+ pages worth of data.
Anti-Bot Measures on Capterra
Capterra uses several layers of protection:
- Cloudflare Bot Management — fingerprints your TLS handshake, checks browser headers, serves JavaScript challenges. Standard
requestslibrary fails on this because it speaks HTTP/1.1 with a non-browser TLS fingerprint. - Rate limiting — more than ~30 requests/minute from one IP triggers blocks (429 or silent redirects to a CAPTCHA)
- Session validation — cookies must persist across requests; fresh cookieless requests get challenged
- Dynamic class names — CSS class names like
ReviewCard__reviewTitle--3kJ2Achange on deployment
You need rotating residential proxies to maintain access. Datacenter IPs (AWS, GCP, VPS providers) get flagged almost immediately. ThorData's residential proxy pool works well here — their rotating IPs cover 195+ countries and handle the Cloudflare challenge layer without getting burned after a few requests.
Setup
pip install httpx selectolax fake-useragent
We're using httpx over requests for HTTP/2 support (matches real browser TLS fingerprint) and selectolax for fast HTML parsing. fake-useragent pulls real browser User-Agent strings.
Core Scraper Architecture
import httpx
import json
import time
import random
import sqlite3
from datetime import datetime, timezone
from selectolax.parser import HTMLParser
from fake_useragent import UserAgent
ua = UserAgent()
PROXY_URL = "http://USER:[email protected]:9000"
def get_client(proxy_url: str | None = None, rotate_ua: bool = True) -> httpx.Client:
"""Create HTTP client with browser-like settings."""
return httpx.Client(
proxy=proxy_url,
headers={
"User-Agent": ua.chrome if rotate_ua else (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/126.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
},
follow_redirects=True,
timeout=30.0,
http2=True,
)
Extracting Product Metadata
The most reliable data source is the JSON-LD structured data that Capterra embeds for SEO. Check for it first before falling back to HTML selectors:
def scrape_product_metadata(
client: httpx.Client,
product_path: str,
) -> dict:
"""
Extract product metadata from a Capterra product page.
product_path format: "158900/Slack" (ID/slug)
"""
url = f"https://www.capterra.com/p/{product_path}/"
resp = client.get(url)
# Detect Cloudflare block
if "cf-browser-verification" in resp.text or resp.status_code == 403:
raise RuntimeError("Cloudflare challenge — rotate IP and retry")
resp.raise_for_status()
tree = HTMLParser(resp.text)
# Try JSON-LD structured data first (most stable)
for script in tree.css('script[type="application/ld+json"]'):
try:
data = json.loads(script.text())
if data.get("@type") == "SoftwareApplication":
rating = data.get("aggregateRating", {})
offers = data.get("offers", {})
return {
"name": data.get("name"),
"description": data.get("description"),
"url": url,
"rating": rating.get("ratingValue"),
"review_count": rating.get("reviewCount"),
"best_rating": rating.get("bestRating"),
"price": offers.get("price"),
"price_currency": offers.get("priceCurrency"),
"category": data.get("applicationCategory"),
"operating_system": data.get("operatingSystem"),
"source": "json_ld",
}
except (json.JSONDecodeError, TypeError):
continue
# Fallback to HTML selectors (less stable, use as backup)
def get_text(*selectors):
for sel in selectors:
el = tree.css_first(sel)
if el:
return el.text(strip=True)
return None
return {
"name": get_text("h1", '[data-testid="product-title"]'),
"description": get_text(
'[data-testid="product-description"]',
'[class*="ProductDescription"]',
'.product-description',
),
"url": url,
"rating": get_text(
'[data-testid="overall-rating"]',
'[class*="OverallRating"] [class*="value"]',
'[itemprop="ratingValue"]',
),
"review_count": get_text(
'[data-testid="review-count"]',
'[class*="ReviewCount"]',
),
"category": get_text('[data-testid="category"]', '[class*="Category"]'),
"source": "html_fallback",
}
Scraping Reviews with Pagination
def extract_star_rating(element) -> float | None:
"""Parse star rating from various possible element formats."""
if not element:
return None
# Try aria-label: "4.5 out of 5 stars"
label = element.attributes.get("aria-label", "")
if "out of" in label:
try:
return float(label.split("out of")[0].strip())
except ValueError:
pass
# Try title attribute
title = element.attributes.get("title", "")
if title and title.replace(".", "").isdigit():
return float(title)
# Try CSS width-based stars (width: 80% = 4/5 stars)
style = element.attributes.get("style", "")
if "width" in style:
try:
pct = float(style.split("width:")[1].split("%")[0].strip())
return round(pct / 20, 1)
except (ValueError, IndexError):
pass
# Try data-rating attribute
rating = element.attributes.get("data-rating") or element.attributes.get("data-score")
if rating:
try:
return float(rating)
except ValueError:
pass
return None
def scrape_review_page(
client: httpx.Client,
product_id: str,
page: int = 1,
sort: str = "recent", # recent, helpful, highest, lowest
) -> dict:
"""Scrape a single page of reviews for a product."""
url = f"https://www.capterra.com/reviews/{product_id}/"
params = {"page": page, "sort": sort}
resp = client.get(url, params=params)
if "cf-browser-verification" in resp.text:
raise RuntimeError("Cloudflare block detected")
if resp.status_code == 404:
return {"reviews": [], "total": 0, "is_last_page": True}
resp.raise_for_status()
tree = HTMLParser(resp.text)
# Find review cards using multiple selector strategies (Capterra changes class names)
cards = (
tree.css('[data-testid="review-card"]')
or tree.css('[class*="ReviewCard__"]')
or tree.css('[class*="review-card"]')
or tree.css('article[class*="Review"]')
)
reviews = []
for card in cards:
def card_text(*selectors):
for sel in selectors:
el = card.css_first(sel)
if el:
return el.text(strip=True)
return None
# Overall star rating
rating_el = (
card.css_first('[class*="StarRating"]')
or card.css_first('[class*="star-rating"]')
or card.css_first('[aria-label*="stars"]')
)
overall_rating = extract_star_rating(rating_el)
# Sub-ratings (ease of use, features, value, customer service)
sub_ratings = {}
for sub_el in card.css('[class*="SubRating"], [class*="sub-rating"]'):
label_el = sub_el.css_first('[class*="label"], span:first-child')
value_el = sub_el.css_first('[class*="StarRating"], [class*="value"]')
if label_el and value_el:
label = label_el.text(strip=True).lower().replace(" ", "_")
sub_ratings[label] = extract_star_rating(value_el)
# Review text
title = card_text(
'[class*="ReviewTitle"]',
'[class*="review-title"]',
'[data-testid="review-title"]',
'h3',
)
body = card_text(
'[class*="ReviewBody"]',
'[class*="review-body"]',
'[data-testid="review-body"]',
)
pros = card_text('[class*="Pros"]', '[class*="pros"]', '[data-testid="pros"]')
cons = card_text('[class*="Cons"]', '[class*="cons"]', '[data-testid="cons"]')
# Reviewer info
reviewer_name = card_text(
'[class*="ReviewerName"]',
'[class*="reviewer-name"]',
'[data-testid="reviewer-name"]',
)
reviewer_role = card_text(
'[class*="ReviewerJob"]',
'[class*="reviewer-role"]',
'[data-testid="reviewer-title"]',
)
company_size = card_text(
'[class*="CompanySize"]',
'[class*="company-size"]',
)
review_date = card_text(
'time',
'[class*="ReviewDate"]',
'[data-testid="review-date"]',
)
usage_duration = card_text('[class*="UsedFor"]', '[class*="used-for"]')
reviews.append({
"rating_overall": overall_rating,
"sub_ratings": sub_ratings,
"title": title,
"body": body,
"pros": pros,
"cons": cons,
"reviewer_name": reviewer_name,
"reviewer_role": reviewer_role,
"company_size": company_size,
"review_date": review_date,
"usage_duration": usage_duration,
})
# Find total review count from pagination info
total_el = tree.css_first('[class*="TotalCount"], [class*="total-count"], [data-testid="total-reviews"]')
total_text = total_el.text(strip=True) if total_el else ""
try:
import re
total = int(re.sub(r'[^\d]', '', total_text)) if total_text else None
except ValueError:
total = None
# Detect last page
next_btn = tree.css_first('a[rel="next"], [aria-label="Next page"], [class*="NextPage"]:not([disabled])')
is_last_page = next_btn is None
return {
"reviews": reviews,
"total": total,
"page": page,
"is_last_page": is_last_page,
}
def scrape_all_reviews(
client: httpx.Client,
product_id: str,
max_pages: int = 50,
sort: str = "recent",
delay_range: tuple[float, float] = (2.0, 5.0),
) -> list[dict]:
"""Scrape all reviews for a product across all pages."""
all_reviews = []
for page in range(1, max_pages + 1):
try:
result = scrape_review_page(client, product_id, page=page, sort=sort)
except RuntimeError as e:
print(f"Page {page}: {e}")
break
except httpx.HTTPStatusError as e:
if e.response.status_code == 429:
wait = int(e.response.headers.get("Retry-After", 60))
print(f"Rate limited. Waiting {wait}s...")
time.sleep(wait)
continue
print(f"Page {page}: HTTP {e.response.status_code}")
break
reviews = result["reviews"]
if not reviews:
print(f"Page {page}: no reviews — stopping")
break
all_reviews.extend(reviews)
total = result.get("total")
print(f"Page {page}: {len(reviews)} reviews "
f"(total so far: {len(all_reviews)}"
+ (f" / {total}" if total else "") + ")")
if result["is_last_page"]:
print("Reached last page")
break
# Randomized delay — consistent timing looks automated
time.sleep(random.uniform(*delay_range))
return all_reviews
Scraping Alternatives and Comparisons
def scrape_alternatives(
client: httpx.Client,
product_path: str,
) -> list[dict]:
"""Get competitor products listed on a Capterra alternatives page."""
url = f"https://www.capterra.com/p/{product_path}/alternatives/"
resp = client.get(url)
resp.raise_for_status()
tree = HTMLParser(resp.text)
alternatives = []
# Product cards on alternatives page
for card in tree.css('[data-testid="product-card"], [class*="ProductCard"]'):
name_el = card.css_first("h3 a, h2 a, [class*='ProductName'] a")
rating_el = card.css_first('[class*="StarRating"], [class*="RatingValue"]')
reviews_el = card.css_first('[class*="ReviewCount"], [class*="reviews"]')
price_el = card.css_first('[class*="Price"], [class*="Starting"]')
if not name_el:
continue
href = name_el.attributes.get("href", "")
product_id = href.split("/")[-2] if "/p/" in href else None
alternatives.append({
"name": name_el.text(strip=True),
"product_id": product_id,
"url": href,
"rating": extract_star_rating(rating_el),
"review_count": reviews_el.text(strip=True) if reviews_el else None,
"starting_price": price_el.text(strip=True) if price_el else None,
})
return alternatives
def scrape_category_products(
client: httpx.Client,
category_slug: str,
max_pages: int = 20,
) -> list[dict]:
"""Scrape all products in a Capterra category."""
products = []
for page in range(1, max_pages + 1):
url = f"https://www.capterra.com/{category_slug}-software/"
params = {"page": page} if page > 1 else {}
resp = client.get(url, params=params)
if resp.status_code == 404:
break
resp.raise_for_status()
tree = HTMLParser(resp.text)
cards = tree.css('[data-testid="product-listing"], [class*="ProductListing"]')
if not cards:
break
for card in cards:
name_el = card.css_first("h3 a, h2 a")
rating_el = card.css_first('[class*="StarRating"]')
if not name_el:
continue
href = name_el.attributes.get("href", "")
# Extract product ID from URL like /p/158900/Slack/
parts = href.strip("/").split("/")
pid_idx = parts.index("p") + 1 if "p" in parts else None
product_id = parts[pid_idx] if pid_idx and pid_idx < len(parts) else None
products.append({
"name": name_el.text(strip=True),
"product_id": product_id,
"url": f"https://www.capterra.com{href}",
"rating": extract_star_rating(rating_el),
})
print(f"Category page {page}: {len(cards)} products")
time.sleep(random.uniform(2, 4))
return products
Handling Cloudflare Blocks
If you get 403 responses or challenge pages, the issue is usually IP reputation. Fixes in order of effectiveness:
-
Use residential rotating proxies — datacenter IPs get fingerprinted fast on Capterra. Each request through ThorData's residential pool appears to come from a different home internet connection.
-
Enable HTTP/2 — the
http2=Trueflag in httpx makes your TLS fingerprint match real Chrome. HTTP/1.1-only clients are a red flag to Cloudflare. -
Maintain session cookies —
httpx.Clienthandles this automatically via its cookie jar. Don't create a new client per request. -
Rotate User-Agents —
fake_useragentpulls from actual browser UA strings. Don't use fake or outdated UA strings. -
Add realistic headers — the
Sec-Fetch-*headers signal navigation context to Cloudflare. Missing them raises suspicion.
def handle_cloudflare_response(resp: httpx.Response) -> bool:
"""Returns True if the response is a Cloudflare challenge."""
if resp.status_code in (403, 503):
return True
if "cf-browser-verification" in resp.text:
return True
if "challenge-platform" in resp.text:
return True
if resp.headers.get("cf-mitigated"):
return True
return False
Storing the Data
For ongoing monitoring, dump reviews into SQLite:
def init_capterra_db(path: str = "capterra.db") -> sqlite3.Connection:
conn = sqlite3.connect(path)
conn.executescript("""
CREATE TABLE IF NOT EXISTS products (
product_id TEXT PRIMARY KEY,
name TEXT,
category TEXT,
description TEXT,
url TEXT,
rating REAL,
review_count INTEGER,
price TEXT,
price_currency TEXT,
fetched_at TEXT
);
CREATE TABLE IF NOT EXISTS reviews (
id INTEGER PRIMARY KEY AUTOINCREMENT,
product_id TEXT NOT NULL,
rating_overall REAL,
ease_of_use REAL,
customer_service REAL,
features REAL,
value_for_money REAL,
title TEXT,
body TEXT,
pros TEXT,
cons TEXT,
reviewer_name TEXT,
reviewer_role TEXT,
company_size TEXT,
review_date TEXT,
usage_duration TEXT,
scraped_at TEXT,
FOREIGN KEY(product_id) REFERENCES products(product_id)
);
CREATE INDEX IF NOT EXISTS idx_reviews_product ON reviews(product_id);
CREATE INDEX IF NOT EXISTS idx_reviews_rating ON reviews(rating_overall);
CREATE INDEX IF NOT EXISTS idx_reviews_date ON reviews(review_date);
""")
conn.commit()
return conn
def save_product(conn: sqlite3.Connection, product: dict) -> None:
conn.execute("""
INSERT OR REPLACE INTO products
(product_id, name, category, description, url, rating,
review_count, price, price_currency, fetched_at)
VALUES (?,?,?,?,?,?,?,?,?,?)
""", (
product.get("product_id"),
product.get("name"),
product.get("category"),
product.get("description"),
product.get("url"),
product.get("rating"),
product.get("review_count"),
product.get("price"),
product.get("price_currency"),
datetime.now(timezone.utc).isoformat(),
))
conn.commit()
def save_reviews(
conn: sqlite3.Connection,
product_id: str,
reviews: list[dict],
) -> int:
"""Save a batch of reviews. Returns number saved."""
saved = 0
now = datetime.now(timezone.utc).isoformat()
for r in reviews:
sub = r.get("sub_ratings", {})
conn.execute("""
INSERT INTO reviews
(product_id, rating_overall, ease_of_use, customer_service,
features, value_for_money, title, body, pros, cons,
reviewer_name, reviewer_role, company_size, review_date,
usage_duration, scraped_at)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)
""", (
product_id,
r.get("rating_overall"),
sub.get("ease_of_use"),
sub.get("customer_service"),
sub.get("features"),
sub.get("value_for_money"),
r.get("title"),
r.get("body"),
r.get("pros"),
r.get("cons"),
r.get("reviewer_name"),
r.get("reviewer_role"),
r.get("company_size"),
r.get("review_date"),
r.get("usage_duration"),
now,
))
saved += 1
conn.commit()
return saved
def get_sentiment_summary(
conn: sqlite3.Connection,
product_id: str,
) -> dict:
"""Compute average ratings for a product."""
row = conn.execute("""
SELECT
COUNT(*) as total,
AVG(rating_overall) as avg_overall,
AVG(ease_of_use) as avg_ease,
AVG(customer_service) as avg_service,
AVG(features) as avg_features,
AVG(value_for_money) as avg_value
FROM reviews
WHERE product_id = ?
""", (product_id,)).fetchone()
return {
"total_reviews": row[0],
"avg_overall": round(row[1], 2) if row[1] else None,
"avg_ease_of_use": round(row[2], 2) if row[2] else None,
"avg_customer_service": round(row[3], 2) if row[3] else None,
"avg_features": round(row[4], 2) if row[4] else None,
"avg_value": round(row[5], 2) if row[5] else None,
}
Full Pipeline Script
def scrape_product_full(
product_path: str,
proxy_url: str | None = None,
max_review_pages: int = 20,
db_path: str = "capterra.db",
) -> None:
"""
Complete pipeline: metadata + all reviews for a single product.
product_path: e.g. "158900/Slack"
"""
product_id = product_path.split("/")[0]
conn = init_capterra_db(db_path)
with get_client(proxy_url) as client:
print(f"Fetching metadata for {product_path}...")
metadata = scrape_product_metadata(client, product_path)
metadata["product_id"] = product_id
save_product(conn, metadata)
print(f" Name: {metadata.get('name')}, Rating: {metadata.get('rating')}")
print(f"Scraping reviews (max {max_review_pages} pages)...")
reviews = scrape_all_reviews(client, product_id, max_pages=max_review_pages)
count = save_reviews(conn, product_id, reviews)
print(f" Saved {count} reviews")
summary = get_sentiment_summary(conn, product_id)
print(f" Summary: {summary}")
conn.close()
if __name__ == "__main__":
PROXY_URL = "http://USER:[email protected]:9000"
# Scrape a specific product
scrape_product_full(
"158900/Slack",
proxy_url=PROXY_URL,
max_review_pages=10,
)
# Or discover and scrape an entire category
conn = init_capterra_db("capterra.db")
with get_client(PROXY_URL) as client:
crm_products = scrape_category_products(client, "crm", max_pages=5)
print(f"Found {len(crm_products)} CRM products")
for product in crm_products[:10]:
print(f" {product['name']} (ID: {product['product_id']})")
Key Techniques Recap
JSON-LD first: Always check <script type="application/ld+json"> before touching CSS selectors. It's stable across redesigns and parseable without HTML selectors.
Selector fallback waterfall: Try data-testid → partial class match [class*=...] → semantic HTML. Never hardcode full generated class names.
Randomized delays: time.sleep(random.uniform(2, 5)) between pages. Consistent 2-second intervals look automated; human variance looks human.
Session persistence: Use httpx.Client as a context manager so cookies persist across requests in the same session. Recreate the client (and get a new proxy IP) only when you hit a block.
Legal Notes
Capterra reviews are user-generated public content. Scraping for analysis, competitive research, or building comparison tools is generally fine. Don't republish reviews verbatim as your own content, don't overload their servers, and respect robots.txt crawl delays. The ethical line is using the data for insight vs. reproducing it wholesale as a competing directory.