Scrape Capterra Reviews: Software Ratings, Pricing & Comparisons (2026)

2026-04-09 [python capterra scraping reviews software]

Scrape Capterra Reviews: Software Ratings, Pricing & Comparisons (2026)

Capterra lists 100,000+ software products with user reviews, pricing tiers, and feature comparisons. If you're building a SaaS comparison tool, tracking competitor sentiment, or doing market research — that data is gold.

The problem: Capterra actively blocks scrapers. Cloudflare protection, rate limiting, JavaScript-rendered content. This guide shows you how to extract what you need without getting blocked, covering everything from structured data extraction to bulk review collection and ongoing monitoring.

What Data Is Available

Each Capterra product page contains: - Overall rating (1–5 stars) plus sub-ratings (ease of use, customer service, features, value for money) - Individual reviews with full text, pros/cons, reviewer role, company size, and usage frequency - Pricing — free tier availability, starting price, pricing model (per user/month, flat, etc.) - Product details — categories, deployment options (cloud/on-premise/mobile), supported platforms - Feature list — capabilities and integrations - Alternatives — competitor products Capterra recommends - Awards — Capterra Shortlist rankings, GetApp awards

Reviews paginate at 25 per page. A popular product like Salesforce has 20,000+ reviews — that's 800+ pages worth of data.

Anti-Bot Measures on Capterra

Capterra uses several layers of protection:

Cloudflare Bot Management — fingerprints your TLS handshake, checks browser headers, serves JavaScript challenges. Standard requests library fails on this because it speaks HTTP/1.1 with a non-browser TLS fingerprint.
Rate limiting — more than ~30 requests/minute from one IP triggers blocks (429 or silent redirects to a CAPTCHA)
Session validation — cookies must persist across requests; fresh cookieless requests get challenged
Dynamic class names — CSS class names like ReviewCard__reviewTitle--3kJ2A change on deployment

You need rotating residential proxies to maintain access. Datacenter IPs (AWS, GCP, VPS providers) get flagged almost immediately. ThorData's residential proxy pool works well here — their rotating IPs cover 195+ countries and handle the Cloudflare challenge layer without getting burned after a few requests.

Setup

pip install httpx selectolax fake-useragent

We're using httpx over requests for HTTP/2 support (matches real browser TLS fingerprint) and selectolax for fast HTML parsing. fake-useragent pulls real browser User-Agent strings.

Core Scraper Architecture

import httpx
import json
import time
import random
import sqlite3
from datetime import datetime, timezone
from selectolax.parser import HTMLParser
from fake_useragent import UserAgent

ua = UserAgent()

PROXY_URL = "http://USER:[email protected]:9000"


def get_client(proxy_url: str | None = None, rotate_ua: bool = True) -> httpx.Client:
    """Create HTTP client with browser-like settings."""
    return httpx.Client(
        proxy=proxy_url,
        headers={
            "User-Agent": ua.chrome if rotate_ua else (
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/126.0.0.0 Safari/537.36"
            ),
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "DNT": "1",
            "Connection": "keep-alive",
            "Upgrade-Insecure-Requests": "1",
            "Sec-Fetch-Dest": "document",
            "Sec-Fetch-Mode": "navigate",
            "Sec-Fetch-Site": "none",
            "Sec-Fetch-User": "?1",
        },
        follow_redirects=True,
        timeout=30.0,
        http2=True,
    )

Extracting Product Metadata

The most reliable data source is the JSON-LD structured data that Capterra embeds for SEO. Check for it first before falling back to HTML selectors:

def scrape_product_metadata(
    client: httpx.Client,
    product_path: str,
) -> dict:
    """
    Extract product metadata from a Capterra product page.
    product_path format: "158900/Slack" (ID/slug)
    """
    url = f"https://www.capterra.com/p/{product_path}/"
    resp = client.get(url)

    # Detect Cloudflare block
    if "cf-browser-verification" in resp.text or resp.status_code == 403:
        raise RuntimeError("Cloudflare challenge — rotate IP and retry")

    resp.raise_for_status()
    tree = HTMLParser(resp.text)

    # Try JSON-LD structured data first (most stable)
    for script in tree.css('script[type="application/ld+json"]'):
        try:
            data = json.loads(script.text())
            if data.get("@type") == "SoftwareApplication":
                rating = data.get("aggregateRating", {})
                offers = data.get("offers", {})
                return {
                    "name": data.get("name"),
                    "description": data.get("description"),
                    "url": url,
                    "rating": rating.get("ratingValue"),
                    "review_count": rating.get("reviewCount"),
                    "best_rating": rating.get("bestRating"),
                    "price": offers.get("price"),
                    "price_currency": offers.get("priceCurrency"),
                    "category": data.get("applicationCategory"),
                    "operating_system": data.get("operatingSystem"),
                    "source": "json_ld",
                }
        except (json.JSONDecodeError, TypeError):
            continue

    # Fallback to HTML selectors (less stable, use as backup)
    def get_text(*selectors):
        for sel in selectors:
            el = tree.css_first(sel)
            if el:
                return el.text(strip=True)
        return None

    return {
        "name": get_text("h1", '[data-testid="product-title"]'),
        "description": get_text(
            '[data-testid="product-description"]',
            '[class*="ProductDescription"]',
            '.product-description',
        ),
        "url": url,
        "rating": get_text(
            '[data-testid="overall-rating"]',
            '[class*="OverallRating"] [class*="value"]',
            '[itemprop="ratingValue"]',
        ),
        "review_count": get_text(
            '[data-testid="review-count"]',
            '[class*="ReviewCount"]',
        ),
        "category": get_text('[data-testid="category"]', '[class*="Category"]'),
        "source": "html_fallback",
    }

Scraping Reviews with Pagination

def extract_star_rating(element) -> float | None:
    """Parse star rating from various possible element formats."""
    if not element:
        return None

    # Try aria-label: "4.5 out of 5 stars"
    label = element.attributes.get("aria-label", "")
    if "out of" in label:
        try:
            return float(label.split("out of")[0].strip())
        except ValueError:
            pass

    # Try title attribute
    title = element.attributes.get("title", "")
    if title and title.replace(".", "").isdigit():
        return float(title)

    # Try CSS width-based stars (width: 80% = 4/5 stars)
    style = element.attributes.get("style", "")
    if "width" in style:
        try:
            pct = float(style.split("width:")[1].split("%")[0].strip())
            return round(pct / 20, 1)
        except (ValueError, IndexError):
            pass

    # Try data-rating attribute
    rating = element.attributes.get("data-rating") or element.attributes.get("data-score")
    if rating:
        try:
            return float(rating)
        except ValueError:
            pass

    return None


def scrape_review_page(
    client: httpx.Client,
    product_id: str,
    page: int = 1,
    sort: str = "recent",  # recent, helpful, highest, lowest
) -> dict:
    """Scrape a single page of reviews for a product."""
    url = f"https://www.capterra.com/reviews/{product_id}/"
    params = {"page": page, "sort": sort}
    resp = client.get(url, params=params)

    if "cf-browser-verification" in resp.text:
        raise RuntimeError("Cloudflare block detected")
    if resp.status_code == 404:
        return {"reviews": [], "total": 0, "is_last_page": True}
    resp.raise_for_status()

    tree = HTMLParser(resp.text)

    # Find review cards using multiple selector strategies (Capterra changes class names)
    cards = (
        tree.css('[data-testid="review-card"]')
        or tree.css('[class*="ReviewCard__"]')
        or tree.css('[class*="review-card"]')
        or tree.css('article[class*="Review"]')
    )

    reviews = []
    for card in cards:
        def card_text(*selectors):
            for sel in selectors:
                el = card.css_first(sel)
                if el:
                    return el.text(strip=True)
            return None

        # Overall star rating
        rating_el = (
            card.css_first('[class*="StarRating"]')
            or card.css_first('[class*="star-rating"]')
            or card.css_first('[aria-label*="stars"]')
        )
        overall_rating = extract_star_rating(rating_el)

        # Sub-ratings (ease of use, features, value, customer service)
        sub_ratings = {}
        for sub_el in card.css('[class*="SubRating"], [class*="sub-rating"]'):
            label_el = sub_el.css_first('[class*="label"], span:first-child')
            value_el = sub_el.css_first('[class*="StarRating"], [class*="value"]')
            if label_el and value_el:
                label = label_el.text(strip=True).lower().replace(" ", "_")
                sub_ratings[label] = extract_star_rating(value_el)

        # Review text
        title = card_text(
            '[class*="ReviewTitle"]',
            '[class*="review-title"]',
            '[data-testid="review-title"]',
            'h3',
        )
        body = card_text(
            '[class*="ReviewBody"]',
            '[class*="review-body"]',
            '[data-testid="review-body"]',
        )
        pros = card_text('[class*="Pros"]', '[class*="pros"]', '[data-testid="pros"]')
        cons = card_text('[class*="Cons"]', '[class*="cons"]', '[data-testid="cons"]')

        # Reviewer info
        reviewer_name = card_text(
            '[class*="ReviewerName"]',
            '[class*="reviewer-name"]',
            '[data-testid="reviewer-name"]',
        )
        reviewer_role = card_text(
            '[class*="ReviewerJob"]',
            '[class*="reviewer-role"]',
            '[data-testid="reviewer-title"]',
        )
        company_size = card_text(
            '[class*="CompanySize"]',
            '[class*="company-size"]',
        )
        review_date = card_text(
            'time',
            '[class*="ReviewDate"]',
            '[data-testid="review-date"]',
        )
        usage_duration = card_text('[class*="UsedFor"]', '[class*="used-for"]')

        reviews.append({
            "rating_overall": overall_rating,
            "sub_ratings": sub_ratings,
            "title": title,
            "body": body,
            "pros": pros,
            "cons": cons,
            "reviewer_name": reviewer_name,
            "reviewer_role": reviewer_role,
            "company_size": company_size,
            "review_date": review_date,
            "usage_duration": usage_duration,
        })

    # Find total review count from pagination info
    total_el = tree.css_first('[class*="TotalCount"], [class*="total-count"], [data-testid="total-reviews"]')
    total_text = total_el.text(strip=True) if total_el else ""
    try:
        import re
        total = int(re.sub(r'[^\d]', '', total_text)) if total_text else None
    except ValueError:
        total = None

    # Detect last page
    next_btn = tree.css_first('a[rel="next"], [aria-label="Next page"], [class*="NextPage"]:not([disabled])')
    is_last_page = next_btn is None

    return {
        "reviews": reviews,
        "total": total,
        "page": page,
        "is_last_page": is_last_page,
    }


def scrape_all_reviews(
    client: httpx.Client,
    product_id: str,
    max_pages: int = 50,
    sort: str = "recent",
    delay_range: tuple[float, float] = (2.0, 5.0),
) -> list[dict]:
    """Scrape all reviews for a product across all pages."""
    all_reviews = []

    for page in range(1, max_pages + 1):
        try:
            result = scrape_review_page(client, product_id, page=page, sort=sort)
        except RuntimeError as e:
            print(f"Page {page}: {e}")
            break
        except httpx.HTTPStatusError as e:
            if e.response.status_code == 429:
                wait = int(e.response.headers.get("Retry-After", 60))
                print(f"Rate limited. Waiting {wait}s...")
                time.sleep(wait)
                continue
            print(f"Page {page}: HTTP {e.response.status_code}")
            break

        reviews = result["reviews"]
        if not reviews:
            print(f"Page {page}: no reviews — stopping")
            break

        all_reviews.extend(reviews)
        total = result.get("total")
        print(f"Page {page}: {len(reviews)} reviews "
              f"(total so far: {len(all_reviews)}"
              + (f" / {total}" if total else "") + ")")

        if result["is_last_page"]:
            print("Reached last page")
            break

        # Randomized delay — consistent timing looks automated
        time.sleep(random.uniform(*delay_range))

    return all_reviews

Scraping Alternatives and Comparisons

def scrape_alternatives(
    client: httpx.Client,
    product_path: str,
) -> list[dict]:
    """Get competitor products listed on a Capterra alternatives page."""
    url = f"https://www.capterra.com/p/{product_path}/alternatives/"
    resp = client.get(url)
    resp.raise_for_status()

    tree = HTMLParser(resp.text)
    alternatives = []

    # Product cards on alternatives page
    for card in tree.css('[data-testid="product-card"], [class*="ProductCard"]'):
        name_el = card.css_first("h3 a, h2 a, [class*='ProductName'] a")
        rating_el = card.css_first('[class*="StarRating"], [class*="RatingValue"]')
        reviews_el = card.css_first('[class*="ReviewCount"], [class*="reviews"]')
        price_el = card.css_first('[class*="Price"], [class*="Starting"]')

        if not name_el:
            continue

        href = name_el.attributes.get("href", "")
        product_id = href.split("/")[-2] if "/p/" in href else None

        alternatives.append({
            "name": name_el.text(strip=True),
            "product_id": product_id,
            "url": href,
            "rating": extract_star_rating(rating_el),
            "review_count": reviews_el.text(strip=True) if reviews_el else None,
            "starting_price": price_el.text(strip=True) if price_el else None,
        })

    return alternatives


def scrape_category_products(
    client: httpx.Client,
    category_slug: str,
    max_pages: int = 20,
) -> list[dict]:
    """Scrape all products in a Capterra category."""
    products = []

    for page in range(1, max_pages + 1):
        url = f"https://www.capterra.com/{category_slug}-software/"
        params = {"page": page} if page > 1 else {}
        resp = client.get(url, params=params)

        if resp.status_code == 404:
            break
        resp.raise_for_status()

        tree = HTMLParser(resp.text)
        cards = tree.css('[data-testid="product-listing"], [class*="ProductListing"]')

        if not cards:
            break

        for card in cards:
            name_el = card.css_first("h3 a, h2 a")
            rating_el = card.css_first('[class*="StarRating"]')

            if not name_el:
                continue

            href = name_el.attributes.get("href", "")
            # Extract product ID from URL like /p/158900/Slack/
            parts = href.strip("/").split("/")
            pid_idx = parts.index("p") + 1 if "p" in parts else None
            product_id = parts[pid_idx] if pid_idx and pid_idx < len(parts) else None

            products.append({
                "name": name_el.text(strip=True),
                "product_id": product_id,
                "url": f"https://www.capterra.com{href}",
                "rating": extract_star_rating(rating_el),
            })

        print(f"Category page {page}: {len(cards)} products")
        time.sleep(random.uniform(2, 4))

    return products

Handling Cloudflare Blocks

If you get 403 responses or challenge pages, the issue is usually IP reputation. Fixes in order of effectiveness:

Use residential rotating proxies — datacenter IPs get fingerprinted fast on Capterra. Each request through ThorData's residential pool appears to come from a different home internet connection.
Enable HTTP/2 — the http2=True flag in httpx makes your TLS fingerprint match real Chrome. HTTP/1.1-only clients are a red flag to Cloudflare.
Maintain session cookies — httpx.Client handles this automatically via its cookie jar. Don't create a new client per request.
Rotate User-Agents — fake_useragent pulls from actual browser UA strings. Don't use fake or outdated UA strings.
Add realistic headers — the Sec-Fetch-* headers signal navigation context to Cloudflare. Missing them raises suspicion.

def handle_cloudflare_response(resp: httpx.Response) -> bool:
    """Returns True if the response is a Cloudflare challenge."""
    if resp.status_code in (403, 503):
        return True
    if "cf-browser-verification" in resp.text:
        return True
    if "challenge-platform" in resp.text:
        return True
    if resp.headers.get("cf-mitigated"):
        return True
    return False

Storing the Data

For ongoing monitoring, dump reviews into SQLite:

def init_capterra_db(path: str = "capterra.db") -> sqlite3.Connection:
    conn = sqlite3.connect(path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS products (
            product_id TEXT PRIMARY KEY,
            name TEXT,
            category TEXT,
            description TEXT,
            url TEXT,
            rating REAL,
            review_count INTEGER,
            price TEXT,
            price_currency TEXT,
            fetched_at TEXT
        );

        CREATE TABLE IF NOT EXISTS reviews (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            product_id TEXT NOT NULL,
            rating_overall REAL,
            ease_of_use REAL,
            customer_service REAL,
            features REAL,
            value_for_money REAL,
            title TEXT,
            body TEXT,
            pros TEXT,
            cons TEXT,
            reviewer_name TEXT,
            reviewer_role TEXT,
            company_size TEXT,
            review_date TEXT,
            usage_duration TEXT,
            scraped_at TEXT,
            FOREIGN KEY(product_id) REFERENCES products(product_id)
        );

        CREATE INDEX IF NOT EXISTS idx_reviews_product ON reviews(product_id);
        CREATE INDEX IF NOT EXISTS idx_reviews_rating ON reviews(rating_overall);
        CREATE INDEX IF NOT EXISTS idx_reviews_date ON reviews(review_date);
    """)
    conn.commit()
    return conn


def save_product(conn: sqlite3.Connection, product: dict) -> None:
    conn.execute("""
        INSERT OR REPLACE INTO products
        (product_id, name, category, description, url, rating,
         review_count, price, price_currency, fetched_at)
        VALUES (?,?,?,?,?,?,?,?,?,?)
    """, (
        product.get("product_id"),
        product.get("name"),
        product.get("category"),
        product.get("description"),
        product.get("url"),
        product.get("rating"),
        product.get("review_count"),
        product.get("price"),
        product.get("price_currency"),
        datetime.now(timezone.utc).isoformat(),
    ))
    conn.commit()


def save_reviews(
    conn: sqlite3.Connection,
    product_id: str,
    reviews: list[dict],
) -> int:
    """Save a batch of reviews. Returns number saved."""
    saved = 0
    now = datetime.now(timezone.utc).isoformat()
    for r in reviews:
        sub = r.get("sub_ratings", {})
        conn.execute("""
            INSERT INTO reviews
            (product_id, rating_overall, ease_of_use, customer_service,
             features, value_for_money, title, body, pros, cons,
             reviewer_name, reviewer_role, company_size, review_date,
             usage_duration, scraped_at)
            VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)
        """, (
            product_id,
            r.get("rating_overall"),
            sub.get("ease_of_use"),
            sub.get("customer_service"),
            sub.get("features"),
            sub.get("value_for_money"),
            r.get("title"),
            r.get("body"),
            r.get("pros"),
            r.get("cons"),
            r.get("reviewer_name"),
            r.get("reviewer_role"),
            r.get("company_size"),
            r.get("review_date"),
            r.get("usage_duration"),
            now,
        ))
        saved += 1
    conn.commit()
    return saved


def get_sentiment_summary(
    conn: sqlite3.Connection,
    product_id: str,
) -> dict:
    """Compute average ratings for a product."""
    row = conn.execute("""
        SELECT
            COUNT(*) as total,
            AVG(rating_overall) as avg_overall,
            AVG(ease_of_use) as avg_ease,
            AVG(customer_service) as avg_service,
            AVG(features) as avg_features,
            AVG(value_for_money) as avg_value
        FROM reviews
        WHERE product_id = ?
    """, (product_id,)).fetchone()

    return {
        "total_reviews": row[0],
        "avg_overall": round(row[1], 2) if row[1] else None,
        "avg_ease_of_use": round(row[2], 2) if row[2] else None,
        "avg_customer_service": round(row[3], 2) if row[3] else None,
        "avg_features": round(row[4], 2) if row[4] else None,
        "avg_value": round(row[5], 2) if row[5] else None,
    }

Full Pipeline Script

def scrape_product_full(
    product_path: str,
    proxy_url: str | None = None,
    max_review_pages: int = 20,
    db_path: str = "capterra.db",
) -> None:
    """
    Complete pipeline: metadata + all reviews for a single product.
    product_path: e.g. "158900/Slack"
    """
    product_id = product_path.split("/")[0]
    conn = init_capterra_db(db_path)

    with get_client(proxy_url) as client:
        print(f"Fetching metadata for {product_path}...")
        metadata = scrape_product_metadata(client, product_path)
        metadata["product_id"] = product_id
        save_product(conn, metadata)
        print(f"  Name: {metadata.get('name')}, Rating: {metadata.get('rating')}")

        print(f"Scraping reviews (max {max_review_pages} pages)...")
        reviews = scrape_all_reviews(client, product_id, max_pages=max_review_pages)
        count = save_reviews(conn, product_id, reviews)
        print(f"  Saved {count} reviews")

        summary = get_sentiment_summary(conn, product_id)
        print(f"  Summary: {summary}")

    conn.close()


if __name__ == "__main__":
    PROXY_URL = "http://USER:[email protected]:9000"

    # Scrape a specific product
    scrape_product_full(
        "158900/Slack",
        proxy_url=PROXY_URL,
        max_review_pages=10,
    )

    # Or discover and scrape an entire category
    conn = init_capterra_db("capterra.db")
    with get_client(PROXY_URL) as client:
        crm_products = scrape_category_products(client, "crm", max_pages=5)
        print(f"Found {len(crm_products)} CRM products")
        for product in crm_products[:10]:
            print(f"  {product['name']} (ID: {product['product_id']})")

Key Techniques Recap

JSON-LD first: Always check <script type="application/ld+json"> before touching CSS selectors. It's stable across redesigns and parseable without HTML selectors.

Selector fallback waterfall: Try data-testid → partial class match [class*=...] → semantic HTML. Never hardcode full generated class names.

Randomized delays: time.sleep(random.uniform(2, 5)) between pages. Consistent 2-second intervals look automated; human variance looks human.

Session persistence: Use httpx.Client as a context manager so cookies persist across requests in the same session. Recreate the client (and get a new proxy IP) only when you hit a block.

Legal Notes

Capterra reviews are user-generated public content. Scraping for analysis, competitive research, or building comparison tools is generally fine. Don't republish reviews verbatim as your own content, don't overload their servers, and respect robots.txt crawl delays. The ethical line is using the data for insight vs. reproducing it wholesale as a competing directory.