How to Scrape Goodreads Book Data, Ratings & Reviews in Python (2026)

2026-04-09 [python scraping goodreads beautifulsoup books]

How to Scrape Goodreads Book Data, Ratings & Reviews in Python (2026)

Goodreads killed their public API in December 2020. No replacement was offered. If you want book ratings, review counts, shelf data, or author information programmatically, scraping is now the only option.

The good news: Goodreads is relatively straightforward to scrape. Most content renders server-side, the HTML is reasonably structured, and the anti-bot measures are lighter than most major sites. The bad news: Amazon owns Goodreads and could tighten things up at any time. Build your scraper to be polite.

This guide covers the complete workflow: fetching book details, scraping reviews, searching by title, author pages, building datasets, and handling the anti-bot measures you will encounter at scale.

Setup

pip install beautifulsoup4 lxml httpx

Using httpx instead of requests because it handles HTTP/2, which reduces fingerprinting surface. Goodreads does not require it, but it is good practice.

Structure of a Goodreads Book Page

Every Goodreads book has a page at goodreads.com/book/show/{id} where the ID is numeric. The page contains:

Title, author, series information
Aggregate rating (star average) and total rating count
Review count separately from rating count
Book description/synopsis
Genres (from "shelves" community picks)
Edition details: page count, format, ISBN, publisher, publication date
Cover image URL

Key insight: Goodreads embeds structured data as JSON-LD in every book page. The application/ld+json script tag gives you most of the core metadata without parsing HTML.

{
  "@type": "Book",
  "name": "The Great Gatsby",
  "author": {"@type": "Person", "name": "F. Scott Fitzgerald"},
  "aggregateRating": {
    "ratingValue": "3.93",
    "ratingCount": 5432198,
    "reviewCount": 89043
  }
}

Scraping Book Details

import httpx
from bs4 import BeautifulSoup
import json
import time
import random
import re

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/126.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}

def get_book_details(book_id: str, client: httpx.Client = None) -> dict:
    """Scrape book details from a Goodreads book page."""
    url = f"https://www.goodreads.com/book/show/{book_id}"
    own_client = client is None

    if own_client:
        client = httpx.Client(headers=HEADERS, timeout=20, follow_redirects=True)

    try:
        resp = client.get(url)
        resp.raise_for_status()
    except httpx.HTTPStatusError as e:
        print(f"HTTP {e.response.status_code} for book {book_id}")
        return {}
    finally:
        if own_client:
            client.close()

    soup = BeautifulSoup(resp.text, "lxml")

    # Primary source: JSON-LD structured data
    ld = {}
    script = soup.select_one('script[type="application/ld+json"]')
    if script:
        try:
            ld = json.loads(script.string)
        except (json.JSONDecodeError, AttributeError):
            pass

    # Title
    title = ld.get("name", "")
    if not title:
        title_el = soup.select_one("h1[data-testid='bookTitle']")
        if not title_el:
            title_el = soup.select_one("h1.Text__title1")
        title = title_el.get_text(strip=True) if title_el else ""

    # Author(s)
    authors = []
    if "author" in ld:
        author_data = ld["author"]
        if isinstance(author_data, list):
            authors = [a.get("name", "") for a in author_data]
        elif isinstance(author_data, dict):
            authors = [author_data.get("name", "")]
    if not authors:
        for a_el in soup.select("a.ContributorLink"):
            authors.append(a_el.get_text(strip=True))

    # Rating from JSON-LD
    agg_rating = ld.get("aggregateRating", {})
    rating = agg_rating.get("ratingValue", "")
    rating_count = agg_rating.get("ratingCount", "")
    review_count = agg_rating.get("reviewCount", "")

    # Fallback: parse from visible elements
    if not rating:
        rating_el = soup.select_one("div.RatingStatistics__rating")
        rating = rating_el.get_text(strip=True) if rating_el else ""

    # Description
    desc = ""
    desc_el = soup.select_one("div[data-testid='description'] span.Formatted")
    if desc_el:
        desc = desc_el.get_text(strip=True)
    elif "description" in ld:
        desc = ld["description"]

    # Genres from community shelves
    genres = []
    for genre_el in soup.select("span.BookPageMetadataSection__genreButton a"):
        genres.append(genre_el.get_text(strip=True))
    if not genres:
        for shelf_el in soup.select("a.Button--tag"):
            text = shelf_el.get_text(strip=True)
            if text and len(text) < 50:
                genres.append(text)

    # Edition info
    pages = ""
    pages_el = soup.select_one("p[data-testid='pagesFormat']")
    if pages_el:
        pages = pages_el.get_text(strip=True)

    publisher = ""
    publish_date = ""
    for detail_el in soup.select("p[data-testid='publicationInfo']"):
        text = detail_el.get_text(strip=True)
        if "Published" in text:
            publish_date = text.replace("Published", "").strip()

    isbn = ld.get("isbn", "")
    cover_img = ""
    img_el = soup.select_one("img.ResponsiveImage")
    if img_el:
        cover_img = img_el.get("src", "")

    return {
        "book_id": book_id,
        "title": title,
        "authors": authors,
        "rating": rating,
        "rating_count": rating_count,
        "review_count": review_count,
        "description": desc[:1000],
        "genres": genres[:10],
        "pages": pages,
        "publish_date": publish_date,
        "isbn": isbn,
        "cover_img": cover_img,
        "url": url,
    }

# Test it
book = get_book_details("11127")  # Anna Karenina
print(f"{book['title']} by {', '.join(book['authors'])}")
print(f"Rating: {book['rating']} ({book['rating_count']} ratings)")
print(f"Genres: {', '.join(book['genres'][:5])}")

Scraping Individual Reviews

The reviews section loads dynamically on newer Goodreads pages. The best approach depends on whether you need all reviews (requires scrolling automation) or just the visible ones.

For accessible review extraction from the initial HTML:

def get_visible_reviews(book_id: str, client: httpx.Client = None) -> list:
    """Extract reviews from the initial page load (typically 8-15 reviews)."""
    url = f"https://www.goodreads.com/book/show/{book_id}"
    own_client = client is None

    if own_client:
        client = httpx.Client(headers=HEADERS, timeout=20, follow_redirects=True)

    try:
        resp = client.get(url)
        resp.raise_for_status()
    finally:
        if own_client:
            client.close()

    soup = BeautifulSoup(resp.text, "lxml")
    reviews = []

    # Reviews are in review cards
    for card in soup.select("article.ReviewCard, div.ReviewCard"):
        review = {}

        # Reviewer name
        reviewer_el = card.select_one("a[href*='/user/show/']")
        if reviewer_el:
            review["reviewer"] = reviewer_el.get_text(strip=True)

        # Star rating — look for aria-label on star container
        stars_el = card.select_one("[aria-label*='out of 5']")
        if not stars_el:
            stars_el = card.select_one("[aria-label*='Stars']")
        if stars_el:
            aria = stars_el.get("aria-label", "")
            match = re.search(r"(\d+)\s*(?:out of|Stars)", aria, re.IGNORECASE)
            if match:
                review["stars"] = int(match.group(1))

        # Review text
        text_el = card.select_one("section.ReviewSection span.Formatted")
        if not text_el:
            text_el = card.select_one("div.ReviewText span.Formatted")
        if text_el:
            review["text"] = text_el.get_text(strip=True)

        # Date
        date_el = card.select_one("time, span[class*='review__date']")
        if date_el:
            review["date"] = date_el.get("title") or date_el.get_text(strip=True)

        # Likes count
        likes_el = card.select_one("button[aria-label*='like']")
        if likes_el:
            aria = likes_el.get("aria-label", "")
            match = re.search(r"(\d+)\s+like", aria, re.IGNORECASE)
            if match:
                review["likes"] = int(match.group(1))

        if review.get("reviewer") or review.get("text"):
            reviews.append(review)

    return reviews

Scraping Search Results

Searching Goodreads lets you discover books by query.

def search_books(query: str, page: int = 1) -> list:
    """Search Goodreads and return book results."""
    url = "https://www.goodreads.com/search"
    params = {"q": query, "page": page}

    client = httpx.Client(headers=HEADERS, timeout=15, follow_redirects=True)
    resp = client.get(url, params=params)
    client.close()
    resp.raise_for_status()

    soup = BeautifulSoup(resp.text, "lxml")
    results = []

    for row in soup.select("tr[itemtype='http://schema.org/Book']"):
        title_link = row.select_one("a.bookTitle")
        author_link = row.select_one("a.authorName")
        rating_el = row.select_one("span.minirating")

        if not title_link:
            continue

        href = title_link.get("href", "")
        # Extract book ID from URL like /book/show/123-book-name
        id_match = re.search(r"/book/show/(\d+)", href)
        book_id = id_match.group(1) if id_match else ""

        results.append({
            "title": title_link.get_text(strip=True),
            "author": author_link.get_text(strip=True) if author_link else "",
            "rating_text": rating_el.get_text(strip=True) if rating_el else "",
            "book_id": book_id,
            "url": f"https://www.goodreads.com{href}" if href.startswith("/") else href,
        })

    return results

results = search_books("data science")
for r in results[:5]:
    print(f"  {r['title']} — {r['author']} ({r['rating_text']})")

Scraping Author Pages

Author pages include biography, follower count, and complete book list.

def get_author_info(author_id: str) -> dict:
    """Scrape author details from Goodreads."""
    url = f"https://www.goodreads.com/author/show/{author_id}"
    client = httpx.Client(headers=HEADERS, timeout=15, follow_redirects=True)

    try:
        resp = client.get(url)
        resp.raise_for_status()
    finally:
        client.close()

    soup = BeautifulSoup(resp.text, "lxml")

    name_el = soup.select_one("h1.authorName, h1[itemprop='name']")
    name = name_el.get_text(strip=True) if name_el else ""

    bio_el = soup.select_one("div.aboutAuthorInfo span, div[class*='about'] span.Formatted")
    bio = bio_el.get_text(strip=True) if bio_el else ""

    # Follower count
    followers = ""
    for stat_el in soup.select("div.userStatTotal"):
        label = stat_el.find_next_sibling()
        if label and "follower" in label.get_text(strip=True).lower():
            followers = stat_el.get_text(strip=True)
            break

    # Author books
    books = []
    for book_el in soup.select("tr[itemscope]"):
        title_link = book_el.select_one("a.bookTitle")
        rating = book_el.select_one("span.minirating")
        year_el = book_el.select_one("td.field.date_pub")

        if title_link:
            href = title_link.get("href", "")
            id_match = re.search(r"/book/show/(\d+)", href)
            books.append({
                "title": title_link.get_text(strip=True),
                "book_id": id_match.group(1) if id_match else "",
                "rating": rating.get_text(strip=True) if rating else "",
                "year": year_el.get_text(strip=True) if year_el else "",
            })

    return {
        "author_id": author_id,
        "name": name,
        "bio": bio[:800],
        "followers": followers,
        "books": books[:30],
        "url": url,
    }

Scraping Listopia (Curated Book Lists)

Goodreads Listopia has thousands of community-curated book lists. Good source for discovering books in a genre.

def get_listopia_books(list_id: str, max_pages: int = 5) -> list:
    """Scrape books from a Goodreads Listopia list."""
    books = []

    for page in range(1, max_pages + 1):
        url = f"https://www.goodreads.com/list/show/{list_id}?page={page}"
        client = httpx.Client(headers=HEADERS, timeout=20, follow_redirects=True)

        try:
            resp = client.get(url)
            if resp.status_code == 404:
                break
            resp.raise_for_status()
        finally:
            client.close()

        soup = BeautifulSoup(resp.text, "lxml")
        rows = soup.select("tr[itemtype='http://schema.org/Book']")

        if not rows:
            break

        for row in rows:
            title_link = row.select_one("a.bookTitle")
            author_link = row.select_one("a.authorName")
            rating_el = row.select_one("span.minirating")

            if not title_link:
                continue

            href = title_link.get("href", "")
            id_match = re.search(r"/book/show/(\d+)", href)

            books.append({
                "title": title_link.get_text(strip=True),
                "author": author_link.get_text(strip=True) if author_link else "",
                "rating": rating_el.get_text(strip=True) if rating_el else "",
                "book_id": id_match.group(1) if id_match else "",
            })

        print(f"  Page {page}: {len(rows)} books")
        time.sleep(random.uniform(2, 4))

    return books

# Example: Best Books of the Decade 2010-2019
books = get_listopia_books("25111.Best_Books_of_the_Decade_2010_s")
print(f"Found {len(books)} books in list")

Anti-Bot Measures and How to Handle Them

Goodreads is lighter on anti-bot than most Amazon properties, but they do have protections:

Rate limiting is the main defense. Hit Goodreads too fast and you will start getting 429 responses or CAPTCHA pages. Keep requests under 10 per minute for sustained scraping.

def polite_delay(min_sec=3, max_sec=8):
    """Random delay to avoid rate limiting."""
    time.sleep(random.uniform(min_sec, max_sec))

CAPTCHA challenges. After sustained scraping from the same IP, Goodreads serves CAPTCHA pages. The response still returns 200, but the body is a challenge page instead of book data. Always check:

def is_captcha_page(soup: BeautifulSoup) -> bool:
    """Detect if Goodreads returned a CAPTCHA or challenge page."""
    if soup.select_one("form#challenge-form"):
        return True
    text = soup.get_text().lower()
    return "captcha" in text or "prove you're human" in text or "unusual traffic" in text

Session tracking. Goodreads tracks sessions via cookies. Rotating your session (clearing cookies) every 50-100 requests helps avoid triggering escalating challenges.

def make_fresh_client(proxy_url=None):
    """Create a new httpx client with fresh cookies."""
    proxies = {"http://": proxy_url, "https://": proxy_url} if proxy_url else None
    return httpx.Client(
        headers=HEADERS,
        proxies=proxies,
        timeout=20,
        follow_redirects=True,
    )

Login walls. Some review content is hidden behind a login requirement for non-authenticated visitors. The HTML book page usually shows enough data, but some deeper review content requires a logged-in session.

For any project needing more than a few hundred book pages — building a dataset, monitoring ratings over time, cataloging extensive bibliographies — residential proxies distribute load across IPs so Goodreads never sees enough traffic from one address to trigger rate limits or CAPTCHAs.

ThorData provides rotating residential proxies that work well for Goodreads. Each request exits from a different residential IP, so Goodreads rate-limits by aggregate load across all IPs rather than throttling your specific scraper.

PROXY_URL = "http://USER:[email protected]:9000"

# Using proxies with httpx
client = httpx.Client(
    headers=HEADERS,
    proxies={"http://": PROXY_URL, "https://": PROXY_URL},
    timeout=20,
    follow_redirects=True,
)

resp = client.get("https://www.goodreads.com/book/show/11127")
soup = BeautifulSoup(resp.text, "lxml")

if is_captcha_page(soup):
    print("Got CAPTCHA — rotating proxy...")
    # Create new client with fresh proxy session

Even with proxies, keep the polite delays. You are distributing load, not eliminating it.

Building a Book Dataset

Here is a complete pipeline for scraping multiple books into a structured CSV dataset.

import csv
import sqlite3
from pathlib import Path
from datetime import datetime

def init_books_db(db_path="goodreads_books.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("PRAGMA journal_mode=WAL")
    conn.execute("""
        CREATE TABLE IF NOT EXISTS books (
            book_id TEXT PRIMARY KEY,
            title TEXT,
            authors TEXT,
            rating REAL,
            rating_count INTEGER,
            review_count INTEGER,
            description TEXT,
            genres TEXT,
            pages TEXT,
            publish_date TEXT,
            isbn TEXT,
            cover_img TEXT,
            url TEXT,
            fetched_at TEXT
        )
    """)
    conn.execute("CREATE INDEX IF NOT EXISTS idx_books_rating ON books(rating)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_books_title ON books(title)")
    conn.commit()
    return conn

def scrape_book_list(
    book_ids: list,
    db_path: str = "goodreads_books.db",
    proxy_url: str = None,
    delay_range: tuple = (4, 10),
):
    """Scrape a list of book IDs and save to SQLite database."""
    conn = init_books_db(db_path)

    # Check which books already in DB
    existing = set(
        row[0] for row in conn.execute("SELECT book_id FROM books").fetchall()
    )
    to_scrape = [bid for bid in book_ids if bid not in existing]
    print(f"Total: {len(book_ids)} | Already scraped: {len(existing)} | To fetch: {len(to_scrape)}")

    client = make_fresh_client(proxy_url=proxy_url)
    requests_this_session = 0
    now = datetime.utcnow().isoformat()

    for i, book_id in enumerate(to_scrape):
        # Rotate session every 50 requests
        if requests_this_session >= 50:
            client.close()
            time.sleep(random.uniform(5, 10))
            client = make_fresh_client(proxy_url=proxy_url)
            requests_this_session = 0

        try:
            book = get_book_details(book_id, client=client)

            if not book.get("title"):
                print(f"[{i+1}/{len(to_scrape)}] No data for {book_id}, skipping")
                continue

            # Check for CAPTCHA
            # (would need soup object — simplified here)

            conn.execute("""
                INSERT OR REPLACE INTO books
                (book_id, title, authors, rating, rating_count, review_count,
                 description, genres, pages, publish_date, isbn, cover_img, url, fetched_at)
                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            """, (
                book["book_id"],
                book["title"],
                "|".join(book.get("authors", [])),
                float(book["rating"]) if book.get("rating") else None,
                int(str(book.get("rating_count", "")).replace(",", "")) if book.get("rating_count") else None,
                int(str(book.get("review_count", "")).replace(",", "")) if book.get("review_count") else None,
                book.get("description"),
                "|".join(book.get("genres", [])),
                book.get("pages"),
                book.get("publish_date"),
                book.get("isbn"),
                book.get("cover_img"),
                book.get("url"),
                now,
            ))
            conn.commit()

            requests_this_session += 1
            print(f"[{i+1}/{len(to_scrape)}] {book['title']} — {book['rating']}/5 ({book['rating_count']} ratings)")

        except Exception as e:
            print(f"[{i+1}/{len(to_scrape)}] Error on {book_id}: {e}")

        time.sleep(random.uniform(*delay_range))

    client.close()
    conn.close()

    print(f"\nDone. Database: {db_path}")

# Example: scrape classics
book_ids = ["11127", "1885", "4671", "15823480", "49552", "1490", "11588", "89724", "3836"]
scrape_book_list(book_ids, db_path="classics.db", delay_range=(4, 8))

Export to CSV

def export_to_csv(db_path: str, csv_path: str):
    """Export SQLite book data to CSV."""
    conn = sqlite3.connect(db_path)
    books = conn.execute("SELECT * FROM books ORDER BY rating_count DESC").fetchall()
    cols = [d[0] for d in conn.execute("SELECT * FROM books LIMIT 0").description]
    conn.close()

    with open(csv_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(cols)
        writer.writerows(books)

    print(f"Exported {len(books)} books to {csv_path}")

export_to_csv("classics.db", "goodreads_classics.csv")

Alternatives to Scraping

Before scraping at scale, check these alternatives:

Open Library API (openlibrary.org): Free, no auth needed, covers most books with ISBNs, ratings, and metadata. Not as complete as Goodreads but a solid first choice for ISBN-based lookup. The bulk dataset is also freely downloadable.

Google Books API: Gives you metadata, descriptions, and preview links. Limited to 1,000 requests/day without an API key; with a key up to 1,000/day per project.

ISBNdb: Paid API with comprehensive book data covering millions of titles.

WorldCat: Library catalog data via OCLC APIs — comprehensive metadata but focuses on library editions rather than reader ratings.

If Goodreads-specific data is what you need — the social shelf data, Goodreads ratings specifically, or review text — scraping remains the only path since the API sunset in 2020.

Final Notes

Goodreads scraping in 2026 is straightforward if you are patient. The HTML is well-structured, JSON-LD gives you clean metadata, and the anti-bot measures are manageable with proper delays and session rotation.

Key mistakes to avoid: going too fast, not checking for CAPTCHA pages, not using the JSON-LD data already embedded in the page, and not caching your results between runs.

Start with get_book_details(), verify it works on a handful of books, then scale up. Use ThorData residential proxies when you need to collect thousands of records without triggering blocks.