← Back to blog

scrape-open-library-books-2026

title: "Scraping Open Library & Internet Archive: Book Metadata, Covers & Downloads (2026)" date: 2026-04-09 description: Complete guide to extracting book metadata, cover images, author data, and reading availability from Open Library and Internet Archive APIs with Python. Includes bulk data dumps, SQLite storage, anti-rate-limit patterns, and real use cases. tags: [python, scraping, open-library, internet-archive, books]


Scraping Open Library & Internet Archive: Book Metadata, Covers & Downloads (2026)

Open Library is one of the best free sources for book data. It's run by the Internet Archive, contains records for over 40 million editions, and has a solid API that doesn't require authentication for most operations.

If you need ISBNs, author info, cover images, or reading availability — this is where you go. No API key needed for basic lookups. The data is openly licensed under CC BY and CC0. And unlike proprietary book databases, there are no per-query fees or restrictive licensing terms.

This guide covers the full picture: the Books API, Search API, Author API, cover image downloads, Internet Archive availability checking, bulk data dumps, and strategies for scaling up without getting throttled.

Understanding the Open Library Data Model

Open Library distinguishes between three types of records:

Works represent the underlying creative work, independent of edition. The work for "The Hitchhiker's Guide to the Galaxy" captures the shared identity across all printings. Work keys look like /works/OL45804W.

Editions represent specific published versions: a particular printing, translation, or format. Each edition has its own ISBN, publisher, and physical details. Edition keys look like /books/OL7353617M.

Authors have their own records with biographical data and links to all their works. Author keys look like /authors/OL26320A.

Understanding this hierarchy matters for scraping. Some data lives at the work level (subjects, description), some at the edition level (publisher, ISBN, page count). You'll often need to fetch both.

The Books API — Quick Lookups

The simplest entry point. Give it an ISBN, OLID, or other identifier and get structured book data back.

import requests
import time
import json
import os

session = requests.Session()
session.headers.update({
    "User-Agent": "BookDataBot/1.0 (https://yourproject.example; [email protected])"
})

def get_edition_by_isbn(isbn: str) -> dict | None:
    """Fetch edition data from Open Library by ISBN (10 or 13 digit)."""
    isbn_clean = isbn.replace("-", "").replace(" ", "")
    resp = session.get(
        f"https://openlibrary.org/isbn/{isbn_clean}.json",
        timeout=15,
    )
    if resp.status_code == 404:
        return None
    resp.raise_for_status()
    return resp.json()


def get_work(work_key: str) -> dict | None:
    """Fetch work-level data (shared across editions)."""
    resp = session.get(
        f"https://openlibrary.org{work_key}.json",
        timeout=15,
    )
    if resp.status_code == 404:
        return None
    resp.raise_for_status()
    return resp.json()


def get_book_details(isbn: str) -> dict | None:
    """Get enriched book info combining edition + work data."""
    edition = get_edition_by_isbn(isbn)
    if not edition:
        return None

    # Fetch work-level data (shared subjects, description, etc.)
    work_key = (edition.get("works") or [{}])[0].get("key", "")
    work = {}
    if work_key:
        work = get_work(work_key) or {}

    # Description can be a string or dict with 'value' key
    desc = work.get("description", "")
    if isinstance(desc, dict):
        desc = desc.get("value", "")

    return {
        "isbn": isbn,
        "title": edition.get("title"),
        "subtitle": edition.get("subtitle"),
        "publishers": edition.get("publishers", []),
        "publish_date": edition.get("publish_date"),
        "publish_places": edition.get("publish_places", []),
        "pages": edition.get("number_of_pages"),
        "languages": [l.get("key", "").replace("/languages/", "")
                      for l in edition.get("languages", [])],
        "subjects": work.get("subjects", [])[:15],
        "description": desc,
        "cover_url_small": f"https://covers.openlibrary.org/b/isbn/{isbn}-S.jpg",
        "cover_url_medium": f"https://covers.openlibrary.org/b/isbn/{isbn}-M.jpg",
        "cover_url_large": f"https://covers.openlibrary.org/b/isbn/{isbn}-L.jpg",
        "ol_edition_key": edition.get("key"),
        "ol_work_key": work_key,
        "dewey_decimal": edition.get("dewey_decimal_class", []),
        "lc_classifications": edition.get("lc_classifications", []),
    }


# Example
book = get_book_details("9780132350884")
if book:
    print(f"{book['title']}")
    print(f"  Published: {book['publish_date']} by {', '.join(book['publishers'])}")
    print(f"  Pages: {book['pages']}")
    print(f"  Subjects: {book['subjects'][:3]}")
    print(f"  Cover: {book['cover_url_large']}")

Books API via Identifier Lookup

The books endpoint supports multiple identifier types simultaneously, returning rich structured data including external links:

def get_book_by_identifiers(**identifiers) -> dict:
    """
    Query the Books API with various identifiers.
    Supported: ISBN_10, ISBN_13, LCCN, OCLC, OLID, etc.
    """
    bibkeys = [f"{k}:{v}" for k, v in identifiers.items()]
    params = {
        "bibkeys": ",".join(bibkeys),
        "format": "json",
        "jscmd": "data",  # use 'details' for even more fields
    }
    resp = session.get(
        "https://openlibrary.org/api/books",
        params=params,
        timeout=15,
    )
    resp.raise_for_status()
    return resp.json()


# Multi-identifier lookup
data = get_book_by_identifiers(ISBN_13="9780132350884", LCCN="2006052558")
for bibkey, book in data.items():
    print(f"{bibkey}:")
    print(f"  Title: {book.get('title')}")
    print(f"  URL: {book.get('url')}")
    # This endpoint also returns external links (WorldCat, LibraryThing, etc.)
    for link in book.get("links", []):
        print(f"  Link: {link.get('title')} -> {link.get('url')}")

Search API — Finding Books by Query

When you don't have an ISBN, the Search API lets you query by title, author, subject, publisher, or full text.

def search_books(query: str, limit: int = 10, page: int = 1,
                 fields: list = None) -> dict:
    """Search Open Library for books."""
    default_fields = [
        "key", "title", "author_name", "author_key",
        "first_publish_year", "isbn", "subject",
        "cover_i", "edition_count", "language",
        "publisher", "publish_year", "number_of_pages_median",
    ]
    params = {
        "q": query,
        "limit": min(limit, 100),
        "page": page,
        "fields": ",".join(fields or default_fields),
    }
    resp = session.get(
        "https://openlibrary.org/search.json",
        params=params,
        timeout=20,
    )
    resp.raise_for_status()
    return resp.json()


def search_by_author(author_name: str, limit: int = 50) -> list[dict]:
    """Search for all books by a specific author."""
    return search_books(f'author:"{author_name}"', limit=limit).get("docs", [])


def search_by_subject(subject: str, limit: int = 100) -> list[dict]:
    """Search for books on a specific subject."""
    return search_books(f'subject:"{subject}"', limit=limit).get("docs", [])


# Paginate through all results for a query
def search_all(query: str, max_results: int = 1000) -> list[dict]:
    all_docs = []
    page = 1
    while len(all_docs) < max_results:
        data = search_books(query, limit=100, page=page)
        docs = data.get("docs", [])
        if not docs:
            break
        all_docs.extend(docs)
        if len(docs) < 100:  # last page
            break
        page += 1
        time.sleep(1)
    return all_docs[:max_results]


# Example searches
python_books = search_books("python programming", limit=5)
for doc in python_books.get("docs", []):
    authors = ", ".join(doc.get("author_name", [])[:2])
    print(f"{doc.get('title')} by {authors} ({doc.get('first_publish_year')})")

Author API

Author records contain biographical info, photo IDs, and links to all works:

def get_author(author_key: str) -> dict | None:
    """Fetch author data by Open Library author key."""
    resp = session.get(
        f"https://openlibrary.org{author_key}.json",
        timeout=15,
    )
    if resp.status_code == 404:
        return None
    resp.raise_for_status()
    return resp.json()


def get_author_works(author_key: str, limit: int = 50) -> list:
    """Get all works by an author."""
    params = {"limit": limit, "offset": 0}
    resp = session.get(
        f"https://openlibrary.org{author_key}/works.json",
        params=params,
        timeout=20,
    )
    resp.raise_for_status()
    return resp.json().get("entries", [])


def get_author_photo_url(author_key: str, size: str = "M") -> str | None:
    """Get author photo URL by author key."""
    author = get_author(author_key)
    if not author:
        return None
    photos = author.get("photos", [])
    if not photos or photos[0] == -1:
        return None
    photo_id = photos[0]
    return f"https://covers.openlibrary.org/a/id/{photo_id}-{size}.jpg"


# Get author data from a search result
results = search_books("Cormac McCarthy Blood Meridian", limit=1)
if results.get("docs"):
    doc = results["docs"][0]
    author_keys = doc.get("author_key", [])
    if author_keys:
        author = get_author(f"/{author_keys[0]}")
        if author:
            print(f"Author: {author.get('name')}")
            birth = author.get("birth_date", "unknown")
            bio = author.get("bio", "")
            if isinstance(bio, dict):
                bio = bio.get("value", "")
            print(f"  Born: {birth}")
            print(f"  Bio: {bio[:200]}")

Downloading Cover Images in Bulk

Cover images follow a simple URL pattern. Valid sizes are S (small), M (medium), L (large). Open Library returns a 1×1 pixel placeholder when no cover exists — check content length to detect it.

def download_covers(isbns: list[str], output_dir: str = "covers",
                    size: str = "M", delay: float = 0.5) -> dict:
    """Download cover images for a list of ISBNs."""
    os.makedirs(output_dir, exist_ok=True)
    stats = {"downloaded": 0, "no_cover": 0, "failed": 0, "skipped": 0}

    for isbn in isbns:
        isbn_clean = isbn.replace("-", "")
        filepath = os.path.join(output_dir, f"{isbn_clean}.jpg")

        if os.path.exists(filepath):
            stats["skipped"] += 1
            continue

        url = f"https://covers.openlibrary.org/b/isbn/{isbn_clean}-{size}.jpg"
        try:
            resp = session.get(url, timeout=15)
            # OL returns a tiny placeholder if no cover exists
            if resp.ok and len(resp.content) > 1000:
                with open(filepath, "wb") as f:
                    f.write(resp.content)
                stats["downloaded"] += 1
            else:
                stats["no_cover"] += 1
        except Exception as e:
            print(f"Failed {isbn}: {e}")
            stats["failed"] += 1

        time.sleep(delay)

    return stats


# Download covers for a set of ISBNs
sample_isbns = ["9780132350884", "9780596007126", "9781449355739", "9780743273565"]
result = download_covers(sample_isbns, delay=0.5)
print(f"Downloaded: {result['downloaded']}, No cover: {result['no_cover']}, Failed: {result['failed']}")

Internet Archive — Full Text Access and Lending

Open Library connects to the Internet Archive for book lending and full-text access. Many books can be borrowed for 1 hour or 14 days. The IA API gives you availability data and reading links.

def check_ia_availability(isbn: str) -> dict:
    """Check if a book is available to borrow on Internet Archive."""
    edition = get_edition_by_isbn(isbn)
    if not edition:
        return {"available": False, "reason": "edition_not_found"}

    ocaid = edition.get("ocaid")  # Internet Archive identifier
    if not ocaid:
        return {"available": False, "reason": "not_in_archive"}

    resp = session.get(
        "https://archive.org/services/availability",
        params={"identifier": ocaid},
        timeout=15,
    )
    if not resp.ok:
        return {"available": False, "reason": "availability_api_error"}

    data = resp.json()
    status = data.get("responses", {}).get(ocaid, {}).get("status", "unknown")

    return {
        "available": status == "available",
        "ia_id": ocaid,
        "read_url": f"https://archive.org/details/{ocaid}",
        "borrow_url": f"https://openlibrary.org/borrow/ia/{ocaid}",
        "status": status,
        "status_label": {
            "available": "Available to borrow",
            "borrowed": "All copies borrowed",
            "restricted": "Restricted access",
            "error": "Error checking availability",
        }.get(status, status),
    }


def get_ia_metadata(ocaid: str) -> dict | None:
    """Fetch Internet Archive item metadata."""
    resp = session.get(
        f"https://archive.org/metadata/{ocaid}",
        timeout=20,
    )
    if not resp.ok:
        return None
    return resp.json()


result = check_ia_availability("9780132350884")
print(f"Available: {result['available']}")
print(f"Status: {result['status_label']}")
if result.get("read_url"):
    print(f"Read at: {result['read_url']}")

Handling Rate Limits and Anti-Bot Measures

Open Library doesn't publish official rate limits, but sustained crawling above ~100 requests/minute will trigger throttling. Their servers are community-funded infrastructure, so aggressive scraping has real impact.

Practical guidelines: - Keep requests under 60/minute for sustained crawling - Add a User-Agent header identifying your project - Cache responses locally — most book metadata doesn't change - Use the bulk dumps for large-scale collection instead of API calls

For projects that need higher throughput — catalog building, dataset enrichment at scale, real-time book data for an application — routing through proxies distributes load and prevents single-IP throttling.

ThorData's residential proxy network works well for Open Library and the Internet Archive. Their pay-as-you-go model suits the bursty nature of book data collection.

import random

PROXY_CONFIGS = [
    {"user": "user1", "pass": "pass1", "host": "proxy.thordata.net", "port": 10000},
    {"user": "user1", "pass": "pass1", "host": "proxy.thordata.net", "port": 10001},
]

def make_proxy_url(config: dict = None) -> str:
    """Get a proxy URL, optionally with session-based sticky routing."""
    if config is None:
        config = random.choice(PROXY_CONFIGS)
    session_id = random.randint(10000, 99999)
    return (f"http://{config['user']}-session-{session_id}:{config['pass']}"
            f"@{config['host']}:{config['port']}")


def fetch_with_proxy(url: str, params: dict = None) -> requests.Response:
    """Fetch with proxy rotation for bulk operations."""
    proxy_url = make_proxy_url()
    s = requests.Session()
    s.proxies = {"http": proxy_url, "https": proxy_url}
    s.headers.update({
        "User-Agent": "BookDataBot/1.0 ([email protected])"
    })
    resp = s.get(url, params=params, timeout=30)
    resp.raise_for_status()
    time.sleep(random.uniform(1, 3))
    return resp

Bulk Data Dumps

For large-scale collection, downloading and processing Open Library's bulk data dumps is far more efficient than API calls. Dumps are updated monthly and cover all works, editions, authors, and subjects.

Download from: https://openlibrary.org/developers/dumps

Format: tab-separated with fields: type, key, revision, last_modified, JSON_data

import gzip

def stream_ol_dump(dump_path: str, record_type: str = "/type/work"):
    """
    Stream Open Library data dump.
    record_type: '/type/work', '/type/edition', '/type/author'
    """
    with gzip.open(dump_path, "rt", encoding="utf-8") as f:
        for line in f:
            parts = line.strip().split("\t")
            if len(parts) < 5:
                continue
            rtype, key, revision, last_modified, json_data = parts[:5]
            if rtype != record_type:
                continue
            try:
                data = json.loads(json_data)
                data["_key"] = key
                data["_last_modified"] = last_modified
                yield data
            except json.JSONDecodeError:
                continue


def load_dump_to_db(dump_path: str, record_type: str, db_path: str):
    """Load a dump file into SQLite."""
    conn = sqlite3.connect(db_path)

    if record_type == "/type/work":
        conn.execute("""CREATE TABLE IF NOT EXISTS works (
            key TEXT PRIMARY KEY, title TEXT, subjects TEXT,
            description TEXT, last_modified TEXT)""")
        for work in stream_ol_dump(dump_path, "/type/work"):
            desc = work.get("description", "")
            if isinstance(desc, dict):
                desc = desc.get("value", "")
            conn.execute(
                "INSERT OR REPLACE INTO works VALUES (?,?,?,?,?)",
                (work["_key"], work.get("title"),
                 json.dumps(work.get("subjects", [])[:20]),
                 desc[:2000], work.get("_last_modified")),
            )

    elif record_type == "/type/edition":
        conn.execute("""CREATE TABLE IF NOT EXISTS editions (
            key TEXT PRIMARY KEY, title TEXT, isbn_13 TEXT,
            isbn_10 TEXT, publish_date TEXT, pages INTEGER,
            work_key TEXT, last_modified TEXT)""")
        for edition in stream_ol_dump(dump_path, "/type/edition"):
            isbns_13 = edition.get("isbn_13", [])
            isbns_10 = edition.get("isbn_10", [])
            work_keys = edition.get("works", [{}])
            conn.execute(
                "INSERT OR REPLACE INTO editions VALUES (?,?,?,?,?,?,?,?)",
                (edition["_key"], edition.get("title"),
                 isbns_13[0] if isbns_13 else None,
                 isbns_10[0] if isbns_10 else None,
                 edition.get("publish_date"),
                 edition.get("number_of_pages"),
                 work_keys[0].get("key") if work_keys else None,
                 edition.get("_last_modified")),
            )

    conn.commit()
    cursor = conn.execute("SELECT COUNT(*) FROM sqlite_master WHERE type='table'")
    print(f"Loaded dump to {db_path}")
    conn.close()

Building a Local Book Database

Putting it all together — a script that builds a SQLite database from Open Library search results:

import sqlite3

def init_book_db(db_path: str = "books.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS books (
            isbn TEXT PRIMARY KEY,
            title TEXT,
            subtitle TEXT,
            authors TEXT,
            publishers TEXT,
            publish_date TEXT,
            pages INTEGER,
            subjects TEXT,
            description TEXT,
            cover_url TEXT,
            ol_edition_key TEXT,
            ol_work_key TEXT,
            ia_available INTEGER DEFAULT 0,
            ia_id TEXT,
            fetched_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS authors (
            ol_key TEXT PRIMARY KEY,
            name TEXT,
            birth_date TEXT,
            bio TEXT,
            photo_url TEXT,
            fetched_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)
    conn.execute("CREATE INDEX IF NOT EXISTS idx_books_title ON books(title)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_books_publish ON books(publish_date)")
    conn.commit()
    return conn


def store_book(conn: sqlite3.Connection, book: dict):
    conn.execute("""
        INSERT OR REPLACE INTO books
        (isbn, title, subtitle, authors, publishers, publish_date,
         pages, subjects, description, cover_url, ol_edition_key, ol_work_key)
        VALUES (?,?,?,?,?,?,?,?,?,?,?,?)
    """, (
        book.get("isbn"),
        book.get("title"),
        book.get("subtitle"),
        json.dumps(book.get("authors", [])),
        json.dumps(book.get("publishers", [])),
        book.get("publish_date"),
        book.get("pages"),
        json.dumps(book.get("subjects", [])[:10]),
        book.get("description", "")[:5000],
        book.get("cover_url_large"),
        book.get("ol_edition_key"),
        book.get("ol_work_key"),
    ))
    conn.commit()


# Build a collection from search queries
conn = init_book_db()
search_queries = [
    "python programming",
    "machine learning deep learning",
    "web development javascript",
    "data science statistics",
]

for query in search_queries:
    print(f"Searching: {query}")
    results = search_books(query, limit=50)
    for doc in results.get("docs", []):
        isbns = doc.get("isbn", [])
        if not isbns:
            continue
        isbn = isbns[0]
        book = get_book_details(isbn)
        if book:
            # Add author names from search result
            book["authors"] = doc.get("author_name", [])
            store_book(conn, book)
    time.sleep(2)

cursor = conn.execute("SELECT COUNT(*) FROM books")
print(f"Total books in database: {cursor.fetchone()[0]}")
conn.close()

Open Library has a subject browse API that returns books tagged with a specific subject:

def browse_by_subject(subject: str, limit: int = 50, offset: int = 0) -> dict:
    """Browse books by subject tag."""
    subject_slug = subject.lower().replace(" ", "_")
    resp = session.get(
        f"https://openlibrary.org/subjects/{subject_slug}.json",
        params={"limit": limit, "offset": offset},
        timeout=20,
    )
    resp.raise_for_status()
    return resp.json()


def get_related_books(work_key: str) -> list:
    """Get related/similar books for a work."""
    resp = session.get(
        f"https://openlibrary.org{work_key}/bookshelves.json",
        timeout=15,
    )
    if not resp.ok:
        return []
    return resp.json().get("counts", {})


# Browse science fiction books
sf_data = browse_by_subject("science_fiction", limit=20)
print(f"Subject: {sf_data.get('name')}")
print(f"Total works: {sf_data.get('work_count', 0)}")
for work in sf_data.get("works", [])[:5]:
    print(f"  {work.get('title')} by {[a['name'] for a in work.get('authors', [])]}")

Practical Tips

Check content-length for cover images — A response of ~807 bytes is the 1×1 placeholder. Only save if len(response.content) > 1000.

Work vs. Edition — Subjects and descriptions live on the Work. Publisher, ISBN, page count, and physical details live on the Edition. Always fetch both for complete data.

ISBNs aren't unique per edition — Some editions list multiple ISBNs (e.g., both the 10-digit and 13-digit forms). Some ISBNs appear on multiple editions due to data quality issues. Deduplicate by work key if you need unique titles.

Use bulk dumps for research datasets — The API is best for real-time lookups. For building a complete catalog or training data, the monthly dumps are orders of magnitude more efficient.

Respect the infrastructure — Open Library runs on community funding and volunteer effort. Add delays, cache aggressively, and prefer dumps over repeated API calls for bulk work.

Handling OCLC, LCCN, and Other Identifiers

Open Library stores many identifier types beyond ISBN. These are useful for deduplication and cross-referencing with library catalogs:

IDENTIFIER_TYPES = ['isbn_10', 'isbn_13', 'lccn', 'oclc_numbers',
                    'dewey_decimal_class', 'lc_classifications']

def get_all_identifiers(edition):
    ids = {}
    for id_type in IDENTIFIER_TYPES:
        values = edition.get(id_type, [])
        if values:
            ids[id_type] = values if isinstance(values, list) else [values]
    return ids

def find_by_lccn(lccn):
    resp = session.get(
        f'https://openlibrary.org/books/lccn/{lccn}.json',
        timeout=15
    )
    if resp.status_code == 404:
        return None
    resp.raise_for_status()
    return resp.json()

Reading Lists and Bookshelves

Open Library has a community 'bookshelves' feature where users can mark books as 'want to read', 'currently reading', or 'already read'. This reading count data is a useful proxy for popularity:

def get_bookshelf_counts(work_key):
    resp = session.get(
        f'https://openlibrary.org{work_key}/bookshelves.json',
        timeout=15
    )
    if not resp.ok:
        return {}
    data = resp.json()
    return {
        'want_to_read': data.get('counts', {}).get('want_to_read', 0),
        'currently_reading': data.get('counts', {}).get('currently_reading', 0),
        'already_read': data.get('counts', {}).get('already_read', 0),
    }

Ratings and Community Data

Open Library collects user ratings through a 5-star system. Ratings are aggregated at the work level:

def get_ratings(work_key):
    resp = session.get(
        f'https://openlibrary.org{work_key}/ratings.json',
        timeout=15
    )
    if not resp.ok:
        return None
    data = resp.json()
    summary = data.get('summary', {})
    return {
        'average': summary.get('average'),
        'count': summary.get('count', 0),
    }

work_key = '/works/OL45804W'
ratings = get_ratings(work_key)
shelves = get_bookshelf_counts(work_key)
print(f'Rating: {ratings["average"]:.2f} from {ratings["count"]} users')
print(f'Want to read: {shelves["want_to_read"]}')

Building a Genre-Based Catalog

Combining the search API with subject browse, you can build genre-specific catalogs efficiently. Here is a pipeline for building a curated science fiction catalog with covers and availability:

def build_genre_catalog(genre, output_db='books.db', max_books=500):
    conn = init_book_db(output_db)
    offset = 0
    total_saved = 0

    while total_saved < max_books:
        data = browse_by_subject(genre, limit=50, offset=offset)
        works = data.get('works', [])
        if not works:
            break

        for work in works:
            # Find an edition with ISBN
            editions_key = work.get('key', '')
            if not editions_key:
                continue
            # Get editions for this work
            eresp = session.get(
                f'https://openlibrary.org{editions_key}/editions.json',
                params={'limit': 5},
                timeout=15
            )
            if not eresp.ok:
                continue
            editions = eresp.json().get('entries', [])
            for ed in editions:
                isbns = ed.get('isbn_13', []) or ed.get('isbn_10', [])
                if isbns:
                    book = get_book_details(isbns[0])
                    if book:
                        book['authors'] = [a['name'] for a in work.get('authors', [])]
                        store_book(conn, book)
                        total_saved += 1
                    break
            time.sleep(0.5)

        offset += 50
        time.sleep(2)

    conn.close()
    return total_saved

Summary

Open Library and the Internet Archive together provide a comprehensive, freely licensed book data platform:

Use the API for real-time lookups; use the dumps for research datasets. Cache aggressively, respect the infrastructure, and these sources will serve most book data needs without the costs of commercial alternatives.