How to Scrape Coursera Course Data with Python (2026)

2026-04-09 [python coursera scraping api education-data]

How to Scrape Coursera Course Data with Python (2026)

Coursera is one of the largest online learning platforms with over 7,000 courses from 300+ universities and companies, plus thousands more professional certificates and specializations. If you're analyzing the online education market, tracking which skills are in demand, building a course comparison tool, or researching how top institutions structure their curricula, you need Coursera data at scale.

The good news: Coursera runs a semi-public API that returns JSON for course metadata, partner information, and instructor data without authentication. The less good news: the interesting data — enrollment counts, detailed ratings, syllabus content — requires web scraping their actual pages, which involves some anti-bot handling.

This guide covers the complete stack: the catalog API for bulk metadata collection, HTML scraping for enrollment and ratings, specialization and certificate program extraction, pagination patterns, SQLite storage, and proxy configuration for production-scale collection.

The Coursera Catalog API

Coursera's API doesn't require authentication for catalog browsing. The base URL is https://api.coursera.org/api/ and the most useful endpoints are:

/courses.v1 — course search and metadata
/partners.v1 — university/company partner details
/instructors.v1 — instructor profiles
/onDemandSpecializations.v1 — specialization programs
/degrees.v1 — degree programs

pip install requests httpx beautifulsoup4 lxml

import requests
import json
import time
import random
from typing import Optional

API_BASE = "https://api.coursera.org/api"

API_HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/127.0.0.0 Safari/537.36",
    "Accept": "application/json",
    "Accept-Language": "en-US,en;q=0.9",
    "Referer": "https://www.coursera.org/",
    "Origin": "https://www.coursera.org",
}

session = requests.Session()
session.headers.update(API_HEADERS)


def api_get(endpoint: str, params: dict = None, max_retries: int = 3) -> Optional[dict]:
    """
    Make a Coursera API request with retry logic.
    endpoint: path like '/courses.v1' or '/partners.v1'
    """
    url = f"{API_BASE}{endpoint}"

    for attempt in range(max_retries):
        try:
            resp = session.get(url, params=params, timeout=20)

            if resp.status_code == 429:
                retry_after = int(resp.headers.get("Retry-After", 60))
                print(f"  Rate limited. Waiting {retry_after}s...")
                time.sleep(retry_after)
                continue

            if resp.status_code == 401:
                print(f"  401 Unauthorized for {endpoint}")
                return None

            if resp.status_code == 404:
                return None

            resp.raise_for_status()
            return resp.json()

        except requests.Timeout:
            print(f"  Timeout on attempt {attempt + 1}")
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)
        except requests.RequestException as e:
            print(f"  Request error: {e}")
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)

    return None

Searching and Paginating Courses

The courses endpoint supports search with a fields[] parameter to specify which data you want:

# All available fields for courses.v1
COURSE_FIELDS = [
    # Core identity
    "id", "slug", "name",
    # Description
    "description", "shortDescription",
    # Metadata
    "workload", "courseType", "certificateType",
    "primaryLanguages", "subtitleLanguages",
    "partnerIds", "instructorIds", "domainTypes",
    # Stats (some may be restricted)
    "enrolledCount",
    # Content
    "modules",
]


def search_courses(
    query: str,
    limit: int = 100,
    start: int = 0,
    fields: list[str] = None,
) -> dict:
    """
    Search Coursera courses.
    Returns: {courses, total, next_start}
    """
    if fields is None:
        fields = [
            "description", "slug", "workload", "primaryLanguages",
            "subtitleLanguages", "partnerIds", "instructorIds",
            "domainTypes", "certificateType", "courseType"
        ]

    params = {
        "q": "search",
        "query": query,
        "limit": min(limit, 100),  # API cap
        "start": start,
        "fields": ",".join(fields),
        "includes": "partnerIds,instructorIds",
        "showHidden": "false",
    }

    data = api_get("/courses.v1", params)
    if not data:
        return {"courses": [], "total": 0, "next_start": None}

    courses = []
    for element in data.get("elements", []):
        course = {
            "id": element.get("id", ""),
            "slug": element.get("slug", ""),
            "name": element.get("name", ""),
            "description": (element.get("description") or "")[:500],
            "short_description": (element.get("shortDescription") or "")[:200],
            "workload": element.get("workload", ""),
            "certificate_type": element.get("certificateType", ""),
            "course_type": element.get("courseType", ""),
            "languages": element.get("primaryLanguages") or [],
            "subtitle_languages": element.get("subtitleLanguages") or [],
            "partner_ids": element.get("partnerIds") or [],
            "instructor_ids": element.get("instructorIds") or [],
            "domain_types": element.get("domainTypes") or [],
        }
        courses.append(course)

    paging = data.get("paging") or {}
    total = paging.get("total", len(courses))
    next_start = start + len(courses) if start + len(courses) < total else None

    return {"courses": courses, "total": total, "next_start": next_start}


def collect_all_courses(
    query: str,
    max_courses: int = 1000,
    delay: float = 0.8,
) -> list[dict]:
    """
    Paginate through all search results for a query.
    Returns up to max_courses courses.
    """
    all_courses = []
    start = 0

    while len(all_courses) < max_courses:
        batch_size = min(100, max_courses - len(all_courses))
        result = search_courses(query, limit=batch_size, start=start)

        if not result["courses"]:
            break

        all_courses.extend(result["courses"])
        print(f"  Collected {len(all_courses)} / {result['total']} courses")

        if result["next_start"] is None:
            break

        start = result["next_start"]
        time.sleep(delay + random.uniform(0, 0.3))

    return all_courses[:max_courses]


def browse_all_courses(
    max_per_subject: int = 200,
    delay: float = 0.8,
) -> list[dict]:
    """
    Browse all Coursera courses by category using common search terms.
    """
    subjects = [
        "data science",
        "machine learning",
        "programming",
        "business",
        "finance",
        "marketing",
        "health",
        "language",
        "design",
        "personal development",
        "social sciences",
        "mathematics",
        "computer science",
        "arts",
        "information technology",
    ]

    all_courses = {}

    for subject in subjects:
        print(f"\nSubject: {subject}")
        courses = collect_all_courses(subject, max_courses=max_per_subject, delay=delay)
        for c in courses:
            if c["id"] not in all_courses:
                all_courses[c["id"]] = c
        print(f"  Unique courses so far: {len(all_courses)}")
        time.sleep(2)

    return list(all_courses.values())

Fetching Partner and Instructor Data

The API returns partner IDs (not names) with courses. Fetch partner details separately:

def get_partners(partner_ids: list[str]) -> dict[str, dict]:
    """
    Fetch university/organization data for a list of partner IDs.
    Returns dict keyed by partner ID.
    """
    if not partner_ids:
        return {}

    # Batch by 50 IDs max per request
    result = {}
    for i in range(0, len(partner_ids), 50):
        batch = partner_ids[i:i+50]
        ids_param = ",".join(batch)

        data = api_get(
            f"/partners.v1",
            params={
                "ids": ids_param,
                "fields": "name,shortName,description,homeLink,location,"
                          "logo,websiteTwitter,websiteFacebook,classLogo",
            }
        )

        if data:
            for p in data.get("elements", []):
                result[p["id"]] = {
                    "id": p["id"],
                    "name": p.get("name", ""),
                    "short_name": p.get("shortName", ""),
                    "description": (p.get("description") or "")[:300],
                    "location": p.get("location", ""),
                    "home_url": p.get("homeLink", ""),
                    "logo_url": p.get("logo", ""),
                }

        time.sleep(0.5)

    return result


def get_instructors(instructor_ids: list[str]) -> dict[str, dict]:
    """
    Fetch instructor profile data.
    Returns dict keyed by instructor ID.
    """
    if not instructor_ids:
        return {}

    result = {}
    for i in range(0, len(instructor_ids), 50):
        batch = instructor_ids[i:i+50]
        ids_param = ",".join(batch)

        data = api_get(
            "/instructors.v1",
            params={
                "ids": ids_param,
                "fields": "firstName,lastName,title,department,bio,photo,"
                          "websites,partnerShortName",
            }
        )

        if data:
            for inst in data.get("elements", []):
                full_name = f"{inst.get('firstName', '')} {inst.get('lastName', '')}".strip()
                result[inst["id"]] = {
                    "id": inst["id"],
                    "name": full_name,
                    "title": inst.get("title", ""),
                    "department": inst.get("department", ""),
                    "bio": (inst.get("bio") or "")[:400],
                    "photo_url": inst.get("photo", ""),
                    "institution": inst.get("partnerShortName", ""),
                }

        time.sleep(0.5)

    return result


def enrich_courses_with_partner_data(courses: list[dict]) -> list[dict]:
    """
    Fetch partner and instructor details and add them to course records.
    """
    # Collect all unique IDs
    all_partner_ids = list({
        pid for c in courses
        for pid in c.get("partner_ids", [])
        if pid
    })
    all_instructor_ids = list({
        iid for c in courses
        for iid in c.get("instructor_ids", [])
        if iid
    })

    print(f"Fetching {len(all_partner_ids)} partners...")
    partners = get_partners(all_partner_ids)

    print(f"Fetching {len(all_instructor_ids)} instructors...")
    instructors = get_instructors(all_instructor_ids)

    # Enrich courses
    for course in courses:
        partner_data = [
            partners.get(pid, {})
            for pid in course.get("partner_ids", [])
            if pid in partners
        ]
        instructor_data = [
            instructors.get(iid, {})
            for iid in course.get("instructor_ids", [])
            if iid in instructors
        ]

        course["partners"] = partner_data
        course["partner_names"] = [p.get("name", "") for p in partner_data]
        course["partner_locations"] = [p.get("location", "") for p in partner_data]
        course["instructors"] = instructor_data
        course["instructor_names"] = [i.get("name", "") for i in instructor_data]

    return courses

Scraping Enrollment and Ratings from Course Pages

The API doesn't expose enrollment counts or detailed ratings. These are visible on course pages:

from bs4 import BeautifulSoup
import re


WEB_HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/127.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
}


def scrape_course_page(
    slug: str,
    proxies: dict = None,
) -> dict:
    """
    Scrape a Coursera course page for enrollment, ratings, and syllabus.
    slug: the course's URL slug (e.g. 'machine-learning', 'python-for-everybody')
    """
    url = f"https://www.coursera.org/learn/{slug}"
    result = {"slug": slug, "url": url}

    try:
        resp = requests.get(
            url,
            headers=WEB_HEADERS,
            proxies=proxies,
            timeout=20,
            allow_redirects=True,
        )

        if resp.status_code == 403:
            return {**result, "error": "bot_detection"}
        if resp.status_code == 404:
            return {**result, "error": "not_found"}
        resp.raise_for_status()

    except requests.Timeout:
        return {**result, "error": "timeout"}
    except requests.RequestException as e:
        return {**result, "error": str(e)}

    soup = BeautifulSoup(resp.text, "lxml")

    # === JSON-LD structured data (most reliable) ===
    ld_data = {}
    for script in soup.find_all("script", {"type": "application/ld+json"}):
        try:
            data = json.loads(script.string or "{}")
            if data.get("@type") in ("Course", "EducationalOccupationalProgram"):
                ld_data = data
                break
        except (json.JSONDecodeError, AttributeError):
            continue

    # === Enrollment count ===
    enrollment = None

    # Try JSON-LD first
    if ld_data.get("numberOfEnrollments"):
        try:
            enrollment = int(str(ld_data["numberOfEnrollments"]).replace(",", ""))
        except ValueError:
            pass

    # Try page text patterns
    if enrollment is None:
        enroll_patterns = [
            r'([\d,]+(?:\.\d+)?[KkMm]?)\s+(?:already\s+)?enrolled',
            r'([\d,]+(?:\.\d+)?[KkMm]?)\s+students?',
            r'Enrolled:\s*([\d,]+)',
        ]
        for pattern in enroll_patterns:
            m = re.search(pattern, resp.text, re.IGNORECASE)
            if m:
                raw = m.group(1).replace(",", "")
                try:
                    if raw.lower().endswith("k"):
                        enrollment = int(float(raw[:-1]) * 1000)
                    elif raw.lower().endswith("m"):
                        enrollment = int(float(raw[:-1]) * 1_000_000)
                    else:
                        enrollment = int(raw)
                    break
                except ValueError:
                    pass

    result["enrollment"] = enrollment

    # === Rating ===
    rating = None
    rating_count = None

    # From JSON-LD
    agg_rating = ld_data.get("aggregateRating") or {}
    if agg_rating.get("ratingValue"):
        try:
            rating = float(agg_rating["ratingValue"])
        except (ValueError, TypeError):
            pass
    if agg_rating.get("reviewCount") or agg_rating.get("ratingCount"):
        try:
            raw_count = agg_rating.get("reviewCount") or agg_rating.get("ratingCount")
            rating_count = int(str(raw_count).replace(",", ""))
        except (ValueError, TypeError):
            pass

    # From HTML selectors
    if rating is None:
        rating_selectors = [
            "[data-testid*='rating'] .css-6ecy9b",
            "[class*='ratings__value']",
            "[itemprop='ratingValue']",
            ".star-rating-text",
        ]
        for sel in rating_selectors:
            el = soup.select_one(sel)
            if el:
                m = re.search(r'(\d+\.?\d*)', el.get_text() or el.get("content", ""))
                if m:
                    val = float(m.group(1))
                    if 1.0 <= val <= 5.0:
                        rating = val
                        break

    result["rating"] = rating
    result["rating_count"] = rating_count

    # === Course level ===
    level = ""
    level_patterns = [
        r'\b(Beginner|Intermediate|Advanced|Mixed)\b',
    ]
    for pattern in level_patterns:
        m = re.search(pattern, resp.text, re.IGNORECASE)
        if m:
            level = m.group(1).capitalize()
            break
    result["level"] = level

    # === Duration/pace ===
    duration = ld_data.get("timeRequired", "")
    if not duration:
        duration_el = soup.select_one("[class*='duration'], [data-testid*='duration']")
        if duration_el:
            duration = duration_el.get_text(strip=True)
    result["duration"] = duration

    # === Syllabus / Modules ===
    modules = []
    module_selectors = [
        "[class*='SyllabusModule']",
        "[class*='syllabus-item']",
        "[data-testid='accordion-item']",
    ]
    for sel in module_selectors:
        for module in soup.select(sel)[:20]:
            title_el = module.select_one("h3, h4, [class*='title'], strong")
            desc_el = module.select_one("[class*='description'], p")
            duration_el = module.select_one("[class*='duration'], [class*='hours']")

            if title_el:
                modules.append({
                    "title": title_el.get_text(strip=True)[:200],
                    "description": (desc_el.get_text(strip=True) if desc_el else "")[:300],
                    "duration": duration_el.get_text(strip=True) if duration_el else "",
                })
        if modules:
            break
    result["modules"] = modules

    # === Skills gained ===
    skills = []
    skill_selectors = [
        "[data-testid*='skill']",
        "[class*='SkillChip']",
        "[class*='Chip'][class*='skill']",
        "a[href*='/learn/skills']",
    ]
    for sel in skill_selectors:
        for el in soup.select(sel)[:20]:
            text = el.get_text(strip=True)
            if text and 2 < len(text) < 60:
                skills.append(text)
        if skills:
            break
    result["skills"] = list(set(skills))[:15]

    # === Provider name from JSON-LD ===
    provider = ld_data.get("provider", {})
    if isinstance(provider, dict):
        result["provider"] = provider.get("name", "")
    elif isinstance(provider, list) and provider:
        result["provider"] = provider[0].get("name", "")
    else:
        result["provider"] = ""

    # === Course image ===
    result["image_url"] = ld_data.get("image", "")

    # === Course name from page if not already known ===
    if not result.get("name"):
        h1 = soup.select_one("h1")
        result["name"] = h1.get_text(strip=True) if h1 else ""

    return result


def scrape_courses_batch(
    slugs: list[str],
    delay_min: float = 1.5,
    delay_max: float = 4.0,
    proxies: dict = None,
    max_failures: int = 10,
) -> list[dict]:
    """
    Scrape multiple course pages with rate limiting.
    Stops early if too many consecutive failures.
    """
    results = []
    consecutive_failures = 0

    for i, slug in enumerate(slugs):
        print(f"  [{i+1}/{len(slugs)}] {slug}")

        result = scrape_course_page(slug, proxies=proxies)
        results.append(result)

        if result.get("error"):
            consecutive_failures += 1
            print(f"    Error: {result['error']}")
            if consecutive_failures >= max_failures:
                print(f"  Too many failures ({max_failures}), stopping")
                break
        else:
            consecutive_failures = 0
            print(f"    Enrollment: {result.get('enrollment') or 'N/A'} | "
                  f"Rating: {result.get('rating') or 'N/A'}")

        wait = random.uniform(delay_min, delay_max)
        time.sleep(wait)

    return results

Specializations and Certificate Programs

Coursera bundles courses into specializations. These have their own API endpoint:

def search_specializations(
    query: str,
    limit: int = 50,
    start: int = 0,
) -> dict:
    """Search Coursera specializations."""
    params = {
        "q": "search",
        "query": query,
        "limit": min(limit, 100),
        "start": start,
        "fields": "name,slug,description,shortDescription,partnerIds,"
                  "courseIds,logo,primaryLanguages,workload",
    }

    data = api_get("/onDemandSpecializations.v1", params)
    if not data:
        return {"specializations": [], "total": 0, "next_start": None}

    specs = []
    for s in data.get("elements", []):
        specs.append({
            "id": s.get("id", ""),
            "slug": s.get("slug", ""),
            "name": s.get("name", ""),
            "description": (s.get("description") or "")[:400],
            "partner_ids": s.get("partnerIds") or [],
            "course_ids": s.get("courseIds") or [],
            "num_courses": len(s.get("courseIds") or []),
            "logo_url": s.get("logo", ""),
            "workload": s.get("workload", ""),
        })

    paging = data.get("paging") or {}
    total = paging.get("total", len(specs))
    next_start = start + len(specs) if start + len(specs) < total else None

    return {"specializations": specs, "total": total, "next_start": next_start}


def get_specialization_courses(spec: dict, course_lookup: dict = None) -> dict:
    """
    Fetch full course details for a specialization.
    course_lookup: optional dict of {course_id: course_data} to avoid re-fetching
    """
    course_ids = spec.get("course_ids", [])
    if not course_ids:
        return {**spec, "courses": []}

    if course_lookup:
        courses = [course_lookup[cid] for cid in course_ids if cid in course_lookup]
        missing = [cid for cid in course_ids if cid not in (course_lookup or {})]
    else:
        courses = []
        missing = course_ids

    # Fetch missing courses from API
    if missing:
        for i in range(0, len(missing), 20):
            batch = missing[i:i+20]
            data = api_get(
                "/courses.v1",
                params={
                    "ids": ",".join(batch),
                    "fields": "name,slug,workload,primaryLanguages,partnerIds,instructorIds",
                }
            )
            if data:
                for c in data.get("elements", []):
                    courses.append({
                        "id": c["id"],
                        "slug": c.get("slug", ""),
                        "name": c.get("name", ""),
                        "workload": c.get("workload", ""),
                    })
            time.sleep(0.5)

    return {**spec, "courses": courses}


def scrape_specialization_page(slug: str, proxies: dict = None) -> dict:
    """
    Scrape a Coursera specialization page for enrollment and ratings.
    Specializations are at coursera.org/specializations/{slug}
    """
    url = f"https://www.coursera.org/specializations/{slug}"
    result = {"slug": slug, "url": url, "type": "specialization"}

    try:
        resp = requests.get(url, headers=WEB_HEADERS, proxies=proxies, timeout=20)
        if resp.status_code != 200:
            return {**result, "error": f"HTTP {resp.status_code}"}
        resp.raise_for_status()
    except Exception as e:
        return {**result, "error": str(e)}

    soup = BeautifulSoup(resp.text, "lxml")

    # JSON-LD
    for script in soup.find_all("script", {"type": "application/ld+json"}):
        try:
            data = json.loads(script.string or "{}")
            if data.get("@type") in ("Course", "EducationalOccupationalProgram"):
                agg = data.get("aggregateRating") or {}
                result["rating"] = agg.get("ratingValue")
                result["rating_count"] = agg.get("reviewCount") or agg.get("ratingCount")
                result["name"] = data.get("name", "")
                result["description"] = (data.get("description") or "")[:400]
                result["image_url"] = data.get("image", "")
                provider = data.get("provider")
                if isinstance(provider, dict):
                    result["provider"] = provider.get("name", "")
                break
        except Exception:
            continue

    # Enrollment
    enrollment = None
    m = re.search(r'([\d,]+(?:\.\d+)?[KkMm]?)\s+(?:already\s+)?enrolled', resp.text, re.I)
    if m:
        raw = m.group(1).replace(",", "")
        try:
            if raw.lower().endswith("k"):
                enrollment = int(float(raw[:-1]) * 1000)
            elif raw.lower().endswith("m"):
                enrollment = int(float(raw[:-1]) * 1_000_000)
            else:
                enrollment = int(raw)
        except ValueError:
            pass
    result["enrollment"] = enrollment

    # Number of courses in specialization
    course_count_m = re.search(r'(\d+)\s+course[s]?\s+series', resp.text, re.I)
    result["course_count"] = int(course_count_m.group(1)) if course_count_m else None

    return result

Database Schema and Storage

import sqlite3
from datetime import date, datetime


def init_coursera_db(db_path: str = "coursera.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS courses (
            id TEXT PRIMARY KEY,
            slug TEXT UNIQUE,
            name TEXT,
            description TEXT,
            short_description TEXT,
            workload TEXT,
            certificate_type TEXT,
            course_type TEXT,
            languages TEXT,
            subtitle_languages TEXT,
            domain_types TEXT,
            partner_ids TEXT,
            partner_names TEXT,
            instructor_ids TEXT,
            instructor_names TEXT,
            first_seen DATE
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS course_stats (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            course_slug TEXT NOT NULL,
            enrollment INTEGER,
            rating REAL,
            rating_count INTEGER,
            level TEXT,
            duration TEXT,
            provider TEXT,
            image_url TEXT,
            skills TEXT,
            module_count INTEGER,
            scraped_date DATE,
            scraped_at TEXT,
            UNIQUE(course_slug, scraped_date)
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS modules (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            course_slug TEXT NOT NULL,
            module_index INTEGER,
            title TEXT,
            description TEXT,
            duration TEXT,
            scraped_date DATE
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS specializations (
            id TEXT PRIMARY KEY,
            slug TEXT UNIQUE,
            name TEXT,
            description TEXT,
            partner_ids TEXT,
            partner_names TEXT,
            course_ids TEXT,
            num_courses INTEGER,
            workload TEXT,
            enrollment INTEGER,
            rating REAL,
            rating_count INTEGER,
            course_count INTEGER,
            image_url TEXT,
            first_seen DATE
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS partners (
            id TEXT PRIMARY KEY,
            name TEXT,
            short_name TEXT,
            description TEXT,
            location TEXT,
            home_url TEXT,
            logo_url TEXT
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS instructors (
            id TEXT PRIMARY KEY,
            name TEXT,
            title TEXT,
            department TEXT,
            bio TEXT,
            photo_url TEXT,
            institution TEXT
        )
    """)

    conn.execute("CREATE INDEX IF NOT EXISTS idx_course_slug ON courses(slug)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_stats_slug ON course_stats(course_slug)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_stats_date ON course_stats(scraped_date)")

    conn.commit()
    return conn


def save_courses(conn: sqlite3.Connection, courses: list[dict]):
    """Batch save course API data."""
    today = date.today().isoformat()
    for c in courses:
        if not c.get("id"):
            continue
        conn.execute("""
            INSERT OR REPLACE INTO courses
            (id, slug, name, description, short_description, workload,
             certificate_type, course_type, languages, subtitle_languages,
             domain_types, partner_ids, partner_names, instructor_ids,
             instructor_names, first_seen)
            VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)
        """, (
            c["id"], c.get("slug"), c.get("name"),
            c.get("description"), c.get("short_description"),
            c.get("workload"), c.get("certificate_type"), c.get("course_type"),
            json.dumps(c.get("languages", [])),
            json.dumps(c.get("subtitle_languages", [])),
            json.dumps(c.get("domain_types", [])),
            json.dumps(c.get("partner_ids", [])),
            json.dumps(c.get("partner_names", [])),
            json.dumps(c.get("instructor_ids", [])),
            json.dumps(c.get("instructor_names", [])),
            today
        ))
    conn.commit()


def save_course_stats(conn: sqlite3.Connection, scraped_data: list[dict]):
    """Save scraped page stats (enrollment, ratings, etc.)"""
    today = date.today().isoformat()
    now = datetime.utcnow().isoformat()

    for s in scraped_data:
        if s.get("error"):
            continue
        slug = s.get("slug")
        if not slug:
            continue

        conn.execute("""
            INSERT OR REPLACE INTO course_stats
            (course_slug, enrollment, rating, rating_count, level, duration,
             provider, image_url, skills, module_count, scraped_date, scraped_at)
            VALUES (?,?,?,?,?,?,?,?,?,?,?,?)
        """, (
            slug, s.get("enrollment"), s.get("rating"), s.get("rating_count"),
            s.get("level"), s.get("duration"), s.get("provider"),
            s.get("image_url"),
            json.dumps(s.get("skills", [])),
            len(s.get("modules", [])),
            today, now,
        ))

        # Save modules
        for idx, module in enumerate(s.get("modules", [])):
            conn.execute("""
                INSERT INTO modules (course_slug, module_index, title, description, duration, scraped_date)
                VALUES (?, ?, ?, ?, ?, ?)
            """, (slug, idx, module["title"], module["description"], module["duration"], today))

    conn.commit()


def save_partners_and_instructors(
    conn: sqlite3.Connection,
    partners: dict,
    instructors: dict,
):
    """Save partner and instructor data."""
    for p in partners.values():
        conn.execute("""
            INSERT OR REPLACE INTO partners (id, name, short_name, description, location, home_url, logo_url)
            VALUES (?, ?, ?, ?, ?, ?, ?)
        """, (p["id"], p["name"], p["short_name"], p["description"],
              p["location"], p["home_url"], p["logo_url"]))

    for inst in instructors.values():
        conn.execute("""
            INSERT OR REPLACE INTO instructors (id, name, title, department, bio, photo_url, institution)
            VALUES (?, ?, ?, ?, ?, ?, ?)
        """, (inst["id"], inst["name"], inst["title"], inst["department"],
              inst["bio"], inst["photo_url"], inst["institution"]))

    conn.commit()

Proxy Configuration for Scale

Coursera uses Cloudflare and rate limiting on their web pages. For small-scale scraping (a few hundred pages), a realistic User-Agent and moderate delays are sufficient. For production-scale collection, use rotating proxies.

ThorData's residential proxy network provides clean IPs that aren't flagged by Cloudflare's IP reputation check:

THORDATA_PROXIES = {
    "http": "http://YOUR_USER-country-us:[email protected]:9000",
    "https": "http://YOUR_USER-country-us:[email protected]:9000",
}

def scrape_with_proxy_rotation(
    slugs: list[str],
    proxy_pool: list[dict] = None,
    requests_per_proxy: int = 50,
) -> list[dict]:
    """
    Scrape course pages with proxy rotation.
    proxy_pool: list of proxy config dicts, or None to use no proxy
    """
    results = []
    current_proxy_idx = 0
    requests_with_proxy = 0

    for i, slug in enumerate(slugs):
        # Select current proxy
        current_proxies = None
        if proxy_pool:
            current_proxies = proxy_pool[current_proxy_idx % len(proxy_pool)]
            requests_with_proxy += 1

            # Rotate proxy
            if requests_with_proxy >= requests_per_proxy:
                current_proxy_idx += 1
                requests_with_proxy = 0
                print(f"  Rotating to proxy {current_proxy_idx % len(proxy_pool)}")

        result = scrape_course_page(slug, proxies=current_proxies)
        results.append(result)

        if not result.get("error"):
            print(f"  [{i+1}/{len(slugs)}] {slug}: "
                  f"enrolled={result.get('enrollment') or 'N/A'} "
                  f"rating={result.get('rating') or 'N/A'}")
        else:
            print(f"  [{i+1}/{len(slugs)}] {slug}: {result['error']}")

        time.sleep(random.uniform(1.5, 3.5))

    return results

Rate limit guidelines for Coursera: - API endpoints: 100 requests/minute comfortable, 200/min max - Web pages: 1 request per 1.5-3.5 seconds per IP (be conservative) - Avoid scraping more than 500 pages/day per IP for sustained operations

Complete Pipeline

import csv

def build_coursera_dataset(
    queries: list[str],
    max_per_query: int = 200,
    scrape_pages: bool = True,
    db_path: str = "coursera.db",
    csv_path: str = "coursera_courses.csv",
    proxies: dict = None,
) -> list[dict]:
    """
    Full pipeline: API collection -> page scraping -> database -> CSV export.
    """
    db = init_coursera_db(db_path)

    # Phase 1: Collect from API
    all_courses = {}
    for query in queries:
        print(f"\nQuery: '{query}'")
        courses = collect_all_courses(query, max_courses=max_per_query, delay=0.8)
        for c in courses:
            if c["id"] not in all_courses:
                all_courses[c["id"]] = c
        print(f"  Unique courses so far: {len(all_courses)}")
        time.sleep(2)

    unique = list(all_courses.values())
    print(f"\nTotal unique courses from API: {len(unique)}")

    # Phase 2: Fetch partner and instructor data
    print("\nFetching partner and instructor data...")
    unique = enrich_courses_with_partner_data(unique)
    save_courses(db, unique)

    # Collect all partners and instructors
    all_partner_ids = list({p for c in unique for p in c.get("partner_ids", [])})
    all_instructor_ids = list({i for c in unique for i in c.get("instructor_ids", [])})
    partners = get_partners(all_partner_ids)
    instructors = get_instructors(all_instructor_ids)
    save_partners_and_instructors(db, partners, instructors)
    print(f"Saved {len(partners)} partners, {len(instructors)} instructors")

    # Phase 3: Scrape page stats
    if scrape_pages:
        print(f"\nScraping page stats for {len(unique)} courses...")
        slugs = [c["slug"] for c in unique if c.get("slug")]
        page_data = scrape_courses_batch(slugs, proxies=proxies)
        save_course_stats(db, page_data)

        # Merge page data back into courses
        page_lookup = {d["slug"]: d for d in page_data}
        for c in unique:
            slug = c.get("slug")
            if slug and slug in page_lookup:
                pd = page_lookup[slug]
                c.update({
                    "enrollment": pd.get("enrollment"),
                    "rating": pd.get("rating"),
                    "rating_count": pd.get("rating_count"),
                    "level": pd.get("level"),
                    "duration": pd.get("duration"),
                    "skills": pd.get("skills", []),
                    "module_count": len(pd.get("modules", [])),
                })

    # Phase 4: Export to CSV
    flat_rows = []
    for c in unique:
        flat_rows.append({
            "id": c["id"],
            "slug": c.get("slug", ""),
            "name": c.get("name", ""),
            "partner_names": "; ".join(c.get("partner_names", [])),
            "instructor_names": "; ".join(c.get("instructor_names", [])),
            "enrollment": c.get("enrollment"),
            "rating": c.get("rating"),
            "rating_count": c.get("rating_count"),
            "level": c.get("level", ""),
            "certificate_type": c.get("certificate_type", ""),
            "workload": c.get("workload", ""),
            "duration": c.get("duration", ""),
            "languages": "; ".join(c.get("languages", [])),
            "domains": "; ".join(
                f"{d.get('domainType', '')}/{d.get('subdomainType', '')}"
                for d in c.get("domain_types", [])
            ),
            "skills": "; ".join(c.get("skills", [])),
            "module_count": c.get("module_count"),
        })

    if flat_rows:
        with open(csv_path, "w", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=flat_rows[0].keys())
            writer.writeheader()
            writer.writerows(flat_rows)
        print(f"\nExported {len(flat_rows)} courses to {csv_path}")

    db.close()
    return unique


if __name__ == "__main__":
    dataset = build_coursera_dataset(
        queries=[
            "machine learning",
            "data science",
            "python",
            "cloud computing",
            "artificial intelligence",
        ],
        max_per_query=150,
        scrape_pages=True,
        proxies=THORDATA_PROXIES,
    )
    print(f"\nDataset: {len(dataset)} courses")

    # Quick summary
    with_enrollment = [c for c in dataset if c.get("enrollment")]
    with_rating = [c for c in dataset if c.get("rating")]
    print(f"With enrollment data: {len(with_enrollment)}")
    print(f"With rating data: {len(with_rating)}")

    if with_enrollment:
        sorted_by_enrollment = sorted(with_enrollment, key=lambda c: c["enrollment"], reverse=True)
        print("\nTop 5 by enrollment:")
        for c in sorted_by_enrollment[:5]:
            print(f"  {c['name'][:50]:50} | {c['enrollment']:>10,} enrolled | "
                  f"{c.get('rating') or 'N/A'} stars")

Useful Analysis Queries

def analyze_coursera_db(db_path: str = "coursera.db"):
    conn = sqlite3.connect(db_path)

    # Top enrolled courses
    print("\n=== Top 10 by Enrollment ===")
    rows = conn.execute("""
        SELECT c.name, cs.enrollment, cs.rating, c.partner_names
        FROM course_stats cs
        JOIN courses c ON c.slug = cs.course_slug
        WHERE cs.enrollment IS NOT NULL
        ORDER BY cs.enrollment DESC
        LIMIT 10
    """).fetchall()
    for r in rows:
        print(f"  {r[0][:45]:45} | {r[1]:>8,} | {r[2] or 'N/A':>5} | {json.loads(r[3] or '[]')[:1]}")

    # Top rated (minimum 100 ratings)
    print("\n=== Top Rated (100+ reviews) ===")
    rows = conn.execute("""
        SELECT c.name, cs.rating, cs.rating_count, c.partner_names
        FROM course_stats cs
        JOIN courses c ON c.slug = cs.course_slug
        WHERE cs.rating IS NOT NULL AND cs.rating_count >= 100
        ORDER BY cs.rating DESC
        LIMIT 10
    """).fetchall()
    for r in rows:
        partners = json.loads(r[3] or "[]")
        print(f"  {r[0][:45]:45} | {r[1]:.2f} stars | {r[2]:>6,} reviews | {', '.join(partners[:2])}")

    # Certificate type breakdown
    print("\n=== Certificate Types ===")
    rows = conn.execute("""
        SELECT certificate_type, COUNT(*) AS count
        FROM courses WHERE certificate_type != ''
        GROUP BY certificate_type ORDER BY count DESC
    """).fetchall()
    for r in rows:
        print(f"  {r[0]:30} {r[1]}")

    conn.close()

Common Gotchas

Enrollment numbers don't appear on all pages. Some courses hide enrollment (usually lower-enrollment courses). The enrollment = None case should be handled gracefully downstream.

JSON-LD data is more reliable than DOM selectors. Coursera's CSS class names change with deployments. The JSON-LD structured data is meant to be machine-readable and changes less frequently — always try it first.

The API's enrolledCount field is often omitted. Despite being listed as a field option, Coursera frequently doesn't include enrollment counts in API responses. That's why we scrape the page.

Specialization page slugs differ from course slugs. Courses are at /learn/{slug}, specializations at /specializations/{slug}, and professional certificates at /professional-certificates/{slug}. Keep these separate in your data model.

Rate limits on the API are soft. You won't get hard 429s until you're making several hundred requests per minute. The more concerning limit is getting your IP flagged on the web scraping side — use the delay guidelines above.

Partner location data is sometimes an ISO country code, sometimes a city string. Normalize this downstream if geographic analysis matters to your use case.

Legal Notes

Coursera's API is semi-public and unauthenticated catalog browsing seems to be intended behavior — they don't require API keys for these read endpoints. However, their Terms of Service restrict automated access. For academic research and personal analysis, small-scale API usage and scraping public course pages carries low practical risk. For commercial products built on Coursera data, review their terms and consider whether official data partnership arrangements make more sense. The enrollment and rating data you're scraping is publicly visible to every visitor — you're not accessing anything non-public.

Conclusion

Coursera's catalog API is one of the cleaner education data APIs available without authentication. The API gets you bulk metadata quickly; page scraping fills in the enrollment and ratings data that the API withholds. For production-scale collection, ThorData residential proxies keep you under Cloudflare's IP reputation radar. The SQLite schema here handles courses, specializations, partners, instructors, and time-series stats in a clean normalized structure that supports the most common analytical queries.