Scraping Udemy Course Data in 2026: Prices, Ratings, Curriculum, and Student Counts

2026-04-09 udemy web-scraping api python elearning

Scraping Udemy Course Data in 2026: Prices, Ratings, Curriculum, and Student Counts

Udemy hosts over 250,000 courses across every conceivable topic. If you're building a course comparison tool, tracking pricing patterns, analyzing the online education market, or building competitive intelligence around online learning — you need structured course data at scale. Udemy shut down their public affiliate API in 2020, but they still run a rich internal API that powers their website and mobile apps. In 2026, it remains the most reliable way to get course data programmatically.

This guide is comprehensive. We'll cover: the internal API structure, searching and paginating courses, extracting full curriculum trees, scraping student reviews, tracking prices over time, handling Udemy's anti-bot protections, building a SQLite-backed pipeline, and exporting to common formats. By the end you'll have a working scraper capable of collecting tens of thousands of courses with full metadata.

Why the Internal API (Not HTML Scraping)

You have two options for getting Udemy data: scraping the rendered HTML or calling the internal JSON API directly.

HTML scraping is fragile. Udemy's frontend is a React SPA with dynamically generated class names that rotate on deploys. What works today breaks in a week.

The internal API is different. When you browse Udemy, your browser calls www.udemy.com/api-2.0/. This same API powers the iOS app, the Android app, and the website. It returns clean, structured JSON with consistent field names. It's been stable for years. Yes, it's undocumented and officially unsanctioned — but it's also the same API that everyone scraping Udemy uses, and it works reliably.

The key endpoints:

/api-2.0/courses/ — Search and browse courses
/api-2.0/courses/{id}/ — Full metadata for a single course
/api-2.0/courses/{id}/public-curriculum-items/ — Full curriculum tree
/api-2.0/courses/{id}/reviews/ — Student reviews with ratings
/api-2.0/courses/{id}/instructor-performance/ — Instructor stats
/api-2.0/users/{id}/ — Instructor profile data

Setting Up Your HTTP Client

Use httpx instead of requests — it has better connection pooling, native async support, and handles HTTP/2 which Udemy's CDN uses:

pip install httpx

import httpx
import time
import json
import random
from typing import Optional

# Realistic browser headers — critical for avoiding immediate blocks
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                   "AppleWebKit/537.36 (KHTML, like Gecko) "
                   "Chrome/127.0.0.0 Safari/537.36",
    "Accept": "application/json, text/plain, */*",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Referer": "https://www.udemy.com/",
    "Origin": "https://www.udemy.com",
    "sec-ch-ua": '"Not)A;Brand";v="99", "Google Chrome";v="127", "Chromium";v="127"',
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": '"macOS"',
    "sec-fetch-dest": "empty",
    "sec-fetch-mode": "cors",
    "sec-fetch-site": "same-origin",
}

client = httpx.Client(
    timeout=30,
    headers=HEADERS,
    follow_redirects=True,
)

Note the sec-ch-ua headers — these Chrome client hints are checked by Cloudflare and missing them is a signal that the request is automated.

Searching and Browsing Courses

The search endpoint accepts a fields[] parameter to control which fields are returned. Only request fields you need — wider field sets increase response size and may trigger stricter rate limiting:

def search_courses(
    query: str,
    page: int = 1,
    page_size: int = 20,
    ordering: str = "relevance",
    language: str = "en",
    price: str = "price-paid",  # price-paid | price-free
) -> dict:
    """
    Search Udemy courses.
    ordering: relevance | most-reviewed | highest-rated | newest | enrollment
    price: price-paid | price-free (omit to get both)
    """
    url = "https://www.udemy.com/api-2.0/courses/"
    params = {
        "search": query,
        "page": page,
        "page_size": page_size,
        "ordering": ordering,
        "language": language,
        "fields[course]": ",".join([
            "title", "url", "price", "price_detail",
            "num_subscribers", "avg_rating", "avg_rating_recent",
            "num_reviews", "num_lectures", "content_length_video",
            "created", "last_update_date", "published_title",
            "locale", "visible_instructors", "primary_category",
            "primary_subcategory", "is_paid", "is_bestseller",
            "headline", "image_480x270",
        ]),
        "fields[user]": "title,display_name,job_title,url",
        "fields[locale]": "simple_english_title",
    }
    if price:
        params["price"] = price

    resp = client.get(url, params=params)
    resp.raise_for_status()
    return resp.json()


def parse_course(c: dict) -> dict:
    """Normalize a raw course dict from the API."""
    price_detail = c.get("price_detail") or {}
    instructors = c.get("visible_instructors") or [{}]
    primary_instructor = instructors[0] if instructors else {}

    return {
        "id": c["id"],
        "title": c["title"],
        "url": f"https://www.udemy.com{c['url']}",
        "slug": c.get("published_title", ""),
        "headline": c.get("headline", ""),
        "price": c.get("price", "Free"),
        "price_amount": price_detail.get("amount", 0),
        "currency": price_detail.get("currency", "USD"),
        "is_paid": c.get("is_paid", True),
        "is_bestseller": c.get("is_bestseller", False),
        "students": c.get("num_subscribers", 0),
        "rating": round(c.get("avg_rating", 0), 2),
        "rating_recent": round(c.get("avg_rating_recent", 0), 2),
        "reviews": c.get("num_reviews", 0),
        "lectures": c.get("num_lectures", 0),
        "video_hours": round(c.get("content_length_video", 0) / 3600, 1),
        "created": c.get("created", ""),
        "last_updated": c.get("last_update_date", ""),
        "instructor_name": primary_instructor.get("display_name", ""),
        "instructor_title": primary_instructor.get("job_title", ""),
        "instructor_url": primary_instructor.get("url", ""),
        "category": (c.get("primary_category") or {}).get("title", ""),
        "subcategory": (c.get("primary_subcategory") or {}).get("title", ""),
        "language": (c.get("locale") or {}).get("simple_english_title", ""),
        "thumbnail": c.get("image_480x270", ""),
    }


def search_all(query: str, max_results: int = 200, delay: float = 1.0) -> list[dict]:
    """Paginate through search results, up to max_results courses."""
    all_courses = []
    page = 1

    while len(all_courses) < max_results:
        data = search_courses(query, page=page, page_size=20)
        results = data.get("results", [])
        if not results:
            break

        all_courses.extend([parse_course(c) for c in results])
        print(f"  Page {page}: got {len(results)} courses (total {len(all_courses)})")

        if not data.get("next"):
            break
        page += 1
        time.sleep(delay + random.uniform(0, 0.5))

    return all_courses[:max_results]


# Example: collect Python courses
python_courses = search_all("python programming", max_results=100)
for c in python_courses[:5]:
    print(f"{c['title'][:60]}")
    print(f"  {c['students']:,} students | {c['rating']}★ ({c['reviews']:,} reviews)")
    print(f"  {c['lectures']} lectures, {c['video_hours']}h | {c['price']}")
    print(f"  Instructor: {c['instructor_name']}")
    print()

Fetching Full Course Detail

The search endpoint returns a subset of fields. For complete metadata on a specific course, hit the course detail endpoint:

def get_course_detail(course_id: int) -> dict:
    """Get complete metadata for a single course by ID."""
    url = f"https://www.udemy.com/api-2.0/courses/{course_id}/"
    params = {
        "fields[course]": ",".join([
            "title", "headline", "url", "price", "price_detail",
            "num_subscribers", "avg_rating", "avg_rating_recent",
            "num_reviews", "num_lectures", "content_length_video",
            "visible_instructors", "primary_category",
            "primary_subcategory", "requirements_data",
            "what_you_will_learn_data", "target_audiences",
            "locale", "created", "last_update_date", "caption_languages",
            "has_certificate", "description", "is_paid", "is_bestseller",
        ]),
        "fields[user]": "title,display_name,job_title,url,num_followers,"
                        "num_reviews,avg_rating,num_published_courses",
    }

    resp = client.get(url, params=params)
    resp.raise_for_status()
    data = resp.json()

    # Parse learning objectives
    objectives = [
        item.get("text", "")
        for item in (data.get("what_you_will_learn_data") or {}).get("items", [])
    ]

    # Parse requirements
    requirements = [
        item.get("text", "")
        for item in (data.get("requirements_data") or {}).get("items", [])
    ]

    # Parse target audiences
    audiences = [
        item.get("text", "")
        for item in (data.get("target_audiences") or {}).get("items", [])
    ]

    # Multi-instructor support
    instructors = []
    for inst in (data.get("visible_instructors") or []):
        instructors.append({
            "name": inst.get("display_name", ""),
            "title": inst.get("job_title", ""),
            "url": inst.get("url", ""),
            "followers": inst.get("num_followers", 0),
            "avg_rating": inst.get("avg_rating", 0),
            "courses": inst.get("num_published_courses", 0),
        })

    course = parse_course(data)
    course.update({
        "description": data.get("description", ""),
        "objectives": objectives,
        "requirements": requirements,
        "target_audiences": audiences,
        "instructors": instructors,
        "caption_languages": data.get("caption_languages", []),
        "has_certificate": data.get("has_certificate", False),
    })
    return course


# Fetch full detail for a specific course
detail = get_course_detail(python_courses[0]["id"])
print(f"Title: {detail['title']}")
print(f"Description: {detail['description'][:200]}...")
print(f"Objectives ({len(detail['objectives'])}):")
for obj in detail['objectives'][:3]:
    print(f"  - {obj}")

Extracting the Full Curriculum Tree

The curriculum endpoint returns every section, lecture, quiz, and practice assignment. This is invaluable for competitive analysis — you can see exactly what a competitor covers without paying for their course:

def get_curriculum(course_id: int) -> list[dict]:
    """
    Fetch the complete curriculum for a course.
    Returns a list of sections, each with nested lectures.
    """
    url = f"https://www.udemy.com/api-2.0/courses/{course_id}/public-curriculum-items/"
    params = {
        "page_size": 200,
        "fields[chapter]": "title,sort_order,object_index,description",
        "fields[lecture]": "title,content_summary,is_free,sort_order,asset",
        "fields[practice]": "title,sort_order,estimated_duration",
        "fields[quiz]": "title,sort_order",
        "fields[asset]": "asset_type,time_estimation",
    }

    all_items = []
    page = 1

    while True:
        params["page"] = page
        resp = client.get(url, params=params)
        if resp.status_code == 403:
            # Some courses restrict curriculum to enrolled students
            print(f"  Curriculum access restricted for course {course_id}")
            break
        if resp.status_code != 200:
            break

        data = resp.json()
        batch = data.get("results", [])
        all_items.extend(batch)

        if not data.get("next"):
            break
        page += 1
        time.sleep(0.3)

    # Organize into sections with nested items
    sections = []
    current_section = None

    for item in all_items:
        cls = item.get("_class", "")

        if cls == "chapter":
            current_section = {
                "title": item["title"],
                "index": item.get("object_index", 0),
                "description": item.get("description", ""),
                "lectures": [],
                "quizzes": [],
                "practices": [],
            }
            sections.append(current_section)

        elif cls == "lecture" and current_section is not None:
            asset = item.get("asset") or {}
            duration_secs = asset.get("time_estimation", 0)
            current_section["lectures"].append({
                "title": item["title"],
                "duration_seconds": duration_secs,
                "duration_display": item.get("content_summary", ""),
                "is_free": item.get("is_free", False),
                "asset_type": asset.get("asset_type", ""),
            })

        elif cls == "quiz" and current_section is not None:
            current_section["quizzes"].append({"title": item["title"]})

        elif cls == "practice" and current_section is not None:
            current_section["practices"].append({
                "title": item["title"],
                "estimated_duration": item.get("estimated_duration", ""),
            })

    return sections


def curriculum_stats(sections: list[dict]) -> dict:
    """Summarize a curriculum."""
    total_lectures = sum(len(s["lectures"]) for s in sections)
    total_quizzes = sum(len(s["quizzes"]) for s in sections)
    total_practices = sum(len(s["practices"]) for s in sections)
    free_lectures = sum(
        1 for s in sections for lec in s["lectures"] if lec["is_free"]
    )
    total_seconds = sum(
        lec["duration_seconds"]
        for s in sections for lec in s["lectures"]
    )

    return {
        "sections": len(sections),
        "lectures": total_lectures,
        "quizzes": total_quizzes,
        "practices": total_practices,
        "free_preview_lectures": free_lectures,
        "total_hours": round(total_seconds / 3600, 1),
    }


# Example
course_id = python_courses[0]["id"]
curriculum = get_curriculum(course_id)
stats = curriculum_stats(curriculum)
print(f"\nCurriculum stats: {stats}")
print(f"\nFirst 5 sections:")
for section in curriculum[:5]:
    print(f"  [{section['index']}] {section['title']} ({len(section['lectures'])} lectures)")
    for lec in section["lectures"][:3]:
        free_tag = " [FREE]" if lec["is_free"] else ""
        print(f"    - {lec['title']} ({lec['duration_display']}){free_tag}")

Scraping Student Reviews

Reviews contain rating values, text, creation dates, and reviewer names. They're paginated and sorted by recency by default:

def get_reviews(
    course_id: int,
    max_pages: int = 10,
    ordering: str = "-created",
) -> list[dict]:
    """
    Fetch reviews for a course.
    ordering: -created (newest) | -helpful (most helpful) | -rating
    """
    url = f"https://www.udemy.com/api-2.0/courses/{course_id}/reviews/"
    params = {
        "page_size": 50,
        "ordering": ordering,
        "fields[course_review]": "title,content,rating,created,user_modified,user",
        "fields[user]": "display_name,title,name",
    }

    reviews = []
    for page in range(1, max_pages + 1):
        params["page"] = page
        resp = client.get(url, params=params)
        if resp.status_code != 200:
            break

        data = resp.json()
        for r in data.get("results", []):
            user = r.get("user") or {}
            reviews.append({
                "rating": r.get("rating", 0),
                "title": r.get("title", ""),
                "content": r.get("content", ""),
                "created": r.get("created", ""),
                "modified": r.get("user_modified", ""),
                "user_name": user.get("display_name", user.get("name", "")),
                "user_title": user.get("title", ""),
            })

        if not data.get("next"):
            break
        time.sleep(0.5)

    return reviews


def analyze_reviews(reviews: list[dict]) -> dict:
    """Extract sentiment signals from reviews."""
    if not reviews:
        return {}

    ratings = [r["rating"] for r in reviews]
    avg = sum(ratings) / len(ratings)
    dist = {i: ratings.count(i) for i in range(1, 6)}

    # Extract common praise/complaint phrases
    all_text = " ".join(r["content"].lower() for r in reviews)

    positive_signals = [
        "excellent", "great", "amazing", "highly recommend", "best course",
        "very clear", "easy to follow", "well explained", "perfect",
    ]
    negative_signals = [
        "outdated", "too fast", "confusing", "poor quality", "not worth",
        "too slow", "boring", "repetitive", "waste", "disappointing",
    ]

    praise = [p for p in positive_signals if p in all_text]
    complaints = [n for n in negative_signals if n in all_text]

    return {
        "total_reviews": len(reviews),
        "avg_rating": round(avg, 2),
        "rating_distribution": dist,
        "five_star_pct": round(100 * dist[5] / len(ratings), 1),
        "one_star_pct": round(100 * dist[1] / len(ratings), 1),
        "common_praise": praise,
        "common_complaints": complaints,
    }


# Get and analyze reviews
reviews = get_reviews(course_id, max_pages=3)
analysis = analyze_reviews(reviews)
print(f"\nReview analysis: {json.dumps(analysis, indent=2)}")

Price Tracking and Sale Detection

Udemy courses swing from $199.99 to $9.99 constantly. The platform practically trains users to wait for sales. If you're building a price comparison tool or just want to buy at the right time, track prices daily:

import sqlite3
from datetime import datetime, date

def init_price_db(db_path: str = "udemy_prices.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS courses (
            id INTEGER PRIMARY KEY,
            title TEXT,
            url TEXT,
            category TEXT,
            subcategory TEXT,
            instructor TEXT,
            first_seen TEXT
        )
    """)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS price_history (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            course_id INTEGER NOT NULL,
            price_amount REAL,
            currency TEXT,
            is_paid INTEGER,
            students INTEGER,
            rating REAL,
            reviews INTEGER,
            captured_date TEXT,
            captured_at TEXT,
            UNIQUE(course_id, captured_date)
        )
    """)
    conn.execute("""
        CREATE INDEX IF NOT EXISTS idx_price_course_date
        ON price_history(course_id, captured_date)
    """)
    conn.commit()
    return conn


def upsert_course(conn: sqlite3.Connection, course: dict):
    conn.execute("""
        INSERT OR IGNORE INTO courses (id, title, url, category, subcategory, instructor, first_seen)
        VALUES (?, ?, ?, ?, ?, ?, ?)
    """, (
        course["id"], course["title"], course["url"],
        course.get("category", ""), course.get("subcategory", ""),
        course.get("instructor_name", ""), date.today().isoformat()
    ))
    conn.commit()


def record_price(conn: sqlite3.Connection, course: dict):
    today = date.today().isoformat()
    conn.execute("""
        INSERT OR REPLACE INTO price_history
        (course_id, price_amount, currency, is_paid, students, rating, reviews, captured_date, captured_at)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, (
        course["id"], course["price_amount"], course["currency"],
        int(course["is_paid"]), course["students"],
        course["rating"], course["reviews"],
        today, datetime.utcnow().isoformat()
    ))
    conn.commit()


def detect_price_drops(conn: sqlite3.Connection, min_drop_pct: float = 50.0) -> list[dict]:
    """Find courses that dropped by min_drop_pct% since yesterday."""
    rows = conn.execute("""
        SELECT a.course_id, c.title,
               b.price_amount AS old_price, a.price_amount AS new_price,
               b.captured_date AS prev_date, a.captured_date AS curr_date
        FROM price_history a
        JOIN price_history b ON a.course_id = b.course_id
        JOIN courses c ON a.course_id = c.id
        WHERE a.captured_date = date('now')
          AND b.captured_date = date('now', '-1 day')
          AND b.price_amount > 0
          AND a.price_amount < b.price_amount
          AND ((b.price_amount - a.price_amount) / b.price_amount * 100) >= ?
    """, (min_drop_pct,)).fetchall()

    drops = []
    for row in rows:
        course_id, title, old_price, new_price, prev_date, curr_date = row
        drop_pct = (old_price - new_price) / old_price * 100
        drops.append({
            "course_id": course_id,
            "title": title,
            "old_price": old_price,
            "new_price": new_price,
            "drop_pct": round(drop_pct, 1),
            "curr_date": curr_date,
        })
    return drops


def get_price_history(conn: sqlite3.Connection, course_id: int, days: int = 30) -> list[dict]:
    """Get price history for a single course."""
    rows = conn.execute("""
        SELECT captured_date, price_amount, students, rating
        FROM price_history
        WHERE course_id = ?
        ORDER BY captured_date DESC
        LIMIT ?
    """, (course_id, days)).fetchall()
    return [
        {"date": r[0], "price": r[1], "students": r[2], "rating": r[3]}
        for r in rows
    ]


# Daily price snapshot pipeline
def daily_price_snapshot(queries: list[str], db_path: str = "udemy_prices.db"):
    db = init_price_db(db_path)
    all_courses = []

    for query in queries:
        print(f"\nSearching: '{query}'")
        courses = search_all(query, max_results=50)
        all_courses.extend(courses)
        time.sleep(2)

    # Deduplicate by course ID
    seen = set()
    unique_courses = []
    for c in all_courses:
        if c["id"] not in seen:
            seen.add(c["id"])
            unique_courses.append(c)

    print(f"\nRecording prices for {len(unique_courses)} unique courses...")
    for c in unique_courses:
        upsert_course(db, c)
        record_price(db, c)

    drops = detect_price_drops(db, min_drop_pct=70.0)
    if drops:
        print(f"\nPrice drops detected ({len(drops)}):")
        for d in drops:
            print(f"  {d['title'][:50]} | ${d['old_price']} -> ${d['new_price']} (-{d['drop_pct']}%)")
    else:
        print("No major price drops today.")

    db.close()

Using ThorData Proxies for Scale and Geo-Pricing

Udemy implements geo-targeted pricing. A Python course priced at $19.99 in the US might be $3.99 in India or $7.99 in Brazil. If you're building a price comparison tool, you need to capture prices from multiple countries — and your proxy location determines which prices you see.

For scale, Udemy's rate limiter tracks by IP. From a single IP you can make roughly 50-60 API calls per minute before getting 429s. At that rate, collecting 10,000 courses takes several hours.

ThorData's residential proxy network solves both problems — you get country-targeted residential IPs to capture geo-specific prices, and IP rotation to stay under rate limits:

import httpx

# ThorData proxy configuration — country-targeted
PROXY_CONFIGS = {
    "us": "http://USER-country-us:[email protected]:9000",
    "uk": "http://USER-country-gb:[email protected]:9000",
    "in": "http://USER-country-in:[email protected]:9000",
    "br": "http://USER-country-br:[email protected]:9000",
    "de": "http://USER-country-de:[email protected]:9000",
}

def make_geo_client(country_code: str) -> httpx.Client:
    """Create an httpx client routed through a specific country's proxy."""
    proxy_url = PROXY_CONFIGS.get(country_code)
    return httpx.Client(
        timeout=30,
        headers=HEADERS,
        proxies={"all://": proxy_url} if proxy_url else None,
    )


def compare_geo_prices(course_id: int, countries: list[str] = None) -> dict:
    """Compare course pricing across multiple countries."""
    if countries is None:
        countries = ["us", "uk", "in", "br"]

    prices = {}
    for country in countries:
        geo_client = make_geo_client(country)
        try:
            url = f"https://www.udemy.com/api-2.0/courses/{course_id}/"
            params = {"fields[course]": "title,price,price_detail"}
            resp = geo_client.get(url, params=params)
            resp.raise_for_status()
            data = resp.json()
            price_detail = data.get("price_detail") or {}
            prices[country] = {
                "price": price_detail.get("amount", 0),
                "currency": price_detail.get("currency", ""),
                "formatted": data.get("price", ""),
            }
        except Exception as e:
            prices[country] = {"error": str(e)}
        finally:
            geo_client.close()
        time.sleep(1.0)

    return prices


# Compare pricing across regions
if python_courses:
    geo_prices = compare_geo_prices(python_courses[0]["id"])
    print("\nGeo-pricing comparison:")
    for country, data in geo_prices.items():
        if "price" in data:
            print(f"  {country.upper()}: {data['formatted']} ({data['currency']})")

Anti-Detection and Rate Limiting Strategy

Udemy's anti-bot stack in 2026 has several layers:

Cloudflare is the first line. They check TLS fingerprints (your SSL handshake pattern), HTTP/2 fingerprints, and request headers. The sec-ch-ua headers we set in the initial client configuration matter here.

Rate limiting is IP-based. The API returns 429 Too Many Requests after roughly 50-60 requests per minute. The limit varies by endpoint — search is stricter than course detail.

Session tracking is behavioral. Rapid sequential requests to different API endpoints trigger soft blocks where responses still succeed but are gradually throttled. Mixing in random delays prevents this.

Bot detection cookies are set on first visit. Cookies like ud_cache_nav and tracking IDs are validated on subsequent requests.

Here's a robust request wrapper:

import time
import random
from typing import Optional

def api_get_with_retry(
    url: str,
    params: dict,
    max_retries: int = 5,
    base_delay: float = 1.0,
) -> Optional[dict]:
    """
    Make an API request with exponential backoff on rate limits.
    Returns None if all retries exhausted.
    """
    for attempt in range(max_retries):
        try:
            # Add jitter to avoid thundering herd
            jitter = random.uniform(0, 0.3 * base_delay)
            if attempt > 0:
                wait_time = base_delay * (2 ** attempt) + jitter
                print(f"  Retry {attempt}/{max_retries}, waiting {wait_time:.1f}s")
                time.sleep(wait_time)

            resp = client.get(url, params=params)

            if resp.status_code == 200:
                return resp.json()

            elif resp.status_code == 429:
                retry_after = int(resp.headers.get("Retry-After", 60))
                print(f"  Rate limited. Waiting {retry_after}s")
                time.sleep(retry_after)
                continue

            elif resp.status_code == 403:
                print(f"  403 Forbidden — possible bot detection. Backing off.")
                time.sleep(base_delay * (2 ** (attempt + 2)))
                continue

            elif resp.status_code == 404:
                return None  # Course not found, don't retry

            else:
                print(f"  HTTP {resp.status_code}, retrying...")
                continue

        except httpx.TimeoutException:
            print(f"  Timeout on attempt {attempt + 1}")
        except httpx.NetworkError as e:
            print(f"  Network error: {e}")

    return None

For large-scale collection (10,000+ courses), use an adaptive delay that backs off when you see rising error rates:

class AdaptiveRateLimiter:
    """Automatically adjust request rate based on error patterns."""

    def __init__(self, initial_delay: float = 1.0):
        self.delay = initial_delay
        self.min_delay = 0.5
        self.max_delay = 30.0
        self.success_streak = 0
        self.error_streak = 0

    def on_success(self):
        self.success_streak += 1
        self.error_streak = 0
        # Gradually speed up after sustained success
        if self.success_streak >= 10 and self.delay > self.min_delay:
            self.delay = max(self.min_delay, self.delay * 0.9)

    def on_error(self):
        self.error_streak += 1
        self.success_streak = 0
        # Aggressively slow down on errors
        self.delay = min(self.max_delay, self.delay * 2.0)

    def wait(self):
        jitter = random.uniform(0, self.delay * 0.2)
        time.sleep(self.delay + jitter)

Building a Full Dataset Pipeline

Combining all the above into a complete, production-ready pipeline:

import json
import csv
import sqlite3
from pathlib import Path
from datetime import date

def build_udemy_dataset(
    queries: list[str],
    max_per_query: int = 100,
    include_curriculum: bool = False,
    include_reviews: bool = False,
    db_path: str = "udemy_dataset.db",
    csv_path: str = "udemy_courses.csv",
) -> list[dict]:
    """
    Full pipeline to collect Udemy courses.
    - Searches multiple queries
    - Deduplicates by course ID
    - Optionally fetches curriculum and reviews
    - Saves to SQLite and CSV
    """
    db = init_price_db(db_path)
    limiter = AdaptiveRateLimiter(initial_delay=1.2)
    all_courses = {}

    # Phase 1: Search and collect course metadata
    for query in queries:
        print(f"\nQuery: '{query}'")
        try:
            courses = search_all(query, max_results=max_per_query)
            for c in courses:
                if c["id"] not in all_courses:
                    all_courses[c["id"]] = c
            print(f"  Collected {len(courses)} courses ({len(all_courses)} total unique)")
        except Exception as e:
            print(f"  Search failed: {e}")
        limiter.wait()

    unique = list(all_courses.values())
    print(f"\nTotal unique courses: {len(unique)}")

    # Phase 2: Optionally enrich with curriculum and reviews
    if include_curriculum or include_reviews:
        for i, course in enumerate(unique):
            cid = course["id"]
            print(f"  Enriching [{i+1}/{len(unique)}] {course['title'][:40]}...")

            if include_curriculum:
                try:
                    curriculum = get_curriculum(cid)
                    course["curriculum"] = curriculum
                    course["curriculum_stats"] = curriculum_stats(curriculum)
                    limiter.on_success()
                except Exception as e:
                    print(f"    Curriculum error: {e}")
                    limiter.on_error()
                limiter.wait()

            if include_reviews:
                try:
                    reviews = get_reviews(cid, max_pages=2)
                    course["reviews_sample"] = reviews[:20]
                    course["review_analysis"] = analyze_reviews(reviews)
                    limiter.on_success()
                except Exception as e:
                    print(f"    Reviews error: {e}")
                    limiter.on_error()
                limiter.wait()

    # Phase 3: Store in SQLite
    for course in unique:
        upsert_course(db, course)
        record_price(db, course)
    db.close()

    # Phase 4: Export to CSV
    flat_rows = []
    for c in unique:
        row = {
            "id": c["id"],
            "title": c["title"],
            "url": c["url"],
            "price_usd": c["price_amount"],
            "is_paid": c["is_paid"],
            "is_bestseller": c["is_bestseller"],
            "students": c["students"],
            "rating": c["rating"],
            "reviews": c["reviews"],
            "lectures": c["lectures"],
            "video_hours": c["video_hours"],
            "instructor": c["instructor_name"],
            "category": c["category"],
            "subcategory": c["subcategory"],
            "language": c["language"],
            "created": c["created"],
            "last_updated": c["last_updated"],
        }
        if include_curriculum and "curriculum_stats" in c:
            row.update(c["curriculum_stats"])
        if include_reviews and "review_analysis" in c:
            row["avg_review_rating"] = c["review_analysis"].get("avg_rating")
            row["five_star_pct"] = c["review_analysis"].get("five_star_pct")
        flat_rows.append(row)

    if flat_rows:
        with open(csv_path, "w", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=flat_rows[0].keys())
            writer.writeheader()
            writer.writerows(flat_rows)
        print(f"\nExported {len(flat_rows)} courses to {csv_path}")

    return unique


# Run the full pipeline
if __name__ == "__main__":
    dataset = build_udemy_dataset(
        queries=[
            "python programming",
            "machine learning",
            "web development",
            "data science",
            "javascript",
        ],
        max_per_query=100,
        include_curriculum=False,  # Set True for curriculum extraction
        include_reviews=False,     # Set True for review data
    )
    print(f"\nDataset complete: {len(dataset)} courses")

Common Gotchas and Edge Cases

Course IDs are permanent, but URLs aren't. Udemy changes the slug portion of course URLs when instructors update the title. Always use the numeric course ID as your primary key, not the URL. Store course["id"] — it never changes.

Price null vs "Free". Some free courses return null for price_detail rather than an amount of 0. The price_amount in our parser handles this with or {}, but don't assume a truthy price_detail means the course is paid.

Subscriber count includes free enrollments. A course with 150,000 "students" and only 800 reviews has a 0.5% review rate — common for free courses. Paid courses average 3-5% review rates. Factor this in if you're using student counts as a quality signal.

The avg_rating_recent field is more useful than avg_rating for assessing current quality. Courses can maintain high overall ratings while declining recently due to outdated content.

Curriculum access restrictions. Some instructors restrict curriculum preview. You'll get a 403 from the curriculum endpoint. Our implementation handles this gracefully.

Instructor IDs vs display. The visible_instructors field returns a list (multi-instructor courses are common). Always handle the multi-instructor case.

Localized content. Use the language parameter in search to filter by language. language=en returns English courses only. Without it, you'll get courses in 60+ languages mixed together.

Real Use Cases

Price tracker. Run the daily price snapshot every night via cron. Track when courses go on sale — Udemy sale patterns are predictable (major sales every 2-4 weeks). Alert when a watchlisted course drops below your target price.

Competitor curriculum analysis. Working on an online course? Search for courses in your topic, pull all their curricula, and map what topics they cover. Find gaps in the market where competitors are thin.

Market research. Use the category and subcategory fields to map the online education landscape. Which subcategories have the most courses? Which have the highest average ratings? Which are growing fastest (compare snapshots over time)?

Instructor analytics. Map instructor careers across the platform — how many courses does the average successful instructor publish? What's the correlation between course age and subscriber count?

Course quality scoring. Build a composite score from: recent rating, review count, student-to-review ratio, update recency, and video hours. More useful than raw star ratings for course discovery.

Conclusion

Udemy's internal API is one of the richest undocumented APIs you'll encounter. Clean JSON responses, granular field selection, full curriculum access, and historical price data make it an excellent foundation for online education analytics.

The core rules: use realistic browser headers, respect rate limits with adaptive delays, use residential proxies like ThorData for geo-pricing and scale, store course IDs as your primary key, and handle the handful of edge cases around free courses and multi-instructor setups. With those in place, you can build a production-grade Udemy dataset in under a day.