Scraping Udemy Course Data in 2026: Prices, Ratings, Curriculum, and Student Counts
Scraping Udemy Course Data in 2026: Prices, Ratings, Curriculum, and Student Counts
Udemy hosts over 250,000 courses across every conceivable topic. If you're building a course comparison tool, tracking pricing patterns, analyzing the online education market, or building competitive intelligence around online learning — you need structured course data at scale. Udemy shut down their public affiliate API in 2020, but they still run a rich internal API that powers their website and mobile apps. In 2026, it remains the most reliable way to get course data programmatically.
This guide is comprehensive. We'll cover: the internal API structure, searching and paginating courses, extracting full curriculum trees, scraping student reviews, tracking prices over time, handling Udemy's anti-bot protections, building a SQLite-backed pipeline, and exporting to common formats. By the end you'll have a working scraper capable of collecting tens of thousands of courses with full metadata.
Why the Internal API (Not HTML Scraping)
You have two options for getting Udemy data: scraping the rendered HTML or calling the internal JSON API directly.
HTML scraping is fragile. Udemy's frontend is a React SPA with dynamically generated class names that rotate on deploys. What works today breaks in a week.
The internal API is different. When you browse Udemy, your browser calls www.udemy.com/api-2.0/. This same API powers the iOS app, the Android app, and the website. It returns clean, structured JSON with consistent field names. It's been stable for years. Yes, it's undocumented and officially unsanctioned — but it's also the same API that everyone scraping Udemy uses, and it works reliably.
The key endpoints:
/api-2.0/courses/— Search and browse courses/api-2.0/courses/{id}/— Full metadata for a single course/api-2.0/courses/{id}/public-curriculum-items/— Full curriculum tree/api-2.0/courses/{id}/reviews/— Student reviews with ratings/api-2.0/courses/{id}/instructor-performance/— Instructor stats/api-2.0/users/{id}/— Instructor profile data
Setting Up Your HTTP Client
Use httpx instead of requests — it has better connection pooling, native async support, and handles HTTP/2 which Udemy's CDN uses:
pip install httpx
import httpx
import time
import json
import random
from typing import Optional
# Realistic browser headers — critical for avoiding immediate blocks
HEADERS = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/127.0.0.0 Safari/537.36",
"Accept": "application/json, text/plain, */*",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://www.udemy.com/",
"Origin": "https://www.udemy.com",
"sec-ch-ua": '"Not)A;Brand";v="99", "Google Chrome";v="127", "Chromium";v="127"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": '"macOS"',
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "same-origin",
}
client = httpx.Client(
timeout=30,
headers=HEADERS,
follow_redirects=True,
)
Note the sec-ch-ua headers — these Chrome client hints are checked by Cloudflare and missing them is a signal that the request is automated.
Searching and Browsing Courses
The search endpoint accepts a fields[] parameter to control which fields are returned. Only request fields you need — wider field sets increase response size and may trigger stricter rate limiting:
def search_courses(
query: str,
page: int = 1,
page_size: int = 20,
ordering: str = "relevance",
language: str = "en",
price: str = "price-paid", # price-paid | price-free
) -> dict:
"""
Search Udemy courses.
ordering: relevance | most-reviewed | highest-rated | newest | enrollment
price: price-paid | price-free (omit to get both)
"""
url = "https://www.udemy.com/api-2.0/courses/"
params = {
"search": query,
"page": page,
"page_size": page_size,
"ordering": ordering,
"language": language,
"fields[course]": ",".join([
"title", "url", "price", "price_detail",
"num_subscribers", "avg_rating", "avg_rating_recent",
"num_reviews", "num_lectures", "content_length_video",
"created", "last_update_date", "published_title",
"locale", "visible_instructors", "primary_category",
"primary_subcategory", "is_paid", "is_bestseller",
"headline", "image_480x270",
]),
"fields[user]": "title,display_name,job_title,url",
"fields[locale]": "simple_english_title",
}
if price:
params["price"] = price
resp = client.get(url, params=params)
resp.raise_for_status()
return resp.json()
def parse_course(c: dict) -> dict:
"""Normalize a raw course dict from the API."""
price_detail = c.get("price_detail") or {}
instructors = c.get("visible_instructors") or [{}]
primary_instructor = instructors[0] if instructors else {}
return {
"id": c["id"],
"title": c["title"],
"url": f"https://www.udemy.com{c['url']}",
"slug": c.get("published_title", ""),
"headline": c.get("headline", ""),
"price": c.get("price", "Free"),
"price_amount": price_detail.get("amount", 0),
"currency": price_detail.get("currency", "USD"),
"is_paid": c.get("is_paid", True),
"is_bestseller": c.get("is_bestseller", False),
"students": c.get("num_subscribers", 0),
"rating": round(c.get("avg_rating", 0), 2),
"rating_recent": round(c.get("avg_rating_recent", 0), 2),
"reviews": c.get("num_reviews", 0),
"lectures": c.get("num_lectures", 0),
"video_hours": round(c.get("content_length_video", 0) / 3600, 1),
"created": c.get("created", ""),
"last_updated": c.get("last_update_date", ""),
"instructor_name": primary_instructor.get("display_name", ""),
"instructor_title": primary_instructor.get("job_title", ""),
"instructor_url": primary_instructor.get("url", ""),
"category": (c.get("primary_category") or {}).get("title", ""),
"subcategory": (c.get("primary_subcategory") or {}).get("title", ""),
"language": (c.get("locale") or {}).get("simple_english_title", ""),
"thumbnail": c.get("image_480x270", ""),
}
def search_all(query: str, max_results: int = 200, delay: float = 1.0) -> list[dict]:
"""Paginate through search results, up to max_results courses."""
all_courses = []
page = 1
while len(all_courses) < max_results:
data = search_courses(query, page=page, page_size=20)
results = data.get("results", [])
if not results:
break
all_courses.extend([parse_course(c) for c in results])
print(f" Page {page}: got {len(results)} courses (total {len(all_courses)})")
if not data.get("next"):
break
page += 1
time.sleep(delay + random.uniform(0, 0.5))
return all_courses[:max_results]
# Example: collect Python courses
python_courses = search_all("python programming", max_results=100)
for c in python_courses[:5]:
print(f"{c['title'][:60]}")
print(f" {c['students']:,} students | {c['rating']}★ ({c['reviews']:,} reviews)")
print(f" {c['lectures']} lectures, {c['video_hours']}h | {c['price']}")
print(f" Instructor: {c['instructor_name']}")
print()
Fetching Full Course Detail
The search endpoint returns a subset of fields. For complete metadata on a specific course, hit the course detail endpoint:
def get_course_detail(course_id: int) -> dict:
"""Get complete metadata for a single course by ID."""
url = f"https://www.udemy.com/api-2.0/courses/{course_id}/"
params = {
"fields[course]": ",".join([
"title", "headline", "url", "price", "price_detail",
"num_subscribers", "avg_rating", "avg_rating_recent",
"num_reviews", "num_lectures", "content_length_video",
"visible_instructors", "primary_category",
"primary_subcategory", "requirements_data",
"what_you_will_learn_data", "target_audiences",
"locale", "created", "last_update_date", "caption_languages",
"has_certificate", "description", "is_paid", "is_bestseller",
]),
"fields[user]": "title,display_name,job_title,url,num_followers,"
"num_reviews,avg_rating,num_published_courses",
}
resp = client.get(url, params=params)
resp.raise_for_status()
data = resp.json()
# Parse learning objectives
objectives = [
item.get("text", "")
for item in (data.get("what_you_will_learn_data") or {}).get("items", [])
]
# Parse requirements
requirements = [
item.get("text", "")
for item in (data.get("requirements_data") or {}).get("items", [])
]
# Parse target audiences
audiences = [
item.get("text", "")
for item in (data.get("target_audiences") or {}).get("items", [])
]
# Multi-instructor support
instructors = []
for inst in (data.get("visible_instructors") or []):
instructors.append({
"name": inst.get("display_name", ""),
"title": inst.get("job_title", ""),
"url": inst.get("url", ""),
"followers": inst.get("num_followers", 0),
"avg_rating": inst.get("avg_rating", 0),
"courses": inst.get("num_published_courses", 0),
})
course = parse_course(data)
course.update({
"description": data.get("description", ""),
"objectives": objectives,
"requirements": requirements,
"target_audiences": audiences,
"instructors": instructors,
"caption_languages": data.get("caption_languages", []),
"has_certificate": data.get("has_certificate", False),
})
return course
# Fetch full detail for a specific course
detail = get_course_detail(python_courses[0]["id"])
print(f"Title: {detail['title']}")
print(f"Description: {detail['description'][:200]}...")
print(f"Objectives ({len(detail['objectives'])}):")
for obj in detail['objectives'][:3]:
print(f" - {obj}")
Extracting the Full Curriculum Tree
The curriculum endpoint returns every section, lecture, quiz, and practice assignment. This is invaluable for competitive analysis — you can see exactly what a competitor covers without paying for their course:
def get_curriculum(course_id: int) -> list[dict]:
"""
Fetch the complete curriculum for a course.
Returns a list of sections, each with nested lectures.
"""
url = f"https://www.udemy.com/api-2.0/courses/{course_id}/public-curriculum-items/"
params = {
"page_size": 200,
"fields[chapter]": "title,sort_order,object_index,description",
"fields[lecture]": "title,content_summary,is_free,sort_order,asset",
"fields[practice]": "title,sort_order,estimated_duration",
"fields[quiz]": "title,sort_order",
"fields[asset]": "asset_type,time_estimation",
}
all_items = []
page = 1
while True:
params["page"] = page
resp = client.get(url, params=params)
if resp.status_code == 403:
# Some courses restrict curriculum to enrolled students
print(f" Curriculum access restricted for course {course_id}")
break
if resp.status_code != 200:
break
data = resp.json()
batch = data.get("results", [])
all_items.extend(batch)
if not data.get("next"):
break
page += 1
time.sleep(0.3)
# Organize into sections with nested items
sections = []
current_section = None
for item in all_items:
cls = item.get("_class", "")
if cls == "chapter":
current_section = {
"title": item["title"],
"index": item.get("object_index", 0),
"description": item.get("description", ""),
"lectures": [],
"quizzes": [],
"practices": [],
}
sections.append(current_section)
elif cls == "lecture" and current_section is not None:
asset = item.get("asset") or {}
duration_secs = asset.get("time_estimation", 0)
current_section["lectures"].append({
"title": item["title"],
"duration_seconds": duration_secs,
"duration_display": item.get("content_summary", ""),
"is_free": item.get("is_free", False),
"asset_type": asset.get("asset_type", ""),
})
elif cls == "quiz" and current_section is not None:
current_section["quizzes"].append({"title": item["title"]})
elif cls == "practice" and current_section is not None:
current_section["practices"].append({
"title": item["title"],
"estimated_duration": item.get("estimated_duration", ""),
})
return sections
def curriculum_stats(sections: list[dict]) -> dict:
"""Summarize a curriculum."""
total_lectures = sum(len(s["lectures"]) for s in sections)
total_quizzes = sum(len(s["quizzes"]) for s in sections)
total_practices = sum(len(s["practices"]) for s in sections)
free_lectures = sum(
1 for s in sections for lec in s["lectures"] if lec["is_free"]
)
total_seconds = sum(
lec["duration_seconds"]
for s in sections for lec in s["lectures"]
)
return {
"sections": len(sections),
"lectures": total_lectures,
"quizzes": total_quizzes,
"practices": total_practices,
"free_preview_lectures": free_lectures,
"total_hours": round(total_seconds / 3600, 1),
}
# Example
course_id = python_courses[0]["id"]
curriculum = get_curriculum(course_id)
stats = curriculum_stats(curriculum)
print(f"\nCurriculum stats: {stats}")
print(f"\nFirst 5 sections:")
for section in curriculum[:5]:
print(f" [{section['index']}] {section['title']} ({len(section['lectures'])} lectures)")
for lec in section["lectures"][:3]:
free_tag = " [FREE]" if lec["is_free"] else ""
print(f" - {lec['title']} ({lec['duration_display']}){free_tag}")
Scraping Student Reviews
Reviews contain rating values, text, creation dates, and reviewer names. They're paginated and sorted by recency by default:
def get_reviews(
course_id: int,
max_pages: int = 10,
ordering: str = "-created",
) -> list[dict]:
"""
Fetch reviews for a course.
ordering: -created (newest) | -helpful (most helpful) | -rating
"""
url = f"https://www.udemy.com/api-2.0/courses/{course_id}/reviews/"
params = {
"page_size": 50,
"ordering": ordering,
"fields[course_review]": "title,content,rating,created,user_modified,user",
"fields[user]": "display_name,title,name",
}
reviews = []
for page in range(1, max_pages + 1):
params["page"] = page
resp = client.get(url, params=params)
if resp.status_code != 200:
break
data = resp.json()
for r in data.get("results", []):
user = r.get("user") or {}
reviews.append({
"rating": r.get("rating", 0),
"title": r.get("title", ""),
"content": r.get("content", ""),
"created": r.get("created", ""),
"modified": r.get("user_modified", ""),
"user_name": user.get("display_name", user.get("name", "")),
"user_title": user.get("title", ""),
})
if not data.get("next"):
break
time.sleep(0.5)
return reviews
def analyze_reviews(reviews: list[dict]) -> dict:
"""Extract sentiment signals from reviews."""
if not reviews:
return {}
ratings = [r["rating"] for r in reviews]
avg = sum(ratings) / len(ratings)
dist = {i: ratings.count(i) for i in range(1, 6)}
# Extract common praise/complaint phrases
all_text = " ".join(r["content"].lower() for r in reviews)
positive_signals = [
"excellent", "great", "amazing", "highly recommend", "best course",
"very clear", "easy to follow", "well explained", "perfect",
]
negative_signals = [
"outdated", "too fast", "confusing", "poor quality", "not worth",
"too slow", "boring", "repetitive", "waste", "disappointing",
]
praise = [p for p in positive_signals if p in all_text]
complaints = [n for n in negative_signals if n in all_text]
return {
"total_reviews": len(reviews),
"avg_rating": round(avg, 2),
"rating_distribution": dist,
"five_star_pct": round(100 * dist[5] / len(ratings), 1),
"one_star_pct": round(100 * dist[1] / len(ratings), 1),
"common_praise": praise,
"common_complaints": complaints,
}
# Get and analyze reviews
reviews = get_reviews(course_id, max_pages=3)
analysis = analyze_reviews(reviews)
print(f"\nReview analysis: {json.dumps(analysis, indent=2)}")
Price Tracking and Sale Detection
Udemy courses swing from $199.99 to $9.99 constantly. The platform practically trains users to wait for sales. If you're building a price comparison tool or just want to buy at the right time, track prices daily:
import sqlite3
from datetime import datetime, date
def init_price_db(db_path: str = "udemy_prices.db") -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS courses (
id INTEGER PRIMARY KEY,
title TEXT,
url TEXT,
category TEXT,
subcategory TEXT,
instructor TEXT,
first_seen TEXT
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS price_history (
id INTEGER PRIMARY KEY AUTOINCREMENT,
course_id INTEGER NOT NULL,
price_amount REAL,
currency TEXT,
is_paid INTEGER,
students INTEGER,
rating REAL,
reviews INTEGER,
captured_date TEXT,
captured_at TEXT,
UNIQUE(course_id, captured_date)
)
""")
conn.execute("""
CREATE INDEX IF NOT EXISTS idx_price_course_date
ON price_history(course_id, captured_date)
""")
conn.commit()
return conn
def upsert_course(conn: sqlite3.Connection, course: dict):
conn.execute("""
INSERT OR IGNORE INTO courses (id, title, url, category, subcategory, instructor, first_seen)
VALUES (?, ?, ?, ?, ?, ?, ?)
""", (
course["id"], course["title"], course["url"],
course.get("category", ""), course.get("subcategory", ""),
course.get("instructor_name", ""), date.today().isoformat()
))
conn.commit()
def record_price(conn: sqlite3.Connection, course: dict):
today = date.today().isoformat()
conn.execute("""
INSERT OR REPLACE INTO price_history
(course_id, price_amount, currency, is_paid, students, rating, reviews, captured_date, captured_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
course["id"], course["price_amount"], course["currency"],
int(course["is_paid"]), course["students"],
course["rating"], course["reviews"],
today, datetime.utcnow().isoformat()
))
conn.commit()
def detect_price_drops(conn: sqlite3.Connection, min_drop_pct: float = 50.0) -> list[dict]:
"""Find courses that dropped by min_drop_pct% since yesterday."""
rows = conn.execute("""
SELECT a.course_id, c.title,
b.price_amount AS old_price, a.price_amount AS new_price,
b.captured_date AS prev_date, a.captured_date AS curr_date
FROM price_history a
JOIN price_history b ON a.course_id = b.course_id
JOIN courses c ON a.course_id = c.id
WHERE a.captured_date = date('now')
AND b.captured_date = date('now', '-1 day')
AND b.price_amount > 0
AND a.price_amount < b.price_amount
AND ((b.price_amount - a.price_amount) / b.price_amount * 100) >= ?
""", (min_drop_pct,)).fetchall()
drops = []
for row in rows:
course_id, title, old_price, new_price, prev_date, curr_date = row
drop_pct = (old_price - new_price) / old_price * 100
drops.append({
"course_id": course_id,
"title": title,
"old_price": old_price,
"new_price": new_price,
"drop_pct": round(drop_pct, 1),
"curr_date": curr_date,
})
return drops
def get_price_history(conn: sqlite3.Connection, course_id: int, days: int = 30) -> list[dict]:
"""Get price history for a single course."""
rows = conn.execute("""
SELECT captured_date, price_amount, students, rating
FROM price_history
WHERE course_id = ?
ORDER BY captured_date DESC
LIMIT ?
""", (course_id, days)).fetchall()
return [
{"date": r[0], "price": r[1], "students": r[2], "rating": r[3]}
for r in rows
]
# Daily price snapshot pipeline
def daily_price_snapshot(queries: list[str], db_path: str = "udemy_prices.db"):
db = init_price_db(db_path)
all_courses = []
for query in queries:
print(f"\nSearching: '{query}'")
courses = search_all(query, max_results=50)
all_courses.extend(courses)
time.sleep(2)
# Deduplicate by course ID
seen = set()
unique_courses = []
for c in all_courses:
if c["id"] not in seen:
seen.add(c["id"])
unique_courses.append(c)
print(f"\nRecording prices for {len(unique_courses)} unique courses...")
for c in unique_courses:
upsert_course(db, c)
record_price(db, c)
drops = detect_price_drops(db, min_drop_pct=70.0)
if drops:
print(f"\nPrice drops detected ({len(drops)}):")
for d in drops:
print(f" {d['title'][:50]} | ${d['old_price']} -> ${d['new_price']} (-{d['drop_pct']}%)")
else:
print("No major price drops today.")
db.close()
Using ThorData Proxies for Scale and Geo-Pricing
Udemy implements geo-targeted pricing. A Python course priced at $19.99 in the US might be $3.99 in India or $7.99 in Brazil. If you're building a price comparison tool, you need to capture prices from multiple countries — and your proxy location determines which prices you see.
For scale, Udemy's rate limiter tracks by IP. From a single IP you can make roughly 50-60 API calls per minute before getting 429s. At that rate, collecting 10,000 courses takes several hours.
ThorData's residential proxy network solves both problems — you get country-targeted residential IPs to capture geo-specific prices, and IP rotation to stay under rate limits:
import httpx
# ThorData proxy configuration — country-targeted
PROXY_CONFIGS = {
"us": "http://USER-country-us:[email protected]:9000",
"uk": "http://USER-country-gb:[email protected]:9000",
"in": "http://USER-country-in:[email protected]:9000",
"br": "http://USER-country-br:[email protected]:9000",
"de": "http://USER-country-de:[email protected]:9000",
}
def make_geo_client(country_code: str) -> httpx.Client:
"""Create an httpx client routed through a specific country's proxy."""
proxy_url = PROXY_CONFIGS.get(country_code)
return httpx.Client(
timeout=30,
headers=HEADERS,
proxies={"all://": proxy_url} if proxy_url else None,
)
def compare_geo_prices(course_id: int, countries: list[str] = None) -> dict:
"""Compare course pricing across multiple countries."""
if countries is None:
countries = ["us", "uk", "in", "br"]
prices = {}
for country in countries:
geo_client = make_geo_client(country)
try:
url = f"https://www.udemy.com/api-2.0/courses/{course_id}/"
params = {"fields[course]": "title,price,price_detail"}
resp = geo_client.get(url, params=params)
resp.raise_for_status()
data = resp.json()
price_detail = data.get("price_detail") or {}
prices[country] = {
"price": price_detail.get("amount", 0),
"currency": price_detail.get("currency", ""),
"formatted": data.get("price", ""),
}
except Exception as e:
prices[country] = {"error": str(e)}
finally:
geo_client.close()
time.sleep(1.0)
return prices
# Compare pricing across regions
if python_courses:
geo_prices = compare_geo_prices(python_courses[0]["id"])
print("\nGeo-pricing comparison:")
for country, data in geo_prices.items():
if "price" in data:
print(f" {country.upper()}: {data['formatted']} ({data['currency']})")
Anti-Detection and Rate Limiting Strategy
Udemy's anti-bot stack in 2026 has several layers:
Cloudflare is the first line. They check TLS fingerprints (your SSL handshake pattern), HTTP/2 fingerprints, and request headers. The sec-ch-ua headers we set in the initial client configuration matter here.
Rate limiting is IP-based. The API returns 429 Too Many Requests after roughly 50-60 requests per minute. The limit varies by endpoint — search is stricter than course detail.
Session tracking is behavioral. Rapid sequential requests to different API endpoints trigger soft blocks where responses still succeed but are gradually throttled. Mixing in random delays prevents this.
Bot detection cookies are set on first visit. Cookies like ud_cache_nav and tracking IDs are validated on subsequent requests.
Here's a robust request wrapper:
import time
import random
from typing import Optional
def api_get_with_retry(
url: str,
params: dict,
max_retries: int = 5,
base_delay: float = 1.0,
) -> Optional[dict]:
"""
Make an API request with exponential backoff on rate limits.
Returns None if all retries exhausted.
"""
for attempt in range(max_retries):
try:
# Add jitter to avoid thundering herd
jitter = random.uniform(0, 0.3 * base_delay)
if attempt > 0:
wait_time = base_delay * (2 ** attempt) + jitter
print(f" Retry {attempt}/{max_retries}, waiting {wait_time:.1f}s")
time.sleep(wait_time)
resp = client.get(url, params=params)
if resp.status_code == 200:
return resp.json()
elif resp.status_code == 429:
retry_after = int(resp.headers.get("Retry-After", 60))
print(f" Rate limited. Waiting {retry_after}s")
time.sleep(retry_after)
continue
elif resp.status_code == 403:
print(f" 403 Forbidden — possible bot detection. Backing off.")
time.sleep(base_delay * (2 ** (attempt + 2)))
continue
elif resp.status_code == 404:
return None # Course not found, don't retry
else:
print(f" HTTP {resp.status_code}, retrying...")
continue
except httpx.TimeoutException:
print(f" Timeout on attempt {attempt + 1}")
except httpx.NetworkError as e:
print(f" Network error: {e}")
return None
For large-scale collection (10,000+ courses), use an adaptive delay that backs off when you see rising error rates:
class AdaptiveRateLimiter:
"""Automatically adjust request rate based on error patterns."""
def __init__(self, initial_delay: float = 1.0):
self.delay = initial_delay
self.min_delay = 0.5
self.max_delay = 30.0
self.success_streak = 0
self.error_streak = 0
def on_success(self):
self.success_streak += 1
self.error_streak = 0
# Gradually speed up after sustained success
if self.success_streak >= 10 and self.delay > self.min_delay:
self.delay = max(self.min_delay, self.delay * 0.9)
def on_error(self):
self.error_streak += 1
self.success_streak = 0
# Aggressively slow down on errors
self.delay = min(self.max_delay, self.delay * 2.0)
def wait(self):
jitter = random.uniform(0, self.delay * 0.2)
time.sleep(self.delay + jitter)
Building a Full Dataset Pipeline
Combining all the above into a complete, production-ready pipeline:
import json
import csv
import sqlite3
from pathlib import Path
from datetime import date
def build_udemy_dataset(
queries: list[str],
max_per_query: int = 100,
include_curriculum: bool = False,
include_reviews: bool = False,
db_path: str = "udemy_dataset.db",
csv_path: str = "udemy_courses.csv",
) -> list[dict]:
"""
Full pipeline to collect Udemy courses.
- Searches multiple queries
- Deduplicates by course ID
- Optionally fetches curriculum and reviews
- Saves to SQLite and CSV
"""
db = init_price_db(db_path)
limiter = AdaptiveRateLimiter(initial_delay=1.2)
all_courses = {}
# Phase 1: Search and collect course metadata
for query in queries:
print(f"\nQuery: '{query}'")
try:
courses = search_all(query, max_results=max_per_query)
for c in courses:
if c["id"] not in all_courses:
all_courses[c["id"]] = c
print(f" Collected {len(courses)} courses ({len(all_courses)} total unique)")
except Exception as e:
print(f" Search failed: {e}")
limiter.wait()
unique = list(all_courses.values())
print(f"\nTotal unique courses: {len(unique)}")
# Phase 2: Optionally enrich with curriculum and reviews
if include_curriculum or include_reviews:
for i, course in enumerate(unique):
cid = course["id"]
print(f" Enriching [{i+1}/{len(unique)}] {course['title'][:40]}...")
if include_curriculum:
try:
curriculum = get_curriculum(cid)
course["curriculum"] = curriculum
course["curriculum_stats"] = curriculum_stats(curriculum)
limiter.on_success()
except Exception as e:
print(f" Curriculum error: {e}")
limiter.on_error()
limiter.wait()
if include_reviews:
try:
reviews = get_reviews(cid, max_pages=2)
course["reviews_sample"] = reviews[:20]
course["review_analysis"] = analyze_reviews(reviews)
limiter.on_success()
except Exception as e:
print(f" Reviews error: {e}")
limiter.on_error()
limiter.wait()
# Phase 3: Store in SQLite
for course in unique:
upsert_course(db, course)
record_price(db, course)
db.close()
# Phase 4: Export to CSV
flat_rows = []
for c in unique:
row = {
"id": c["id"],
"title": c["title"],
"url": c["url"],
"price_usd": c["price_amount"],
"is_paid": c["is_paid"],
"is_bestseller": c["is_bestseller"],
"students": c["students"],
"rating": c["rating"],
"reviews": c["reviews"],
"lectures": c["lectures"],
"video_hours": c["video_hours"],
"instructor": c["instructor_name"],
"category": c["category"],
"subcategory": c["subcategory"],
"language": c["language"],
"created": c["created"],
"last_updated": c["last_updated"],
}
if include_curriculum and "curriculum_stats" in c:
row.update(c["curriculum_stats"])
if include_reviews and "review_analysis" in c:
row["avg_review_rating"] = c["review_analysis"].get("avg_rating")
row["five_star_pct"] = c["review_analysis"].get("five_star_pct")
flat_rows.append(row)
if flat_rows:
with open(csv_path, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=flat_rows[0].keys())
writer.writeheader()
writer.writerows(flat_rows)
print(f"\nExported {len(flat_rows)} courses to {csv_path}")
return unique
# Run the full pipeline
if __name__ == "__main__":
dataset = build_udemy_dataset(
queries=[
"python programming",
"machine learning",
"web development",
"data science",
"javascript",
],
max_per_query=100,
include_curriculum=False, # Set True for curriculum extraction
include_reviews=False, # Set True for review data
)
print(f"\nDataset complete: {len(dataset)} courses")
Common Gotchas and Edge Cases
Course IDs are permanent, but URLs aren't. Udemy changes the slug portion of course URLs when instructors update the title. Always use the numeric course ID as your primary key, not the URL. Store course["id"] — it never changes.
Price null vs "Free". Some free courses return null for price_detail rather than an amount of 0. The price_amount in our parser handles this with or {}, but don't assume a truthy price_detail means the course is paid.
Subscriber count includes free enrollments. A course with 150,000 "students" and only 800 reviews has a 0.5% review rate — common for free courses. Paid courses average 3-5% review rates. Factor this in if you're using student counts as a quality signal.
The avg_rating_recent field is more useful than avg_rating for assessing current quality. Courses can maintain high overall ratings while declining recently due to outdated content.
Curriculum access restrictions. Some instructors restrict curriculum preview. You'll get a 403 from the curriculum endpoint. Our implementation handles this gracefully.
Instructor IDs vs display. The visible_instructors field returns a list (multi-instructor courses are common). Always handle the multi-instructor case.
Localized content. Use the language parameter in search to filter by language. language=en returns English courses only. Without it, you'll get courses in 60+ languages mixed together.
Real Use Cases
Price tracker. Run the daily price snapshot every night via cron. Track when courses go on sale — Udemy sale patterns are predictable (major sales every 2-4 weeks). Alert when a watchlisted course drops below your target price.
Competitor curriculum analysis. Working on an online course? Search for courses in your topic, pull all their curricula, and map what topics they cover. Find gaps in the market where competitors are thin.
Market research. Use the category and subcategory fields to map the online education landscape. Which subcategories have the most courses? Which have the highest average ratings? Which are growing fastest (compare snapshots over time)?
Instructor analytics. Map instructor careers across the platform — how many courses does the average successful instructor publish? What's the correlation between course age and subscriber count?
Course quality scoring. Build a composite score from: recent rating, review count, student-to-review ratio, update recency, and video hours. More useful than raw star ratings for course discovery.
Conclusion
Udemy's internal API is one of the richest undocumented APIs you'll encounter. Clean JSON responses, granular field selection, full curriculum access, and historical price data make it an excellent foundation for online education analytics.
The core rules: use realistic browser headers, respect rate limits with adaptive delays, use residential proxies like ThorData for geo-pricing and scale, store course IDs as your primary key, and handle the handful of edge cases around free courses and multi-instructor setups. With those in place, you can build a production-grade Udemy dataset in under a day.