Scraping MIT OpenCourseWare: Course Catalog, Lectures & Problem Sets with Python (2026)
Scraping MIT OpenCourseWare: Course Catalog, Lectures & Problem Sets with Python (2026)
MIT OpenCourseWare (OCW) has published materials from over 2,500 MIT courses — for free. Lecture notes, problem sets, exams, video lectures, and reading lists covering everything from introductory calculus to graduate-level quantum field theory.
If you're building an educational search engine, studying how university curricula evolve, constructing AI training datasets from high-quality problem/solution pairs, or simply want offline access to MIT course materials, scraping OCW is a worthwhile project. The site is relatively scraper-friendly compared to commercial targets — the content is explicitly meant to be open — but there are still practical nuances to cover.
What's Available
Each course on OCW can include:
- Syllabus — course description, prerequisites, grading structure, course goals
- Lecture notes — usually PDFs, sometimes HTML pages with embedded LaTeX
- Problem sets — assignments, often with solution sets in separate PDFs
- Exams — midterms and finals, often with complete answer keys
- Video lectures — hosted on YouTube with auto-generated and human-edited transcripts
- Reading lists — required and recommended textbooks, with ISBNs
- Course calendar — week-by-week topic schedule with readings mapped to each session
- Projects — final project descriptions, rubrics, and sometimes example submissions
- Recitation notes — supplementary materials from TA sessions
Not every course has all of these. Older courses (pre-2005) may have only a syllabus and reading list. Well-resourced courses — 6.006 (Algorithms), 18.06 (Linear Algebra), 8.04 (Quantum Physics) — have the full package.
License
OCW content is published under Creative Commons BY-NC-SA 4.0. You can share and adapt the materials for non-commercial purposes with attribution. This is one of the rare scraping targets where the content creators explicitly want you to use the data — OpenCourseWare exists specifically to make this material accessible.
For building ML training datasets, research papers, or educational tools, this license gives you substantial freedom. The key restrictions are attribution (cite MIT OCW) and non-commercial (don't sell the raw materials without transformation).
Site Structure
OCW's URL structure follows a predictable pattern:
https://ocw.mit.edu/courses/{department}/{course-slug}/
https://ocw.mit.edu/courses/{department}/{course-slug}/lecture-notes/
https://ocw.mit.edu/courses/{department}/{course-slug}/assignments/
https://ocw.mit.edu/courses/{department}/{course-slug}/exams/
https://ocw.mit.edu/courses/{department}/{course-slug}/video-lectures/
Course slugs follow the pattern {course-number}-{course-name}-{semester}, e.g., 6-006-introduction-to-algorithms-fall-2011.
The site has a search/browse interface at https://ocw.mit.edu/search that supports filtering by department, level, and features.
Scraping the Course Catalog
import httpx
from bs4 import BeautifulSoup
import json
import time
import re
import os
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/128.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
}
BASE_URL = "https://ocw.mit.edu"
# Common OCW department slugs for filtering
DEPARTMENTS = [
"electrical-engineering-and-computer-science",
"mathematics",
"physics",
"chemistry",
"biology",
"economics",
"management",
"mechanical-engineering",
"architecture",
"linguistics-and-philosophy",
]
def scrape_search_page(
department: str = "",
features: list = None,
level: str = "",
page: int = 1,
) -> list[dict]:
"""
Scrape one page of OCW search results.
features: list of values like "lecture-videos", "problem-sets", "exams"
level: "undergraduate" or "graduate"
"""
url = f"{BASE_URL}/search"
params = {"page": page}
if department:
params["d"] = department
if level:
params["l"] = level
if features:
params["f"] = ",".join(features)
resp = httpx.get(url, params=params, headers=HEADERS, follow_redirects=True, timeout=20)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
courses = []
# OCW search results use a card-based layout
for card in soup.select("[class*='course-card'], [class*='search-result'], article.course"):
course = {}
# Title
title_el = card.select_one("h3, h2, [class*='title']")
course["title"] = title_el.get_text(strip=True) if title_el else ""
# URL
link_el = card.select_one("a[href]")
if link_el:
href = link_el.get("href", "")
course["url"] = href if href.startswith("http") else BASE_URL + href
else:
course["url"] = ""
# Course number / department
dept_el = card.select_one("[class*='department'], [class*='course-number'], .subject")
course["department"] = dept_el.get_text(strip=True) if dept_el else ""
# Description snippet
desc_el = card.select_one("p, [class*='description'], [class*='teaser']")
course["description"] = desc_el.get_text(strip=True)[:300] if desc_el else ""
# Feature badges (lecture videos, problem sets, etc.)
badges = card.select("[class*='badge'], [class*='feature'], [class*='tag']")
course["features"] = [b.get_text(strip=True) for b in badges if b.get_text(strip=True)]
if course["title"] and course["url"]:
courses.append(course)
return courses
def scrape_course_listing(
department: str = "",
features: list = None,
max_pages: int = 100,
) -> list[dict]:
"""Paginate through OCW course listings and collect all results."""
all_courses = []
for page in range(1, max_pages + 1):
batch = scrape_search_page(department=department, features=features, page=page)
if not batch:
print(f" No results on page {page}, stopping.")
break
all_courses.extend(batch)
print(f" Page {page}: {len(batch)} courses (total: {len(all_courses)})")
time.sleep(2.0)
return all_courses
Extracting Individual Course Pages
Once you have a list of course URLs, scrape each course's detail page:
def scrape_course_page(course_url: str) -> dict:
"""Scrape all metadata and material links from a single OCW course page."""
resp = httpx.get(course_url, headers=HEADERS, follow_redirects=True, timeout=20)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
course = {
"url": course_url,
"title": "",
"course_number": "",
"instructor": [],
"semester": "",
"level": "",
"topics": [],
"description": "",
"features": [],
"materials": [],
}
# Title
h1 = soup.select_one("h1, [class*='course-title']")
course["title"] = h1.get_text(strip=True) if h1 else ""
# Course number (parse from URL or page)
number_el = soup.select_one("[class*='course-number'], [class*='subject-id']")
if number_el:
course["course_number"] = number_el.get_text(strip=True)
else:
# Extract from URL: /courses/6-006-... -> 6.006
match = re.search(r"/courses/(\d+-\d+)", course_url)
if match:
course["course_number"] = match.group(1).replace("-", ".", 1)
# Instructors
for inst_el in soup.select("[class*='instructor'], [itemprop='instructor']"):
name = inst_el.get_text(strip=True)
if name and len(name) > 2:
course["instructor"].append(name)
# Metadata table (semester, level, etc.)
for row in soup.select("dl dt, [class*='metadata'] dt, [class*='course-info'] dt"):
label = row.get_text(strip=True).lower().rstrip(":")
value_el = row.find_next_sibling("dd")
if not value_el:
continue
value = value_el.get_text(strip=True)
if "semester" in label or "term" in label or "as taught" in label:
course["semester"] = value
elif "level" in label:
course["level"] = value
# Course description
desc_el = soup.select_one("[class*='description'], [itemprop='description'], .lead")
if desc_el:
course["description"] = desc_el.get_text(strip=True)[:1000]
# Topics / tags
for tag in soup.select("[class*='topic'], [class*='tag'], [rel*='tag']"):
text = tag.get_text(strip=True)
if text and len(text) < 80:
course["topics"].append(text)
course["topics"] = list(set(course["topics"]))
# Material links
for a in soup.select("a[href]"):
href = a.get("href", "")
text = a.get_text(strip=True)
full_url = href if href.startswith("http") else BASE_URL + href
material_type = classify_material_link(href, text)
if material_type and text and len(text) > 3:
course["materials"].append({
"title": text[:200],
"url": full_url,
"type": material_type,
})
# Deduplicate materials by URL
seen_urls = set()
unique_materials = []
for m in course["materials"]:
if m["url"] not in seen_urls:
seen_urls.add(m["url"])
unique_materials.append(m)
course["materials"] = unique_materials
return course
def classify_material_link(href: str, text: str) -> str | None:
"""Classify a link as a specific material type or return None to ignore it."""
href_lower = href.lower()
text_lower = text.lower()
if href_lower.endswith(".pdf"):
if any(k in href_lower or k in text_lower for k in ["lecture", "notes", "slides"]):
return "lecture_notes"
elif any(k in href_lower or k in text_lower for k in ["problem", "pset", "assignment", "hw"]):
return "problem_set"
elif any(k in href_lower or k in text_lower for k in ["exam", "midterm", "final", "quiz"]):
return "exam"
elif any(k in href_lower or k in text_lower for k in ["solution", "answer", "key"]):
return "solution"
return "pdf"
elif "/lecture-notes/" in href_lower or "/lecture/" in href_lower:
return "lecture_notes"
elif "/assignments/" in href_lower or "/problem-sets/" in href_lower:
return "problem_set"
elif "/exams/" in href_lower:
return "exam"
elif "/video-lectures/" in href_lower:
return "video"
elif "/readings/" in href_lower:
return "reading"
elif "/recitations/" in href_lower:
return "recitation"
elif "/projects/" in href_lower:
return "project"
return None
Scraping Sub-Section Pages
OCW organizes materials within each course into dedicated sub-sections. Scraping these pages gives you the actual file listings:
SECTION_SLUGS = [
"lecture-notes",
"assignments",
"exams",
"video-lectures",
"readings",
"recitations",
"projects",
"tools",
]
def scrape_course_section(
course_url: str,
section: str,
) -> list[dict]:
"""Scrape a specific section page of a course for material links."""
section_url = course_url.rstrip("/") + f"/{section}/"
resp = httpx.get(section_url, headers=HEADERS, follow_redirects=True, timeout=15)
if resp.status_code == 404:
return []
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
materials = []
for a in soup.select("a[href]"):
href = a.get("href", "")
text = a.get_text(strip=True)
if not text or len(text) < 3:
continue
full_url = href if href.startswith("http") else BASE_URL + href
# Only collect actual file links (PDFs) or sub-page links within OCW
if (
href.endswith(".pdf")
or (course_url.split("/courses/")[1].split("/")[0] in href and len(href) > 50)
):
materials.append({
"title": text[:200],
"url": full_url,
"type": section.rstrip("s"), # "lecture-notes" -> "lecture-note"
"section": section,
})
return materials
def get_all_course_materials(course_url: str) -> dict[str, list]:
"""
Systematically collect material links from all sections of a course.
Returns dict keyed by section name.
"""
materials_by_section = {}
for section in SECTION_SLUGS:
materials = scrape_course_section(course_url, section)
if materials:
materials_by_section[section] = materials
print(f" {section}: {len(materials)} item(s)")
time.sleep(1.0)
return materials_by_section
Extracting Video Lecture Metadata
OCW video lectures are hosted on YouTube. The video pages on OCW link out to YouTube, and you can extract the YouTube IDs to pull transcripts:
import re
def extract_youtube_ids_from_page(page_url: str) -> list[str]:
"""Extract YouTube video IDs from an OCW video lectures page."""
resp = httpx.get(page_url, headers=HEADERS, follow_redirects=True, timeout=15)
if resp.status_code != 200:
return []
soup = BeautifulSoup(resp.text, "lxml")
yt_ids = set()
# Look for YouTube embed URLs and links
for el in soup.select("[src*='youtube'], [href*='youtube'], [href*='youtu.be']"):
src = el.get("src", "") + el.get("href", "")
match = re.search(
r"(?:youtube\.com/(?:embed/|watch\?v=)|youtu\.be/)([A-Za-z0-9_-]{11})",
src,
)
if match:
yt_ids.add(match.group(1))
return list(yt_ids)
def get_youtube_transcript(video_id: str) -> str | None:
"""
Fetch transcript for a YouTube video using the unofficial caption API.
Returns plain text transcript or None if unavailable.
"""
# YouTube's timedtext API
list_url = f"https://www.youtube.com/api/timedtext?v={video_id}&type=list"
try:
resp = httpx.get(list_url, headers=HEADERS, timeout=10)
if resp.status_code != 200:
return None
# Parse available tracks
from xml.etree import ElementTree as ET
root = ET.fromstring(resp.text)
# Prefer English
lang_code = "en"
tracks = root.findall("track")
if not tracks:
return None
en_tracks = [t for t in tracks if t.get("lang_code", "").startswith("en")]
track = en_tracks[0] if en_tracks else tracks[0]
lang_code = track.get("lang_code", "en")
# Fetch the actual transcript
transcript_url = (
f"https://www.youtube.com/api/timedtext"
f"?v={video_id}&lang={lang_code}&fmt=vtt"
)
resp = httpx.get(transcript_url, headers=HEADERS, timeout=15)
if resp.status_code != 200:
return None
# Strip VTT formatting to get plain text
lines = resp.text.split("\n")
text_lines = []
for line in lines:
line = line.strip()
# Skip VTT headers, timestamps, and empty lines
if (
line
and not line.startswith("WEBVTT")
and not re.match(r"^\d{2}:\d{2}", line)
and not re.match(r"^NOTE", line)
and not re.match(r"^\d+$", line)
):
# Remove HTML tags
clean = re.sub(r"<[^>]+>", "", line)
if clean:
text_lines.append(clean)
# Deduplicate adjacent identical lines (common in auto-captions)
deduped = []
for line in text_lines:
if not deduped or line != deduped[-1]:
deduped.append(line)
return " ".join(deduped) if deduped else None
except Exception as e:
print(f" Transcript error for {video_id}: {e}")
return None
Anti-Bot Measures on OCW
MIT OCW is substantially more permissive than commercial scraping targets — the site exists to be open. But practical protections are still present:
Cloudflare. OCW sits behind Cloudflare's CDN/DDoS protection. Aggressive scraping from a single IP triggers Cloudflare challenges. For catalog browsing at reasonable rates (one request every 2-3 seconds), you typically don't hit challenges. For bulk downloads at high rates, you will.
Rate limiting at the server level. If you send more than ~30-40 requests per minute from a single IP, OCW's servers will start throttling. You'll see response times increase before outright 429s.
Large-file download detection. Downloading hundreds of PDFs from one IP in a short window is flagged by Cloudflare's abuse detection. This is especially relevant for courses with large video files.
YouTube CDN throttling. YouTube's CDN applies per-IP bandwidth limits for video content. This doesn't affect transcript API calls, but does affect direct video downloads.
For catalog scraping and individual course page scraping, you usually don't need proxies — the rates required are low enough to stay under Cloudflare's thresholds. For bulk material downloads at scale, rotating IPs help. ThorData's residential proxies are a good fit because residential IPs are treated like regular users by Cloudflare's edge rules:
def polite_get(
url: str,
max_retries: int = 4,
base_delay: float = 2.0,
proxy_url: str = None,
) -> httpx.Response | None:
"""
GET with exponential backoff and optional proxy support.
Handles Cloudflare rate limiting and server errors gracefully.
"""
client_kwargs = {"headers": HEADERS, "follow_redirects": True, "timeout": 25}
if proxy_url:
client_kwargs["proxies"] = {"https://": proxy_url, "http://": proxy_url}
for attempt in range(max_retries):
try:
with httpx.Client(**client_kwargs) as client:
resp = client.get(url)
if resp.status_code == 200:
return resp
elif resp.status_code == 429:
wait = base_delay * (2 ** attempt)
print(f" Rate limited (attempt {attempt + 1}). Waiting {wait:.0f}s...")
time.sleep(wait)
elif resp.status_code == 403:
# Cloudflare challenge
wait = 30 * (attempt + 1)
print(f" 403 Cloudflare challenge. Waiting {wait}s...")
time.sleep(wait)
elif resp.status_code in (500, 502, 503, 504):
wait = base_delay * (attempt + 1)
print(f" {resp.status_code} server error. Waiting {wait:.0f}s...")
time.sleep(wait)
else:
return resp
except httpx.TimeoutException:
wait = base_delay * (attempt + 1)
print(f" Timeout. Waiting {wait:.0f}s...")
time.sleep(wait)
except Exception as e:
print(f" Error: {e}")
if attempt < max_retries - 1:
time.sleep(base_delay)
return None
Downloading Course Materials
For offline archives, download the actual PDFs and documents:
def download_material(
url: str,
save_dir: str,
filename: str = None,
proxy_url: str = None,
) -> str | None:
"""
Download a single material file. Returns local path or None on failure.
"""
os.makedirs(save_dir, exist_ok=True)
if not filename:
filename = url.split("/")[-1].split("?")[0]
if not filename or "." not in filename:
filename = f"material_{abs(hash(url))}.pdf"
# Sanitize filename
filename = re.sub(r'[^\w\s\-.]', '_', filename)[:100]
filepath = os.path.join(save_dir, filename)
if os.path.exists(filepath) and os.path.getsize(filepath) > 0:
return filepath # Already downloaded
resp = polite_get(url, proxy_url=proxy_url)
if resp and resp.status_code == 200:
with open(filepath, "wb") as f:
f.write(resp.content)
return filepath
return None
def download_course_materials(
course: dict,
base_dir: str = "ocw_downloads",
material_types: list = None,
proxy_url: str = None,
) -> dict:
"""
Download all materials for a course.
material_types: list of types to download, e.g. ["lecture_notes", "problem_set", "exam"]
Pass None to download all types.
"""
# Create a safe directory name from the course title
safe_title = re.sub(r"[^\w\s-]", "", course.get("title", "unknown"))[:60].strip()
course_dir = os.path.join(base_dir, safe_title)
os.makedirs(course_dir, exist_ok=True)
# Save metadata
with open(os.path.join(course_dir, "metadata.json"), "w") as f:
meta = {k: v for k, v in course.items() if k != "materials"}
json.dump(meta, f, indent=2)
download_results = {"success": [], "failed": [], "skipped": []}
for material in course.get("materials", []):
m_type = material.get("type", "other")
if material_types and m_type not in material_types:
download_results["skipped"].append(material["url"])
continue
if not material["url"].endswith(".pdf") and "ocw.mit.edu" not in material["url"]:
download_results["skipped"].append(material["url"])
continue
type_dir = os.path.join(course_dir, m_type)
result = download_material(
material["url"],
save_dir=type_dir,
proxy_url=proxy_url,
)
if result:
download_results["success"].append(result)
else:
download_results["failed"].append(material["url"])
time.sleep(1.5) # Polite delay between downloads
print(
f" Downloaded {len(download_results['success'])} files "
f"({len(download_results['failed'])} failed, "
f"{len(download_results['skipped'])} skipped)"
)
return download_results
Building a Course Material Index (SQLite)
For searchable metadata without downloading everything:
import sqlite3
def init_ocw_db(path: str = "ocw_catalog.db") -> sqlite3.Connection:
conn = sqlite3.connect(path)
conn.execute("""
CREATE TABLE IF NOT EXISTS courses (
url TEXT PRIMARY KEY,
title TEXT,
course_number TEXT,
instructors TEXT,
semester TEXT,
level TEXT,
department TEXT,
description TEXT,
topics TEXT,
material_count INTEGER DEFAULT 0,
has_video BOOLEAN DEFAULT 0,
has_problems BOOLEAN DEFAULT 0,
has_exams BOOLEAN DEFAULT 0,
scraped_at TEXT DEFAULT (datetime('now'))
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS materials (
id INTEGER PRIMARY KEY AUTOINCREMENT,
course_url TEXT NOT NULL,
title TEXT,
url TEXT UNIQUE,
type TEXT,
local_path TEXT,
downloaded INTEGER DEFAULT 0,
FOREIGN KEY (course_url) REFERENCES courses(url)
)
""")
conn.execute("""
CREATE VIRTUAL TABLE IF NOT EXISTS course_search
USING fts5(title, description, topics, course_number, content=courses)
""")
conn.execute("CREATE INDEX IF NOT EXISTS idx_materials_type ON materials(type)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_materials_course ON materials(course_url)")
conn.commit()
return conn
def save_course(conn: sqlite3.Connection, course: dict, department: str = ""):
"""Save a scraped course and its materials to the database."""
materials = course.get("materials", [])
has_video = any(m["type"] == "video" for m in materials)
has_problems = any(m["type"] in ("problem_set", "assignment") for m in materials)
has_exams = any(m["type"] == "exam" for m in materials)
conn.execute("""
INSERT OR REPLACE INTO courses
(url, title, course_number, instructors, semester, level, department,
description, topics, material_count, has_video, has_problems, has_exams)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
course["url"],
course.get("title"),
course.get("course_number"),
json.dumps(course.get("instructor", [])),
course.get("semester"),
course.get("level"),
department,
course.get("description"),
json.dumps(course.get("topics", [])),
len(materials),
has_video, has_problems, has_exams,
))
for mat in materials:
conn.execute("""
INSERT OR IGNORE INTO materials (course_url, title, url, type)
VALUES (?, ?, ?, ?)
""", (course["url"], mat.get("title"), mat.get("url"), mat.get("type")))
conn.commit()
def search_courses(conn: sqlite3.Connection, query: str, limit: int = 20) -> list[dict]:
"""Full-text search across course titles and descriptions."""
rows = conn.execute("""
SELECT c.url, c.title, c.course_number, c.level, c.material_count
FROM courses c
JOIN course_search cs ON cs.rowid = c.rowid
WHERE course_search MATCH ?
ORDER BY rank
LIMIT ?
""", (query, limit)).fetchall()
return [
{
"url": r[0],
"title": r[1],
"course_number": r[2],
"level": r[3],
"materials": r[4],
}
for r in rows
]
def get_courses_with_problem_sets(conn: sqlite3.Connection) -> list[dict]:
"""Find all courses that have downloadable problem sets."""
rows = conn.execute("""
SELECT c.title, c.course_number, c.department, c.semester,
COUNT(m.id) as pset_count
FROM courses c
JOIN materials m ON m.course_url = c.url
WHERE m.type IN ('problem_set', 'solution')
GROUP BY c.url
HAVING pset_count > 0
ORDER BY pset_count DESC
""").fetchall()
return [
{
"title": r[0],
"course_number": r[1],
"department": r[2],
"semester": r[3],
"pset_count": r[4],
}
for r in rows
]
Full Pipeline
Putting it all together for a complete department scrape:
def scrape_department(
department: str,
db_path: str = "ocw_catalog.db",
download_materials: bool = False,
download_dir: str = "ocw_downloads",
material_types: list = None,
proxy_url: str = None,
max_courses: int = 200,
) -> dict:
"""
Full pipeline: catalog scrape, course detail extraction, optional download.
material_types: types to download if download_materials=True.
e.g. ["lecture_notes", "problem_set", "exam", "solution"]
Pass None to download all.
"""
conn = init_ocw_db(db_path)
print(f"Scraping course list for department: {department}")
course_list = scrape_course_listing(department=department, max_pages=20)
course_list = course_list[:max_courses]
print(f"Found {len(course_list)} courses")
results = {"scraped": 0, "with_materials": 0, "errors": 0}
for i, course_meta in enumerate(course_list):
print(f"\n[{i + 1}/{len(course_list)}] {course_meta['title'][:60]}")
resp = polite_get(course_meta["url"], proxy_url=proxy_url)
if not resp:
results["errors"] += 1
continue
soup = BeautifulSoup(resp.text, "lxml")
course = scrape_course_page(course_meta["url"])
course["department"] = department
# Get materials from dedicated section pages
section_materials = get_all_course_materials(course["url"])
for section, mats in section_materials.items():
course["materials"].extend(mats)
# Deduplicate materials again after adding section materials
seen = set()
unique = []
for m in course["materials"]:
if m["url"] not in seen:
seen.add(m["url"])
unique.append(m)
course["materials"] = unique
save_course(conn, course, department=department)
results["scraped"] += 1
if course["materials"]:
results["with_materials"] += 1
print(f" Materials: {len(course['materials'])} items")
if download_materials and course["materials"]:
download_course_materials(
course,
base_dir=download_dir,
material_types=material_types,
proxy_url=proxy_url,
)
# Polite delay between courses
time.sleep(3.0)
conn.close()
print(f"\nDone. Scraped: {results['scraped']}, With materials: {results['with_materials']}, Errors: {results['errors']}")
return results
# Example: scrape EECS department, download only problem sets and solutions
if __name__ == "__main__":
scrape_department(
department="electrical-engineering-and-computer-science",
download_materials=True,
material_types=["problem_set", "solution", "exam"],
max_courses=50,
)
What You Can Build
OCW data enables interesting downstream projects:
Problem set / solution datasets for AI tutoring. Problem sets and their solution keys are paired automatically — the solution PDF appears in the same course as the assignment PDF. This makes OCW one of the best open sources for training math and CS tutoring models. The data spans introductory to graduate difficulty across dozens of subjects.
Curriculum evolution tracking. OCW has courses archived across multiple semesters — "Introduction to Algorithms" from 2001, 2005, 2011, and 2020. Comparing reading lists, problem set topics, and lecture content across years reveals how curricula change in response to research progress and industry shifts.
Prerequisites and dependency graphs. The syllabus section of each course mentions prerequisites. Parsing these creates a dependency graph of the MIT curriculum — useful for building learning path recommendations.
Cross-university comparison. Combining OCW with similar data from Stanford Engineering Everywhere, Yale Open Courses, and edX/Coursera datasets lets you compare how different institutions teach the same subjects.
Full-text search index. OCW's own search is limited to course-level metadata. An index built from downloaded PDFs, lecture transcripts, and problem sets enables much richer semantic search — "find OCW problems involving dynamic programming on trees" becomes possible.
OCW is one of the friendliest scraping targets you'll encounter: CC-licensed content, relatively simple structure, and a mission explicitly aligned with open access. The main obligation is being polite with rate limits and not treating MIT's servers like a commercial CDN.