Scraping MIT OpenCourseWare: Course Catalog, Lectures & Problem Sets with Python (2026)

2026-04-09 [python scraping mit opencourseware education datasets]

Scraping MIT OpenCourseWare: Course Catalog, Lectures & Problem Sets with Python (2026)

MIT OpenCourseWare (OCW) has published materials from over 2,500 MIT courses — for free. Lecture notes, problem sets, exams, video lectures, and reading lists covering everything from introductory calculus to graduate-level quantum field theory.

If you're building an educational search engine, studying how university curricula evolve, constructing AI training datasets from high-quality problem/solution pairs, or simply want offline access to MIT course materials, scraping OCW is a worthwhile project. The site is relatively scraper-friendly compared to commercial targets — the content is explicitly meant to be open — but there are still practical nuances to cover.

What's Available

Each course on OCW can include:

Syllabus — course description, prerequisites, grading structure, course goals
Lecture notes — usually PDFs, sometimes HTML pages with embedded LaTeX
Problem sets — assignments, often with solution sets in separate PDFs
Exams — midterms and finals, often with complete answer keys
Video lectures — hosted on YouTube with auto-generated and human-edited transcripts
Reading lists — required and recommended textbooks, with ISBNs
Course calendar — week-by-week topic schedule with readings mapped to each session
Projects — final project descriptions, rubrics, and sometimes example submissions
Recitation notes — supplementary materials from TA sessions

Not every course has all of these. Older courses (pre-2005) may have only a syllabus and reading list. Well-resourced courses — 6.006 (Algorithms), 18.06 (Linear Algebra), 8.04 (Quantum Physics) — have the full package.

License

OCW content is published under Creative Commons BY-NC-SA 4.0. You can share and adapt the materials for non-commercial purposes with attribution. This is one of the rare scraping targets where the content creators explicitly want you to use the data — OpenCourseWare exists specifically to make this material accessible.

For building ML training datasets, research papers, or educational tools, this license gives you substantial freedom. The key restrictions are attribution (cite MIT OCW) and non-commercial (don't sell the raw materials without transformation).

Site Structure

OCW's URL structure follows a predictable pattern:

https://ocw.mit.edu/courses/{department}/{course-slug}/
https://ocw.mit.edu/courses/{department}/{course-slug}/lecture-notes/
https://ocw.mit.edu/courses/{department}/{course-slug}/assignments/
https://ocw.mit.edu/courses/{department}/{course-slug}/exams/
https://ocw.mit.edu/courses/{department}/{course-slug}/video-lectures/

Course slugs follow the pattern {course-number}-{course-name}-{semester}, e.g., 6-006-introduction-to-algorithms-fall-2011.

The site has a search/browse interface at https://ocw.mit.edu/search that supports filtering by department, level, and features.

Scraping the Course Catalog

import httpx
from bs4 import BeautifulSoup
import json
import time
import re
import os

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/128.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
}

BASE_URL = "https://ocw.mit.edu"

# Common OCW department slugs for filtering
DEPARTMENTS = [
    "electrical-engineering-and-computer-science",
    "mathematics",
    "physics",
    "chemistry",
    "biology",
    "economics",
    "management",
    "mechanical-engineering",
    "architecture",
    "linguistics-and-philosophy",
]


def scrape_search_page(
    department: str = "",
    features: list = None,
    level: str = "",
    page: int = 1,
) -> list[dict]:
    """
    Scrape one page of OCW search results.

    features: list of values like "lecture-videos", "problem-sets", "exams"
    level: "undergraduate" or "graduate"
    """
    url = f"{BASE_URL}/search"
    params = {"page": page}

    if department:
        params["d"] = department
    if level:
        params["l"] = level
    if features:
        params["f"] = ",".join(features)

    resp = httpx.get(url, params=params, headers=HEADERS, follow_redirects=True, timeout=20)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "lxml")

    courses = []

    # OCW search results use a card-based layout
    for card in soup.select("[class*='course-card'], [class*='search-result'], article.course"):
        course = {}

        # Title
        title_el = card.select_one("h3, h2, [class*='title']")
        course["title"] = title_el.get_text(strip=True) if title_el else ""

        # URL
        link_el = card.select_one("a[href]")
        if link_el:
            href = link_el.get("href", "")
            course["url"] = href if href.startswith("http") else BASE_URL + href
        else:
            course["url"] = ""

        # Course number / department
        dept_el = card.select_one("[class*='department'], [class*='course-number'], .subject")
        course["department"] = dept_el.get_text(strip=True) if dept_el else ""

        # Description snippet
        desc_el = card.select_one("p, [class*='description'], [class*='teaser']")
        course["description"] = desc_el.get_text(strip=True)[:300] if desc_el else ""

        # Feature badges (lecture videos, problem sets, etc.)
        badges = card.select("[class*='badge'], [class*='feature'], [class*='tag']")
        course["features"] = [b.get_text(strip=True) for b in badges if b.get_text(strip=True)]

        if course["title"] and course["url"]:
            courses.append(course)

    return courses


def scrape_course_listing(
    department: str = "",
    features: list = None,
    max_pages: int = 100,
) -> list[dict]:
    """Paginate through OCW course listings and collect all results."""
    all_courses = []

    for page in range(1, max_pages + 1):
        batch = scrape_search_page(department=department, features=features, page=page)

        if not batch:
            print(f"  No results on page {page}, stopping.")
            break

        all_courses.extend(batch)
        print(f"  Page {page}: {len(batch)} courses (total: {len(all_courses)})")

        time.sleep(2.0)

    return all_courses

Extracting Individual Course Pages

Once you have a list of course URLs, scrape each course's detail page:

def scrape_course_page(course_url: str) -> dict:
    """Scrape all metadata and material links from a single OCW course page."""
    resp = httpx.get(course_url, headers=HEADERS, follow_redirects=True, timeout=20)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "lxml")

    course = {
        "url": course_url,
        "title": "",
        "course_number": "",
        "instructor": [],
        "semester": "",
        "level": "",
        "topics": [],
        "description": "",
        "features": [],
        "materials": [],
    }

    # Title
    h1 = soup.select_one("h1, [class*='course-title']")
    course["title"] = h1.get_text(strip=True) if h1 else ""

    # Course number (parse from URL or page)
    number_el = soup.select_one("[class*='course-number'], [class*='subject-id']")
    if number_el:
        course["course_number"] = number_el.get_text(strip=True)
    else:
        # Extract from URL: /courses/6-006-... -> 6.006
        match = re.search(r"/courses/(\d+-\d+)", course_url)
        if match:
            course["course_number"] = match.group(1).replace("-", ".", 1)

    # Instructors
    for inst_el in soup.select("[class*='instructor'], [itemprop='instructor']"):
        name = inst_el.get_text(strip=True)
        if name and len(name) > 2:
            course["instructor"].append(name)

    # Metadata table (semester, level, etc.)
    for row in soup.select("dl dt, [class*='metadata'] dt, [class*='course-info'] dt"):
        label = row.get_text(strip=True).lower().rstrip(":")
        value_el = row.find_next_sibling("dd")
        if not value_el:
            continue
        value = value_el.get_text(strip=True)

        if "semester" in label or "term" in label or "as taught" in label:
            course["semester"] = value
        elif "level" in label:
            course["level"] = value

    # Course description
    desc_el = soup.select_one("[class*='description'], [itemprop='description'], .lead")
    if desc_el:
        course["description"] = desc_el.get_text(strip=True)[:1000]

    # Topics / tags
    for tag in soup.select("[class*='topic'], [class*='tag'], [rel*='tag']"):
        text = tag.get_text(strip=True)
        if text and len(text) < 80:
            course["topics"].append(text)
    course["topics"] = list(set(course["topics"]))

    # Material links
    for a in soup.select("a[href]"):
        href = a.get("href", "")
        text = a.get_text(strip=True)
        full_url = href if href.startswith("http") else BASE_URL + href

        material_type = classify_material_link(href, text)
        if material_type and text and len(text) > 3:
            course["materials"].append({
                "title": text[:200],
                "url": full_url,
                "type": material_type,
            })

    # Deduplicate materials by URL
    seen_urls = set()
    unique_materials = []
    for m in course["materials"]:
        if m["url"] not in seen_urls:
            seen_urls.add(m["url"])
            unique_materials.append(m)
    course["materials"] = unique_materials

    return course


def classify_material_link(href: str, text: str) -> str | None:
    """Classify a link as a specific material type or return None to ignore it."""
    href_lower = href.lower()
    text_lower = text.lower()

    if href_lower.endswith(".pdf"):
        if any(k in href_lower or k in text_lower for k in ["lecture", "notes", "slides"]):
            return "lecture_notes"
        elif any(k in href_lower or k in text_lower for k in ["problem", "pset", "assignment", "hw"]):
            return "problem_set"
        elif any(k in href_lower or k in text_lower for k in ["exam", "midterm", "final", "quiz"]):
            return "exam"
        elif any(k in href_lower or k in text_lower for k in ["solution", "answer", "key"]):
            return "solution"
        return "pdf"
    elif "/lecture-notes/" in href_lower or "/lecture/" in href_lower:
        return "lecture_notes"
    elif "/assignments/" in href_lower or "/problem-sets/" in href_lower:
        return "problem_set"
    elif "/exams/" in href_lower:
        return "exam"
    elif "/video-lectures/" in href_lower:
        return "video"
    elif "/readings/" in href_lower:
        return "reading"
    elif "/recitations/" in href_lower:
        return "recitation"
    elif "/projects/" in href_lower:
        return "project"

    return None

Scraping Sub-Section Pages

OCW organizes materials within each course into dedicated sub-sections. Scraping these pages gives you the actual file listings:

SECTION_SLUGS = [
    "lecture-notes",
    "assignments",
    "exams",
    "video-lectures",
    "readings",
    "recitations",
    "projects",
    "tools",
]


def scrape_course_section(
    course_url: str,
    section: str,
) -> list[dict]:
    """Scrape a specific section page of a course for material links."""
    section_url = course_url.rstrip("/") + f"/{section}/"
    resp = httpx.get(section_url, headers=HEADERS, follow_redirects=True, timeout=15)

    if resp.status_code == 404:
        return []

    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "lxml")

    materials = []
    for a in soup.select("a[href]"):
        href = a.get("href", "")
        text = a.get_text(strip=True)

        if not text or len(text) < 3:
            continue

        full_url = href if href.startswith("http") else BASE_URL + href

        # Only collect actual file links (PDFs) or sub-page links within OCW
        if (
            href.endswith(".pdf")
            or (course_url.split("/courses/")[1].split("/")[0] in href and len(href) > 50)
        ):
            materials.append({
                "title": text[:200],
                "url": full_url,
                "type": section.rstrip("s"),  # "lecture-notes" -> "lecture-note"
                "section": section,
            })

    return materials


def get_all_course_materials(course_url: str) -> dict[str, list]:
    """
    Systematically collect material links from all sections of a course.
    Returns dict keyed by section name.
    """
    materials_by_section = {}

    for section in SECTION_SLUGS:
        materials = scrape_course_section(course_url, section)
        if materials:
            materials_by_section[section] = materials
            print(f"    {section}: {len(materials)} item(s)")
        time.sleep(1.0)

    return materials_by_section

Extracting Video Lecture Metadata

OCW video lectures are hosted on YouTube. The video pages on OCW link out to YouTube, and you can extract the YouTube IDs to pull transcripts:

import re


def extract_youtube_ids_from_page(page_url: str) -> list[str]:
    """Extract YouTube video IDs from an OCW video lectures page."""
    resp = httpx.get(page_url, headers=HEADERS, follow_redirects=True, timeout=15)
    if resp.status_code != 200:
        return []

    soup = BeautifulSoup(resp.text, "lxml")
    yt_ids = set()

    # Look for YouTube embed URLs and links
    for el in soup.select("[src*='youtube'], [href*='youtube'], [href*='youtu.be']"):
        src = el.get("src", "") + el.get("href", "")
        match = re.search(
            r"(?:youtube\.com/(?:embed/|watch\?v=)|youtu\.be/)([A-Za-z0-9_-]{11})",
            src,
        )
        if match:
            yt_ids.add(match.group(1))

    return list(yt_ids)


def get_youtube_transcript(video_id: str) -> str | None:
    """
    Fetch transcript for a YouTube video using the unofficial caption API.
    Returns plain text transcript or None if unavailable.
    """
    # YouTube's timedtext API
    list_url = f"https://www.youtube.com/api/timedtext?v={video_id}&type=list"

    try:
        resp = httpx.get(list_url, headers=HEADERS, timeout=10)
        if resp.status_code != 200:
            return None

        # Parse available tracks
        from xml.etree import ElementTree as ET
        root = ET.fromstring(resp.text)

        # Prefer English
        lang_code = "en"
        tracks = root.findall("track")
        if not tracks:
            return None

        en_tracks = [t for t in tracks if t.get("lang_code", "").startswith("en")]
        track = en_tracks[0] if en_tracks else tracks[0]
        lang_code = track.get("lang_code", "en")

        # Fetch the actual transcript
        transcript_url = (
            f"https://www.youtube.com/api/timedtext"
            f"?v={video_id}&lang={lang_code}&fmt=vtt"
        )
        resp = httpx.get(transcript_url, headers=HEADERS, timeout=15)
        if resp.status_code != 200:
            return None

        # Strip VTT formatting to get plain text
        lines = resp.text.split("\n")
        text_lines = []
        for line in lines:
            line = line.strip()
            # Skip VTT headers, timestamps, and empty lines
            if (
                line
                and not line.startswith("WEBVTT")
                and not re.match(r"^\d{2}:\d{2}", line)
                and not re.match(r"^NOTE", line)
                and not re.match(r"^\d+$", line)
            ):
                # Remove HTML tags
                clean = re.sub(r"<[^>]+>", "", line)
                if clean:
                    text_lines.append(clean)

        # Deduplicate adjacent identical lines (common in auto-captions)
        deduped = []
        for line in text_lines:
            if not deduped or line != deduped[-1]:
                deduped.append(line)

        return " ".join(deduped) if deduped else None

    except Exception as e:
        print(f"  Transcript error for {video_id}: {e}")
        return None

Anti-Bot Measures on OCW

MIT OCW is substantially more permissive than commercial scraping targets — the site exists to be open. But practical protections are still present:

Cloudflare. OCW sits behind Cloudflare's CDN/DDoS protection. Aggressive scraping from a single IP triggers Cloudflare challenges. For catalog browsing at reasonable rates (one request every 2-3 seconds), you typically don't hit challenges. For bulk downloads at high rates, you will.

Rate limiting at the server level. If you send more than ~30-40 requests per minute from a single IP, OCW's servers will start throttling. You'll see response times increase before outright 429s.

Large-file download detection. Downloading hundreds of PDFs from one IP in a short window is flagged by Cloudflare's abuse detection. This is especially relevant for courses with large video files.

YouTube CDN throttling. YouTube's CDN applies per-IP bandwidth limits for video content. This doesn't affect transcript API calls, but does affect direct video downloads.

For catalog scraping and individual course page scraping, you usually don't need proxies — the rates required are low enough to stay under Cloudflare's thresholds. For bulk material downloads at scale, rotating IPs help. ThorData's residential proxies are a good fit because residential IPs are treated like regular users by Cloudflare's edge rules:

def polite_get(
    url: str,
    max_retries: int = 4,
    base_delay: float = 2.0,
    proxy_url: str = None,
) -> httpx.Response | None:
    """
    GET with exponential backoff and optional proxy support.
    Handles Cloudflare rate limiting and server errors gracefully.
    """
    client_kwargs = {"headers": HEADERS, "follow_redirects": True, "timeout": 25}
    if proxy_url:
        client_kwargs["proxies"] = {"https://": proxy_url, "http://": proxy_url}

    for attempt in range(max_retries):
        try:
            with httpx.Client(**client_kwargs) as client:
                resp = client.get(url)

                if resp.status_code == 200:
                    return resp

                elif resp.status_code == 429:
                    wait = base_delay * (2 ** attempt)
                    print(f"  Rate limited (attempt {attempt + 1}). Waiting {wait:.0f}s...")
                    time.sleep(wait)

                elif resp.status_code == 403:
                    # Cloudflare challenge
                    wait = 30 * (attempt + 1)
                    print(f"  403 Cloudflare challenge. Waiting {wait}s...")
                    time.sleep(wait)

                elif resp.status_code in (500, 502, 503, 504):
                    wait = base_delay * (attempt + 1)
                    print(f"  {resp.status_code} server error. Waiting {wait:.0f}s...")
                    time.sleep(wait)

                else:
                    return resp

        except httpx.TimeoutException:
            wait = base_delay * (attempt + 1)
            print(f"  Timeout. Waiting {wait:.0f}s...")
            time.sleep(wait)
        except Exception as e:
            print(f"  Error: {e}")
            if attempt < max_retries - 1:
                time.sleep(base_delay)

    return None

Downloading Course Materials

For offline archives, download the actual PDFs and documents:

def download_material(
    url: str,
    save_dir: str,
    filename: str = None,
    proxy_url: str = None,
) -> str | None:
    """
    Download a single material file. Returns local path or None on failure.
    """
    os.makedirs(save_dir, exist_ok=True)

    if not filename:
        filename = url.split("/")[-1].split("?")[0]
        if not filename or "." not in filename:
            filename = f"material_{abs(hash(url))}.pdf"

    # Sanitize filename
    filename = re.sub(r'[^\w\s\-.]', '_', filename)[:100]
    filepath = os.path.join(save_dir, filename)

    if os.path.exists(filepath) and os.path.getsize(filepath) > 0:
        return filepath  # Already downloaded

    resp = polite_get(url, proxy_url=proxy_url)
    if resp and resp.status_code == 200:
        with open(filepath, "wb") as f:
            f.write(resp.content)
        return filepath

    return None


def download_course_materials(
    course: dict,
    base_dir: str = "ocw_downloads",
    material_types: list = None,
    proxy_url: str = None,
) -> dict:
    """
    Download all materials for a course.
    material_types: list of types to download, e.g. ["lecture_notes", "problem_set", "exam"]
                   Pass None to download all types.
    """
    # Create a safe directory name from the course title
    safe_title = re.sub(r"[^\w\s-]", "", course.get("title", "unknown"))[:60].strip()
    course_dir = os.path.join(base_dir, safe_title)
    os.makedirs(course_dir, exist_ok=True)

    # Save metadata
    with open(os.path.join(course_dir, "metadata.json"), "w") as f:
        meta = {k: v for k, v in course.items() if k != "materials"}
        json.dump(meta, f, indent=2)

    download_results = {"success": [], "failed": [], "skipped": []}

    for material in course.get("materials", []):
        m_type = material.get("type", "other")

        if material_types and m_type not in material_types:
            download_results["skipped"].append(material["url"])
            continue

        if not material["url"].endswith(".pdf") and "ocw.mit.edu" not in material["url"]:
            download_results["skipped"].append(material["url"])
            continue

        type_dir = os.path.join(course_dir, m_type)
        result = download_material(
            material["url"],
            save_dir=type_dir,
            proxy_url=proxy_url,
        )

        if result:
            download_results["success"].append(result)
        else:
            download_results["failed"].append(material["url"])

        time.sleep(1.5)  # Polite delay between downloads

    print(
        f"  Downloaded {len(download_results['success'])} files "
        f"({len(download_results['failed'])} failed, "
        f"{len(download_results['skipped'])} skipped)"
    )
    return download_results

Building a Course Material Index (SQLite)

For searchable metadata without downloading everything:

import sqlite3


def init_ocw_db(path: str = "ocw_catalog.db") -> sqlite3.Connection:
    conn = sqlite3.connect(path)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS courses (
            url             TEXT PRIMARY KEY,
            title           TEXT,
            course_number   TEXT,
            instructors     TEXT,
            semester        TEXT,
            level           TEXT,
            department      TEXT,
            description     TEXT,
            topics          TEXT,
            material_count  INTEGER DEFAULT 0,
            has_video       BOOLEAN DEFAULT 0,
            has_problems    BOOLEAN DEFAULT 0,
            has_exams       BOOLEAN DEFAULT 0,
            scraped_at      TEXT DEFAULT (datetime('now'))
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS materials (
            id          INTEGER PRIMARY KEY AUTOINCREMENT,
            course_url  TEXT NOT NULL,
            title       TEXT,
            url         TEXT UNIQUE,
            type        TEXT,
            local_path  TEXT,
            downloaded  INTEGER DEFAULT 0,
            FOREIGN KEY (course_url) REFERENCES courses(url)
        )
    """)

    conn.execute("""
        CREATE VIRTUAL TABLE IF NOT EXISTS course_search
        USING fts5(title, description, topics, course_number, content=courses)
    """)

    conn.execute("CREATE INDEX IF NOT EXISTS idx_materials_type ON materials(type)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_materials_course ON materials(course_url)")
    conn.commit()
    return conn


def save_course(conn: sqlite3.Connection, course: dict, department: str = ""):
    """Save a scraped course and its materials to the database."""
    materials = course.get("materials", [])
    has_video = any(m["type"] == "video" for m in materials)
    has_problems = any(m["type"] in ("problem_set", "assignment") for m in materials)
    has_exams = any(m["type"] == "exam" for m in materials)

    conn.execute("""
        INSERT OR REPLACE INTO courses
        (url, title, course_number, instructors, semester, level, department,
         description, topics, material_count, has_video, has_problems, has_exams)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, (
        course["url"],
        course.get("title"),
        course.get("course_number"),
        json.dumps(course.get("instructor", [])),
        course.get("semester"),
        course.get("level"),
        department,
        course.get("description"),
        json.dumps(course.get("topics", [])),
        len(materials),
        has_video, has_problems, has_exams,
    ))

    for mat in materials:
        conn.execute("""
            INSERT OR IGNORE INTO materials (course_url, title, url, type)
            VALUES (?, ?, ?, ?)
        """, (course["url"], mat.get("title"), mat.get("url"), mat.get("type")))

    conn.commit()


def search_courses(conn: sqlite3.Connection, query: str, limit: int = 20) -> list[dict]:
    """Full-text search across course titles and descriptions."""
    rows = conn.execute("""
        SELECT c.url, c.title, c.course_number, c.level, c.material_count
        FROM courses c
        JOIN course_search cs ON cs.rowid = c.rowid
        WHERE course_search MATCH ?
        ORDER BY rank
        LIMIT ?
    """, (query, limit)).fetchall()

    return [
        {
            "url": r[0],
            "title": r[1],
            "course_number": r[2],
            "level": r[3],
            "materials": r[4],
        }
        for r in rows
    ]


def get_courses_with_problem_sets(conn: sqlite3.Connection) -> list[dict]:
    """Find all courses that have downloadable problem sets."""
    rows = conn.execute("""
        SELECT c.title, c.course_number, c.department, c.semester,
               COUNT(m.id) as pset_count
        FROM courses c
        JOIN materials m ON m.course_url = c.url
        WHERE m.type IN ('problem_set', 'solution')
        GROUP BY c.url
        HAVING pset_count > 0
        ORDER BY pset_count DESC
    """).fetchall()

    return [
        {
            "title": r[0],
            "course_number": r[1],
            "department": r[2],
            "semester": r[3],
            "pset_count": r[4],
        }
        for r in rows
    ]

Full Pipeline

Putting it all together for a complete department scrape:

def scrape_department(
    department: str,
    db_path: str = "ocw_catalog.db",
    download_materials: bool = False,
    download_dir: str = "ocw_downloads",
    material_types: list = None,
    proxy_url: str = None,
    max_courses: int = 200,
) -> dict:
    """
    Full pipeline: catalog scrape, course detail extraction, optional download.

    material_types: types to download if download_materials=True.
                   e.g. ["lecture_notes", "problem_set", "exam", "solution"]
                   Pass None to download all.
    """
    conn = init_ocw_db(db_path)

    print(f"Scraping course list for department: {department}")
    course_list = scrape_course_listing(department=department, max_pages=20)
    course_list = course_list[:max_courses]
    print(f"Found {len(course_list)} courses")

    results = {"scraped": 0, "with_materials": 0, "errors": 0}

    for i, course_meta in enumerate(course_list):
        print(f"\n[{i + 1}/{len(course_list)}] {course_meta['title'][:60]}")

        resp = polite_get(course_meta["url"], proxy_url=proxy_url)
        if not resp:
            results["errors"] += 1
            continue

        soup = BeautifulSoup(resp.text, "lxml")
        course = scrape_course_page(course_meta["url"])
        course["department"] = department

        # Get materials from dedicated section pages
        section_materials = get_all_course_materials(course["url"])
        for section, mats in section_materials.items():
            course["materials"].extend(mats)

        # Deduplicate materials again after adding section materials
        seen = set()
        unique = []
        for m in course["materials"]:
            if m["url"] not in seen:
                seen.add(m["url"])
                unique.append(m)
        course["materials"] = unique

        save_course(conn, course, department=department)
        results["scraped"] += 1

        if course["materials"]:
            results["with_materials"] += 1
            print(f"  Materials: {len(course['materials'])} items")

        if download_materials and course["materials"]:
            download_course_materials(
                course,
                base_dir=download_dir,
                material_types=material_types,
                proxy_url=proxy_url,
            )

        # Polite delay between courses
        time.sleep(3.0)

    conn.close()
    print(f"\nDone. Scraped: {results['scraped']}, With materials: {results['with_materials']}, Errors: {results['errors']}")
    return results


# Example: scrape EECS department, download only problem sets and solutions
if __name__ == "__main__":
    scrape_department(
        department="electrical-engineering-and-computer-science",
        download_materials=True,
        material_types=["problem_set", "solution", "exam"],
        max_courses=50,
    )

What You Can Build

OCW data enables interesting downstream projects:

Problem set / solution datasets for AI tutoring. Problem sets and their solution keys are paired automatically — the solution PDF appears in the same course as the assignment PDF. This makes OCW one of the best open sources for training math and CS tutoring models. The data spans introductory to graduate difficulty across dozens of subjects.

Curriculum evolution tracking. OCW has courses archived across multiple semesters — "Introduction to Algorithms" from 2001, 2005, 2011, and 2020. Comparing reading lists, problem set topics, and lecture content across years reveals how curricula change in response to research progress and industry shifts.

Prerequisites and dependency graphs. The syllabus section of each course mentions prerequisites. Parsing these creates a dependency graph of the MIT curriculum — useful for building learning path recommendations.

Cross-university comparison. Combining OCW with similar data from Stanford Engineering Everywhere, Yale Open Courses, and edX/Coursera datasets lets you compare how different institutions teach the same subjects.

Full-text search index. OCW's own search is limited to course-level metadata. An index built from downloaded PDFs, lecture transcripts, and problem sets enables much richer semantic search — "find OCW problems involving dynamic programming on trees" becomes possible.

OCW is one of the friendliest scraping targets you'll encounter: CC-licensed content, relatively simple structure, and a mission explicitly aligned with open access. The main obligation is being polite with rate limits and not treating MIT's servers like a commercial CDN.