Scraping Skillshare Course Data and Instructor Stats with Python (2026)

2026-04-09 ["skillshare" "web scraping" "python" "playwright" "education"]

Scraping Skillshare Course Data and Instructor Stats with Python (2026)

Skillshare is an interesting scraping target because there's no public API. The platform shut down third-party API access years ago, which means if you want course metadata — enrollment numbers, instructor ratings, curriculum structure, class descriptions — you're building a scraper from scratch. This matters for market research in the online education space, competitive analysis if you're an instructor, or aggregating course catalogs across platforms.

The platform is a JavaScript-heavy single-page application, which rules out simple HTTP + BeautifulSoup approaches for most data points. But Skillshare also loads structured data from internal API endpoints that you can intercept, and certain pages still embed useful metadata in the initial HTML response. This guide covers both angles.

What Data Is Available

Through web scraping and API interception:

Class metadata — title, description, category, subcategory, duration, number of lessons, skill level, publication date, last updated date
Enrollment data — total student count per class, number of student projects submitted
Instructor profiles — name, bio, follower count, total students across all classes, number of classes published, average rating
Reviews — individual review text, star ratings, reviewer name, review date
Curriculum structure — lesson titles, lesson durations, section groupings
Trending/popular — classes ranked by popularity within categories, staff picks, featured collections

Skillshare doesn't expose view counts publicly (unlike YouTube or Dailymotion), but student enrollment counts serve as a reasonable proxy for class popularity.

Why Skillshare Data Is Valuable

The online education market exceeded $350 billion globally in 2026. Skillshare sits in the creative and professional development niche, competing with Udemy, LinkedIn Learning, and Coursera. The data it holds is commercially useful in multiple ways:

Creator economy research — Understand which niches attract the most students and what course formats work
Competitive benchmarking — Instructors can measure their enrollment and ratings against comparable classes
Market validation — Before creating a course, verify demand by looking at enrollment numbers in similar topics
Content aggregation — Build cross-platform learning catalogs that surface the best material regardless of platform
Academic research — Study self-directed learning patterns, topic popularity cycles, and the impact of video length on completion

Anti-Bot Measures

Skillshare's protections are moderate but layered:

Cloudflare. The entire site sits behind Cloudflare with JavaScript challenge pages. Direct HTTP requests to skillshare.com from known datacenter IPs get intercepted with a challenge page before any content loads. This is the first wall you'll hit.

Server-side rendering detection. Skillshare checks for browser capabilities during page load. Requests that don't execute JavaScript get a minimal HTML shell with no useful content. The actual course data loads via XHR calls after the JavaScript framework initializes.

Session-based rate limiting. Rapid sequential requests from the same session trigger soft blocks — pages start returning 429 or redirect to a "please wait" interstitial. This kicks in around 40-60 requests per minute from a single IP.

Login walls. Some course data (full reviews, detailed curriculum) is only accessible to logged-in users. Class pages show limited information to anonymous visitors — typically the first few lessons and a truncated description.

Bot fingerprinting. Skillshare uses a JavaScript-based fingerprinting library that checks for automation tells: navigator.webdriver, headless browser artifacts, inconsistent viewport/screen ratios, and missing browser APIs that real browsers have.

Setting Up Playwright

Playwright handles the Cloudflare challenge and JavaScript rendering:

pip install playwright playwright-stealth httpx
playwright install chromium

import asyncio
import json
import random
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async

async def create_browser(proxy: str = None):
    """Create a stealth Playwright browser instance."""
    pw = await async_playwright().start()
    launch_args = {
        "headless": True,
        "args": [
            "--no-sandbox",
            "--disable-blink-features=AutomationControlled",
            "--disable-dev-shm-usage",
            "--disable-extensions",
            "--no-first-run",
        ],
    }
    if proxy:
        launch_args["proxy"] = {"server": proxy}

    browser = await pw.chromium.launch(**launch_args)
    context = await browser.new_context(
        viewport={"width": 1440, "height": 900},
        user_agent=(
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/125.0.0.0 Safari/537.36"
        ),
        locale="en-US",
        timezone_id="America/New_York",
    )
    page = await context.new_page()
    await stealth_async(page)
    return pw, browser, page


async def warm_session(page) -> None:
    """Visit the homepage to establish a valid session before scraping."""
    await page.goto("https://www.skillshare.com/en/", wait_until="domcontentloaded")
    await page.wait_for_timeout(random.randint(2000, 4000))
    # Simulate some browsing behavior
    await page.mouse.move(random.randint(100, 800), random.randint(100, 600))
    await page.wait_for_timeout(500)

Scraping Class Metadata

Skillshare class pages load structured data via internal API calls. You can intercept these, but the simpler approach for individual classes is to parse the page content after JavaScript rendering:

async def scrape_class(class_url: str, proxy: str = None) -> dict:
    """Scrape a single Skillshare class page for all available metadata."""
    pw, browser, page = await create_browser(proxy)
    api_data = {}

    # Intercept GraphQL responses
    async def capture_api(response):
        if "/api/graphql" in response.url and response.status == 200:
            try:
                body = await response.json()
                if "data" in body:
                    api_data.update(body["data"])
            except Exception:
                pass

    page.on("response", capture_api)

    await warm_session(page)
    await page.goto(class_url, wait_until="networkidle", timeout=60000)
    await page.wait_for_timeout(3000)

    result = {}

    # Title
    title_el = await page.query_selector("h1")
    if title_el:
        result["title"] = (await title_el.inner_text()).strip()

    # Description
    desc_el = await page.query_selector("[class*='description'], [class*='about-class']")
    if desc_el:
        result["description"] = (await desc_el.inner_text()).strip()[:2000]

    # Student count
    stats_els = await page.query_selector_all("[class*='classMeta'] span, [class*='stats'] span")
    for el in stats_els:
        text = await el.inner_text()
        if "student" in text.lower():
            result["students"] = text.strip()
        elif "project" in text.lower():
            result["projects"] = text.strip()

    # Category/skill level
    skill_el = await page.query_selector("[class*='skill-level'], [class*='level-pill']")
    if skill_el:
        result["skill_level"] = (await skill_el.inner_text()).strip()

    # Duration
    duration_el = await page.query_selector("[class*='duration'], [class*='runtime']")
    if duration_el:
        result["duration"] = (await duration_el.inner_text()).strip()

    # Lesson list
    lessons = []
    lesson_els = await page.query_selector_all(
        "[class*='lesson-item'], [class*='lessonItem'], [class*='unit-item']"
    )
    for el in lesson_els:
        title_inner = await el.query_selector("[class*='title'], span")
        if title_inner:
            lesson_title = (await title_inner.inner_text()).strip()
            if lesson_title:
                lessons.append(lesson_title)

    result["lessons"] = lessons
    result["lesson_count"] = len(lessons)
    result["url"] = class_url

    # Merge intercepted API data
    if api_data:
        result["api_data"] = api_data

    await browser.close()
    await pw.stop()
    return result


# Example usage
async def main():
    url = "https://www.skillshare.com/en/classes/Python-for-Data-Science/12345"
    data = await scrape_class(url)
    print(f"Title: {data.get('title')}")
    print(f"Students: {data.get('students')}")
    print(f"Lessons: {data.get('lesson_count')}")

asyncio.run(main())

Bulk Class Discovery

To find classes at scale, scrape Skillshare's browse and search pages:

async def search_classes(query: str, max_results: int = 100,
                          proxy: str = None) -> list[dict]:
    """Search Skillshare for classes matching a query."""
    pw, browser, page = await create_browser(proxy)
    classes = []
    page_num = 1

    await warm_session(page)

    while len(classes) < max_results:
        url = (f"https://www.skillshare.com/en/search"
               f"?query={query}&page={page_num}")
        await page.goto(url, wait_until="networkidle", timeout=60000)
        await page.wait_for_timeout(random.randint(2000, 4000))

        # Try multiple selector patterns
        cards = await page.query_selector_all(
            "[class*='card-inner'], [class*='classCard'], "
            "[class*='search-result-item']"
        )
        if not cards:
            break

        page_classes = []
        for card in cards:
            link_el = await card.query_selector("a[href*='/classes/']")
            title_el = await card.query_selector(
                "[class*='title'], h3, h4"
            )
            teacher_el = await card.query_selector(
                "[class*='teacher'], [class*='instructor'], "
                "[class*='author']"
            )
            student_el = await card.query_selector(
                "[class*='student'], [class*='enrollment']"
            )
            rating_el = await card.query_selector(
                "[class*='rating'], [aria-label*='star']"
            )

            href = await link_el.get_attribute("href") if link_el else None
            title = await title_el.inner_text() if title_el else None
            teacher = await teacher_el.inner_text() if teacher_el else None
            students = await student_el.inner_text() if student_el else None
            rating = await rating_el.get_attribute("aria-label") if rating_el else None

            if href and title:
                page_classes.append({
                    "url": (
                        f"https://www.skillshare.com{href}"
                        if href.startswith("/") else href
                    ),
                    "title": title.strip(),
                    "instructor": teacher.strip() if teacher else None,
                    "students_text": students.strip() if students else None,
                    "rating_text": rating,
                })

        if not page_classes:
            break

        classes.extend(page_classes)
        print(f"Page {page_num}: found {len(page_classes)} classes "
              f"(total: {len(classes)})")

        page_num += 1
        await asyncio.sleep(random.uniform(4.0, 7.0))

    await browser.close()
    await pw.stop()
    return classes[:max_results]

Instructor Profile Scraping

Instructor pages surface aggregate statistics across all their published classes:

async def scrape_instructor(profile_url: str, proxy: str = None) -> dict:
    """Scrape an instructor's Skillshare profile for stats and class list."""
    pw, browser, page = await create_browser(proxy)

    await warm_session(page)
    await page.goto(profile_url, wait_until="networkidle", timeout=60000)
    await page.wait_for_timeout(2000)

    result = {"url": profile_url}

    # Name
    name_el = await page.query_selector("h1, [class*='profileName'], [class*='profile-name']")
    if name_el:
        result["name"] = (await name_el.inner_text()).strip()

    # Stats counters (students, followers, classes)
    stat_els = await page.query_selector_all(
        "[class*='stat'], [class*='counter'], [class*='profile-stat']"
    )
    for el in stat_els:
        text = (await el.inner_text()).strip().lower()
        if "student" in text:
            result["total_students"] = text
        elif "class" in text:
            result["total_classes"] = text
        elif "follower" in text:
            result["followers"] = text

    # Bio
    bio_el = await page.query_selector(
        "[class*='bio'], [class*='description'], [class*='about']"
    )
    if bio_el:
        result["bio"] = (await bio_el.inner_text()).strip()[:1000]

    # List of classes taught
    class_links = await page.query_selector_all("a[href*='/classes/']")
    classes = []
    seen_urls = set()
    for link in class_links:
        href = await link.get_attribute("href")
        title_text = (await link.inner_text()).strip()
        if href and href not in seen_urls and title_text:
            seen_urls.add(href)
            classes.append({
                "url": (
                    f"https://www.skillshare.com{href}"
                    if href.startswith("/") else href
                ),
                "title": title_text,
            })

    result["classes"] = classes[:20]  # First 20 classes

    await browser.close()
    await pw.stop()
    return result

Intercepting GraphQL API Responses

Skillshare's frontend communicates with a GraphQL API. Intercepting these calls gives you clean, structured JSON rather than scraped HTML:

async def intercept_course_api(class_url: str, proxy: str = None) -> dict:
    """
    Navigate to a class page and capture all API responses.
    Returns structured data from the GraphQL responses.
    """
    pw, browser, page = await create_browser(proxy)

    captured = {
        "course": None,
        "instructor": None,
        "curriculum": None,
        "reviews": [],
    }

    async def on_response(response):
        if "/api/graphql" not in response.url:
            return
        if response.status != 200:
            return

        try:
            body = await response.json()
            data = body.get("data", {})

            # Identify response by its data shape
            if "class" in data or "course" in data:
                captured["course"] = data.get("class") or data.get("course")
            if "teacher" in data or "instructor" in data:
                captured["instructor"] = (
                    data.get("teacher") or data.get("instructor")
                )
            if "units" in data or "lessons" in data or "curriculum" in data:
                captured["curriculum"] = data
            if "reviews" in data or "classReviews" in data:
                reviews_data = data.get("reviews") or data.get("classReviews", {})
                if isinstance(reviews_data, list):
                    captured["reviews"].extend(reviews_data)
                elif isinstance(reviews_data, dict):
                    items = reviews_data.get("edges", [])
                    captured["reviews"].extend(
                        item.get("node", item) for item in items
                    )
        except Exception:
            pass

    page.on("response", on_response)

    await warm_session(page)
    await page.goto(class_url, wait_until="networkidle", timeout=60000)
    # Scroll to trigger lazy-loaded content
    await page.evaluate("window.scrollTo(0, document.body.scrollHeight / 2)")
    await page.wait_for_timeout(2000)
    await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
    await page.wait_for_timeout(2000)

    await browser.close()
    await pw.stop()
    return captured

Proxy Configuration

Skillshare's Cloudflare setup is the primary reason proxies are essential here. Datacenter IPs from AWS, GCP, or DigitalOcean get challenged on nearly every page load, and many challenges fail even with proper browser automation because Cloudflare scores the IP reputation independently.

ThorData's residential proxies bypass this cleanly — residential IPs have high trust scores with Cloudflare by default. The per-request rotation means each page load comes from a different IP, which prevents session-based rate limiting from accumulating.

# ThorData proxy configuration
PROXY_USER = "your_username"
PROXY_PASS = "your_password"
PROXY_HOST = "proxy.thordata.com"
PROXY_PORT = 9000

def build_proxy_url(country: str = "us") -> str:
    """Build a ThorData proxy URL with country targeting."""
    auth = f"{PROXY_USER}:{PROXY_PASS}"
    return f"http://{auth}@{PROXY_HOST}:{PROXY_PORT}?country={country}"

PROXY = build_proxy_url("us")

# Pass to browser creation
async def run_with_proxy():
    pw, browser, page = await create_browser(proxy=PROXY)
    # ... scraping code
    await browser.close()
    await pw.stop()

For Skillshare specifically, you want US residential IPs since the platform's content catalog and pricing structure are US-centric. Adding random delays of 4-8 seconds between page navigations keeps the request pattern looking natural.

Storing Results in SQLite

import sqlite3
import json
from datetime import datetime

def init_db(path: str = "skillshare.db") -> sqlite3.Connection:
    """Create database schema for Skillshare data."""
    conn = sqlite3.connect(path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS classes (
            url TEXT PRIMARY KEY,
            title TEXT,
            instructor TEXT,
            instructor_url TEXT,
            students TEXT,
            lesson_count INTEGER,
            duration TEXT,
            skill_level TEXT,
            category TEXT,
            description TEXT,
            scraped_at TEXT DEFAULT (datetime('now'))
        );

        CREATE TABLE IF NOT EXISTS lessons (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            class_url TEXT,
            position INTEGER,
            title TEXT,
            duration_seconds INTEGER,
            FOREIGN KEY (class_url) REFERENCES classes(url)
        );

        CREATE TABLE IF NOT EXISTS instructors (
            url TEXT PRIMARY KEY,
            name TEXT,
            total_students TEXT,
            total_classes TEXT,
            followers TEXT,
            bio TEXT,
            scraped_at TEXT DEFAULT (datetime('now'))
        );

        CREATE TABLE IF NOT EXISTS reviews (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            class_url TEXT,
            reviewer_name TEXT,
            rating REAL,
            review_text TEXT,
            review_date TEXT,
            scraped_at TEXT DEFAULT (datetime('now')),
            FOREIGN KEY (class_url) REFERENCES classes(url)
        );

        CREATE INDEX IF NOT EXISTS idx_classes_instructor
            ON classes(instructor);
        CREATE INDEX IF NOT EXISTS idx_reviews_class
            ON reviews(class_url);
    """)
    conn.commit()
    return conn


def save_class(conn: sqlite3.Connection, class_data: dict) -> None:
    """Save class metadata to the database."""
    conn.execute("""
        INSERT OR REPLACE INTO classes
        (url, title, instructor, instructor_url, students,
         lesson_count, duration, skill_level, description)
        VALUES (?,?,?,?,?,?,?,?,?)
    """, (
        class_data.get("url"),
        class_data.get("title"),
        class_data.get("instructor"),
        class_data.get("instructor_url"),
        class_data.get("students_text"),
        class_data.get("lesson_count", 0),
        class_data.get("duration"),
        class_data.get("skill_level"),
        class_data.get("description", "")[:2000],
    ))

    # Save lessons
    for i, lesson_title in enumerate(class_data.get("lessons", []), 1):
        conn.execute("""
            INSERT OR IGNORE INTO lessons (class_url, position, title)
            VALUES (?,?,?)
        """, (class_data.get("url"), i, lesson_title))

    conn.commit()


def save_instructor(conn: sqlite3.Connection, data: dict) -> None:
    """Save instructor profile data."""
    conn.execute("""
        INSERT OR REPLACE INTO instructors
        (url, name, total_students, total_classes, followers, bio)
        VALUES (?,?,?,?,?,?)
    """, (
        data.get("url"),
        data.get("name"),
        data.get("total_students"),
        data.get("total_classes"),
        data.get("followers"),
        data.get("bio", "")[:1000],
    ))
    conn.commit()

Running a Full Catalog Crawl

Here's a complete pipeline to discover and scrape classes in a specific category:

async def crawl_category(category_query: str, max_classes: int = 50,
                           proxy: str = None) -> None:
    """Crawl a Skillshare category and collect class details."""
    conn = init_db()

    print(f"Searching for classes: {category_query}")
    class_list = await search_classes(category_query, max_results=max_classes,
                                       proxy=proxy)
    print(f"Found {len(class_list)} classes")

    for i, cls in enumerate(class_list, 1):
        print(f"\n[{i}/{len(class_list)}] {cls['title']}")

        # Check if already scraped
        existing = conn.execute(
            "SELECT url FROM classes WHERE url = ?",
            (cls["url"],)
        ).fetchone()

        if existing:
            print("  Already in DB, skipping")
            continue

        try:
            detail = await scrape_class(cls["url"], proxy=proxy)
            detail.update(cls)  # merge listing data
            save_class(conn, detail)
            print(f"  Saved: {detail.get('lesson_count')} lessons, "
                  f"{detail.get('students', 'N/A')} students")
        except Exception as e:
            print(f"  Error: {e}")

        await asyncio.sleep(random.uniform(5.0, 10.0))

    conn.close()
    print(f"\nCrawl complete. Saved {max_classes} classes to skillshare.db")


# Example run
asyncio.run(crawl_category("python programming", max_classes=20))

Legal Note

Skillshare's Terms of Service explicitly prohibit scraping. There is no public API or data partnership program. Any automated data collection from Skillshare operates in a legal gray area — the data is visible to anyone who visits the site, but accessing it programmatically violates the ToS. Consider whether your use case genuinely requires Skillshare-specific data, or whether similar information is available from platforms with more permissive terms. If you proceed, limit your collection scope and rate to avoid disrupting the platform.

Key Takeaways

Skillshare has no public API — all data extraction requires browser-based scraping with JavaScript rendering.
Playwright with stealth patches is the most reliable approach because Skillshare's Cloudflare setup blocks plain HTTP requests from datacenter IPs.
API interception via page.on("response") can capture structured GraphQL data that's cleaner than DOM parsing, but class selectors are needed as a fallback since not all data loads through interceptable endpoints.
Rate limiting kicks in around 40-60 requests per minute — pace your scraper with randomized delays of 5-10 seconds between page loads.
ThorData residential proxies are effectively required for any sustained scraping — Cloudflare challenges on datacenter IPs make raw automation unreliable.
CSS selectors on Skillshare break frequently as the frontend is updated — build your scraper to fail gracefully and log pages where extraction returns empty results so you can update selectors.
Always warm up the session with a homepage visit before scraping class or instructor pages — cold sessions trigger more Cloudflare challenges.

Analyzing Course Market Data

With a collected dataset, you can extract real market intelligence:

def analyze_skillshare_market(conn: sqlite3.Connection) -> None:
    """Generate market analysis from collected Skillshare data."""
    print("=== Skillshare Market Analysis ===\n")

    # Total collection stats
    row = conn.execute("""
        SELECT COUNT(*) as courses,
               COUNT(DISTINCT instructor) as instructors
        FROM classes
    """).fetchone()
    print(f"Dataset: {row[0]} courses from {row[1]} instructors\n")

    # Skill level distribution
    print("Skill level distribution:")
    for row in conn.execute("""
        SELECT skill_level, COUNT(*) as count
        FROM classes
        WHERE skill_level IS NOT NULL
        GROUP BY skill_level
        ORDER BY count DESC
    """):
        print(f"  {(row[0] or 'Unspecified'):20}: {row[1]:4} courses")

    # Most prolific instructors
    print("\nTop 10 instructors by course count:")
    for row in conn.execute("""
        SELECT instructor, COUNT(*) as courses
        FROM classes
        WHERE instructor IS NOT NULL
        GROUP BY instructor
        ORDER BY courses DESC LIMIT 10
    """):
        print(f"  {row[0]:30}: {row[1]} courses")

    # Courses with most lessons (comprehensive courses)
    print("\nMost comprehensive courses (by lesson count):")
    for row in conn.execute("""
        SELECT title, instructor, lesson_count, duration
        FROM classes
        ORDER BY lesson_count DESC LIMIT 10
    """):
        print(f"  {row[0][:45]:45} ({row[2]} lessons) by {row[1]}")

    # Lesson count distribution
    print("\nLesson count distribution:")
    for row in conn.execute("""
        SELECT
            CASE
                WHEN lesson_count <= 5 THEN 'Short (1-5)'
                WHEN lesson_count <= 15 THEN 'Medium (6-15)'
                WHEN lesson_count <= 30 THEN 'Long (16-30)'
                ELSE 'Comprehensive (30+)'
            END as tier,
            COUNT(*) as count
        FROM classes
        WHERE lesson_count > 0
        GROUP BY tier
        ORDER BY MIN(lesson_count)
    """):
        print(f"  {row[0]:25}: {row[1]:4} courses")


def find_market_gaps(conn: sqlite3.Connection,
                      topic_keywords: list[str]) -> None:
    """Find topics with high student demand but few quality courses."""
    print("=== Market Gap Analysis ===\n")
    print("Topics by course count (low count = potential opportunity):\n")

    for keyword in topic_keywords:
        count = conn.execute("""
            SELECT COUNT(*) FROM classes
            WHERE LOWER(title) LIKE ?
               OR LOWER(description) LIKE ?
        """, (f"%{keyword}%", f"%{keyword}%")).fetchone()[0]

        print(f"  '{keyword}': {count} courses")

Tips for Long-Running Skillshare Crawls

Running a Skillshare scraper for hours requires careful management:

Checkpoint frequently. Save to the database after every page, not just at the end of a crawl run. If the scraper dies at page 50 of 100, you want to resume from page 50.
Rotate proxy sessions. After 20-30 pages, create a new browser instance with a fresh proxy IP. Don't let a single proxy session accumulate too many requests.
Monitor for empty results. If you're consistently getting 0 results from search pages, your session has likely been blocked. Add a check:

async def check_session_health(page) -> bool:
    """Verify the current session is not blocked."""
    resp = await page.goto(
        "https://www.skillshare.com/en/",
        wait_until="domcontentloaded"
    )
    if resp.status in (403, 429, 503):
        return False
    # Check for Cloudflare challenge page
    content = await page.content()
    if "Checking your browser" in content:
        return False
    return True

Log selector failures. When CSS selectors return empty results, log the page URL and timestamp. Review these logs to identify when Skillshare has updated their frontend and which selectors need updating.
Use Playwright's page.pause() during development. This pauses execution and opens Chrome DevTools so you can inspect the live DOM and verify your selectors before committing them to production code.