Scraping Skillshare Course Data and Instructor Stats with Python (2026)
Scraping Skillshare Course Data and Instructor Stats with Python (2026)
Skillshare is an interesting scraping target because there's no public API. The platform shut down third-party API access years ago, which means if you want course metadata — enrollment numbers, instructor ratings, curriculum structure, class descriptions — you're building a scraper from scratch. This matters for market research in the online education space, competitive analysis if you're an instructor, or aggregating course catalogs across platforms.
The platform is a JavaScript-heavy single-page application, which rules out simple HTTP + BeautifulSoup approaches for most data points. But Skillshare also loads structured data from internal API endpoints that you can intercept, and certain pages still embed useful metadata in the initial HTML response. This guide covers both angles.
What Data Is Available
Through web scraping and API interception:
- Class metadata — title, description, category, subcategory, duration, number of lessons, skill level, publication date, last updated date
- Enrollment data — total student count per class, number of student projects submitted
- Instructor profiles — name, bio, follower count, total students across all classes, number of classes published, average rating
- Reviews — individual review text, star ratings, reviewer name, review date
- Curriculum structure — lesson titles, lesson durations, section groupings
- Trending/popular — classes ranked by popularity within categories, staff picks, featured collections
Skillshare doesn't expose view counts publicly (unlike YouTube or Dailymotion), but student enrollment counts serve as a reasonable proxy for class popularity.
Why Skillshare Data Is Valuable
The online education market exceeded $350 billion globally in 2026. Skillshare sits in the creative and professional development niche, competing with Udemy, LinkedIn Learning, and Coursera. The data it holds is commercially useful in multiple ways:
- Creator economy research — Understand which niches attract the most students and what course formats work
- Competitive benchmarking — Instructors can measure their enrollment and ratings against comparable classes
- Market validation — Before creating a course, verify demand by looking at enrollment numbers in similar topics
- Content aggregation — Build cross-platform learning catalogs that surface the best material regardless of platform
- Academic research — Study self-directed learning patterns, topic popularity cycles, and the impact of video length on completion
Anti-Bot Measures
Skillshare's protections are moderate but layered:
Cloudflare. The entire site sits behind Cloudflare with JavaScript challenge pages. Direct HTTP requests to skillshare.com from known datacenter IPs get intercepted with a challenge page before any content loads. This is the first wall you'll hit.
Server-side rendering detection. Skillshare checks for browser capabilities during page load. Requests that don't execute JavaScript get a minimal HTML shell with no useful content. The actual course data loads via XHR calls after the JavaScript framework initializes.
Session-based rate limiting. Rapid sequential requests from the same session trigger soft blocks — pages start returning 429 or redirect to a "please wait" interstitial. This kicks in around 40-60 requests per minute from a single IP.
Login walls. Some course data (full reviews, detailed curriculum) is only accessible to logged-in users. Class pages show limited information to anonymous visitors — typically the first few lessons and a truncated description.
Bot fingerprinting. Skillshare uses a JavaScript-based fingerprinting library that checks for automation tells: navigator.webdriver, headless browser artifacts, inconsistent viewport/screen ratios, and missing browser APIs that real browsers have.
Setting Up Playwright
Playwright handles the Cloudflare challenge and JavaScript rendering:
pip install playwright playwright-stealth httpx
playwright install chromium
import asyncio
import json
import random
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async
async def create_browser(proxy: str = None):
"""Create a stealth Playwright browser instance."""
pw = await async_playwright().start()
launch_args = {
"headless": True,
"args": [
"--no-sandbox",
"--disable-blink-features=AutomationControlled",
"--disable-dev-shm-usage",
"--disable-extensions",
"--no-first-run",
],
}
if proxy:
launch_args["proxy"] = {"server": proxy}
browser = await pw.chromium.launch(**launch_args)
context = await browser.new_context(
viewport={"width": 1440, "height": 900},
user_agent=(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/125.0.0.0 Safari/537.36"
),
locale="en-US",
timezone_id="America/New_York",
)
page = await context.new_page()
await stealth_async(page)
return pw, browser, page
async def warm_session(page) -> None:
"""Visit the homepage to establish a valid session before scraping."""
await page.goto("https://www.skillshare.com/en/", wait_until="domcontentloaded")
await page.wait_for_timeout(random.randint(2000, 4000))
# Simulate some browsing behavior
await page.mouse.move(random.randint(100, 800), random.randint(100, 600))
await page.wait_for_timeout(500)
Scraping Class Metadata
Skillshare class pages load structured data via internal API calls. You can intercept these, but the simpler approach for individual classes is to parse the page content after JavaScript rendering:
async def scrape_class(class_url: str, proxy: str = None) -> dict:
"""Scrape a single Skillshare class page for all available metadata."""
pw, browser, page = await create_browser(proxy)
api_data = {}
# Intercept GraphQL responses
async def capture_api(response):
if "/api/graphql" in response.url and response.status == 200:
try:
body = await response.json()
if "data" in body:
api_data.update(body["data"])
except Exception:
pass
page.on("response", capture_api)
await warm_session(page)
await page.goto(class_url, wait_until="networkidle", timeout=60000)
await page.wait_for_timeout(3000)
result = {}
# Title
title_el = await page.query_selector("h1")
if title_el:
result["title"] = (await title_el.inner_text()).strip()
# Description
desc_el = await page.query_selector("[class*='description'], [class*='about-class']")
if desc_el:
result["description"] = (await desc_el.inner_text()).strip()[:2000]
# Student count
stats_els = await page.query_selector_all("[class*='classMeta'] span, [class*='stats'] span")
for el in stats_els:
text = await el.inner_text()
if "student" in text.lower():
result["students"] = text.strip()
elif "project" in text.lower():
result["projects"] = text.strip()
# Category/skill level
skill_el = await page.query_selector("[class*='skill-level'], [class*='level-pill']")
if skill_el:
result["skill_level"] = (await skill_el.inner_text()).strip()
# Duration
duration_el = await page.query_selector("[class*='duration'], [class*='runtime']")
if duration_el:
result["duration"] = (await duration_el.inner_text()).strip()
# Lesson list
lessons = []
lesson_els = await page.query_selector_all(
"[class*='lesson-item'], [class*='lessonItem'], [class*='unit-item']"
)
for el in lesson_els:
title_inner = await el.query_selector("[class*='title'], span")
if title_inner:
lesson_title = (await title_inner.inner_text()).strip()
if lesson_title:
lessons.append(lesson_title)
result["lessons"] = lessons
result["lesson_count"] = len(lessons)
result["url"] = class_url
# Merge intercepted API data
if api_data:
result["api_data"] = api_data
await browser.close()
await pw.stop()
return result
# Example usage
async def main():
url = "https://www.skillshare.com/en/classes/Python-for-Data-Science/12345"
data = await scrape_class(url)
print(f"Title: {data.get('title')}")
print(f"Students: {data.get('students')}")
print(f"Lessons: {data.get('lesson_count')}")
asyncio.run(main())
Bulk Class Discovery
To find classes at scale, scrape Skillshare's browse and search pages:
async def search_classes(query: str, max_results: int = 100,
proxy: str = None) -> list[dict]:
"""Search Skillshare for classes matching a query."""
pw, browser, page = await create_browser(proxy)
classes = []
page_num = 1
await warm_session(page)
while len(classes) < max_results:
url = (f"https://www.skillshare.com/en/search"
f"?query={query}&page={page_num}")
await page.goto(url, wait_until="networkidle", timeout=60000)
await page.wait_for_timeout(random.randint(2000, 4000))
# Try multiple selector patterns
cards = await page.query_selector_all(
"[class*='card-inner'], [class*='classCard'], "
"[class*='search-result-item']"
)
if not cards:
break
page_classes = []
for card in cards:
link_el = await card.query_selector("a[href*='/classes/']")
title_el = await card.query_selector(
"[class*='title'], h3, h4"
)
teacher_el = await card.query_selector(
"[class*='teacher'], [class*='instructor'], "
"[class*='author']"
)
student_el = await card.query_selector(
"[class*='student'], [class*='enrollment']"
)
rating_el = await card.query_selector(
"[class*='rating'], [aria-label*='star']"
)
href = await link_el.get_attribute("href") if link_el else None
title = await title_el.inner_text() if title_el else None
teacher = await teacher_el.inner_text() if teacher_el else None
students = await student_el.inner_text() if student_el else None
rating = await rating_el.get_attribute("aria-label") if rating_el else None
if href and title:
page_classes.append({
"url": (
f"https://www.skillshare.com{href}"
if href.startswith("/") else href
),
"title": title.strip(),
"instructor": teacher.strip() if teacher else None,
"students_text": students.strip() if students else None,
"rating_text": rating,
})
if not page_classes:
break
classes.extend(page_classes)
print(f"Page {page_num}: found {len(page_classes)} classes "
f"(total: {len(classes)})")
page_num += 1
await asyncio.sleep(random.uniform(4.0, 7.0))
await browser.close()
await pw.stop()
return classes[:max_results]
Instructor Profile Scraping
Instructor pages surface aggregate statistics across all their published classes:
async def scrape_instructor(profile_url: str, proxy: str = None) -> dict:
"""Scrape an instructor's Skillshare profile for stats and class list."""
pw, browser, page = await create_browser(proxy)
await warm_session(page)
await page.goto(profile_url, wait_until="networkidle", timeout=60000)
await page.wait_for_timeout(2000)
result = {"url": profile_url}
# Name
name_el = await page.query_selector("h1, [class*='profileName'], [class*='profile-name']")
if name_el:
result["name"] = (await name_el.inner_text()).strip()
# Stats counters (students, followers, classes)
stat_els = await page.query_selector_all(
"[class*='stat'], [class*='counter'], [class*='profile-stat']"
)
for el in stat_els:
text = (await el.inner_text()).strip().lower()
if "student" in text:
result["total_students"] = text
elif "class" in text:
result["total_classes"] = text
elif "follower" in text:
result["followers"] = text
# Bio
bio_el = await page.query_selector(
"[class*='bio'], [class*='description'], [class*='about']"
)
if bio_el:
result["bio"] = (await bio_el.inner_text()).strip()[:1000]
# List of classes taught
class_links = await page.query_selector_all("a[href*='/classes/']")
classes = []
seen_urls = set()
for link in class_links:
href = await link.get_attribute("href")
title_text = (await link.inner_text()).strip()
if href and href not in seen_urls and title_text:
seen_urls.add(href)
classes.append({
"url": (
f"https://www.skillshare.com{href}"
if href.startswith("/") else href
),
"title": title_text,
})
result["classes"] = classes[:20] # First 20 classes
await browser.close()
await pw.stop()
return result
Intercepting GraphQL API Responses
Skillshare's frontend communicates with a GraphQL API. Intercepting these calls gives you clean, structured JSON rather than scraped HTML:
async def intercept_course_api(class_url: str, proxy: str = None) -> dict:
"""
Navigate to a class page and capture all API responses.
Returns structured data from the GraphQL responses.
"""
pw, browser, page = await create_browser(proxy)
captured = {
"course": None,
"instructor": None,
"curriculum": None,
"reviews": [],
}
async def on_response(response):
if "/api/graphql" not in response.url:
return
if response.status != 200:
return
try:
body = await response.json()
data = body.get("data", {})
# Identify response by its data shape
if "class" in data or "course" in data:
captured["course"] = data.get("class") or data.get("course")
if "teacher" in data or "instructor" in data:
captured["instructor"] = (
data.get("teacher") or data.get("instructor")
)
if "units" in data or "lessons" in data or "curriculum" in data:
captured["curriculum"] = data
if "reviews" in data or "classReviews" in data:
reviews_data = data.get("reviews") or data.get("classReviews", {})
if isinstance(reviews_data, list):
captured["reviews"].extend(reviews_data)
elif isinstance(reviews_data, dict):
items = reviews_data.get("edges", [])
captured["reviews"].extend(
item.get("node", item) for item in items
)
except Exception:
pass
page.on("response", on_response)
await warm_session(page)
await page.goto(class_url, wait_until="networkidle", timeout=60000)
# Scroll to trigger lazy-loaded content
await page.evaluate("window.scrollTo(0, document.body.scrollHeight / 2)")
await page.wait_for_timeout(2000)
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(2000)
await browser.close()
await pw.stop()
return captured
Proxy Configuration
Skillshare's Cloudflare setup is the primary reason proxies are essential here. Datacenter IPs from AWS, GCP, or DigitalOcean get challenged on nearly every page load, and many challenges fail even with proper browser automation because Cloudflare scores the IP reputation independently.
ThorData's residential proxies bypass this cleanly — residential IPs have high trust scores with Cloudflare by default. The per-request rotation means each page load comes from a different IP, which prevents session-based rate limiting from accumulating.
# ThorData proxy configuration
PROXY_USER = "your_username"
PROXY_PASS = "your_password"
PROXY_HOST = "proxy.thordata.com"
PROXY_PORT = 9000
def build_proxy_url(country: str = "us") -> str:
"""Build a ThorData proxy URL with country targeting."""
auth = f"{PROXY_USER}:{PROXY_PASS}"
return f"http://{auth}@{PROXY_HOST}:{PROXY_PORT}?country={country}"
PROXY = build_proxy_url("us")
# Pass to browser creation
async def run_with_proxy():
pw, browser, page = await create_browser(proxy=PROXY)
# ... scraping code
await browser.close()
await pw.stop()
For Skillshare specifically, you want US residential IPs since the platform's content catalog and pricing structure are US-centric. Adding random delays of 4-8 seconds between page navigations keeps the request pattern looking natural.
Storing Results in SQLite
import sqlite3
import json
from datetime import datetime
def init_db(path: str = "skillshare.db") -> sqlite3.Connection:
"""Create database schema for Skillshare data."""
conn = sqlite3.connect(path)
conn.executescript("""
CREATE TABLE IF NOT EXISTS classes (
url TEXT PRIMARY KEY,
title TEXT,
instructor TEXT,
instructor_url TEXT,
students TEXT,
lesson_count INTEGER,
duration TEXT,
skill_level TEXT,
category TEXT,
description TEXT,
scraped_at TEXT DEFAULT (datetime('now'))
);
CREATE TABLE IF NOT EXISTS lessons (
id INTEGER PRIMARY KEY AUTOINCREMENT,
class_url TEXT,
position INTEGER,
title TEXT,
duration_seconds INTEGER,
FOREIGN KEY (class_url) REFERENCES classes(url)
);
CREATE TABLE IF NOT EXISTS instructors (
url TEXT PRIMARY KEY,
name TEXT,
total_students TEXT,
total_classes TEXT,
followers TEXT,
bio TEXT,
scraped_at TEXT DEFAULT (datetime('now'))
);
CREATE TABLE IF NOT EXISTS reviews (
id INTEGER PRIMARY KEY AUTOINCREMENT,
class_url TEXT,
reviewer_name TEXT,
rating REAL,
review_text TEXT,
review_date TEXT,
scraped_at TEXT DEFAULT (datetime('now')),
FOREIGN KEY (class_url) REFERENCES classes(url)
);
CREATE INDEX IF NOT EXISTS idx_classes_instructor
ON classes(instructor);
CREATE INDEX IF NOT EXISTS idx_reviews_class
ON reviews(class_url);
""")
conn.commit()
return conn
def save_class(conn: sqlite3.Connection, class_data: dict) -> None:
"""Save class metadata to the database."""
conn.execute("""
INSERT OR REPLACE INTO classes
(url, title, instructor, instructor_url, students,
lesson_count, duration, skill_level, description)
VALUES (?,?,?,?,?,?,?,?,?)
""", (
class_data.get("url"),
class_data.get("title"),
class_data.get("instructor"),
class_data.get("instructor_url"),
class_data.get("students_text"),
class_data.get("lesson_count", 0),
class_data.get("duration"),
class_data.get("skill_level"),
class_data.get("description", "")[:2000],
))
# Save lessons
for i, lesson_title in enumerate(class_data.get("lessons", []), 1):
conn.execute("""
INSERT OR IGNORE INTO lessons (class_url, position, title)
VALUES (?,?,?)
""", (class_data.get("url"), i, lesson_title))
conn.commit()
def save_instructor(conn: sqlite3.Connection, data: dict) -> None:
"""Save instructor profile data."""
conn.execute("""
INSERT OR REPLACE INTO instructors
(url, name, total_students, total_classes, followers, bio)
VALUES (?,?,?,?,?,?)
""", (
data.get("url"),
data.get("name"),
data.get("total_students"),
data.get("total_classes"),
data.get("followers"),
data.get("bio", "")[:1000],
))
conn.commit()
Running a Full Catalog Crawl
Here's a complete pipeline to discover and scrape classes in a specific category:
async def crawl_category(category_query: str, max_classes: int = 50,
proxy: str = None) -> None:
"""Crawl a Skillshare category and collect class details."""
conn = init_db()
print(f"Searching for classes: {category_query}")
class_list = await search_classes(category_query, max_results=max_classes,
proxy=proxy)
print(f"Found {len(class_list)} classes")
for i, cls in enumerate(class_list, 1):
print(f"\n[{i}/{len(class_list)}] {cls['title']}")
# Check if already scraped
existing = conn.execute(
"SELECT url FROM classes WHERE url = ?",
(cls["url"],)
).fetchone()
if existing:
print(" Already in DB, skipping")
continue
try:
detail = await scrape_class(cls["url"], proxy=proxy)
detail.update(cls) # merge listing data
save_class(conn, detail)
print(f" Saved: {detail.get('lesson_count')} lessons, "
f"{detail.get('students', 'N/A')} students")
except Exception as e:
print(f" Error: {e}")
await asyncio.sleep(random.uniform(5.0, 10.0))
conn.close()
print(f"\nCrawl complete. Saved {max_classes} classes to skillshare.db")
# Example run
asyncio.run(crawl_category("python programming", max_classes=20))
Legal Note
Skillshare's Terms of Service explicitly prohibit scraping. There is no public API or data partnership program. Any automated data collection from Skillshare operates in a legal gray area — the data is visible to anyone who visits the site, but accessing it programmatically violates the ToS. Consider whether your use case genuinely requires Skillshare-specific data, or whether similar information is available from platforms with more permissive terms. If you proceed, limit your collection scope and rate to avoid disrupting the platform.
Key Takeaways
- Skillshare has no public API — all data extraction requires browser-based scraping with JavaScript rendering.
- Playwright with stealth patches is the most reliable approach because Skillshare's Cloudflare setup blocks plain HTTP requests from datacenter IPs.
- API interception via
page.on("response")can capture structured GraphQL data that's cleaner than DOM parsing, but class selectors are needed as a fallback since not all data loads through interceptable endpoints. - Rate limiting kicks in around 40-60 requests per minute — pace your scraper with randomized delays of 5-10 seconds between page loads.
- ThorData residential proxies are effectively required for any sustained scraping — Cloudflare challenges on datacenter IPs make raw automation unreliable.
- CSS selectors on Skillshare break frequently as the frontend is updated — build your scraper to fail gracefully and log pages where extraction returns empty results so you can update selectors.
- Always warm up the session with a homepage visit before scraping class or instructor pages — cold sessions trigger more Cloudflare challenges.
Analyzing Course Market Data
With a collected dataset, you can extract real market intelligence:
def analyze_skillshare_market(conn: sqlite3.Connection) -> None:
"""Generate market analysis from collected Skillshare data."""
print("=== Skillshare Market Analysis ===\n")
# Total collection stats
row = conn.execute("""
SELECT COUNT(*) as courses,
COUNT(DISTINCT instructor) as instructors
FROM classes
""").fetchone()
print(f"Dataset: {row[0]} courses from {row[1]} instructors\n")
# Skill level distribution
print("Skill level distribution:")
for row in conn.execute("""
SELECT skill_level, COUNT(*) as count
FROM classes
WHERE skill_level IS NOT NULL
GROUP BY skill_level
ORDER BY count DESC
"""):
print(f" {(row[0] or 'Unspecified'):20}: {row[1]:4} courses")
# Most prolific instructors
print("\nTop 10 instructors by course count:")
for row in conn.execute("""
SELECT instructor, COUNT(*) as courses
FROM classes
WHERE instructor IS NOT NULL
GROUP BY instructor
ORDER BY courses DESC LIMIT 10
"""):
print(f" {row[0]:30}: {row[1]} courses")
# Courses with most lessons (comprehensive courses)
print("\nMost comprehensive courses (by lesson count):")
for row in conn.execute("""
SELECT title, instructor, lesson_count, duration
FROM classes
ORDER BY lesson_count DESC LIMIT 10
"""):
print(f" {row[0][:45]:45} ({row[2]} lessons) by {row[1]}")
# Lesson count distribution
print("\nLesson count distribution:")
for row in conn.execute("""
SELECT
CASE
WHEN lesson_count <= 5 THEN 'Short (1-5)'
WHEN lesson_count <= 15 THEN 'Medium (6-15)'
WHEN lesson_count <= 30 THEN 'Long (16-30)'
ELSE 'Comprehensive (30+)'
END as tier,
COUNT(*) as count
FROM classes
WHERE lesson_count > 0
GROUP BY tier
ORDER BY MIN(lesson_count)
"""):
print(f" {row[0]:25}: {row[1]:4} courses")
def find_market_gaps(conn: sqlite3.Connection,
topic_keywords: list[str]) -> None:
"""Find topics with high student demand but few quality courses."""
print("=== Market Gap Analysis ===\n")
print("Topics by course count (low count = potential opportunity):\n")
for keyword in topic_keywords:
count = conn.execute("""
SELECT COUNT(*) FROM classes
WHERE LOWER(title) LIKE ?
OR LOWER(description) LIKE ?
""", (f"%{keyword}%", f"%{keyword}%")).fetchone()[0]
print(f" '{keyword}': {count} courses")
Tips for Long-Running Skillshare Crawls
Running a Skillshare scraper for hours requires careful management:
-
Checkpoint frequently. Save to the database after every page, not just at the end of a crawl run. If the scraper dies at page 50 of 100, you want to resume from page 50.
-
Rotate proxy sessions. After 20-30 pages, create a new browser instance with a fresh proxy IP. Don't let a single proxy session accumulate too many requests.
-
Monitor for empty results. If you're consistently getting 0 results from search pages, your session has likely been blocked. Add a check:
async def check_session_health(page) -> bool:
"""Verify the current session is not blocked."""
resp = await page.goto(
"https://www.skillshare.com/en/",
wait_until="domcontentloaded"
)
if resp.status in (403, 429, 503):
return False
# Check for Cloudflare challenge page
content = await page.content()
if "Checking your browser" in content:
return False
return True
-
Log selector failures. When CSS selectors return empty results, log the page URL and timestamp. Review these logs to identify when Skillshare has updated their frontend and which selectors need updating.
-
Use Playwright's
page.pause()during development. This pauses execution and opens Chrome DevTools so you can inspect the live DOM and verify your selectors before committing them to production code.