Scraping Instructables for DIY Project Data with Python (2026)
Scraping Instructables for DIY Project Data with Python (2026)
Instructables is the biggest repository of DIY and maker project guides on the internet — electronics, woodworking, 3D printing, cooking, robotics, home improvement. Each project includes step-by-step instructions, component lists, photos, and community feedback.
For researchers studying maker culture, builders tracking component trends, or anyone creating a project recommendation engine, it's excellent data.
This guide covers everything: the site structure, anti-bot handling, extracting steps and materials, collecting at scale, storing the results, and real-world analysis ideas.
Table of Contents
- Site Architecture and Strategy
- Environment Setup
- Scraping a Project Page
- Extracting Step-by-Step Instructions
- Component and Material Lists
- Navigating Category Listings
- Anti-Bot Protection and How to Handle It
- Collecting Data at Scale
- Database Design for Maker Data
- Async Scraping for Speed
- Proxy Setup with ThorData
- Playwright Fallback for JS-Heavy Pages
- Data Analysis: Component Trends
- Real-World Use Cases
- Tips for Clean Data
Site Architecture and Strategy {#architecture}
Instructables serves most content as static HTML. The core project data — title, steps, descriptions, materials — is baked into the initial page response. This is the best case for scraping: no JavaScript rendering required for the main content.
However, the site does use JavaScript for some interactive elements: - Comment loading (paginated via AJAX) - "Pro member" content (gated behind login) - Some embedded tool lists on newer projects
For 95% of use cases, requests + BeautifulSoup is all you need.
The URL structure:
# Project page
https://www.instructables.com/PROJECT-SLUG/
# Author profile
https://www.instructables.com/member/USERNAME/
# Category listing
https://www.instructables.com/CATEGORY/
# Tag search
https://www.instructables.com/tag/type-id/TOPIC/
Category slugs include: circuits, workshop, craft, cooking, living, outside, teachers. Each has subcategories: circuits/arduino, circuits/raspberry-pi, workshop/3d-printing, etc.
Environment Setup {#setup}
pip install requests beautifulsoup4 pandas lxml sqlite3 asyncio httpx aiohttp
For anti-detection and proxy support:
pip install requests beautifulsoup4 lxml pandas playwright curl-cffi
playwright install chromium
Basic configuration shared across all examples:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
import sqlite3
import json
import re
from datetime import datetime
from urllib.parse import urljoin, urlparse
BASE_URL = "https://www.instructables.com"
# Rotate through multiple user agents
USER_AGENTS = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15",
]
def get_headers(referer: str = None) -> dict:
"""Generate randomized browser headers."""
headers = {
"User-Agent": random.choice(USER_AGENTS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "same-origin" if referer else "none",
"Sec-Fetch-User": "?1",
}
if referer:
headers["Referer"] = referer
return headers
def polite_get(url: str, session: requests.Session = None,
proxy: str = None, min_delay: float = 3.0,
max_delay: float = 7.0) -> requests.Response:
"""
Polite HTTP GET with delay, retry logic, and optional proxy.
"""
time.sleep(random.uniform(min_delay, max_delay))
headers = get_headers(referer=BASE_URL)
proxies = {"http": proxy, "https": proxy} if proxy else None
if session:
resp = session.get(url, headers=headers, proxies=proxies, timeout=30)
else:
resp = requests.get(url, headers=headers, proxies=proxies, timeout=30)
if resp.status_code == 429:
retry_after = int(resp.headers.get("Retry-After", 60))
print(f"Rate limited. Waiting {retry_after}s")
time.sleep(retry_after + random.uniform(5, 15))
return polite_get(url, session, proxy)
resp.raise_for_status()
return resp
Scraping a Project Page {#project-page}
Each Instructables project page contains the core metadata in the HTML. The structure has stayed relatively stable over the years, though Autodesk occasionally updates class names.
def get_project(project_url: str, session: requests.Session = None,
proxy: str = None) -> dict:
"""
Scrape a full project page.
Returns metadata: title, author, category, stats, description.
"""
resp = polite_get(project_url, session=session, proxy=proxy)
soup = BeautifulSoup(resp.text, "lxml")
project = {"url": project_url, "scraped_at": datetime.utcnow().isoformat()}
# Title — try multiple selectors (site structure varies by age of post)
for selector in [("h1", "header-title"), ("h1", "title"), ("h1", None)]:
tag, cls = selector
el = soup.find(tag, class_=cls) if cls else soup.find(tag)
if el:
project["title"] = el.get_text(strip=True)
break
# Author
author_el = (soup.find("a", class_="member-header-display") or
soup.find("a", attrs={"itemprop": "author"}))
project["author"] = author_el.get_text(strip=True) if author_el else ""
if author_el:
project["author_url"] = urljoin(BASE_URL, author_el.get("href", ""))
# Category breadcrumb
breadcrumbs = soup.find_all("a", class_="breadcrumb")
if not breadcrumbs:
breadcrumbs = soup.find_all("a", attrs={"itemprop": "item"})
project["category"] = " > ".join(b.get_text(strip=True) for b in breadcrumbs)
# Stats
for stat_class, field in [
("views-count", "views"),
("favorites-count", "favorites"),
("comments-count", "comments"),
]:
el = soup.find("span", class_=stat_class)
if el:
text = el.get_text(strip=True).replace(",", "")
try:
project[field] = int(re.sub(r"[^\d]", "", text))
except ValueError:
project[field] = 0
# Date published
date_el = (soup.find("meta", attrs={"itemprop": "datePublished"}) or
soup.find("time", class_="publish-date"))
if date_el:
project["published"] = (date_el.get("content") or
date_el.get("datetime") or
date_el.get_text(strip=True))
# Introduction / description (first step body)
intro = (soup.find("div", class_="step-body") or
soup.find("div", class_="intro"))
if intro:
project["description"] = intro.get_text(separator=" ", strip=True)[:1000]
# Difficulty level
difficulty_el = soup.find("span", class_="difficulty")
project["difficulty"] = difficulty_el.get_text(strip=True) if difficulty_el else ""
# Contest wins
contests = soup.find_all("a", class_="contest-winner")
project["contest_wins"] = len(contests)
# License
license_el = soup.find("a", class_="license")
project["license"] = license_el.get_text(strip=True) if license_el else ""
return project
# Example usage
# session = requests.Session()
# project = get_project("https://www.instructables.com/Wooden-LED-Cube/", session)
# print(project["title"], "—", project["views"], "views")
Extracting Step-by-Step Instructions {#steps}
The step structure is what makes Instructables data unique. Each project breaks down a build into numbered steps with text and images.
def get_project_steps(project_url: str, session: requests.Session = None,
proxy: str = None) -> list[dict]:
"""
Extract all steps from a project.
Returns list of dicts with step number, title, text, image URLs.
"""
resp = polite_get(project_url, session=session, proxy=proxy)
soup = BeautifulSoup(resp.text, "lxml")
steps = []
# Steps are in <section class="step"> containers
step_containers = soup.find_all("section", class_="step")
# Fallback: older projects use div with id="step..."
if not step_containers:
step_containers = [
div for div in soup.find_all("div", id=re.compile(r"^step"))
]
for i, container in enumerate(step_containers):
step = {
"step_number": i,
"is_intro": "intro" in container.get("class", []),
}
# Step title
title_el = container.find(["h2", "h3"])
step["title"] = title_el.get_text(strip=True) if title_el else f"Step {i}"
# Step body text
body_el = container.find("div", class_="step-body")
if body_el:
# Get clean text, preserving paragraph breaks
paragraphs = body_el.find_all(["p", "li"])
if paragraphs:
step["text"] = "\n".join(
p.get_text(separator=" ", strip=True)
for p in paragraphs if p.get_text(strip=True)
)
else:
step["text"] = body_el.get_text(separator="\n", strip=True)
else:
step["text"] = ""
# Image URLs
images = container.find_all("img")
step["image_urls"] = [
img.get("src") or img.get("data-src", "")
for img in images
if img.get("src") or img.get("data-src")
]
step["image_count"] = len(step["image_urls"])
# Embedded files (some steps include downloadable files)
file_links = container.find_all("a", class_="pdf-link")
step["files"] = [a.get("href", "") for a in file_links]
# Code blocks (common in electronics projects)
code_blocks = container.find_all(["code", "pre"])
step["code_snippets"] = [
cb.get_text() for cb in code_blocks if cb.get_text(strip=True)
]
step["has_code"] = len(step["code_snippets"]) > 0
if step["title"] or step["text"]:
steps.append(step)
return steps
def get_full_project(project_url: str, session: requests.Session = None,
proxy: str = None) -> dict:
"""Get project metadata AND all steps in one call."""
resp = polite_get(project_url, session=session, proxy=proxy)
soup = BeautifulSoup(resp.text, "lxml")
# Metadata
project = {"url": project_url, "scraped_at": datetime.utcnow().isoformat()}
title_el = soup.find("h1")
project["title"] = title_el.get_text(strip=True) if title_el else ""
# Steps (reuse the same soup object — no extra request needed)
steps_data = []
for i, container in enumerate(soup.find_all("section", class_="step")):
title_el = container.find(["h2", "h3"])
body_el = container.find("div", class_="step-body")
images = container.find_all("img")
steps_data.append({
"step_number": i,
"title": title_el.get_text(strip=True) if title_el else f"Step {i}",
"text": body_el.get_text(separator="\n", strip=True) if body_el else "",
"image_count": len(images),
})
project["steps"] = steps_data
project["step_count"] = len(steps_data)
project["total_images"] = sum(s["image_count"] for s in steps_data)
project["total_chars"] = sum(len(s["text"]) for s in steps_data)
return project
Component and Material Lists {#materials}
The materials list is the gold mine for maker data analysis. It reveals what components are commonly needed across different project types.
def get_supplies(project_url: str, session: requests.Session = None,
proxy: str = None) -> dict:
"""
Extract the supplies/materials list from a project.
Returns structured data with items and links.
"""
resp = polite_get(project_url, session=session, proxy=proxy)
soup = BeautifulSoup(resp.text, "lxml")
# Method 1: Look for dedicated supplies section
supply_section = soup.find("section", class_="step-supplies")
# Method 2: Look for a step titled "Supplies", "Materials", "Components"
if not supply_section:
supply_keywords = ["supplie", "material", "component",
"you will need", "what you need", "tools",
"parts list", "shopping list", "bill of materials"]
for section in soup.find_all("section", class_="step"):
title_el = section.find(["h2", "h3"])
if title_el:
title_text = title_el.get_text().lower()
if any(kw in title_text for kw in supply_keywords):
supply_section = section
break
result = {
"items": [],
"tool_items": [],
"amazon_links": [],
"other_links": [],
}
if not supply_section:
return result
# Extract list items
for li in supply_section.find_all("li"):
text = li.get_text(strip=True)
if not text:
continue
item_data = {"text": text}
# Extract quantity if present (e.g., "2x Arduino Nano", "3 LEDs")
qty_match = re.match(r"^(\d+)\s*[xX×]?\s*(.+)", text)
if qty_match:
item_data["quantity"] = int(qty_match.group(1))
item_data["name"] = qty_match.group(2).strip()
else:
item_data["quantity"] = 1
item_data["name"] = text
# Extract links (often Amazon affiliate links)
links = li.find_all("a")
item_links = []
for link in links:
href = link.get("href", "")
if href:
if "amazon" in href:
result["amazon_links"].append(href)
item_data["amazon_url"] = href
else:
result["other_links"].append(href)
item_links.append(href)
item_data["links"] = item_links
result["items"].append(item_data)
# Also capture any paragraph text in the supplies section
body_el = supply_section.find("div", class_="step-body")
if body_el:
# Sometimes materials are listed as paragraphs, not <li>
paragraphs = body_el.find_all("p")
for p in paragraphs:
text = p.get_text(strip=True)
if text and len(text) < 200: # short = likely a component name
# Check if it's not already captured as a list item
if not any(item["text"] == text for item in result["items"]):
result["items"].append({
"text": text,
"quantity": 1,
"name": text,
"links": [],
})
return result
def normalize_component_name(raw_name: str) -> str:
"""
Normalize component names for analysis.
"2x Arduino Nano v3" → "arduino nano"
"10K ohm resistor (SMD)" → "resistor 10k"
"""
name = raw_name.lower()
# Remove quantity prefixes
name = re.sub(r"^\d+\s*[xX×]?\s*", "", name)
# Remove parenthetical notes
name = re.sub(r"\([^)]+\)", "", name)
# Remove version numbers
name = re.sub(r"\bv\d+(\.\d+)?\b", "", name)
# Normalize spacing
name = " ".join(name.split())
return name.strip()
Navigating Category Listings {#categories}
To collect projects at scale, you start from category listing pages and paginate through them.
def get_category_projects(category_slug: str, page: int = 1,
session: requests.Session = None,
proxy: str = None) -> list[dict]:
"""
Scrape a category page to get project URLs and metadata.
Args:
category_slug: e.g., "circuits/arduino", "workshop/3d-printing"
page: page number (1-indexed)
"""
url = f"{BASE_URL}/{category_slug}/?page={page}"
resp = polite_get(url, session=session, proxy=proxy)
soup = BeautifulSoup(resp.text, "lxml")
projects = []
# Project cards — multiple possible selectors
cards = (soup.find_all("a", class_="ible-card") or
soup.find_all("div", class_="instructable-link") or
soup.find_all("article", class_="instructable"))
for card in cards:
project = {}
# URL
href = card.get("href") if card.name == "a" else None
if not href:
link = card.find("a")
href = link.get("href") if link else None
if not href:
continue
project["url"] = urljoin(BASE_URL, href)
# Extract slug from URL
slug_match = re.search(r"/([^/]+)/?$", project["url"])
project["slug"] = slug_match.group(1) if slug_match else ""
# Title
title_el = (card.find("strong", class_="title") or
card.find(["h2", "h3"]) or
card.find("span", class_="title"))
project["title"] = title_el.get_text(strip=True) if title_el else card.get("title", "")
# Author
author_el = card.find("span", class_="author")
project["author"] = author_el.get_text(strip=True) if author_el else ""
# Views
views_el = card.find("span", class_="views")
if views_el:
views_text = views_el.get_text(strip=True).replace(",", "")
try:
project["views"] = int(re.sub(r"[^\d]", "", views_text))
except ValueError:
project["views"] = 0
# Favorites
favs_el = card.find("span", class_="favorites")
if favs_el:
favs_text = favs_el.get_text(strip=True).replace(",", "")
try:
project["favorites"] = int(re.sub(r"[^\d]", "", favs_text))
except ValueError:
project["favorites"] = 0
# Thumbnail
img = card.find("img")
project["thumbnail"] = img.get("src") or img.get("data-src", "") if img else ""
project["category"] = category_slug
projects.append(project)
# Check for next page
next_btn = soup.find("a", rel="next")
has_next = bool(next_btn)
return projects, has_next
def collect_category_data(category_slug: str, max_pages: int = 5,
session: requests.Session = None,
proxy: str = None) -> pd.DataFrame:
"""
Collect all projects from a category up to max_pages.
"""
all_projects = []
for page in range(1, max_pages + 1):
try:
projects, has_next = get_category_projects(
category_slug, page=page,
session=session, proxy=proxy
)
if not projects:
print(f" No projects on page {page}, stopping")
break
all_projects.extend(projects)
print(f" Page {page}: {len(projects)} projects "
f"(total: {len(all_projects)})")
if not has_next:
print(f" Reached last page ({page})")
break
except Exception as e:
print(f" Error on page {page}: {e}")
break
return pd.DataFrame(all_projects)
# Example: collect Arduino projects
# session = requests.Session()
# df = collect_category_data("circuits/arduino", max_pages=10, session=session)
# print(df.sort_values("views", ascending=False).head(10))
Browsing All Top-Level Categories
INSTRUCTABLES_CATEGORIES = {
"circuits": [
"circuits/arduino",
"circuits/raspberry-pi",
"circuits/microcontrollers",
"circuits/sensors",
"circuits/power",
"circuits/computers",
"circuits/leds",
],
"workshop": [
"workshop/3d-printing",
"workshop/woodworking",
"workshop/metalworking",
"workshop/cnc",
"workshop/laser-cutting",
],
"craft": [
"craft/sewing",
"craft/knitting-and-crochet",
"craft/paper",
"craft/costumes",
],
"cooking": [
"cooking/main-course",
"cooking/snacks-and-appetizers",
"cooking/baking",
"cooking/canning-and-preserving",
],
}
def crawl_all_categories(max_pages_per_category: int = 3) -> pd.DataFrame:
"""Crawl all major categories and compile a master dataset."""
session = requests.Session()
all_data = []
for category, subcategories in INSTRUCTABLES_CATEGORIES.items():
for subcat in subcategories:
print(f"\nScraping {subcat}...")
try:
df = collect_category_data(
subcat,
max_pages=max_pages_per_category,
session=session
)
df["main_category"] = category
all_data.append(df)
except Exception as e:
print(f" Failed {subcat}: {e}")
# Rest between subcategories
time.sleep(random.uniform(10, 20))
if all_data:
combined = pd.concat(all_data, ignore_index=True)
# Deduplicate by URL
combined = combined.drop_duplicates(subset=["url"])
return combined
return pd.DataFrame()
Anti-Bot Protection and How to Handle It {#anti-bot}
Instructables is owned by Autodesk and uses standard web protections — not as aggressive as DoorDash or LinkedIn, but consistent crawling will get you blocked.
What You're Up Against
Cloudflare: Instructables uses Cloudflare. Standard datacenter IPs (AWS, GCP, DigitalOcean) receive Cloudflare challenges. Residential IPs bypass most of these checks automatically.
Rate Limiting: Sustained fast crawling triggers 429 responses. Keep 3-7 seconds between requests from the same IP.
Session-Based Detection: Autodesk tracks browsing patterns. A scraper that goes directly to project pages without any referrer, cookies, or navigation history looks automated. Mitigate this by:
- Using a requests.Session() to persist cookies
- Visiting the homepage first to establish a session
- Occasionally visiting category pages between project pages
User-Agent Rotation: The same User-Agent making thousands of requests is a clear signal. Rotate from a pool of realistic browser strings.
Implementing Human-Like Browsing
class HumanLikeScraper:
"""
Simulates realistic browsing patterns:
- Session persistence with cookies
- Occasional "navigation" visits to non-target pages
- Randomized delays with occasional longer pauses
- User-agent rotation
"""
def __init__(self, proxy_url: str = None):
self.session = requests.Session()
self.proxy_url = proxy_url
self.proxies = ({"http": proxy_url, "https": proxy_url}
if proxy_url else None)
self.request_count = 0
self._initialize_session()
def _initialize_session(self):
"""Start with a homepage visit to get cookies."""
try:
self.session.get(
BASE_URL,
headers=get_headers(),
proxies=self.proxies,
timeout=15,
)
time.sleep(random.uniform(2, 5))
except Exception:
pass
def _maybe_browse_randomly(self):
"""Occasionally visit a random category page to look more human."""
if random.random() < 0.1: # 10% chance
cat = random.choice(list(INSTRUCTABLES_CATEGORIES.keys()))
url = f"{BASE_URL}/{cat}/"
try:
self.session.get(
url,
headers=get_headers(referer=BASE_URL),
proxies=self.proxies,
timeout=15,
)
time.sleep(random.uniform(1, 3))
except Exception:
pass
def get(self, url: str) -> requests.Response:
"""Make a request with human-like behavior."""
self.request_count += 1
self._maybe_browse_randomly()
# Longer pause every ~20 requests
if self.request_count % 20 == 0:
pause = random.uniform(30, 90)
print(f"Taking a break ({pause:.0f}s) after {self.request_count} requests")
time.sleep(pause)
else:
time.sleep(random.uniform(3, 8))
referer = BASE_URL if self.request_count == 1 else None
resp = self.session.get(
url,
headers=get_headers(referer=referer),
proxies=self.proxies,
timeout=30,
)
if resp.status_code == 429:
wait = int(resp.headers.get("Retry-After", 120))
print(f"Rate limited. Waiting {wait}s")
time.sleep(wait + random.uniform(10, 30))
return self.get(url)
resp.raise_for_status()
return resp
Residential Proxies for Scale
For building a large dataset — tens of thousands of projects for component analysis or ML training — you need residential proxies. Cloudflare challenges datacenter IPs constantly; residential IPs look like real Autodesk users browsing for project inspiration.
ThorData provides rotating residential proxies that work well with Instructables:
def get_thordata_proxy(username: str, password: str,
country: str = "US",
session_id: str = None) -> str:
"""
Generate a ThorData proxy URL.
- Without session_id: rotating (new IP per request)
- With session_id: sticky (same IP for the session)
"""
if session_id:
user = f"{username}-country-{country}-session-{session_id}"
else:
user = f"{username}-country-{country}"
return f"http://{user}:{password}@proxy.thordata.com:9000"
class ThorDataScraper(HumanLikeScraper):
"""Extends HumanLikeScraper with ThorData proxy rotation."""
def __init__(self, td_username: str, td_password: str,
rotate_every: int = 20):
self.td_username = td_username
self.td_password = td_password
self.rotate_every = rotate_every
self._session_counter = 0
proxy = get_thordata_proxy(td_username, td_password,
session_id=f"init-{int(time.time())}")
super().__init__(proxy_url=proxy)
def get(self, url: str) -> requests.Response:
# Rotate proxy every N requests
if self.request_count > 0 and self.request_count % self.rotate_every == 0:
self._session_counter += 1
new_proxy = get_thordata_proxy(
self.td_username, self.td_password,
session_id=f"session-{self._session_counter}"
)
self.proxy_url = new_proxy
self.proxies = {"http": new_proxy, "https": new_proxy}
print(f"Rotated to proxy session {self._session_counter}")
return super().get(url)
Collecting Data at Scale {#scale}
def scrape_projects_batch(project_urls: list[str],
scraper: HumanLikeScraper,
include_steps: bool = True,
include_supplies: bool = True) -> list[dict]:
"""
Scrape a batch of project URLs with full data extraction.
"""
results = []
for i, url in enumerate(project_urls):
print(f"[{i+1}/{len(project_urls)}] {url}")
try:
# Get basic project data using the scraper's session
resp = scraper.get(url)
soup = BeautifulSoup(resp.text, "lxml")
# Basic metadata
project = {
"url": url,
"scraped_at": datetime.utcnow().isoformat(),
}
title_el = soup.find("h1")
project["title"] = title_el.get_text(strip=True) if title_el else ""
# Get steps from same soup
steps = []
for j, container in enumerate(soup.find_all("section", class_="step")):
title = container.find(["h2", "h3"])
body = container.find("div", class_="step-body")
images = container.find_all("img")
steps.append({
"step_number": j,
"title": title.get_text(strip=True) if title else f"Step {j}",
"text": body.get_text(separator="\n", strip=True) if body else "",
"image_count": len(images),
})
project["steps"] = steps if include_steps else []
project["step_count"] = len(steps)
# Supplies from same soup
if include_supplies:
supply_section = soup.find("section", class_="step-supplies")
if not supply_section:
for section in soup.find_all("section", class_="step"):
t = section.find(["h2", "h3"])
if t and any(kw in t.get_text().lower()
for kw in ["supplie", "material", "component"]):
supply_section = section
break
if supply_section:
items = []
for li in supply_section.find_all("li"):
text = li.get_text(strip=True)
if text:
items.append(text)
project["supplies"] = items
else:
project["supplies"] = []
results.append(project)
print(f" OK — {project['title'][:60]} | "
f"{project['step_count']} steps, "
f"{len(project.get('supplies', []))} supplies")
except Exception as e:
print(f" FAILED: {e}")
results.append({"url": url, "error": str(e)})
return results
Database Design for Maker Data {#database}
def create_instructables_db(db_path: str = "instructables.db") -> sqlite3.Connection:
"""Create the SQLite schema for Instructables data."""
conn = sqlite3.connect(db_path)
conn.execute("PRAGMA journal_mode=WAL") # better concurrent access
c = conn.cursor()
c.executescript("""
CREATE TABLE IF NOT EXISTS projects (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT UNIQUE NOT NULL,
slug TEXT,
title TEXT,
author TEXT,
author_url TEXT,
category TEXT,
main_category TEXT,
description TEXT,
difficulty TEXT,
views INTEGER,
favorites INTEGER,
comments INTEGER,
contest_wins INTEGER DEFAULT 0,
step_count INTEGER DEFAULT 0,
total_images INTEGER DEFAULT 0,
published TEXT,
license TEXT,
scraped_at TEXT,
updated_at TEXT DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS project_steps (
id INTEGER PRIMARY KEY AUTOINCREMENT,
project_url TEXT REFERENCES projects(url),
step_number INTEGER,
title TEXT,
text TEXT,
image_count INTEGER DEFAULT 0,
has_code BOOLEAN DEFAULT 0
);
CREATE TABLE IF NOT EXISTS project_supplies (
id INTEGER PRIMARY KEY AUTOINCREMENT,
project_url TEXT REFERENCES projects(url),
raw_text TEXT,
normalized_name TEXT,
quantity INTEGER DEFAULT 1,
amazon_url TEXT,
category TEXT -- classified component type
);
CREATE TABLE IF NOT EXISTS supply_categories (
normalized_name TEXT PRIMARY KEY,
category TEXT,
subcategory TEXT
);
CREATE INDEX IF NOT EXISTS idx_projects_category ON projects(category);
CREATE INDEX IF NOT EXISTS idx_projects_views ON projects(views DESC);
CREATE INDEX IF NOT EXISTS idx_projects_favorites ON projects(favorites DESC);
CREATE INDEX IF NOT EXISTS idx_supplies_name ON project_supplies(normalized_name);
CREATE INDEX IF NOT EXISTS idx_supplies_project ON project_supplies(project_url);
""")
conn.commit()
return conn
def save_project(conn: sqlite3.Connection, project: dict):
"""Save a scraped project to the database."""
c = conn.cursor()
# Upsert project
c.execute("""
INSERT INTO projects (url, slug, title, author, author_url, category,
description, difficulty, views, favorites, comments,
contest_wins, step_count, published, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(url) DO UPDATE SET
views=excluded.views,
favorites=excluded.favorites,
step_count=excluded.step_count,
updated_at=CURRENT_TIMESTAMP
""", (
project.get("url"),
project.get("slug"),
project.get("title"),
project.get("author"),
project.get("author_url"),
project.get("category"),
project.get("description"),
project.get("difficulty"),
project.get("views"),
project.get("favorites"),
project.get("comments"),
project.get("contest_wins", 0),
project.get("step_count", 0),
project.get("published"),
project.get("scraped_at"),
))
url = project.get("url")
# Clear and re-insert steps
c.execute("DELETE FROM project_steps WHERE project_url=?", (url,))
for step in project.get("steps", []):
c.execute("""
INSERT INTO project_steps (project_url, step_number, title,
text, image_count, has_code)
VALUES (?, ?, ?, ?, ?, ?)
""", (
url, step["step_number"], step["title"],
step["text"], step["image_count"],
step.get("has_code", False)
))
# Clear and re-insert supplies
c.execute("DELETE FROM project_supplies WHERE project_url=?", (url,))
for supply in project.get("supplies", []):
if isinstance(supply, str):
raw = supply
normalized = normalize_component_name(supply)
else:
raw = supply.get("text", "")
normalized = normalize_component_name(supply.get("name", raw))
c.execute("""
INSERT INTO project_supplies (project_url, raw_text,
normalized_name, quantity)
VALUES (?, ?, ?, ?)
""", (url, raw, normalized, 1))
conn.commit()
Async Scraping for Speed {#async}
When collecting thousands of projects, async scraping dramatically reduces total time. Use asyncio + httpx for concurrent requests with per-domain rate limiting.
import asyncio
import httpx
from datetime import datetime
async def async_get_project(client: httpx.AsyncClient,
url: str,
semaphore: asyncio.Semaphore) -> dict:
"""Async version of project fetcher."""
async with semaphore:
await asyncio.sleep(random.uniform(2, 5))
try:
resp = await client.get(
url,
headers=get_headers(referer=BASE_URL),
follow_redirects=True,
timeout=30,
)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
title_el = soup.find("h1")
steps = soup.find_all("section", class_="step")
return {
"url": url,
"title": title_el.get_text(strip=True) if title_el else "",
"step_count": len(steps),
"scraped_at": datetime.utcnow().isoformat(),
"status": "ok",
}
except Exception as e:
return {"url": url, "error": str(e), "status": "failed"}
async def async_scrape_batch(urls: list[str],
max_concurrent: int = 5,
proxy_url: str = None) -> list[dict]:
"""
Scrape multiple Instructables project URLs concurrently.
max_concurrent=5 is a good balance of speed vs. politeness.
"""
semaphore = asyncio.Semaphore(max_concurrent)
client_kwargs = {
"headers": get_headers(),
"timeout": 30,
"follow_redirects": True,
}
if proxy_url:
client_kwargs["proxies"] = proxy_url
async with httpx.AsyncClient(**client_kwargs) as client:
tasks = [async_get_project(client, url, semaphore) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
return [r for r in results if isinstance(r, dict)]
# Example: collect a category and then detail-scrape all projects
async def full_pipeline(category_slug: str, max_category_pages: int = 5):
"""Full pipeline: list → detail scrape → save."""
print(f"Phase 1: Collecting URLs from {category_slug}")
session = requests.Session()
df = collect_category_data(
category_slug, max_pages=max_category_pages, session=session
)
urls = df["url"].tolist()
print(f"Phase 2: Detail-scraping {len(urls)} projects")
results = await async_scrape_batch(urls, max_concurrent=5)
conn = create_instructables_db()
saved = 0
for result in results:
if result.get("status") == "ok":
save_project(conn, result)
saved += 1
print(f"Phase 3: Saved {saved}/{len(urls)} projects")
conn.close()
return results
# asyncio.run(full_pipeline("circuits/arduino", max_category_pages=3))
Proxy Setup with ThorData {#proxies}
For serious Instructables scraping — building training datasets, component trend analysis, or recommendation systems — residential proxies are essential.
Cloudflare's challenge rate correlates with IP reputation. Residential IPs from ThorData have much better reputation scores than shared datacenter IPs.
Full Proxied Async Scraper
async def async_scrape_with_thordata(
urls: list[str],
td_username: str,
td_password: str,
max_concurrent: int = 5,
rotate_every: int = 15,
) -> list[dict]:
"""
Async Instructables scraper with ThorData proxy rotation.
Rotates proxy every `rotate_every` requests.
"""
results = []
semaphore = asyncio.Semaphore(max_concurrent)
async def fetch_one(url: str, request_num: int) -> dict:
async with semaphore:
# Calculate session ID for rotation
session_id = f"session-{request_num // rotate_every}"
proxy_url = (
f"http://{td_username}-country-US-session-{session_id}:"
f"{td_password}@proxy.thordata.com:9000"
)
await asyncio.sleep(random.uniform(2, 6))
try:
async with httpx.AsyncClient(
proxies=proxy_url,
headers=get_headers(),
timeout=30,
follow_redirects=True,
) as client:
resp = await client.get(url)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
title_el = soup.find("h1")
steps = soup.find_all("section", class_="step")
supply_section = soup.find("section", class_="step-supplies")
supplies = []
if supply_section:
supplies = [
li.get_text(strip=True)
for li in supply_section.find_all("li")
if li.get_text(strip=True)
]
return {
"url": url,
"title": title_el.get_text(strip=True) if title_el else "",
"step_count": len(steps),
"supplies": supplies,
"status": "ok",
}
except Exception as e:
return {"url": url, "error": str(e), "status": "failed"}
tasks = [fetch_one(url, i) for i, url in enumerate(urls)]
return await asyncio.gather(*tasks, return_exceptions=True)
Playwright Fallback for JS-Heavy Pages {#playwright}
Some newer Instructables pages load content dynamically. If your BeautifulSoup scraper consistently returns empty steps, use Playwright:
from playwright.async_api import async_playwright
import asyncio
async def playwright_scrape_project(url: str,
proxy_url: str = None) -> dict:
"""Playwright-based scraper for JS-heavy Instructables pages."""
async with async_playwright() as p:
launch_opts = {"headless": True}
if proxy_url:
launch_opts["proxy"] = {"server": proxy_url}
browser = await p.chromium.launch(**launch_opts)
context = await browser.new_context(
user_agent=random.choice(USER_AGENTS),
viewport={"width": 1366, "height": 768},
)
page = await context.new_page()
try:
await page.goto(url, wait_until="domcontentloaded", timeout=30000)
# Wait for steps to appear
await page.wait_for_selector("section.step", timeout=10000)
# Extract data via JavaScript
data = await page.evaluate("""() => {
const title = document.querySelector('h1')?.textContent?.trim() || '';
const steps = Array.from(document.querySelectorAll('section.step')).map((s, i) => ({
step_number: i,
title: s.querySelector('h2, h3')?.textContent?.trim() || `Step ${i}`,
text: s.querySelector('.step-body')?.textContent?.trim() || '',
image_count: s.querySelectorAll('img').length,
}));
const supplySection = document.querySelector('section.step-supplies');
const supplies = supplySection
? Array.from(supplySection.querySelectorAll('li')).map(li => li.textContent.trim())
: [];
return { title, steps, supplies };
}""")
return {
"url": url,
**data,
"step_count": len(data.get("steps", [])),
"scraped_at": datetime.utcnow().isoformat(),
"status": "ok",
}
except Exception as e:
return {"url": url, "error": str(e), "status": "failed"}
finally:
await browser.close()
Data Analysis: Component Trends {#analysis}
Once you have thousands of projects in your database, here's how to analyze component trends:
from collections import Counter
import re
# Component classification patterns
COMPONENT_PATTERNS = {
"microcontroller": ["arduino", "esp32", "esp8266", "raspberry pi", "teensy",
"attiny", "stm32", "atmega", "pic", "nodemcu", "nano",
"uno", "mega", "leonardo"],
"sensor": ["sensor", "detector", "pir", "thermistor", "thermocouple",
"accelerometer", "gyroscope", "magnetometer", "barometer",
"humidity", "temperature", "ultrasonic", "infrared", "ir "],
"display": ["lcd", "oled", "led matrix", "7-segment", "e-ink", "e-paper",
"tft", "display", "screen"],
"motor": ["servo", "stepper", "motor", "dc motor", "brushless"],
"communication": ["bluetooth", "wifi", "nrf24", "lora", "zigbee", "433mhz",
"rf module", "can bus"],
"power": ["battery", "lipo", "18650", "boost converter", "buck converter",
"usb", "solar panel", "capacitor"],
"passive": ["resistor", "capacitor", "inductor", "diode", "transistor",
"mosfet", "relay", "led ", "leds"],
"3d_printing": ["pla", "abs", "petg", "filament", "3d print", "resin"],
"tool": ["soldering", "multimeter", "oscilloscope", "drill", "hot glue",
"wire stripper"],
}
def classify_component(name: str) -> str:
"""Classify a component name into a category."""
name_lower = name.lower()
for category, keywords in COMPONENT_PATTERNS.items():
if any(kw in name_lower for kw in keywords):
return category
return "other"
def analyze_component_trends(conn: sqlite3.Connection,
category_filter: str = None) -> dict:
"""
Analyze which components appear most frequently.
Optionally filter by project category.
"""
c = conn.cursor()
if category_filter:
c.execute("""
SELECT ps.normalized_name, COUNT(*) as frequency
FROM project_supplies ps
JOIN projects p ON ps.project_url = p.url
WHERE p.category LIKE ?
GROUP BY ps.normalized_name
ORDER BY frequency DESC
LIMIT 100
""", (f"%{category_filter}%",))
else:
c.execute("""
SELECT normalized_name, COUNT(*) as frequency
FROM project_supplies
GROUP BY normalized_name
ORDER BY frequency DESC
LIMIT 100
""")
rows = c.fetchall()
# Aggregate by category
by_category = Counter()
by_item = {}
for name, freq in rows:
cat = classify_component(name)
by_category[cat] += freq
by_item[name] = freq
return {
"by_category": dict(by_category.most_common()),
"top_items": dict(list(by_item.items())[:30]),
"total_supply_mentions": sum(by_item.values()),
}
def trending_components_over_time(conn: sqlite3.Connection,
component: str) -> list[dict]:
"""
Track how often a specific component appears in projects over time.
Useful for spotting adoption curves (e.g., when did ESP32 take off?).
"""
c = conn.cursor()
c.execute("""
SELECT
strftime('%Y-%m', p.published) as month,
COUNT(DISTINCT ps.project_url) as project_count
FROM project_supplies ps
JOIN projects p ON ps.project_url = p.url
WHERE ps.normalized_name LIKE ?
AND p.published IS NOT NULL
GROUP BY month
ORDER BY month
""", (f"%{component.lower()}%",))
return [{"month": row[0], "count": row[1]} for row in c.fetchall()]
def top_makers_by_views(conn: sqlite3.Connection, limit: int = 20) -> list:
"""Find the most-viewed authors in the database."""
c = conn.cursor()
c.execute("""
SELECT author, COUNT(*) as project_count,
SUM(views) as total_views,
AVG(views) as avg_views,
MAX(views) as best_project_views
FROM projects
WHERE author != ''
GROUP BY author
ORDER BY total_views DESC
LIMIT ?
""", (limit,))
return c.fetchall()
def find_high_value_projects(conn: sqlite3.Connection,
min_views: int = 10000,
min_favorites: int = 500) -> pd.DataFrame:
"""Find projects with high engagement for deeper analysis."""
c = conn.cursor()
c.execute("""
SELECT url, title, author, category, views, favorites,
step_count, published
FROM projects
WHERE views >= ? AND favorites >= ?
ORDER BY favorites DESC
""", (min_views, min_favorites))
cols = ["url", "title", "author", "category", "views",
"favorites", "step_count", "published"]
return pd.DataFrame(c.fetchall(), columns=cols)
Real-World Use Cases {#use-cases}
1. Component Shortage Early Warning
During chip shortages, tracking which components appear in tutorials can signal future supply pressure:
def detect_surge(conn: sqlite3.Connection, component: str,
weeks: int = 4) -> dict:
"""Detect if a component is appearing more often in recent projects."""
c = conn.cursor()
c.execute("""
SELECT
CASE WHEN date(p.published) >= date('now', ?) THEN 'recent' ELSE 'older' END as period,
COUNT(DISTINCT ps.project_url) as count
FROM project_supplies ps
JOIN projects p ON ps.project_url = p.url
WHERE ps.normalized_name LIKE ?
GROUP BY period
""", (f"-{weeks} weeks", f"%{component.lower()}%"))
results = dict(c.fetchall())
recent = results.get("recent", 0)
older = results.get("older", 0)
# Normalize to per-week rate
recent_rate = recent / weeks
# Assume "older" data is from ~52 weeks
older_rate = older / 52 if older else 0
return {
"component": component,
"recent_rate_per_week": recent_rate,
"historical_rate_per_week": older_rate,
"surge_factor": recent_rate / older_rate if older_rate else float("inf"),
}
2. Project Recommendation Engine Input
Build training data for a "similar projects" system:
def build_project_feature_vectors(conn: sqlite3.Connection) -> pd.DataFrame:
"""
Create feature vectors for ML-based project similarity.
Features: category, component presence, step count, difficulty.
"""
c = conn.cursor()
# Get all projects with their supply lists
c.execute("""
SELECT p.url, p.category, p.step_count, p.views, p.favorites,
GROUP_CONCAT(ps.normalized_name, '|') as components
FROM projects p
LEFT JOIN project_supplies ps ON p.url = ps.project_url
GROUP BY p.url
""")
rows = c.fetchall()
# Build component presence matrix
all_components = set()
for row in rows:
if row[5]: # components column
all_components.update(row[5].split("|"))
top_components = [c for c, _ in
Counter({c: 1 for c in all_components}).most_common(50)]
features = []
for url, category, step_count, views, favorites, components_str in rows:
project_components = set(components_str.split("|")) if components_str else set()
feature = {
"url": url,
"category": category or "unknown",
"step_count": step_count or 0,
"views": views or 0,
"favorites": favorites or 0,
}
for comp in top_components:
feature[f"has_{comp.replace(' ', '_')}"] = int(comp in project_components)
features.append(feature)
return pd.DataFrame(features)
3. Tracking Arduino vs Raspberry Pi Popularity
def platform_popularity_trend(conn: sqlite3.Connection) -> pd.DataFrame:
"""Compare Arduino vs RPi vs ESP32 project counts over time."""
platforms = {
"Arduino": "arduino",
"Raspberry Pi": "raspberry pi",
"ESP32": "esp32",
"ESP8266": "esp8266",
}
results = {}
for platform_name, search_term in platforms.items():
trend = trending_components_over_time(conn, search_term)
for point in trend:
month = point["month"]
if month not in results:
results[month] = {"month": month}
results[month][platform_name] = point["count"]
df = pd.DataFrame(list(results.values())).sort_values("month")
df = df.fillna(0)
return df
Tips for Clean Data {#tips}
Inconsistent HTML structure. Projects from 2010 have different markup than 2026 projects. Always use multiple selector fallbacks:
# Prefer specific selectors, fall back to generic
title = (soup.find("h1", class_="header-title") or
soup.find("h1", attrs={"itemprop": "name"}) or
soup.find("h1"))
Supply list formatting varies wildly. Some authors use bullet lists, some paragraphs, some embed product links in sentences. Always normalize:
def clean_supply_item(text: str) -> str:
"""Normalize a supply item text."""
# Remove URLs
text = re.sub(r"https?://\S+", "", text)
# Remove Amazon ASIN references
text = re.sub(r"\(ASIN[:\s]+\w+\)", "", text)
# Normalize whitespace
text = " ".join(text.split())
return text.strip()
Unicode in titles. The maker community is international. Always store as UTF-8:
conn = sqlite3.connect(db_path)
conn.text_factory = str # ensure Unicode handling
"Pro member" gated content. Some projects require an Instructables Pro account to view all steps. You can detect these by checking for a paywall element:
def is_pro_gated(soup: BeautifulSoup) -> bool:
"""Check if full content requires Pro membership."""
return bool(
soup.find("div", class_="pro-member-signup") or
soup.find("button", class_="pro-cta")
)
Duplicate detection. Category listings can overlap. Deduplicate by URL before detail-scraping:
urls = list(set(df["url"].tolist())) # dedup before scraping
Instructables sits at the intersection of hardware, creativity, and community knowledge. Whether you're analyzing component trends, building a recommendation engine, or creating a training dataset for project-matching AI, the data is there — just scrape it politely, store it well, and analyze it thoughtfully.
Summary
The Instructables scraping stack:
1. requests + BeautifulSoup for most content (no JS rendering needed)
2. requests.Session() for cookie persistence and human-like behavior
3. Randomized delays (3-7s) and occasional category navigation
4. Residential proxies via ThorData for large-scale collection without Cloudflare blocks
5. asyncio + httpx when you need throughput
6. SQLite for storage with proper normalization
The data quality challenge is in the supply lists — every maker formats them differently. Build robust normalization from day one and your analysis queries will be much cleaner.