Scraping Instructables for DIY Project Data with Python (2026)

2026-04-09 [python scraping diy instructables maker hardware]

Scraping Instructables for DIY Project Data with Python (2026)

Instructables is the biggest repository of DIY and maker project guides on the internet — electronics, woodworking, 3D printing, cooking, robotics, home improvement. Each project includes step-by-step instructions, component lists, photos, and community feedback.

For researchers studying maker culture, builders tracking component trends, or anyone creating a project recommendation engine, it's excellent data.

This guide covers everything: the site structure, anti-bot handling, extracting steps and materials, collecting at scale, storing the results, and real-world analysis ideas.

Site Architecture and Strategy
Environment Setup
Scraping a Project Page
Extracting Step-by-Step Instructions
Component and Material Lists
Navigating Category Listings
Anti-Bot Protection and How to Handle It
Collecting Data at Scale
Database Design for Maker Data
Async Scraping for Speed
Proxy Setup with ThorData
Playwright Fallback for JS-Heavy Pages
Data Analysis: Component Trends
Real-World Use Cases
Tips for Clean Data

Site Architecture and Strategy {#architecture}

Instructables serves most content as static HTML. The core project data — title, steps, descriptions, materials — is baked into the initial page response. This is the best case for scraping: no JavaScript rendering required for the main content.

However, the site does use JavaScript for some interactive elements: - Comment loading (paginated via AJAX) - "Pro member" content (gated behind login) - Some embedded tool lists on newer projects

For 95% of use cases, requests + BeautifulSoup is all you need.

The URL structure:

# Project page
https://www.instructables.com/PROJECT-SLUG/

# Author profile
https://www.instructables.com/member/USERNAME/

# Category listing
https://www.instructables.com/CATEGORY/

# Tag search
https://www.instructables.com/tag/type-id/TOPIC/

Category slugs include: circuits, workshop, craft, cooking, living, outside, teachers. Each has subcategories: circuits/arduino, circuits/raspberry-pi, workshop/3d-printing, etc.

Environment Setup {#setup}

pip install requests beautifulsoup4 pandas lxml sqlite3 asyncio httpx aiohttp

For anti-detection and proxy support:

pip install requests beautifulsoup4 lxml pandas playwright curl-cffi
playwright install chromium

Basic configuration shared across all examples:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
import sqlite3
import json
import re
from datetime import datetime
from urllib.parse import urljoin, urlparse

BASE_URL = "https://www.instructables.com"

# Rotate through multiple user agents
USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15",
]

def get_headers(referer: str = None) -> dict:
    """Generate randomized browser headers."""
    headers = {
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "DNT": "1",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "same-origin" if referer else "none",
        "Sec-Fetch-User": "?1",
    }
    if referer:
        headers["Referer"] = referer
    return headers


def polite_get(url: str, session: requests.Session = None,
               proxy: str = None, min_delay: float = 3.0,
               max_delay: float = 7.0) -> requests.Response:
    """
    Polite HTTP GET with delay, retry logic, and optional proxy.
    """
    time.sleep(random.uniform(min_delay, max_delay))

    headers = get_headers(referer=BASE_URL)
    proxies = {"http": proxy, "https": proxy} if proxy else None

    if session:
        resp = session.get(url, headers=headers, proxies=proxies, timeout=30)
    else:
        resp = requests.get(url, headers=headers, proxies=proxies, timeout=30)

    if resp.status_code == 429:
        retry_after = int(resp.headers.get("Retry-After", 60))
        print(f"Rate limited. Waiting {retry_after}s")
        time.sleep(retry_after + random.uniform(5, 15))
        return polite_get(url, session, proxy)

    resp.raise_for_status()
    return resp

Scraping a Project Page {#project-page}

Each Instructables project page contains the core metadata in the HTML. The structure has stayed relatively stable over the years, though Autodesk occasionally updates class names.

def get_project(project_url: str, session: requests.Session = None,
                proxy: str = None) -> dict:
    """
    Scrape a full project page.
    Returns metadata: title, author, category, stats, description.
    """
    resp = polite_get(project_url, session=session, proxy=proxy)
    soup = BeautifulSoup(resp.text, "lxml")

    project = {"url": project_url, "scraped_at": datetime.utcnow().isoformat()}

    # Title — try multiple selectors (site structure varies by age of post)
    for selector in [("h1", "header-title"), ("h1", "title"), ("h1", None)]:
        tag, cls = selector
        el = soup.find(tag, class_=cls) if cls else soup.find(tag)
        if el:
            project["title"] = el.get_text(strip=True)
            break

    # Author
    author_el = (soup.find("a", class_="member-header-display") or
                 soup.find("a", attrs={"itemprop": "author"}))
    project["author"] = author_el.get_text(strip=True) if author_el else ""
    if author_el:
        project["author_url"] = urljoin(BASE_URL, author_el.get("href", ""))

    # Category breadcrumb
    breadcrumbs = soup.find_all("a", class_="breadcrumb")
    if not breadcrumbs:
        breadcrumbs = soup.find_all("a", attrs={"itemprop": "item"})
    project["category"] = " > ".join(b.get_text(strip=True) for b in breadcrumbs)

    # Stats
    for stat_class, field in [
        ("views-count", "views"),
        ("favorites-count", "favorites"),
        ("comments-count", "comments"),
    ]:
        el = soup.find("span", class_=stat_class)
        if el:
            text = el.get_text(strip=True).replace(",", "")
            try:
                project[field] = int(re.sub(r"[^\d]", "", text))
            except ValueError:
                project[field] = 0

    # Date published
    date_el = (soup.find("meta", attrs={"itemprop": "datePublished"}) or
               soup.find("time", class_="publish-date"))
    if date_el:
        project["published"] = (date_el.get("content") or
                                 date_el.get("datetime") or
                                 date_el.get_text(strip=True))

    # Introduction / description (first step body)
    intro = (soup.find("div", class_="step-body") or
             soup.find("div", class_="intro"))
    if intro:
        project["description"] = intro.get_text(separator=" ", strip=True)[:1000]

    # Difficulty level
    difficulty_el = soup.find("span", class_="difficulty")
    project["difficulty"] = difficulty_el.get_text(strip=True) if difficulty_el else ""

    # Contest wins
    contests = soup.find_all("a", class_="contest-winner")
    project["contest_wins"] = len(contests)

    # License
    license_el = soup.find("a", class_="license")
    project["license"] = license_el.get_text(strip=True) if license_el else ""

    return project


# Example usage
# session = requests.Session()
# project = get_project("https://www.instructables.com/Wooden-LED-Cube/", session)
# print(project["title"], "—", project["views"], "views")

Extracting Step-by-Step Instructions {#steps}

The step structure is what makes Instructables data unique. Each project breaks down a build into numbered steps with text and images.

def get_project_steps(project_url: str, session: requests.Session = None,
                      proxy: str = None) -> list[dict]:
    """
    Extract all steps from a project.
    Returns list of dicts with step number, title, text, image URLs.
    """
    resp = polite_get(project_url, session=session, proxy=proxy)
    soup = BeautifulSoup(resp.text, "lxml")

    steps = []

    # Steps are in <section class="step"> containers
    step_containers = soup.find_all("section", class_="step")

    # Fallback: older projects use div with id="step..."
    if not step_containers:
        step_containers = [
            div for div in soup.find_all("div", id=re.compile(r"^step"))
        ]

    for i, container in enumerate(step_containers):
        step = {
            "step_number": i,
            "is_intro": "intro" in container.get("class", []),
        }

        # Step title
        title_el = container.find(["h2", "h3"])
        step["title"] = title_el.get_text(strip=True) if title_el else f"Step {i}"

        # Step body text
        body_el = container.find("div", class_="step-body")
        if body_el:
            # Get clean text, preserving paragraph breaks
            paragraphs = body_el.find_all(["p", "li"])
            if paragraphs:
                step["text"] = "\n".join(
                    p.get_text(separator=" ", strip=True)
                    for p in paragraphs if p.get_text(strip=True)
                )
            else:
                step["text"] = body_el.get_text(separator="\n", strip=True)
        else:
            step["text"] = ""

        # Image URLs
        images = container.find_all("img")
        step["image_urls"] = [
            img.get("src") or img.get("data-src", "")
            for img in images
            if img.get("src") or img.get("data-src")
        ]
        step["image_count"] = len(step["image_urls"])

        # Embedded files (some steps include downloadable files)
        file_links = container.find_all("a", class_="pdf-link")
        step["files"] = [a.get("href", "") for a in file_links]

        # Code blocks (common in electronics projects)
        code_blocks = container.find_all(["code", "pre"])
        step["code_snippets"] = [
            cb.get_text() for cb in code_blocks if cb.get_text(strip=True)
        ]
        step["has_code"] = len(step["code_snippets"]) > 0

        if step["title"] or step["text"]:
            steps.append(step)

    return steps


def get_full_project(project_url: str, session: requests.Session = None,
                     proxy: str = None) -> dict:
    """Get project metadata AND all steps in one call."""
    resp = polite_get(project_url, session=session, proxy=proxy)
    soup = BeautifulSoup(resp.text, "lxml")

    # Metadata
    project = {"url": project_url, "scraped_at": datetime.utcnow().isoformat()}

    title_el = soup.find("h1")
    project["title"] = title_el.get_text(strip=True) if title_el else ""

    # Steps (reuse the same soup object — no extra request needed)
    steps_data = []
    for i, container in enumerate(soup.find_all("section", class_="step")):
        title_el = container.find(["h2", "h3"])
        body_el = container.find("div", class_="step-body")
        images = container.find_all("img")

        steps_data.append({
            "step_number": i,
            "title": title_el.get_text(strip=True) if title_el else f"Step {i}",
            "text": body_el.get_text(separator="\n", strip=True) if body_el else "",
            "image_count": len(images),
        })

    project["steps"] = steps_data
    project["step_count"] = len(steps_data)
    project["total_images"] = sum(s["image_count"] for s in steps_data)
    project["total_chars"] = sum(len(s["text"]) for s in steps_data)

    return project

Component and Material Lists {#materials}

The materials list is the gold mine for maker data analysis. It reveals what components are commonly needed across different project types.

def get_supplies(project_url: str, session: requests.Session = None,
                 proxy: str = None) -> dict:
    """
    Extract the supplies/materials list from a project.
    Returns structured data with items and links.
    """
    resp = polite_get(project_url, session=session, proxy=proxy)
    soup = BeautifulSoup(resp.text, "lxml")

    # Method 1: Look for dedicated supplies section
    supply_section = soup.find("section", class_="step-supplies")

    # Method 2: Look for a step titled "Supplies", "Materials", "Components"
    if not supply_section:
        supply_keywords = ["supplie", "material", "component",
                           "you will need", "what you need", "tools",
                           "parts list", "shopping list", "bill of materials"]
        for section in soup.find_all("section", class_="step"):
            title_el = section.find(["h2", "h3"])
            if title_el:
                title_text = title_el.get_text().lower()
                if any(kw in title_text for kw in supply_keywords):
                    supply_section = section
                    break

    result = {
        "items": [],
        "tool_items": [],
        "amazon_links": [],
        "other_links": [],
    }

    if not supply_section:
        return result

    # Extract list items
    for li in supply_section.find_all("li"):
        text = li.get_text(strip=True)
        if not text:
            continue

        item_data = {"text": text}

        # Extract quantity if present (e.g., "2x Arduino Nano", "3 LEDs")
        qty_match = re.match(r"^(\d+)\s*[xX×]?\s*(.+)", text)
        if qty_match:
            item_data["quantity"] = int(qty_match.group(1))
            item_data["name"] = qty_match.group(2).strip()
        else:
            item_data["quantity"] = 1
            item_data["name"] = text

        # Extract links (often Amazon affiliate links)
        links = li.find_all("a")
        item_links = []
        for link in links:
            href = link.get("href", "")
            if href:
                if "amazon" in href:
                    result["amazon_links"].append(href)
                    item_data["amazon_url"] = href
                else:
                    result["other_links"].append(href)
                item_links.append(href)
        item_data["links"] = item_links

        result["items"].append(item_data)

    # Also capture any paragraph text in the supplies section
    body_el = supply_section.find("div", class_="step-body")
    if body_el:
        # Sometimes materials are listed as paragraphs, not <li>
        paragraphs = body_el.find_all("p")
        for p in paragraphs:
            text = p.get_text(strip=True)
            if text and len(text) < 200:  # short = likely a component name
                # Check if it's not already captured as a list item
                if not any(item["text"] == text for item in result["items"]):
                    result["items"].append({
                        "text": text,
                        "quantity": 1,
                        "name": text,
                        "links": [],
                    })

    return result


def normalize_component_name(raw_name: str) -> str:
    """
    Normalize component names for analysis.
    "2x Arduino Nano v3" → "arduino nano"
    "10K ohm resistor (SMD)" → "resistor 10k"
    """
    name = raw_name.lower()
    # Remove quantity prefixes
    name = re.sub(r"^\d+\s*[xX×]?\s*", "", name)
    # Remove parenthetical notes
    name = re.sub(r"\([^)]+\)", "", name)
    # Remove version numbers
    name = re.sub(r"\bv\d+(\.\d+)?\b", "", name)
    # Normalize spacing
    name = " ".join(name.split())
    return name.strip()

Navigating Category Listings {#categories}

To collect projects at scale, you start from category listing pages and paginate through them.

def get_category_projects(category_slug: str, page: int = 1,
                           session: requests.Session = None,
                           proxy: str = None) -> list[dict]:
    """
    Scrape a category page to get project URLs and metadata.

    Args:
        category_slug: e.g., "circuits/arduino", "workshop/3d-printing"
        page: page number (1-indexed)
    """
    url = f"{BASE_URL}/{category_slug}/?page={page}"
    resp = polite_get(url, session=session, proxy=proxy)
    soup = BeautifulSoup(resp.text, "lxml")

    projects = []

    # Project cards — multiple possible selectors
    cards = (soup.find_all("a", class_="ible-card") or
             soup.find_all("div", class_="instructable-link") or
             soup.find_all("article", class_="instructable"))

    for card in cards:
        project = {}

        # URL
        href = card.get("href") if card.name == "a" else None
        if not href:
            link = card.find("a")
            href = link.get("href") if link else None

        if not href:
            continue

        project["url"] = urljoin(BASE_URL, href)

        # Extract slug from URL
        slug_match = re.search(r"/([^/]+)/?$", project["url"])
        project["slug"] = slug_match.group(1) if slug_match else ""

        # Title
        title_el = (card.find("strong", class_="title") or
                    card.find(["h2", "h3"]) or
                    card.find("span", class_="title"))
        project["title"] = title_el.get_text(strip=True) if title_el else card.get("title", "")

        # Author
        author_el = card.find("span", class_="author")
        project["author"] = author_el.get_text(strip=True) if author_el else ""

        # Views
        views_el = card.find("span", class_="views")
        if views_el:
            views_text = views_el.get_text(strip=True).replace(",", "")
            try:
                project["views"] = int(re.sub(r"[^\d]", "", views_text))
            except ValueError:
                project["views"] = 0

        # Favorites
        favs_el = card.find("span", class_="favorites")
        if favs_el:
            favs_text = favs_el.get_text(strip=True).replace(",", "")
            try:
                project["favorites"] = int(re.sub(r"[^\d]", "", favs_text))
            except ValueError:
                project["favorites"] = 0

        # Thumbnail
        img = card.find("img")
        project["thumbnail"] = img.get("src") or img.get("data-src", "") if img else ""

        project["category"] = category_slug
        projects.append(project)

    # Check for next page
    next_btn = soup.find("a", rel="next")
    has_next = bool(next_btn)

    return projects, has_next


def collect_category_data(category_slug: str, max_pages: int = 5,
                           session: requests.Session = None,
                           proxy: str = None) -> pd.DataFrame:
    """
    Collect all projects from a category up to max_pages.
    """
    all_projects = []

    for page in range(1, max_pages + 1):
        try:
            projects, has_next = get_category_projects(
                category_slug, page=page,
                session=session, proxy=proxy
            )

            if not projects:
                print(f"  No projects on page {page}, stopping")
                break

            all_projects.extend(projects)
            print(f"  Page {page}: {len(projects)} projects "
                  f"(total: {len(all_projects)})")

            if not has_next:
                print(f"  Reached last page ({page})")
                break

        except Exception as e:
            print(f"  Error on page {page}: {e}")
            break

    return pd.DataFrame(all_projects)


# Example: collect Arduino projects
# session = requests.Session()
# df = collect_category_data("circuits/arduino", max_pages=10, session=session)
# print(df.sort_values("views", ascending=False).head(10))

Browsing All Top-Level Categories

INSTRUCTABLES_CATEGORIES = {
    "circuits": [
        "circuits/arduino",
        "circuits/raspberry-pi",
        "circuits/microcontrollers",
        "circuits/sensors",
        "circuits/power",
        "circuits/computers",
        "circuits/leds",
    ],
    "workshop": [
        "workshop/3d-printing",
        "workshop/woodworking",
        "workshop/metalworking",
        "workshop/cnc",
        "workshop/laser-cutting",
    ],
    "craft": [
        "craft/sewing",
        "craft/knitting-and-crochet",
        "craft/paper",
        "craft/costumes",
    ],
    "cooking": [
        "cooking/main-course",
        "cooking/snacks-and-appetizers",
        "cooking/baking",
        "cooking/canning-and-preserving",
    ],
}


def crawl_all_categories(max_pages_per_category: int = 3) -> pd.DataFrame:
    """Crawl all major categories and compile a master dataset."""
    session = requests.Session()
    all_data = []

    for category, subcategories in INSTRUCTABLES_CATEGORIES.items():
        for subcat in subcategories:
            print(f"\nScraping {subcat}...")
            try:
                df = collect_category_data(
                    subcat,
                    max_pages=max_pages_per_category,
                    session=session
                )
                df["main_category"] = category
                all_data.append(df)
            except Exception as e:
                print(f"  Failed {subcat}: {e}")

            # Rest between subcategories
            time.sleep(random.uniform(10, 20))

    if all_data:
        combined = pd.concat(all_data, ignore_index=True)
        # Deduplicate by URL
        combined = combined.drop_duplicates(subset=["url"])
        return combined

    return pd.DataFrame()

Anti-Bot Protection and How to Handle It {#anti-bot}

Instructables is owned by Autodesk and uses standard web protections — not as aggressive as DoorDash or LinkedIn, but consistent crawling will get you blocked.

What You're Up Against

Cloudflare: Instructables uses Cloudflare. Standard datacenter IPs (AWS, GCP, DigitalOcean) receive Cloudflare challenges. Residential IPs bypass most of these checks automatically.

Rate Limiting: Sustained fast crawling triggers 429 responses. Keep 3-7 seconds between requests from the same IP.

Session-Based Detection: Autodesk tracks browsing patterns. A scraper that goes directly to project pages without any referrer, cookies, or navigation history looks automated. Mitigate this by: - Using a requests.Session() to persist cookies - Visiting the homepage first to establish a session - Occasionally visiting category pages between project pages

User-Agent Rotation: The same User-Agent making thousands of requests is a clear signal. Rotate from a pool of realistic browser strings.

Implementing Human-Like Browsing

class HumanLikeScraper:
    """
    Simulates realistic browsing patterns:
    - Session persistence with cookies
    - Occasional "navigation" visits to non-target pages
    - Randomized delays with occasional longer pauses
    - User-agent rotation
    """

    def __init__(self, proxy_url: str = None):
        self.session = requests.Session()
        self.proxy_url = proxy_url
        self.proxies = ({"http": proxy_url, "https": proxy_url}
                        if proxy_url else None)
        self.request_count = 0
        self._initialize_session()

    def _initialize_session(self):
        """Start with a homepage visit to get cookies."""
        try:
            self.session.get(
                BASE_URL,
                headers=get_headers(),
                proxies=self.proxies,
                timeout=15,
            )
            time.sleep(random.uniform(2, 5))
        except Exception:
            pass

    def _maybe_browse_randomly(self):
        """Occasionally visit a random category page to look more human."""
        if random.random() < 0.1:  # 10% chance
            cat = random.choice(list(INSTRUCTABLES_CATEGORIES.keys()))
            url = f"{BASE_URL}/{cat}/"
            try:
                self.session.get(
                    url,
                    headers=get_headers(referer=BASE_URL),
                    proxies=self.proxies,
                    timeout=15,
                )
                time.sleep(random.uniform(1, 3))
            except Exception:
                pass

    def get(self, url: str) -> requests.Response:
        """Make a request with human-like behavior."""
        self.request_count += 1
        self._maybe_browse_randomly()

        # Longer pause every ~20 requests
        if self.request_count % 20 == 0:
            pause = random.uniform(30, 90)
            print(f"Taking a break ({pause:.0f}s) after {self.request_count} requests")
            time.sleep(pause)
        else:
            time.sleep(random.uniform(3, 8))

        referer = BASE_URL if self.request_count == 1 else None
        resp = self.session.get(
            url,
            headers=get_headers(referer=referer),
            proxies=self.proxies,
            timeout=30,
        )

        if resp.status_code == 429:
            wait = int(resp.headers.get("Retry-After", 120))
            print(f"Rate limited. Waiting {wait}s")
            time.sleep(wait + random.uniform(10, 30))
            return self.get(url)

        resp.raise_for_status()
        return resp

Residential Proxies for Scale

For building a large dataset — tens of thousands of projects for component analysis or ML training — you need residential proxies. Cloudflare challenges datacenter IPs constantly; residential IPs look like real Autodesk users browsing for project inspiration.

ThorData provides rotating residential proxies that work well with Instructables:

def get_thordata_proxy(username: str, password: str,
                        country: str = "US",
                        session_id: str = None) -> str:
    """
    Generate a ThorData proxy URL.
    - Without session_id: rotating (new IP per request)
    - With session_id: sticky (same IP for the session)
    """
    if session_id:
        user = f"{username}-country-{country}-session-{session_id}"
    else:
        user = f"{username}-country-{country}"

    return f"http://{user}:{password}@proxy.thordata.com:9000"


class ThorDataScraper(HumanLikeScraper):
    """Extends HumanLikeScraper with ThorData proxy rotation."""

    def __init__(self, td_username: str, td_password: str,
                 rotate_every: int = 20):
        self.td_username = td_username
        self.td_password = td_password
        self.rotate_every = rotate_every
        self._session_counter = 0

        proxy = get_thordata_proxy(td_username, td_password,
                                    session_id=f"init-{int(time.time())}")
        super().__init__(proxy_url=proxy)

    def get(self, url: str) -> requests.Response:
        # Rotate proxy every N requests
        if self.request_count > 0 and self.request_count % self.rotate_every == 0:
            self._session_counter += 1
            new_proxy = get_thordata_proxy(
                self.td_username, self.td_password,
                session_id=f"session-{self._session_counter}"
            )
            self.proxy_url = new_proxy
            self.proxies = {"http": new_proxy, "https": new_proxy}
            print(f"Rotated to proxy session {self._session_counter}")

        return super().get(url)

Collecting Data at Scale {#scale}

def scrape_projects_batch(project_urls: list[str],
                           scraper: HumanLikeScraper,
                           include_steps: bool = True,
                           include_supplies: bool = True) -> list[dict]:
    """
    Scrape a batch of project URLs with full data extraction.
    """
    results = []

    for i, url in enumerate(project_urls):
        print(f"[{i+1}/{len(project_urls)}] {url}")

        try:
            # Get basic project data using the scraper's session
            resp = scraper.get(url)
            soup = BeautifulSoup(resp.text, "lxml")

            # Basic metadata
            project = {
                "url": url,
                "scraped_at": datetime.utcnow().isoformat(),
            }

            title_el = soup.find("h1")
            project["title"] = title_el.get_text(strip=True) if title_el else ""

            # Get steps from same soup
            steps = []
            for j, container in enumerate(soup.find_all("section", class_="step")):
                title = container.find(["h2", "h3"])
                body = container.find("div", class_="step-body")
                images = container.find_all("img")
                steps.append({
                    "step_number": j,
                    "title": title.get_text(strip=True) if title else f"Step {j}",
                    "text": body.get_text(separator="\n", strip=True) if body else "",
                    "image_count": len(images),
                })
            project["steps"] = steps if include_steps else []
            project["step_count"] = len(steps)

            # Supplies from same soup
            if include_supplies:
                supply_section = soup.find("section", class_="step-supplies")
                if not supply_section:
                    for section in soup.find_all("section", class_="step"):
                        t = section.find(["h2", "h3"])
                        if t and any(kw in t.get_text().lower()
                                     for kw in ["supplie", "material", "component"]):
                            supply_section = section
                            break

                if supply_section:
                    items = []
                    for li in supply_section.find_all("li"):
                        text = li.get_text(strip=True)
                        if text:
                            items.append(text)
                    project["supplies"] = items
                else:
                    project["supplies"] = []

            results.append(project)
            print(f"  OK — {project['title'][:60]} | "
                  f"{project['step_count']} steps, "
                  f"{len(project.get('supplies', []))} supplies")

        except Exception as e:
            print(f"  FAILED: {e}")
            results.append({"url": url, "error": str(e)})

    return results

Database Design for Maker Data {#database}

def create_instructables_db(db_path: str = "instructables.db") -> sqlite3.Connection:
    """Create the SQLite schema for Instructables data."""
    conn = sqlite3.connect(db_path)
    conn.execute("PRAGMA journal_mode=WAL")  # better concurrent access
    c = conn.cursor()

    c.executescript("""
        CREATE TABLE IF NOT EXISTS projects (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            url TEXT UNIQUE NOT NULL,
            slug TEXT,
            title TEXT,
            author TEXT,
            author_url TEXT,
            category TEXT,
            main_category TEXT,
            description TEXT,
            difficulty TEXT,
            views INTEGER,
            favorites INTEGER,
            comments INTEGER,
            contest_wins INTEGER DEFAULT 0,
            step_count INTEGER DEFAULT 0,
            total_images INTEGER DEFAULT 0,
            published TEXT,
            license TEXT,
            scraped_at TEXT,
            updated_at TEXT DEFAULT CURRENT_TIMESTAMP
        );

        CREATE TABLE IF NOT EXISTS project_steps (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            project_url TEXT REFERENCES projects(url),
            step_number INTEGER,
            title TEXT,
            text TEXT,
            image_count INTEGER DEFAULT 0,
            has_code BOOLEAN DEFAULT 0
        );

        CREATE TABLE IF NOT EXISTS project_supplies (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            project_url TEXT REFERENCES projects(url),
            raw_text TEXT,
            normalized_name TEXT,
            quantity INTEGER DEFAULT 1,
            amazon_url TEXT,
            category TEXT  -- classified component type
        );

        CREATE TABLE IF NOT EXISTS supply_categories (
            normalized_name TEXT PRIMARY KEY,
            category TEXT,
            subcategory TEXT
        );

        CREATE INDEX IF NOT EXISTS idx_projects_category ON projects(category);
        CREATE INDEX IF NOT EXISTS idx_projects_views ON projects(views DESC);
        CREATE INDEX IF NOT EXISTS idx_projects_favorites ON projects(favorites DESC);
        CREATE INDEX IF NOT EXISTS idx_supplies_name ON project_supplies(normalized_name);
        CREATE INDEX IF NOT EXISTS idx_supplies_project ON project_supplies(project_url);
    """)

    conn.commit()
    return conn


def save_project(conn: sqlite3.Connection, project: dict):
    """Save a scraped project to the database."""
    c = conn.cursor()

    # Upsert project
    c.execute("""
        INSERT INTO projects (url, slug, title, author, author_url, category,
                              description, difficulty, views, favorites, comments,
                              contest_wins, step_count, published, scraped_at)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        ON CONFLICT(url) DO UPDATE SET
            views=excluded.views,
            favorites=excluded.favorites,
            step_count=excluded.step_count,
            updated_at=CURRENT_TIMESTAMP
    """, (
        project.get("url"),
        project.get("slug"),
        project.get("title"),
        project.get("author"),
        project.get("author_url"),
        project.get("category"),
        project.get("description"),
        project.get("difficulty"),
        project.get("views"),
        project.get("favorites"),
        project.get("comments"),
        project.get("contest_wins", 0),
        project.get("step_count", 0),
        project.get("published"),
        project.get("scraped_at"),
    ))

    url = project.get("url")

    # Clear and re-insert steps
    c.execute("DELETE FROM project_steps WHERE project_url=?", (url,))
    for step in project.get("steps", []):
        c.execute("""
            INSERT INTO project_steps (project_url, step_number, title,
                                       text, image_count, has_code)
            VALUES (?, ?, ?, ?, ?, ?)
        """, (
            url, step["step_number"], step["title"],
            step["text"], step["image_count"],
            step.get("has_code", False)
        ))

    # Clear and re-insert supplies
    c.execute("DELETE FROM project_supplies WHERE project_url=?", (url,))
    for supply in project.get("supplies", []):
        if isinstance(supply, str):
            raw = supply
            normalized = normalize_component_name(supply)
        else:
            raw = supply.get("text", "")
            normalized = normalize_component_name(supply.get("name", raw))

        c.execute("""
            INSERT INTO project_supplies (project_url, raw_text,
                                          normalized_name, quantity)
            VALUES (?, ?, ?, ?)
        """, (url, raw, normalized, 1))

    conn.commit()

Async Scraping for Speed {#async}

When collecting thousands of projects, async scraping dramatically reduces total time. Use asyncio + httpx for concurrent requests with per-domain rate limiting.

import asyncio
import httpx
from datetime import datetime

async def async_get_project(client: httpx.AsyncClient,
                             url: str,
                             semaphore: asyncio.Semaphore) -> dict:
    """Async version of project fetcher."""
    async with semaphore:
        await asyncio.sleep(random.uniform(2, 5))

        try:
            resp = await client.get(
                url,
                headers=get_headers(referer=BASE_URL),
                follow_redirects=True,
                timeout=30,
            )
            resp.raise_for_status()

            soup = BeautifulSoup(resp.text, "lxml")
            title_el = soup.find("h1")
            steps = soup.find_all("section", class_="step")

            return {
                "url": url,
                "title": title_el.get_text(strip=True) if title_el else "",
                "step_count": len(steps),
                "scraped_at": datetime.utcnow().isoformat(),
                "status": "ok",
            }

        except Exception as e:
            return {"url": url, "error": str(e), "status": "failed"}


async def async_scrape_batch(urls: list[str],
                              max_concurrent: int = 5,
                              proxy_url: str = None) -> list[dict]:
    """
    Scrape multiple Instructables project URLs concurrently.
    max_concurrent=5 is a good balance of speed vs. politeness.
    """
    semaphore = asyncio.Semaphore(max_concurrent)
    client_kwargs = {
        "headers": get_headers(),
        "timeout": 30,
        "follow_redirects": True,
    }
    if proxy_url:
        client_kwargs["proxies"] = proxy_url

    async with httpx.AsyncClient(**client_kwargs) as client:
        tasks = [async_get_project(client, url, semaphore) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)

    return [r for r in results if isinstance(r, dict)]


# Example: collect a category and then detail-scrape all projects
async def full_pipeline(category_slug: str, max_category_pages: int = 5):
    """Full pipeline: list → detail scrape → save."""
    print(f"Phase 1: Collecting URLs from {category_slug}")
    session = requests.Session()
    df = collect_category_data(
        category_slug, max_pages=max_category_pages, session=session
    )

    urls = df["url"].tolist()
    print(f"Phase 2: Detail-scraping {len(urls)} projects")

    results = await async_scrape_batch(urls, max_concurrent=5)

    conn = create_instructables_db()
    saved = 0
    for result in results:
        if result.get("status") == "ok":
            save_project(conn, result)
            saved += 1

    print(f"Phase 3: Saved {saved}/{len(urls)} projects")
    conn.close()
    return results


# asyncio.run(full_pipeline("circuits/arduino", max_category_pages=3))

Proxy Setup with ThorData {#proxies}

For serious Instructables scraping — building training datasets, component trend analysis, or recommendation systems — residential proxies are essential.

Cloudflare's challenge rate correlates with IP reputation. Residential IPs from ThorData have much better reputation scores than shared datacenter IPs.

Full Proxied Async Scraper

async def async_scrape_with_thordata(
    urls: list[str],
    td_username: str,
    td_password: str,
    max_concurrent: int = 5,
    rotate_every: int = 15,
) -> list[dict]:
    """
    Async Instructables scraper with ThorData proxy rotation.
    Rotates proxy every `rotate_every` requests.
    """
    results = []
    semaphore = asyncio.Semaphore(max_concurrent)

    async def fetch_one(url: str, request_num: int) -> dict:
        async with semaphore:
            # Calculate session ID for rotation
            session_id = f"session-{request_num // rotate_every}"
            proxy_url = (
                f"http://{td_username}-country-US-session-{session_id}:"
                f"{td_password}@proxy.thordata.com:9000"
            )

            await asyncio.sleep(random.uniform(2, 6))

            try:
                async with httpx.AsyncClient(
                    proxies=proxy_url,
                    headers=get_headers(),
                    timeout=30,
                    follow_redirects=True,
                ) as client:
                    resp = await client.get(url)
                    resp.raise_for_status()
                    soup = BeautifulSoup(resp.text, "lxml")

                    title_el = soup.find("h1")
                    steps = soup.find_all("section", class_="step")
                    supply_section = soup.find("section", class_="step-supplies")

                    supplies = []
                    if supply_section:
                        supplies = [
                            li.get_text(strip=True)
                            for li in supply_section.find_all("li")
                            if li.get_text(strip=True)
                        ]

                    return {
                        "url": url,
                        "title": title_el.get_text(strip=True) if title_el else "",
                        "step_count": len(steps),
                        "supplies": supplies,
                        "status": "ok",
                    }

            except Exception as e:
                return {"url": url, "error": str(e), "status": "failed"}

    tasks = [fetch_one(url, i) for i, url in enumerate(urls)]
    return await asyncio.gather(*tasks, return_exceptions=True)

Playwright Fallback for JS-Heavy Pages {#playwright}

Some newer Instructables pages load content dynamically. If your BeautifulSoup scraper consistently returns empty steps, use Playwright:

from playwright.async_api import async_playwright
import asyncio

async def playwright_scrape_project(url: str,
                                     proxy_url: str = None) -> dict:
    """Playwright-based scraper for JS-heavy Instructables pages."""
    async with async_playwright() as p:
        launch_opts = {"headless": True}
        if proxy_url:
            launch_opts["proxy"] = {"server": proxy_url}

        browser = await p.chromium.launch(**launch_opts)
        context = await browser.new_context(
            user_agent=random.choice(USER_AGENTS),
            viewport={"width": 1366, "height": 768},
        )
        page = await context.new_page()

        try:
            await page.goto(url, wait_until="domcontentloaded", timeout=30000)

            # Wait for steps to appear
            await page.wait_for_selector("section.step", timeout=10000)

            # Extract data via JavaScript
            data = await page.evaluate("""() => {
                const title = document.querySelector('h1')?.textContent?.trim() || '';

                const steps = Array.from(document.querySelectorAll('section.step')).map((s, i) => ({
                    step_number: i,
                    title: s.querySelector('h2, h3')?.textContent?.trim() || `Step ${i}`,
                    text: s.querySelector('.step-body')?.textContent?.trim() || '',
                    image_count: s.querySelectorAll('img').length,
                }));

                const supplySection = document.querySelector('section.step-supplies');
                const supplies = supplySection
                    ? Array.from(supplySection.querySelectorAll('li')).map(li => li.textContent.trim())
                    : [];

                return { title, steps, supplies };
            }""")

            return {
                "url": url,
                **data,
                "step_count": len(data.get("steps", [])),
                "scraped_at": datetime.utcnow().isoformat(),
                "status": "ok",
            }

        except Exception as e:
            return {"url": url, "error": str(e), "status": "failed"}
        finally:
            await browser.close()

Data Analysis: Component Trends {#analysis}

Once you have thousands of projects in your database, here's how to analyze component trends:

from collections import Counter
import re

# Component classification patterns
COMPONENT_PATTERNS = {
    "microcontroller": ["arduino", "esp32", "esp8266", "raspberry pi", "teensy",
                         "attiny", "stm32", "atmega", "pic", "nodemcu", "nano",
                         "uno", "mega", "leonardo"],
    "sensor": ["sensor", "detector", "pir", "thermistor", "thermocouple",
                "accelerometer", "gyroscope", "magnetometer", "barometer",
                "humidity", "temperature", "ultrasonic", "infrared", "ir "],
    "display": ["lcd", "oled", "led matrix", "7-segment", "e-ink", "e-paper",
                 "tft", "display", "screen"],
    "motor": ["servo", "stepper", "motor", "dc motor", "brushless"],
    "communication": ["bluetooth", "wifi", "nrf24", "lora", "zigbee", "433mhz",
                       "rf module", "can bus"],
    "power": ["battery", "lipo", "18650", "boost converter", "buck converter",
               "usb", "solar panel", "capacitor"],
    "passive": ["resistor", "capacitor", "inductor", "diode", "transistor",
                 "mosfet", "relay", "led ", "leds"],
    "3d_printing": ["pla", "abs", "petg", "filament", "3d print", "resin"],
    "tool": ["soldering", "multimeter", "oscilloscope", "drill", "hot glue",
              "wire stripper"],
}


def classify_component(name: str) -> str:
    """Classify a component name into a category."""
    name_lower = name.lower()
    for category, keywords in COMPONENT_PATTERNS.items():
        if any(kw in name_lower for kw in keywords):
            return category
    return "other"


def analyze_component_trends(conn: sqlite3.Connection,
                               category_filter: str = None) -> dict:
    """
    Analyze which components appear most frequently.
    Optionally filter by project category.
    """
    c = conn.cursor()

    if category_filter:
        c.execute("""
            SELECT ps.normalized_name, COUNT(*) as frequency
            FROM project_supplies ps
            JOIN projects p ON ps.project_url = p.url
            WHERE p.category LIKE ?
            GROUP BY ps.normalized_name
            ORDER BY frequency DESC
            LIMIT 100
        """, (f"%{category_filter}%",))
    else:
        c.execute("""
            SELECT normalized_name, COUNT(*) as frequency
            FROM project_supplies
            GROUP BY normalized_name
            ORDER BY frequency DESC
            LIMIT 100
        """)

    rows = c.fetchall()

    # Aggregate by category
    by_category = Counter()
    by_item = {}
    for name, freq in rows:
        cat = classify_component(name)
        by_category[cat] += freq
        by_item[name] = freq

    return {
        "by_category": dict(by_category.most_common()),
        "top_items": dict(list(by_item.items())[:30]),
        "total_supply_mentions": sum(by_item.values()),
    }


def trending_components_over_time(conn: sqlite3.Connection,
                                   component: str) -> list[dict]:
    """
    Track how often a specific component appears in projects over time.
    Useful for spotting adoption curves (e.g., when did ESP32 take off?).
    """
    c = conn.cursor()
    c.execute("""
        SELECT
            strftime('%Y-%m', p.published) as month,
            COUNT(DISTINCT ps.project_url) as project_count
        FROM project_supplies ps
        JOIN projects p ON ps.project_url = p.url
        WHERE ps.normalized_name LIKE ?
          AND p.published IS NOT NULL
        GROUP BY month
        ORDER BY month
    """, (f"%{component.lower()}%",))

    return [{"month": row[0], "count": row[1]} for row in c.fetchall()]


def top_makers_by_views(conn: sqlite3.Connection, limit: int = 20) -> list:
    """Find the most-viewed authors in the database."""
    c = conn.cursor()
    c.execute("""
        SELECT author, COUNT(*) as project_count,
               SUM(views) as total_views,
               AVG(views) as avg_views,
               MAX(views) as best_project_views
        FROM projects
        WHERE author != ''
        GROUP BY author
        ORDER BY total_views DESC
        LIMIT ?
    """, (limit,))
    return c.fetchall()


def find_high_value_projects(conn: sqlite3.Connection,
                              min_views: int = 10000,
                              min_favorites: int = 500) -> pd.DataFrame:
    """Find projects with high engagement for deeper analysis."""
    c = conn.cursor()
    c.execute("""
        SELECT url, title, author, category, views, favorites,
               step_count, published
        FROM projects
        WHERE views >= ? AND favorites >= ?
        ORDER BY favorites DESC
    """, (min_views, min_favorites))

    cols = ["url", "title", "author", "category", "views",
            "favorites", "step_count", "published"]
    return pd.DataFrame(c.fetchall(), columns=cols)

Real-World Use Cases {#use-cases}

1. Component Shortage Early Warning

During chip shortages, tracking which components appear in tutorials can signal future supply pressure:

def detect_surge(conn: sqlite3.Connection, component: str,
                  weeks: int = 4) -> dict:
    """Detect if a component is appearing more often in recent projects."""
    c = conn.cursor()

    c.execute("""
        SELECT
            CASE WHEN date(p.published) >= date('now', ?) THEN 'recent' ELSE 'older' END as period,
            COUNT(DISTINCT ps.project_url) as count
        FROM project_supplies ps
        JOIN projects p ON ps.project_url = p.url
        WHERE ps.normalized_name LIKE ?
        GROUP BY period
    """, (f"-{weeks} weeks", f"%{component.lower()}%"))

    results = dict(c.fetchall())
    recent = results.get("recent", 0)
    older = results.get("older", 0)

    # Normalize to per-week rate
    recent_rate = recent / weeks
    # Assume "older" data is from ~52 weeks
    older_rate = older / 52 if older else 0

    return {
        "component": component,
        "recent_rate_per_week": recent_rate,
        "historical_rate_per_week": older_rate,
        "surge_factor": recent_rate / older_rate if older_rate else float("inf"),
    }

2. Project Recommendation Engine Input

Build training data for a "similar projects" system:

def build_project_feature_vectors(conn: sqlite3.Connection) -> pd.DataFrame:
    """
    Create feature vectors for ML-based project similarity.
    Features: category, component presence, step count, difficulty.
    """
    c = conn.cursor()

    # Get all projects with their supply lists
    c.execute("""
        SELECT p.url, p.category, p.step_count, p.views, p.favorites,
               GROUP_CONCAT(ps.normalized_name, '|') as components
        FROM projects p
        LEFT JOIN project_supplies ps ON p.url = ps.project_url
        GROUP BY p.url
    """)

    rows = c.fetchall()

    # Build component presence matrix
    all_components = set()
    for row in rows:
        if row[5]:  # components column
            all_components.update(row[5].split("|"))

    top_components = [c for c, _ in
                       Counter({c: 1 for c in all_components}).most_common(50)]

    features = []
    for url, category, step_count, views, favorites, components_str in rows:
        project_components = set(components_str.split("|")) if components_str else set()
        feature = {
            "url": url,
            "category": category or "unknown",
            "step_count": step_count or 0,
            "views": views or 0,
            "favorites": favorites or 0,
        }
        for comp in top_components:
            feature[f"has_{comp.replace(' ', '_')}"] = int(comp in project_components)
        features.append(feature)

    return pd.DataFrame(features)

3. Tracking Arduino vs Raspberry Pi Popularity

def platform_popularity_trend(conn: sqlite3.Connection) -> pd.DataFrame:
    """Compare Arduino vs RPi vs ESP32 project counts over time."""
    platforms = {
        "Arduino": "arduino",
        "Raspberry Pi": "raspberry pi",
        "ESP32": "esp32",
        "ESP8266": "esp8266",
    }

    results = {}
    for platform_name, search_term in platforms.items():
        trend = trending_components_over_time(conn, search_term)
        for point in trend:
            month = point["month"]
            if month not in results:
                results[month] = {"month": month}
            results[month][platform_name] = point["count"]

    df = pd.DataFrame(list(results.values())).sort_values("month")
    df = df.fillna(0)
    return df

Tips for Clean Data {#tips}

Inconsistent HTML structure. Projects from 2010 have different markup than 2026 projects. Always use multiple selector fallbacks:

# Prefer specific selectors, fall back to generic
title = (soup.find("h1", class_="header-title") or
         soup.find("h1", attrs={"itemprop": "name"}) or
         soup.find("h1"))

Supply list formatting varies wildly. Some authors use bullet lists, some paragraphs, some embed product links in sentences. Always normalize:

def clean_supply_item(text: str) -> str:
    """Normalize a supply item text."""
    # Remove URLs
    text = re.sub(r"https?://\S+", "", text)
    # Remove Amazon ASIN references
    text = re.sub(r"\(ASIN[:\s]+\w+\)", "", text)
    # Normalize whitespace
    text = " ".join(text.split())
    return text.strip()

Unicode in titles. The maker community is international. Always store as UTF-8:

conn = sqlite3.connect(db_path)
conn.text_factory = str  # ensure Unicode handling

"Pro member" gated content. Some projects require an Instructables Pro account to view all steps. You can detect these by checking for a paywall element:

def is_pro_gated(soup: BeautifulSoup) -> bool:
    """Check if full content requires Pro membership."""
    return bool(
        soup.find("div", class_="pro-member-signup") or
        soup.find("button", class_="pro-cta")
    )

Duplicate detection. Category listings can overlap. Deduplicate by URL before detail-scraping:

urls = list(set(df["url"].tolist()))  # dedup before scraping

Instructables sits at the intersection of hardware, creativity, and community knowledge. Whether you're analyzing component trends, building a recommendation engine, or creating a training dataset for project-matching AI, the data is there — just scrape it politely, store it well, and analyze it thoughtfully.

Summary

The Instructables scraping stack: 1. requests + BeautifulSoup for most content (no JS rendering needed) 2. requests.Session() for cookie persistence and human-like behavior 3. Randomized delays (3-7s) and occasional category navigation 4. Residential proxies via ThorData for large-scale collection without Cloudflare blocks 5. asyncio + httpx when you need throughput 6. SQLite for storage with proper normalization

The data quality challenge is in the supply lists — every maker formats them differently. Build robust normalization from day one and your analysis queries will be much cleaner.

Scraping Instructables for DIY Project Data with Python (2026)

Scraping Instructables for DIY Project Data with Python (2026)

Table of Contents

Site Architecture and Strategy {#architecture}

Environment Setup {#setup}

Scraping a Project Page {#project-page}

Extracting Step-by-Step Instructions {#steps}

Component and Material Lists {#materials}

Navigating Category Listings {#categories}

Browsing All Top-Level Categories

Anti-Bot Protection and How to Handle It {#anti-bot}

What You're Up Against

Implementing Human-Like Browsing

Residential Proxies for Scale

Collecting Data at Scale {#scale}

Database Design for Maker Data {#database}

Async Scraping for Speed {#async}

Proxy Setup with ThorData {#proxies}

Full Proxied Async Scraper

Playwright Fallback for JS-Heavy Pages {#playwright}

Data Analysis: Component Trends {#analysis}

Real-World Use Cases {#use-cases}

1. Component Shortage Early Warning

2. Project Recommendation Engine Input

3. Tracking Arduino vs Raspberry Pi Popularity

Tips for Clean Data {#tips}

Summary