Scraping Home Depot: Product Data, Pricing, and Availability (2026)

2026-04-09 ["home-depot" "web scraping" "python" "ecommerce" "retail"]

Scraping Home Depot: Product Data, Pricing, and Availability (2026)

Home Depot is interesting to scrape because it sits between two worlds. Their website is a modern React app with aggressive bot protection, but their backend APIs are surprisingly well-structured once you have a valid session. The product data is rich — not just prices and SKUs, but installation guides, project calculators, and real-time store inventory by location. That makes it useful for price monitoring, competitor analysis, inventory tracking, or building product comparison tools.

Here's what works for getting data out of homedepot.com in 2026.

What Data Is Available

Home Depot product pages pack a lot of information:

Product details — title, brand, model number, Home Depot SKU (item ID), UPC, GTIN
Pricing — regular price, sale price, bulk pricing tiers, unit pricing (price per sq ft, per gallon, etc.)
Availability — online stock status, store-level inventory (by zip code or store ID), delivery estimates
Specifications — dimensions, weight, material, color, power requirements, detailed spec tables
Reviews — star rating, review count, individual review text, verified purchase flag, review helpful votes
Images — multiple product photos at various resolutions, lifestyle images, dimension diagrams, instruction images
Related products — frequently bought together, similar items, accessories, upgrade products
Project guides — how-to content linked to product categories (linked from product pages)
Fulfillment options — ship to home, buy online pick up in store (BOPUS), curbside, direct delivery

Understanding Home Depot's Architecture

Home Depot's frontend is a React/GraphQL application. All product data flows through a single GraphQL endpoint at https://www.homedepot.com/federation-gateway/graphql. This endpoint handles search, product detail, store inventory, and several other queries — all differentiated by the operationName field in the request body.

The federation gateway aggregates multiple backend services, which is why the schema is large and some fields return nested GraphQL sub-objects. The key to using this endpoint without authentication is having valid cookies from a browser session that has already passed Home Depot's bot detection layer.

Anti-Bot Measures: HUMAN Security (PerimeterX)

Home Depot runs HUMAN Security (formerly PerimeterX) as their primary bot detection. This is one of the more sophisticated anti-bot solutions on the market.

JavaScript Sensor Data

HUMAN's protection works by injecting a JavaScript agent into every page. This agent collects behavioral signals — mouse movement velocity, click patterns, scroll behavior, typing cadence, touch event characteristics — and computes a "sensor data" payload. This encoded payload is sent back to HUMAN's scoring service asynchronously. Without this payload and the cookies it sets, subsequent requests to the API endpoints return empty results or 403 responses.

The cookie sequence matters. A real browser session on homedepot.com establishes:

_px3 — the PerimeterX validation cookie, short TTL (~30-60 min)
_pxvid — the PerimeterX visitor ID, persists across sessions
THD_SESSION — Home Depot's session tracking cookie
Various A/B test and analytics cookies that PerimeterX correlates with the visitor profile

Without a valid _px3 cookie on requests to the federation gateway, you get 403 or empty data: null responses.

IP Reputation Scoring

PerimeterX evaluates IP reputation at the edge. Datacenter IP ranges (AWS, GCP, Azure, DigitalOcean, Vultr, OVH) start with near-zero trust scores and typically fail the challenge immediately. Home Depot's tier of PerimeterX protection blocks these requests before any JavaScript challenge even loads.

The practical solution is ThorData residential proxies. Residential IPs from real ISP customers pass the IP reputation check that kills datacenter requests outright. Their US residential pool is particularly relevant for Home Depot — US geo matters because Home Depot's pricing and availability are US-centric and some API responses include geo-dependent data. ThorData supports sticky sessions, letting you hold the same exit IP across the browser session (cookie harvest) and the subsequent API calls.

Setting Up the Session with Playwright

The cleanest approach is Playwright for cookie harvesting, then httpx for the actual API calls:

import asyncio
from playwright.async_api import async_playwright, BrowserContext

async def harvest_homedepot_cookies(proxy_host: str = None, proxy_port: int = None,
                                     proxy_user: str = None, proxy_pass: str = None) -> dict:
    """
    Launch a real browser session on homedepot.com and extract valid cookies.
    Run this every 30-45 minutes during sustained scraping.
    """
    async with async_playwright() as pw:
        launch_kwargs = {
            "headless": True,
            "args": [
                "--disable-blink-features=AutomationControlled",
                "--disable-dev-shm-usage",
                "--no-sandbox",
                "--disable-setuid-sandbox",
            ],
        }

        if proxy_host:
            launch_kwargs["proxy"] = {
                "server": f"http://{proxy_host}:{proxy_port}",
                "username": proxy_user or "",
                "password": proxy_pass or "",
            }

        browser = await pw.chromium.launch(**launch_kwargs)
        context: BrowserContext = await browser.new_context(
            viewport={"width": 1440, "height": 900},
            locale="en-US",
            timezone_id="America/New_York",
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/125.0.0.0 Safari/537.36"
            ),
            extra_http_headers={
                "Accept-Language": "en-US,en;q=0.9",
            },
        )
        page = await context.new_page()

        # Load the homepage and let PerimeterX challenge complete
        await page.goto("https://www.homedepot.com/", wait_until="networkidle", timeout=30000)
        await page.wait_for_timeout(3000)

        # Simulate realistic human interactions
        await page.mouse.move(350, 250)
        await page.wait_for_timeout(500)
        await page.mouse.move(600, 400)
        await page.wait_for_timeout(800)
        await page.evaluate("window.scrollBy(0, 200)")
        await page.wait_for_timeout(1000)
        await page.evaluate("window.scrollBy(0, 150)")
        await page.wait_for_timeout(500)

        # Optionally navigate to a product page to establish a richer session
        await page.goto(
            "https://www.homedepot.com/b/Tools-Power-Tools-Drills/N-5yc1vZc2bk",
            wait_until="networkidle",
            timeout=30000
        )
        await page.wait_for_timeout(2000)

        cookies = await context.cookies()
        await browser.close()

        return {c["name"]: c["value"] for c in cookies}

API-First Product Scraping

Once you have valid cookies, call the GraphQL endpoint directly from Python — much faster than rendering full pages:

import httpx
import json
import time
import random

GRAPHQL_URL = "https://www.homedepot.com/federation-gateway/graphql"

BASE_HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/125.0.0.0 Safari/537.36"
    ),
    "Accept": "application/json",
    "Accept-Language": "en-US,en;q=0.9",
    "Content-Type": "application/json",
    "Referer": "https://www.homedepot.com/",
    "Origin": "https://www.homedepot.com",
    "x-experience-name": "general-merchandise",
    "x-current-url": "/",
}


class HomeDepotClient:
    def __init__(self, cookies: dict, proxy: str = None):
        transport = httpx.HTTPTransport(proxy=proxy) if proxy else None
        self.client = httpx.Client(
            headers=BASE_HEADERS,
            cookies=cookies,
            timeout=30,
            transport=transport,
            follow_redirects=True,
        )

    def _post(self, payload: dict) -> dict:
        resp = self.client.post(GRAPHQL_URL, json=payload)
        resp.raise_for_status()
        return resp.json()

    def search_products(
        self,
        keyword: str,
        store_id: str = "121",
        zip_code: str = "10001",
        offset: int = 0,
        page_size: int = 24,
    ) -> dict:
        """Search for products via the GraphQL search endpoint."""
        payload = {
            "operationName": "searchModel",
            "variables": {
                "storeId": store_id,
                "zipCode": zip_code,
                "skipInstallServices": True,
                "startIndex": offset,
                "pageSize": page_size,
                "keyword": keyword,
            },
            "query": """
                query searchModel(
                    $keyword: String!, $storeId: String,
                    $zipCode: String, $startIndex: Int, $pageSize: Int
                ) {
                    searchModel(
                        keyword: $keyword, storeId: $storeId,
                        zipCode: $zipCode, startIndex: $startIndex,
                        pageSize: $pageSize
                    ) {
                        products {
                            itemId
                            identifiers {
                                productLabel brandName modelNumber storeSkuNumber
                            }
                            pricing {
                                value original unitOfMeasure
                                promotion { description }
                            }
                            media { images { url sizes } }
                            reviews { ratingsReviews {
                                averageRating totalReviews
                            }}
                            availabilityType { type }
                        }
                        searchReport { totalProducts keyword }
                    }
                }
            """,
        }

        data = self._post(payload)
        return data.get("data", {}).get("searchModel", {})

    def get_product(
        self,
        item_id: str,
        store_id: str = "121",
        zip_code: str = "10001",
    ) -> dict:
        """Fetch full product details by item ID."""
        payload = {
            "operationName": "productClientOnlyProduct",
            "variables": {
                "itemId": item_id,
                "storeId": store_id,
                "zipCode": zip_code,
                "skipSpecificationGroup": False,
                "skipSubscribeAndSave": True,
            },
            "query": """
                query productClientOnlyProduct(
                    $itemId: String!, $storeId: String, $zipCode: String,
                    $skipSpecificationGroup: Boolean!
                ) {
                    product(itemId: $itemId, storeId: $storeId, zipCode: $zipCode) {
                        itemId
                        dataSources
                        identifiers {
                            productLabel brandName modelNumber storeSkuNumber
                            upcGtin13 canonicalUrl
                        }
                        pricing {
                            value original percentageOff unitOfMeasure
                            promotion {
                                description
                                dates { start end }
                                type
                            }
                            specialBuy { value description }
                        }
                        details {
                            description collection installation highlights
                        }
                        specificationGroup @skip(if: $skipSpecificationGroup) {
                            specTitle
                            specifications { specName specValue }
                        }
                        media {
                            images { url sizes }
                            video { url thumbnail }
                        }
                        reviews { ratingsReviews {
                            averageRating totalReviews
                            recommendedCount notRecommendedCount
                        }}
                        taxonomy {
                            breadCrumbs { label url }
                        }
                        availabilityType { type discontinued }
                        fulfillment {
                            fulfillmentOptions {
                                type
                                services {
                                    type
                                    deliveryDateRange
                                    freeDeliveryThreshold
                                }
                            }
                        }
                        seoDescription
                    }
                }
            """,
        }

        data = self._post(payload)
        return data.get("data", {}).get("product", {})

Store Inventory Checking

One of the more valuable data points is real-time store inventory — Home Depot shows "X in stock at Store Y" on product pages, and the API exposes this at the zip-code level:

    def get_store_inventory(
        self,
        item_id: str,
        zip_code: str,
        radius_miles: int = 25,
    ) -> list[dict]:
        """Check product stock at stores near a zip code."""
        payload = {
            "operationName": "storeSearch",
            "variables": {
                "itemId": item_id,
                "zipCode": zip_code,
                "radius": radius_miles,
            },
            "query": """
                query storeSearch($itemId: String!, $zipCode: String!, $radius: Int) {
                    storeSearch(itemId: $itemId, zipCode: $zipCode, radius: $radius) {
                        stores {
                            storeId
                            storeName
                            phone
                            address {
                                street city state zip
                            }
                            inventory {
                                quantity isInStock isLimitedQuantity
                                isUnavailable
                            }
                            distance
                            isPickupEligible
                        }
                    }
                }
            """,
        }

        data = self._post(payload)
        stores = (
            data.get("data", {})
            .get("storeSearch", {})
            .get("stores", [])
        )

        return [
            {
                "store_id": s["storeId"],
                "name": s["storeName"],
                "address": f"{s['address']['city']}, {s['address']['state']}",
                "phone": s.get("phone"),
                "quantity": s["inventory"]["quantity"],
                "in_stock": s["inventory"]["isInStock"],
                "limited": s["inventory"]["isLimitedQuantity"],
                "unavailable": s["inventory"].get("isUnavailable", False),
                "distance_miles": s.get("distance"),
                "pickup_eligible": s.get("isPickupEligible", False),
            }
            for s in stores
        ]

Scraping Product Reviews

Reviews require pagination since individual products can have hundreds of reviews:

    def get_reviews(
        self,
        item_id: str,
        page: int = 1,
        page_size: int = 30,
        sort_by: str = "Most Recent",
    ) -> dict:
        """Fetch paginated reviews for a product."""
        payload = {
            "operationName": "productReviews",
            "variables": {
                "itemId": item_id,
                "startIndex": (page - 1) * page_size,
                "endIndex": page * page_size,
                "sortBy": sort_by,
                "filterBy": "",
            },
            "query": """
                query productReviews(
                    $itemId: String!, $startIndex: Int, $endIndex: Int,
                    $sortBy: String, $filterBy: String
                ) {
                    reviews(
                        itemId: $itemId, startIndex: $startIndex,
                        endIndex: $endIndex, sortBy: $sortBy, filterBy: $filterBy
                    ) {
                        totalResults
                        results {
                            reviewId
                            rating
                            headline
                            body
                            submissionTime
                            reviewerName
                            isVerifiedPurchase
                            positiveFeedbackCount
                            negativeFeedbackCount
                            photos { Sizes { Normal { Url } } }
                            pros cons
                        }
                    }
                }
            """,
        }

        data = self._post(payload)
        return data.get("data", {}).get("reviews", {})

    def get_all_reviews(self, item_id: str, max_pages: int = 10) -> list[dict]:
        """Collect all reviews for a product across multiple pages."""
        all_reviews = []
        page = 1

        while page <= max_pages:
            batch = self.get_reviews(item_id, page=page)
            results = batch.get("results", [])

            if not results:
                break

            all_reviews.extend([
                {
                    "review_id": r["reviewId"],
                    "rating": r["rating"],
                    "headline": r.get("headline"),
                    "body": r.get("body"),
                    "submitted": r.get("submissionTime"),
                    "reviewer": r.get("reviewerName"),
                    "verified": r.get("isVerifiedPurchase", False),
                    "helpful": r.get("positiveFeedbackCount", 0),
                    "unhelpful": r.get("negativeFeedbackCount", 0),
                    "pros": r.get("pros"),
                    "cons": r.get("cons"),
                }
                for r in results
            ])

            total = batch.get("totalResults", 0)
            if len(all_reviews) >= total:
                break

            page += 1
            time.sleep(random.uniform(1.0, 2.5))

        return all_reviews

Category Browsing

To discover products systematically rather than through search, use the category browsing endpoint:

    def browse_category(
        self,
        nav_param: str,
        store_id: str = "121",
        zip_code: str = "10001",
        page: int = 1,
        page_size: int = 24,
        sort_by: str = "TOP_SELLERS",
    ) -> dict:
        """
        Browse products in a category.
        nav_param: Home Depot's internal category nav parameter (e.g., "N-5yc1vZc2bk" for power drills)
        """
        payload = {
            "operationName": "browseModel",
            "variables": {
                "storeId": store_id,
                "zipCode": zip_code,
                "navParam": nav_param,
                "startIndex": (page - 1) * page_size,
                "pageSize": page_size,
                "sortBy": sort_by,
            },
            "query": """
                query browseModel(
                    $navParam: String!, $storeId: String, $zipCode: String,
                    $startIndex: Int, $pageSize: Int, $sortBy: String
                ) {
                    browseModel(
                        navParam: $navParam, storeId: $storeId, zipCode: $zipCode,
                        startIndex: $startIndex, pageSize: $pageSize, sortBy: $sortBy
                    ) {
                        products {
                            itemId
                            identifiers { productLabel brandName modelNumber }
                            pricing { value original }
                            reviews { ratingsReviews { averageRating totalReviews }}
                        }
                        searchReport { totalProducts }
                    }
                }
            """,
        }

        data = self._post(payload)
        return data.get("data", {}).get("browseModel", {})

Price Monitoring Pipeline

For ongoing price tracking across a watchlist of SKUs:

import sqlite3
from datetime import datetime


def init_price_db(path: str = "homedepot_prices.db") -> sqlite3.Connection:
    conn = sqlite3.connect(path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS products (
            item_id         TEXT PRIMARY KEY,
            brand           TEXT,
            name            TEXT,
            model_number    TEXT,
            upc             TEXT,
            department      TEXT,
            url             TEXT,
            first_seen      TEXT
        )
    """)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS price_history (
            id              INTEGER PRIMARY KEY AUTOINCREMENT,
            item_id         TEXT,
            price           REAL,
            original_price  REAL,
            unit_measure    TEXT,
            on_sale         BOOLEAN,
            promotion_desc  TEXT,
            avg_rating      REAL,
            review_count    INTEGER,
            checked_at      TEXT DEFAULT (datetime('now'))
        )
    """)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS inventory_history (
            id          INTEGER PRIMARY KEY AUTOINCREMENT,
            item_id     TEXT,
            store_id    TEXT,
            store_name  TEXT,
            quantity    INTEGER,
            in_stock    BOOLEAN,
            checked_at  TEXT DEFAULT (datetime('now'))
        )
    """)
    conn.commit()
    return conn


def track_price(
    conn: sqlite3.Connection,
    client: HomeDepotClient,
    item_id: str,
    store_id: str = "121",
):
    """Fetch and record current price and inventory for one product."""
    product = client.get_product(item_id, store_id=store_id)
    if not product:
        return

    # Upsert product record
    identifiers = product.get("identifiers", {})
    pricing = product.get("pricing", {})
    reviews_data = product.get("reviews", {}).get("ratingsReviews", {})

    conn.execute("""
        INSERT OR REPLACE INTO products
        (item_id, brand, name, model_number, upc, first_seen)
        VALUES (?, ?, ?, ?, ?, COALESCE((SELECT first_seen FROM products WHERE item_id=?), datetime('now')))
    """, (
        item_id,
        identifiers.get("brandName"),
        identifiers.get("productLabel"),
        identifiers.get("modelNumber"),
        identifiers.get("upcGtin13"),
        item_id,
    ))

    promo = pricing.get("promotion", {}) or {}

    conn.execute("""
        INSERT INTO price_history
        (item_id, price, original_price, unit_measure, on_sale, promotion_desc, avg_rating, review_count)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?)
    """, (
        item_id,
        pricing.get("value"),
        pricing.get("original"),
        pricing.get("unitOfMeasure"),
        pricing.get("percentageOff") is not None and pricing.get("percentageOff", 0) > 0,
        promo.get("description"),
        reviews_data.get("averageRating"),
        reviews_data.get("totalReviews"),
    ))

    conn.commit()


def get_price_drops(
    conn: sqlite3.Connection,
    min_drop_pct: float = 10.0,
) -> list[dict]:
    """Find items where today's price is significantly below historical average."""
    rows = conn.execute("""
        WITH latest AS (
            SELECT item_id, price, checked_at
            FROM price_history
            WHERE checked_at = (SELECT MAX(checked_at) FROM price_history ph2 WHERE ph2.item_id = price_history.item_id)
        ),
        historical AS (
            SELECT item_id, AVG(price) as avg_price
            FROM price_history
            WHERE checked_at < date('now', '-1 day')
            GROUP BY item_id
            HAVING COUNT(*) >= 3
        )
        SELECT l.item_id, p.name, l.price, h.avg_price,
               ROUND((h.avg_price - l.price) / h.avg_price * 100, 1) as drop_pct
        FROM latest l
        JOIN historical h ON l.item_id = h.item_id
        JOIN products p ON l.item_id = p.item_id
        WHERE (h.avg_price - l.price) / h.avg_price * 100 >= ?
        ORDER BY drop_pct DESC
    """, (min_drop_pct,)).fetchall()

    return [
        {"item_id": r[0], "name": r[1], "current": r[2], "avg": r[3], "drop_pct": r[4]}
        for r in rows
    ]

Error Handling and Session Refresh

import asyncio


class HomeDepotScraper:
    """Full scraper with automatic session refresh."""

    def __init__(
        self,
        proxy_host: str = None,
        proxy_port: int = None,
        proxy_user: str = None,
        proxy_pass: str = None,
    ):
        self.proxy_host = proxy_host
        self.proxy_port = proxy_port
        self.proxy_user = proxy_user
        self.proxy_pass = proxy_pass
        self.proxy_url = None
        if proxy_host:
            self.proxy_url = f"http://{proxy_user}:{proxy_pass}@{proxy_host}:{proxy_port}"

        self.client: HomeDepotClient = None
        self.cookies_harvested_at: float = 0
        self.cookie_ttl: float = 1800  # 30 minutes

    async def ensure_fresh_session(self):
        """Re-harvest cookies if the session is expired."""
        now = time.time()
        if self.client is None or (now - self.cookies_harvested_at) > self.cookie_ttl:
            print("  Harvesting fresh session cookies...")
            cookies = await harvest_homedepot_cookies(
                proxy_host=self.proxy_host,
                proxy_port=self.proxy_port,
                proxy_user=self.proxy_user,
                proxy_pass=self.proxy_pass,
            )
            self.client = HomeDepotClient(cookies=cookies, proxy=self.proxy_url)
            self.cookies_harvested_at = now

    def get_product_safe(self, item_id: str, max_retries: int = 3) -> dict:
        """Fetch a product with retry on session expiry errors."""
        for attempt in range(max_retries):
            try:
                asyncio.get_event_loop().run_until_complete(self.ensure_fresh_session())
                return self.client.get_product(item_id)

            except httpx.HTTPStatusError as e:
                if e.response.status_code == 403:
                    # Session expired — force refresh
                    print(f"  403 on {item_id}, refreshing session (attempt {attempt + 1})")
                    self.cookies_harvested_at = 0
                    time.sleep(5 * (attempt + 1))
                elif e.response.status_code == 429:
                    wait = int(e.response.headers.get("Retry-After", 30))
                    print(f"  Rate limited. Waiting {wait}s")
                    time.sleep(wait)
                else:
                    print(f"  HTTP {e.response.status_code} for {item_id}")
                    return {}

            except Exception as e:
                print(f"  Error fetching {item_id}: {e}")
                if attempt < max_retries - 1:
                    time.sleep(3)

        return {}

Things to Watch Out For

Store ID matters significantly. Pricing and availability can differ by store. The same SKU might be on clearance at one location and full price at another. Always pass a consistent store ID or zip code and track which one you're using in your database.

GraphQL schema drift. Home Depot updates their API periodically. The queries above work as of late 2026, but field names and nesting structures do shift. If you start getting null where you expect data, inspect the network tab in a real browser to see the current GraphQL schema for that operation.

PerimeterX session TTL is short. The _px3 cookie expires in roughly 30 minutes. Refreshing your browser session periodically during long scraping runs is necessary. If you start getting empty data.product = null responses or 403s, expired cookies are usually the cause.

Rate-limit yourself voluntarily. One request every 3-5 seconds is a reasonable pace for single-threaded use. Home Depot's API is clearly designed for their frontend, not bulk extraction. Running at 1 req/sec for hours will trigger adaptive throttling even with rotating proxies.

Respect their robots.txt scope. Home Depot's robots.txt restricts scraping of checkout, account, and order management paths. Product and catalog data falls outside these restricted areas and represents the commercially valuable, publicly accessible information.

Legal and Ethical Considerations

Home Depot publishes product data publicly on their website, and courts have generally treated publicly accessible product prices and specifications as factual information not protected by copyright. That said, their Terms of Use prohibit automated access. The practical risk threshold for legitimate research purposes — price monitoring for your own purchasing decisions, market research, building comparison tools for consumers — is lower than for commercial data reselling at scale.

If you're building a product that competes directly with Home Depot, or one that repurposes their proprietary catalog at scale, you should evaluate their data licensing options. For personal use and research, the main obligation is being a good citizen: reasonable request rates, no credential stuffing, and no interference with their checkout or transaction systems.

Scraping Home Depot: Product Data, Pricing, and Availability (2026)

Scraping Home Depot: Product Data, Pricing, and Availability (2026)

What Data Is Available

Understanding Home Depot's Architecture

Anti-Bot Measures: HUMAN Security (PerimeterX)

JavaScript Sensor Data

Cookie Chain

IP Reputation Scoring

Setting Up the Session with Playwright

API-First Product Scraping

Store Inventory Checking

Scraping Product Reviews

Category Browsing

Price Monitoring Pipeline

Error Handling and Session Refresh

Things to Watch Out For

Legal and Ethical Considerations