Scraping Home Depot: Product Data, Pricing, and Availability (2026)
Scraping Home Depot: Product Data, Pricing, and Availability (2026)
Home Depot is interesting to scrape because it sits between two worlds. Their website is a modern React app with aggressive bot protection, but their backend APIs are surprisingly well-structured once you have a valid session. The product data is rich — not just prices and SKUs, but installation guides, project calculators, and real-time store inventory by location. That makes it useful for price monitoring, competitor analysis, inventory tracking, or building product comparison tools.
Here's what works for getting data out of homedepot.com in 2026.
What Data Is Available
Home Depot product pages pack a lot of information:
- Product details — title, brand, model number, Home Depot SKU (item ID), UPC, GTIN
- Pricing — regular price, sale price, bulk pricing tiers, unit pricing (price per sq ft, per gallon, etc.)
- Availability — online stock status, store-level inventory (by zip code or store ID), delivery estimates
- Specifications — dimensions, weight, material, color, power requirements, detailed spec tables
- Reviews — star rating, review count, individual review text, verified purchase flag, review helpful votes
- Images — multiple product photos at various resolutions, lifestyle images, dimension diagrams, instruction images
- Related products — frequently bought together, similar items, accessories, upgrade products
- Project guides — how-to content linked to product categories (linked from product pages)
- Fulfillment options — ship to home, buy online pick up in store (BOPUS), curbside, direct delivery
Understanding Home Depot's Architecture
Home Depot's frontend is a React/GraphQL application. All product data flows through a single GraphQL endpoint at https://www.homedepot.com/federation-gateway/graphql. This endpoint handles search, product detail, store inventory, and several other queries — all differentiated by the operationName field in the request body.
The federation gateway aggregates multiple backend services, which is why the schema is large and some fields return nested GraphQL sub-objects. The key to using this endpoint without authentication is having valid cookies from a browser session that has already passed Home Depot's bot detection layer.
Anti-Bot Measures: HUMAN Security (PerimeterX)
Home Depot runs HUMAN Security (formerly PerimeterX) as their primary bot detection. This is one of the more sophisticated anti-bot solutions on the market.
JavaScript Sensor Data
HUMAN's protection works by injecting a JavaScript agent into every page. This agent collects behavioral signals — mouse movement velocity, click patterns, scroll behavior, typing cadence, touch event characteristics — and computes a "sensor data" payload. This encoded payload is sent back to HUMAN's scoring service asynchronously. Without this payload and the cookies it sets, subsequent requests to the API endpoints return empty results or 403 responses.
Cookie Chain
The cookie sequence matters. A real browser session on homedepot.com establishes:
_px3— the PerimeterX validation cookie, short TTL (~30-60 min)_pxvid— the PerimeterX visitor ID, persists across sessionsTHD_SESSION— Home Depot's session tracking cookie- Various A/B test and analytics cookies that PerimeterX correlates with the visitor profile
Without a valid _px3 cookie on requests to the federation gateway, you get 403 or empty data: null responses.
IP Reputation Scoring
PerimeterX evaluates IP reputation at the edge. Datacenter IP ranges (AWS, GCP, Azure, DigitalOcean, Vultr, OVH) start with near-zero trust scores and typically fail the challenge immediately. Home Depot's tier of PerimeterX protection blocks these requests before any JavaScript challenge even loads.
The practical solution is ThorData residential proxies. Residential IPs from real ISP customers pass the IP reputation check that kills datacenter requests outright. Their US residential pool is particularly relevant for Home Depot — US geo matters because Home Depot's pricing and availability are US-centric and some API responses include geo-dependent data. ThorData supports sticky sessions, letting you hold the same exit IP across the browser session (cookie harvest) and the subsequent API calls.
Setting Up the Session with Playwright
The cleanest approach is Playwright for cookie harvesting, then httpx for the actual API calls:
import asyncio
from playwright.async_api import async_playwright, BrowserContext
async def harvest_homedepot_cookies(proxy_host: str = None, proxy_port: int = None,
proxy_user: str = None, proxy_pass: str = None) -> dict:
"""
Launch a real browser session on homedepot.com and extract valid cookies.
Run this every 30-45 minutes during sustained scraping.
"""
async with async_playwright() as pw:
launch_kwargs = {
"headless": True,
"args": [
"--disable-blink-features=AutomationControlled",
"--disable-dev-shm-usage",
"--no-sandbox",
"--disable-setuid-sandbox",
],
}
if proxy_host:
launch_kwargs["proxy"] = {
"server": f"http://{proxy_host}:{proxy_port}",
"username": proxy_user or "",
"password": proxy_pass or "",
}
browser = await pw.chromium.launch(**launch_kwargs)
context: BrowserContext = await browser.new_context(
viewport={"width": 1440, "height": 900},
locale="en-US",
timezone_id="America/New_York",
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/125.0.0.0 Safari/537.36"
),
extra_http_headers={
"Accept-Language": "en-US,en;q=0.9",
},
)
page = await context.new_page()
# Load the homepage and let PerimeterX challenge complete
await page.goto("https://www.homedepot.com/", wait_until="networkidle", timeout=30000)
await page.wait_for_timeout(3000)
# Simulate realistic human interactions
await page.mouse.move(350, 250)
await page.wait_for_timeout(500)
await page.mouse.move(600, 400)
await page.wait_for_timeout(800)
await page.evaluate("window.scrollBy(0, 200)")
await page.wait_for_timeout(1000)
await page.evaluate("window.scrollBy(0, 150)")
await page.wait_for_timeout(500)
# Optionally navigate to a product page to establish a richer session
await page.goto(
"https://www.homedepot.com/b/Tools-Power-Tools-Drills/N-5yc1vZc2bk",
wait_until="networkidle",
timeout=30000
)
await page.wait_for_timeout(2000)
cookies = await context.cookies()
await browser.close()
return {c["name"]: c["value"] for c in cookies}
API-First Product Scraping
Once you have valid cookies, call the GraphQL endpoint directly from Python — much faster than rendering full pages:
import httpx
import json
import time
import random
GRAPHQL_URL = "https://www.homedepot.com/federation-gateway/graphql"
BASE_HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/125.0.0.0 Safari/537.36"
),
"Accept": "application/json",
"Accept-Language": "en-US,en;q=0.9",
"Content-Type": "application/json",
"Referer": "https://www.homedepot.com/",
"Origin": "https://www.homedepot.com",
"x-experience-name": "general-merchandise",
"x-current-url": "/",
}
class HomeDepotClient:
def __init__(self, cookies: dict, proxy: str = None):
transport = httpx.HTTPTransport(proxy=proxy) if proxy else None
self.client = httpx.Client(
headers=BASE_HEADERS,
cookies=cookies,
timeout=30,
transport=transport,
follow_redirects=True,
)
def _post(self, payload: dict) -> dict:
resp = self.client.post(GRAPHQL_URL, json=payload)
resp.raise_for_status()
return resp.json()
def search_products(
self,
keyword: str,
store_id: str = "121",
zip_code: str = "10001",
offset: int = 0,
page_size: int = 24,
) -> dict:
"""Search for products via the GraphQL search endpoint."""
payload = {
"operationName": "searchModel",
"variables": {
"storeId": store_id,
"zipCode": zip_code,
"skipInstallServices": True,
"startIndex": offset,
"pageSize": page_size,
"keyword": keyword,
},
"query": """
query searchModel(
$keyword: String!, $storeId: String,
$zipCode: String, $startIndex: Int, $pageSize: Int
) {
searchModel(
keyword: $keyword, storeId: $storeId,
zipCode: $zipCode, startIndex: $startIndex,
pageSize: $pageSize
) {
products {
itemId
identifiers {
productLabel brandName modelNumber storeSkuNumber
}
pricing {
value original unitOfMeasure
promotion { description }
}
media { images { url sizes } }
reviews { ratingsReviews {
averageRating totalReviews
}}
availabilityType { type }
}
searchReport { totalProducts keyword }
}
}
""",
}
data = self._post(payload)
return data.get("data", {}).get("searchModel", {})
def get_product(
self,
item_id: str,
store_id: str = "121",
zip_code: str = "10001",
) -> dict:
"""Fetch full product details by item ID."""
payload = {
"operationName": "productClientOnlyProduct",
"variables": {
"itemId": item_id,
"storeId": store_id,
"zipCode": zip_code,
"skipSpecificationGroup": False,
"skipSubscribeAndSave": True,
},
"query": """
query productClientOnlyProduct(
$itemId: String!, $storeId: String, $zipCode: String,
$skipSpecificationGroup: Boolean!
) {
product(itemId: $itemId, storeId: $storeId, zipCode: $zipCode) {
itemId
dataSources
identifiers {
productLabel brandName modelNumber storeSkuNumber
upcGtin13 canonicalUrl
}
pricing {
value original percentageOff unitOfMeasure
promotion {
description
dates { start end }
type
}
specialBuy { value description }
}
details {
description collection installation highlights
}
specificationGroup @skip(if: $skipSpecificationGroup) {
specTitle
specifications { specName specValue }
}
media {
images { url sizes }
video { url thumbnail }
}
reviews { ratingsReviews {
averageRating totalReviews
recommendedCount notRecommendedCount
}}
taxonomy {
breadCrumbs { label url }
}
availabilityType { type discontinued }
fulfillment {
fulfillmentOptions {
type
services {
type
deliveryDateRange
freeDeliveryThreshold
}
}
}
seoDescription
}
}
""",
}
data = self._post(payload)
return data.get("data", {}).get("product", {})
Store Inventory Checking
One of the more valuable data points is real-time store inventory — Home Depot shows "X in stock at Store Y" on product pages, and the API exposes this at the zip-code level:
def get_store_inventory(
self,
item_id: str,
zip_code: str,
radius_miles: int = 25,
) -> list[dict]:
"""Check product stock at stores near a zip code."""
payload = {
"operationName": "storeSearch",
"variables": {
"itemId": item_id,
"zipCode": zip_code,
"radius": radius_miles,
},
"query": """
query storeSearch($itemId: String!, $zipCode: String!, $radius: Int) {
storeSearch(itemId: $itemId, zipCode: $zipCode, radius: $radius) {
stores {
storeId
storeName
phone
address {
street city state zip
}
inventory {
quantity isInStock isLimitedQuantity
isUnavailable
}
distance
isPickupEligible
}
}
}
""",
}
data = self._post(payload)
stores = (
data.get("data", {})
.get("storeSearch", {})
.get("stores", [])
)
return [
{
"store_id": s["storeId"],
"name": s["storeName"],
"address": f"{s['address']['city']}, {s['address']['state']}",
"phone": s.get("phone"),
"quantity": s["inventory"]["quantity"],
"in_stock": s["inventory"]["isInStock"],
"limited": s["inventory"]["isLimitedQuantity"],
"unavailable": s["inventory"].get("isUnavailable", False),
"distance_miles": s.get("distance"),
"pickup_eligible": s.get("isPickupEligible", False),
}
for s in stores
]
Scraping Product Reviews
Reviews require pagination since individual products can have hundreds of reviews:
def get_reviews(
self,
item_id: str,
page: int = 1,
page_size: int = 30,
sort_by: str = "Most Recent",
) -> dict:
"""Fetch paginated reviews for a product."""
payload = {
"operationName": "productReviews",
"variables": {
"itemId": item_id,
"startIndex": (page - 1) * page_size,
"endIndex": page * page_size,
"sortBy": sort_by,
"filterBy": "",
},
"query": """
query productReviews(
$itemId: String!, $startIndex: Int, $endIndex: Int,
$sortBy: String, $filterBy: String
) {
reviews(
itemId: $itemId, startIndex: $startIndex,
endIndex: $endIndex, sortBy: $sortBy, filterBy: $filterBy
) {
totalResults
results {
reviewId
rating
headline
body
submissionTime
reviewerName
isVerifiedPurchase
positiveFeedbackCount
negativeFeedbackCount
photos { Sizes { Normal { Url } } }
pros cons
}
}
}
""",
}
data = self._post(payload)
return data.get("data", {}).get("reviews", {})
def get_all_reviews(self, item_id: str, max_pages: int = 10) -> list[dict]:
"""Collect all reviews for a product across multiple pages."""
all_reviews = []
page = 1
while page <= max_pages:
batch = self.get_reviews(item_id, page=page)
results = batch.get("results", [])
if not results:
break
all_reviews.extend([
{
"review_id": r["reviewId"],
"rating": r["rating"],
"headline": r.get("headline"),
"body": r.get("body"),
"submitted": r.get("submissionTime"),
"reviewer": r.get("reviewerName"),
"verified": r.get("isVerifiedPurchase", False),
"helpful": r.get("positiveFeedbackCount", 0),
"unhelpful": r.get("negativeFeedbackCount", 0),
"pros": r.get("pros"),
"cons": r.get("cons"),
}
for r in results
])
total = batch.get("totalResults", 0)
if len(all_reviews) >= total:
break
page += 1
time.sleep(random.uniform(1.0, 2.5))
return all_reviews
Category Browsing
To discover products systematically rather than through search, use the category browsing endpoint:
def browse_category(
self,
nav_param: str,
store_id: str = "121",
zip_code: str = "10001",
page: int = 1,
page_size: int = 24,
sort_by: str = "TOP_SELLERS",
) -> dict:
"""
Browse products in a category.
nav_param: Home Depot's internal category nav parameter (e.g., "N-5yc1vZc2bk" for power drills)
"""
payload = {
"operationName": "browseModel",
"variables": {
"storeId": store_id,
"zipCode": zip_code,
"navParam": nav_param,
"startIndex": (page - 1) * page_size,
"pageSize": page_size,
"sortBy": sort_by,
},
"query": """
query browseModel(
$navParam: String!, $storeId: String, $zipCode: String,
$startIndex: Int, $pageSize: Int, $sortBy: String
) {
browseModel(
navParam: $navParam, storeId: $storeId, zipCode: $zipCode,
startIndex: $startIndex, pageSize: $pageSize, sortBy: $sortBy
) {
products {
itemId
identifiers { productLabel brandName modelNumber }
pricing { value original }
reviews { ratingsReviews { averageRating totalReviews }}
}
searchReport { totalProducts }
}
}
""",
}
data = self._post(payload)
return data.get("data", {}).get("browseModel", {})
Price Monitoring Pipeline
For ongoing price tracking across a watchlist of SKUs:
import sqlite3
from datetime import datetime
def init_price_db(path: str = "homedepot_prices.db") -> sqlite3.Connection:
conn = sqlite3.connect(path)
conn.execute("""
CREATE TABLE IF NOT EXISTS products (
item_id TEXT PRIMARY KEY,
brand TEXT,
name TEXT,
model_number TEXT,
upc TEXT,
department TEXT,
url TEXT,
first_seen TEXT
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS price_history (
id INTEGER PRIMARY KEY AUTOINCREMENT,
item_id TEXT,
price REAL,
original_price REAL,
unit_measure TEXT,
on_sale BOOLEAN,
promotion_desc TEXT,
avg_rating REAL,
review_count INTEGER,
checked_at TEXT DEFAULT (datetime('now'))
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS inventory_history (
id INTEGER PRIMARY KEY AUTOINCREMENT,
item_id TEXT,
store_id TEXT,
store_name TEXT,
quantity INTEGER,
in_stock BOOLEAN,
checked_at TEXT DEFAULT (datetime('now'))
)
""")
conn.commit()
return conn
def track_price(
conn: sqlite3.Connection,
client: HomeDepotClient,
item_id: str,
store_id: str = "121",
):
"""Fetch and record current price and inventory for one product."""
product = client.get_product(item_id, store_id=store_id)
if not product:
return
# Upsert product record
identifiers = product.get("identifiers", {})
pricing = product.get("pricing", {})
reviews_data = product.get("reviews", {}).get("ratingsReviews", {})
conn.execute("""
INSERT OR REPLACE INTO products
(item_id, brand, name, model_number, upc, first_seen)
VALUES (?, ?, ?, ?, ?, COALESCE((SELECT first_seen FROM products WHERE item_id=?), datetime('now')))
""", (
item_id,
identifiers.get("brandName"),
identifiers.get("productLabel"),
identifiers.get("modelNumber"),
identifiers.get("upcGtin13"),
item_id,
))
promo = pricing.get("promotion", {}) or {}
conn.execute("""
INSERT INTO price_history
(item_id, price, original_price, unit_measure, on_sale, promotion_desc, avg_rating, review_count)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
""", (
item_id,
pricing.get("value"),
pricing.get("original"),
pricing.get("unitOfMeasure"),
pricing.get("percentageOff") is not None and pricing.get("percentageOff", 0) > 0,
promo.get("description"),
reviews_data.get("averageRating"),
reviews_data.get("totalReviews"),
))
conn.commit()
def get_price_drops(
conn: sqlite3.Connection,
min_drop_pct: float = 10.0,
) -> list[dict]:
"""Find items where today's price is significantly below historical average."""
rows = conn.execute("""
WITH latest AS (
SELECT item_id, price, checked_at
FROM price_history
WHERE checked_at = (SELECT MAX(checked_at) FROM price_history ph2 WHERE ph2.item_id = price_history.item_id)
),
historical AS (
SELECT item_id, AVG(price) as avg_price
FROM price_history
WHERE checked_at < date('now', '-1 day')
GROUP BY item_id
HAVING COUNT(*) >= 3
)
SELECT l.item_id, p.name, l.price, h.avg_price,
ROUND((h.avg_price - l.price) / h.avg_price * 100, 1) as drop_pct
FROM latest l
JOIN historical h ON l.item_id = h.item_id
JOIN products p ON l.item_id = p.item_id
WHERE (h.avg_price - l.price) / h.avg_price * 100 >= ?
ORDER BY drop_pct DESC
""", (min_drop_pct,)).fetchall()
return [
{"item_id": r[0], "name": r[1], "current": r[2], "avg": r[3], "drop_pct": r[4]}
for r in rows
]
Error Handling and Session Refresh
import asyncio
class HomeDepotScraper:
"""Full scraper with automatic session refresh."""
def __init__(
self,
proxy_host: str = None,
proxy_port: int = None,
proxy_user: str = None,
proxy_pass: str = None,
):
self.proxy_host = proxy_host
self.proxy_port = proxy_port
self.proxy_user = proxy_user
self.proxy_pass = proxy_pass
self.proxy_url = None
if proxy_host:
self.proxy_url = f"http://{proxy_user}:{proxy_pass}@{proxy_host}:{proxy_port}"
self.client: HomeDepotClient = None
self.cookies_harvested_at: float = 0
self.cookie_ttl: float = 1800 # 30 minutes
async def ensure_fresh_session(self):
"""Re-harvest cookies if the session is expired."""
now = time.time()
if self.client is None or (now - self.cookies_harvested_at) > self.cookie_ttl:
print(" Harvesting fresh session cookies...")
cookies = await harvest_homedepot_cookies(
proxy_host=self.proxy_host,
proxy_port=self.proxy_port,
proxy_user=self.proxy_user,
proxy_pass=self.proxy_pass,
)
self.client = HomeDepotClient(cookies=cookies, proxy=self.proxy_url)
self.cookies_harvested_at = now
def get_product_safe(self, item_id: str, max_retries: int = 3) -> dict:
"""Fetch a product with retry on session expiry errors."""
for attempt in range(max_retries):
try:
asyncio.get_event_loop().run_until_complete(self.ensure_fresh_session())
return self.client.get_product(item_id)
except httpx.HTTPStatusError as e:
if e.response.status_code == 403:
# Session expired — force refresh
print(f" 403 on {item_id}, refreshing session (attempt {attempt + 1})")
self.cookies_harvested_at = 0
time.sleep(5 * (attempt + 1))
elif e.response.status_code == 429:
wait = int(e.response.headers.get("Retry-After", 30))
print(f" Rate limited. Waiting {wait}s")
time.sleep(wait)
else:
print(f" HTTP {e.response.status_code} for {item_id}")
return {}
except Exception as e:
print(f" Error fetching {item_id}: {e}")
if attempt < max_retries - 1:
time.sleep(3)
return {}
Things to Watch Out For
Store ID matters significantly. Pricing and availability can differ by store. The same SKU might be on clearance at one location and full price at another. Always pass a consistent store ID or zip code and track which one you're using in your database.
GraphQL schema drift. Home Depot updates their API periodically. The queries above work as of late 2026, but field names and nesting structures do shift. If you start getting null where you expect data, inspect the network tab in a real browser to see the current GraphQL schema for that operation.
PerimeterX session TTL is short. The _px3 cookie expires in roughly 30 minutes. Refreshing your browser session periodically during long scraping runs is necessary. If you start getting empty data.product = null responses or 403s, expired cookies are usually the cause.
Rate-limit yourself voluntarily. One request every 3-5 seconds is a reasonable pace for single-threaded use. Home Depot's API is clearly designed for their frontend, not bulk extraction. Running at 1 req/sec for hours will trigger adaptive throttling even with rotating proxies.
Respect their robots.txt scope. Home Depot's robots.txt restricts scraping of checkout, account, and order management paths. Product and catalog data falls outside these restricted areas and represents the commercially valuable, publicly accessible information.
Legal and Ethical Considerations
Home Depot publishes product data publicly on their website, and courts have generally treated publicly accessible product prices and specifications as factual information not protected by copyright. That said, their Terms of Use prohibit automated access. The practical risk threshold for legitimate research purposes — price monitoring for your own purchasing decisions, market research, building comparison tools for consumers — is lower than for commercial data reselling at scale.
If you're building a product that competes directly with Home Depot, or one that repurposes their proprietary catalog at scale, you should evaluate their data licensing options. For personal use and research, the main obligation is being a good citizen: reasonable request rates, no credential stuffing, and no interference with their checkout or transaction systems.