How to Scrape Walmart Product Prices in 2026 (Python Guide)

2026-04-09 ["web-scraping" "python" "walmart" "ecommerce" "price-monitoring"]

How to Scrape Walmart Product Prices in 2026 (Python Guide)

Walmart.com is the second-largest e-commerce platform in the US, and its product data is genuinely useful for a wide range of work: price monitoring against Amazon, competitor analysis for private label sellers, dropshipping product research, and wholesale market studies. If you're doing any kind of e-commerce intelligence work, you need Walmart data.

The bad news: Walmart has one of the more aggressive bot protection setups of any major retailer. The good news: it's not unbeatable, and their pages still embed clean JSON you can parse once you get through the door.

This guide covers the full stack — search results, product pages, reviews, seller data — with working Python code and honest notes on what makes Walmart annoying to scrape.

What Data Is Available on Walmart

Before writing a single line of code, it's worth knowing what Walmart exposes and what's locked away:

Publicly accessible: - Product names, descriptions, images - Current prices and "was" prices (markdown pricing) - Customer reviews and ratings - Seller information for marketplace items - In-stock status - Category hierarchy and breadcrumb paths - Product specifications and attributes - Shipping options and estimated delivery

Not accessible without login or third-party tools: - Sales velocity / bestseller ranks - Historical price data (Walmart doesn't expose this; you build it by collecting snapshots) - Inventory counts (you get in-stock/out-of-stock, not quantities) - Seller performance metrics (that's internal seller data)

The JSON embedded in Walmart's Next.js pages is the richest source. It includes fields that don't even render visually on the page.

Understanding Walmart's Technical Stack

Walmart.com runs on Next.js with server-side rendering. The page HTML includes a __NEXT_DATA__ script tag containing the complete data for the page as a JSON blob. This is the primary extraction target — no XHR interception or JavaScript execution required.

The challenge is getting a valid HTML response in the first place. Walmart uses HUMAN Security (formerly PerimeterX) for bot management. This involves:

TLS fingerprinting — checking that your TLS handshake matches a real browser's cipher suite order
HTTP header fingerprinting — verifying header order and values match expected browser patterns
JavaScript execution — running fingerprinting scripts that check navigator properties, canvas, WebGL, etc.
IP reputation — comparing your IP against known proxy/bot IP ranges

The curl_cffi library handles points 1 and 2 by replaying authentic Chrome TLS handshakes. Points 3 and 4 require residential proxies for any serious scraping.

Setting Up the Environment

pip install curl-cffi beautifulsoup4 lxml httpx

from curl_cffi import requests as cffi_requests
import json
import re
import time
import random
import sqlite3
from typing import Optional, Any
from dataclasses import dataclass, field

# Realistic Chrome headers — order matters for fingerprinting
CHROME_HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,"
              "image/avif,image/webp,image/apng,*/*;q=0.8,"
              "application/signed-exchange;v=b3;q=0.7",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Sec-Ch-Ua": '"Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"',
    "Sec-Ch-Ua-Mobile": "?0",
    "Sec-Ch-Ua-Platform": '"Windows"',
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
    "Upgrade-Insecure-Requests": "1",
    "Cache-Control": "max-age=0",
}

PROXIES = {
    "http": "http://USER:[email protected]:9001",
    "https": "http://USER:[email protected]:9001",
}

def make_session(use_proxy: bool = True) -> cffi_requests.Session:
    """Create a curl_cffi session impersonating Chrome."""
    session = cffi_requests.Session()
    session.headers.update(CHROME_HEADERS)
    if use_proxy:
        session.proxies = PROXIES
    return session

def extract_next_data(html: str) -> Optional[dict]:
    """Extract and parse the __NEXT_DATA__ JSON blob from Walmart HTML."""
    match = re.search(
        r'<script id="__NEXT_DATA__" type="application/json">(.*?)</script>',
        html,
        re.DOTALL,
    )
    if not match:
        return None
    try:
        return json.loads(match.group(1))
    except json.JSONDecodeError:
        return None

Searching for Products

Walmart search lives at walmart.com/search?q=QUERY. The data you want is in __NEXT_DATA__ inside the HTML:

@dataclass
class WalmartProduct:
    item_id: str
    name: str
    price: Optional[float]
    was_price: Optional[float]
    price_string: str
    rating: Optional[float]
    review_count: int
    url: str
    thumbnail_url: str
    brand: str
    category_path: str
    item_type: str  # 'REGULAR', 'VARIANT', 'GROUP'
    sponsored: bool
    in_stock: bool

def search_walmart(
    query: str,
    page: int = 1,
    sort: str = "best_match",
    session: Optional[cffi_requests.Session] = None,
) -> list[WalmartProduct]:
    """
    Search Walmart for products.
    sort options: 'best_match', 'price_low', 'price_high', 'rating_high', 'new'
    """
    if session is None:
        session = make_session()

    sort_map = {
        "best_match": "best_match",
        "price_low": "price_low",
        "price_high": "price_high",
        "rating_high": "rating_high",
        "new": "new",
    }

    url = f"https://www.walmart.com/search?q={query}&page={page}&sort={sort_map.get(sort, 'best_match')}"

    resp = session.get(url, impersonate="chrome124", timeout=25)

    if resp.status_code != 200:
        raise RuntimeError(f"Got {resp.status_code} for search: {query}")

    data = extract_next_data(resp.text)
    if not data:
        raise ValueError("__NEXT_DATA__ not found — likely blocked or page structure changed")

    # Navigate the nested structure
    item_stacks = (
        data.get("props", {})
            .get("pageProps", {})
            .get("initialData", {})
            .get("searchResult", {})
            .get("itemStacks", [])
    )

    products = []
    for stack in item_stacks:
        for item in stack.get("items", []):
            if not item:
                continue
            products.append(_parse_search_item(item))

    return products

def _parse_search_item(item: dict) -> WalmartProduct:
    price_info = item.get("priceInfo", {})
    current_price = price_info.get("currentPrice", {})
    was_price_info = price_info.get("wasPrice", {})

    return WalmartProduct(
        item_id=item.get("usItemId", ""),
        name=item.get("name", ""),
        price=current_price.get("price"),
        was_price=was_price_info.get("price") if was_price_info else None,
        price_string=current_price.get("priceString", ""),
        rating=item.get("averageRating"),
        review_count=item.get("numberOfReviews", 0) or 0,
        url="https://www.walmart.com" + item.get("canonicalUrl", ""),
        thumbnail_url=item.get("imageInfo", {}).get("thumbnailUrl", ""),
        brand=item.get("brand", ""),
        category_path=item.get("category", {}).get("displayValue", ""),
        item_type=item.get("itemType", "REGULAR"),
        sponsored=item.get("isSponsoredItem", False),
        in_stock=item.get("availabilityStatus", "") == "IN_STOCK",
    )

def search_all_pages(
    query: str,
    max_pages: int = 5,
    session: Optional[cffi_requests.Session] = None,
    delay_range: tuple = (1.5, 3.0),
) -> list[WalmartProduct]:
    """Paginate through search results."""
    if session is None:
        session = make_session()

    all_products = []
    for page in range(1, max_pages + 1):
        try:
            products = search_walmart(query, page=page, session=session)
            if not products:
                break
            all_products.extend(products)
            print(f"  Page {page}: {len(products)} products (total: {len(all_products)})")
            time.sleep(random.uniform(*delay_range))
        except Exception as e:
            print(f"  Page {page} failed: {e}")
            break

    return all_products

Extracting Product Details

Individual product pages follow the pattern walmart.com/ip/PRODUCT-NAME/ITEM-ID. Same __NEXT_DATA__ trick applies, with much richer data:

@dataclass
class WalmartProductDetail:
    item_id: str
    name: str
    brand: str
    description: str
    price: Optional[float]
    was_price: Optional[float]
    currency: str
    in_stock: bool
    availability_status: str
    category: list
    specs: list
    images: list
    rating: Optional[float]
    review_count: int
    fulfillment_options: list
    seller_id: Optional[str]
    seller_name: Optional[str]
    model_number: str
    upc: str
    item_type_code: str
    short_description: str
    long_description: str
    color: str
    size: str
    unit_price: Optional[str]  # price per oz, per count, etc.

def get_product_details(
    item_id: str,
    slug: str = "product",
    session: Optional[cffi_requests.Session] = None,
) -> WalmartProductDetail:
    """
    Scrape full product details from Walmart product page.
    slug: the URL-friendly product name (can use 'product' as fallback)
    """
    if session is None:
        session = make_session()

    url = f"https://www.walmart.com/ip/{slug}/{item_id}"
    resp = session.get(url, impersonate="chrome124", timeout=25)
    resp.raise_for_status()

    data = extract_next_data(resp.text)
    if not data:
        raise ValueError(f"Could not extract data from product page {item_id}")

    product = (
        data.get("props", {})
            .get("pageProps", {})
            .get("initialData", {})
            .get("data", {})
            .get("product", {})
    ) or {}

    price_info = product.get("priceInfo", {})
    current_price = price_info.get("currentPrice", {})
    was_price_info = price_info.get("wasPrice", {})

    # Category breadcrumb
    category_path = []
    for crumb in product.get("category", {}).get("categoryPathV2", []):
        category_path.append(crumb.get("name", ""))

    # Seller info (first/default seller)
    seller_info = product.get("sellerInfo", {})

    # Specifications
    specs = []
    for spec_group in product.get("specifications", []):
        for spec in spec_group.get("specifications", []):
            specs.append({
                "name": spec.get("name", ""),
                "value": spec.get("value", ""),
            })

    # Images
    images = []
    for img in product.get("imageInfo", {}).get("allImages", []):
        images.append({
            "url": img.get("url", ""),
            "alt": img.get("alt", ""),
            "asset_size_map": img.get("assetSizeMap", {}),
        })

    # Fulfillment options
    fulfillment = []
    for opt in product.get("fulfillmentOptions", []):
        fulfillment.append({
            "type": opt.get("type", ""),
            "displayValue": opt.get("displayValue", ""),
            "price": opt.get("price", {}).get("value"),
        })

    return WalmartProductDetail(
        item_id=product.get("usItemId", item_id),
        name=product.get("name", ""),
        brand=product.get("brand", ""),
        description=product.get("longDescription", "")[:1000],
        price=current_price.get("price"),
        was_price=was_price_info.get("price") if was_price_info else None,
        currency=current_price.get("currencyUnit", "USD"),
        in_stock=product.get("availabilityStatus", "") == "IN_STOCK",
        availability_status=product.get("availabilityStatus", ""),
        category=category_path,
        specs=specs,
        images=images,
        rating=product.get("averageRating"),
        review_count=product.get("numberOfReviews", 0) or 0,
        fulfillment_options=fulfillment,
        seller_id=seller_info.get("sellerId"),
        seller_name=seller_info.get("sellerDisplayName"),
        model_number=product.get("model", ""),
        upc=product.get("upc", ""),
        item_type_code=product.get("itemTypeCode", ""),
        short_description=product.get("shortDescription", "")[:500],
        long_description=product.get("longDescription", "")[:1000],
        color=product.get("variantList", {}).get("currentVariant", {}).get("color", ""),
        size=product.get("variantList", {}).get("currentVariant", {}).get("size", ""),
        unit_price=price_info.get("unitPrice", {}).get("displayValue"),
    )

Fetching Reviews

Walmart exposes a proper reviews API endpoint. It returns JSON directly without any TLS fingerprint challenges:

def get_reviews(
    item_id: str,
    page: int = 1,
    limit: int = 20,
    sort: str = "relevancy",
    session: Optional[cffi_requests.Session] = None,
) -> dict:
    """
    Fetch product reviews.
    sort: 'relevancy', 'submission-desc' (newest), 'rating-desc', 'rating-asc',
          'helpfulness', 'submission-asc'
    """
    if session is None:
        session = make_session()

    url = (
        f"https://www.walmart.com/reviews/api/contracts/reviews?"
        f"entityId={item_id}&entityType=item&page={page}&limit={limit}"
        f"&sort={sort}&returnMeta=true"
    )

    headers = {
        **CHROME_HEADERS,
        "Accept": "application/json",
        "X-Requested-With": "XMLHttpRequest",
        "Referer": f"https://www.walmart.com/ip/product/{item_id}",
    }

    resp = session.get(url, headers=headers, impersonate="chrome124", timeout=20)
    resp.raise_for_status()
    payload = resp.json()

    reviews = []
    for r in payload.get("reviews", []):
        reviews.append({
            "review_id": r.get("reviewId"),
            "rating": r.get("rating"),
            "title": r.get("reviewTitle", ""),
            "text": r.get("reviewText", "")[:2000],
            "author": r.get("userNickname", ""),
            "date": r.get("submissionTime", ""),
            "verified_purchase": r.get("badges", {}).get("isPurchaseVerified", False),
            "helpful_votes": r.get("positiveFeedback", 0),
            "not_helpful_votes": r.get("negativeFeedback", 0),
            "response": r.get("clientResponse", {}).get("response", ""),  # seller response
        })

    return {
        "total_reviews": payload.get("totalResults", 0),
        "page": page,
        "limit": limit,
        "average_rating": payload.get("averageOverallRating"),
        "rating_distribution": payload.get("ratingDistribution", {}),
        "reviews": reviews,
    }

def get_all_reviews(
    item_id: str,
    max_pages: int = 10,
    session: Optional[cffi_requests.Session] = None,
    delay: float = 1.5,
) -> list[dict]:
    """Collect all available reviews for a product."""
    all_reviews = []

    for page in range(1, max_pages + 1):
        data = get_reviews(item_id, page=page, session=session)
        all_reviews.extend(data["reviews"])

        if not data["reviews"]:
            break
        if len(all_reviews) >= data["total_reviews"]:
            break

        time.sleep(delay)

    return all_reviews

def get_review_summary(item_id: str, session=None) -> dict:
    """Get rating distribution and summary stats without fetching all reviews."""
    data = get_reviews(item_id, page=1, limit=1, session=session)
    return {
        "total_reviews": data["total_reviews"],
        "average_rating": data["average_rating"],
        "rating_distribution": data["rating_distribution"],
    }

Seller and Marketplace Data

Walmart's marketplace listings embed seller info directly in the product JSON. Multiple third-party sellers may offer the same item at different prices:

def get_all_sellers(item_id: str, session=None) -> list[dict]:
    """
    Get all seller offers for a product.
    Walmart marketplace items often have multiple sellers at different prices.
    """
    if session is None:
        session = make_session()

    url = f"https://www.walmart.com/ip/product/{item_id}"
    resp = session.get(url, impersonate="chrome124", timeout=25)

    data = extract_next_data(resp.text)
    if not data:
        return []

    product = (
        data.get("props", {})
            .get("pageProps", {})
            .get("initialData", {})
            .get("data", {})
            .get("product", {})
    ) or {}

    offers_data = product.get("offers", {})
    sellers = []

    for offer in offers_data.get("items", []):
        price_info = offer.get("priceInfo", {})
        current = price_info.get("currentPrice", {})

        sellers.append({
            "seller_id": offer.get("sellerId"),
            "seller_name": offer.get("sellerDisplayName", ""),
            "price": current.get("price"),
            "price_string": current.get("priceString", ""),
            "condition": offer.get("condition", "New"),
            "ships_from": offer.get("shipsFrom", ""),
            "fulfillment_types": offer.get("shipFulfillmentTypes", []),
            "arrives_by": offer.get("shippingInfo", {}).get("arrivalDate", ""),
            "is_walmart_fulfilled": offer.get("isWalmartFulfilled", False),
            "is_pro_seller": offer.get("isProSeller", False),
            "seller_rating": offer.get("sellerRating"),
            "seller_review_count": offer.get("sellerReviewCount"),
        })

    # Sort by price
    sellers.sort(key=lambda x: x["price"] or float("inf"))
    return sellers

def find_buybox_winner(sellers: list[dict]) -> Optional[dict]:
    """Identify which seller won the Buy Box (shown by default)."""
    # Walmart typically shows Walmart-fulfilled first, then sorted by price
    for seller in sellers:
        if seller.get("is_walmart_fulfilled"):
            return seller
    return sellers[0] if sellers else None

Anti-Bot Measures (The Full Picture)

Walmart uses HUMAN Security (formerly PerimeterX) for bot management. Here's what you're actually fighting:

Layer 1: TLS fingerprinting. This is where plain requests or httpx fails immediately. Your TLS handshake — cipher suites, extension order, elliptic curves — must match Chrome exactly. curl_cffi with impersonate="chrome124" handles this.

Layer 2: HTTP/2 fingerprinting. Chrome uses HTTP/2 with specific settings frames and priority signals. curl_cffi handles this too, since it actually uses libcurl with the right h2 settings.

Layer 3: Header fingerprinting. Header names, values, and order must match Chrome's behavior. Sec-Fetch-* headers, Sec-Ch-Ua-* headers, and header ordering all get checked. The headers in the examples above are ordered to match Chrome's actual behavior.

Layer 4: IP reputation. HUMAN Security maintains real-time IP reputation databases. Datacenter IP ranges (AWS, GCP, Azure, Vultr, DigitalOcean) are pre-flagged as suspicious. Even if your TLS and headers are perfect, a datacenter IP will often get challenged. Residential proxies are necessary for sustained scraping.

Layer 5: Behavioral signals. Request timing patterns, navigation flows (do you have a referrer?), cookie handling, and session consistency all contribute to the bot score.

def fetch_with_retry(
    url: str,
    session: cffi_requests.Session,
    max_attempts: int = 4,
    initial_delay: float = 2.0,
) -> str:
    """
    Fetch a Walmart page with exponential backoff and bot-detection handling.
    """
    for attempt in range(max_attempts):
        try:
            # Add referrer for internal pages
            if "/ip/" in url:
                session.headers["Referer"] = "https://www.walmart.com/"
                session.headers["Sec-Fetch-Site"] = "same-origin"
            else:
                session.headers.pop("Referer", None)
                session.headers["Sec-Fetch-Site"] = "none"

            resp = session.get(url, impersonate="chrome124", timeout=25)

            if resp.status_code == 200:
                if "__NEXT_DATA__" in resp.text:
                    return resp.text
                elif "px-captcha" in resp.text or "perimeterx" in resp.text.lower():
                    print(f"PerimeterX challenge on attempt {attempt + 1}")
                    time.sleep(initial_delay * (2 ** attempt) + random.uniform(0, 2))
                    continue
                else:
                    # Page returned 200 but no data — likely challenge
                    time.sleep(initial_delay * (2 ** attempt))
                    continue

            elif resp.status_code == 429:
                wait = initial_delay * (2 ** attempt) + random.uniform(1, 3)
                print(f"Rate limited (attempt {attempt + 1}). Waiting {wait:.1f}s...")
                time.sleep(wait)

            elif resp.status_code == 403:
                print(f"403 Forbidden on attempt {attempt + 1}. Rotating session...")
                time.sleep(5 + random.uniform(0, 3))

            elif resp.status_code == 503:
                print(f"503 Service unavailable on attempt {attempt + 1}.")
                time.sleep(10)

            else:
                print(f"Got {resp.status_code} on attempt {attempt + 1}")
                time.sleep(initial_delay * (attempt + 1))

        except cffi_requests.exceptions.Timeout:
            print(f"Timeout on attempt {attempt + 1}")
            time.sleep(initial_delay * (attempt + 1))

        except Exception as e:
            print(f"Error on attempt {attempt + 1}: {e}")
            time.sleep(initial_delay)

    raise RuntimeError(f"Failed to fetch {url} after {max_attempts} attempts")

ThorData Proxy Setup for Walmart

Walmart is one of the sites where datacenter proxies consistently fail — HUMAN Security's IP reputation database is thorough enough that most datacenter ranges are pre-flagged. Residential proxies are the practical solution.

ThorData's residential proxy network offers rotating US residential IPs that work reliably for Walmart. A few configuration tips for Walmart specifically:

Use US IPs only. Walmart geo-restricts aggressively. Non-US IPs often get redirected or blocked entirely.
Use session-based rotation (same IP for a page session) rather than per-request rotation. Walmart's bot detection correlates multiple requests within a session — if your IP changes between page loads, that's suspicious behavior.
Keep sessions under 5-10 requests before rotating. Fresh sessions look more human than long-running ones.

THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
THORDATA_HOST = "proxy.thordata.com"
THORDATA_PORT = 9001

def make_sticky_session(session_id: int) -> cffi_requests.Session:
    """
    Create a session with a sticky proxy (same IP across requests).
    Use a different session_id to get a different IP.
    """
    # ThorData sticky session format: username-session-ID
    proxy_user = f"{THORDATA_USER}-session-{session_id}"
    proxy_url = f"http://{proxy_user}:{THORDATA_PASS}@{THORDATA_HOST}:{THORDATA_PORT}"

    session = cffi_requests.Session()
    session.headers.update(CHROME_HEADERS)
    session.proxies = {"http": proxy_url, "https": proxy_url}
    return session

def scrape_product_batch(
    item_ids: list[str],
    requests_per_session: int = 5,
    delay_range: tuple = (1.5, 3.5),
) -> list[dict]:
    """
    Scrape a batch of products with session rotation.
    Creates a new proxy session every N requests.
    """
    results = []
    session_id = random.randint(1000, 99999)
    session = make_sticky_session(session_id)
    requests_this_session = 0

    for i, item_id in enumerate(item_ids):
        # Rotate session
        if requests_this_session >= requests_per_session:
            session.close()
            session_id = random.randint(1000, 99999)
            session = make_sticky_session(session_id)
            requests_this_session = 0
            time.sleep(random.uniform(1, 2))

        try:
            html = fetch_with_retry(
                f"https://www.walmart.com/ip/product/{item_id}",
                session=session,
            )
            data = extract_next_data(html)
            if data:
                product = (
                    data.get("props", {})
                        .get("pageProps", {})
                        .get("initialData", {})
                        .get("data", {})
                        .get("product", {})
                ) or {}
                results.append({
                    "item_id": item_id,
                    "name": product.get("name", ""),
                    "price": product.get("priceInfo", {}).get("currentPrice", {}).get("price"),
                    "in_stock": product.get("availabilityStatus") == "IN_STOCK",
                })
            requests_this_session += 1
        except Exception as e:
            print(f"Failed on item {item_id}: {e}")

        time.sleep(random.uniform(*delay_range))

    session.close()
    return results

Price Monitoring Pipeline

The most common real-world use case: track prices on a watchlist over time.

def setup_price_monitor_db(db_path: str = "walmart_prices.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.execute("PRAGMA journal_mode=WAL")
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS products (
            item_id TEXT PRIMARY KEY,
            name TEXT,
            brand TEXT,
            url TEXT,
            category TEXT,
            upc TEXT,
            model TEXT,
            created_at TEXT DEFAULT (datetime('now')),
            updated_at TEXT
        );

        CREATE TABLE IF NOT EXISTS price_snapshots (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            item_id TEXT NOT NULL,
            current_price REAL,
            was_price REAL,
            in_stock INTEGER,
            seller_id TEXT,
            seller_name TEXT,
            rating REAL,
            review_count INTEGER,
            availability_status TEXT,
            captured_at TEXT DEFAULT (datetime('now')),
            FOREIGN KEY (item_id) REFERENCES products(item_id)
        );

        CREATE INDEX IF NOT EXISTS idx_snapshots_item ON price_snapshots(item_id);
        CREATE INDEX IF NOT EXISTS idx_snapshots_captured ON price_snapshots(captured_at);
        CREATE INDEX IF NOT EXISTS idx_snapshots_price ON price_snapshots(current_price);
    """)
    conn.commit()
    return conn

def record_price_snapshot(
    conn: sqlite3.Connection,
    detail: WalmartProductDetail,
):
    """Record current price state to database."""
    conn.execute("""
        INSERT OR REPLACE INTO products (item_id, name, brand, url, category, upc, model, updated_at)
        VALUES (?, ?, ?, ?, ?, ?, ?, datetime('now'))
    """, (
        detail.item_id, detail.name, detail.brand,
        f"https://www.walmart.com/ip/product/{detail.item_id}",
        " > ".join(detail.category),
        detail.upc, detail.model_number,
    ))

    conn.execute("""
        INSERT INTO price_snapshots
        (item_id, current_price, was_price, in_stock, seller_id, seller_name,
         rating, review_count, availability_status)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, (
        detail.item_id, detail.price, detail.was_price,
        int(detail.in_stock), detail.seller_id, detail.seller_name,
        detail.rating, detail.review_count, detail.availability_status,
    ))
    conn.commit()

def check_price_alerts(conn: sqlite3.Connection, item_id: str, alert_drop_pct: float = 10.0) -> dict:
    """Check if a product's price has dropped by the threshold percentage."""
    rows = conn.execute("""
        SELECT current_price, captured_at FROM price_snapshots
        WHERE item_id = ? ORDER BY captured_at DESC LIMIT 30
    """, (item_id,)).fetchall()

    if len(rows) < 2:
        return {"alert": False}

    latest_price = rows[0][0]
    max_recent_price = max(r[0] for r in rows if r[0] is not None)

    if max_recent_price and latest_price:
        drop_pct = ((max_recent_price - latest_price) / max_recent_price) * 100
        if drop_pct >= alert_drop_pct:
            return {
                "alert": True,
                "item_id": item_id,
                "current_price": latest_price,
                "peak_price": max_recent_price,
                "drop_percentage": round(drop_pct, 1),
            }

    return {"alert": False}

def run_price_monitor_cycle(
    watchlist: list[str],
    db_path: str = "walmart_prices.db",
) -> dict:
    """Run one monitoring cycle: scrape all watchlist items and record prices."""
    conn = setup_price_monitor_db(db_path)
    session = make_sticky_session(random.randint(1000, 99999))
    stats = {"scraped": 0, "errors": 0, "alerts": []}

    for item_id in watchlist:
        try:
            html = fetch_with_retry(
                f"https://www.walmart.com/ip/product/{item_id}",
                session=session,
            )
            data = extract_next_data(html)
            if not data:
                stats["errors"] += 1
                continue

            product_raw = (
                data.get("props", {})
                    .get("pageProps", {})
                    .get("initialData", {})
                    .get("data", {})
                    .get("product", {})
            ) or {}

            # Build minimal detail for recording
            price_info = product_raw.get("priceInfo", {})
            detail = WalmartProductDetail(
                item_id=item_id,
                name=product_raw.get("name", ""),
                brand=product_raw.get("brand", ""),
                description="",
                price=price_info.get("currentPrice", {}).get("price"),
                was_price=price_info.get("wasPrice", {}).get("price") if price_info.get("wasPrice") else None,
                currency="USD",
                in_stock=product_raw.get("availabilityStatus") == "IN_STOCK",
                availability_status=product_raw.get("availabilityStatus", ""),
                category=[],
                specs=[],
                images=[],
                rating=product_raw.get("averageRating"),
                review_count=product_raw.get("numberOfReviews", 0) or 0,
                fulfillment_options=[],
                seller_id=product_raw.get("sellerInfo", {}).get("sellerId"),
                seller_name=product_raw.get("sellerInfo", {}).get("sellerDisplayName"),
                model_number=product_raw.get("model", ""),
                upc=product_raw.get("upc", ""),
                item_type_code=product_raw.get("itemTypeCode", ""),
                short_description="",
                long_description="",
                color="",
                size="",
                unit_price=None,
            )

            record_price_snapshot(conn, detail)
            stats["scraped"] += 1

            alert = check_price_alerts(conn, item_id)
            if alert.get("alert"):
                stats["alerts"].append(alert)

        except Exception as e:
            print(f"Error on {item_id}: {e}")
            stats["errors"] += 1

        time.sleep(random.uniform(2.0, 4.0))

    session.close()
    conn.close()
    return stats

Legal and Ethical Considerations

Walmart's robots.txt disallows crawling of most product pages (/ip/, /search/), which means you're operating outside their explicitly permitted scope. The practical legal reality for scraping e-commerce price data:

Price monitoring for personal use or market research — common industry practice, rarely if ever litigated
Competitive pricing tools — widely used by e-commerce businesses; companies like Keepa, PriceChartr, and Honey do this commercially at scale
Data resale as a raw dataset — legally murkier, especially if you're circumventing technical protection measures in a way that might implicate the CFAA
Building a competing e-commerce platform using Walmart product data — clearly prohibited by their ToS and potentially actionable

Don't hammer their servers. Keep request rates human-shaped. Don't collect personal data. Use the data for legitimate business intelligence purposes, and you'll be operating in the same space as dozens of established companies doing the same thing.

The technical stack here — curl_cffi for TLS fingerprinting, ThorData residential proxies for IP reputation, realistic headers and session management — is what makes the difference between getting blocked on the first page and running a reliable price monitoring pipeline.