← Back to blog

Scraping Walmart Marketplace in 2026: Seller Data, Pricing, and Product Reviews

Scraping Walmart Marketplace in 2026: Seller Data, Pricing, and Product Reviews

Walmart Marketplace has quietly built one of the more aggressive anti-bot stacks in ecommerce. The data — seller ratings, fulfillment type, third-party pricing — is valuable for competitive intelligence, and less picked-over than Amazon. Here is what works in 2026.

Walmart's Anti-Bot Stack

Walmart runs a layered defense that catches most naive scrapers within the first ten requests.

The primary layer is PerimeterX (now HUMAN), a JavaScript challenge that fingerprints the browser environment — headless Chrome indicators, missing APIs, inconsistent screen dimensions, canvas artifacts. Default Playwright or Puppeteer will get a challenge page or silent 403 before any product data loads.

On top of that, Akamai's WAF scores IP reputation. Datacenter ranges from AWS, GCP, and Azure are flagged aggressively, and residential IPs flagged on other Akamai-protected properties get throttled. TLS fingerprinting catches a lot of people too — Walmart's edge checks your JA3 fingerprint. Python's requests library has a recognizable fingerprint; httpx with HTTP/2 gets you closer but you still need Chrome's cipher suite order.

Rate limiting is adaptive. Over roughly 30 requests per minute triggers throttling, and the threshold drops if your pattern is too regular — same intervals, same header order.

Walmart Open API: What It Covers (and What It Does Not)

The official path is through the Walmart Developer Portal. You register an app, get a consumer ID and private key, and sign requests with a timestamp-based HMAC-SHA256 signature.

import httpx, time, base64, hashlib, hmac

CONSUMER_ID = "your-consumer-id"
PRIVATE_KEY_VERSION = "1"

def generate_auth_headers(url: str, private_key_bytes: bytes) -> dict:
    timestamp = str(int(time.time() * 1000))
    message = f"{CONSUMER_ID}\n{url}\n{timestamp}\n"
    sig = hmac.new(private_key_bytes, message.encode(), hashlib.sha256).digest()
    return {
        "WM_CONSUMER.ID": CONSUMER_ID,
        "WM_CONSUMER.INTIMESTAMP": timestamp,
        "WM_SEC.AUTH_SIGNATURE": base64.b64encode(sig).decode(),
        "WM_SEC.KEY_VERSION": PRIVATE_KEY_VERSION,
        "WM_SVC.NAME": "Walmart Marketplace",
        "Accept": "application/json",
    }

def api_get(path: str) -> dict:
    url = f"https://developer.api.walmart.com/api-proxy/service/affil/product/v2/{path}"
    with open("private_key.pem", "rb") as f:
        headers = generate_auth_headers(url, f.read())
    with httpx.Client() as client:
        r = client.get(url, headers=headers, timeout=15)
        r.raise_for_status()
        return r.json()

# Catalog search — returns up to 25 items per page
results = api_get("search?query=coffee+maker&numItems=25&start=1")

# Single item lookup by Walmart item ID
item = api_get("items/123456789")

# Full taxonomy — get the entire category tree
taxonomy = api_get("taxonomy")

What the API gives you: catalog metadata (name, brand, UPC, category, Walmart's price), basic availability, and product images. What it does not give you: third-party seller listings, per-seller pricing, seller ratings, review content, or fulfillment status breakdowns. For all of that, you scrape.

The API is rate-limited to 5 requests per second and requires a qualifying website or app. It is useful for initializing your item ID database at scale — use it to bulk-collect item IDs by category, then switch to scraping for seller intelligence.

Complete Product Scraper: Extracting from NEXT_DATA

Walmart's product pages embed their full data payload in a <script id="__NEXT_DATA__"> JSON block. This is the same data the React frontend uses, so it is comprehensive and structured. The path through that JSON is stable but deep.

import httpx
import json
import re
import time
from typing import Optional

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Ch-Ua": '"Chromium";v="124", "Google Chrome";v="124"',
    "Sec-Ch-Ua-Mobile": "?0",
    "Sec-Ch-Ua-Platform": '"macOS"',
    "Upgrade-Insecure-Requests": "1",
}

def extract_next_data(html: str) -> dict:
    match = re.search(
        r'<script id="__NEXT_DATA__" type="application/json">(.*?)</script>',
        html, re.DOTALL
    )
    if not match:
        raise ValueError("__NEXT_DATA__ not found in page HTML")
    return json.loads(match.group(1))

def scrape_product(item_id: str, proxies: Optional[dict] = None) -> dict:
    url = f"https://www.walmart.com/ip/{item_id}"
    with httpx.Client(headers=HEADERS, proxies=proxies, http2=True, follow_redirects=True) as client:
        r = client.get(url, timeout=20)
        r.raise_for_status()

    data = extract_next_data(r.text)

    # Root path: props -> pageProps -> initialData -> data -> product
    product = (data
        .get("props", {})
        .get("pageProps", {})
        .get("initialData", {})
        .get("data", {})
        .get("product", {}))

    if not product:
        raise ValueError(f"No product data found for item {item_id}")

    # Price info lives under product -> priceInfo
    price_info = product.get("priceInfo", {})
    current_price = price_info.get("currentPrice", {}).get("price")
    was_price = price_info.get("wasPrice", {}).get("price")
    savings = price_info.get("savings", {}).get("amount")

    # Images: product -> imageInfo -> allImages (list of {url, type})
    images = [
        img.get("url") for img in
        product.get("imageInfo", {}).get("allImages", [])
    ]

    # Specifications: product -> specifications (list of {name, value})
    specs = {
        s.get("name"): s.get("value")
        for s in product.get("specifications", [])
    }

    # Category breadcrumb: product -> category -> path (list of {name, url})
    category_path = [
        c.get("name") for c in
        product.get("category", {}).get("path", [])
    ]

    # Shipping: product -> fulfillmentOptions (list with type, estimatedDeliveryDate)
    fulfillment_options = product.get("fulfillmentOptions", [])
    delivery_date = next(
        (f.get("estimatedDeliveryDate") for f in fulfillment_options
         if f.get("type") == "DELIVERY"), None
    )
    pickup_available = any(
        f.get("type") == "PICKUP" for f in fulfillment_options
    )

    # Offers: product -> offers -> items (list of seller offer objects)
    offers = product.get("offers", {}).get("items", [])
    sellers = []
    for o in offers:
        sellers.append({
            "name": o.get("sellerName"),
            "seller_id": o.get("sellerId"),
            "price": o.get("priceInfo", {}).get("currentPrice", {}).get("price"),
            "fulfillment": o.get("fulfillmentType"),  # "FC" or "MP"
            "wfs": o.get("wfsEnabled", False),
            "rating": o.get("sellerInfo", {}).get("rating"),
            "ships_from": o.get("shipFromCountry"),
        })

    return {
        "item_id": item_id,
        "name": product.get("name"),
        "brand": product.get("brand"),
        "upc": product.get("upc"),
        "gtin": product.get("gtin13"),
        "current_price": current_price,
        "was_price": was_price,
        "savings": savings,
        "rating": product.get("averageRating"),
        "review_count": product.get("numberOfReviews"),
        "images": images,
        "description": product.get("shortDescription"),
        "specifications": specs,
        "category_path": category_path,
        "availability": product.get("availabilityStatus"),
        "pickup_available": pickup_available,
        "delivery_date": delivery_date,
        "sellers": sellers,
    }

Realistic output from that scraper:

{
  "name": "Apple AirPods Pro (2nd Generation)",
  "brand": "Apple",
  "item_id": "1752657021",
  "upc": "194253397175",
  "current_price": 189.99,
  "was_price": 249.00,
  "savings": 59.01,
  "rating": 4.7,
  "review_count": 15234,
  "category_path": ["Electronics", "Headphones", "True Wireless"],
  "images": [
    "https://i5.walmartimages.com/asr/abc123_main.jpg",
    "https://i5.walmartimages.com/asr/abc123_angle.jpg"
  ],
  "specifications": {
    "Color": "White",
    "Connectivity": "Bluetooth",
    "Battery Life": "Up to 30 hours total",
    "Water Resistance": "IPX4"
  },
  "sellers": [
    {"name": "Walmart.com", "price": 189.99, "fulfillment": "FC", "wfs": true},
    {"name": "TechDeals LLC", "price": 194.50, "fulfillment": "MP", "wfs": false, "rating": 4.2}
  ],
  "availability": "In Stock",
  "pickup_available": true,
  "delivery_date": "Apr 2"
}

Search Results Scraper

Walmart search at walmart.com/search?q=keyword uses the same __NEXT_DATA__ pattern. The search-specific JSON path differs from product pages.

def scrape_search(query: str, page_num: int = 1, proxies: Optional[dict] = None) -> dict:
    url = f"https://www.walmart.com/search?q={query.replace(' ', '+')}&sort=best_match&page={page_num}"
    with httpx.Client(headers=HEADERS, proxies=proxies, http2=True, follow_redirects=True) as client:
        r = client.get(url, timeout=20)
        r.raise_for_status()

    data = extract_next_data(r.text)

    # Search path: props -> pageProps -> initialData -> searchResult
    search_result = (data
        .get("props", {})
        .get("pageProps", {})
        .get("initialData", {})
        .get("searchResult", {}))

    item_stacks = (search_result
        .get("itemStacks", [{}])[0]
        .get("items", []))

    products = []
    for item in item_stacks:
        products.append({
            "item_id": item.get("usItemId"),
            "name": item.get("name"),
            "price": item.get("priceInfo", {}).get("currentPrice", {}).get("price"),
            "rating": item.get("averageRating"),
            "review_count": item.get("numberOfReviews"),
            "thumbnail": item.get("imageInfo", {}).get("thumbnailUrl"),
            "sponsored": item.get("isSponsoredFlag", False),
            "pickup_available": item.get("fulfillmentBadge") == "Pickup available",
            "delivery_badge": item.get("shippingOptions", {}).get("deliveryBadge"),
        })

    total_pages = search_result.get("paginationV2", {}).get("maxPage", 1)
    return {"products": products, "total_pages": total_pages, "page": page_num}

def scrape_all_search_pages(query: str, max_pages: int = 5, proxies: Optional[dict] = None) -> list[dict]:
    """Paginate through search results using Walmart's page parameter."""
    all_products = []
    for page in range(1, max_pages + 1):
        result = scrape_search(query, page_num=page, proxies=proxies)
        all_products.extend(result["products"])
        if page >= result["total_pages"]:
            break
        time.sleep(2.5 + (0.5 * page))  # progressive backoff
    return all_products

Walmart paginates with page=N in the query string, 25 items per page. The isSponsoredFlag field filters paid placements from organic results — relevant for competitive ranking analysis. deliveryBadge surfaces strings like "Free shipping, arrives tomorrow" which you can parse for delivery speed intelligence.

Category Browser

Walmart browse pages at walmart.com/browse/{department} expose category trees in __NEXT_DATA__. Crawl the tree to map department structure and product counts.

def scrape_category(path: str, proxies: Optional[dict] = None) -> dict:
    """
    path examples: "electronics", "electronics/computers",
                   "home-improvement/lighting"
    """
    url = f"https://www.walmart.com/browse/{path}"
    with httpx.Client(headers=HEADERS, proxies=proxies, http2=True, follow_redirects=True) as client:
        r = client.get(url, timeout=20)
        r.raise_for_status()

    data = extract_next_data(r.text)

    # Browse path: props -> pageProps -> initialData -> browseResult
    browse = (data
        .get("props", {})
        .get("pageProps", {})
        .get("initialData", {})
        .get("browseResult", {}))

    # Facets contain subcategory options with product counts
    facets = browse.get("facets", [])
    category_facet = next(
        (f for f in facets if f.get("name") == "Category"), {}
    )

    subcategories = []
    for option in category_facet.get("values", []):
        subcategories.append({
            "name": option.get("value"),
            "count": option.get("count"),
            "url": option.get("url"),
        })

    return {
        "path": path,
        "total_items": browse.get("paginationV2", {}).get("total"),
        "subcategories": subcategories,
    }

def build_category_tree(root_path: str = "electronics", depth: int = 2,
                         proxies: Optional[dict] = None) -> dict:
    """Recursively build category tree up to specified depth."""
    node = scrape_category(root_path, proxies=proxies)
    if depth > 0:
        for sub in node["subcategories"]:
            sub_path = sub["url"].replace("/browse/", "").split("?")[0]
            try:
                sub["children"] = build_category_tree(sub_path, depth - 1, proxies)
                time.sleep(2)
            except Exception as e:
                sub["children"] = None
                sub["error"] = str(e)
    return node

The category tree is the foundation for market share analysis. You can compare total item counts per subcategory against seller density to surface underpopulated niches.

Seller Profile Scraper

Seller detail pages at walmart.com/seller/{sellerId} expose aggregate performance metrics. Seller IDs appear in the offer objects on any product page where that seller appears — extract them from scrape_product() output.

def scrape_seller(seller_id: str, proxies: Optional[dict] = None) -> dict:
    url = f"https://www.walmart.com/seller/{seller_id}"
    with httpx.Client(headers=HEADERS, proxies=proxies, http2=True, follow_redirects=True) as client:
        r = client.get(url, timeout=20)
        r.raise_for_status()

    data = extract_next_data(r.text)

    seller = (data
        .get("props", {})
        .get("pageProps", {})
        .get("initialData", {})
        .get("data", {})
        .get("seller", {}))

    metrics = seller.get("sellerScorecardSummary", {})

    return {
        "seller_id": seller_id,
        "name": seller.get("displayName"),
        "rating": seller.get("sellerRating"),
        "review_count": seller.get("numSellerReviews"),
        "on_time_shipping_pct": metrics.get("onTimeShipping"),
        "cancellation_rate_pct": metrics.get("cancellationRate"),
        "return_policy": seller.get("returnPolicy", {}).get("description"),
        "total_items": seller.get("catalogCount"),
        "joined_date": seller.get("joinedDate"),
    }

def collect_seller_ids_from_item(item_id: str, proxies: Optional[dict] = None) -> list[str]:
    """Extract all seller IDs from a product's offer set."""
    product = scrape_product(item_id, proxies=proxies)
    return [s["seller_id"] for s in product.get("sellers", []) if s.get("seller_id")]

on_time_shipping_pct and cancellation_rate_pct are the same metrics Walmart uses internally to determine buy box eligibility. A seller with on-time shipping below 95% or cancellation above 2% is vulnerable to losing the buy box — useful signal for competitive analysis. Seller profiles change slowly; cache them for 24-48 hours.

Anti-Detection Deep Dive

PerimeterX / HUMAN Defense

PerimeterX checks for headless browser tells: missing window.chrome object, navigator.webdriver = true, inconsistent plugin counts, canvas fingerprint anomalies, and missing browser APIs. With Playwright, use proper launch flags to suppress these signals:

from playwright.async_api import async_playwright
import asyncio

async def fetch_with_playwright(url: str) -> str:
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            args=[
                "--disable-blink-features=AutomationControlled",
                "--disable-dev-shm-usage",
                "--no-sandbox",
                "--disable-setuid-sandbox",
                "--disable-infobars",
                "--window-size=1920,1080",
                "--start-maximized",
            ]
        )
        context = await browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
            locale="en-US",
            timezone_id="America/Chicago",
        )
        # Patch navigator.webdriver before page loads
        await context.add_init_script(
            "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
        )
        page = await context.new_page()
        await page.goto(url, wait_until="networkidle")
        content = await page.content()
        await browser.close()
        return content

Playwright naturally generates realistic mouse movement data, scroll events, and interaction timing through its normal navigation — this is what Akamai's sensor data validation watches for. You do not need to manually simulate mouse movements for page loads, only for form interactions.

TLS Fingerprint Matching

Python's standard httpx and requests produce a JA3 fingerprint that differs from Chrome. The fix is curl_cffi, which impersonates Chrome's TLS stack at the C level:

from curl_cffi import requests as cffi_requests

def fetch_with_chrome_tls(url: str, proxies: Optional[dict] = None) -> str:
    resp = cffi_requests.get(
        url,
        headers=HEADERS,
        proxies=proxies,
        impersonate="chrome124",  # matches Chrome 124 JA3 + ALPN exactly
        timeout=20,
    )
    resp.raise_for_status()
    return resp.text

curl_cffi handles cipher suite order, TLS extensions list, and ALPN negotiation to match the specified Chrome version. It is the most reliable pure-HTTP approach when you do not want the overhead of a full browser.

Akamai Sensor Data

Akamai's sensor script (/akam/11/...) collects mouse trajectories, scroll patterns, keyboard timing, and page interaction sequences. It sends an encrypted blob in the sensor_data field. Playwright handles this naturally through normal navigation behavior.

For pure HTTP scraping, Akamai challenges are occasional. The practical workaround: run a small Playwright pool to initialize sessions (collect _abck and bm_sz cookies), then replay those cookies in your fast httpx requests until expiry (typically 30-60 minutes per session).

Key session cookies for Walmart:

Maintain these across requests in one session. A common mistake: reinitializing cookies per request. This reads as a new client connecting repeatedly from the same IP — itself a bot signal regardless of IP quality.

# Re-use the same client instance across requests to maintain cookie jar
client = httpx.Client(headers=HEADERS, http2=True, follow_redirects=True)

# First request establishes session cookies
client.get("https://www.walmart.com/")
time.sleep(1.5)

# Subsequent requests carry the full cookie chain automatically
product_page = client.get("https://www.walmart.com/ip/1752657021")
search_page = client.get("https://www.walmart.com/search?q=airpods")

Geo-Targeting: Location-Specific Pricing

Walmart shows different prices by ZIP code, particularly for grocery, pharmacy, and store-pickup items. Location is set via the location and wm_latlong cookies.

def set_walmart_location(client: httpx.Client, zip_code: str) -> None:
    """Set ZIP code in session to see location-specific pricing."""
    client.get(
        f"https://www.walmart.com/store/finder?location={zip_code}&distance=50",
        headers={**HEADERS, "Accept": "application/json"}
    )
    # Subsequent product requests in this session reflect that ZIP's pricing

For systematic geo-price comparison, pair location cookies with proxy routing — set the proxy to match the ZIP's geographic region. Residential proxies from ThorData let you target specific US states, so your IP origin aligns with the ZIP code you are setting. A Texas IP querying a New York ZIP is a mismatched signal that raises detection risk.

PROXY_USER = "your-thordata-username"
PROXY_PASS = "your-thordata-password"

def build_proxy(state: str = "TX") -> dict:
    auth_str = f"{PROXY_USER}-country-US-state-{state}:{PROXY_PASS}"
    proxy_url = f"http://{auth_str}@rotating.thordata.com:7777"
    return {"http://": proxy_url, "https://": proxy_url}

# Compare AirPods price across three states
for state, zip_code in [("TX", "75201"), ("NY", "10001"), ("CA", "90210")]:
    proxies = build_proxy(state=state)
    with httpx.Client(headers=HEADERS, proxies=proxies, http2=True) as client:
        set_walmart_location(client, zip_code)
        time.sleep(1)
        product = scrape_product("1752657021", proxies=proxies)
        print(f"{state} ({zip_code}): ${product['current_price']}")

Grocery and perishable categories show the most price variation by ZIP. Electronics less so, but fulfillment options and availability differ meaningfully.

Complete Price Tracker with SQLite

Full working tracker: creates schema, polls on schedule, records per-seller price history, and surfaces the biggest drops over the last 24 hours.

import sqlite3
import random
from datetime import datetime, timedelta

DB_PATH = "walmart_prices.db"

SCHEMA = """
CREATE TABLE IF NOT EXISTS items (
    item_id TEXT PRIMARY KEY,
    name TEXT,
    brand TEXT,
    upc TEXT,
    category TEXT,
    added_at TEXT
);

CREATE TABLE IF NOT EXISTS price_history (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    item_id TEXT NOT NULL,
    seller_id TEXT,
    seller_name TEXT,
    price REAL,
    was_price REAL,
    fulfillment TEXT,
    wfs INTEGER,
    seller_rating REAL,
    recorded_at TEXT NOT NULL,
    FOREIGN KEY (item_id) REFERENCES items(item_id)
);

CREATE INDEX IF NOT EXISTS idx_item_time ON price_history(item_id, recorded_at);
CREATE INDEX IF NOT EXISTS idx_seller_time ON price_history(seller_id, recorded_at);
"""

def init_db() -> sqlite3.Connection:
    conn = sqlite3.connect(DB_PATH)
    conn.executescript(SCHEMA)
    conn.commit()
    return conn

def record_prices(conn: sqlite3.Connection, product: dict) -> None:
    ts = datetime.utcnow().isoformat()
    conn.execute(
        "INSERT OR IGNORE INTO items VALUES (?,?,?,?,?,?)",
        (product["item_id"], product["name"], product["brand"],
         product.get("upc"), "|".join(product.get("category_path", [])), ts)
    )
    for seller in product.get("sellers", []):
        conn.execute(
            "INSERT INTO price_history VALUES (NULL,?,?,?,?,?,?,?,?,?)",
            (product["item_id"], seller.get("seller_id"), seller.get("name"),
             seller.get("price"), product.get("was_price"),
             seller.get("fulfillment"), int(seller.get("wfs", False)),
             seller.get("rating"), ts)
        )
    conn.commit()

def get_price_drops(conn: sqlite3.Connection, hours: int = 24) -> list[dict]:
    """Find items with the biggest price drops in the last N hours."""
    cutoff = (datetime.utcnow() - timedelta(hours=hours)).isoformat()
    rows = conn.execute("""
        WITH recent AS (
            SELECT
                item_id, seller_id, seller_name, price, recorded_at,
                MIN(price) OVER (
                    PARTITION BY item_id, seller_id
                    ORDER BY recorded_at
                    ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
                ) AS prev_min_price
            FROM price_history
            WHERE recorded_at >= ?
        )
        SELECT
            r.item_id,
            i.name,
            r.seller_name,
            r.prev_min_price  AS old_price,
            r.price           AS new_price,
            ROUND((r.prev_min_price - r.price) / r.prev_min_price * 100, 1) AS drop_pct,
            r.recorded_at
        FROM recent r
        JOIN items i ON r.item_id = i.item_id
        WHERE r.prev_min_price IS NOT NULL
          AND r.price < r.prev_min_price
        ORDER BY drop_pct DESC
        LIMIT 20
    """, (cutoff,)).fetchall()

    return [
        {"item_id": r[0], "name": r[1], "seller": r[2],
         "old_price": r[3], "new_price": r[4], "drop_pct": r[5], "at": r[6]}
        for r in rows
    ]

ITEMS_TO_TRACK = ["1752657021", "123456789", "987654321"]

def run_tracker(interval_minutes: int = 60) -> None:
    conn = init_db()
    while True:
        for item_id in ITEMS_TO_TRACK:
            state = random.choice(["IL", "TX", "CA", "FL", "NY"])
            proxies = build_proxy(state=state)
            try:
                product = scrape_product(item_id, proxies=proxies)
                record_prices(conn, product)
                prices = [s["price"] for s in product["sellers"] if s.get("price")]
                print(f"[{datetime.utcnow().isoformat()}] {item_id}: "
                      f"{len(product['sellers'])} sellers, best ${min(prices):.2f}")
            except Exception as e:
                print(f"Error on {item_id}: {e}")
            time.sleep(random.uniform(3.5, 7.0))

        drops = get_price_drops(conn, hours=24)
        if drops:
            print("\n--- Top price drops (24h) ---")
            for d in drops[:5]:
                print(f"  {d['name'][:45]}: ${d['old_price']} -> ${d['new_price']} (-{d['drop_pct']}%)")

        print(f"Cycle complete. Sleeping {interval_minutes}m.")
        time.sleep(interval_minutes * 60)

if __name__ == "__main__":
    run_tracker()

The window function in get_price_drops compares each recorded price against the minimum price seen earlier in the same polling window for that item/seller pair. This avoids flagging normal price oscillations as "drops" — it only surfaces cases where the price genuinely moved below the previous floor.

Review Analysis Pipeline

Walmart's reviews endpoint paginates cleanly. Collect reviews and run the analysis pipeline below for product development and competitive intelligence.

import pandas as pd
from collections import Counter

def get_reviews(item_id: str, page: int = 1, limit: int = 20,
                proxies: Optional[dict] = None) -> dict:
    url = (f"https://www.walmart.com/reviews/product/{item_id}"
           f"?page={page}&limit={limit}&sort=relevancy")
    review_headers = {
        **HEADERS,
        "Accept": "application/json",
        "X-Requested-With": "XMLHttpRequest",
        "Referer": f"https://www.walmart.com/ip/{item_id}",
    }
    with httpx.Client(headers=review_headers, proxies=proxies, http2=True) as client:
        r = client.get(url, timeout=15)
        r.raise_for_status()
        return r.json()

def fetch_all_reviews(item_id: str, max_pages: int = 10,
                      proxies: Optional[dict] = None) -> list[dict]:
    all_reviews = []
    for page in range(1, max_pages + 1):
        data = get_reviews(item_id, page=page, proxies=proxies)
        reviews = data.get("reviews", [])
        if not reviews:
            break
        all_reviews.extend(reviews)
        time.sleep(2)
    return all_reviews

STOP_WORDS = {
    "the", "a", "and", "is", "in", "it", "of", "to", "was", "i",
    "this", "for", "my", "but", "with", "very", "are", "be", "on",
    "have", "not", "great", "good", "they", "that", "just", "so"
}

def analyze_reviews(reviews: list[dict]) -> dict:
    df = pd.DataFrame(reviews)

    star_dist = df["rating"].value_counts().sort_index().to_dict()
    verified_ratio = df["verifiedPurchase"].mean() if "verifiedPurchase" in df.columns else None

    all_text = " ".join(df["reviewText"].fillna("").str.lower())
    words = [w for w in all_text.split() if len(w) > 3 and w not in STOP_WORDS]
    top_words = Counter(words).most_common(20)

    # Isolate complaints in 1-2 star reviews
    negative = df[df["rating"] <= 2]["reviewText"].dropna().tolist()
    negative_words = Counter(
        w for text in negative
        for w in text.lower().split()
        if len(w) > 3 and w not in STOP_WORDS
    ).most_common(10)

    return {
        "total_reviews": len(df),
        "avg_rating": round(df["rating"].mean(), 2),
        "star_distribution": star_dist,
        "verified_purchase_pct": round(verified_ratio * 100, 1) if verified_ratio else None,
        "top_keywords": top_words,
        "negative_keywords": negative_words,
    }

Running this across the top 50 products in a category surfaces recurring complaint patterns — issues that span multiple brands rather than isolated product failures. Those cross-brand complaints are the product development signal: what the whole category does poorly, a better product could solve.

Use Cases Worth Building

Price monitoring across sellers — Track buy box winner changes hour by hour. The buy box shifts as sellers adjust prices, run out of stock, or fall below Walmart's seller performance thresholds. Understanding who holds it and when is competitive intelligence sellers pay for.

WFS vs merchant-fulfilled analysis — Compare buy box win rates for the same item between WFS and non-WFS sellers. The WFS fee premium is roughly 8-15% of item price; quantifying the win rate lift tells you whether it is worth it for specific categories.

Category market share — Combine search scraping with seller profiling to estimate what fraction of a category is owned by the top sellers. Categories where the top three sellers hold over 60% of listings are harder to penetrate. Fragmented categories — many sellers, no dominant player — are opportunities.

Private label opportunity finder — Look for categories with high search volume (many total results, high review counts on top items) but few established brand names among the actual sellers. High demand plus no dominant brand is the private label entry signal.

Repricing intelligence — Track competitor price changes over time. Some sellers use automated repricers (mechanical, predictable adjustments at fixed intervals); others are manual and slower to react. Knowing their repricing cadence helps you time your own adjustments.

Review mining for product development — Run the negative keyword analysis at category scale. Pull 1-2 star reviews across the top 50 products in a niche. Recurring complaints that span multiple brands point to a category-wide gap a better product can fill.

Wrapping Up

Walmart Marketplace scraping is doable with the right stack. The official Affiliate API is worth using for bulk item ID collection and catalog bootstrapping, but it is blind to third-party seller detail entirely. For seller intelligence — offer sets, WFS status, per-seller ratings, geo-specific pricing — you are scraping the __NEXT_DATA__ JSON from product, search, category, and seller pages.

The main obstacles are PerimeterX challenge pages, TLS fingerprinting, Akamai session management, and geo-pricing inconsistency. Residential proxies with geo-targeting (ThorData has US state-level routing that aligns proxy IP with ZIP code) paired with proper cookie session maintenance gets you past most of it. Start with curl_cffi for TLS fingerprint bypass, maintain a persistent httpx client session for cookies, and add Playwright only where you actually need a real browser interaction.

The data is genuinely useful and less competitive than Amazon scraping. Seller performance metrics, WFS adoption rates, and buy box shift patterns are all visible in public pages — it just takes systematic extraction to turn them into actionable intelligence.