Scraping Shopify Stores: Product Data, Inventory, and Pricing (2026)

2026-04-03 shopify web-scraping ecommerce api python

Scraping Shopify Stores: Product Data, Inventory, and Pricing (2026)

Shopify powers over 4.8 million online stores. Whether you're doing competitive research, building a price comparison tool, tracking pricing trends across a niche, or feeding product data into an analytics pipeline, you'll eventually need to pull structured data from Shopify stores. The good news: Shopify makes this surprisingly easy compared to most ecommerce platforms.

This guide covers the practical methods that work right now in 2026 — from undocumented public endpoints to the official Storefront API — plus anti-detection strategies, pagination at scale, proxy integration with ThorData, and storing everything in SQLite.

Why Shopify Is Unusually Scraper-Friendly

Most ecommerce platforms guard their data carefully. Amazon serves dynamic JavaScript, Walmart employs sophisticated fingerprinting, and eBay wraps almost everything behind authentication. Shopify is different by design.

Shopify's architecture was built around an open data philosophy for third-party app developers. As a result, every store exposes structured product data as plain JSON through endpoints that Shopify itself uses for theme rendering. These endpoints were never meant to be secret — they're documented in Shopify's theme development guides — but most people outside the Shopify developer community don't know they exist.

This makes Shopify one of the best targets for legitimate competitive research:

No authentication required for product data
Structured JSON, no HTML parsing needed for basic data
Consistent schema across all 4.8 million stores
Rate limits are generous for reasonable usage

The Hidden JSON Endpoints

Every Shopify store exposes product data as JSON. Append /products.json to any Shopify store URL and you get structured product data back:

https://example-store.myshopify.com/products.json
https://example-store.com/products.json

The response includes product titles, descriptions, images, variants (with prices, SKUs, inventory), tags, vendor info, and timestamps. It's the same data structure Shopify uses internally.

There's also: - /collections.json — all collections with handles and titles - /collections/{handle}/products.json — products within a specific collection - /products/{handle}.json — a single product by handle - /blogs.json — blog listings - /pages.json — static page listings

Detecting Shopify Stores

Before scraping, you probably want to confirm a site actually runs on Shopify. The easiest check is the X-ShopId or X-Powered-By response header:

import httpx

def is_shopify(url: str) -> bool:
    """Check if a URL is a Shopify store."""
    try:
        r = httpx.head(url, follow_redirects=True, timeout=10)
        headers = {k.lower(): v for k, v in r.headers.items()}
        if "x-shopid" in headers:
            return True
        if headers.get("x-powered-by", "").lower() == "shopify":
            return True
        # Fallback: check for Shopify CDN in HTML
        r2 = httpx.get(url, follow_redirects=True, timeout=10)
        return "cdn.shopify.com" in r2.text
    except httpx.HTTPError:
        return False

print(is_shopify("https://allbirds.com"))  # True
print(is_shopify("https://gymshark.com"))  # True
print(is_shopify("https://amazon.com"))    # False

The X-ShopId header is the most reliable signal. If that's missing, check for cdn.shopify.com references in the page source — Shopify stores always load assets from their CDN.

Discovering Shopify Stores at Scale

If you want to build a list of Shopify stores to monitor, several approaches work:

BuiltWith and SimilarTech — technology profiling tools that maintain databases of sites using Shopify. Both have APIs (paid).
Google dorking — site:myshopify.com returns stores still on their native subdomain.
CommonCrawl — the public web crawl dataset indexes hundreds of millions of pages. Filter for X-Shopid headers or cdn.shopify.com references.
Product-specific searches — if you want stores in a niche, search Google for niche keywords plus site:myshopify.com.

# Quick list builder using myshopify.com pattern
import httpx

def find_shopify_competitors(niche_keywords: list[str]) -> list[str]:
    """
    Use a search API to find Shopify stores in a niche.
    Replace with your preferred search API.
    """
    stores = []
    for keyword in niche_keywords:
        query = f"{keyword} site:myshopify.com"
        # Use SerpAPI, ScrapingBee, or similar
        # resp = httpx.get("https://serpapi.com/search", params={"q": query, "api_key": KEY})
        # parse results...
        pass
    return stores

Fetching Products with Full Pagination

The /products.json endpoint accepts limit (max 250) and page parameters. Here's a complete scraper that handles pagination and rate limits:

import httpx
import time
import json
from typing import Optional

def scrape_shopify_products(
    store_url: str,
    max_pages: int = 50,
    proxy: Optional[str] = None,
    delay: float = 0.5,
) -> list[dict]:
    """
    Fetch all products from a Shopify store via /products.json.
    Handles pagination, 429 rate limits, and optional proxy.

    Args:
        store_url: Base URL of the Shopify store
        max_pages: Safety cap on number of pages to fetch
        proxy: Optional proxy URL (e.g., "http://user:pass@host:port")
        delay: Seconds to wait between requests

    Returns:
        List of raw product dicts
    """
    store_url = store_url.rstrip("/")
    all_products = []
    page = 1

    client_kwargs = {
        "timeout": 30,
        "headers": {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
            "Accept": "application/json",
            "Accept-Language": "en-US,en;q=0.9",
        },
        "follow_redirects": True,
    }
    if proxy:
        client_kwargs["proxies"] = {"all://": proxy}

    with httpx.Client(**client_kwargs) as client:
        while page <= max_pages:
            url = f"{store_url}/products.json?limit=250&page={page}"
            try:
                resp = client.get(url)
            except httpx.HTTPError as e:
                print(f"Request failed on page {page}: {e}")
                break

            if resp.status_code == 429:
                retry_after = int(resp.headers.get("Retry-After", 2))
                print(f"Rate limited. Waiting {retry_after}s...")
                time.sleep(retry_after)
                continue

            if resp.status_code == 404:
                print(f"404 on {url} — store may not be Shopify or products endpoint disabled")
                break

            if resp.status_code != 200:
                print(f"Got {resp.status_code} on page {page}")
                break

            try:
                data = resp.json()
            except Exception:
                print(f"Non-JSON response on page {page}")
                break

            products = data.get("products", [])

            if not products:
                break  # No more pages

            all_products.extend(products)
            print(f"Page {page}: {len(products)} products (total: {len(all_products)})")
            page += 1
            time.sleep(delay)

    return all_products

Key notes on this approach:

limit=250 is the maximum per page. Shopify caps it silently — passing 500 still returns 250.
Page-based pagination works but may become unreliable past page 50 on very large catalogs. For stores with 10,000+ products, use collection-based pagination or since_id as shown below.
Rate limits are generous — you'll rarely hit 429s with a 0.5s delay. But always handle them gracefully.
The page parameter is deprecated in some Shopify API versions but still works on the public endpoint consistently in 2026.

Cursor-Based Pagination with `since_id`

For large catalogs, cursor-based pagination using since_id is more reliable:

def scrape_shopify_since_id(
    store_url: str,
    proxy: Optional[str] = None,
) -> list[dict]:
    """
    Fetch all products using since_id cursor pagination.
    More reliable than page-based for large catalogs (10k+ products).
    """
    store_url = store_url.rstrip("/")
    all_products = []
    since_id = 0

    client_kwargs = {
        "timeout": 30,
        "headers": {"User-Agent": "Mozilla/5.0 (compatible; ProductBot/1.0)"},
        "follow_redirects": True,
    }
    if proxy:
        client_kwargs["proxies"] = {"all://": proxy}

    with httpx.Client(**client_kwargs) as client:
        while True:
            url = f"{store_url}/products.json?limit=250&since_id={since_id}"
            resp = client.get(url)

            if resp.status_code == 429:
                time.sleep(int(resp.headers.get("Retry-After", 5)))
                continue
            if resp.status_code != 200:
                break

            products = resp.json().get("products", [])
            if not products:
                break

            all_products.extend(products)
            since_id = products[-1]["id"]  # Last product ID becomes cursor
            print(f"Fetched up to ID {since_id}, total: {len(all_products)}")
            time.sleep(0.5)

    return all_products

The since_id approach guarantees you never miss products even if the catalog changes while you're paginating, because it's based on stable product IDs rather than page offsets.

Extracting Variant Data

Each product contains a variants array with the actual pricing, inventory, and SKU data. A single product can have dozens of variants (size × color combinations, for example).

def extract_pricing(products: list[dict]) -> list[dict]:
    """
    Extract a flat list of variant records from Shopify products.
    Each variant is one buyable SKU.
    """
    items = []
    for product in products:
        base = {
            "product_id": product["id"],
            "product_title": product["title"],
            "vendor": product.get("vendor", ""),
            "product_type": product.get("product_type", ""),
            "tags": product.get("tags", ""),
            "handle": product.get("handle", ""),
            "created_at": product.get("created_at"),
            "updated_at": product.get("updated_at"),
            "published_at": product.get("published_at"),
        }
        for variant in product.get("variants", []):
            items.append({
                **base,
                "variant_id": variant["id"],
                "variant_title": variant.get("title", "Default Title"),
                "sku": variant.get("sku", ""),
                "price": float(variant["price"]),
                "compare_at_price": float(variant["compare_at_price"]) if variant.get("compare_at_price") else None,
                "available": variant.get("available", True),
                "inventory_quantity": variant.get("inventory_quantity"),
                "weight": variant.get("grams"),
                "option1": variant.get("option1"),
                "option2": variant.get("option2"),
                "option3": variant.get("option3"),
                "barcode": variant.get("barcode"),
                "requires_shipping": variant.get("requires_shipping"),
            })
    return items


# Usage
products = scrape_shopify_products("https://example-store.com")
pricing = extract_pricing(products)

print(f"Extracted {len(pricing)} variants from {len(products)} products")

# Find discounted items
discounted = [p for p in pricing if p["compare_at_price"] and p["compare_at_price"] > p["price"]]
print(f"Currently on sale: {len(discounted)} variants")

# Average price by product type
from collections import defaultdict
by_type = defaultdict(list)
for p in pricing:
    if p["product_type"]:
        by_type[p["product_type"]].append(p["price"])

for ptype, prices in sorted(by_type.items()):
    avg = sum(prices) / len(prices)
    print(f"{ptype}: avg ${avg:.2f} across {len(prices)} variants")

Note that inventory_quantity may be None on stores where the owner has hidden stock levels. compare_at_price shows the original price when an item is on sale — useful for tracking discounts and sale frequency.

Collection-Based Scraping

For stores organized into collections, scraping by collection gives you better context about how the store categorizes its products:

def get_collections(store_url: str, proxy: Optional[str] = None) -> list[dict]:
    """Fetch all collections from a Shopify store."""
    store_url = store_url.rstrip("/")
    client_kwargs = {"timeout": 20, "follow_redirects": True}
    if proxy:
        client_kwargs["proxies"] = {"all://": proxy}

    with httpx.Client(**client_kwargs) as client:
        resp = client.get(f"{store_url}/collections.json?limit=250")
        resp.raise_for_status()
        return resp.json().get("collections", [])


def get_collection_products(
    store_url: str,
    collection_handle: str,
    proxy: Optional[str] = None,
) -> list[dict]:
    """Fetch all products in a specific collection."""
    store_url = store_url.rstrip("/")
    client_kwargs = {"timeout": 20, "follow_redirects": True}
    if proxy:
        client_kwargs["proxies"] = {"all://": proxy}

    all_products = []
    page = 1

    with httpx.Client(**client_kwargs) as client:
        while True:
            url = f"{store_url}/collections/{collection_handle}/products.json?limit=250&page={page}"
            resp = client.get(url)
            if resp.status_code != 200:
                break
            products = resp.json().get("products", [])
            if not products:
                break
            all_products.extend(products)
            page += 1
            time.sleep(0.3)

    return all_products


# Example: scrape every collection
store = "https://example-store.com"
collections = get_collections(store)
print(f"Found {len(collections)} collections")

all_data = {}
for coll in collections:
    handle = coll["handle"]
    products = get_collection_products(store, handle)
    all_data[handle] = products
    print(f"Collection '{coll['title']}': {len(products)} products")
    time.sleep(0.5)

The Storefront API (The Official Way)

If you need more reliable access or richer queries, Shopify's Storefront API is the proper route. It's a GraphQL API that gives you products, collections, pages, blog posts, and cart functionality.

The catch: you need a Storefront Access Token, which can only be created by the store owner through their Shopify admin. This makes it impractical for scraping stores you don't own, but it's the right choice when:

You're building an integration for a client's store
You have permission from the store owner
You need real-time inventory data with webhook support
You want cursor-based pagination that's guaranteed stable

import httpx

STORE = "example-store.myshopify.com"
TOKEN = "your-storefront-access-token"

def storefront_query(query: str, variables: dict = None) -> dict:
    """Execute a GraphQL query against the Shopify Storefront API."""
    resp = httpx.post(
        f"https://{STORE}/api/2025-01/graphql.json",
        json={"query": query, "variables": variables or {}},
        headers={
            "X-Shopify-Storefront-Access-Token": TOKEN,
            "Content-Type": "application/json",
        },
        timeout=15,
    )
    resp.raise_for_status()
    return resp.json()


# Paginated product query
PRODUCTS_QUERY = """
query GetProducts($cursor: String) {
  products(first: 50, after: $cursor) {
    edges {
      node {
        id
        title
        handle
        vendor
        productType
        tags
        createdAt
        updatedAt
        variants(first: 10) {
          edges {
            node {
              id
              title
              sku
              price { amount currencyCode }
              compareAtPrice { amount currencyCode }
              availableForSale
              inventoryQuantity
            }
          }
        }
        images(first: 1) {
          edges {
            node { url altText }
          }
        }
      }
    }
    pageInfo {
      hasNextPage
      endCursor
    }
  }
}
"""

def fetch_all_storefront_products() -> list[dict]:
    """Paginate through all products using Storefront API cursor pagination."""
    all_products = []
    cursor = None

    while True:
        data = storefront_query(PRODUCTS_QUERY, {"cursor": cursor})
        products_data = data["data"]["products"]

        for edge in products_data["edges"]:
            node = edge["node"]
            all_products.append(node)

        page_info = products_data["pageInfo"]
        if not page_info["hasNextPage"]:
            break
        cursor = page_info["endCursor"]
        time.sleep(0.2)

    return all_products

The Storefront API uses cursor-based pagination (endCursor + after parameter), which is stable and reliable for large catalogs. Rate limits are based on a calculated query cost rather than simple request counts — requesting nested variants in each query costs more than a simple product list.

Bot Detection: What to Expect and How to Handle It

Most Shopify stores have minimal bot protection out of the box. The /products.json endpoint works with basic HTTP clients — no JavaScript rendering, no CAPTCHAs, no fingerprinting for 95% of stores.

However, some stores add third-party protection:

Cloudflare

The most common addition. Stores on Shopify Plus often route traffic through Cloudflare, which adds JavaScript challenges and TLS fingerprinting. You'll know because you get a 403 with a Cloudflare challenge page instead of JSON.

Cloudflare operates at multiple levels: - Bot Score — Cloudflare assigns each request a bot score based on IP reputation, TLS fingerprint, HTTP/2 fingerprint, and behavioral signals. - JavaScript Challenge — When confidence is low, Cloudflare serves a JS challenge that evaluates browser capabilities. Standard HTTP clients fail this; browsers pass it. - Turnstile — Cloudflare's CAPTCHA replacement. Appears on checkout and some product pages, rarely on JSON endpoints.

For the JSON endpoints specifically, Cloudflare usually applies passive checks only (no active JS challenge) because they're clearly machine-readable paths. The IP reputation and TLS fingerprint checks still apply, though.

DataDome

Less common but increasingly popular with larger brands. Uses behavioral analysis and device fingerprinting. If you hit a Shopify store and get a 403 with a datadome.co reference in the response, this is the culprit.

Shopify's Built-In Rate Limiting

Shopify has gradually tightened their own bot detection, especially on checkout and cart endpoints. Product data endpoints are still relatively open, but patterns like:

Hundreds of requests per minute
No delay between requests
Datacenter IP ranges
Missing standard browser headers

...will trigger temporary IP blocks. The block typically lasts 1-24 hours.

Handling Detection with ThorData Proxies

When you hit Cloudflare or similar protection, a residential proxy service becomes necessary. ThorData's residential proxy network provides IPs from real ISP ranges that pass Cloudflare's IP reputation checks. Unlike datacenter proxies (AWS, DigitalOcean, Hetzner ranges), residential IPs originate from household internet connections and are scored much more favorably by Cloudflare's bot detection.

import httpx
import time

# ThorData proxy configuration
THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
THORDATA_HOST = "gate.thordata.net"
THORDATA_PORT = 9000

def make_proxy(country: str = "us", session_id: str = None) -> str:
    """
    Build a ThorData residential proxy URL.
    Use country targeting to match the store's primary market.
    Use sticky sessions (session_id) for paginated requests.
    """
    user = f"{THORDATA_USER}-country-{country}"
    if session_id:
        user += f"-session-{session_id}"
    return f"http://{user}:{THORDATA_PASS}@{THORDATA_HOST}:{THORDATA_PORT}"


def scrape_cloudflare_protected_store(store_url: str) -> list[dict]:
    """
    Scrape a Shopify store that's behind Cloudflare using residential proxies.
    Uses sticky sessions so all pages of a scrape come from the same IP.
    """
    import random
    import string

    # Generate a session ID to get sticky IPs per store (maintains session context)
    session_id = "".join(random.choices(string.ascii_lowercase, k=8))
    proxy = make_proxy(country="us", session_id=session_id)

    return scrape_shopify_products(store_url, proxy=proxy, delay=1.0)


# Use it for any Cloudflare-protected store
products = scrape_cloudflare_protected_store("https://protected-shopify-store.com")

For the vast majority of Shopify stores, you won't need proxies at all. A polite request rate with a reasonable User-Agent string is enough. Use ThorData when:

You're seeing consistent 403s that aren't explained by the store being private
You're monitoring many stores simultaneously from the same IP
You're scraping stores that are explicitly Shopify Plus (enterprise tier — more likely to have Cloudflare)

Storing Data in SQLite

For any ongoing monitoring or competitive analysis, you need persistent storage. SQLite is perfect for this — no server required, file-based, and handles the data volumes involved easily.

import sqlite3
import json
from datetime import datetime, timezone

def init_db(db_path: str = "shopify_products.db") -> sqlite3.Connection:
    """Initialize the SQLite database with product and variant tables."""
    conn = sqlite3.connect(db_path)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS stores (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            url TEXT UNIQUE NOT NULL,
            name TEXT,
            first_scraped TEXT,
            last_scraped TEXT
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS products (
            id INTEGER PRIMARY KEY,
            store_url TEXT NOT NULL,
            title TEXT,
            vendor TEXT,
            product_type TEXT,
            tags TEXT,
            handle TEXT,
            created_at TEXT,
            updated_at TEXT,
            published_at TEXT,
            scraped_at TEXT NOT NULL
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS variants (
            id INTEGER PRIMARY KEY,
            product_id INTEGER NOT NULL,
            store_url TEXT NOT NULL,
            title TEXT,
            sku TEXT,
            price REAL,
            compare_at_price REAL,
            available INTEGER,
            inventory_quantity INTEGER,
            option1 TEXT,
            option2 TEXT,
            option3 TEXT,
            weight_grams INTEGER,
            barcode TEXT,
            scraped_at TEXT NOT NULL,
            FOREIGN KEY (product_id) REFERENCES products(id)
        )
    """)

    conn.execute("""
        CREATE TABLE IF NOT EXISTS price_history (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            variant_id INTEGER NOT NULL,
            price REAL NOT NULL,
            compare_at_price REAL,
            available INTEGER,
            recorded_at TEXT NOT NULL
        )
    """)

    conn.execute("CREATE INDEX IF NOT EXISTS idx_products_store ON products(store_url)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_variants_product ON variants(product_id)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_price_history_variant ON price_history(variant_id)")

    conn.commit()
    return conn


def save_products(
    conn: sqlite3.Connection,
    store_url: str,
    products: list[dict],
):
    """Save products and variants to SQLite, recording price changes."""
    now = datetime.now(timezone.utc).isoformat()

    # Upsert store record
    conn.execute("""
        INSERT INTO stores (url, last_scraped, first_scraped)
        VALUES (?, ?, ?)
        ON CONFLICT(url) DO UPDATE SET last_scraped=excluded.last_scraped
    """, (store_url, now, now))

    for product in products:
        # Upsert product
        conn.execute("""
            INSERT OR REPLACE INTO products
            (id, store_url, title, vendor, product_type, tags, handle,
             created_at, updated_at, published_at, scraped_at)
            VALUES (?,?,?,?,?,?,?,?,?,?,?)
        """, (
            product["id"],
            store_url,
            product.get("title"),
            product.get("vendor"),
            product.get("product_type"),
            product.get("tags"),
            product.get("handle"),
            product.get("created_at"),
            product.get("updated_at"),
            product.get("published_at"),
            now,
        ))

        for variant in product.get("variants", []):
            price = float(variant["price"])
            compare_at = float(variant["compare_at_price"]) if variant.get("compare_at_price") else None
            available = 1 if variant.get("available", True) else 0

            # Check if price changed since last scrape
            prev = conn.execute(
                "SELECT price, compare_at_price, available FROM variants WHERE id=?",
                (variant["id"],)
            ).fetchone()

            # Always log price history on change
            if prev is None or prev[0] != price or prev[1] != compare_at or prev[2] != available:
                conn.execute("""
                    INSERT INTO price_history (variant_id, price, compare_at_price, available, recorded_at)
                    VALUES (?,?,?,?,?)
                """, (variant["id"], price, compare_at, available, now))

            # Upsert variant
            conn.execute("""
                INSERT OR REPLACE INTO variants
                (id, product_id, store_url, title, sku, price, compare_at_price,
                 available, inventory_quantity, option1, option2, option3,
                 weight_grams, barcode, scraped_at)
                VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)
            """, (
                variant["id"],
                product["id"],
                store_url,
                variant.get("title"),
                variant.get("sku"),
                price,
                compare_at,
                available,
                variant.get("inventory_quantity"),
                variant.get("option1"),
                variant.get("option2"),
                variant.get("option3"),
                variant.get("grams"),
                variant.get("barcode"),
                now,
            ))

    conn.commit()
    print(f"Saved {len(products)} products for {store_url}")


def get_price_drops(conn: sqlite3.Connection, min_drop_pct: float = 10.0) -> list[dict]:
    """Find variants that recently went on sale."""
    rows = conn.execute("""
        SELECT v.id, p.title, v.title, v.price, v.compare_at_price,
               ((v.compare_at_price - v.price) / v.compare_at_price * 100) as drop_pct,
               v.store_url
        FROM variants v
        JOIN products p ON p.id = v.product_id
        WHERE v.compare_at_price IS NOT NULL
          AND v.compare_at_price > v.price
          AND ((v.compare_at_price - v.price) / v.compare_at_price * 100) >= ?
        ORDER BY drop_pct DESC
    """, (min_drop_pct,)).fetchall()

    return [
        {
            "variant_id": r[0],
            "product": r[1],
            "variant": r[2],
            "sale_price": r[3],
            "original_price": r[4],
            "discount_pct": round(r[5], 1),
            "store": r[6],
        }
        for r in rows
    ]

Building a Multi-Store Price Monitor

Put it all together into a monitoring pipeline:

import httpx
import time
import sqlite3

STORES_TO_MONITOR = [
    "https://allbirds.com",
    "https://gymshark.com",
    "https://fashionnova.com",
    # Add more stores here
]

def run_monitoring_job(
    stores: list[str],
    db_path: str = "shopify_monitor.db",
    use_proxy: bool = False,
):
    """Run one monitoring pass across all stores."""
    conn = init_db(db_path)

    for store_url in stores:
        print(f"\nScraping: {store_url}")

        # Verify it's Shopify first
        if not is_shopify(store_url):
            print(f"  Not a Shopify store, skipping")
            continue

        proxy = None
        if use_proxy:
            import random, string
            session = "".join(random.choices(string.ascii_lowercase, k=8))
            proxy = make_proxy(country="us", session_id=session)

        try:
            products = scrape_shopify_products(store_url, proxy=proxy, delay=0.8)
            save_products(conn, store_url, products)
            print(f"  Done: {len(products)} products")
        except Exception as e:
            print(f"  Error: {e}")

        time.sleep(2)  # Wait between stores

    # Report any significant price drops
    drops = get_price_drops(conn, min_drop_pct=20.0)
    if drops:
        print(f"\nFound {len(drops)} items with 20%+ discounts:")
        for d in drops[:10]:
            print(f"  {d['product']} ({d['variant']}): ${d['sale_price']} (was ${d['original_price']}, -{d['discount_pct']}%)")

    conn.close()


if __name__ == "__main__":
    run_monitoring_job(STORES_TO_MONITOR, use_proxy=False)

Schedule this with cron to run daily or hourly. Each run records price changes into the price_history table, giving you a time series of pricing decisions across your competitors.

Advanced Use Cases

Competitive Pricing Analysis

With data across multiple Shopify stores in the same niche, you can build competitive pricing dashboards:

def competitive_analysis(conn: sqlite3.Connection) -> dict:
    """Compare pricing across all tracked stores."""
    # Average price by store
    rows = conn.execute("""
        SELECT store_url, COUNT(*) as variant_count,
               AVG(price) as avg_price,
               MIN(price) as min_price,
               MAX(price) as max_price,
               SUM(CASE WHEN compare_at_price IS NOT NULL THEN 1 ELSE 0 END) as on_sale_count
        FROM variants
        GROUP BY store_url
        ORDER BY avg_price DESC
    """).fetchall()

    return {
        row[0]: {
            "variants": row[1],
            "avg_price": round(row[2], 2),
            "price_range": (row[3], row[4]),
            "on_sale": row[5],
        }
        for row in rows
    }

Product Availability Tracking

Track when popular items go out of stock:

def track_availability_changes(conn: sqlite3.Connection, hours: int = 24) -> list[dict]:
    """Find variants that changed availability status in the last N hours."""
    from datetime import datetime, timezone, timedelta
    cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat()

    rows = conn.execute("""
        SELECT DISTINCT ph.variant_id, p.title, v.title,
               v.available, ph.available as prev_available,
               ph.recorded_at, v.store_url
        FROM price_history ph
        JOIN variants v ON v.id = ph.variant_id
        JOIN products p ON p.id = v.product_id
        WHERE ph.recorded_at >= ?
          AND ph.available != v.available
        ORDER BY ph.recorded_at DESC
    """, (cutoff,)).fetchall()

    return [
        {
            "variant_id": r[0],
            "product": r[1],
            "variant": r[2],
            "now_available": bool(r[3]),
            "was_available": bool(r[4]),
            "changed_at": r[5],
            "store": r[6],
        }
        for r in rows
    ]

New Product Detection

def find_new_products(conn: sqlite3.Connection, store_url: str, days: int = 7) -> list[dict]:
    """Find products added to a store in the last N days."""
    from datetime import datetime, timezone, timedelta
    cutoff = (datetime.now(timezone.utc) - timedelta(days=days)).isoformat()

    rows = conn.execute("""
        SELECT id, title, vendor, product_type, published_at, scraped_at
        FROM products
        WHERE store_url = ?
          AND published_at >= ?
        ORDER BY published_at DESC
    """, (store_url, cutoff)).fetchall()

    return [
        {
            "id": r[0],
            "title": r[1],
            "vendor": r[2],
            "product_type": r[3],
            "published_at": r[4],
        }
        for r in rows
    ]

Practical Tips

Start with /products.json — It's the fastest path to data. No authentication, no API setup, works on every Shopify store. Only move to the Storefront API if you have legitimate access.

Respect robots.txt — Most Shopify stores have a default robots.txt that doesn't block /products.json, but check anyway. It's both ethical and practical — stores that explicitly block scrapers are more likely to have additional defenses.

Monitor the updated_at field — If you're tracking prices over time, the updated_at field on products and variants tells you when data last changed. Use it to avoid re-processing unchanged products:

def needs_update(conn: sqlite3.Connection, product_id: int, updated_at: str) -> bool:
    """Check if a product has been updated since last scrape."""
    row = conn.execute(
        "SELECT updated_at FROM products WHERE id=?", (product_id,)
    ).fetchone()
    if row is None:
        return True
    return row[0] != updated_at

Handle custom domains — Many Shopify stores use custom domains. The JSON endpoints work on both store.myshopify.com and store.com. If one fails, try the other.

Watch for theme customizations — Some Shopify themes modify the JSON output or add middleware. If /products.json returns HTML instead of JSON, try adding an Accept: application/json header.

Handle currency — Prices are strings in the JSON (e.g., "29.99"). Always convert with float() rather than assuming they're numbers.

Legal and Ethical Considerations

Shopify's platform terms and individual store terms of service vary. The public JSON endpoints expose data that's publicly accessible by anyone browsing the store — the same data is visible in the page source during normal shopping. Courts in the US have generally found that scraping publicly accessible data is protected (hiQ v. LinkedIn, 2022).

Practical guidelines:

Don't scrape checkout or cart endpoints — these affect store operations
Don't redistribute scraped data as a competing product database without transformation/added value
Respect rate limits — don't hammer stores; 1 request/second is more than enough
Store data for analysis, not resale of the raw dump
Some stores explicitly prohibit scraping in their ToS — your risk tolerance determines whether you comply

Wrapping Up

Shopify's architecture makes product data extraction straightforward compared to platforms like Amazon or Walmart. The public JSON endpoints give you everything you need for most use cases: pricing, inventory, product catalog, variant structure. The Storefront API covers authorized access when you have it.

The main thing to be mindful of is scale. Scraping one store occasionally is trivial. Scraping thousands of stores continuously requires proper rate limiting, error handling, and occasionally proxy infrastructure — specifically ThorData residential proxies when Cloudflare-protected stores are in your target list. Build incrementally — start with the simple approach and add complexity only when you actually need it.

The SQLite schema above handles price history, availability tracking, and multi-store comparison out of the box. Run it on a schedule and you'll build a competitive intelligence dataset that compounds over time.

Scraping Shopify Stores: Product Data, Inventory, and Pricing (2026)

Scraping Shopify Stores: Product Data, Inventory, and Pricing (2026)

Why Shopify Is Unusually Scraper-Friendly

The Hidden JSON Endpoints

Detecting Shopify Stores

Discovering Shopify Stores at Scale

Fetching Products with Full Pagination

Cursor-Based Pagination with since_id

Extracting Variant Data

Collection-Based Scraping

The Storefront API (The Official Way)

Bot Detection: What to Expect and How to Handle It

Cloudflare

DataDome

Shopify's Built-In Rate Limiting

Handling Detection with ThorData Proxies

Storing Data in SQLite

Building a Multi-Store Price Monitor

Advanced Use Cases

Competitive Pricing Analysis

Product Availability Tracking

New Product Detection

Practical Tips

Legal and Ethical Considerations

Wrapping Up

Cursor-Based Pagination with `since_id`