Scraping Zillow and Real Estate Data (2026)

2026-04-01 zillow real-estate web-scraping python proxies bot-detection

Scraping Zillow and Real Estate Data (2026)

Zillow killed its public Zestimate API in 2018 and has never replaced it with anything useful for developers. In 2026, your options are: scrape the site (hard, requires residential proxies), use Zillow's official data licensing program (expensive), or piece together data from alternative sources. This guide covers what's actually achievable — with working Python code for every approach.

What You're Up Against

Zillow runs some of the most aggressive bot detection in the real estate sector. It uses:

Cloudflare for initial request filtering — all traffic hits Cloudflare before reaching Zillow's servers
Browser fingerprinting via JavaScript challenges that detect headless browsers, missing APIs, and unrealistic timing
TLS fingerprinting — the SSL handshake from requests looks different from Chrome; Cloudflare notices
IP reputation scoring — datacenter IPs (AWS, GCP, DigitalOcean, Hetzner) are blocked at the edge before any content loads
Session-based rate limits — even residential IPs get soft-blocked after rapid pagination; Zillow expects human browsing patterns
Cookie challenges that require real JavaScript execution to pass, including Cloudflare's __cf_bm and cf_clearance cookies
Behavioral analysis — mouse movement patterns, scroll depth, time between clicks

A plain requests.get("https://www.zillow.com/homes/...") returns a 403 or a Cloudflare challenge page. You need a TLS-impersonating HTTP client or a full browser, plus a residential IP.

Setting Up Your Environment

Before writing a single scraping line, install the required libraries:

pip install curl-cffi beautifulsoup4 playwright lxml sqlite3
playwright install chromium

For proxy support you'll also want:

pip install httpx[socks]  # if you need SOCKS5 proxies

The Lightweight Approach: curl-cffi

curl-cffi impersonates real browser TLS fingerprints (Chrome, Firefox, Safari). It bypasses Cloudflare's TLS-based detection without the overhead of Playwright. This works for basic listing pages but will eventually fail on pages with heavy JavaScript challenges:

from curl_cffi import requests as cffi_requests
import json
import time
import random

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xhtml+xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-CH-UA": '"Chromium";v="124", "Google Chrome";v="124"',
    "Sec-CH-UA-Mobile": "?0",
    "Sec-CH-UA-Platform": '"macOS"',
}

# ThorData residential proxy - replace with your credentials
# https://thordata.partnerstack.com/partner/0a0x4nzq (or [Oxylabs](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=2066&url_id=174))
PROXY = "http://USERNAME:[email protected]:7777"

def create_zillow_session(proxy: str = None) -> cffi_requests.Session:
    """Create a curl-cffi session that impersonates Chrome 124."""
    session = cffi_requests.Session(impersonate="chrome124")
    if proxy:
        session.proxies = {"http": proxy, "https": proxy}
    session.headers.update(HEADERS)
    return session

def fetch_zillow_page(url: str, session: cffi_requests.Session) -> tuple[str, int]:
    """Fetch a Zillow page with retry logic."""
    for attempt in range(3):
        try:
            response = session.get(url, timeout=30)
            if response.status_code == 200:
                return response.text, 200
            elif response.status_code == 429:
                wait = 30 * (2 ** attempt)
                print(f"Rate limited. Waiting {wait}s...")
                time.sleep(wait)
            else:
                return "", response.status_code
        except Exception as e:
            print(f"Request error (attempt {attempt+1}): {e}")
            time.sleep(5 * (attempt + 1))
    return "", -1

# Example usage
session = create_zillow_session(proxy=PROXY)
html, status = fetch_zillow_page(
    "https://www.zillow.com/homes/for_sale/Seattle-WA/",
    session
)
print(f"Status: {status}, Got {len(html)} bytes")

Extracting ZPID and Listing Data

Zillow embeds most listing data in the HTML as a JSON blob inside a <script> tag or an HTML comment. The key identifier for every property is the ZPID (Zillow Property ID) — a numeric ID that uniquely identifies each listing across Zillow's entire database:

import re
from bs4 import BeautifulSoup

def extract_zillow_listings(html: str) -> list[dict]:
    """Extract property listings from Zillow search results HTML."""
    soup = BeautifulSoup(html, "lxml")

    # Method 1: Look for __NEXT_DATA__ (newer Zillow pages)
    next_data_script = soup.find("script", {"id": "__NEXT_DATA__"})
    if next_data_script:
        try:
            next_data = json.loads(next_data_script.string)
            listings = _extract_from_next_data(next_data)
            if listings:
                return listings
        except (json.JSONDecodeError, TypeError):
            pass

    # Method 2: JSON blobs in script tags
    scripts = soup.find_all("script", {"type": "application/json"})
    for script in scripts:
        try:
            parsed = json.loads(script.string or "")
            if "cat1" in parsed or "searchResults" in str(parsed)[:300]:
                listings = _extract_from_search_results(parsed)
                if listings:
                    return listings
        except (json.JSONDecodeError, TypeError):
            continue

    # Method 3: Regex fallback for the queryState comment
    match = re.search(r'<!--(\{"queryState".*?})-->', html, re.DOTALL)
    if match:
        try:
            data = json.loads(match.group(1))
            return _extract_from_search_results(data)
        except json.JSONDecodeError:
            pass

    return []

def _extract_from_next_data(data: dict) -> list[dict]:
    """Extract listings from Next.js page data structure."""
    results = []
    # Navigate to the search results in Next.js data
    page_props = data.get("props", {}).get("pageProps", {})
    search_data = page_props.get("searchPageState", {})
    listing_map = (
        search_data.get("cat1", {})
        .get("searchResults", {})
        .get("listResults", [])
    )
    for listing in listing_map:
        results.append(_parse_listing(listing))
    return results

def _extract_from_search_results(data: dict) -> list[dict]:
    """Extract listings from search results data structure."""
    listing_map = (
        data.get("cat1", {})
        .get("searchResults", {})
        .get("listResults", [])
    )
    return [_parse_listing(l) for l in listing_map]

def _parse_listing(listing: dict) -> dict:
    """Normalize a single listing dict."""
    zpid = listing.get("zpid")
    home_info = listing.get("hdpData", {}).get("homeInfo", {})
    lat_long = listing.get("latLong", {})
    return {
        "zpid": zpid,
        "address": listing.get("address"),
        "price": listing.get("price"),
        "beds": listing.get("beds"),
        "baths": listing.get("baths"),
        "sqft": listing.get("area"),
        "zestimate": listing.get("zestimate"),
        "days_on_market": home_info.get("daysOnZillow"),
        "listing_url": f"https://www.zillow.com{listing.get('detailUrl', '')}",
        "latitude": lat_long.get("latitude"),
        "longitude": lat_long.get("longitude"),
        "home_type": home_info.get("homeType"),
        "status": listing.get("statusType"),
        "price_change": listing.get("priceChange"),
        "lot_id": listing.get("lotId"),
    }

Getting Individual Property Detail

Once you have a ZPID, you can construct the property URL directly and pull deep detail including tax history and price history:

def get_property_url(zpid: int) -> str:
    return f"https://www.zillow.com/homedetails/{zpid}_zpid/"

def scrape_property_detail(zpid: int, session: cffi_requests.Session) -> dict:
    """Fetch full property detail for a single ZPID."""
    url = get_property_url(zpid)
    html, status = fetch_zillow_page(url, session)

    if status != 200:
        return {"error": status, "zpid": zpid}

    soup = BeautifulSoup(html, "lxml")

    # Zillow embeds full property detail in __NEXT_DATA__
    script = soup.find("script", {"id": "__NEXT_DATA__"})
    if not script:
        return {"error": "no_next_data", "zpid": zpid}

    try:
        next_data = json.loads(script.string)
    except json.JSONDecodeError:
        return {"error": "json_parse_failed", "zpid": zpid}

    props = next_data.get("props", {}).get("pageProps", {})
    gdp_client_cache = props.get("gdpClientCache", "{}")

    try:
        cache = json.loads(gdp_client_cache)
        # Key is usually "ForSaleShopperPlatformFullRenderQuery..." — grab first value
        property_data = next(iter(cache.values()), {})
        home_info = property_data.get("property", {})
    except (json.JSONDecodeError, StopIteration, AttributeError):
        return {"error": "cache_parse_failed", "zpid": zpid}

    return {
        "zpid": zpid,
        "address": home_info.get("streetAddress"),
        "city": home_info.get("city"),
        "state": home_info.get("state"),
        "zip": home_info.get("zipcode"),
        "price": home_info.get("price"),
        "zestimate": home_info.get("zestimate"),
        "rent_zestimate": home_info.get("rentZestimate"),
        "beds": home_info.get("bedrooms"),
        "baths": home_info.get("bathrooms"),
        "sqft": home_info.get("livingArea"),
        "lot_size": home_info.get("lotSize"),
        "year_built": home_info.get("yearBuilt"),
        "home_type": home_info.get("homeType"),
        "hoa_fee": home_info.get("hoaFee"),
        "parking_type": home_info.get("parkingType"),
        "heating_system": home_info.get("heatingSystem"),
        "cooling_system": home_info.get("coolingSystem"),
        "elementary_school": home_info.get("elementary"),
        "middle_school": home_info.get("middle"),
        "high_school": home_info.get("high"),
        "tax_history": home_info.get("taxHistory", []),
        "price_history": home_info.get("priceHistory", []),
        "description": home_info.get("description"),
        "photos": [p.get("url") for p in home_info.get("photos", []) if p.get("url")],
        "nearby_homes": [
            {
                "zpid": n.get("zpid"),
                "price": n.get("price"),
                "beds": n.get("bedrooms"),
                "sqft": n.get("livingArea"),
            }
            for n in home_info.get("nearbyHomes", [])
        ],
    }

Why Playwright Is Required for Production

curl-cffi gets you through the initial TLS check, but Zillow's full bot detection stack includes JavaScript challenges that detect:

Missing browser globals (navigator.webdriver, window.chrome, plugin arrays)
Canvas fingerprint inconsistencies between calls
Timing anomalies in event dispatch order
Missing WebGL renderer information
Inconsistent screen resolution and device memory values

For sustained scraping at any volume, you need Playwright with stealth patches:

from playwright.async_api import async_playwright
import asyncio

async def create_stealth_browser(proxy_server: str = None):
    """Launch a stealth Playwright browser."""
    playwright = await async_playwright().start()

    launch_options = {
        "headless": True,
        "args": [
            "--disable-blink-features=AutomationControlled",
            "--disable-features=IsolateOrigins,site-per-process",
            "--no-sandbox",
            "--disable-dev-shm-usage",
        ],
    }
    if proxy_server:
        launch_options["proxy"] = {"server": proxy_server}

    browser = await playwright.chromium.launch(**launch_options)

    context = await browser.new_context(
        user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        viewport={"width": 1440, "height": 900},
        locale="en-US",
        timezone_id="America/New_York",
        device_scale_factor=2,
        has_touch=False,
        java_script_enabled=True,
        color_scheme="light",
    )

    # Patch common bot detection signals
    await context.add_init_script("""
        // Remove webdriver flag
        Object.defineProperty(navigator, 'webdriver', {get: () => undefined});

        // Fake plugins array (real Chrome has plugins)
        Object.defineProperty(navigator, 'plugins', {
            get: () => [
                {name: 'Chrome PDF Plugin', filename: 'internal-pdf-viewer'},
                {name: 'Chrome PDF Viewer', filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai'},
                {name: 'Native Client', filename: 'internal-nacl-plugin'},
            ]
        });

        // Fake languages
        Object.defineProperty(navigator, 'languages', {get: () => ['en-US', 'en']});

        // Fake deviceMemory
        Object.defineProperty(navigator, 'deviceMemory', {get: () => 8});

        // Fake hardware concurrency
        Object.defineProperty(navigator, 'hardwareConcurrency', {get: () => 8});
    """)

    return playwright, browser, context

async def scrape_zillow_with_playwright(url: str, proxy_server: str = None) -> tuple[str, list]:
    """Scrape Zillow using Playwright with API interception."""
    playwright, browser, context = await create_stealth_browser(proxy_server)
    api_responses = []

    try:
        page = await context.new_page()

        # Intercept Zillow's internal API calls — these return clean JSON
        async def handle_response(response):
            if "zillow.com/async-create-search-page-state" in response.url:
                try:
                    body = await response.json()
                    api_responses.append(body)
                except Exception:
                    pass
            elif "zillow.com/graphql" in response.url:
                try:
                    body = await response.json()
                    api_responses.append({"source": "graphql", "data": body})
                except Exception:
                    pass

        page.on("response", handle_response)

        # Random delay before navigating
        await asyncio.sleep(random.uniform(1, 3))
        await page.goto(url, wait_until="networkidle", timeout=60000)

        # Simulate human scrolling
        await page.evaluate("window.scrollBy(0, 300)")
        await asyncio.sleep(random.uniform(0.5, 1.5))
        await page.evaluate("window.scrollBy(0, 500)")
        await asyncio.sleep(random.uniform(0.5, 1.0))

        html = await page.content()

    finally:
        await context.close()
        await browser.close()
        await playwright.stop()

    return html, api_responses

async def main():
    proxy = "http://user:[email protected]:7777"
    html, api_data = await scrape_zillow_with_playwright(
        "https://www.zillow.com/homes/for_sale/Austin-TX/",
        proxy_server=proxy,
    )
    listings = extract_zillow_listings(html)

    # Also parse API data if intercepted
    for api_resp in api_data:
        if "searchResults" in str(api_resp)[:500]:
            extra_listings = _extract_from_search_results(api_resp)
            print(f"Got {len(extra_listings)} listings from API interception")

    print(f"Found {len(listings)} listings from HTML")

asyncio.run(main())

The async-create-search-page-state API call Zillow makes internally is the cleanest data source — intercept it and you get fully structured JSON without parsing the HTML at all.

Pagination: Walking Through Search Results

Zillow search results are paginated. Here's how to iterate through pages:

import sqlite3

def scrape_market_data(
    location: str,
    max_pages: int = 10,
    proxy: str = None,
    db_path: str = "zillow.db"
) -> int:
    """Scrape and store all listings for a given location."""

    # Set up SQLite storage
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS listings (
            zpid TEXT PRIMARY KEY,
            address TEXT,
            price INTEGER,
            beds REAL,
            baths REAL,
            sqft INTEGER,
            zestimate INTEGER,
            days_on_market INTEGER,
            home_type TEXT,
            status TEXT,
            latitude REAL,
            longitude REAL,
            listing_url TEXT,
            scraped_at TEXT
        )
    """)

    session = create_zillow_session(proxy=proxy)
    total_saved = 0

    for page in range(1, max_pages + 1):
        # Zillow URL pagination: /page=N/ in the URL
        if page == 1:
            url = f"https://www.zillow.com/homes/for_sale/{location.replace(' ', '-')}/"
        else:
            url = f"https://www.zillow.com/homes/for_sale/{location.replace(' ', '-')}/{page}_p/"

        print(f"Scraping page {page}/{max_pages}: {url}")
        html, status = fetch_zillow_page(url, session)

        if status != 200:
            print(f"Failed on page {page} with status {status}")
            break

        listings = extract_zillow_listings(html)

        if not listings:
            print(f"No listings found on page {page} — stopping")
            break

        # Save to database
        import datetime
        now = datetime.datetime.now().isoformat()
        for listing in listings:
            try:
                conn.execute(
                    "INSERT OR REPLACE INTO listings VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?)",
                    (
                        str(listing.get("zpid")),
                        listing.get("address"),
                        listing.get("price"),
                        listing.get("beds"),
                        listing.get("baths"),
                        listing.get("sqft"),
                        listing.get("zestimate"),
                        listing.get("days_on_market"),
                        listing.get("home_type"),
                        listing.get("status"),
                        listing.get("latitude"),
                        listing.get("longitude"),
                        listing.get("listing_url"),
                        now,
                    )
                )
                total_saved += 1
            except Exception as e:
                print(f"DB error for {listing.get('zpid')}: {e}")

        conn.commit()
        print(f"  Saved {len(listings)} listings (total: {total_saved})")

        # Human-like delay between pages: 5-15 seconds
        delay = random.uniform(5, 15)
        print(f"  Waiting {delay:.1f}s before next page...")
        time.sleep(delay)

    conn.close()
    return total_saved

# Scrape Seattle listings
count = scrape_market_data(
    location="Seattle WA",
    max_pages=5,
    proxy=PROXY,
    db_path="seattle_listings.db"
)
print(f"Total listings scraped: {count}")

Residential Proxies Are Non-Negotiable

Zillow blocks all known datacenter CIDR ranges at the Cloudflare layer. From an AWS, GCP, or DigitalOcean IP, you get a 403 before any content loads. The block happens before your request even reaches Zillow's application servers.

Even residential IPs get flagged if they: - Make more than ~30 requests per hour - Access more than ~100 pages without natural pauses - Show perfectly uniform timing (bots are too precise; humans aren't) - Use the same session cookies for too long

ThorData provides rotating residential proxies with geo-targeting — you can target specific US metro areas, which is useful when scraping Zillow market data for a specific region and want the IP to appear local. This also helps with localized Zestimate values that can vary by IP location.

Key rotation strategies: - Sticky sessions: Use the same residential IP for 5-10 pages, then rotate. Avoids the "teleporting user" pattern where the same session jumps between different IPs. - Geo-targeting: If scraping Boston listings, route through New England IPs. Zillow's session tracking may behave differently for local vs. remote IPs. - Rotation on block: If you get a 403, immediately rotate to a new IP — don't retry the same one.

import random

# ThorData sticky session format: append session ID to force same IP for multiple requests
def get_sticky_proxy(session_id: str = None) -> str:
    if session_id is None:
        session_id = str(random.randint(10000, 99999))
    return f"http://user-{session_id}:[email protected]:7777"

# Use same IP for all pages in one city search
city_session = str(random.randint(10000, 99999))
proxy = get_sticky_proxy(city_session)
# After scraping one city, get a fresh session for the next

Anti-Detection Checklist

Before running any Zillow scraper at scale, verify:

TLS fingerprint: Use curl-cffi with impersonate="chrome124" or newer — never vanilla requests
Headers: Include all Sec-Fetch-* headers; missing ones are a bot signal
Delays: 3–8 seconds between page loads minimum; add random jitter
Session length: Don't use the same session for more than 20-30 pages
IP source: Only residential IPs; rotate every 5-10 requests
Referrer: Set appropriate Referer headers (search → listing, not direct-to-listing)
Cookie handling: Let the session accumulate cookies naturally; don't strip them
User agent rotation: Rotate through 3-5 realistic Chrome/Safari user agents

Parsing and Storing Data

Once you have raw listing data, normalize it for analysis:

import csv
from datetime import datetime

def export_listings_to_csv(db_path: str, output_path: str):
    """Export SQLite listings to CSV for analysis."""
    conn = sqlite3.connect(db_path)
    cursor = conn.execute("""
        SELECT
            zpid, address, price, beds, baths, sqft,
            zestimate, days_on_market, home_type, status,
            latitude, longitude,
            CASE WHEN price > 0 AND sqft > 0
                 THEN ROUND(price * 1.0 / sqft, 2)
                 ELSE NULL END as price_per_sqft,
            CASE WHEN zestimate > 0 AND price > 0
                 THEN ROUND((price - zestimate) * 100.0 / zestimate, 1)
                 ELSE NULL END as pct_above_zestimate
        FROM listings
        WHERE status = 'FOR_SALE'
        ORDER BY price ASC
    """)

    fieldnames = [
        "zpid", "address", "price", "beds", "baths", "sqft",
        "zestimate", "days_on_market", "home_type", "status",
        "latitude", "longitude", "price_per_sqft", "pct_above_zestimate"
    ]

    with open(output_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        for row in cursor:
            writer.writerow(dict(zip(fieldnames, row)))

    conn.close()
    print(f"Exported to {output_path}")

# Market analysis queries
def analyze_market(db_path: str, city: str = None):
    """Print basic market statistics from the database."""
    conn = sqlite3.connect(db_path)

    stats = conn.execute("""
        SELECT
            COUNT(*) as total,
            ROUND(AVG(price)) as avg_price,
            ROUND(AVG(price * 1.0 / sqft), 2) as avg_ppsf,
            ROUND(AVG(days_on_market)) as avg_dom,
            MIN(price) as min_price,
            MAX(price) as max_price
        FROM listings
        WHERE price > 0 AND sqft > 0 AND status = 'FOR_SALE'
    """).fetchone()

    print(f"Market Statistics:")
    print(f"  Total active listings: {stats[0]:,}")
    print(f"  Average price: ${stats[1]:,.0f}")
    print(f"  Average price/sqft: ${stats[2]:.2f}")
    print(f"  Average days on market: {stats[3]:.0f}")
    print(f"  Price range: ${stats[4]:,.0f} — ${stats[5]:,.0f}")

    conn.close()

The Enterprise Path: Zillow Data Licensing

If you're building a product that needs Zillow data at scale legally, the real path is Zillow's Data Licensing Program (formerly Bridge Interactive). Zillow licenses its listing data to MLSs, brokerages, and portals through formal data agreements.

It's not cheap — pricing is negotiated based on usage volume and use case — but it's the only way to get Zestimates, rental estimates, and sold/off-market data with permission. Contact [email protected] for enterprise inquiries.

For most developers, this isn't viable. But if you're building a funded real estate product, it's worth knowing it exists before investing heavily in a scraper Zillow could break at any time.

Alternative Data Sources

When Zillow is too hard to scrape, these sources are easier and sometimes richer:

Redfin (Semi-Public API)

Redfin has a semi-documented API endpoint that returns JSON directly for map queries — no authentication required:

import httpx

def search_redfin(bounds: dict) -> list[dict]:
    """
    Search Redfin listings within a bounding box.
    bounds: {'north': float, 'south': float, 'east': float, 'west': float}
    """
    params = {
        "al": 1,
        "num_homes": 350,
        "mpt": "99",  # property type (99 = all)
        "market": "seattle",
        "region_id": "12898",  # region ID for Seattle
        "region_type": "6",
        "sf": "1,2,3,5,6,7",  # include all sale types
        "sold_within_days": "30",
        "v": "8",
    }

    # Redfin's GIS endpoint
    url = "https://www.redfin.com/stingray/api/gis"

    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
        "Accept": "application/json",
        "Referer": "https://www.redfin.com/",
    }

    with httpx.Client(headers=headers, timeout=30) as client:
        resp = client.get(url, params=params)

    if resp.status_code == 200:
        # Redfin also prefixes with {}&&
        text = resp.text.replace("{}&&", "")
        data = json.loads(text)
        homes = data.get("payload", {}).get("homes", [])
        return [
            {
                "price": h.get("priceInfo", {}).get("amount"),
                "beds": h.get("beds"),
                "baths": h.get("baths"),
                "sqft": h.get("sqFt", {}).get("value"),
                "address": h.get("streetLine", {}).get("value"),
                "city": h.get("cityStateZip", {}).get("value"),
                "lat": h.get("latLong", {}).get("latitude"),
                "lng": h.get("latLong", {}).get("longitude"),
                "days_on_market": h.get("dom", {}).get("value"),
                "url": f"https://www.redfin.com{h.get('url', '')}",
            }
            for h in homes
        ]
    return []

County Assessor Records

County assessor data is public, often downloadable as bulk CSV from county websites. Slower to update (quarterly or annual releases) but no scraping needed and no ToS to worry about:

King County (Seattle): https://info.kingcounty.gov/assessor/DataDownload/default.aspx
Cook County (Chicago): https://datacatalog.cookcountyil.gov/
Los Angeles County: https://assessor.lacounty.gov/
New York City: https://www.nyc.gov/site/finance/property/property-data.page

For many use cases — especially anything involving historical sales, tax records, or ownership history — county assessor data is more complete than Zillow and completely free.

Commercial Options

ATTOM Data: Comprehensive property records API. $500-2,000/month depending on volume, but clean data with proper ToS.
CoreLogic: Enterprise property data platform. Used by banks and mortgage companies.
HouseCanary: Similar to ATTOM with Zestimate-equivalent AVM (Automated Valuation Model).

Common Gotchas

JSON structure changes: Zillow updates their frontend regularly. The path to listing data in __NEXT_DATA__ changes. Build your parser to try multiple paths and log when none work.

Sold listings have different structure: The cat1.searchResults.listResults path works for active listings. Sold listings go into a different key (soldResults).

Price formatting: Prices come as strings like "$1,250,000" in some fields and as integers in others. Normalize explicitly:

def parse_price(price_str) -> int | None:
    if isinstance(price_str, (int, float)):
        return int(price_str)
    if isinstance(price_str, str):
        cleaned = re.sub(r'[^\d]', '', price_str)
        return int(cleaned) if cleaned else None
    return None

Zestimate availability: Zestimate is only present for ~70% of listings. Handle None gracefully.

Rate limiting is per-IP not per-session: Clearing cookies doesn't reset the rate limit. Only rotating to a new IP helps once you're throttled.

Zillow scraping in 2026 means committing to Playwright or curl-cffi, residential proxies via ThorData or similar, and accepting that your scraper will break periodically when Zillow updates its detection. Keep your code modular, log failures aggressively, and plan for a bi-monthly fix cycle. For anything requiring scale or legal certainty, the data licensing path or alternatives like Redfin's API and county records are worth the tradeoff.