← Back to blog

How to Scrape Craigslist Listings with Python (2026)

How to Scrape Craigslist Listings with Python (2026)

Craigslist is one of the most useful datasets on the web for local market research. Rental prices, used car inventory, furniture markups, freelance gig rates — the data is there, updated constantly, spanning hundreds of cities. The catch is that Craigslist actively resists automated access, and its city-specific URL structure requires a thoughtful multi-city collection strategy.

This guide covers how to pull listing data across cities efficiently: starting with RSS feeds (which are fast and tolerated), falling back to HTML scraping when you need more fields, and handling the anti-bot measures you'll inevitably hit at scale.

What Data Is Available

Each Craigslist listing exposes:

You won't get seller contact info directly (Craigslist anonymizes emails), but for pricing and geographic analysis, you don't need it.

URL Structure Across Cities

Craigslist uses city-based subdomains. Each city gets its own subdomain under craigslist.org:

sfbay.craigslist.org        # San Francisco Bay Area
newyork.craigslist.org      # New York City
chicago.craigslist.org      # Chicago
losangeles.craigslist.org   # Los Angeles
seattle.craigslist.org      # Seattle
austin.craigslist.org       # Austin
miami.craigslist.org        # Miami
denver.craigslist.org       # Denver

Search within a city follows this pattern:

https://{city}.craigslist.org/search/{category}?query={term}&min_price={min}&max_price={max}

Category codes include apa (apartments), fua (furniture), cto (cars by owner), ggg (gigs), sof (software jobs). A full list is in the URL when you browse Craigslist manually.

CITIES = [
    "sfbay", "newyork", "chicago", "losangeles",
    "seattle", "austin", "miami", "denver",
]

CATEGORY = "apa"  # apartments

def search_url(city: str, query: str = "", min_price: int = None, max_price: int = None) -> str:
    base = f"https://{city}.craigslist.org/search/{CATEGORY}"
    params = {}
    if query:
        params["query"] = query
    if min_price is not None:
        params["min_price"] = min_price
    if max_price is not None:
        params["max_price"] = max_price
    if params:
        from urllib.parse import urlencode
        return f"{base}?{urlencode(params)}"
    return base

RSS Feeds: The Easy Path

Craigslist exposes RSS feeds for every category and search query. These are the most reliable way to pull listings without triggering anti-bot measures — Craigslist tolerates RSS polling at reasonable intervals because RSS is a standard protocol with clear semantics.

The RSS URL pattern:

https://{city}.craigslist.org/search/{category}.rss
https://{city}.craigslist.org/search/{category}.rss?query={term}

Use feedparser to parse them:

import feedparser
import time

def fetch_rss(city: str, category: str = "apa", query: str = "") -> list[dict]:
    url = f"https://{city}.craigslist.org/search/{category}.rss"
    if query:
        url += f"?query={query}"

    feed = feedparser.parse(url)
    listings = []

    for entry in feed.entries:
        listings.append({
            "city": city,
            "title": entry.get("title", ""),
            "url": entry.get("link", ""),
            "published": entry.get("published", ""),
            "description": entry.get("summary", ""),
        })

    return listings

# Collect across cities
all_listings = []
for city in CITIES:
    results = fetch_rss(city, category="apa")
    all_listings.extend(results)
    time.sleep(2)  # Be polite between city requests

print(f"Collected {len(all_listings)} listings")

RSS feeds return the 25 most recent posts per query. They don't support pagination, so they're best for freshness monitoring rather than bulk historical collection.

HTML Scraping Fallback

When you need more than 25 results or fields not in the RSS (coordinates, neighborhood, number of bedrooms), you'll need to scrape the HTML search results and individual listing pages.

Use httpx for requests and selectolax for fast HTML parsing:

import httpx
from selectolax.parser import HTMLParser
import re

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
}

def parse_search_page(html: str, city: str) -> list[dict]:
    tree = HTMLParser(html)
    listings = []

    for item in tree.css("li.cl-search-result"):
        title_node = item.css_first("a.cl-app-anchor span.label")
        price_node = item.css_first(".priceinfo")
        meta_node = item.css_first(".meta")
        link_node = item.css_first("a.cl-app-anchor")

        title = title_node.text(strip=True) if title_node else ""
        price_text = price_node.text(strip=True) if price_node else ""
        meta = meta_node.text(strip=True) if meta_node else ""
        url = link_node.attributes.get("href", "") if link_node else ""

        # Extract numeric price
        price_match = re.search(r"[\d,]+", price_text.replace(",", ""))
        price = int(price_match.group()) if price_match else None

        listings.append({
            "city": city,
            "title": title,
            "price": price,
            "meta": meta,
            "url": url,
        })

    return listings

def scrape_city(city: str, category: str = "apa", pages: int = 3) -> list[dict]:
    client = httpx.Client(headers=HEADERS, timeout=20, follow_redirects=True)
    all_results = []

    for page in range(pages):
        offset = page * 120  # Craigslist returns 120 results per page
        url = f"https://{city}.craigslist.org/search/{category}?s={offset}"
        response = client.get(url)

        if response.status_code != 200:
            print(f"Got {response.status_code} for {city} page {page}")
            break

        results = parse_search_page(response.text, city)
        all_results.extend(results)
        time.sleep(3)

    client.close()
    return all_results

For individual listing pages, pull coordinates and neighborhood from the listing detail HTML:

def parse_listing_detail(html: str) -> dict:
    tree = HTMLParser(html)

    # Coordinates are in a map link
    map_link = tree.css_first("a[href*='maps.google.com']")
    lat, lon = None, None
    if map_link:
        href = map_link.attributes.get("href", "")
        coord_match = re.search(r"ll=([-\d.]+),([-\d.]+)", href)
        if coord_match:
            lat = float(coord_match.group(1))
            lon = float(coord_match.group(2))

    # Neighborhood label
    neighborhood_node = tree.css_first(".postingtitletext small")
    neighborhood = neighborhood_node.text(strip=True).strip("()") if neighborhood_node else ""

    # Attributes (bedrooms, sqft, etc.)
    attrs = {}
    for span in tree.css(".attrgroup span"):
        text = span.text(strip=True)
        if "BR" in text or "Ba" in text:
            attrs["bedrooms_baths"] = text

    return {"lat": lat, "lon": lon, "neighborhood": neighborhood, "attrs": attrs}

Anti-Bot Measures

Craigslist is not as sophisticated as Cloudflare-protected sites, but it does push back:

IP blocking. Hit the same subdomain too fast and your IP gets a temporary block — you'll see 403 responses or connection resets. This typically lifts after a few hours, but repeated violations can lead to longer bans. The threshold varies by city; large metro subdomains like sfbay and newyork are more aggressively monitored.

CAPTCHA after many requests. Past a certain volume from one IP, Craigslist serves a CAPTCHA interstitial instead of search results. The threshold is roughly 50-100 requests per hour per IP for HTML endpoints. RSS feeds have a higher tolerance.

Rate limiting per city subdomain. Each subdomain appears to have its own rate limit counter. Spreading requests across multiple city subdomains helps — you're hitting different servers, not just the same rate limit bucket.

User-Agent filtering. Requests with Python's default python-httpx/x.x.x user agent get blocked quickly. Always set a realistic browser user agent.

The practical mitigations: slow down your request rate (3-5 seconds between HTML requests), rotate IPs, and prefer RSS when freshness is sufficient.

Multi-City Collection

A coordinator that collects across cities with per-city delays:

import random
from datetime import datetime

def collect_all_cities(cities: list[str], category: str = "apa") -> list[dict]:
    all_data = []

    for city in cities:
        print(f"[{datetime.now().isoformat()}] Fetching {city}...")
        try:
            # Try RSS first (faster, lower risk)
            listings = fetch_rss(city, category)
            for item in listings:
                item["source"] = "rss"
            all_data.extend(listings)
        except Exception as e:
            print(f"RSS failed for {city}: {e}")

        # Random delay between cities: 2-5 seconds
        time.sleep(random.uniform(2, 5))

    return all_data

Proxy Configuration

At scale — more than a handful of cities, running multiple times per day — you'll hit IP bans. Residential proxies are the cleanest solution. Datacenter IPs are blocked quickly by Craigslist; residential IPs that look like regular household traffic get through consistently.

ThorData's residential proxy network works well here. Their rotating residential pool cycles IPs per request, which keeps each city subdomain from seeing repeated traffic from the same source. Setup is straightforward:

PROXY = "http://USER:[email protected]:9000"

client = httpx.Client(
    headers=HEADERS,
    proxy=PROXY,
    timeout=20,
    follow_redirects=True,
)

# Each request routes through a different residential IP
response = client.get(f"https://sfbay.craigslist.org/search/apa")

For feedparser, set the proxy via environment variable before parsing:

import os
os.environ["http_proxy"] = PROXY
os.environ["https_proxy"] = PROXY

feed = feedparser.parse(f"https://sfbay.craigslist.org/search/apa.rss")

Pricing Trend Analysis

Once you have data across cities, cross-city pricing analysis is straightforward. Using a dict-based approach (no pandas required):

from collections import defaultdict
import statistics

def analyze_prices(listings: list[dict]) -> dict:
    by_city = defaultdict(list)

    for item in listings:
        price = item.get("price")
        city = item.get("city")
        if price and city and 100 < price < 20000:  # Filter outliers
            by_city[city].append(price)

    summary = {}
    for city, prices in by_city.items():
        if len(prices) < 5:
            continue
        summary[city] = {
            "count": len(prices),
            "median": statistics.median(prices),
            "mean": round(statistics.mean(prices), 2),
            "min": min(prices),
            "max": max(prices),
            "stdev": round(statistics.stdev(prices), 2) if len(prices) > 1 else 0,
        }

    # Sort by median price descending
    return dict(sorted(summary.items(), key=lambda x: x[1]["median"], reverse=True))

results = analyze_prices(all_listings)
for city, stats in results.items():
    print(f"{city:15s}  median=${stats['median']:,}  n={stats['count']}")

Example output:

sfbay            median=$2,850  n=312
newyork          median=$2,600  n=489
losangeles       median=$2,200  n=271
seattle          median=$1,950  n=198
miami            median=$1,800  n=167
chicago          median=$1,500  n=241
denver           median=$1,450  n=133
austin           median=$1,400  n=155

Storing Data

SQLite is sufficient for multi-city Craigslist data — even 100k listings is fast with the right index:

import sqlite3
from datetime import datetime

def store_listings(listings: list[dict], db_path: str = "craigslist.db"):
    con = sqlite3.connect(db_path)
    con.execute("""
        CREATE TABLE IF NOT EXISTS listings (
            url TEXT PRIMARY KEY,
            city TEXT,
            title TEXT,
            price INTEGER,
            neighborhood TEXT,
            published TEXT,
            scraped_at TEXT
        )
    """)
    con.execute("CREATE INDEX IF NOT EXISTS idx_city_price ON listings (city, price)")

    now = datetime.utcnow().isoformat()
    rows = [
        (
            item.get("url"),
            item.get("city"),
            item.get("title"),
            item.get("price"),
            item.get("neighborhood", ""),
            item.get("published", ""),
            now,
        )
        for item in listings
        if item.get("url")
    ]

    con.executemany(
        "INSERT OR IGNORE INTO listings VALUES (?, ?, ?, ?, ?, ?, ?)",
        rows,
    )
    con.commit()
    con.close()
    print(f"Stored {len(rows)} listings")

Using INSERT OR IGNORE with the URL as primary key means re-running the scraper won't create duplicates.

Craigslist's terms of service prohibit automated scraping. Their robots.txt blocks most crawlers. Craigslist has also litigated against scrapers (hiQ Labs v. LinkedIn is the relevant precedent for public data access, but Craigslist has its own case history). For personal research or academic use, the risk is low. For commercial applications at scale, consult a lawyer and consider whether purchasing a data license or using a compliant data broker is a better path.

Key Takeaways

Advanced: Location Coordinates and Geographic Analysis

Individual listing pages sometimes embed latitude/longitude coordinates. Extracting these enables radius-based analysis:

import httpx
import re
import time
import random
from selectolax.parser import HTMLParser

def get_listing_coordinates(listing_url: str, headers: dict = None, proxy: str = None) -> dict:
    """Fetch individual listing page and extract geo coordinates."""
    if headers is None:
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                          "(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        }

    client_kwargs = {"headers": headers, "timeout": 15, "follow_redirects": True}
    if proxy:
        client_kwargs["proxies"] = {"all://": proxy}

    with httpx.Client(**client_kwargs) as client:
        resp = client.get(listing_url)

    if resp.status_code != 200:
        return {}

    tree = HTMLParser(resp.text)
    result = {}

    # Coordinates appear in the map link href
    map_link = tree.css_first("a[href*='maps.google.com']")
    if map_link:
        href = map_link.attributes.get("href", "")
        coord_match = re.search(r'll=([-\d.]+),([-\d.]+)', href)
        if coord_match:
            result["lat"] = float(coord_match.group(1))
            result["lon"] = float(coord_match.group(2))

    # Also check data attributes on the map div
    map_div = tree.css_first("#map")
    if map_div and not result.get("lat"):
        lat = map_div.attributes.get("data-latitude")
        lon = map_div.attributes.get("data-longitude")
        if lat and lon:
            result["lat"] = float(lat)
            result["lon"] = float(lon)

    # Neighborhood label
    neighborhood = tree.css_first(".postingtitletext small")
    if neighborhood:
        result["neighborhood"] = neighborhood.text(strip=True).strip("() ")

    # Bedrooms/bathrooms from attributes
    attr_group = tree.css_first(".attrgroup")
    if attr_group:
        for span in attr_group.css("span"):
            text = span.text(strip=True)
            if "BR" in text or "Ba" in text:
                result["bedrooms_baths"] = text
            elif "ft" in text and text.replace(",", "").replace("ft2", "").strip().isdigit():
                result["sqft"] = text

    # Full description
    body = tree.css_first("#postingbody")
    if body:
        result["description"] = body.text(strip=True)[:2000]

    return result


def enrich_listings_with_coords(
    listings: list,
    proxy: str = None,
    max_per_city: int = 50,
) -> list:
    """Fetch coordinates for a sample of listings from each city."""
    enriched = []
    city_counts = {}

    for listing in listings:
        city = listing.get("city", "")
        url = listing.get("url", "")

        if not url:
            enriched.append(listing)
            continue

        # Limit per-city enrichment to control volume
        if city_counts.get(city, 0) >= max_per_city:
            enriched.append(listing)
            continue

        geo = get_listing_coordinates(url, proxy=proxy)
        listing.update(geo)
        city_counts[city] = city_counts.get(city, 0) + 1
        enriched.append(listing)

        time.sleep(random.uniform(2, 4))

    return enriched

Advanced Multi-City Price Analysis

With geographic data, you can do meaningful regional comparisons:

import statistics
from collections import defaultdict

def geographic_price_analysis(listings: list) -> dict:
    """Analyze rental prices by city with statistical breakdown."""
    city_data = defaultdict(list)

    for item in listings:
        price = item.get("price")
        city = item.get("city")
        if price and city and 300 < price < 15000:
            city_data[city].append({
                "price": price,
                "neighborhood": item.get("neighborhood"),
                "bedrooms_baths": item.get("bedrooms_baths"),
            })

    analysis = {}
    for city, data in city_data.items():
        prices = [d["price"] for d in data]
        if len(prices) < 5:
            continue

        # Neighborhood breakdown
        neighborhoods = defaultdict(list)
        for d in data:
            if d.get("neighborhood"):
                neighborhoods[d["neighborhood"]].append(d["price"])

        top_neighborhoods = {
            nb: round(statistics.median(prices), 0)
            for nb, prices in sorted(
                neighborhoods.items(),
                key=lambda x: statistics.median(x[1]),
                reverse=True
            )[:5]
            if len(prices) >= 3
        }

        analysis[city] = {
            "count": len(prices),
            "median": statistics.median(prices),
            "mean": round(statistics.mean(prices), 2),
            "p25": sorted(prices)[len(prices)//4],
            "p75": sorted(prices)[3*len(prices)//4],
            "min": min(prices),
            "max": max(prices),
            "std": round(statistics.stdev(prices), 2) if len(prices) > 1 else 0,
            "top_neighborhoods": top_neighborhoods,
        }

    return dict(sorted(analysis.items(), key=lambda x: x[1]["median"], reverse=True))


def compute_city_affordability_index(analysis: dict) -> list:
    """
    Rank cities by affordability relative to each other.
    Normalized index where 100 = median of all cities.
    """
    medians = [v["median"] for v in analysis.values()]
    overall_median = statistics.median(medians)

    indexed = []
    for city, stats in analysis.items():
        index = round(stats["median"] / overall_median * 100, 1)
        indexed.append({
            "city": city,
            "median_price": stats["median"],
            "affordability_index": index,  # < 100 = cheaper than average
            "sample_size": stats["count"],
        })

    return sorted(indexed, key=lambda x: x["affordability_index"])

Categories Beyond Apartments

Craigslist's value extends well beyond housing. Here are the most useful categories for different use cases:

# Complete category reference
CRAIGSLIST_CATEGORIES = {
    # Housing
    "aap": "apartments",
    "apa": "apartments (all)",
    "roo": "rooms & shares",
    "sub": "sublets",
    "vac": "vacation rentals",

    # For Sale
    "cto": "cars by owner",
    "cta": "cars by dealer",
    "bik": "bicycles",
    "ele": "electronics",
    "fua": "furniture",
    "app": "appliances",
    "spo": "sporting goods",
    "tls": "tools",
    "zip": "free stuff",

    # Jobs
    "sof": "software/QA/dba",
    "web": "web/html/info design",
    "eng": "engineering",
    "mdi": "medical/health",
    "mar": "marketing/PR/ad",

    # Services
    "bts": "beauty services",
    "cps": "computer services",
    "lgs": "legal services",
    "lss": "lessons/tutoring",

    # Gigs
    "cpg": "computer gigs",
    "crg": "creative gigs",
    "lbg": "labor gigs",
    "smg": "skilled trades gigs",
    "wag": "writing/editing gigs",
}


def build_market_report(cities: list, categories: list) -> dict:
    """
    Build a cross-city, cross-category market report.
    Returns structured data suitable for analysis or export.
    """
    report = {
        "metadata": {
            "cities": cities,
            "categories": categories,
            "generated": __import__("datetime").datetime.now().isoformat(),
        },
        "data": {},
    }

    for city in cities:
        report["data"][city] = {}
        for category in categories:
            try:
                listings = fetch_rss(city, category=category)
                prices = []
                for item in listings:
                    # Extract price from title/description
                    text = item.get("title", "") + " " + item.get("description", "")
                    price_match = __import__("re").search(r'\$([\d,]+)', text)
                    if price_match:
                        price = int(price_match.group(1).replace(",", ""))
                        if 10 < price < 100000:
                            prices.append(price)

                if prices:
                    report["data"][city][category] = {
                        "count": len(listings),
                        "median_price": __import__("statistics").median(prices) if prices else None,
                        "listings_with_price": len(prices),
                    }
                else:
                    report["data"][city][category] = {
                        "count": len(listings),
                        "median_price": None,
                        "listings_with_price": 0,
                    }

                __import__("time").sleep(2)
            except Exception as e:
                report["data"][city][category] = {"error": str(e)}

    return report

Deduplication Strategy

Craigslist listings sometimes appear in multiple cities or get reposted. Here is a deduplication approach:

import hashlib

def deduplicate_listings(listings: list) -> list:
    """
    Remove duplicate listings based on title similarity and price.
    Handles both exact duplicates and slightly modified reposts.
    """
    seen_hashes = set()
    unique = []

    for listing in listings:
        # Exact URL dedup
        url = listing.get("url", "")
        if url in seen_hashes:
            continue
        seen_hashes.add(url)

        # Content-based dedup (for reposts with different URLs)
        title = listing.get("title", "").lower().strip()
        price = listing.get("price", 0)
        # Normalize title: remove punctuation, extra whitespace
        import re
        normalized = re.sub(r'[^a-z0-9 ]', '', title)
        normalized = re.sub(r'\s+', ' ', normalized)

        content_hash = hashlib.md5(f"{normalized}_{price}".encode()).hexdigest()
        if content_hash not in seen_hashes:
            seen_hashes.add(content_hash)
            unique.append(listing)

    return unique


def flag_suspicious_listings(listings: list) -> list:
    """Flag listings that may be fraudulent based on common patterns."""
    import re

    suspicious_patterns = [
        r'\$\d+.*per.*month.*utilities included.*furnished',
        r'owner.*overseas',
        r'send.*money.*order',
        r'western union',
        r'email.*only.*no.*call',
        r'price.*too.*good',
    ]

    for listing in listings:
        text = (listing.get("title", "") + " " + listing.get("description", "")).lower()
        flags = []
        for pattern in suspicious_patterns:
            if re.search(pattern, text):
                flags.append(pattern)
        if flags:
            listing["suspicious_flags"] = flags
        else:
            listing["suspicious_flags"] = []

    return listings

Scheduling and Automation

Set up automated collection that runs continuously without manual intervention:

import schedule
import time
from datetime import datetime

def scheduled_craigslist_run():
    """Run the multi-city scraper on a schedule."""
    print(f"[{datetime.now().isoformat()}] Starting scheduled run")

    PROXY = "http://USER:[email protected]:9000"

    cities = ["sfbay", "newyork", "chicago", "losangeles", "seattle"]
    categories = ["apa", "cto", "sof"]

    all_listings = []
    for city in cities:
        for category in categories:
            try:
                listings = fetch_rss(city, category=category)
                all_listings.extend(listings)
            except Exception as e:
                print(f"  Error {city}/{category}: {e}")
            time.sleep(2)

    # Deduplicate and store
    unique = deduplicate_listings(all_listings)
    store_listings(unique)

    print(f"[{datetime.now().isoformat()}] Done: {len(unique)} unique listings")


# Schedule to run every 30 minutes
schedule.every(30).minutes.do(scheduled_craigslist_run)

if __name__ == "__main__":
    scheduled_craigslist_run()  # Run immediately on start
    while True:
        schedule.run_pending()
        time.sleep(60)

Complete Field Reference

Here is every field available from Craigslist listings by method:

RSS feed (fast, low risk): - title - Listing title (often contains price) - url - Direct link to listing - published - Posted date/time - description - HTML summary (may contain price, photos) - city - Which city subdomain

HTML search results (more fields): - title - Listing title - url - Direct link - price - Numeric price extracted from title - meta - Area/neighborhood note from search result - city - City subdomain

Individual listing page (most complete): - All of above, plus: - lat/lon - GPS coordinates (when available) - neighborhood - Neighborhood name from title small tag - bedrooms_baths - "1BR / 1Ba" format - sqft - Square footage (apartments) - description - Full text of listing body - images - Array of photo URLs

Craigslist's terms of service prohibit automated scraping. Their robots.txt blocks most crawlers. Craigslist has also litigated against scrapers -- their case history is more aggressive than LinkedIn's. For personal research or academic use, risk is low. For commercial applications at scale, consult a lawyer and consider whether purchasing a data license is a better path.

The safest approach: use RSS feeds (which Craigslist explicitly provides as a syndication format), keep request rates low (one request per 2-5 seconds), and scrape individual listing pages only when you need the extra field coverage that HTML provides.

ThorData's residential proxy network is recommended for any sustained multi-city collection -- the rotating residential IPs prevent per-IP rate limiting from accumulating across city subdomains.