How to Scrape Craigslist Listings with Python (2026)
How to Scrape Craigslist Listings with Python (2026)
Craigslist is one of the most useful datasets on the web for local market research. Rental prices, used car inventory, furniture markups, freelance gig rates — the data is there, updated constantly, spanning hundreds of cities. The catch is that Craigslist actively resists automated access, and its city-specific URL structure requires a thoughtful multi-city collection strategy.
This guide covers how to pull listing data across cities efficiently: starting with RSS feeds (which are fast and tolerated), falling back to HTML scraping when you need more fields, and handling the anti-bot measures you'll inevitably hit at scale.
What Data Is Available
Each Craigslist listing exposes:
- Title and description — free text, often contains specs, condition, model
- Price — listed in the title and the posting body
- Location — city, neighborhood, and sometimes latitude/longitude in the listing HTML
- Category — housing, for sale, jobs, gigs, services, etc.
- Posted date — useful for freshness filtering and velocity analysis
- Listing URL — stable per-post identifier
You won't get seller contact info directly (Craigslist anonymizes emails), but for pricing and geographic analysis, you don't need it.
URL Structure Across Cities
Craigslist uses city-based subdomains. Each city gets its own subdomain under craigslist.org:
sfbay.craigslist.org # San Francisco Bay Area
newyork.craigslist.org # New York City
chicago.craigslist.org # Chicago
losangeles.craigslist.org # Los Angeles
seattle.craigslist.org # Seattle
austin.craigslist.org # Austin
miami.craigslist.org # Miami
denver.craigslist.org # Denver
Search within a city follows this pattern:
https://{city}.craigslist.org/search/{category}?query={term}&min_price={min}&max_price={max}
Category codes include apa (apartments), fua (furniture), cto (cars by owner), ggg (gigs), sof (software jobs). A full list is in the URL when you browse Craigslist manually.
CITIES = [
"sfbay", "newyork", "chicago", "losangeles",
"seattle", "austin", "miami", "denver",
]
CATEGORY = "apa" # apartments
def search_url(city: str, query: str = "", min_price: int = None, max_price: int = None) -> str:
base = f"https://{city}.craigslist.org/search/{CATEGORY}"
params = {}
if query:
params["query"] = query
if min_price is not None:
params["min_price"] = min_price
if max_price is not None:
params["max_price"] = max_price
if params:
from urllib.parse import urlencode
return f"{base}?{urlencode(params)}"
return base
RSS Feeds: The Easy Path
Craigslist exposes RSS feeds for every category and search query. These are the most reliable way to pull listings without triggering anti-bot measures — Craigslist tolerates RSS polling at reasonable intervals because RSS is a standard protocol with clear semantics.
The RSS URL pattern:
https://{city}.craigslist.org/search/{category}.rss
https://{city}.craigslist.org/search/{category}.rss?query={term}
Use feedparser to parse them:
import feedparser
import time
def fetch_rss(city: str, category: str = "apa", query: str = "") -> list[dict]:
url = f"https://{city}.craigslist.org/search/{category}.rss"
if query:
url += f"?query={query}"
feed = feedparser.parse(url)
listings = []
for entry in feed.entries:
listings.append({
"city": city,
"title": entry.get("title", ""),
"url": entry.get("link", ""),
"published": entry.get("published", ""),
"description": entry.get("summary", ""),
})
return listings
# Collect across cities
all_listings = []
for city in CITIES:
results = fetch_rss(city, category="apa")
all_listings.extend(results)
time.sleep(2) # Be polite between city requests
print(f"Collected {len(all_listings)} listings")
RSS feeds return the 25 most recent posts per query. They don't support pagination, so they're best for freshness monitoring rather than bulk historical collection.
HTML Scraping Fallback
When you need more than 25 results or fields not in the RSS (coordinates, neighborhood, number of bedrooms), you'll need to scrape the HTML search results and individual listing pages.
Use httpx for requests and selectolax for fast HTML parsing:
import httpx
from selectolax.parser import HTMLParser
import re
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
}
def parse_search_page(html: str, city: str) -> list[dict]:
tree = HTMLParser(html)
listings = []
for item in tree.css("li.cl-search-result"):
title_node = item.css_first("a.cl-app-anchor span.label")
price_node = item.css_first(".priceinfo")
meta_node = item.css_first(".meta")
link_node = item.css_first("a.cl-app-anchor")
title = title_node.text(strip=True) if title_node else ""
price_text = price_node.text(strip=True) if price_node else ""
meta = meta_node.text(strip=True) if meta_node else ""
url = link_node.attributes.get("href", "") if link_node else ""
# Extract numeric price
price_match = re.search(r"[\d,]+", price_text.replace(",", ""))
price = int(price_match.group()) if price_match else None
listings.append({
"city": city,
"title": title,
"price": price,
"meta": meta,
"url": url,
})
return listings
def scrape_city(city: str, category: str = "apa", pages: int = 3) -> list[dict]:
client = httpx.Client(headers=HEADERS, timeout=20, follow_redirects=True)
all_results = []
for page in range(pages):
offset = page * 120 # Craigslist returns 120 results per page
url = f"https://{city}.craigslist.org/search/{category}?s={offset}"
response = client.get(url)
if response.status_code != 200:
print(f"Got {response.status_code} for {city} page {page}")
break
results = parse_search_page(response.text, city)
all_results.extend(results)
time.sleep(3)
client.close()
return all_results
For individual listing pages, pull coordinates and neighborhood from the listing detail HTML:
def parse_listing_detail(html: str) -> dict:
tree = HTMLParser(html)
# Coordinates are in a map link
map_link = tree.css_first("a[href*='maps.google.com']")
lat, lon = None, None
if map_link:
href = map_link.attributes.get("href", "")
coord_match = re.search(r"ll=([-\d.]+),([-\d.]+)", href)
if coord_match:
lat = float(coord_match.group(1))
lon = float(coord_match.group(2))
# Neighborhood label
neighborhood_node = tree.css_first(".postingtitletext small")
neighborhood = neighborhood_node.text(strip=True).strip("()") if neighborhood_node else ""
# Attributes (bedrooms, sqft, etc.)
attrs = {}
for span in tree.css(".attrgroup span"):
text = span.text(strip=True)
if "BR" in text or "Ba" in text:
attrs["bedrooms_baths"] = text
return {"lat": lat, "lon": lon, "neighborhood": neighborhood, "attrs": attrs}
Anti-Bot Measures
Craigslist is not as sophisticated as Cloudflare-protected sites, but it does push back:
IP blocking. Hit the same subdomain too fast and your IP gets a temporary block — you'll see 403 responses or connection resets. This typically lifts after a few hours, but repeated violations can lead to longer bans. The threshold varies by city; large metro subdomains like sfbay and newyork are more aggressively monitored.
CAPTCHA after many requests. Past a certain volume from one IP, Craigslist serves a CAPTCHA interstitial instead of search results. The threshold is roughly 50-100 requests per hour per IP for HTML endpoints. RSS feeds have a higher tolerance.
Rate limiting per city subdomain. Each subdomain appears to have its own rate limit counter. Spreading requests across multiple city subdomains helps — you're hitting different servers, not just the same rate limit bucket.
User-Agent filtering. Requests with Python's default python-httpx/x.x.x user agent get blocked quickly. Always set a realistic browser user agent.
The practical mitigations: slow down your request rate (3-5 seconds between HTML requests), rotate IPs, and prefer RSS when freshness is sufficient.
Multi-City Collection
A coordinator that collects across cities with per-city delays:
import random
from datetime import datetime
def collect_all_cities(cities: list[str], category: str = "apa") -> list[dict]:
all_data = []
for city in cities:
print(f"[{datetime.now().isoformat()}] Fetching {city}...")
try:
# Try RSS first (faster, lower risk)
listings = fetch_rss(city, category)
for item in listings:
item["source"] = "rss"
all_data.extend(listings)
except Exception as e:
print(f"RSS failed for {city}: {e}")
# Random delay between cities: 2-5 seconds
time.sleep(random.uniform(2, 5))
return all_data
Proxy Configuration
At scale — more than a handful of cities, running multiple times per day — you'll hit IP bans. Residential proxies are the cleanest solution. Datacenter IPs are blocked quickly by Craigslist; residential IPs that look like regular household traffic get through consistently.
ThorData's residential proxy network works well here. Their rotating residential pool cycles IPs per request, which keeps each city subdomain from seeing repeated traffic from the same source. Setup is straightforward:
PROXY = "http://USER:[email protected]:9000"
client = httpx.Client(
headers=HEADERS,
proxy=PROXY,
timeout=20,
follow_redirects=True,
)
# Each request routes through a different residential IP
response = client.get(f"https://sfbay.craigslist.org/search/apa")
For feedparser, set the proxy via environment variable before parsing:
import os
os.environ["http_proxy"] = PROXY
os.environ["https_proxy"] = PROXY
feed = feedparser.parse(f"https://sfbay.craigslist.org/search/apa.rss")
Pricing Trend Analysis
Once you have data across cities, cross-city pricing analysis is straightforward. Using a dict-based approach (no pandas required):
from collections import defaultdict
import statistics
def analyze_prices(listings: list[dict]) -> dict:
by_city = defaultdict(list)
for item in listings:
price = item.get("price")
city = item.get("city")
if price and city and 100 < price < 20000: # Filter outliers
by_city[city].append(price)
summary = {}
for city, prices in by_city.items():
if len(prices) < 5:
continue
summary[city] = {
"count": len(prices),
"median": statistics.median(prices),
"mean": round(statistics.mean(prices), 2),
"min": min(prices),
"max": max(prices),
"stdev": round(statistics.stdev(prices), 2) if len(prices) > 1 else 0,
}
# Sort by median price descending
return dict(sorted(summary.items(), key=lambda x: x[1]["median"], reverse=True))
results = analyze_prices(all_listings)
for city, stats in results.items():
print(f"{city:15s} median=${stats['median']:,} n={stats['count']}")
Example output:
sfbay median=$2,850 n=312
newyork median=$2,600 n=489
losangeles median=$2,200 n=271
seattle median=$1,950 n=198
miami median=$1,800 n=167
chicago median=$1,500 n=241
denver median=$1,450 n=133
austin median=$1,400 n=155
Storing Data
SQLite is sufficient for multi-city Craigslist data — even 100k listings is fast with the right index:
import sqlite3
from datetime import datetime
def store_listings(listings: list[dict], db_path: str = "craigslist.db"):
con = sqlite3.connect(db_path)
con.execute("""
CREATE TABLE IF NOT EXISTS listings (
url TEXT PRIMARY KEY,
city TEXT,
title TEXT,
price INTEGER,
neighborhood TEXT,
published TEXT,
scraped_at TEXT
)
""")
con.execute("CREATE INDEX IF NOT EXISTS idx_city_price ON listings (city, price)")
now = datetime.utcnow().isoformat()
rows = [
(
item.get("url"),
item.get("city"),
item.get("title"),
item.get("price"),
item.get("neighborhood", ""),
item.get("published", ""),
now,
)
for item in listings
if item.get("url")
]
con.executemany(
"INSERT OR IGNORE INTO listings VALUES (?, ?, ?, ?, ?, ?, ?)",
rows,
)
con.commit()
con.close()
print(f"Stored {len(rows)} listings")
Using INSERT OR IGNORE with the URL as primary key means re-running the scraper won't create duplicates.
Legal Note
Craigslist's terms of service prohibit automated scraping. Their robots.txt blocks most crawlers. Craigslist has also litigated against scrapers (hiQ Labs v. LinkedIn is the relevant precedent for public data access, but Craigslist has its own case history). For personal research or academic use, the risk is low. For commercial applications at scale, consult a lawyer and consider whether purchasing a data license or using a compliant data broker is a better path.
Key Takeaways
- Craigslist uses city subdomains (
sfbay.craigslist.org,newyork.craigslist.org) — your collection logic needs to iterate across a city list, not a single domain. - RSS feeds (
/search/{category}.rss) are the easiest entry point: no HTML parsing, higher rate limit tolerance, feedparser handles the rest. - HTML scraping with
httpxandselectolaxgets you more data (coordinates, neighborhood, full pagination) at the cost of higher ban risk. - Craigslist's anti-bot measures include IP blocking, CAPTCHAs past 50-100 requests/hour/IP, and user-agent filtering — distribute load across cities and add random delays.
- For sustained multi-city collection, residential proxies are necessary. ThorData's residential proxy network handles IP rotation automatically, keeping each city subdomain from flagging your traffic.
- Store with SQLite and
INSERT OR IGNOREon the URL as primary key — idempotent runs, no duplicate handling required.
Advanced: Location Coordinates and Geographic Analysis
Individual listing pages sometimes embed latitude/longitude coordinates. Extracting these enables radius-based analysis:
import httpx
import re
import time
import random
from selectolax.parser import HTMLParser
def get_listing_coordinates(listing_url: str, headers: dict = None, proxy: str = None) -> dict:
"""Fetch individual listing page and extract geo coordinates."""
if headers is None:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
}
client_kwargs = {"headers": headers, "timeout": 15, "follow_redirects": True}
if proxy:
client_kwargs["proxies"] = {"all://": proxy}
with httpx.Client(**client_kwargs) as client:
resp = client.get(listing_url)
if resp.status_code != 200:
return {}
tree = HTMLParser(resp.text)
result = {}
# Coordinates appear in the map link href
map_link = tree.css_first("a[href*='maps.google.com']")
if map_link:
href = map_link.attributes.get("href", "")
coord_match = re.search(r'll=([-\d.]+),([-\d.]+)', href)
if coord_match:
result["lat"] = float(coord_match.group(1))
result["lon"] = float(coord_match.group(2))
# Also check data attributes on the map div
map_div = tree.css_first("#map")
if map_div and not result.get("lat"):
lat = map_div.attributes.get("data-latitude")
lon = map_div.attributes.get("data-longitude")
if lat and lon:
result["lat"] = float(lat)
result["lon"] = float(lon)
# Neighborhood label
neighborhood = tree.css_first(".postingtitletext small")
if neighborhood:
result["neighborhood"] = neighborhood.text(strip=True).strip("() ")
# Bedrooms/bathrooms from attributes
attr_group = tree.css_first(".attrgroup")
if attr_group:
for span in attr_group.css("span"):
text = span.text(strip=True)
if "BR" in text or "Ba" in text:
result["bedrooms_baths"] = text
elif "ft" in text and text.replace(",", "").replace("ft2", "").strip().isdigit():
result["sqft"] = text
# Full description
body = tree.css_first("#postingbody")
if body:
result["description"] = body.text(strip=True)[:2000]
return result
def enrich_listings_with_coords(
listings: list,
proxy: str = None,
max_per_city: int = 50,
) -> list:
"""Fetch coordinates for a sample of listings from each city."""
enriched = []
city_counts = {}
for listing in listings:
city = listing.get("city", "")
url = listing.get("url", "")
if not url:
enriched.append(listing)
continue
# Limit per-city enrichment to control volume
if city_counts.get(city, 0) >= max_per_city:
enriched.append(listing)
continue
geo = get_listing_coordinates(url, proxy=proxy)
listing.update(geo)
city_counts[city] = city_counts.get(city, 0) + 1
enriched.append(listing)
time.sleep(random.uniform(2, 4))
return enriched
Advanced Multi-City Price Analysis
With geographic data, you can do meaningful regional comparisons:
import statistics
from collections import defaultdict
def geographic_price_analysis(listings: list) -> dict:
"""Analyze rental prices by city with statistical breakdown."""
city_data = defaultdict(list)
for item in listings:
price = item.get("price")
city = item.get("city")
if price and city and 300 < price < 15000:
city_data[city].append({
"price": price,
"neighborhood": item.get("neighborhood"),
"bedrooms_baths": item.get("bedrooms_baths"),
})
analysis = {}
for city, data in city_data.items():
prices = [d["price"] for d in data]
if len(prices) < 5:
continue
# Neighborhood breakdown
neighborhoods = defaultdict(list)
for d in data:
if d.get("neighborhood"):
neighborhoods[d["neighborhood"]].append(d["price"])
top_neighborhoods = {
nb: round(statistics.median(prices), 0)
for nb, prices in sorted(
neighborhoods.items(),
key=lambda x: statistics.median(x[1]),
reverse=True
)[:5]
if len(prices) >= 3
}
analysis[city] = {
"count": len(prices),
"median": statistics.median(prices),
"mean": round(statistics.mean(prices), 2),
"p25": sorted(prices)[len(prices)//4],
"p75": sorted(prices)[3*len(prices)//4],
"min": min(prices),
"max": max(prices),
"std": round(statistics.stdev(prices), 2) if len(prices) > 1 else 0,
"top_neighborhoods": top_neighborhoods,
}
return dict(sorted(analysis.items(), key=lambda x: x[1]["median"], reverse=True))
def compute_city_affordability_index(analysis: dict) -> list:
"""
Rank cities by affordability relative to each other.
Normalized index where 100 = median of all cities.
"""
medians = [v["median"] for v in analysis.values()]
overall_median = statistics.median(medians)
indexed = []
for city, stats in analysis.items():
index = round(stats["median"] / overall_median * 100, 1)
indexed.append({
"city": city,
"median_price": stats["median"],
"affordability_index": index, # < 100 = cheaper than average
"sample_size": stats["count"],
})
return sorted(indexed, key=lambda x: x["affordability_index"])
Categories Beyond Apartments
Craigslist's value extends well beyond housing. Here are the most useful categories for different use cases:
# Complete category reference
CRAIGSLIST_CATEGORIES = {
# Housing
"aap": "apartments",
"apa": "apartments (all)",
"roo": "rooms & shares",
"sub": "sublets",
"vac": "vacation rentals",
# For Sale
"cto": "cars by owner",
"cta": "cars by dealer",
"bik": "bicycles",
"ele": "electronics",
"fua": "furniture",
"app": "appliances",
"spo": "sporting goods",
"tls": "tools",
"zip": "free stuff",
# Jobs
"sof": "software/QA/dba",
"web": "web/html/info design",
"eng": "engineering",
"mdi": "medical/health",
"mar": "marketing/PR/ad",
# Services
"bts": "beauty services",
"cps": "computer services",
"lgs": "legal services",
"lss": "lessons/tutoring",
# Gigs
"cpg": "computer gigs",
"crg": "creative gigs",
"lbg": "labor gigs",
"smg": "skilled trades gigs",
"wag": "writing/editing gigs",
}
def build_market_report(cities: list, categories: list) -> dict:
"""
Build a cross-city, cross-category market report.
Returns structured data suitable for analysis or export.
"""
report = {
"metadata": {
"cities": cities,
"categories": categories,
"generated": __import__("datetime").datetime.now().isoformat(),
},
"data": {},
}
for city in cities:
report["data"][city] = {}
for category in categories:
try:
listings = fetch_rss(city, category=category)
prices = []
for item in listings:
# Extract price from title/description
text = item.get("title", "") + " " + item.get("description", "")
price_match = __import__("re").search(r'\$([\d,]+)', text)
if price_match:
price = int(price_match.group(1).replace(",", ""))
if 10 < price < 100000:
prices.append(price)
if prices:
report["data"][city][category] = {
"count": len(listings),
"median_price": __import__("statistics").median(prices) if prices else None,
"listings_with_price": len(prices),
}
else:
report["data"][city][category] = {
"count": len(listings),
"median_price": None,
"listings_with_price": 0,
}
__import__("time").sleep(2)
except Exception as e:
report["data"][city][category] = {"error": str(e)}
return report
Deduplication Strategy
Craigslist listings sometimes appear in multiple cities or get reposted. Here is a deduplication approach:
import hashlib
def deduplicate_listings(listings: list) -> list:
"""
Remove duplicate listings based on title similarity and price.
Handles both exact duplicates and slightly modified reposts.
"""
seen_hashes = set()
unique = []
for listing in listings:
# Exact URL dedup
url = listing.get("url", "")
if url in seen_hashes:
continue
seen_hashes.add(url)
# Content-based dedup (for reposts with different URLs)
title = listing.get("title", "").lower().strip()
price = listing.get("price", 0)
# Normalize title: remove punctuation, extra whitespace
import re
normalized = re.sub(r'[^a-z0-9 ]', '', title)
normalized = re.sub(r'\s+', ' ', normalized)
content_hash = hashlib.md5(f"{normalized}_{price}".encode()).hexdigest()
if content_hash not in seen_hashes:
seen_hashes.add(content_hash)
unique.append(listing)
return unique
def flag_suspicious_listings(listings: list) -> list:
"""Flag listings that may be fraudulent based on common patterns."""
import re
suspicious_patterns = [
r'\$\d+.*per.*month.*utilities included.*furnished',
r'owner.*overseas',
r'send.*money.*order',
r'western union',
r'email.*only.*no.*call',
r'price.*too.*good',
]
for listing in listings:
text = (listing.get("title", "") + " " + listing.get("description", "")).lower()
flags = []
for pattern in suspicious_patterns:
if re.search(pattern, text):
flags.append(pattern)
if flags:
listing["suspicious_flags"] = flags
else:
listing["suspicious_flags"] = []
return listings
Scheduling and Automation
Set up automated collection that runs continuously without manual intervention:
import schedule
import time
from datetime import datetime
def scheduled_craigslist_run():
"""Run the multi-city scraper on a schedule."""
print(f"[{datetime.now().isoformat()}] Starting scheduled run")
PROXY = "http://USER:[email protected]:9000"
cities = ["sfbay", "newyork", "chicago", "losangeles", "seattle"]
categories = ["apa", "cto", "sof"]
all_listings = []
for city in cities:
for category in categories:
try:
listings = fetch_rss(city, category=category)
all_listings.extend(listings)
except Exception as e:
print(f" Error {city}/{category}: {e}")
time.sleep(2)
# Deduplicate and store
unique = deduplicate_listings(all_listings)
store_listings(unique)
print(f"[{datetime.now().isoformat()}] Done: {len(unique)} unique listings")
# Schedule to run every 30 minutes
schedule.every(30).minutes.do(scheduled_craigslist_run)
if __name__ == "__main__":
scheduled_craigslist_run() # Run immediately on start
while True:
schedule.run_pending()
time.sleep(60)
Complete Field Reference
Here is every field available from Craigslist listings by method:
RSS feed (fast, low risk):
- title - Listing title (often contains price)
- url - Direct link to listing
- published - Posted date/time
- description - HTML summary (may contain price, photos)
- city - Which city subdomain
HTML search results (more fields):
- title - Listing title
- url - Direct link
- price - Numeric price extracted from title
- meta - Area/neighborhood note from search result
- city - City subdomain
Individual listing page (most complete):
- All of above, plus:
- lat/lon - GPS coordinates (when available)
- neighborhood - Neighborhood name from title small tag
- bedrooms_baths - "1BR / 1Ba" format
- sqft - Square footage (apartments)
- description - Full text of listing body
- images - Array of photo URLs
Legal Considerations
Craigslist's terms of service prohibit automated scraping. Their robots.txt blocks most crawlers. Craigslist has also litigated against scrapers -- their case history is more aggressive than LinkedIn's. For personal research or academic use, risk is low. For commercial applications at scale, consult a lawyer and consider whether purchasing a data license is a better path.
The safest approach: use RSS feeds (which Craigslist explicitly provides as a syndication format), keep request rates low (one request per 2-5 seconds), and scrape individual listing pages only when you need the extra field coverage that HTML provides.
ThorData's residential proxy network is recommended for any sustained multi-city collection -- the rotating residential IPs prevent per-IP rate limiting from accumulating across city subdomains.