How to Scrape DoorDash Restaurant Data in 2026 (Menus, Delivery Zones, ETAs)
How to Scrape DoorDash Restaurant Data in 2026 (Menus, Delivery Zones, ETAs)
DoorDash has no public API for third-party developers. If you need restaurant data for market research, price comparison, or delivery analytics, you have to extract it yourself.
The good news: DoorDash's frontend relies on a GraphQL API that returns structured JSON. Once you understand the endpoint, you can pull menus, delivery fees, ETAs, and ratings without parsing HTML.
What Data You Can Extract
DoorDash exposes a lot through its internal API:
- Restaurant name, address, coordinates
- Menu categories and individual items with prices
- Delivery fee, service fee, minimum order
- Estimated delivery time (ETA)
- Restaurant rating and review count
- Operating hours and delivery radius
- Promotions and DashPass eligibility
What you won't get without an account: order history, driver data, or real-time driver locations.
Table of Contents
- Understanding DoorDash's Architecture
- Finding the GraphQL Endpoint
- Anti-Bot Defenses and How to Handle Them
- Setting Up Your Scraping Environment
- Core Scraping Code
- Finding Store IDs at Scale
- Handling Rate Limits and Blocks
- Storing and Analyzing the Data
- Building a Multi-City Scraper
- Proxy Strategy with ThorData
- Playwright-Based Fallback
- Real-World Use Cases
- Legal Considerations
- Performance Optimization
Understanding DoorDash's Architecture {#architecture}
Before writing any code, it helps to understand how DoorDash actually works under the hood. The website is a React single-page application. When you visit a restaurant page, the browser loads a mostly empty HTML shell and then fires off API calls to populate the content.
These API calls go to DoorDash's internal GraphQL endpoint. GraphQL is a query language where the client specifies exactly what fields it wants in the response. This is actually great for scraping: the API returns clean, structured JSON without any HTML parsing.
The key insight is that these API calls use the same endpoint your browser uses. If you replicate the same HTTP requests with the right headers, you get the same data.
DoorDash runs on: - CloudFront CDN for static assets and some API responses - Amazon WAF (Web Application Firewall) for bot detection - Internal rate limiting per IP and per session - TLS fingerprinting via their CDN layer
Understanding this stack tells you what defenses you need to bypass.
Finding the GraphQL Endpoint {#graphql}
Open DoorDash in Chrome, navigate to any restaurant page, and open DevTools (F12). Go to the Network tab and filter by "Fetch/XHR".
Reload the page. You'll see a flood of POST requests to:
https://www.doordash.com/graphql
Each request carries:
1. An operationName field identifying what it's fetching
2. A variables object with query parameters
3. A query string with the GraphQL query definition
Click on any of these requests to inspect the payload and response. The two operations most useful for restaurant data:
getStoreDetails— restaurant metadata: name, address, ratings, delivery info, hoursgetStoreMenu— full menu structure with categories, items, prices, descriptionssearchStores— discover stores by geographic coordinatesgetStoreAvailability— operating status, open/closed, next available time
You can also find the full schema by looking at the __schema introspection queries that DoorDash's own frontend makes. This reveals every available field.
Anti-Bot Defenses and How to Handle Them {#anti-bot}
DoorDash runs aggressive bot detection. A naive requests.get() call returns a 403 or redirects to a CAPTCHA page. Here's what you're actually up against:
TLS Fingerprinting
DoorDash checks your TLS handshake characteristics. Standard Python requests and even httpx use a recognizable TLS fingerprint that gets flagged. Real browsers (Chrome, Firefox) negotiate TLS differently — different cipher suites, different extensions, different ALPN protocols.
The fix: use curl-cffi, a Python library that wraps libcurl and lets you impersonate specific browser TLS fingerprints.
from curl_cffi import requests as cffi_requests
session = cffi_requests.Session(impersonate="chrome120")
resp = session.get("https://www.doordash.com/graphql")
This makes your TLS handshake look identical to Chrome 120's.
HTTP/2 Fingerprinting
Beyond TLS, HTTP/2 frames have a fingerprint too — window sizes, header ordering, SETTINGS frames. curl-cffi handles this correctly when you set the impersonation target.
CloudFront Bot Detection
Amazon's WAF analyzes request patterns: request timing, header ordering, missing browser-specific headers. You need to send headers in the right order with the right values.
Rate Limiting
More than ~30-40 requests per minute from a single IP triggers soft blocks (increasing delays) and eventually hard blocks (403s that don't resolve).
Session Tracking
DoorDash sets cookies when you first visit. Subsequent requests without those cookies look bot-like. Always carry cookies across your session.
Setting Up Your Scraping Environment {#setup}
Install the required packages:
pip install curl-cffi httpx requests beautifulsoup4 pandas sqlite3 asyncio aiohttp
For the full production setup that handles all DoorDash defenses:
pip install curl-cffi asyncio pandas sqlite3
Basic session setup with TLS impersonation:
from curl_cffi import requests as cffi_requests
import json
import time
import random
# Proxy configuration (residential proxy recommended)
PROXY_URL = "http://USERNAME:[email protected]:9000"
# Browser-like headers - these must be in the right order
HEADERS = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
"Accept": "application/json",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Content-Type": "application/json",
"Referer": "https://www.doordash.com/",
"Origin": "https://www.doordash.com",
"x-channel-id": "marketplace",
"x-client-version": "24.0.0",
"x-experience-id": "doordash",
"Sec-Ch-Ua": '"Not/A)Brand";v="8", "Chromium";v="126", "Google Chrome";v="126"',
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": '"macOS"',
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
}
GRAPHQL_URL = "https://www.doordash.com/graphql"
def create_session():
"""Create a session with Chrome TLS fingerprint."""
session = cffi_requests.Session(impersonate="chrome120")
session.headers.update(HEADERS)
if PROXY_URL:
session.proxies = {"https": PROXY_URL, "http": PROXY_URL}
return session
def warm_up_session(session):
"""Visit homepage first to get cookies, like a real browser would."""
try:
session.get("https://www.doordash.com/", timeout=15)
time.sleep(random.uniform(1.5, 3.0))
except Exception as e:
print(f"Warmup failed (continuing anyway): {e}")
Core Scraping Code {#core-code}
Fetching Restaurant Details
def get_store_details(session, store_id: int) -> dict:
"""Fetch restaurant metadata: name, rating, delivery info, hours."""
payload = {
"operationName": "getStoreDetails",
"variables": {
"storeId": store_id,
"consumerAddressId": None,
"fetchMenuCategories": False,
},
"query": """
query getStoreDetails($storeId: Int!, $consumerAddressId: BigInt) {
storeDetails: storeV2(storeId: $storeId) {
id
name
phoneNumber
description
coverImgUrl
averageRating
numRatings
priceRange
address {
street
city
state
zipCode
lat
lng
countryCode
}
businessHours {
dayOfWeek
openTime
closeTime
isCurrentlyOpen
}
deliveryFee
serviceFee
minOrderAmount
estimatedDeliveryTime
deliveryRadius
isDashPassEligible
headerImgUrl
cuisineType
tags
isOpen
nextOpenTime
businessType
}
}"""
}
resp = session.post(GRAPHQL_URL, json=payload, timeout=20)
resp.raise_for_status()
data = resp.json()
if "errors" in data:
raise ValueError(f"GraphQL errors: {data['errors']}")
return data.get("data", {}).get("storeDetails", {})
def get_store_menu(session, store_id: int) -> list[dict]:
"""Fetch full menu with categories, items, prices, and modifiers."""
payload = {
"operationName": "getStoreMenu",
"variables": {
"storeId": store_id,
},
"query": """
query getStoreMenu($storeId: Int!) {
storeMenu(storeId: $storeId) {
categories {
id
name
description
isPopular
items {
id
name
description
price
originalPrice
imageUrl
isAvailable
isPopular
alcoholic
portionSizeInfo
extras {
id
name
minNumOptions
maxNumOptions
options {
id
name
price
default
}
}
}
}
}
}"""
}
resp = session.post(GRAPHQL_URL, json=payload, timeout=20)
resp.raise_for_status()
data = resp.json()
if "errors" in data:
raise ValueError(f"GraphQL errors: {data['errors']}")
menu_data = data.get("data", {}).get("storeMenu", {})
return menu_data.get("categories", [])
def scrape_restaurant(store_id: int) -> dict:
"""Full restaurant scrape: details + menu."""
session = create_session()
warm_up_session(session)
print(f"Scraping store {store_id}...")
# Get details
time.sleep(random.uniform(1, 2))
details = get_store_details(session, store_id)
# Get menu
time.sleep(random.uniform(2, 4))
menu = get_store_menu(session, store_id)
return {
"details": details,
"menu": menu,
"scraped_at": time.strftime("%Y-%m-%dT%H:%M:%S"),
}
Usage Example
if __name__ == "__main__":
store_id = 61092 # Pizza Hut NYC
result = scrape_restaurant(store_id)
details = result["details"]
menu = result["menu"]
print(f"\nRestaurant: {details.get('name')}")
print(f"Rating: {details.get('averageRating')} ({details.get('numRatings')} reviews)")
print(f"Delivery fee: ${(details.get('deliveryFee') or 0) / 100:.2f}")
print(f"ETA: {details.get('estimatedDeliveryTime')} min")
print(f"DashPass eligible: {details.get('isDashPassEligible')}")
print(f"\nMenu ({len(menu)} categories):")
for category in menu[:3]: # show first 3 categories
items = category.get("items", [])
print(f"\n {category['name']} ({len(items)} items):")
for item in items[:5]: # show first 5 items
price = (item.get("price") or 0) / 100
popular = " [Popular]" if item.get("isPopular") else ""
print(f" ${price:.2f} — {item['name']}{popular}")
Finding Store IDs at Scale {#store-ids}
Store IDs appear in DoorDash URLs. doordash.com/store/pizza-hut-new-york-61092/ has store ID 61092. But to build a comprehensive database, you need to discover stores systematically.
Search by Location
def search_stores_by_location(session, lat: float, lng: float,
query: str = "", limit: int = 50) -> list[dict]:
"""Discover stores near a geographic coordinate."""
payload = {
"operationName": "searchStoresV3",
"variables": {
"latitude": lat,
"longitude": lng,
"query": query,
"limit": limit,
"offset": 0,
"filters": {
"sortOrder": "RELEVANCE",
}
},
"query": """
query searchStoresV3($latitude: Float!, $longitude: Float!,
$query: String, $limit: Int, $offset: Int) {
searchStoresV3(
latitude: $latitude
longitude: $longitude
query: $query
limit: $limit
offset: $offset
) {
stores {
id
name
averageRating
numRatings
estimatedDeliveryTime
deliveryFee
cuisineType
address {
city
state
}
isDashPassEligible
isOpen
}
hasMore
totalCount
}
}"""
}
resp = session.post(GRAPHQL_URL, json=payload, timeout=20)
resp.raise_for_status()
data = resp.json()
return data.get("data", {}).get("searchStoresV3", {})
def discover_stores_in_city(city_lat: float, city_lng: float,
radius_km: float = 5.0) -> list[int]:
"""
Grid-search a city to discover all store IDs.
Uses a grid of GPS coordinates spaced ~2km apart.
"""
import math
session = create_session()
warm_up_session(session)
# Calculate grid steps (approximately 0.018 degrees = 2km)
step = 0.018
lat_steps = int(radius_km / 2) + 1
lng_steps = int(radius_km / 2) + 1
all_store_ids = set()
requests_made = 0
for lat_offset in range(-lat_steps, lat_steps + 1):
for lng_offset in range(-lng_steps, lng_steps + 1):
lat = city_lat + (lat_offset * step)
lng = city_lng + (lng_offset * step)
try:
result = search_stores_by_location(session, lat, lng, limit=50)
stores = result.get("stores", [])
for store in stores:
store_id = store.get("id")
if store_id:
all_store_ids.add(store_id)
requests_made += 1
print(f"Grid point ({lat:.3f}, {lng:.3f}): "
f"{len(stores)} stores, total unique: {len(all_store_ids)}")
# Polite delay between grid points
time.sleep(random.uniform(2, 5))
except Exception as e:
print(f"Failed at ({lat:.3f}, {lng:.3f}): {e}")
time.sleep(10) # longer backoff on error
print(f"\nDiscovered {len(all_store_ids)} unique stores "
f"in {requests_made} requests")
return list(all_store_ids)
# Example: discover restaurants in Manhattan
# manhattan_stores = discover_stores_in_city(40.7580, -73.9855, radius_km=5)
Pagination Through Search Results
def get_all_stores_for_query(session, lat: float, lng: float,
query: str, max_results: int = 500) -> list[dict]:
"""Paginate through search results to get more than 50 stores."""
all_stores = []
offset = 0
limit = 50
while offset < max_results:
payload = {
"operationName": "searchStoresV3",
"variables": {
"latitude": lat,
"longitude": lng,
"query": query,
"limit": limit,
"offset": offset,
},
"query": """
query searchStoresV3($latitude: Float!, $longitude: Float!,
$query: String, $limit: Int, $offset: Int) {
searchStoresV3(lat: $latitude, lng: $longitude,
query: $query, limit: $limit, offset: $offset) {
stores { id name cuisineType averageRating }
hasMore
}
}"""
}
resp = session.post(GRAPHQL_URL, json=payload, timeout=20)
data = resp.json()
result = data.get("data", {}).get("searchStoresV3", {})
stores = result.get("stores", [])
all_stores.extend(stores)
if not result.get("hasMore") or not stores:
break
offset += limit
time.sleep(random.uniform(1, 3))
return all_stores
Handling Rate Limits and Blocks {#rate-limits}
Detecting Different Block Types
import re
def check_response(resp) -> str:
"""Classify the response to detect blocks."""
if resp.status_code == 200:
data = resp.json()
if "errors" in data:
errors = data["errors"]
error_msg = str(errors)
if "UNAUTHENTICATED" in error_msg:
return "auth_required"
if "RATE_LIMITED" in error_msg:
return "rate_limited"
return "graphql_error"
return "success"
elif resp.status_code == 403:
# Check if it's CloudFront or application-level
server = resp.headers.get("server", "")
if "CloudFront" in server:
return "cloudfront_block"
return "forbidden"
elif resp.status_code == 429:
return "rate_limited"
elif resp.status_code == 503:
return "service_unavailable"
return f"http_error_{resp.status_code}"
def robust_graphql_request(session, payload: dict,
max_retries: int = 5) -> dict:
"""Make a GraphQL request with exponential backoff and block detection."""
for attempt in range(max_retries):
try:
resp = session.post(GRAPHQL_URL, json=payload, timeout=20)
status = check_response(resp)
if status == "success":
return resp.json()
elif status == "rate_limited":
retry_after = int(resp.headers.get("Retry-After", 60))
wait = max(retry_after, 30 * (2 ** attempt))
print(f"Rate limited. Waiting {wait}s (attempt {attempt+1})")
time.sleep(wait)
elif status == "cloudfront_block":
# CloudFront block - need fresh IP and session
wait = 120 * (2 ** attempt)
print(f"CloudFront block detected. Waiting {wait}s")
time.sleep(wait)
session = create_session() # fresh session with new proxy
warm_up_session(session)
elif status == "auth_required":
# Some endpoints need authentication
raise PermissionError("Endpoint requires authentication")
else:
wait = 10 * (2 ** attempt)
print(f"Status: {status}. Waiting {wait}s")
time.sleep(wait)
except (ConnectionError, TimeoutError) as e:
wait = 15 * (2 ** attempt)
print(f"Network error: {e}. Waiting {wait}s")
time.sleep(wait)
raise RuntimeError(f"All {max_retries} attempts failed")
Request Pacing
import threading
from collections import deque
class RequestPacer:
"""Enforces a minimum delay between requests with jitter."""
def __init__(self, min_delay: float = 2.0, max_delay: float = 5.0,
burst_limit: int = 10):
self.min_delay = min_delay
self.max_delay = max_delay
self.burst_limit = burst_limit
self.request_times = deque(maxlen=burst_limit)
self.lock = threading.Lock()
def wait(self):
"""Block until it's safe to make another request."""
with self.lock:
now = time.time()
# Enforce burst limit: if we've made `burst_limit` requests
# in the last 60 seconds, wait
if len(self.request_times) == self.burst_limit:
oldest = self.request_times[0]
elapsed = now - oldest
if elapsed < 60:
sleep_time = 60 - elapsed + random.uniform(0, 5)
print(f"Burst limit reached. Waiting {sleep_time:.1f}s")
time.sleep(sleep_time)
# Enforce minimum delay from last request
if self.request_times:
last = self.request_times[-1]
min_next = last + self.min_delay
if time.time() < min_next:
delay = min_next - time.time() + random.uniform(0, self.max_delay - self.min_delay)
time.sleep(delay)
self.request_times.append(time.time())
pacer = RequestPacer(min_delay=2, max_delay=5, burst_limit=20)
def paced_scrape_restaurant(store_id: int) -> dict:
pacer.wait()
return scrape_restaurant(store_id)
Storing and Analyzing the Data {#storage}
SQLite Schema
import sqlite3
def create_database(db_path: str = "doordash.db"):
"""Create database schema for DoorDash data."""
conn = sqlite3.connect(db_path)
c = conn.cursor()
c.executescript("""
CREATE TABLE IF NOT EXISTS restaurants (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
phone TEXT,
description TEXT,
cuisine_type TEXT,
price_range INTEGER,
average_rating REAL,
num_ratings INTEGER,
delivery_fee INTEGER,
service_fee INTEGER,
min_order_amount INTEGER,
estimated_delivery_time INTEGER,
delivery_radius REAL,
is_dashpass_eligible BOOLEAN,
is_open BOOLEAN,
street TEXT,
city TEXT,
state TEXT,
zip_code TEXT,
lat REAL,
lng REAL,
scraped_at TEXT,
updated_at TEXT DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS menu_categories (
id INTEGER PRIMARY KEY,
restaurant_id INTEGER REFERENCES restaurants(id),
name TEXT NOT NULL,
description TEXT,
is_popular BOOLEAN,
sort_order INTEGER
);
CREATE TABLE IF NOT EXISTS menu_items (
id INTEGER PRIMARY KEY,
category_id INTEGER REFERENCES menu_categories(id),
restaurant_id INTEGER REFERENCES restaurants(id),
name TEXT NOT NULL,
description TEXT,
price INTEGER,
original_price INTEGER,
image_url TEXT,
is_available BOOLEAN,
is_popular BOOLEAN,
is_alcoholic BOOLEAN,
scraped_at TEXT
);
CREATE TABLE IF NOT EXISTS scrape_log (
id INTEGER PRIMARY KEY AUTOINCREMENT,
store_id INTEGER,
status TEXT,
error_message TEXT,
scraped_at TEXT DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_restaurants_city ON restaurants(city, state);
CREATE INDEX IF NOT EXISTS idx_restaurants_cuisine ON restaurants(cuisine_type);
CREATE INDEX IF NOT EXISTS idx_items_restaurant ON menu_items(restaurant_id);
""")
conn.commit()
return conn
def save_restaurant(conn: sqlite3.Connection, data: dict):
"""Save restaurant and menu to database (upsert)."""
c = conn.cursor()
details = data.get("details", {})
menu = data.get("menu", [])
scraped_at = data.get("scraped_at")
address = details.get("address") or {}
# Upsert restaurant
c.execute("""
INSERT INTO restaurants (
id, name, phone, description, cuisine_type, price_range,
average_rating, num_ratings, delivery_fee, service_fee,
min_order_amount, estimated_delivery_time, delivery_radius,
is_dashpass_eligible, is_open, street, city, state, zip_code,
lat, lng, scraped_at
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(id) DO UPDATE SET
average_rating=excluded.average_rating,
num_ratings=excluded.num_ratings,
delivery_fee=excluded.delivery_fee,
estimated_delivery_time=excluded.estimated_delivery_time,
is_open=excluded.is_open,
scraped_at=excluded.scraped_at,
updated_at=CURRENT_TIMESTAMP
""", (
details.get("id"),
details.get("name"),
details.get("phoneNumber"),
details.get("description"),
details.get("cuisineType"),
details.get("priceRange"),
details.get("averageRating"),
details.get("numRatings"),
details.get("deliveryFee"),
details.get("serviceFee"),
details.get("minOrderAmount"),
details.get("estimatedDeliveryTime"),
details.get("deliveryRadius"),
details.get("isDashPassEligible"),
details.get("isOpen"),
address.get("street"),
address.get("city"),
address.get("state"),
address.get("zipCode"),
address.get("lat"),
address.get("lng"),
scraped_at,
))
restaurant_id = details.get("id")
# Clear existing menu (fresh scrape)
c.execute("DELETE FROM menu_categories WHERE restaurant_id=?", (restaurant_id,))
c.execute("DELETE FROM menu_items WHERE restaurant_id=?", (restaurant_id,))
# Insert menu
for sort_order, category in enumerate(menu):
cat_id = category.get("id")
c.execute("""
INSERT OR REPLACE INTO menu_categories
(id, restaurant_id, name, description, is_popular, sort_order)
VALUES (?, ?, ?, ?, ?, ?)
""", (
cat_id, restaurant_id,
category.get("name"), category.get("description"),
category.get("isPopular"), sort_order
))
for item in category.get("items", []):
c.execute("""
INSERT OR REPLACE INTO menu_items
(id, category_id, restaurant_id, name, description,
price, original_price, image_url, is_available,
is_popular, is_alcoholic, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
item.get("id"), cat_id, restaurant_id,
item.get("name"), item.get("description"),
item.get("price"), item.get("originalPrice"),
item.get("imageUrl"), item.get("isAvailable"),
item.get("isPopular"), item.get("alcoholic"),
scraped_at
))
conn.commit()
Analysis Queries
def analyze_market(conn: sqlite3.Connection, city: str):
"""Run market analysis queries on the collected data."""
c = conn.cursor()
print(f"\n=== DoorDash Market Analysis: {city} ===\n")
# Top cuisines
print("Top 10 cuisines by restaurant count:")
c.execute("""
SELECT cuisine_type, COUNT(*) as count,
AVG(average_rating) as avg_rating,
AVG(delivery_fee / 100.0) as avg_fee
FROM restaurants
WHERE city = ? AND cuisine_type IS NOT NULL
GROUP BY cuisine_type
ORDER BY count DESC
LIMIT 10
""", (city,))
for row in c.fetchall():
print(f" {row[0]}: {row[1]} restaurants, "
f"avg rating {row[2]:.2f}, avg fee ${row[3]:.2f}")
# Price distribution
print("\nDelivery fee distribution:")
c.execute("""
SELECT
CASE
WHEN delivery_fee = 0 THEN 'Free'
WHEN delivery_fee < 200 THEN '$0.01-$1.99'
WHEN delivery_fee < 400 THEN '$2.00-$3.99'
WHEN delivery_fee < 600 THEN '$4.00-$5.99'
ELSE '$6.00+'
END as fee_range,
COUNT(*) as count
FROM restaurants
WHERE city = ?
GROUP BY fee_range
ORDER BY MIN(delivery_fee)
""", (city,))
for row in c.fetchall():
print(f" {row[0]}: {row[1]} restaurants")
# Most popular menu items across restaurants
print("\nMost common popular items across all restaurants:")
c.execute("""
SELECT name, COUNT(*) as frequency
FROM menu_items
WHERE is_popular = 1
AND restaurant_id IN (SELECT id FROM restaurants WHERE city = ?)
GROUP BY LOWER(name)
ORDER BY frequency DESC
LIMIT 15
""", (city,))
for row in c.fetchall():
print(f" '{row[0]}': appears in {row[1]} restaurants")
Building a Multi-City Scraper {#multi-city}
import concurrent.futures
import json
# Major US cities with coordinates
CITIES = {
"New York": (40.7580, -73.9855),
"Los Angeles": (34.0522, -118.2437),
"Chicago": (41.8781, -87.6298),
"Houston": (29.7604, -95.3698),
"Phoenix": (33.4484, -112.0740),
"Philadelphia": (39.9526, -75.1652),
"San Antonio": (29.4241, -98.4936),
"San Diego": (32.7157, -117.1611),
"Dallas": (32.7767, -96.7970),
"Austin": (30.2672, -97.7431),
}
def scrape_city(city_name: str, lat: float, lng: float,
db_path: str = "doordash_multi.db") -> dict:
"""Scrape all restaurants in a city."""
print(f"\nStarting scrape for {city_name}...")
session = create_session()
warm_up_session(session)
# Discover store IDs
result = search_stores_by_location(session, lat, lng, limit=50)
stores = result.get("stores", [])
store_ids = [s["id"] for s in stores if s.get("id")]
conn = create_database(db_path)
scraped = 0
failed = 0
pacer = RequestPacer(min_delay=3, max_delay=7)
for store_id in store_ids:
try:
pacer.wait()
data = scrape_restaurant(store_id)
save_restaurant(conn, data)
scraped += 1
print(f" [{city_name}] Scraped {scraped}/{len(store_ids)}: "
f"{data['details'].get('name', store_id)}")
except Exception as e:
failed += 1
c = conn.cursor()
c.execute("""
INSERT INTO scrape_log (store_id, status, error_message)
VALUES (?, 'error', ?)
""", (store_id, str(e)))
conn.commit()
print(f" [{city_name}] Failed {store_id}: {e}")
time.sleep(random.uniform(5, 15))
conn.close()
return {
"city": city_name,
"total": len(store_ids),
"scraped": scraped,
"failed": failed,
}
def run_multi_city_scraper(cities: dict, db_path: str = "doordash_multi.db"):
"""Run the scraper across multiple cities sequentially (be polite)."""
results = []
for city_name, (lat, lng) in cities.items():
result = scrape_city(city_name, lat, lng, db_path)
results.append(result)
# Rest between cities
rest_time = random.uniform(30, 60)
print(f"Resting {rest_time:.0f}s before next city...")
time.sleep(rest_time)
print("\n=== Multi-City Scrape Complete ===")
for r in results:
print(f" {r['city']}: {r['scraped']}/{r['total']} scraped, "
f"{r['failed']} failed")
return results
if __name__ == "__main__":
run_multi_city_scraper(
{k: v for k, v in list(CITIES.items())[:3]}, # start with 3 cities
db_path="doordash_multi.db"
)
Proxy Strategy with ThorData {#proxies}
Residential proxies are essential for DoorDash scraping at scale. Datacenter IPs get flagged almost immediately because DoorDash's WAF recognizes the IP ranges belonging to cloud providers like AWS, GCP, and Azure.
ThorData provides rotating residential proxies sourced from real consumer internet connections. Each request routes through a different household IP, making your traffic indistinguishable from thousands of real DoorDash users in different locations.
Setting Up ThorData Rotation
import itertools
import threading
class ThorDataProxyPool:
"""
Manages a rotating pool of ThorData residential proxies.
ThorData supports sticky sessions (same IP per session) and
rotating sessions (new IP per request).
"""
def __init__(self, username: str, password: str,
host: str = "proxy.thordata.com", port: int = 9000,
rotate_per_request: bool = True):
self.username = username
self.password = password
self.host = host
self.port = port
self.rotate_per_request = rotate_per_request
self._lock = threading.Lock()
self._request_count = 0
def get_proxy_url(self, country: str = "US",
session_id: str = None) -> str:
"""
Generate a proxy URL.
For rotating proxies (new IP per request), omit session_id.
For sticky proxies (same IP per session), provide a session_id.
"""
with self._lock:
self._request_count += 1
if self.rotate_per_request:
# Each call gets a new IP from ThorData's pool
user = f"{self.username}-country-{country}"
else:
# Sticky session: same IP for all requests with this session_id
sid = session_id or f"session-{self._request_count // 10}"
user = f"{self.username}-country-{country}-session-{sid}"
return f"http://{user}:{self.password}@{self.host}:{self.port}"
def get_country_specific_proxy(self, country: str) -> str:
"""Get a proxy with a specific country's IP."""
return self.get_proxy_url(country=country)
def create_session(self, country: str = "US") -> cffi_requests.Session:
"""Create a curl-cffi session with a ThorData proxy."""
proxy_url = self.get_proxy_url(country=country)
session = cffi_requests.Session(impersonate="chrome120")
session.headers.update(HEADERS)
session.proxies = {"https": proxy_url, "http": proxy_url}
return session
# Initialize the proxy pool
# Get your credentials at https://thordata.partnerstack.com/partner/0a0x4nzh
proxy_pool = ThorDataProxyPool(
username="YOUR_THORDATA_USERNAME",
password="YOUR_THORDATA_PASSWORD",
rotate_per_request=True,
)
def scrape_with_rotating_proxies(store_ids: list[int]) -> list[dict]:
"""Scrape multiple stores, rotating proxies between each."""
results = []
for store_id in store_ids:
# New session = new proxy IP from ThorData's pool
session = proxy_pool.create_session(country="US")
warm_up_session(session)
try:
pacer = RequestPacer(min_delay=2, max_delay=5)
pacer.wait()
details = get_store_details(session, store_id)
menu = get_store_menu(session, store_id)
results.append({
"details": details,
"menu": menu,
"scraped_at": time.strftime("%Y-%m-%dT%H:%M:%S"),
})
print(f"Scraped: {details.get('name', store_id)}")
except Exception as e:
print(f"Failed {store_id}: {e}")
results.append({"store_id": store_id, "error": str(e)})
return results
Geographic Targeting
When scraping DoorDash for specific cities, use ThorData to send requests from IPs in those cities. DoorDash personalizes delivery times and fees based on IP location, so city-matched proxies give you more accurate local data:
CITY_COUNTRY_MAP = {
"New York": "US",
"London": "GB",
"Toronto": "CA",
"Sydney": "AU",
"Berlin": "DE",
}
def scrape_city_with_local_ip(city: str, lat: float, lng: float) -> list[dict]:
country = CITY_COUNTRY_MAP.get(city, "US")
session = proxy_pool.create_session(country=country)
warm_up_session(session)
result = search_stores_by_location(session, lat, lng, limit=50)
return result.get("stores", [])
Playwright-Based Fallback {#playwright}
Sometimes DoorDash updates their bot detection and the API approach stops working temporarily. In that case, Playwright driving a real browser is the fallback:
from playwright.async_api import async_playwright
import asyncio
import json
async def scrape_with_playwright(store_id: int,
proxy_url: str = None) -> dict:
"""
Full browser scrape using Playwright.
Intercepts network requests to capture GraphQL responses.
"""
async with async_playwright() as p:
# Launch with optional proxy
launch_options = {
"headless": True, # set False for debugging
}
if proxy_url:
launch_options["proxy"] = {"server": proxy_url}
browser = await p.chromium.launch(**launch_options)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/126.0.0.0 Safari/537.36",
viewport={"width": 1280, "height": 800},
locale="en-US",
)
page = await context.new_page()
captured_data = {}
# Intercept GraphQL responses
async def handle_response(response):
if "graphql" in response.url and response.status == 200:
try:
body = await response.json()
if "data" in body:
op = "unknown"
# Try to identify the operation
req_post_data = response.request.post_data
if req_post_data:
req_json = json.loads(req_post_data)
op = req_json.get("operationName", "unknown")
captured_data[op] = body["data"]
except Exception:
pass
page.on("response", handle_response)
# Navigate to restaurant page
url = f"https://www.doordash.com/store/{store_id}/"
await page.goto(url, wait_until="networkidle", timeout=30000)
# Wait for menu to load
try:
await page.wait_for_selector("[data-testid='menu-category']",
timeout=10000)
except Exception:
pass # Menu selector might have changed
await browser.close()
return captured_data
async def run_playwright_scraper(store_ids: list[int]) -> list[dict]:
results = []
for store_id in store_ids:
data = await scrape_with_playwright(store_id)
results.append({"store_id": store_id, "data": data})
await asyncio.sleep(random.uniform(3, 7))
return results
# Run it
# results = asyncio.run(run_playwright_scraper([61092, 12345, 67890]))
Real-World Use Cases {#use-cases}
1. Competitive Price Intelligence
Track how delivery fees and minimum orders change across neighborhoods:
def track_price_changes(db_path: str, store_id: int, interval_hours: int = 24):
"""Monitor delivery fee changes over time."""
conn = sqlite3.connect(db_path)
c = conn.cursor()
c.execute("""
SELECT scraped_at, delivery_fee, estimated_delivery_time
FROM restaurants
WHERE id = ?
ORDER BY scraped_at DESC
LIMIT 30
""", (store_id,))
history = c.fetchall()
if len(history) > 1:
current_fee = history[0][1]
prev_fee = history[1][1]
if current_fee != prev_fee:
change = (current_fee - prev_fee) / 100
print(f"Store {store_id}: delivery fee changed by ${change:+.2f}")
conn.close()
2. Restaurant Opening Detection
Find restaurants that just appeared on DoorDash in your city:
def find_new_restaurants(db_path: str, city: str, days_ago: int = 7) -> list:
"""Find restaurants added to DoorDash in the last N days."""
conn = sqlite3.connect(db_path)
c = conn.cursor()
c.execute("""
SELECT id, name, cuisine_type, average_rating, delivery_fee
FROM restaurants
WHERE city = ?
AND date(scraped_at) >= date('now', ?)
ORDER BY scraped_at DESC
""", (city, f"-{days_ago} days"))
results = c.fetchall()
conn.close()
return results
3. Cuisine Gap Analysis
Find underserved cuisine types in a neighborhood:
def find_cuisine_gaps(db_path: str, city: str) -> list:
"""Identify cuisines with few options relative to demand (based on ratings)."""
conn = sqlite3.connect(db_path)
c = conn.cursor()
c.execute("""
SELECT
cuisine_type,
COUNT(*) as count,
AVG(average_rating) as avg_rating,
AVG(num_ratings) as avg_reviews,
-- High avg_reviews relative to count = high demand, low supply
AVG(num_ratings) / COUNT(*) as demand_supply_ratio
FROM restaurants
WHERE city = ?
AND cuisine_type IS NOT NULL
AND average_rating >= 4.0
GROUP BY cuisine_type
HAVING count < 5 -- few options
ORDER BY demand_supply_ratio DESC
""", (city,))
results = c.fetchall()
conn.close()
return results
4. Menu Item Price Benchmarking
Compare prices for the same item across multiple restaurants:
def benchmark_item(db_path: str, item_name: str, city: str) -> list:
"""Find the same menu item across restaurants and compare prices."""
conn = sqlite3.connect(db_path)
c = conn.cursor()
c.execute("""
SELECT
r.name as restaurant,
mi.name as item,
mi.price / 100.0 as price,
r.average_rating,
r.delivery_fee / 100.0 as delivery_fee
FROM menu_items mi
JOIN restaurants r ON mi.restaurant_id = r.id
WHERE r.city = ?
AND LOWER(mi.name) LIKE LOWER(?)
AND mi.is_available = 1
ORDER BY mi.price
""", (city, f"%{item_name}%"))
results = c.fetchall()
conn.close()
if results:
prices = [r[2] for r in results]
print(f"\n'{item_name}' prices in {city}:")
print(f" Min: ${min(prices):.2f}")
print(f" Max: ${max(prices):.2f}")
print(f" Avg: ${sum(prices)/len(prices):.2f}")
print(f"\n Cheapest option: {results[0][0]} at ${results[0][2]:.2f}")
return results
Legal Considerations {#legal}
DoorDash's Terms of Service prohibit automated access. Scraping publicly visible data (restaurant names, menus, prices) generally falls under fair use for research and analysis, but selling or republishing the data commercially could create legal exposure.
Key principles to stay on the right side:
- Only scrape publicly visible data — nothing behind authentication, no user data, no payment information
- Don't overwhelm the server — keep request rates low, respect the spirit of
robots.txt - Don't republish raw data commercially — aggregated analysis is safer than raw data dumps
- Cache aggressively — scrape each page as infrequently as you need to, not as often as technically possible
- Identify yourself — some organizations set a custom User-Agent with contact info for good-faith scrapers
The Computer Fraud and Abuse Act (CFAA) in the US has been interpreted narrowly since the hiQ v. LinkedIn ruling — scraping publicly accessible data is generally protected. But this area of law is still evolving.
Performance Optimization {#performance}
Async Concurrent Scraping
For maximum throughput while staying polite:
import asyncio
from curl_cffi.requests import AsyncSession
async def async_scrape_store(session: AsyncSession,
store_id: int, semaphore: asyncio.Semaphore) -> dict:
"""Async version for concurrent scraping."""
async with semaphore:
await asyncio.sleep(random.uniform(1, 3)) # polite delay
payload = {
"operationName": "getStoreDetails",
"variables": {"storeId": store_id},
"query": "query getStoreDetails($storeId: Int!) { ... }",
}
resp = await session.post(GRAPHQL_URL, json=payload, timeout=20)
return resp.json()
async def async_batch_scrape(store_ids: list[int],
max_concurrent: int = 5) -> list[dict]:
"""Scrape multiple stores concurrently."""
semaphore = asyncio.Semaphore(max_concurrent)
proxy_url = proxy_pool.get_proxy_url()
async with AsyncSession(impersonate="chrome120",
proxies={"https": proxy_url}) as session:
session.headers.update(HEADERS)
tasks = [async_scrape_store(session, sid, semaphore)
for sid in store_ids]
results = await asyncio.gather(*tasks, return_exceptions=True)
return [r for r in results if not isinstance(r, Exception)]
# Scrape 100 stores with max 5 concurrent requests
# results = asyncio.run(async_batch_scrape(store_ids[:100], max_concurrent=5))
Estimated Performance
With proper configuration:
| Setup | Stores/hour | Notes |
|---|---|---|
| Single thread, no proxy | 15-20 | Gets blocked quickly |
| Single thread + residential proxy | 60-80 | Reliable, slow |
| 5 concurrent + rotating proxies | 200-300 | Good balance |
| 10 concurrent + dedicated proxy pool | 400-500 | For large operations |
The limiting factor is almost always proxy cost and rate limit avoidance, not your hardware.
Summary
Scraping DoorDash requires handling several layers of defense: TLS fingerprinting, CloudFront WAF, and rate limiting. The core approach is:
- Use
curl-cffiwith Chrome impersonation for TLS fingerprinting - Route through residential proxies (ThorData) for IP diversity
- Warm up each session by visiting the homepage first
- Keep request rates low with randomized delays
- Store in SQLite for easy analysis
The GraphQL API returns clean JSON, so there's no HTML parsing complexity — the hard part is the bot detection layer, not the data extraction.
For a working proxy setup that handles DoorDash specifically, ThorData's residential network is the most reliable option — their rotating pool ensures each request looks like a different household consumer.