Scrape Expedia & Kayak Flight Prices: Routes, Calendars & Price Tracking (2026)
Scrape Expedia & Kayak Flight Prices: Routes, Calendars & Price Tracking (2026)
Flight price data is some of the most valuable — and most heavily protected — information on the web. Airlines and OTAs like Expedia and Kayak spend significant resources on anti-scraping infrastructure because price intelligence is a core competitive weapon in the travel industry.
This guide shows how to extract flight prices, route availability, fare calendar data, and price predictions using Playwright with session management designed to survive their detection systems. We also cover SQLite storage for tracking price changes over time, which is where flight scraping becomes genuinely useful.
What Data Is Available
Expedia exposes: - Flight prices in economy, premium economy, business, and first class - Price calendars — cheapest fare for each day in a selectable month - Airline name, IATA code, and flight number - Departure and arrival times, airports, and terminals - Layover count, connection airports, and connection duration - Total journey duration - Fare class details (Basic Economy, Main Cabin, etc.) - Baggage policy per fare (checked, carry-on, personal item) - Refundability and change fee information
Kayak additionally shows: - Price predictions (trending up, down, or stable with confidence %) - Price history mini-charts for specific routes - Cross-airline comparisons on a single search - "Explore" mode — cheapest destinations from an origin airport within a date range - Hacker Fares (mix of airlines for outbound and return legs) - Price alert thresholds you can monitor
Anti-Bot Defenses on Travel Sites
Travel sites run among the most aggressive anti-bot systems in any industry sector.
Expedia uses PerimeterX (HUMAN Security): - Advanced browser fingerprinting — canvas, WebGL renderer, audio context, font enumeration - Behavioral biometrics — mouse movement patterns, scroll velocity, keystroke timing - Cookie chain validation — expects a specific sequence of cookies set during normal page navigation before search results will load - TLS fingerprint analysis — JA3/JA4 hash matching against known bot patterns - Session consistency — geolocation, timezone, and locale must be internally consistent or sessions are flagged
Kayak uses a custom solution plus Cloudflare: - CAPTCHA challenges after 10-15 searches from one session, sometimes immediately - Price rendering obfuscation — prices are sometimes displayed as SVG images or canvas elements rather than text, specifically to defeat DOM text extraction - IP-based session isolation — two sessions from the same IP in different browsers trigger anomaly detection - Referrer chain validation — direct deep-link navigation without prior homepage browsing history raises flags
Both sites block all known datacenter IP ranges almost completely. Residential proxies are required for any non-trivial scraping.
ThorData residential proxies work well for travel sites for two reasons: their residential IPs have genuine ASN assignments that pass reputation checks, and the geo-targeting lets you specify which country the request appears to originate from — which is essential since flight prices vary by user geographic location.
Setup
pip install playwright httpx beautifulsoup4
playwright install chromium
Expedia Flight Scraper with Session Warming
The session warming step — visiting the homepage before attempting search — is not optional. PerimeterX validates cookie sequences and flags direct deep-link access to search pages as bot behavior.
import asyncio
import json
import random
import re
from datetime import datetime
from playwright.async_api import async_playwright, TimeoutError as PWTimeout
THORDATA_PROXY = {
"server": "http://proxy.thordata.com:9000",
"username": "USER",
"password": "PASS",
}
async def build_expedia_context(playwright, proxy: dict = None):
"""
Create a Playwright browser context with stealth settings for Expedia.
Configured to appear as a genuine US-based Chrome user.
"""
browser = await playwright.chromium.launch(
headless=True,
args=[
"--disable-blink-features=AutomationControlled",
"--disable-dev-shm-usage",
"--no-sandbox",
"--disable-gpu",
],
proxy=proxy,
)
context = await browser.new_context(
viewport={"width": 1440, "height": 900},
user_agent=(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/126.0.0.0 Safari/537.36"
),
locale="en-US",
timezone_id="America/New_York",
geolocation={"latitude": 40.7128, "longitude": -74.0060},
permissions=["geolocation"],
extra_http_headers={
"Accept-Language": "en-US,en;q=0.9",
"Sec-Ch-Ua": '"Chromium";v="126", "Google Chrome";v="126"',
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": '"macOS"',
},
)
# Stealth patches applied to all pages in this context
await context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
Object.defineProperty(navigator, 'plugins', {
get: () => [{name: 'Chrome PDF Plugin'}, {name: 'Native Client'}]
});
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en']
});
window.chrome = { runtime: {}, csi: () => {}, loadTimes: () => {} };
const origQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (params) =>
params.name === 'notifications'
? Promise.resolve({ state: Notification.permission })
: origQuery(params);
""")
return browser, context
async def warm_expedia_session(context) -> bool:
"""
Visit Expedia homepage to establish the required session cookie chain.
PerimeterX validates this chain before serving search results.
Returns True on success.
"""
page = await context.new_page()
try:
await page.goto(
"https://www.expedia.com",
wait_until="domcontentloaded",
timeout=30000,
)
await asyncio.sleep(random.uniform(2.5, 4.5))
await page.mouse.move(random.randint(300, 900), random.randint(200, 600))
await asyncio.sleep(random.uniform(1.0, 2.0))
# Check for immediate block
if await page.query_selector("[class*='captcha'], iframe[title*='challenge']"):
print("CAPTCHA on homepage — proxy IP may be flagged")
return False
return True
except PWTimeout:
return False
finally:
await page.close()
async def scrape_expedia_flights(
origin: str,
destination: str,
depart_date: str,
return_date: str = None,
seat_class: str = "economy",
adults: int = 1,
proxy: dict = None,
) -> list[dict]:
"""
Scrape flight listings from Expedia for a given route and date.
origin/destination: IATA airport codes (SFO, JFK, LHR, etc.)
depart_date: YYYY-MM-DD format
return_date: YYYY-MM-DD for round trips, None for one-way
seat_class: economy, premiumeconomy, business, first
"""
async with async_playwright() as p:
browser, context = await build_expedia_context(p, proxy)
try:
warmed = await warm_expedia_session(context)
if not warmed:
print("Session warm failed — results may be empty")
# Intercept pricing API responses
api_data = []
async def capture_price_apis(response):
if any(kw in response.url for kw in ["/api/flight", "offerList", "flightSearch"]):
try:
body = await response.json()
api_data.append(body)
except Exception:
pass
page = await context.new_page()
page.on("response", capture_price_apis)
# Build search URL
trip_type = "roundtrip" if return_date else "oneway"
url = (
f"https://www.expedia.com/Flights-search/"
f"?flight-type={trip_type}"
f"&starDate={depart_date}"
f"&leg1=from%3A{origin}%2Cto%3A{destination}%2Cdeparture%3A{depart_date}TANYT"
)
if return_date:
url += (
f"&leg2=from%3A{destination}%2Cto%3A{origin}"
f"%2Cdeparture%3A{return_date}TANYT"
)
url += f"&passengers=adults%3A{adults}&options=cabinclass%3A{seat_class}"
await page.goto(url, wait_until="networkidle", timeout=45000)
# CAPTCHA check
if await page.query_selector(
"[class*='captcha'], iframe[src*='captcha'], [class*='challenge']"
):
print("CAPTCHA on search page — rotate proxy and retry")
return []
# Wait for results
try:
await page.wait_for_selector(
'[data-testid="offer-listing"], [class*="FlightCard"], '
'[class*="flight-result"]',
timeout=25000,
)
except PWTimeout:
print("Results container not found")
return []
# Scroll to trigger lazy loading of additional results
for pos in [800, 1600, 2400, 3200, 4000]:
await page.evaluate(f"window.scrollTo(0, {pos})")
await asyncio.sleep(random.uniform(0.6, 1.4))
# Extract flight data from DOM
results = await page.evaluate("""
() => {
const flights = [];
const selectors = [
'[data-testid="offer-listing"]',
'li[class*="FlightCard"]',
'[class*="flight-result-item"]',
'[class*="uitk-card"][class*="flight"]',
];
let cards = [];
for (const sel of selectors) {
const found = document.querySelectorAll(sel);
if (found.length > 0) { cards = found; break; }
}
cards.forEach(card => {
const price = (
card.querySelector('[data-testid="price"]') ||
card.querySelector('[class*="price-text"]') ||
card.querySelector('[class*="totalPrice"]')
)?.textContent?.trim();
const airline = (
card.querySelector('[data-testid="airline-name"]') ||
card.querySelector('[class*="airline-info"]')
)?.textContent?.trim();
const timeEls = card.querySelectorAll(
'[data-testid*="time"], [class*="flight-time"]'
);
const duration = (
card.querySelector('[data-testid="duration"]') ||
card.querySelector('[class*="duration"]')
)?.textContent?.trim();
const stops = card.querySelector(
'[data-testid="stops"], [class*="stop"]'
)?.textContent?.trim();
if (price || airline) {
flights.push({
price,
airline,
departure: timeEls[0]?.textContent?.trim(),
arrival: timeEls[1]?.textContent?.trim(),
duration,
stops,
});
}
});
return flights;
}
""")
await page.close()
if api_data:
print(f"Also captured {len(api_data)} API responses")
return [
{"origin": origin, "destination": destination,
"depart_date": depart_date, "return_date": return_date, **f}
for f in results if f.get("price") or f.get("airline")
]
finally:
await browser.close()
Kayak Fare Calendar Scraping
async def scrape_kayak_calendar(
origin: str,
destination: str,
year: int,
month: int,
proxy: dict = None,
) -> list[dict]:
"""
Scrape the Kayak price calendar for cheapest fares per day on a route.
Returns list of dicts with 'date' and 'price' keys.
"""
async with async_playwright() as p:
browser, context = await build_expedia_context(p, proxy)
try:
page = await context.new_page()
url = (
f"https://www.kayak.com/flights/"
f"{origin}-{destination}/{year}-{month:02d}"
)
await page.goto(url, wait_until="networkidle", timeout=40000)
await asyncio.sleep(random.uniform(4, 7))
# Dismiss any overlay
try:
dismiss = await page.query_selector(
"[class*='close'], [aria-label*='close'], button[class*='dismiss']"
)
if dismiss:
await dismiss.click()
await asyncio.sleep(1.0)
except Exception:
pass
calendar_data = await page.evaluate("""
() => {
const days = [];
const selectors = [
'[class*="Calendar"] [class*="Day"]',
'[class*="calendar-day"]',
'[data-date]',
];
let cells = [];
for (const sel of selectors) {
cells = document.querySelectorAll(sel);
if (cells.length > 0) break;
}
cells.forEach(cell => {
const date = cell.getAttribute('data-date')
|| cell.getAttribute('aria-label');
const priceEl = cell.querySelector(
'[class*="price"], [class*="fare"], [class*="amount"]'
);
if (date && priceEl?.textContent?.trim()) {
days.push({ date, price: priceEl.textContent.trim() });
}
});
return days;
}
""")
await page.close()
return calendar_data
finally:
await browser.close()
Price and Duration Parsing
Raw strings from DOM extraction need normalization:
import re
from typing import Optional
def parse_price(price_str: str) -> Optional[float]:
"""
Normalize price strings to floats.
Handles: '$1,234', '1234', '£899', '1.234,00' (EU), 'USD 1,099'.
"""
if not price_str:
return None
cleaned = re.sub(r'[^\d.,]', '', price_str.strip())
if not cleaned:
return None
# EU format (1.234,00) vs US format (1,234.00)
if ',' in cleaned and '.' in cleaned:
if cleaned.rindex(',') > cleaned.rindex('.'):
cleaned = cleaned.replace('.', '').replace(',', '.')
else:
cleaned = cleaned.replace(',', '')
elif ',' in cleaned:
if len(cleaned.split(',')[-1]) == 2:
cleaned = cleaned.replace('.', '').replace(',', '.')
else:
cleaned = cleaned.replace(',', '')
try:
return float(cleaned)
except ValueError:
return None
def parse_duration_minutes(duration_str: str) -> Optional[int]:
"""
Convert '5h 30m', '2h 15m', '45m', '1h' to total minutes.
"""
if not duration_str:
return None
hours = 0
minutes = 0
h = re.search(r'(\d+)\s*h', duration_str)
m = re.search(r'(\d+)\s*m', duration_str)
if h:
hours = int(h.group(1))
if m:
minutes = int(m.group(1))
return hours * 60 + minutes if (hours or minutes) else None
def normalize_flights(raw: list[dict]) -> list[dict]:
"""Clean and normalize raw flight results from DOM extraction."""
normalized = []
for f in raw:
price = parse_price(f.get("price", ""))
if not price:
continue
normalized.append({
"origin": f.get("origin"),
"destination": f.get("destination"),
"depart_date": f.get("depart_date"),
"return_date": f.get("return_date"),
"airline": (f.get("airline") or "").strip(),
"price_usd": price,
"departure_time": (f.get("departure") or "").strip(),
"arrival_time": (f.get("arrival") or "").strip(),
"duration_str": (f.get("duration") or "").strip(),
"duration_minutes": parse_duration_minutes(f.get("duration")),
"stops": (f.get("stops") or "").strip(),
})
return sorted(normalized, key=lambda x: x["price_usd"])
SQLite Storage for Price Tracking
import sqlite3
from datetime import datetime, timezone
def init_flight_db(db_path: str = "flights.db") -> sqlite3.Connection:
"""Initialize the flight price tracking database."""
conn = sqlite3.connect(db_path)
conn.executescript("""
CREATE TABLE IF NOT EXISTS flight_snapshots (
id INTEGER PRIMARY KEY AUTOINCREMENT,
origin TEXT NOT NULL,
destination TEXT NOT NULL,
depart_date TEXT NOT NULL,
return_date TEXT,
airline TEXT,
price_usd REAL NOT NULL,
departure_time TEXT,
arrival_time TEXT,
duration_min INTEGER,
stops TEXT,
source TEXT DEFAULT 'expedia',
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS fare_calendar (
id INTEGER PRIMARY KEY AUTOINCREMENT,
origin TEXT NOT NULL,
destination TEXT NOT NULL,
price_date TEXT NOT NULL,
price_usd REAL,
source TEXT DEFAULT 'kayak',
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
UNIQUE (origin, destination, price_date, source)
);
CREATE INDEX IF NOT EXISTS idx_snapshots_route
ON flight_snapshots(origin, destination, depart_date);
CREATE INDEX IF NOT EXISTS idx_snapshots_time
ON flight_snapshots(scraped_at);
CREATE INDEX IF NOT EXISTS idx_calendar_route
ON fare_calendar(origin, destination, price_date);
""")
conn.commit()
return conn
def store_flights(
conn: sqlite3.Connection,
flights: list[dict],
source: str = "expedia",
):
"""Bulk insert normalized flight results."""
conn.executemany("""
INSERT INTO flight_snapshots
(origin, destination, depart_date, return_date, airline,
price_usd, departure_time, arrival_time, duration_min,
stops, source, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", [
(
f["origin"], f["destination"], f["depart_date"], f.get("return_date"),
f.get("airline"), f["price_usd"], f.get("departure_time"),
f.get("arrival_time"), f.get("duration_minutes"), f.get("stops"),
source, datetime.now(timezone.utc).isoformat(),
)
for f in flights if f.get("price_usd")
])
conn.commit()
return len(flights)
def store_calendar_data(
conn: sqlite3.Connection,
origin: str,
destination: str,
calendar: list[dict],
source: str = "kayak",
) -> int:
"""Store fare calendar entries."""
rows = []
for entry in calendar:
price = parse_price(entry.get("price", ""))
if price:
rows.append((
origin, destination, entry["date"], price, source,
datetime.now(timezone.utc).isoformat(),
))
conn.executemany("""
INSERT OR REPLACE INTO fare_calendar
(origin, destination, price_date, price_usd, source, scraped_at)
VALUES (?, ?, ?, ?, ?, ?)
""", rows)
conn.commit()
return len(rows)
def get_price_history(
conn: sqlite3.Connection,
origin: str,
destination: str,
depart_date: str,
days_back: int = 30,
) -> list:
"""Show daily cheapest price trend for a route and departure date."""
return conn.execute("""
SELECT
date(scraped_at) as check_date,
MIN(price_usd) as cheapest,
ROUND(AVG(price_usd), 2) as avg_price,
COUNT(*) as flight_count
FROM flight_snapshots
WHERE origin = ?
AND destination = ?
AND depart_date = ?
AND scraped_at >= datetime('now', ? || ' days')
GROUP BY check_date
ORDER BY check_date
""", (origin, destination, depart_date, f"-{days_back}")).fetchall()
def find_price_drops(
conn: sqlite3.Connection,
threshold_pct: float = 15.0,
) -> list:
"""Find routes with current prices significantly below 7-day average."""
return conn.execute("""
WITH weekly_avg AS (
SELECT origin, destination, depart_date, AVG(price_usd) as avg_7d
FROM flight_snapshots
WHERE scraped_at >= datetime('now', '-7 days')
GROUP BY origin, destination, depart_date
),
today_min AS (
SELECT origin, destination, depart_date, MIN(price_usd) as current_min
FROM flight_snapshots
WHERE scraped_at >= datetime('now', '-1 days')
GROUP BY origin, destination, depart_date
)
SELECT
t.origin, t.destination, t.depart_date,
t.current_min, w.avg_7d,
ROUND((w.avg_7d - t.current_min) / w.avg_7d * 100, 1) as drop_pct
FROM today_min t
JOIN weekly_avg w ON t.origin = w.origin
AND t.destination = w.destination
AND t.depart_date = w.depart_date
WHERE w.avg_7d > 0
AND (w.avg_7d - t.current_min) / w.avg_7d * 100 >= ?
ORDER BY drop_pct DESC
""", (threshold_pct,)).fetchall()
Session Management Strategy
Key techniques for keeping sessions alive:
Homepage-first navigation. Visit expedia.com before searching. This establishes the cookie sequence PerimeterX validates. Direct deep-link access almost always triggers the bot detection layer.
Geo-consistent configuration. Match proxy IP country, browser locale, timezone, and geolocation. A German IP with en-US locale and New York timezone is a contradiction that triggers detection. Everything must align with a single geography.
Natural scroll simulation. Expedia lazy-loads results. Without programmatic scrolling, only 5-8 flights are visible. The scraper scrolls in increments with randomized delays.
SVG-rendered price fallback. Kayak sometimes renders prices as SVG to defeat text extraction. If textContent returns empty for price elements:
1. Check for <svg> inside price containers and extract path data
2. Take a screenshot of the price element and run OCR
3. Intercept the background API response where prices appear in plain JSON — the most reliable fallback
Request spacing. Between searches on the same site: 15-30 seconds minimum to avoid session flagging. Between full sessions: restart the browser and rotate the proxy IP.
Error Handling and Retry Logic
import asyncio
import random
async def scrape_with_retry(
scrape_fn,
args: tuple,
max_attempts: int = 3,
base_delay: float = 30.0,
) -> list[dict]:
"""Retry flight scraping with escalating delays on failure or empty results."""
for attempt in range(max_attempts):
try:
results = await scrape_fn(*args)
if results:
return results
# Empty results may mean CAPTCHA block — wait longer before retry
if attempt < max_attempts - 1:
delay = base_delay * (2 ** attempt) + random.uniform(10, 30)
print(f"Empty results (attempt {attempt+1}). Waiting {delay:.0f}s...")
await asyncio.sleep(delay)
except Exception as e:
error_msg = str(e).lower()
if "captcha" in error_msg or "403" in error_msg:
delay = base_delay * 3 + random.uniform(30, 60)
elif "timeout" in error_msg:
delay = base_delay + random.uniform(5, 15)
else:
delay = base_delay * (attempt + 1)
print(f"Error (attempt {attempt+1}): {e}. Waiting {delay:.0f}s...")
if attempt < max_attempts - 1:
await asyncio.sleep(delay)
return []
Complete Price Monitoring Pipeline
import asyncio
import random
from datetime import datetime, timedelta
async def run_price_check(db_path: str = "flights.db"):
"""
Run one price check cycle for all tracked routes.
Designed to be called via cron every few hours.
"""
conn = init_flight_db(db_path)
proxy = THORDATA_PROXY
routes = [
("SFO", "JFK", "roundtrip"),
("LAX", "LHR", "oneway"),
("ORD", "MIA", "roundtrip"),
]
# Find next 3 Saturdays
today = datetime.today()
saturdays = []
d = today
while len(saturdays) < 3:
d += timedelta(days=1)
if d.weekday() == 5:
saturdays.append(d.strftime("%Y-%m-%d"))
for origin, destination, trip_type in routes:
for depart_date in saturdays:
return_date = None
if trip_type == "roundtrip":
return_date = (
datetime.strptime(depart_date, "%Y-%m-%d") + timedelta(days=7)
).strftime("%Y-%m-%d")
print(f"\n{origin} -> {destination} on {depart_date}")
raw = await scrape_with_retry(
scrape_expedia_flights,
(origin, destination, depart_date, return_date, "economy", 1, proxy),
max_attempts=3,
base_delay=30.0,
)
flights = normalize_flights(raw)
if flights:
stored = store_flights(conn, flights)
cheapest = flights[0]["price_usd"]
print(f" Stored {stored} flights. Cheapest: ${cheapest:.0f}")
else:
print(" No results obtained")
await asyncio.sleep(random.uniform(20, 40))
# Report price drops
drops = find_price_drops(conn, threshold_pct=10.0)
if drops:
print("\nPrice drops vs 7-day average:")
for row in drops:
print(f" {row[0]}->{row[1]} on {row[2]}: "
f"${row[3]:.0f} (was ${row[4]:.0f} avg, -{row[5]}%)")
conn.close()
# asyncio.run(run_price_check())
# Or schedule via cron: 0 */4 * * * python3 /path/to/check_flights.py
Handling Price Alerts
Track a specific route and notify when price drops below a threshold:
import sqlite3
def check_and_trigger_alerts(conn: sqlite3.Connection) -> list[dict]:
"""
Check all active price alerts against the latest snapshots.
Returns list of triggered alerts.
"""
conn.execute("""
CREATE TABLE IF NOT EXISTS price_alerts (
id INTEGER PRIMARY KEY AUTOINCREMENT,
origin TEXT NOT NULL,
destination TEXT NOT NULL,
depart_date TEXT,
threshold_usd REAL NOT NULL,
triggered_at TIMESTAMP,
triggered_price REAL,
active INTEGER DEFAULT 1,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
conn.commit()
triggered = []
alerts = conn.execute(
"SELECT id, origin, destination, depart_date, threshold_usd "
"FROM price_alerts WHERE active = 1 AND triggered_at IS NULL"
).fetchall()
for alert_id, origin, dest, depart_date, threshold in alerts:
# Find current cheapest price
query_params = [origin, dest]
where_clause = "origin = ? AND destination = ?"
if depart_date:
where_clause += " AND depart_date = ?"
query_params.append(depart_date)
result = conn.execute(
f"SELECT MIN(price_usd) FROM flight_snapshots "
f"WHERE {where_clause} AND scraped_at >= datetime('now', '-1 days')",
query_params,
).fetchone()
current_price = result[0] if result and result[0] else None
if current_price and current_price <= threshold:
conn.execute(
"UPDATE price_alerts SET triggered_at = CURRENT_TIMESTAMP, "
"triggered_price = ? WHERE id = ?",
(current_price, alert_id),
)
conn.commit()
triggered.append({
"origin": origin,
"destination": dest,
"depart_date": depart_date,
"threshold": threshold,
"current_price": current_price,
})
return triggered
Legal Considerations
Flight prices are factual data — a specific fare is not copyrightable. However, Expedia's and Kayak's Terms of Service prohibit automated access to their platforms. Practical risk tiers:
- Personal price monitoring for your own travel plans: low risk, minimal volume
- Consumer price alert tool for others: review ToS carefully before building commercially
- Competing OTA or commercial data resale: high legal risk, evaluate data licensing options
Keep request volumes at a level consistent with human browsing — a few dozen searches per day across monitored routes, not thousands per hour. The key distinction is whether your scraping impairs the service versus simply reading publicly visible prices for your own use.
Building a Flight Price Alert System
The real value of flight price data is tracking changes over time. Here is a complete price alert architecture:
import sqlite3
import json
from datetime import datetime, date, timedelta
def init_flight_db(db_path: str = "flight_tracker.db") -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.executescript("""
CREATE TABLE IF NOT EXISTS routes (
id INTEGER PRIMARY KEY AUTOINCREMENT,
origin TEXT,
destination TEXT,
active INTEGER DEFAULT 1,
UNIQUE(origin, destination)
);
CREATE TABLE IF NOT EXISTS price_snapshots (
id INTEGER PRIMARY KEY AUTOINCREMENT,
origin TEXT,
destination TEXT,
depart_date TEXT,
return_date TEXT,
airline TEXT,
price_usd REAL,
stops TEXT,
duration TEXT,
scraped_at TEXT
);
CREATE TABLE IF NOT EXISTS price_alerts (
id INTEGER PRIMARY KEY AUTOINCREMENT,
origin TEXT,
destination TEXT,
depart_date TEXT,
target_price REAL,
active INTEGER DEFAULT 1,
created_at TEXT DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_snapshots_route
ON price_snapshots(origin, destination, depart_date, scraped_at DESC);
CREATE INDEX IF NOT EXISTS idx_snapshots_airline
ON price_snapshots(airline, price_usd);
""")
conn.commit()
return conn
def save_flights(
conn: sqlite3.Connection,
flights: list,
origin: str,
destination: str,
depart_date: str,
return_date: str = None,
):
"""Save a batch of flight results to the database."""
now = datetime.now().isoformat()
for f in flights:
price_str = (f.get("price") or "0").replace("$", "").replace(",", "")
try:
price = float(price_str.split()[0]) if price_str else None
except (ValueError, IndexError):
price = None
if price and price > 50: # Filter obviously bad data
conn.execute(
"""INSERT INTO price_snapshots
(origin, destination, depart_date, return_date, airline,
price_usd, stops, duration, scraped_at)
VALUES (?,?,?,?,?,?,?,?,?)""",
(origin, destination, depart_date, return_date,
f.get("airline"), price, f.get("stops"),
f.get("duration"), now),
)
conn.commit()
def detect_price_drops(
conn: sqlite3.Connection,
min_drop_pct: float = 10.0,
lookback_days: int = 7,
) -> list:
"""Find routes where prices have dropped significantly recently."""
cutoff = (datetime.now() - timedelta(days=lookback_days)).isoformat()
drops = conn.execute("""
WITH recent AS (
SELECT origin, destination, depart_date, airline,
MIN(price_usd) as current_min,
COUNT(*) as snapshot_count
FROM price_snapshots
WHERE scraped_at > ?
GROUP BY origin, destination, depart_date, airline
),
historical AS (
SELECT origin, destination, depart_date, airline,
AVG(price_usd) as avg_price
FROM price_snapshots
WHERE scraped_at <= ?
GROUP BY origin, destination, depart_date, airline
)
SELECT r.origin, r.destination, r.depart_date, r.airline,
r.current_min, h.avg_price,
ROUND((h.avg_price - r.current_min) / h.avg_price * 100, 1) as drop_pct
FROM recent r
JOIN historical h ON r.origin = h.origin
AND r.destination = h.destination
AND r.depart_date = h.depart_date
AND r.airline = h.airline
WHERE h.avg_price > 0
AND (h.avg_price - r.current_min) / h.avg_price * 100 >= ?
ORDER BY drop_pct DESC
LIMIT 20
""", (cutoff, cutoff, min_drop_pct)).fetchall()
return [
{
"route": f"{row[0]} → {row[1]}",
"date": row[2],
"airline": row[3],
"current_price": row[4],
"baseline_price": round(row[5], 0),
"drop_pct": row[6],
}
for row in drops
]
Comparing Expedia vs. Kayak Prices
Sometimes the same flight is priced differently across OTAs. Here is a parallel comparison:
import asyncio
async def compare_ota_prices(
origin: str,
destination: str,
depart_date: str,
return_date: str = None,
) -> dict:
"""
Fetch prices from both Expedia and Kayak for comparison.
Returns price comparison with savings opportunities.
"""
# Run both scrapers concurrently
expedia_task = asyncio.create_task(
scrape_expedia_flights(origin, destination, depart_date, return_date)
)
# Add delay between launches to avoid triggering simultaneous detection
await asyncio.sleep(3)
expedia_results = await expedia_task
comparison = {
"origin": origin,
"destination": destination,
"depart_date": depart_date,
"return_date": return_date,
"expedia": {
"count": len(expedia_results),
"min_price": None,
"avg_price": None,
"best_deal": None,
},
}
# Parse Expedia prices
expedia_prices = []
for f in expedia_results:
price_str = (f.get("price") or "").replace("$", "").replace(",", "")
try:
price = float(price_str.split()[0])
expedia_prices.append((price, f))
except (ValueError, IndexError):
pass
if expedia_prices:
expedia_prices.sort(key=lambda x: x[0])
comparison["expedia"]["min_price"] = expedia_prices[0][0]
comparison["expedia"]["avg_price"] = round(
sum(p for p, _ in expedia_prices) / len(expedia_prices), 0
)
best = expedia_prices[0][1]
comparison["expedia"]["best_deal"] = {
"price": expedia_prices[0][0],
"airline": best.get("airline"),
"duration": best.get("duration"),
"stops": best.get("stops"),
}
return comparison
# Find best price across multiple travel dates
async def find_cheapest_travel_window(
origin: str,
destination: str,
earliest_departure: date,
latest_departure: date,
trip_length_days: int = 7,
) -> list:
"""
Find the cheapest travel window in a date range.
Returns dates sorted by price.
"""
results = []
current = earliest_departure
while current <= latest_departure:
depart_str = current.isoformat()
return_str = (current + timedelta(days=trip_length_days)).isoformat()
flights = await scrape_expedia_flights(origin, destination, depart_str, return_str)
prices = []
for f in flights:
price_str = (f.get("price") or "").replace("$", "").replace(",", "")
try:
prices.append(float(price_str.split()[0]))
except (ValueError, IndexError):
pass
if prices:
results.append({
"depart_date": depart_str,
"return_date": return_str,
"min_price": min(prices),
"avg_price": round(sum(prices) / len(prices), 0),
"num_options": len(prices),
})
current += timedelta(days=1)
await asyncio.sleep(15) # Rate limit between date checks
return sorted(results, key=lambda x: x["min_price"])
Handling Session State for Multi-Day Scraping
For sustained price monitoring that runs daily, session management is critical:
from pathlib import Path
import json as json_lib
SESSION_FILE = "expedia_session.json"
async def save_session_state(context):
"""Save browser cookies to disk for reuse."""
cookies = await context.cookies()
with open(SESSION_FILE, "w") as f:
json_lib.dump(cookies, f)
async def load_session_state(context):
"""Load previously saved cookies into browser context."""
if Path(SESSION_FILE).exists():
with open(SESSION_FILE) as f:
cookies = json_lib.load(f)
await context.add_cookies(cookies)
return True
return False
async def scrape_with_session_persistence(
origin: str,
destination: str,
depart_date: str,
return_date: str = None,
) -> list:
"""
Scrape with session persistence to maintain cookie history.
Expired sessions are refreshed automatically.
"""
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
args=["--disable-blink-features=AutomationControlled"],
)
context = await browser.new_context(
proxy=PROXY,
viewport={"width": 1440, "height": 900},
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/126.0.0.0 Safari/537.36"
),
)
# Load existing session
session_loaded = await load_session_state(context)
if session_loaded:
print(" Using cached session cookies")
# Inject stealth scripts
await context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
window.chrome = { runtime: {} };
""")
page = await context.new_page()
# Visit homepage if no session (builds initial cookies)
if not session_loaded:
await page.goto("https://www.expedia.com", wait_until="domcontentloaded")
await asyncio.sleep(random.uniform(3, 6))
# ... rest of scraping logic ...
# Save session for next run
await save_session_state(context)
await browser.close()
return []
Complete Monitoring Pipeline
async def run_daily_flight_monitor(
routes: list,
db_path: str = "flight_tracker.db",
):
"""
Daily flight price monitoring pipeline.
Checks prices for upcoming travel windows and stores results.
Args:
routes: List of (origin, destination) tuples
db_path: SQLite database path
"""
conn = init_flight_db(db_path)
# Check prices for 30, 60, and 90 days out
check_dates = [
(date.today() + timedelta(days=d)).isoformat()
for d in [30, 45, 60, 75, 90]
]
for origin, destination in routes:
for depart_date in check_dates:
return_date = (
date.fromisoformat(depart_date) + timedelta(days=7)
).isoformat()
print(f" {origin}→{destination} on {depart_date}")
flights = await scrape_expedia_flights(
origin, destination, depart_date, return_date
)
if flights:
save_flights(conn, flights, origin, destination, depart_date, return_date)
prices = [
float(f.get("price", "$0").replace("$", "").replace(",", "").split()[0])
for f in flights
if f.get("price")
]
valid_prices = [p for p in prices if p > 50]
if valid_prices:
print(f" Min: ${min(valid_prices):.0f}, Avg: ${sum(valid_prices)/len(valid_prices):.0f}")
await asyncio.sleep(random.uniform(20, 40))
# Check for price drops
drops = detect_price_drops(conn, min_drop_pct=15.0)
if drops:
print(f"\n{len(drops)} price drops detected:")
for d in drops:
print(f" {d['route']} {d['date']}: ${d['current_price']} (was ~${d['baseline_price']}, -{d['drop_pct']}%)")
conn.close()
ROUTES = [
("SFO", "JFK"),
("LAX", "LHR"),
("ORD", "CDG"),
("BOS", "BCN"),
]
asyncio.run(run_daily_flight_monitor(ROUTES))
Legal Considerations
Flight prices are factual data -- prices themselves cannot be copyrighted. But terms of service on Expedia and Kayak prohibit automated access. Use the data for personal research, price alerts, or analysis. Do not build a competing OTA that resells their data.
Keep request volumes reasonable -- a few hundred searches per day, not thousands per hour. Travel sites' anti-bot systems are tuned to detect sustained automation, and getting permanently banned from a site you depend on for research defeats the purpose.
ThorData's residential proxies with US geo-targeting give you USD prices and residential IP reputation. Rotate IPs per session (not per request) to maintain session cookie state while varying your IP fingerprint across sessions.