Scrape Expedia & Kayak Flight Prices: Routes, Calendars & Price Tracking (2026)

2026-04-09 [python flights scraping playwright travel expedia kayak sqlite]

Scrape Expedia & Kayak Flight Prices: Routes, Calendars & Price Tracking (2026)

Flight price data is some of the most valuable — and most heavily protected — information on the web. Airlines and OTAs like Expedia and Kayak spend significant resources on anti-scraping infrastructure because price intelligence is a core competitive weapon in the travel industry.

This guide shows how to extract flight prices, route availability, fare calendar data, and price predictions using Playwright with session management designed to survive their detection systems. We also cover SQLite storage for tracking price changes over time, which is where flight scraping becomes genuinely useful.

What Data Is Available

Expedia exposes: - Flight prices in economy, premium economy, business, and first class - Price calendars — cheapest fare for each day in a selectable month - Airline name, IATA code, and flight number - Departure and arrival times, airports, and terminals - Layover count, connection airports, and connection duration - Total journey duration - Fare class details (Basic Economy, Main Cabin, etc.) - Baggage policy per fare (checked, carry-on, personal item) - Refundability and change fee information

Kayak additionally shows: - Price predictions (trending up, down, or stable with confidence %) - Price history mini-charts for specific routes - Cross-airline comparisons on a single search - "Explore" mode — cheapest destinations from an origin airport within a date range - Hacker Fares (mix of airlines for outbound and return legs) - Price alert thresholds you can monitor

Anti-Bot Defenses on Travel Sites

Travel sites run among the most aggressive anti-bot systems in any industry sector.

Expedia uses PerimeterX (HUMAN Security): - Advanced browser fingerprinting — canvas, WebGL renderer, audio context, font enumeration - Behavioral biometrics — mouse movement patterns, scroll velocity, keystroke timing - Cookie chain validation — expects a specific sequence of cookies set during normal page navigation before search results will load - TLS fingerprint analysis — JA3/JA4 hash matching against known bot patterns - Session consistency — geolocation, timezone, and locale must be internally consistent or sessions are flagged

Kayak uses a custom solution plus Cloudflare: - CAPTCHA challenges after 10-15 searches from one session, sometimes immediately - Price rendering obfuscation — prices are sometimes displayed as SVG images or canvas elements rather than text, specifically to defeat DOM text extraction - IP-based session isolation — two sessions from the same IP in different browsers trigger anomaly detection - Referrer chain validation — direct deep-link navigation without prior homepage browsing history raises flags

Both sites block all known datacenter IP ranges almost completely. Residential proxies are required for any non-trivial scraping.

ThorData residential proxies work well for travel sites for two reasons: their residential IPs have genuine ASN assignments that pass reputation checks, and the geo-targeting lets you specify which country the request appears to originate from — which is essential since flight prices vary by user geographic location.

Setup

pip install playwright httpx beautifulsoup4
playwright install chromium

Expedia Flight Scraper with Session Warming

The session warming step — visiting the homepage before attempting search — is not optional. PerimeterX validates cookie sequences and flags direct deep-link access to search pages as bot behavior.

import asyncio
import json
import random
import re
from datetime import datetime
from playwright.async_api import async_playwright, TimeoutError as PWTimeout


THORDATA_PROXY = {
    "server": "http://proxy.thordata.com:9000",
    "username": "USER",
    "password": "PASS",
}


async def build_expedia_context(playwright, proxy: dict = None):
    """
    Create a Playwright browser context with stealth settings for Expedia.
    Configured to appear as a genuine US-based Chrome user.
    """
    browser = await playwright.chromium.launch(
        headless=True,
        args=[
            "--disable-blink-features=AutomationControlled",
            "--disable-dev-shm-usage",
            "--no-sandbox",
            "--disable-gpu",
        ],
        proxy=proxy,
    )
    context = await browser.new_context(
        viewport={"width": 1440, "height": 900},
        user_agent=(
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/126.0.0.0 Safari/537.36"
        ),
        locale="en-US",
        timezone_id="America/New_York",
        geolocation={"latitude": 40.7128, "longitude": -74.0060},
        permissions=["geolocation"],
        extra_http_headers={
            "Accept-Language": "en-US,en;q=0.9",
            "Sec-Ch-Ua": '"Chromium";v="126", "Google Chrome";v="126"',
            "Sec-Ch-Ua-Mobile": "?0",
            "Sec-Ch-Ua-Platform": '"macOS"',
        },
    )

    # Stealth patches applied to all pages in this context
    await context.add_init_script("""
        Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
        Object.defineProperty(navigator, 'plugins', {
            get: () => [{name: 'Chrome PDF Plugin'}, {name: 'Native Client'}]
        });
        Object.defineProperty(navigator, 'languages', {
            get: () => ['en-US', 'en']
        });
        window.chrome = { runtime: {}, csi: () => {}, loadTimes: () => {} };
        const origQuery = window.navigator.permissions.query;
        window.navigator.permissions.query = (params) =>
            params.name === 'notifications'
                ? Promise.resolve({ state: Notification.permission })
                : origQuery(params);
    """)

    return browser, context


async def warm_expedia_session(context) -> bool:
    """
    Visit Expedia homepage to establish the required session cookie chain.
    PerimeterX validates this chain before serving search results.
    Returns True on success.
    """
    page = await context.new_page()
    try:
        await page.goto(
            "https://www.expedia.com",
            wait_until="domcontentloaded",
            timeout=30000,
        )
        await asyncio.sleep(random.uniform(2.5, 4.5))
        await page.mouse.move(random.randint(300, 900), random.randint(200, 600))
        await asyncio.sleep(random.uniform(1.0, 2.0))

        # Check for immediate block
        if await page.query_selector("[class*='captcha'], iframe[title*='challenge']"):
            print("CAPTCHA on homepage — proxy IP may be flagged")
            return False
        return True
    except PWTimeout:
        return False
    finally:
        await page.close()


async def scrape_expedia_flights(
    origin: str,
    destination: str,
    depart_date: str,
    return_date: str = None,
    seat_class: str = "economy",
    adults: int = 1,
    proxy: dict = None,
) -> list[dict]:
    """
    Scrape flight listings from Expedia for a given route and date.

    origin/destination: IATA airport codes (SFO, JFK, LHR, etc.)
    depart_date: YYYY-MM-DD format
    return_date: YYYY-MM-DD for round trips, None for one-way
    seat_class: economy, premiumeconomy, business, first
    """
    async with async_playwright() as p:
        browser, context = await build_expedia_context(p, proxy)

        try:
            warmed = await warm_expedia_session(context)
            if not warmed:
                print("Session warm failed — results may be empty")

            # Intercept pricing API responses
            api_data = []

            async def capture_price_apis(response):
                if any(kw in response.url for kw in ["/api/flight", "offerList", "flightSearch"]):
                    try:
                        body = await response.json()
                        api_data.append(body)
                    except Exception:
                        pass

            page = await context.new_page()
            page.on("response", capture_price_apis)

            # Build search URL
            trip_type = "roundtrip" if return_date else "oneway"
            url = (
                f"https://www.expedia.com/Flights-search/"
                f"?flight-type={trip_type}"
                f"&starDate={depart_date}"
                f"&leg1=from%3A{origin}%2Cto%3A{destination}%2Cdeparture%3A{depart_date}TANYT"
            )
            if return_date:
                url += (
                    f"&leg2=from%3A{destination}%2Cto%3A{origin}"
                    f"%2Cdeparture%3A{return_date}TANYT"
                )
            url += f"&passengers=adults%3A{adults}&options=cabinclass%3A{seat_class}"

            await page.goto(url, wait_until="networkidle", timeout=45000)

            # CAPTCHA check
            if await page.query_selector(
                "[class*='captcha'], iframe[src*='captcha'], [class*='challenge']"
            ):
                print("CAPTCHA on search page — rotate proxy and retry")
                return []

            # Wait for results
            try:
                await page.wait_for_selector(
                    '[data-testid="offer-listing"], [class*="FlightCard"], '
                    '[class*="flight-result"]',
                    timeout=25000,
                )
            except PWTimeout:
                print("Results container not found")
                return []

            # Scroll to trigger lazy loading of additional results
            for pos in [800, 1600, 2400, 3200, 4000]:
                await page.evaluate(f"window.scrollTo(0, {pos})")
                await asyncio.sleep(random.uniform(0.6, 1.4))

            # Extract flight data from DOM
            results = await page.evaluate("""
                () => {
                    const flights = [];
                    const selectors = [
                        '[data-testid="offer-listing"]',
                        'li[class*="FlightCard"]',
                        '[class*="flight-result-item"]',
                        '[class*="uitk-card"][class*="flight"]',
                    ];
                    let cards = [];
                    for (const sel of selectors) {
                        const found = document.querySelectorAll(sel);
                        if (found.length > 0) { cards = found; break; }
                    }
                    cards.forEach(card => {
                        const price = (
                            card.querySelector('[data-testid="price"]') ||
                            card.querySelector('[class*="price-text"]') ||
                            card.querySelector('[class*="totalPrice"]')
                        )?.textContent?.trim();
                        const airline = (
                            card.querySelector('[data-testid="airline-name"]') ||
                            card.querySelector('[class*="airline-info"]')
                        )?.textContent?.trim();
                        const timeEls = card.querySelectorAll(
                            '[data-testid*="time"], [class*="flight-time"]'
                        );
                        const duration = (
                            card.querySelector('[data-testid="duration"]') ||
                            card.querySelector('[class*="duration"]')
                        )?.textContent?.trim();
                        const stops = card.querySelector(
                            '[data-testid="stops"], [class*="stop"]'
                        )?.textContent?.trim();
                        if (price || airline) {
                            flights.push({
                                price,
                                airline,
                                departure: timeEls[0]?.textContent?.trim(),
                                arrival: timeEls[1]?.textContent?.trim(),
                                duration,
                                stops,
                            });
                        }
                    });
                    return flights;
                }
            """)

            await page.close()

            if api_data:
                print(f"Also captured {len(api_data)} API responses")

            return [
                {"origin": origin, "destination": destination,
                 "depart_date": depart_date, "return_date": return_date, **f}
                for f in results if f.get("price") or f.get("airline")
            ]

        finally:
            await browser.close()

Kayak Fare Calendar Scraping

async def scrape_kayak_calendar(
    origin: str,
    destination: str,
    year: int,
    month: int,
    proxy: dict = None,
) -> list[dict]:
    """
    Scrape the Kayak price calendar for cheapest fares per day on a route.

    Returns list of dicts with 'date' and 'price' keys.
    """
    async with async_playwright() as p:
        browser, context = await build_expedia_context(p, proxy)
        try:
            page = await context.new_page()
            url = (
                f"https://www.kayak.com/flights/"
                f"{origin}-{destination}/{year}-{month:02d}"
            )
            await page.goto(url, wait_until="networkidle", timeout=40000)
            await asyncio.sleep(random.uniform(4, 7))

            # Dismiss any overlay
            try:
                dismiss = await page.query_selector(
                    "[class*='close'], [aria-label*='close'], button[class*='dismiss']"
                )
                if dismiss:
                    await dismiss.click()
                    await asyncio.sleep(1.0)
            except Exception:
                pass

            calendar_data = await page.evaluate("""
                () => {
                    const days = [];
                    const selectors = [
                        '[class*="Calendar"] [class*="Day"]',
                        '[class*="calendar-day"]',
                        '[data-date]',
                    ];
                    let cells = [];
                    for (const sel of selectors) {
                        cells = document.querySelectorAll(sel);
                        if (cells.length > 0) break;
                    }
                    cells.forEach(cell => {
                        const date = cell.getAttribute('data-date')
                            || cell.getAttribute('aria-label');
                        const priceEl = cell.querySelector(
                            '[class*="price"], [class*="fare"], [class*="amount"]'
                        );
                        if (date && priceEl?.textContent?.trim()) {
                            days.push({ date, price: priceEl.textContent.trim() });
                        }
                    });
                    return days;
                }
            """)

            await page.close()
            return calendar_data
        finally:
            await browser.close()

Price and Duration Parsing

Raw strings from DOM extraction need normalization:

import re
from typing import Optional


def parse_price(price_str: str) -> Optional[float]:
    """
    Normalize price strings to floats.
    Handles: '$1,234', '1234', '£899', '1.234,00' (EU), 'USD 1,099'.
    """
    if not price_str:
        return None
    cleaned = re.sub(r'[^\d.,]', '', price_str.strip())
    if not cleaned:
        return None
    # EU format (1.234,00) vs US format (1,234.00)
    if ',' in cleaned and '.' in cleaned:
        if cleaned.rindex(',') > cleaned.rindex('.'):
            cleaned = cleaned.replace('.', '').replace(',', '.')
        else:
            cleaned = cleaned.replace(',', '')
    elif ',' in cleaned:
        if len(cleaned.split(',')[-1]) == 2:
            cleaned = cleaned.replace('.', '').replace(',', '.')
        else:
            cleaned = cleaned.replace(',', '')
    try:
        return float(cleaned)
    except ValueError:
        return None


def parse_duration_minutes(duration_str: str) -> Optional[int]:
    """
    Convert '5h 30m', '2h 15m', '45m', '1h' to total minutes.
    """
    if not duration_str:
        return None
    hours = 0
    minutes = 0
    h = re.search(r'(\d+)\s*h', duration_str)
    m = re.search(r'(\d+)\s*m', duration_str)
    if h:
        hours = int(h.group(1))
    if m:
        minutes = int(m.group(1))
    return hours * 60 + minutes if (hours or minutes) else None


def normalize_flights(raw: list[dict]) -> list[dict]:
    """Clean and normalize raw flight results from DOM extraction."""
    normalized = []
    for f in raw:
        price = parse_price(f.get("price", ""))
        if not price:
            continue
        normalized.append({
            "origin": f.get("origin"),
            "destination": f.get("destination"),
            "depart_date": f.get("depart_date"),
            "return_date": f.get("return_date"),
            "airline": (f.get("airline") or "").strip(),
            "price_usd": price,
            "departure_time": (f.get("departure") or "").strip(),
            "arrival_time": (f.get("arrival") or "").strip(),
            "duration_str": (f.get("duration") or "").strip(),
            "duration_minutes": parse_duration_minutes(f.get("duration")),
            "stops": (f.get("stops") or "").strip(),
        })
    return sorted(normalized, key=lambda x: x["price_usd"])

SQLite Storage for Price Tracking

import sqlite3
from datetime import datetime, timezone


def init_flight_db(db_path: str = "flights.db") -> sqlite3.Connection:
    """Initialize the flight price tracking database."""
    conn = sqlite3.connect(db_path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS flight_snapshots (
            id              INTEGER PRIMARY KEY AUTOINCREMENT,
            origin          TEXT NOT NULL,
            destination     TEXT NOT NULL,
            depart_date     TEXT NOT NULL,
            return_date     TEXT,
            airline         TEXT,
            price_usd       REAL NOT NULL,
            departure_time  TEXT,
            arrival_time    TEXT,
            duration_min    INTEGER,
            stops           TEXT,
            source          TEXT DEFAULT 'expedia',
            scraped_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );

        CREATE TABLE IF NOT EXISTS fare_calendar (
            id              INTEGER PRIMARY KEY AUTOINCREMENT,
            origin          TEXT NOT NULL,
            destination     TEXT NOT NULL,
            price_date      TEXT NOT NULL,
            price_usd       REAL,
            source          TEXT DEFAULT 'kayak',
            scraped_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
            UNIQUE (origin, destination, price_date, source)
        );

        CREATE INDEX IF NOT EXISTS idx_snapshots_route
            ON flight_snapshots(origin, destination, depart_date);
        CREATE INDEX IF NOT EXISTS idx_snapshots_time
            ON flight_snapshots(scraped_at);
        CREATE INDEX IF NOT EXISTS idx_calendar_route
            ON fare_calendar(origin, destination, price_date);
    """)
    conn.commit()
    return conn


def store_flights(
    conn: sqlite3.Connection,
    flights: list[dict],
    source: str = "expedia",
):
    """Bulk insert normalized flight results."""
    conn.executemany("""
        INSERT INTO flight_snapshots
            (origin, destination, depart_date, return_date, airline,
             price_usd, departure_time, arrival_time, duration_min,
             stops, source, scraped_at)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, [
        (
            f["origin"], f["destination"], f["depart_date"], f.get("return_date"),
            f.get("airline"), f["price_usd"], f.get("departure_time"),
            f.get("arrival_time"), f.get("duration_minutes"), f.get("stops"),
            source, datetime.now(timezone.utc).isoformat(),
        )
        for f in flights if f.get("price_usd")
    ])
    conn.commit()
    return len(flights)


def store_calendar_data(
    conn: sqlite3.Connection,
    origin: str,
    destination: str,
    calendar: list[dict],
    source: str = "kayak",
) -> int:
    """Store fare calendar entries."""
    rows = []
    for entry in calendar:
        price = parse_price(entry.get("price", ""))
        if price:
            rows.append((
                origin, destination, entry["date"], price, source,
                datetime.now(timezone.utc).isoformat(),
            ))
    conn.executemany("""
        INSERT OR REPLACE INTO fare_calendar
            (origin, destination, price_date, price_usd, source, scraped_at)
        VALUES (?, ?, ?, ?, ?, ?)
    """, rows)
    conn.commit()
    return len(rows)


def get_price_history(
    conn: sqlite3.Connection,
    origin: str,
    destination: str,
    depart_date: str,
    days_back: int = 30,
) -> list:
    """Show daily cheapest price trend for a route and departure date."""
    return conn.execute("""
        SELECT
            date(scraped_at) as check_date,
            MIN(price_usd) as cheapest,
            ROUND(AVG(price_usd), 2) as avg_price,
            COUNT(*) as flight_count
        FROM flight_snapshots
        WHERE origin = ?
          AND destination = ?
          AND depart_date = ?
          AND scraped_at >= datetime('now', ? || ' days')
        GROUP BY check_date
        ORDER BY check_date
    """, (origin, destination, depart_date, f"-{days_back}")).fetchall()


def find_price_drops(
    conn: sqlite3.Connection,
    threshold_pct: float = 15.0,
) -> list:
    """Find routes with current prices significantly below 7-day average."""
    return conn.execute("""
        WITH weekly_avg AS (
            SELECT origin, destination, depart_date, AVG(price_usd) as avg_7d
            FROM flight_snapshots
            WHERE scraped_at >= datetime('now', '-7 days')
            GROUP BY origin, destination, depart_date
        ),
        today_min AS (
            SELECT origin, destination, depart_date, MIN(price_usd) as current_min
            FROM flight_snapshots
            WHERE scraped_at >= datetime('now', '-1 days')
            GROUP BY origin, destination, depart_date
        )
        SELECT
            t.origin, t.destination, t.depart_date,
            t.current_min, w.avg_7d,
            ROUND((w.avg_7d - t.current_min) / w.avg_7d * 100, 1) as drop_pct
        FROM today_min t
        JOIN weekly_avg w ON t.origin = w.origin
            AND t.destination = w.destination
            AND t.depart_date = w.depart_date
        WHERE w.avg_7d > 0
          AND (w.avg_7d - t.current_min) / w.avg_7d * 100 >= ?
        ORDER BY drop_pct DESC
    """, (threshold_pct,)).fetchall()

Session Management Strategy

Key techniques for keeping sessions alive:

Homepage-first navigation. Visit expedia.com before searching. This establishes the cookie sequence PerimeterX validates. Direct deep-link access almost always triggers the bot detection layer.

Geo-consistent configuration. Match proxy IP country, browser locale, timezone, and geolocation. A German IP with en-US locale and New York timezone is a contradiction that triggers detection. Everything must align with a single geography.

Natural scroll simulation. Expedia lazy-loads results. Without programmatic scrolling, only 5-8 flights are visible. The scraper scrolls in increments with randomized delays.

SVG-rendered price fallback. Kayak sometimes renders prices as SVG to defeat text extraction. If textContent returns empty for price elements: 1. Check for <svg> inside price containers and extract path data 2. Take a screenshot of the price element and run OCR 3. Intercept the background API response where prices appear in plain JSON — the most reliable fallback

Request spacing. Between searches on the same site: 15-30 seconds minimum to avoid session flagging. Between full sessions: restart the browser and rotate the proxy IP.

Error Handling and Retry Logic

import asyncio
import random


async def scrape_with_retry(
    scrape_fn,
    args: tuple,
    max_attempts: int = 3,
    base_delay: float = 30.0,
) -> list[dict]:
    """Retry flight scraping with escalating delays on failure or empty results."""
    for attempt in range(max_attempts):
        try:
            results = await scrape_fn(*args)
            if results:
                return results
            # Empty results may mean CAPTCHA block — wait longer before retry
            if attempt < max_attempts - 1:
                delay = base_delay * (2 ** attempt) + random.uniform(10, 30)
                print(f"Empty results (attempt {attempt+1}). Waiting {delay:.0f}s...")
                await asyncio.sleep(delay)
        except Exception as e:
            error_msg = str(e).lower()
            if "captcha" in error_msg or "403" in error_msg:
                delay = base_delay * 3 + random.uniform(30, 60)
            elif "timeout" in error_msg:
                delay = base_delay + random.uniform(5, 15)
            else:
                delay = base_delay * (attempt + 1)
            print(f"Error (attempt {attempt+1}): {e}. Waiting {delay:.0f}s...")
            if attempt < max_attempts - 1:
                await asyncio.sleep(delay)

    return []

Complete Price Monitoring Pipeline

import asyncio
import random
from datetime import datetime, timedelta


async def run_price_check(db_path: str = "flights.db"):
    """
    Run one price check cycle for all tracked routes.
    Designed to be called via cron every few hours.
    """
    conn = init_flight_db(db_path)
    proxy = THORDATA_PROXY

    routes = [
        ("SFO", "JFK", "roundtrip"),
        ("LAX", "LHR", "oneway"),
        ("ORD", "MIA", "roundtrip"),
    ]

    # Find next 3 Saturdays
    today = datetime.today()
    saturdays = []
    d = today
    while len(saturdays) < 3:
        d += timedelta(days=1)
        if d.weekday() == 5:
            saturdays.append(d.strftime("%Y-%m-%d"))

    for origin, destination, trip_type in routes:
        for depart_date in saturdays:
            return_date = None
            if trip_type == "roundtrip":
                return_date = (
                    datetime.strptime(depart_date, "%Y-%m-%d") + timedelta(days=7)
                ).strftime("%Y-%m-%d")

            print(f"\n{origin} -> {destination} on {depart_date}")

            raw = await scrape_with_retry(
                scrape_expedia_flights,
                (origin, destination, depart_date, return_date, "economy", 1, proxy),
                max_attempts=3,
                base_delay=30.0,
            )
            flights = normalize_flights(raw)

            if flights:
                stored = store_flights(conn, flights)
                cheapest = flights[0]["price_usd"]
                print(f"  Stored {stored} flights. Cheapest: ${cheapest:.0f}")
            else:
                print("  No results obtained")

            await asyncio.sleep(random.uniform(20, 40))

    # Report price drops
    drops = find_price_drops(conn, threshold_pct=10.0)
    if drops:
        print("\nPrice drops vs 7-day average:")
        for row in drops:
            print(f"  {row[0]}->{row[1]} on {row[2]}: "
                  f"${row[3]:.0f} (was ${row[4]:.0f} avg, -{row[5]}%)")

    conn.close()


# asyncio.run(run_price_check())
# Or schedule via cron: 0 */4 * * * python3 /path/to/check_flights.py

Handling Price Alerts

Track a specific route and notify when price drops below a threshold:

import sqlite3


def check_and_trigger_alerts(conn: sqlite3.Connection) -> list[dict]:
    """
    Check all active price alerts against the latest snapshots.
    Returns list of triggered alerts.
    """
    conn.execute("""
        CREATE TABLE IF NOT EXISTS price_alerts (
            id              INTEGER PRIMARY KEY AUTOINCREMENT,
            origin          TEXT NOT NULL,
            destination     TEXT NOT NULL,
            depart_date     TEXT,
            threshold_usd   REAL NOT NULL,
            triggered_at    TIMESTAMP,
            triggered_price REAL,
            active          INTEGER DEFAULT 1,
            created_at      TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)
    conn.commit()

    triggered = []
    alerts = conn.execute(
        "SELECT id, origin, destination, depart_date, threshold_usd "
        "FROM price_alerts WHERE active = 1 AND triggered_at IS NULL"
    ).fetchall()

    for alert_id, origin, dest, depart_date, threshold in alerts:
        # Find current cheapest price
        query_params = [origin, dest]
        where_clause = "origin = ? AND destination = ?"
        if depart_date:
            where_clause += " AND depart_date = ?"
            query_params.append(depart_date)

        result = conn.execute(
            f"SELECT MIN(price_usd) FROM flight_snapshots "
            f"WHERE {where_clause} AND scraped_at >= datetime('now', '-1 days')",
            query_params,
        ).fetchone()

        current_price = result[0] if result and result[0] else None

        if current_price and current_price <= threshold:
            conn.execute(
                "UPDATE price_alerts SET triggered_at = CURRENT_TIMESTAMP, "
                "triggered_price = ? WHERE id = ?",
                (current_price, alert_id),
            )
            conn.commit()
            triggered.append({
                "origin": origin,
                "destination": dest,
                "depart_date": depart_date,
                "threshold": threshold,
                "current_price": current_price,
            })

    return triggered

Legal Considerations

Flight prices are factual data — a specific fare is not copyrightable. However, Expedia's and Kayak's Terms of Service prohibit automated access to their platforms. Practical risk tiers:

Personal price monitoring for your own travel plans: low risk, minimal volume
Consumer price alert tool for others: review ToS carefully before building commercially
Competing OTA or commercial data resale: high legal risk, evaluate data licensing options

Keep request volumes at a level consistent with human browsing — a few dozen searches per day across monitored routes, not thousands per hour. The key distinction is whether your scraping impairs the service versus simply reading publicly visible prices for your own use.

Building a Flight Price Alert System

The real value of flight price data is tracking changes over time. Here is a complete price alert architecture:

import sqlite3
import json
from datetime import datetime, date, timedelta

def init_flight_db(db_path: str = "flight_tracker.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS routes (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            origin TEXT,
            destination TEXT,
            active INTEGER DEFAULT 1,
            UNIQUE(origin, destination)
        );

        CREATE TABLE IF NOT EXISTS price_snapshots (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            origin TEXT,
            destination TEXT,
            depart_date TEXT,
            return_date TEXT,
            airline TEXT,
            price_usd REAL,
            stops TEXT,
            duration TEXT,
            scraped_at TEXT
        );

        CREATE TABLE IF NOT EXISTS price_alerts (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            origin TEXT,
            destination TEXT,
            depart_date TEXT,
            target_price REAL,
            active INTEGER DEFAULT 1,
            created_at TEXT DEFAULT CURRENT_TIMESTAMP
        );

        CREATE INDEX IF NOT EXISTS idx_snapshots_route
            ON price_snapshots(origin, destination, depart_date, scraped_at DESC);

        CREATE INDEX IF NOT EXISTS idx_snapshots_airline
            ON price_snapshots(airline, price_usd);
    """)
    conn.commit()
    return conn


def save_flights(
    conn: sqlite3.Connection,
    flights: list,
    origin: str,
    destination: str,
    depart_date: str,
    return_date: str = None,
):
    """Save a batch of flight results to the database."""
    now = datetime.now().isoformat()
    for f in flights:
        price_str = (f.get("price") or "0").replace("$", "").replace(",", "")
        try:
            price = float(price_str.split()[0]) if price_str else None
        except (ValueError, IndexError):
            price = None

        if price and price > 50:  # Filter obviously bad data
            conn.execute(
                """INSERT INTO price_snapshots
                   (origin, destination, depart_date, return_date, airline,
                    price_usd, stops, duration, scraped_at)
                   VALUES (?,?,?,?,?,?,?,?,?)""",
                (origin, destination, depart_date, return_date,
                 f.get("airline"), price, f.get("stops"),
                 f.get("duration"), now),
            )
    conn.commit()


def detect_price_drops(
    conn: sqlite3.Connection,
    min_drop_pct: float = 10.0,
    lookback_days: int = 7,
) -> list:
    """Find routes where prices have dropped significantly recently."""
    cutoff = (datetime.now() - timedelta(days=lookback_days)).isoformat()

    drops = conn.execute("""
        WITH recent AS (
            SELECT origin, destination, depart_date, airline,
                   MIN(price_usd) as current_min,
                   COUNT(*) as snapshot_count
            FROM price_snapshots
            WHERE scraped_at > ?
            GROUP BY origin, destination, depart_date, airline
        ),
        historical AS (
            SELECT origin, destination, depart_date, airline,
                   AVG(price_usd) as avg_price
            FROM price_snapshots
            WHERE scraped_at <= ?
            GROUP BY origin, destination, depart_date, airline
        )
        SELECT r.origin, r.destination, r.depart_date, r.airline,
               r.current_min, h.avg_price,
               ROUND((h.avg_price - r.current_min) / h.avg_price * 100, 1) as drop_pct
        FROM recent r
        JOIN historical h ON r.origin = h.origin
                         AND r.destination = h.destination
                         AND r.depart_date = h.depart_date
                         AND r.airline = h.airline
        WHERE h.avg_price > 0
          AND (h.avg_price - r.current_min) / h.avg_price * 100 >= ?
        ORDER BY drop_pct DESC
        LIMIT 20
    """, (cutoff, cutoff, min_drop_pct)).fetchall()

    return [
        {
            "route": f"{row[0]} → {row[1]}",
            "date": row[2],
            "airline": row[3],
            "current_price": row[4],
            "baseline_price": round(row[5], 0),
            "drop_pct": row[6],
        }
        for row in drops
    ]

Comparing Expedia vs. Kayak Prices

Sometimes the same flight is priced differently across OTAs. Here is a parallel comparison:

import asyncio

async def compare_ota_prices(
    origin: str,
    destination: str,
    depart_date: str,
    return_date: str = None,
) -> dict:
    """
    Fetch prices from both Expedia and Kayak for comparison.
    Returns price comparison with savings opportunities.
    """
    # Run both scrapers concurrently
    expedia_task = asyncio.create_task(
        scrape_expedia_flights(origin, destination, depart_date, return_date)
    )

    # Add delay between launches to avoid triggering simultaneous detection
    await asyncio.sleep(3)

    expedia_results = await expedia_task

    comparison = {
        "origin": origin,
        "destination": destination,
        "depart_date": depart_date,
        "return_date": return_date,
        "expedia": {
            "count": len(expedia_results),
            "min_price": None,
            "avg_price": None,
            "best_deal": None,
        },
    }

    # Parse Expedia prices
    expedia_prices = []
    for f in expedia_results:
        price_str = (f.get("price") or "").replace("$", "").replace(",", "")
        try:
            price = float(price_str.split()[0])
            expedia_prices.append((price, f))
        except (ValueError, IndexError):
            pass

    if expedia_prices:
        expedia_prices.sort(key=lambda x: x[0])
        comparison["expedia"]["min_price"] = expedia_prices[0][0]
        comparison["expedia"]["avg_price"] = round(
            sum(p for p, _ in expedia_prices) / len(expedia_prices), 0
        )
        best = expedia_prices[0][1]
        comparison["expedia"]["best_deal"] = {
            "price": expedia_prices[0][0],
            "airline": best.get("airline"),
            "duration": best.get("duration"),
            "stops": best.get("stops"),
        }

    return comparison


# Find best price across multiple travel dates
async def find_cheapest_travel_window(
    origin: str,
    destination: str,
    earliest_departure: date,
    latest_departure: date,
    trip_length_days: int = 7,
) -> list:
    """
    Find the cheapest travel window in a date range.
    Returns dates sorted by price.
    """
    results = []
    current = earliest_departure

    while current <= latest_departure:
        depart_str = current.isoformat()
        return_str = (current + timedelta(days=trip_length_days)).isoformat()

        flights = await scrape_expedia_flights(origin, destination, depart_str, return_str)

        prices = []
        for f in flights:
            price_str = (f.get("price") or "").replace("$", "").replace(",", "")
            try:
                prices.append(float(price_str.split()[0]))
            except (ValueError, IndexError):
                pass

        if prices:
            results.append({
                "depart_date": depart_str,
                "return_date": return_str,
                "min_price": min(prices),
                "avg_price": round(sum(prices) / len(prices), 0),
                "num_options": len(prices),
            })

        current += timedelta(days=1)
        await asyncio.sleep(15)  # Rate limit between date checks

    return sorted(results, key=lambda x: x["min_price"])

Handling Session State for Multi-Day Scraping

For sustained price monitoring that runs daily, session management is critical:

from pathlib import Path
import json as json_lib

SESSION_FILE = "expedia_session.json"

async def save_session_state(context):
    """Save browser cookies to disk for reuse."""
    cookies = await context.cookies()
    with open(SESSION_FILE, "w") as f:
        json_lib.dump(cookies, f)


async def load_session_state(context):
    """Load previously saved cookies into browser context."""
    if Path(SESSION_FILE).exists():
        with open(SESSION_FILE) as f:
            cookies = json_lib.load(f)
        await context.add_cookies(cookies)
        return True
    return False


async def scrape_with_session_persistence(
    origin: str,
    destination: str,
    depart_date: str,
    return_date: str = None,
) -> list:
    """
    Scrape with session persistence to maintain cookie history.
    Expired sessions are refreshed automatically.
    """
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            args=["--disable-blink-features=AutomationControlled"],
        )
        context = await browser.new_context(
            proxy=PROXY,
            viewport={"width": 1440, "height": 900},
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/126.0.0.0 Safari/537.36"
            ),
        )

        # Load existing session
        session_loaded = await load_session_state(context)
        if session_loaded:
            print("  Using cached session cookies")

        # Inject stealth scripts
        await context.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
            window.chrome = { runtime: {} };
        """)

        page = await context.new_page()

        # Visit homepage if no session (builds initial cookies)
        if not session_loaded:
            await page.goto("https://www.expedia.com", wait_until="domcontentloaded")
            await asyncio.sleep(random.uniform(3, 6))

        # ... rest of scraping logic ...

        # Save session for next run
        await save_session_state(context)
        await browser.close()

    return []

Complete Monitoring Pipeline

async def run_daily_flight_monitor(
    routes: list,
    db_path: str = "flight_tracker.db",
):
    """
    Daily flight price monitoring pipeline.
    Checks prices for upcoming travel windows and stores results.

    Args:
        routes: List of (origin, destination) tuples
        db_path: SQLite database path
    """
    conn = init_flight_db(db_path)

    # Check prices for 30, 60, and 90 days out
    check_dates = [
        (date.today() + timedelta(days=d)).isoformat()
        for d in [30, 45, 60, 75, 90]
    ]

    for origin, destination in routes:
        for depart_date in check_dates:
            return_date = (
                date.fromisoformat(depart_date) + timedelta(days=7)
            ).isoformat()

            print(f"  {origin}→{destination} on {depart_date}")

            flights = await scrape_expedia_flights(
                origin, destination, depart_date, return_date
            )

            if flights:
                save_flights(conn, flights, origin, destination, depart_date, return_date)
                prices = [
                    float(f.get("price", "$0").replace("$", "").replace(",", "").split()[0])
                    for f in flights
                    if f.get("price")
                ]
                valid_prices = [p for p in prices if p > 50]
                if valid_prices:
                    print(f"    Min: ${min(valid_prices):.0f}, Avg: ${sum(valid_prices)/len(valid_prices):.0f}")

            await asyncio.sleep(random.uniform(20, 40))

    # Check for price drops
    drops = detect_price_drops(conn, min_drop_pct=15.0)
    if drops:
        print(f"\n{len(drops)} price drops detected:")
        for d in drops:
            print(f"  {d['route']} {d['date']}: ${d['current_price']} (was ~${d['baseline_price']}, -{d['drop_pct']}%)")

    conn.close()


ROUTES = [
    ("SFO", "JFK"),
    ("LAX", "LHR"),
    ("ORD", "CDG"),
    ("BOS", "BCN"),
]

asyncio.run(run_daily_flight_monitor(ROUTES))

Legal Considerations

Flight prices are factual data -- prices themselves cannot be copyrighted. But terms of service on Expedia and Kayak prohibit automated access. Use the data for personal research, price alerts, or analysis. Do not build a competing OTA that resells their data.

Keep request volumes reasonable -- a few hundred searches per day, not thousands per hour. Travel sites' anti-bot systems are tuned to detect sustained automation, and getting permanently banned from a site you depend on for research defeats the purpose.

ThorData's residential proxies with US geo-targeting give you USD prices and residential IP reputation. Rotate IPs per session (not per request) to maintain session cookie state while varying your IP fingerprint across sessions.