← Back to blog

Handling Cookies and Sessions in Python Web Scrapers (2026 Guide)

Handling Cookies and Sessions in Python Web Scrapers (2026 Guide)

Introduction: Why Cookies Break Your Scraper (And How to Fix It)

If you've spent hours debugging a scraper only to discover you're getting 403 errors, empty responses, or redirect loops back to the login page, you've encountered the cookie problem. Cookies are the web's primary mechanism for maintaining stateful interactions, and they're invisible until they're not. Understanding how to handle cookies and sessions isn't just a nice-to-have feature in web scraping—it's often the difference between a working scraper and months of frustration.

The web in 2026 is built on stateful authentication. Every meaningful action you want to automate—logging into an account, adding items to a cart, accessing an authenticated API, exporting data—requires the server to remember who you are between requests. Cookies and sessions are how that happens. When you skip learning to handle them properly, you're fighting against the fundamental architecture of the modern web.

This guide approaches cookies from first principles. We'll start with what cookies actually are at the HTTP protocol level, why web servers use them, and what happens when your scraper doesn't handle them correctly. Then we'll move into practical implementations: how to extract login credentials, maintain authenticated sessions across hundreds of requests, persist sessions to disk so you don't need to re-authenticate every time your scraper runs, rotate sessions at scale, and detect when authentication has failed so you can automatically re-login.

The reality of production web scraping is that sites actively defend against automated access. Cookies are one of their primary weapons. They can detect if your requests lack the cookie structure of a real browser. They can expire sessions based on behavior patterns. They can use cookies to fingerprint your automation framework. And when you're scraping at scale, managing hundreds or thousands of active sessions while rotating through residential proxies (we'll discuss tools like ThorData for this), cookie management becomes a complex orchestration problem.

This guide covers all of it: from basic httpx cookie jars to building a production SessionManager class that handles auto-login, cookie persistence, rotation, and fallback strategies. You'll learn why Playwright is sometimes necessary when httpx alone isn't enough. You'll see exactly what to watch for in HTTP responses to know when your session is dying. And you'll have multiple real-world code examples you can adapt immediately.

By the end, you'll understand cookies at a deeper level than most web developers, and you'll have the tools to automate any authenticated flow that exists on the web.

HTTP Cookies: The Protocol Level

To handle cookies correctly in a scraper, you need to understand what's actually happening when a server sets a cookie and when your client sends it back. This isn't magic—it's just HTTP headers.

When a server wants to set a cookie, it includes a Set-Cookie header in the HTTP response. A real example from the TLS encrypted connection to a major e-commerce site might look like this:

HTTP/1.1 200 OK
Set-Cookie: session_id=abc123def456; Path=/; Domain=.example.com; Secure; HttpOnly; SameSite=Strict; Max-Age=3600
Set-Cookie: user_pref=dark_mode; Path=/; Domain=.example.com; Secure; Max-Age=31536000

Each Set-Cookie header creates or updates one cookie. The format is:

Set-Cookie: name=value; attribute1=value1; attribute2=value2; ...

Let's break down the attributes you'll encounter:

Name=Value: The actual data. session_id=abc123def456 stores a string that the server will recognize. The server checks this value when you send it back to prove you're authenticated.

Domain: Controls which domains can access this cookie. Domain=.example.com means the browser (or your scraper) will send this cookie to example.com, www.example.com, api.example.com—anything under example.com. If no Domain is set, the cookie is only sent to the exact domain that set it. This matters when scraping sites with multiple subdomains; you might need to follow redirects to different subdomains to pick up cookies set there.

Path: Limits which URL paths on that domain receive the cookie. Path=/api means the cookie is only sent to requests like GET /api/users. Path=/ means all paths. When scraping, you usually don't need to worry about this—the server handles it—but it's worth knowing if you're debugging cookie behavior.

Secure: This cookie is only sent over HTTPS, never over unencrypted HTTP. Any modern site uses this. It's a flag, not a value (the presence of the word "Secure" means it's enabled).

HttpOnly: The cookie cannot be accessed by JavaScript, only sent in HTTP requests. This is a security feature to prevent XSS attacks from stealing session tokens. For scraping, it's transparent—you don't read the cookie value in JavaScript, you just receive it in Set-Cookie headers and send it back.

SameSite: Controls when the cookie is sent in cross-site requests. SameSite=Strict means the cookie is only sent to requests on the exact same site. SameSite=Lax (the modern default) allows the cookie in top-level navigation but not in cross-site subrequests. SameSite=None requires Secure and allows cross-site requests. For scraping on the same domain, this is usually irrelevant—you're making same-site requests.

Expires/Max-Age: When the cookie expires. Expires=Wed, 31-Dec-2025 23:59:59 GMT is an absolute time. Max-Age=3600 is seconds from now. If neither is set, the cookie is a "session cookie" that expires when the browser closes. For your scraper, session cookies mean you need to maintain a persistent HTTP connection or re-authenticate frequently.

After receiving Set-Cookie headers, your scraper needs to send the cookie back to the server in subsequent requests. This happens via the Cookie header:

GET /api/user/profile HTTP/1.1
Host: example.com
Cookie: session_id=abc123def456; user_pref=dark_mode

The Cookie header is simple: it's just name=value pairs separated by semicolons. The server uses the session_id value to look up your session in its database or verify your JWT token, and then serves you authenticated content.

This is where most cookie-related scraping bugs occur. Your scraper needs to: 1. Extract cookies from Set-Cookie headers in responses 2. Store them in a cookie jar 3. Automatically send them in the Cookie header for subsequent requests to the same domain 4. Handle cookie expiration 5. Know when cookies are invalid (when the server starts rejecting them)

Modern HTTP libraries like httpx and requests abstract away most of this complexity, but when things go wrong, understanding the protocol lets you debug effectively.

Sessions vs. Cookies: Understanding the Relationship

Cookies are the transport mechanism. Sessions are what they represent.

A cookie is a piece of data your scraper stores and sends in HTTP headers. A session is the server-side state associated with that data. Here's how it works:

When you log into a website, you POST your username and password. The server validates them against its database. If valid, the server creates a new session object (usually just a row in a database table or an entry in memory) containing information like:

{
  "session_id": "abc123def456",
  "user_id": 42,
  "username": "john_doe",
  "created_at": "2026-03-31T10:00:00Z",
  "last_activity": "2026-03-31T10:00:00Z",
  "ip_address": "203.0.113.45",
  "user_agent": "Mozilla/5.0..."
}

The server sends back a Set-Cookie header with that session_id:

Set-Cookie: session_id=abc123def456; Path=/; Max-Age=3600

Now your scraper stores this cookie and sends it in every request. When you request /api/user/profile, the server receives the Cookie header, looks up session_id=abc123def456 in its session table, finds that it belongs to user 42, and serves that user's profile.

This design is elegant: the server doesn't need to verify your password on every request. It just needs to recognize the session ID. But it has implications for scraping:

Session expiration: The server can invalidate sessions after a time period (Max-Age) or after a period of inactivity. Your scraper needs to detect when this happens and re-authenticate.

Session binding: Modern sites bind sessions to IP addresses or user agents. If you're scraping through a rotating proxy, your IP changes with each request. The server might invalidate your session, thinking someone hijacked it. This is why pairing residential proxies (like ThorData at https://thordata.partnerstack.com/partner/0a0x4nzh) with proper session management is crucial for large-scale scraping.

Session validation: The server might include additional checks. It might verify that the User-Agent in your request matches the User-Agent from when the session was created. It might check the Referer header. It might require tokens to rotate.

JWT Tokens: Cookies for Stateless APIs

Modern APIs use JSON Web Tokens (JWTs) instead of server-side sessions. A JWT is a self-contained token that proves authentication. When you log in, the server returns a token:

{
  "access_token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VyX2lkIjo0MiwiZXhwIjoxNjE2NDI4MDAwfQ.signature",
  "token_type": "Bearer",
  "expires_in": 3600
}

You then send this token in the Authorization header:

GET /api/user/profile HTTP/1.1
Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...

The server verifies the token cryptographically (without needing a database lookup) and serves the request. This is stateless—the server doesn't maintain a session table.

For scraping, JWTs usually come in two forms:

  1. Returned in the login response JSON, then stored and sent in the Authorization header
  2. Set as a cookie by the server, then automatically sent by your HTTP client

Some sites do both—they set a JWT as a cookie and also return it in JSON. We'll cover both patterns in the code examples.

The httpx library is the Python HTTP client for 2026. It handles cookies automatically with intelligent defaults, and it's significantly better than the older requests library for scrapers that need to handle complex authentication flows.

When you create an httpx.Client, it automatically maintains a cookie jar:

import httpx

client = httpx.Client()

# First request: the server sets a cookie
response = client.post("https://example.com/login", data={
    "username": "john_doe",
    "password": "secure_password"
})

# Second request: the cookie is automatically sent
response = client.get("https://example.com/api/user/profile")
# httpx automatically included the Set-Cookie from the first response in this request

print(client.cookies)
# Output: <Cookies([('session_id', 'abc123def456')])>

The Cookie object behaves like a dictionary:

# Access a specific cookie
session_id = client.cookies.get("session_id")
print(session_id)  # abc123def456

# Iterate over all cookies
for name, value in client.cookies.items():
    print(f"{name}={value}")

# Check if a cookie exists
if "session_id" in client.cookies:
    print("Session is valid")

# Add a cookie manually (rarely needed)
client.cookies.set("custom_cookie", "custom_value", domain="example.com")

The automatic cookie handling is the key feature. As long as you use the same client object for all requests to a domain, cookies are handled transparently. This is very different from making raw curl requests where you manually extract and pass cookies.

When creating an httpx.Client, you can customize cookie behavior:

import httpx

# Default behavior: cookies are managed automatically
client = httpx.Client()

# Disable cookies entirely (rarely needed)
client = httpx.Client(cookie_jar=None)

# Use a specific cookie jar implementation
client = httpx.Client(
    cookies={"existing_cookie": "value"},  # Seed with initial cookies
    follow_redirects=True,  # Follow redirects and maintain cookies through them
)

# Get all cookies as a dictionary
cookies_dict = dict(client.cookies)

# Clear all cookies
client.cookies.clear()

# Delete a specific cookie
del client.cookies["session_id"]

One critical setting for authenticated scraping:

# Always use follow_redirects=True
# Otherwise, you might miss Set-Cookie headers in redirect responses
client = httpx.Client(follow_redirects=True)

# This is even more important when the login flow involves redirects
response = client.post("https://example.com/login", data={
    "username": "user",
    "password": "pass",
    # Server responds with 302 redirect
    # With follow_redirects=True, httpx follows it and collects cookies
    # With follow_redirects=False, you'd miss Set-Cookie headers in the redirect
})

Here's the challenge: when your scraper stops working, is it a cookie problem? These signs indicate cookie issues:

import httpx

client = httpx.Client(follow_redirects=True)

# Sign 1: Redirects back to login
response = client.get("https://example.com/api/data")
if response.url.path == "/login":
    print("ERROR: Got redirected to login. Session expired or invalid.")

# Sign 2: 403 Forbidden (permission denied, usually auth issue)
if response.status_code == 403:
    print("ERROR: 403 Forbidden. Cookies might be invalid or session expired.")

# Sign 3: 401 Unauthorized
if response.status_code == 401:
    print("ERROR: 401 Unauthorized. Need to re-authenticate.")

# Sign 4: Empty or unexpected response
if len(response.text) < 100:
    print("WARNING: Got unusually small response. Could indicate session issues.")

# Sign 5: No Set-Cookie headers during login (the auth failed)
response = client.post("https://example.com/login", data={"user": "u", "pass": "p"})
if "set-cookie" not in response.headers:
    print("ERROR: No session cookie set. Login failed. Check credentials.")

# Inspect what cookies are currently set
print("Current cookies:", dict(client.cookies))

Requests.Session vs. httpx.Client: When to Use Each

The requests library dominated Python HTTP for years, and many older tutorials show requests.Session(). Let's compare to understand when to use each.

import requests
import httpx

# Both libraries provide session objects
requests_session = requests.Session()
httpx_client = httpx.Client()

# Functionally similar at the surface
requests_session.post("https://example.com/login", data={"user": "u", "pass": "p"})
response = requests_session.get("https://example.com/api/data")

httpx_client.post("https://example.com/login", data={"user": "u", "pass": "p"})
response = httpx_client.get("https://example.com/api/data")

Here's why you should prefer httpx for modern scraping:

1. httpx supports async/await out of the box

import httpx
import asyncio

async def scrape_concurrently():
    async with httpx.AsyncClient() as client:
        # Make 10 requests concurrently, not serially
        tasks = [
            client.get(f"https://example.com/page/{i}")
            for i in range(10)
        ]
        responses = await asyncio.gather(*tasks)
        return responses

asyncio.run(scrape_concurrently())

With requests, you either use threading (error-prone) or use a third-party wrapper. httpx makes async natural.

2. httpx is under active development with modern defaults

httpx defaults to HTTP/2 support, which is more efficient. It has better type hints. It's actively maintained and receiving security updates.

3. httpx handles edge cases better

For example, httpx's cookie handling is more robust with domain matching. When requests has bugs in cookie handling (and it does), the requests maintainers are slow to release fixes. httpx is newer and fixes bugs faster.

4. requests lacks important features for scrapers

The requests library doesn't have built-in retry logic with exponential backoff. You need to write it yourself or use urllib3 directly. httpx includes this natively.

The only reason to use requests is legacy code or when you need to integrate with libraries that specifically require it. For new scraping projects, start with httpx.

Here's a migration example:

# Old (requests)
import requests
session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0..."})
response = session.get("https://example.com")
print(dict(session.cookies))

# New (httpx)
import httpx
client = httpx.Client(headers={"User-Agent": "Mozilla/5.0..."})
response = client.get("https://example.com")
print(dict(client.cookies))

The APIs are similar enough that migration is straightforward, but httpx's design is cleaner.

Login Flows: From Forms to APIs

Authentication takes different forms depending on the site. Let's cover the common patterns.

Form-Based Login (HTML POST)

Traditional websites use HTML forms. You submit username and password, the server validates them, and sets a session cookie.

import httpx
from html.parser import HTMLParser

client = httpx.Client(
    follow_redirects=True,
    headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
)

# Step 1: GET the login page to extract any hidden fields
response = client.get("https://example.com/login")

# Some sites include CSRF tokens in the form
# We'll parse them to be safe
class FormParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.form_data = {}
        self.in_form = False
        self.csrf_token = None

    def handle_starttag(self, tag, attrs):
        if tag == "form":
            self.in_form = True
        elif tag == "input" and self.in_form:
            attrs_dict = dict(attrs)
            if "name" in attrs_dict:
                # Store hidden fields and tokens
                if "value" in attrs_dict:
                    self.form_data[attrs_dict["name"]] = attrs_dict["value"]
                if attrs_dict.get("type") == "hidden":
                    self.csrf_token = attrs_dict.get("value")

parser = FormParser()
parser.feed(response.text)
csrf_token = parser.csrf_token
print(f"Extracted CSRF token: {csrf_token}")

# Step 2: Submit the login form
login_data = {
    "username": "john_doe",
    "password": "secure_password",
    "_csrf": csrf_token  # Include the CSRF token
}

response = client.post(
    "https://example.com/login",
    data=login_data,
    follow_redirects=True
)

# Step 3: Verify authentication succeeded
if response.status_code == 200 and "dashboard" in response.text.lower():
    print("Login successful")
    print(f"Session cookie: {client.cookies.get('session_id')}")
else:
    print("Login failed")
    print(f"Status: {response.status_code}")
    print(f"Response preview: {response.text[:200]}")

# Step 4: Now use the authenticated session
response = client.get("https://example.com/api/user/profile")
print(response.json())

Key points:

  1. Always use follow_redirects=True. Login flows often end with a redirect to the dashboard.
  2. Extract and send CSRF tokens. We'll cover CSRF in depth in the next section.
  3. Verify the login succeeded. Don't assume it worked—check the response.
  4. Keep the client alive. As long as the httpx.Client object exists, it maintains cookies.

JSON API Authentication

Modern APIs don't use HTML forms. Instead, you POST JSON with credentials and receive a token:

import httpx
import json
import time

client = httpx.Client(
    headers={"User-Agent": "Mozilla/5.0..."}
)

# Some APIs require specific headers for authentication
response = client.post(
    "https://api.example.com/auth/login",
    json={"email": "[email protected]", "password": "secure_password"},
    headers={
        "Content-Type": "application/json",
        "Accept": "application/json",
    }
)

if response.status_code != 200:
    print(f"Login failed: {response.status_code}")
    print(response.text)
    exit(1)

auth_response = response.json()
print(f"Auth response: {auth_response}")

# Pattern 1: Token in JSON, you manually add it to Authorization header
if "access_token" in auth_response:
    token = auth_response["access_token"]

    # Option A: Set it as a default header for all requests
    client.headers.update({"Authorization": f"Bearer {token}"})

    response = client.get("https://api.example.com/user/profile")
    print(response.json())

# Pattern 2: Token is set as a cookie automatically
# (less common but happens with some APIs)
# The Set-Cookie header from the login response contains the token
# httpx.Client handles it automatically
response = client.get("https://api.example.com/user/profile")
# The cookie is sent automatically because client maintains the cookie jar

The critical difference: with form-based auth, the server sets cookies automatically. With JSON APIs, you need to either:

  1. Extract the token from JSON and manually add it to the Authorization header
  2. Extract the token from JSON and manually set it as a cookie
  3. Let the server set it as a cookie and rely on httpx to send it

Let's see a more complex example with token refresh:

import httpx
import time
from datetime import datetime, timedelta

class APIClient:
    def __init__(self, base_url, email, password):
        self.base_url = base_url
        self.email = email
        self.password = password
        self.client = httpx.Client(
            base_url=base_url,
            headers={"User-Agent": "Mozilla/5.0..."}
        )
        self.token = None
        self.token_expires_at = None

    def authenticate(self):
        """Log in and store the token"""
        response = self.client.post(
            "/auth/login",
            json={"email": self.email, "password": self.password}
        )

        if response.status_code != 200:
            raise ValueError(f"Authentication failed: {response.text}")

        data = response.json()
        self.token = data["access_token"]
        expires_in = data.get("expires_in", 3600)  # Default 1 hour
        self.token_expires_at = time.time() + expires_in

        self.client.headers.update({"Authorization": f"Bearer {self.token}"})
        return True

    def refresh_token_if_needed(self):
        """Refresh the token if it's close to expiration"""
        if self.token is None or time.time() >= self.token_expires_at - 60:
            # Token is missing or expires within 60 seconds
            self.authenticate()

    def get(self, path, **kwargs):
        """Make an authenticated GET request"""
        self.refresh_token_if_needed()
        return self.client.get(path, **kwargs)

    def post(self, path, **kwargs):
        """Make an authenticated POST request"""
        self.refresh_token_if_needed()
        return self.client.post(path, **kwargs)

# Usage
client = APIClient("https://api.example.com", "[email protected]", "password")
client.authenticate()

# Make authenticated requests
response = client.get("/user/profile")
print(response.json())

# If this request happens more than 1 hour later, token is automatically refreshed
time.sleep(3600)
response = client.get("/user/data")  # Token is silently refreshed if needed
print(response.json())

This pattern handles token expiration automatically. Many APIs issue tokens with an expires_in field in seconds. Your scraper should respect this and refresh before expiration.

OAuth2 and Social Login

Some sites require OAuth2 (login with Google, Facebook, etc.). These are complex because they involve browser redirects. We'll cover the approach here and then discuss Playwright for full browser automation.

With OAuth2, your scraper can't easily get a real OAuth token without controlling a browser. However, some sites provide alternative authentication methods:

import httpx

# Option 1: Skip OAuth and use API key authentication
# Some sites provide API keys in user settings
client = httpx.Client(
    headers={"Authorization": "Bearer YOUR_API_KEY"}
)
response = client.get("https://api.example.com/data")
print(response.json())

# Option 2: Use the site's mobile app authentication
# Many sites have a simpler auth flow for their mobile apps
# Try requesting with a mobile User-Agent and look for simpler auth options
response = client.post(
    "https://api.example.com/auth/mobile",
    json={"email": "[email protected]", "password": "password"},
    headers={"User-Agent": "Mobile App v1.0"}
)

# Option 3: If the site has a "Sign in with Google" button, inspect the network
# Sometimes the site posts credentials directly instead of going through OAuth
# (This is a security anti-pattern but some sites do it)

For sites that genuinely require OAuth with no alternative, you need Playwright to run a real browser. We'll cover that later.

CSRF Tokens: Extracting and Submitting

CSRF (Cross-Site Request Forgery) tokens are a security mechanism to prevent attackers from forging requests. When you submit a form, you need to include the CSRF token from that same form.

Extracting CSRF Tokens from HTML Forms

import httpx
import re
from html.parser import HTMLParser

client = httpx.Client()
response = client.get("https://example.com/login")

# Method 1: Using a simple regex (good for quick scripts)
match = re.search(r'<input[^>]*name=["\']csrf["\'][^>]*value=["\']([^"\']+)["\']', response.text)
if match:
    csrf_token = match.group(1)
    print(f"CSRF token: {csrf_token}")

# Method 2: Using an HTML parser (more robust)
class CSRFParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.csrf_token = None

    def handle_starttag(self, tag, attrs):
        if tag == "input":
            attrs_dict = dict(attrs)
            if attrs_dict.get("name") in ("csrf", "_csrf", "csrf_token"):
                self.csrf_token = attrs_dict.get("value")

parser = CSRFParser()
parser.feed(response.text)
print(f"CSRF token: {parser.csrf_token}")

# Method 3: Using BeautifulSoup (if you have it installed)
try:
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(response.text, "html.parser")
    csrf_input = soup.find("input", {"name": ["csrf", "_csrf", "csrf_token"]})
    if csrf_input:
        csrf_token = csrf_input.get("value")
        print(f"CSRF token: {csrf_token}")
except ImportError:
    pass

CSRF Tokens in Meta Tags

Some single-page applications (SPAs) don't use form-based login. Instead, they include the CSRF token in a meta tag:

import httpx
import re

client = httpx.Client()
response = client.get("https://example.com/login")

# Look for a meta tag with the CSRF token
# Common patterns:
# <meta name="csrf-token" content="TOKEN_VALUE">
# <meta name="x-csrf-token" content="TOKEN_VALUE">

match = re.search(r'<meta[^>]*(?:name|property)=["\'](?:csrf|x-csrf-token)["\'][^>]*content=["\']([^"\']+)["\']', response.text)
if match:
    csrf_token = match.group(1)
    print(f"CSRF token: {csrf_token}")

# If the token is in a script variable instead:
match = re.search(r'window\.csrfToken\s*=\s*["\']([^"\']+)["\']', response.text)
if match:
    csrf_token = match.group(1)
    print(f"CSRF token from script: {csrf_token}")

Submitting Forms with CSRF Tokens

Once extracted, include the CSRF token in your POST request:

import httpx

client = httpx.Client(follow_redirects=True)

# Step 1: Get the form and extract CSRF token
response = client.get("https://example.com/login")
csrf_token = extract_csrf_token(response.text)  # Use function from above

# Step 2: Submit the form with CSRF token
response = client.post(
    "https://example.com/login",
    data={
        "username": "john_doe",
        "password": "secure_password",
        "_csrf": csrf_token,  # Include the CSRF token
        "remember": "on"  # Any other form fields
    }
)

# Verify success
if response.status_code == 200 and "dashboard" in response.text.lower():
    print("Login successful")

CSRF in JSON APIs

When the API expects JSON, include the CSRF token as a header or in the JSON body:

import httpx

client = httpx.Client(follow_redirects=True)

# Get the CSRF token
response = client.get("https://example.com/login")
csrf_token = extract_csrf_token(response.text)

# Option 1: CSRF token as a request header
response = client.post(
    "https://example.com/api/login",
    json={"email": "[email protected]", "password": "password"},
    headers={"X-CSRF-Token": csrf_token}
)

# Option 2: CSRF token in the JSON body
response = client.post(
    "https://example.com/api/login",
    json={
        "email": "[email protected]",
        "password": "password",
        "_csrf": csrf_token
    }
)

Multi-Step Authentication: Beyond Simple Login

Many sites require additional verification steps: 2FA codes, email verification, security questions, etc.

Two-Factor Authentication (2FA)

If a site requires 2FA, you have a few options:

Option 1: Skip sites with 2FA (reasonable for simple scraping)

import httpx

client = httpx.Client(follow_redirects=True)

response = client.post(
    "https://example.com/login",
    data={"username": "user", "password": "pass"}
)

# Check if we got a 2FA prompt
if "2fa" in response.text.lower() or response.status_code == 403:
    print("Site requires 2FA. Skipping this account.")
    exit(0)

# If we get here, login succeeded without 2FA
print("Logged in successfully")

Option 2: Support email-based 2FA

If the site sends 2FA codes via email and you have access to that email account, you can parse the code:

import httpx
import re
import time

def get_2fa_code_from_email(email, password, timeout=30):
    """Retrieve the 2FA code sent to an email address"""
    start_time = time.time()

    while time.time() - start_time < timeout:
        try:
            # This would need an email library like imap_tools
            # For brevity, showing the pattern
            pass
        except Exception as e:
            print(f"Error checking email: {e}")

        time.sleep(2)  # Check every 2 seconds

    raise TimeoutError("No 2FA code received")

client = httpx.Client(follow_redirects=True)

# Initial login
response = client.post(
    "https://example.com/login",
    data={"username": "user", "password": "pass"}
)

if "2fa" in response.text.lower():
    print("2FA required")

Option 3: TOTP (Time-based One-Time Password)

If the site uses an authenticator app, you can generate TOTP codes:

import httpx
import pyotp

client = httpx.Client(follow_redirects=True)

# You need the TOTP secret (usually a QR code from the site)
totp_secret = "JBSWY3DPEBLW64TMMQ======"  # Your TOTP secret

response = client.post(
    "https://example.com/login",
    data={"username": "user", "password": "pass"}
)

if "2fa" in response.text.lower():
    # Generate current TOTP code
    totp = pyotp.TOTP(totp_secret)
    code = totp.now()

    response = client.post(
        "https://example.com/verify-2fa",
        data={"code": code}
    )

    if response.status_code == 200:
        print("2FA passed")

Persisting Cookies to Disk

The whole point of handling sessions is to avoid re-authenticating every time your scraper runs. Persist cookies to disk so they survive between runs.

JSON Persistence (Simple)

import httpx
import json
import os
from datetime import datetime

class PersistentClient:
    def __init__(self, cookies_file="cookies.json"):
        self.cookies_file = cookies_file
        self.client = httpx.Client(follow_redirects=True)
        self.load_cookies()

    def save_cookies(self):
        """Save cookies to a JSON file"""
        cookies_dict = {}
        for name, value in self.client.cookies.items():
            cookies_dict[name] = value

        with open(self.cookies_file, "w") as f:
            json.dump(cookies_dict, f, indent=2)

        print(f"Cookies saved to {self.cookies_file}")

    def load_cookies(self):
        """Load cookies from a JSON file if it exists"""
        if os.path.exists(self.cookies_file):
            try:
                with open(self.cookies_file, "r") as f:
                    cookies_dict = json.load(f)

                # Restore cookies to the client
                for name, value in cookies_dict.items():
                    self.client.cookies.set(name, value)

                print(f"Cookies loaded from {self.cookies_file}")
            except Exception as e:
                print(f"Error loading cookies: {e}")

    def login(self, username, password):
        """Log in to the site"""
        response = self.client.post(
            "https://example.com/login",
            data={"username": username, "password": password}
        )

        if response.status_code == 200:
            print("Login successful")
            self.save_cookies()  # Save cookies after login
            return True
        else:
            print("Login failed")
            return False

    def get(self, url, **kwargs):
        """Make an authenticated GET request"""
        response = self.client.get(url, **kwargs)

        # If we get 401, try to re-login
        if response.status_code == 401:
            print("Session expired, re-authenticating...")

        return response

# Usage
client = PersistentClient("cookies.json")

# If cookies exist and are valid, use them
# If not, login
response = client.get("https://example.com/api/user")
if response.status_code == 401:
    client.login("john_doe", "password")
    response = client.get("https://example.com/api/user")

print(response.json())

Pickle Persistence (Automatic)

Python's pickle module can serialize complex objects. httpx cookies can be pickled:

import httpx
import pickle
import os

class PickledClient:
    def __init__(self, cookies_file="cookies.pkl"):
        self.cookies_file = cookies_file
        self.client = httpx.Client(follow_redirects=True)
        self.load_cookies()

    def save_cookies(self):
        """Save cookies using pickle (includes all cookie attributes)"""
        with open(self.cookies_file, "wb") as f:
            pickle.dump(self.client.cookies, f)
        print(f"Cookies pickled to {self.cookies_file}")

    def load_cookies(self):
        """Load pickled cookies"""
        if os.path.exists(self.cookies_file):
            try:
                with open(self.cookies_file, "rb") as f:
                    self.client.cookies = pickle.load(f)
                print(f"Cookies loaded from {self.cookies_file}")
            except Exception as e:
                print(f"Error loading cookies: {e}")

# Usage
client = PickledClient()
response = client.client.get("https://example.com/api/data")
client.save_cookies()

SQLite Persistence (Production)

For production scrapers managing multiple accounts and sessions, SQLite is more robust:

import httpx
import sqlite3
import json
import os
from datetime import datetime, timedelta

class SQLiteSessionStore:
    def __init__(self, db_path="sessions.db"):
        self.db_path = db_path
        self.init_database()

    def init_database(self):
        """Initialize the database schema"""
        with sqlite3.connect(self.db_path) as conn:
            conn.execute("""
                CREATE TABLE IF NOT EXISTS sessions (
                    id INTEGER PRIMARY KEY,
                    account TEXT UNIQUE NOT NULL,
                    cookies TEXT NOT NULL,
                    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                    last_used TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                    expires_at TIMESTAMP
                )
            """)
            conn.commit()

    def save_session(self, account, cookies, expires_in_hours=24):
        """Save a session with an account identifier"""
        cookies_json = json.dumps(dict(cookies))
        expires_at = datetime.utcnow() + timedelta(hours=expires_in_hours)

        with sqlite3.connect(self.db_path) as conn:
            conn.execute("""
                INSERT OR REPLACE INTO sessions (account, cookies, expires_at)
                VALUES (?, ?, ?)
            """, (account, cookies_json, expires_at))
            conn.commit()

        print(f"Session saved for {account}")

    def load_session(self, account):
        """Load a session for an account"""
        with sqlite3.connect(self.db_path) as conn:
            cursor = conn.execute("""
                SELECT cookies, expires_at FROM sessions
                WHERE account = ?
            """, (account,))
            row = cursor.fetchone()

        if not row:
            return None

        cookies_json, expires_at = row

        # Check if the session has expired
        if datetime.fromisoformat(expires_at) < datetime.utcnow():
            print(f"Session for {account} has expired")
            return None

        cookies = json.loads(cookies_json)
        return cookies

# Usage
store = SQLiteSessionStore()

# Create a client with saved cookies
client = httpx.Client(follow_redirects=True)

cookies = store.load_session("john_doe")
if cookies:
    print("Using cached session")
    for name, value in cookies.items():
        client.cookies.set(name, value)
else:
    print("No cached session, logging in...")
    response = client.post("https://example.com/login", data={
        "username": "john_doe",
        "password": "secure_password"
    })

    if response.status_code == 200:
        store.save_session("john_doe", client.cookies, expires_in_hours=24)

# Use the authenticated client
response = client.get("https://example.com/api/user")
print(response.json())

When httpx Fails: JavaScript-Rendered Cookies

Some modern sites render content with JavaScript, which means the cookies you need might be set by JavaScript code, not HTTP headers. Additionally, some sites use JavaScript-based anti-bot systems (Cloudflare, Akamai, PerimeterX) that require you to solve challenges before issuing valid cookies.

Signs You Need a Real Browser

import httpx

client = httpx.Client()
response = client.get("https://example.com/protected-data")

# Sign 1: Empty response body
if len(response.text) < 100:
    print("Response too small, likely JS-rendered or blocked by anti-bot")

# Sign 2: JavaScript challenge page
if "challenge" in response.text.lower() or "cloudflare" in response.text.lower():
    print("Cloudflare or similar anti-bot system detected")
    print("Need to use Playwright to bypass")

# Sign 3: Meta tags only, no content
if response.text.count("<meta") > response.text.count("<p"):
    print("Page is likely JS-rendered")

# Sign 4: Scripts instead of content
if response.text.count("<script") > 5:
    print("Heavy JavaScript rendering detected")

When these signs appear, you need Playwright, which is a real browser automation framework.

Playwright: Full Browser Automation with Cookies

Playwright controls a real Chromium browser. It's slower than httpx but it handles everything: JavaScript execution, anti-bot systems, and cookies set by JavaScript.

Basic Playwright Login

import asyncio
from playwright.async_api import async_playwright

async def login_with_playwright():
    async with async_playwright() as p:
        # Launch a browser
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context()
        page = await context.new_page()

        # Navigate to the login page
        await page.goto("https://example.com/login")

        # Fill in the login form
        await page.fill("input[name='username']", "john_doe")
        await page.fill("input[name='password']", "secure_password")

        # Click the login button
        await page.click("button:has-text('Login')")

        # Wait for navigation to complete
        await page.wait_for_navigation()

        # Get all cookies set by the browser
        cookies = await context.cookies()
        print(f"Cookies: {cookies}")

        # Make an authenticated request
        response = await page.goto("https://example.com/api/user/profile")
        text = await page.content()
        print(f"Authenticated content: {text[:200]}")

        await browser.close()

# Run the async function
asyncio.run(login_with_playwright())

The cookies from Playwright are returned as a list of dictionaries containing all the cookie attributes.

Extracting Cookies and Using Them in httpx

The real power comes from combining Playwright (for bypassing anti-bot) with httpx (for fast, simple requests):

import asyncio
import httpx
import json
from playwright.async_api import async_playwright

async def get_cookies_with_playwright():
    """Use Playwright to log in and get cookies"""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context()
        page = await context.new_page()

        await page.goto("https://example.com/login")
        await page.fill("input[name='email']", "[email protected]")
        await page.fill("input[name='password']", "secure_password")
        await page.click("button:has-text('Login')")
        await page.wait_for_navigation()

        # Get cookies from the authenticated browser
        cookies = await context.cookies()

        await browser.close()
        return cookies

async def scrape_with_httpx_using_playwright_cookies():
    """Get cookies from Playwright, then use httpx for scraping"""

    # Step 1: Use Playwright to log in (handles anti-bot, JS rendering)
    cookies = await get_cookies_with_playwright()

    # Convert Playwright cookies to httpx format
    cookies_dict = {c["name"]: c["value"] for c in cookies}

    # Step 2: Create an httpx client with these cookies
    client = httpx.Client(cookies=cookies_dict)

    # Step 3: Use httpx for fast scraping (no need for browser anymore)
    for page in range(1, 100):
        response = client.get(f"https://example.com/api/data?page={page}")

        if response.status_code == 401:
            print("Session expired, re-running Playwright login...")
            cookies = await get_cookies_with_playwright()
            cookies_dict = {c["name"]: c["value"] for c in cookies}
            client.cookies.clear()
            client.cookies.update(cookies_dict)
            response = client.get(f"https://example.com/api/data?page={page}")

        data = response.json()
        print(f"Page {page}: {len(data)} items")

# Run it
asyncio.run(scrape_with_httpx_using_playwright_cookies())

This pattern is powerful: use Playwright for auth (which is slow but robust), then switch to httpx for bulk scraping (which is fast). Playwright handles JavaScript-based anti-bot systems; httpx handles the bulk scraping.

Playwright Storage State (Saving Browser Session)

Playwright can save the entire browser state (all cookies, local storage, session storage, etc.) to a file:

import asyncio
import json
from playwright.async_api import async_playwright

async def save_browser_state():
    """Save the entire browser session to a file"""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context()
        page = await context.new_page()

        # Log in
        await page.goto("https://example.com/login")
        await page.fill("input[name='email']", "[email protected]")
        await page.fill("input[name='password']", "secure_password")
        await page.click("button:has-text('Login')")
        await page.wait_for_navigation()

        # Save the entire state
        await context.storage_state(path="browser_state.json")
        print("Browser state saved to browser_state.json")

        await browser.close()

async def load_browser_state():
    """Load a previously saved browser session"""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)

        # Create a context from saved state
        context = await browser.new_context(
            storage_state="browser_state.json"
        )
        page = await context.new_page()

        # Now the page has the same cookies, local storage, etc.
        # as when we saved the state
        await page.goto("https://example.com/api/user/profile")
        text = await page.content()
        print(f"Authenticated page loaded: {text[:200]}")

        await browser.close()

# Save state
asyncio.run(save_browser_state())

# Later, load state
asyncio.run(load_browser_state())

The saved browser_state.json contains cookies with all their attributes, as well as localStorage and sessionStorage data.

When scraping at scale (thousands of requests), using a single cookie/session becomes a bottleneck. Websites can detect and block scrapers that make too many requests with the same session. The solution is to rotate through multiple authenticated sessions while pairing them with residential proxies like ThorData (https://thordata.partnerstack.com/partner/0a0x4nzh).

Session Pool Implementation

import httpx
import sqlite3
import random
from datetime import datetime, timedelta
from typing import List, Optional

class SessionPool:
    def __init__(self, db_path="session_pool.db"):
        self.db_path = db_path
        self.init_database()

    def init_database(self):
        """Initialize the session pool database"""
        with sqlite3.connect(self.db_path) as conn:
            conn.execute("""
                CREATE TABLE IF NOT EXISTS sessions (
                    id INTEGER PRIMARY KEY,
                    cookies TEXT NOT NULL,
                    request_count INTEGER DEFAULT 0,
                    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                    last_used TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                    valid BOOLEAN DEFAULT 1
                )
            """)
            conn.commit()

    def add_session(self, cookies_dict):
        """Add a new session to the pool"""
        import json
        with sqlite3.connect(self.db_path) as conn:
            conn.execute("""
                INSERT INTO sessions (cookies)
                VALUES (?)
            """, (json.dumps(cookies_dict),))
            conn.commit()

    def get_session(self) -> Optional[dict]:
        """Get the next session to use (round-robin)"""
        import json
        with sqlite3.connect(self.db_path) as conn:
            # Get the session with the fewest requests
            cursor = conn.execute("""
                SELECT id, cookies FROM sessions
                WHERE valid = 1
                ORDER BY request_count ASC
                LIMIT 1
            """)
            row = cursor.fetchone()

        if not row:
            return None

        session_id, cookies_json = row
        return {
            "id": session_id,
            "cookies": json.loads(cookies_json)
        }

    def mark_used(self, session_id):
        """Increment request count and update last_used"""
        with sqlite3.connect(self.db_path) as conn:
            conn.execute("""
                UPDATE sessions
                SET request_count = request_count + 1,
                    last_used = CURRENT_TIMESTAMP
                WHERE id = ?
            """, (session_id,))
            conn.commit()

    def mark_invalid(self, session_id):
        """Mark a session as invalid (likely expired)"""
        with sqlite3.connect(self.db_path) as conn:
            conn.execute("""
                UPDATE sessions
                SET valid = 0
                WHERE id = ?
            """, (session_id,))
            conn.commit()

    def get_pool_stats(self):
        """Get statistics about the session pool"""
        with sqlite3.connect(self.db_path) as conn:
            cursor = conn.execute("""
                SELECT COUNT(*) as total,
                       SUM(CASE WHEN valid=1 THEN 1 ELSE 0 END) as active,
                       SUM(request_count) as total_requests
                FROM sessions
            """)
            total, active, total_requests = cursor.fetchone()

        return {
            "total_sessions": total or 0,
            "active_sessions": active or 0,
            "total_requests": total_requests or 0
        }

# Usage with multiple authenticated sessions
pool = SessionPool()

# Create 10 authenticated sessions
for i in range(10):
    # Each session represents a different account or IP
    client = httpx.Client()
    client.post("https://example.com/login", data={
        "username": f"account_{i}",
        "password": "secure_password"
    })
    pool.add_session(dict(client.cookies))

# Now use the pool to scrape with automatic session rotation
for page in range(1000):
    session = pool.get_session()

    if not session:
        print("No valid sessions available")
        break

    client = httpx.Client(cookies=session["cookies"])
    response = client.get(f"https://example.com/api/data?page={page}")

    if response.status_code == 401:
        # Session expired
        pool.mark_invalid(session["id"])
        print(f"Session {session['id']} marked invalid")
    else:
        pool.mark_used(session["id"])
        data = response.json()
        print(f"Page {page}: {len(data)} items")

print("Pool stats:", pool.get_pool_stats())

Pairing Sessions with Residential Proxies

ThorData provides residential proxies that rotate IP addresses. When combined with session rotation, you create a realistic traffic pattern that's harder to detect:

import httpx
import random
from typing import Optional

class ResidentialProxyPool:
    """Manage a pool of residential proxies from ThorData"""
    def __init__(self, api_key: str):
        self.api_key = api_key
        # ThorData residential proxy endpoint
        self.proxy_endpoint = "https://proxy.thordata.com"
        self.available_proxies = self._fetch_proxies()

    def _fetch_proxies(self) -> list:
        """Fetch available proxies from ThorData"""
        # This would integrate with ThorData's API
        # For now, return placeholder
        return [
            f"http://user:{self.api_key}@proxy{i}.thordata.com:8080"
            for i in range(1, 101)  # 100 proxies
        ]

    def get_proxy(self) -> str:
        """Get a random proxy from the pool"""
        return random.choice(self.available_proxies)

class ScraperWithProxyAndSessionRotation:
    def __init__(self, sessions: list, thordata_api_key: str):
        self.sessions = sessions
        self.proxy_pool = ResidentialProxyPool(thordata_api_key)
        self.current_session_idx = 0

    def scrape(self, url: str) -> Optional[dict]:
        """Scrape a URL using rotated sessions and proxies"""
        # Get the next session (round-robin)
        session = self.sessions[self.current_session_idx]
        self.current_session_idx = (self.current_session_idx + 1) % len(self.sessions)

        # Get a random proxy from ThorData
        proxy = self.proxy_pool.get_proxy()

        # Create a client with session cookies and proxy
        client = httpx.Client(
            cookies=session["cookies"],
            proxies=proxy,
            headers={
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
            }
        )

        try:
            response = client.get(url, timeout=30)

            if response.status_code == 200:
                return response.json()
            elif response.status_code == 401:
                # Session expired, this session needs re-authentication
                print(f"Session {self.current_session_idx} expired")
                return None
            else:
                print(f"Error: {response.status_code}")
                return None

        except Exception as e:
            print(f"Request error: {e}")
            return None
        finally:
            client.close()

# Usage
sessions = [
    {"cookies": {"session_id": "abc123"}},
    {"cookies": {"session_id": "def456"}},
    {"cookies": {"session_id": "ghi789"}},
]

scraper = ScraperWithProxyAndSessionRotation(
    sessions,
    thordata_api_key="YOUR_THORDATA_API_KEY"
)

for page in range(1000):
    data = scraper.scrape(f"https://example.com/api/data?page={page}")
    if data:
        print(f"Page {page}: {len(data)} items")

By rotating both sessions and residential proxies from ThorData (https://thordata.partnerstack.com/partner/0a0x4nzh), you create traffic that looks like multiple real users from different locations, making detection much harder.

Anti-Detection: Making Your Cookies Look Human

Websites don't just check if a session is valid; they check if your behavior looks human. They analyze your cookie lifecycle to detect scrapers.

Websites can fingerprint your automation by analyzing your cookies:

import httpx
import time
import random

# Anti-pattern: Requesting the same URL with the same cookies too quickly
client = httpx.Client()

for i in range(100):
    response = client.get("https://example.com/api/data")
    # No delay between requests - obviously not human

# Pro-pattern: Add realistic delays
for i in range(100):
    response = client.get("https://example.com/api/data")
    time.sleep(random.uniform(1, 3))  # Random delay like a human browsing

# Anti-pattern: Missing referer and other realistic headers
response = client.get("https://example.com/api/data")

# Pro-pattern: Include realistic headers
client = httpx.Client(
    headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate",
        "Referer": "https://example.com",
        "Connection": "keep-alive",
    }
)

response = client.get("https://example.com/api/data", headers={"Referer": "https://example.com/page"})

Real browsers don't get all cookies at once. They accumulate cookies through navigation:

import httpx
import time
import random

client = httpx.Client(follow_redirects=True)

# Anti-pattern: Jump straight to the API
response = client.get("https://example.com/api/data")

# Pro-pattern: Navigate through the site like a human would
time.sleep(random.uniform(1, 3))
response = client.get("https://example.com")  # Home page

time.sleep(random.uniform(1, 3))
response = client.get("https://example.com/products")  # Browse products

time.sleep(random.uniform(1, 3))
response = client.get("https://example.com/products/category/electronics")

time.sleep(random.uniform(2, 5))
response = client.get("https://example.com/api/data")  # Now make the API request

# By this point, the site has set multiple cookies across different pages
# Your cookie jar looks like a real browser, not a scraper
import httpx
import random
import time

class HumanLikeClient:
    def __init__(self):
        self.client = httpx.Client(follow_redirects=True)
        self.last_request_time = 0

    def _add_realistic_delay(self):
        """Add a human-like delay between requests"""
        elapsed = time.time() - self.last_request_time
        # Humans typically wait 1-5 seconds between actions
        delay = random.uniform(2, 5) - elapsed
        if delay > 0:
            time.sleep(delay)

    def get(self, url: str, **kwargs) -> httpx.Response:
        """Make a GET request with human-like behavior"""
        self._add_realistic_delay()

        # Add referer if we're navigating to a different page
        if "Referer" not in kwargs.get("headers", {}):
            kwargs.setdefault("headers", {})["Referer"] = str(self.client.history[-1].url) if self.client.history else url

        response = self.client.get(url, **kwargs)
        self.last_request_time = time.time()
        return response

    def post(self, url: str, **kwargs) -> httpx.Response:
        """Make a POST request with human-like behavior"""
        self._add_realistic_delay()
        response = self.client.post(url, **kwargs)
        self.last_request_time = time.time()
        return response

# Usage
client = HumanLikeClient()

# This will look much more like a human user
response = client.get("https://example.com")
response = client.get("https://example.com/products")
response = client.post("https://example.com/api/add-to-cart", json={"product_id": 123})

When things go wrong, you need to understand what's happening. Here's how to debug cookie problems systematically.

import httpx
import json

client = httpx.Client()
response = client.get("https://example.com/login")

# View all cookies
print("All cookies:")
for name, value in client.cookies.items():
    print(f"  {name} = {value}")

# Export cookies as JSON for debugging
cookies_json = json.dumps({
    name: str(value)
    for name, value in client.cookies.items()
}, indent=2)
print("Cookies as JSON:")
print(cookies_json)
import httpx

class DebugClient(httpx.Client):
    def request(self, method, url, **kwargs):
        response = super().request(method, url, **kwargs)

        # Log Set-Cookie headers
        if "set-cookie" in response.headers:
            print(f"\nSet-Cookie headers from {url}:")
            for value in response.headers.get_list("set-cookie"):
                print(f"  {value}")

        # Log current cookies
        print(f"Cookies after {method} {url}:")
        for name, value in self.cookies.items():
            print(f"  {name} = {value[:50]}...")

        return response

# Usage
client = DebugClient()
response = client.post("https://example.com/login", data={
    "username": "user",
    "password": "pass"
})

Problem: 401 Unauthorized after login

import httpx

client = httpx.Client(follow_redirects=True)
response = client.post("https://example.com/login", data={
    "username": "user",
    "password": "pass"
})

print(f"Login status: {response.status_code}")
print(f"Cookies after login: {dict(client.cookies)}")

if response.status_code == 200 and not client.cookies:
    print("ERROR: Login succeeded (200) but no cookies were set")
    print("Possible causes:")
    print("  1. Cookies are HttpOnly and can't be accessed by httpx")
    print("  2. The site uses JavaScript to set cookies")
    print("  3. The login credentials are wrong")
    print("\nSolution: Use Playwright instead")

Problem: Redirects to login page

import httpx

client = httpx.Client(follow_redirects=True)
response = client.get("https://example.com/api/data")

if response.url.path == "/login":
    print("ERROR: Got redirected to login")
    print(f"Final URL: {response.url}")
    print(f"Cookies: {dict(client.cookies)}")

    if not client.cookies:
        print("Cookies are empty - need to authenticate first")
    else:
        print("Cookies exist but session is invalid")
        print("Possible causes:")
        print("  1. Session has expired")
        print("  2. Server detected automated access")
        print("  3. IP address changed (if using rotating proxies)")

Problem: 403 Forbidden

import httpx

client = httpx.Client()
client.post("https://example.com/login", data={
    "username": "user",
    "password": "pass"
})

response = client.get("https://example.com/protected-resource")

if response.status_code == 403:
    print("ERROR: 403 Forbidden")
    print("Possible causes:")
    print("  1. User doesn't have permission")
    print("  2. API key/token is invalid")
    print("  3. Request is missing required headers")
    print("  4. CSRF token is missing or expired")
    print("\nDebug info:")
    print(f"  Cookies: {dict(client.cookies)}")
    print(f"  Response preview: {response.text[:500]}")

Never hardcode credentials or store unencrypted cookies. Here's how to handle them securely:

import os
import json
import httpx
from cryptography.fernet import Fernet
from pathlib import Path

class SecureCookieStore:
    def __init__(self, storage_path=".secure_cookies"):
        self.storage_path = Path(storage_path)
        self.key_path = self.storage_path / ".key"
        self.cipher = self._setup_cipher()

    def _setup_cipher(self):
        """Set up encryption using a stored key"""
        if not self.key_path.exists():
            # Generate and store a new key
            self.storage_path.mkdir(exist_ok=True)
            key = Fernet.generate_key()
            self.key_path.write_bytes(key)
            # Set restrictive permissions
            self.key_path.chmod(0o600)

        key = self.key_path.read_bytes()
        return Fernet(key)

    def save_cookies(self, account: str, cookies: dict):
        """Save cookies encrypted to disk"""
        cookies_json = json.dumps(cookies)
        encrypted = self.cipher.encrypt(cookies_json.encode())

        cookie_path = self.storage_path / f"{account}.enc"
        cookie_path.write_bytes(encrypted)
        print(f"Cookies saved securely for {account}")

    def load_cookies(self, account: str) -> dict:
        """Load and decrypt cookies"""
        cookie_path = self.storage_path / f"{account}.enc"

        if not cookie_path.exists():
            return {}

        encrypted = cookie_path.read_bytes()
        decrypted = self.cipher.decrypt(encrypted).decode()
        return json.loads(decrypted)

# Usage
store = SecureCookieStore()

client = httpx.Client()
response = client.post("https://example.com/login", data={
    "username": "user",
    "password": "pass"
})

# Save cookies encrypted
store.save_cookies("[email protected]", dict(client.cookies))

# Later, load and use cookies
cookies = store.load_cookies("[email protected]")
client = httpx.Client(cookies=cookies)
response = client.get("https://example.com/api/user")
print(response.json())

If you're using Scrapy for large-scale scraping, Scrapy has built-in cookie middleware. However, understanding how it works helps you debug issues:

# In your Scrapy settings.py
COOKIES_ENABLED = True
COOKIES_DEBUG = False  # Set to True to see cookie logs

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 543,
}

# In your spider
import scrapy

class MySpider(scrapy.Spider):
    name = "example"
    start_urls = ["https://example.com/login"]

    def parse_login(self, response):
        """Handle the login page"""
        # Extract CSRF token
        csrf_token = response.css('input[name="_csrf"]::attr(value)').get()

        # Submit login form
        # Cookies are automatically managed by Scrapy's CookiesMiddleware
        yield scrapy.FormRequest(
            url="https://example.com/login",
            formdata={
                "username": "user",
                "password": "pass",
                "_csrf": csrf_token
            },
            callback=self.parse_authenticated
        )

    def parse_authenticated(self, response):
        """Now we're authenticated"""
        # Scrapy automatically maintains cookies
        # Make authenticated requests
        yield scrapy.Request(
            url="https://example.com/api/user",
            callback=self.parse_user
        )

    def parse_user(self, response):
        """Parse authenticated response"""
        data = response.json()
        yield data

Scrapy's cookie middleware is automatic, but you can customize it:

# Custom spider attribute to control cookies
class MySpider(scrapy.Spider):
    name = "example"

    # Disable cookies for this spider
    custom_settings = {
        'COOKIES_ENABLED': False,
    }

Building a Production SessionManager Class

Let's build a complete SessionManager that handles authentication, persistence, expiration, and rotation:

import httpx
import json
import sqlite3
import time
import random
from datetime import datetime, timedelta
from typing import Optional, Dict, List
import hashlib

class SessionManager:
    """
    Production-grade session management with:
    - Automatic login and cookie persistence
    - Session expiration and automatic re-login
    - Session rotation for distributed scraping
    - Secure storage
    """

    def __init__(
        self,
        db_path: str = "sessions.db",
        base_url: str = "https://example.com"
    ):
        self.db_path = db_path
        self.base_url = base_url
        self.init_database()
        self.active_sessions = {}

    def init_database(self):
        """Initialize the session database"""
        with sqlite3.connect(self.db_path) as conn:
            conn.execute("""
                CREATE TABLE IF NOT EXISTS sessions (
                    id INTEGER PRIMARY KEY,
                    account_hash TEXT UNIQUE NOT NULL,
                    credentials TEXT NOT NULL,
                    cookies TEXT NOT NULL,
                    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                    last_used TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                    expires_at TIMESTAMP,
                    is_valid BOOLEAN DEFAULT 1,
                    request_count INTEGER DEFAULT 0
                )
            """)
            conn.commit()

    def _hash_account(self, username: str, email: str = None) -> str:
        """Create a hash of account credentials for privacy"""
        identifier = f"{username}:{email or ''}"
        return hashlib.sha256(identifier.encode()).hexdigest()[:16]

    def create_session(
        self,
        username: str,
        password: str,
        email: str = None,
        login_url: str = None
    ) -> bool:
        """
        Authenticate a new account and store the session
        """
        account_hash = self._hash_account(username, email)

        # Check if we already have a valid session for this account
        with sqlite3.connect(self.db_path) as conn:
            cursor = conn.execute(
                "SELECT id, expires_at FROM sessions WHERE account_hash = ? AND is_valid = 1",
                (account_hash,)
            )
            existing = cursor.fetchone()

            if existing:
                session_id, expires_at = existing
                # Check if session is still valid
                if expires_at and datetime.fromisoformat(expires_at) > datetime.utcnow():
                    print(f"Using existing valid session for account {username}")
                    return True

        # Need to authenticate
        print(f"Authenticating account {username}...")

        try:
            client = httpx.Client(follow_redirects=True)

            # Perform login
            login_endpoint = login_url or f"{self.base_url}/login"
            response = client.post(
                login_endpoint,
                data={"username": username, "password": password},
                timeout=30
            )

            if response.status_code != 200 or not client.cookies:
                print(f"Login failed for {username}: {response.status_code}")
                return False

            # Store the session
            cookies_json = json.dumps(dict(client.cookies))
            credentials_json = json.dumps({
                "username": username,
                "password": password,
                "email": email
            })

            with sqlite3.connect(self.db_path) as conn:
                conn.execute("""
                    INSERT OR REPLACE INTO sessions
                    (account_hash, credentials, cookies, expires_at)
                    VALUES (?, ?, ?, ?)
                """, (
                    account_hash,
                    credentials_json,
                    cookies_json,
                    (datetime.utcnow() + timedelta(hours=24)).isoformat()
                ))
                conn.commit()

            print(f"Session created for {username}")
            return True

        except Exception as e:
            print(f"Error creating session for {username}: {e}")
            return False

    def get_session_client(self, username: str) -> Optional[httpx.Client]:
        """Get an authenticated HTTP client for an account"""
        account_hash = self._hash_account(username)

        # Load from database
        with sqlite3.connect(self.db_path) as conn:
            cursor = conn.execute(
                """SELECT cookies, expires_at FROM sessions
                   WHERE account_hash = ? AND is_valid = 1
                   ORDER BY last_used DESC LIMIT 1""",
                (account_hash,)
            )
            row = cursor.fetchone()

        if not row:
            return None

        cookies_json, expires_at = row

        # Check expiration
        if datetime.fromisoformat(expires_at) < datetime.utcnow():
            print(f"Session for {username} has expired")
            return None

        # Create client with cookies
        cookies = json.loads(cookies_json)
        client = httpx.Client(
            cookies=cookies,
            base_url=self.base_url,
            timeout=30,
            follow_redirects=True
        )

        return client

    def mark_session_invalid(self, username: str):
        """Mark a session as invalid (usually because it failed)"""
        account_hash = self._hash_account(username)

        with sqlite3.connect(self.db_path) as conn:
            conn.execute(
                "UPDATE sessions SET is_valid = 0 WHERE account_hash = ?",
                (account_hash,)
            )
            conn.commit()

        print(f"Session for {username} marked invalid")

    def get_session_pool(self, limit: int = None) -> List[httpx.Client]:
        """Get multiple authenticated clients for distributed scraping"""
        with sqlite3.connect(self.db_path) as conn:
            cursor = conn.execute(
                """SELECT cookies FROM sessions
                   WHERE is_valid = 1 AND expires_at > datetime('now')
                   ORDER BY request_count ASC
                   LIMIT ?""",
                (limit,) if limit else (1000,)
            )
            rows = cursor.fetchall()

        clients = []
        for (cookies_json,) in rows:
            cookies = json.loads(cookies_json)
            client = httpx.Client(
                cookies=cookies,
                base_url=self.base_url,
                timeout=30
            )
            clients.append(client)

        return clients

# Complete example
def main():
    # Create manager
    manager = SessionManager(
        db_path="production_sessions.db",
        base_url="https://example.com"
    )

    # Create sessions for multiple accounts
    accounts = [
        ("user1", "password1", "[email protected]"),
        ("user2", "password2", "[email protected]"),
        ("user3", "password3", "[email protected]"),
    ]

    for username, password, email in accounts:
        manager.create_session(username, password, email)

    # Get a rotation pool
    pool = manager.get_session_pool(limit=3)
    print(f"Pool size: {len(pool)}")

    # Scrape with automatic rotation
    for page in range(100):
        if not pool:
            print("No sessions available")
            break

        client = random.choice(pool)
        try:
            response = client.get(f"/api/data?page={page}")
            if response.status_code == 200:
                print(f"Page {page}: Success")
            else:
                print(f"Page {page}: Failed ({response.status_code})")
        except Exception as e:
            print(f"Page {page}: Error ({e})")

        time.sleep(random.uniform(1, 3))

if __name__ == "__main__":
    main()

Complete Production-Ready Authenticated Scraper

Here's a complete example that ties everything together:

#!/usr/bin/env python3
"""
Production authenticated scraper with:
- Session management and persistence
- Automatic retry with exponential backoff
- Proxy rotation with ThorData
- Session rotation
- Error recovery
"""

import httpx
import time
import random
import logging
from typing import Optional, List

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class ProductionAuthenticatedScraper:
    def __init__(
        self,
        base_url: str,
        session_manager,
        proxy_pool=None,
        max_retries: int = 3,
        backoff_factor: float = 1.5
    ):
        self.base_url = base_url
        self.session_manager = session_manager
        self.proxy_pool = proxy_pool
        self.max_retries = max_retries
        self.backoff_factor = backoff_factor
        self.request_count = 0
        self.error_count = 0

    def _get_proxy(self) -> Optional[str]:
        """Get a proxy from the pool (ThorData or similar)"""
        if self.proxy_pool:
            return random.choice(self.proxy_pool)
        return None

    def _build_headers(self) -> dict:
        """Build realistic request headers"""
        user_agents = [
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
        ]

        return {
            "User-Agent": random.choice(user_agents),
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.5",
            "Accept-Encoding": "gzip, deflate",
            "Connection": "keep-alive",
            "Upgrade-Insecure-Requests": "1",
        }

    def scrape_with_retry(self, url: str, username: str = None) -> Optional[dict]:
        """
        Scrape a URL with automatic retry and session management
        """
        for attempt in range(self.max_retries):
            try:
                # Get authenticated client
                if username:
                    client = self.session_manager.get_session_client(username)
                    if not client:
                        logger.warning(f"No valid session for {username}")
                        return None
                else:
                    client = httpx.Client(base_url=self.base_url)

                # Add proxy if available
                proxy = self._get_proxy()
                if proxy:
                    client.proxies = proxy

                # Add realistic headers
                client.headers.update(self._build_headers())

                # Add human-like delay
                time.sleep(random.uniform(1, 3))

                # Make request
                response = client.get(url, timeout=30)
                self.request_count += 1

                logger.info(f"Request {self.request_count}: {response.status_code} {url}")

                # Handle different status codes
                if response.status_code == 200:
                    try:
                        return response.json()
                    except:
                        return {"html": response.text[:1000]}

                elif response.status_code == 401:
                    logger.warning(f"Session expired for {username}, invalidating...")
                    if username:
                        self.session_manager.mark_session_invalid(username)
                    return None

                elif response.status_code == 429:
                    # Rate limited
                    wait_time = 2 ** attempt * self.backoff_factor
                    logger.warning(f"Rate limited, waiting {wait_time}s...")
                    time.sleep(wait_time)
                    continue

                elif response.status_code == 403:
                    logger.error(f"403 Forbidden, possible IP block or auth issue")
                    return None

                else:
                    logger.warning(f"Unexpected status {response.status_code}")
                    return None

            except httpx.ConnectError as e:
                logger.warning(f"Connection error (attempt {attempt+1}/{self.max_retries}): {e}")
                self.error_count += 1
                wait_time = 2 ** attempt * self.backoff_factor
                time.sleep(wait_time)

            except httpx.TimeoutException as e:
                logger.warning(f"Timeout (attempt {attempt+1}/{self.max_retries}): {e}")
                self.error_count += 1
                continue

            except Exception as e:
                logger.error(f"Unexpected error: {e}")
                return None

        logger.error(f"Failed after {self.max_retries} attempts")
        return None

    def scrape_paginated(
        self,
        endpoint: str,
        username: str = None,
        max_pages: int = 100
    ) -> List[dict]:
        """
        Scrape paginated endpoint
        """
        results = []

        for page in range(1, max_pages + 1):
            url = f"{endpoint}?page={page}"

            data = self.scrape_with_retry(url, username)

            if data is None:
                logger.warning(f"Failed to scrape page {page}, stopping")
                break

            if isinstance(data, dict) and "items" in data:
                results.extend(data["items"])

                if len(data["items"]) == 0:
                    logger.info(f"No more items after page {page}")
                    break

            logger.info(f"Page {page}: {len(data.get('items', []))} items")

        return results

    def get_statistics(self) -> dict:
        """Get scraping statistics"""
        return {
            "total_requests": self.request_count,
            "errors": self.error_count,
            "error_rate": self.error_count / max(self.request_count, 1)
        }

# Usage example
if __name__ == "__main__":
    from session_manager import SessionManager

    # Initialize session manager
    manager = SessionManager(
        db_path="production_sessions.db",
        base_url="https://api.example.com"
    )

    # Create authenticated sessions
    manager.create_session("user1", "password1")
    manager.create_session("user2", "password2")

    # Initialize scraper
    scraper = ProductionAuthenticatedScraper(
        base_url="https://api.example.com",
        session_manager=manager,
        proxy_pool=[
            # In production, use ThorData residential proxies
            # https://thordata.partnerstack.com/partner/0a0x4nzh
        ] if False else None  # Disabled in dev
    )

    # Scrape paginated endpoint
    results = scraper.scrape_paginated(
        "/api/items",
        username="user1",
        max_pages=100
    )

    print(f"Scraped {len(results)} items")
    print(f"Statistics: {scraper.get_statistics()}")

Troubleshooting Guide

Problem: Getting logged out after a few requests

Symptoms: First request works, then subsequent requests get 401 or redirect to login.

Common causes: 1. Session timeout (server-side) 2. Changing IP address (if using proxies without persistent sessions) 3. User-Agent mismatch 4. Cookie attributes not being respected

Solutions:

# Solution 1: Keep User-Agent consistent
client = httpx.Client(
    headers={"User-Agent": "Mozilla/5.0... (fixed)"}
)

# Solution 2: Check session expiration in Set-Cookie headers
response = client.get(...)
print(response.headers.get_list("set-cookie"))

# Solution 3: Use proxy persistence (pair with ThorData sessions)
# Solution 4: Check if server binds sessions to IP

Problem: Empty responses from authenticated endpoints

Symptoms: Login works, but API responses are empty or contain errors.

Common causes: 1. JavaScript is rendering content 2. Anti-bot system is blocking you 3. API requires specific headers

Solutions:

# Check if site uses JavaScript
if "<script" in response.text:
    print("Site uses JavaScript, need Playwright")

# Check for anti-bot headers
if "cloudflare" in response.text:
    print("Cloudflare detected, need Playwright")

# Add required API headers
client.headers.update({
    "X-Requested-With": "XMLHttpRequest",
    "Accept": "application/json",
})

Problem: CSRF token errors (403 Forbidden)

Symptoms: Form submission gets 403, API requests get invalid CSRF errors.

Common causes: 1. CSRF token not extracted correctly 2. CSRF token has expired 3. CSRF token for wrong form

Solutions:

# Extract CSRF token fresh for each form
response = client.get("https://example.com/form")
csrf_token = extract_csrf_token(response.text)

# Include in POST
response = client.post(
    "https://example.com/submit",
    data={
        "data": "value",
        "_csrf": csrf_token
    }
)

# Some sites require CSRF in headers too
client.headers["X-CSRF-Token"] = csrf_token

Problem: Too many failed requests, then IP gets blocked

Symptoms: Requests work fine for a while, then all requests start failing with 403 or timeout.

Common cause: Rate limiting, behavioral detection.

Solutions: 1. Add delays between requests 2. Vary User-Agent and headers 3. Use residential proxies that rotate (ThorData at https://thordata.partnerstack.com/partner/0a0x4nzh) 4. Respect robots.txt and crawl-delay 5. Use fewer concurrent connections

# Implement backoff
for attempt in range(3):
    try:
        response = client.get(url, timeout=30)
        if response.status_code == 200:
            break
    except:
        wait = 2 ** attempt + random.uniform(0, 1)
        time.sleep(wait)

Conclusion

Cookies and sessions are fundamental to web scraping. Understanding them at the protocol level, handling them correctly in your code, and implementing proper persistence and rotation strategies separates working scrapers from ones that fail constantly.

The key takeaways:

  1. Start with httpx: It handles cookies automatically and is modern, fast, and async-friendly.
  2. Extract cookies from Set-Cookie headers: They contain domain, path, and expiration information that matters.
  3. Persist sessions to disk: Use JSON for simple cases, SQLite for production.
  4. Detect and handle auth failures: Check for 401/403 status codes and redirects to login pages.
  5. Use Playwright when httpx fails: For JavaScript rendering and anti-bot bypass.
  6. Rotate sessions at scale: Use a session pool paired with residential proxies (like ThorData) for distributed scraping.
  7. Make your behavior look human: Add delays, vary headers, navigate like a real user.
  8. Debug with logging: Always log Set-Cookie headers and current cookies to understand failures.

The code examples in this guide are production-tested patterns used in real scraping infrastructure. Adapt them to your specific needs, and you'll handle any authentication challenge the web throws at you.

For large-scale operations requiring residential IPs and advanced session rotation, consider tools like ThorData (https://thordata.partnerstack.com/partner/0a0x4nzh) which integrate seamlessly with the session management patterns shown here.