Handling Cookies and Sessions in Python Web Scrapers (2026 Guide)
Handling Cookies and Sessions in Python Web Scrapers (2026 Guide)
Introduction: Why Cookies Break Your Scraper (And How to Fix It)
If you've spent hours debugging a scraper only to discover you're getting 403 errors, empty responses, or redirect loops back to the login page, you've encountered the cookie problem. Cookies are the web's primary mechanism for maintaining stateful interactions, and they're invisible until they're not. Understanding how to handle cookies and sessions isn't just a nice-to-have feature in web scraping—it's often the difference between a working scraper and months of frustration.
The web in 2026 is built on stateful authentication. Every meaningful action you want to automate—logging into an account, adding items to a cart, accessing an authenticated API, exporting data—requires the server to remember who you are between requests. Cookies and sessions are how that happens. When you skip learning to handle them properly, you're fighting against the fundamental architecture of the modern web.
This guide approaches cookies from first principles. We'll start with what cookies actually are at the HTTP protocol level, why web servers use them, and what happens when your scraper doesn't handle them correctly. Then we'll move into practical implementations: how to extract login credentials, maintain authenticated sessions across hundreds of requests, persist sessions to disk so you don't need to re-authenticate every time your scraper runs, rotate sessions at scale, and detect when authentication has failed so you can automatically re-login.
The reality of production web scraping is that sites actively defend against automated access. Cookies are one of their primary weapons. They can detect if your requests lack the cookie structure of a real browser. They can expire sessions based on behavior patterns. They can use cookies to fingerprint your automation framework. And when you're scraping at scale, managing hundreds or thousands of active sessions while rotating through residential proxies (we'll discuss tools like ThorData for this), cookie management becomes a complex orchestration problem.
This guide covers all of it: from basic httpx cookie jars to building a production SessionManager class that handles auto-login, cookie persistence, rotation, and fallback strategies. You'll learn why Playwright is sometimes necessary when httpx alone isn't enough. You'll see exactly what to watch for in HTTP responses to know when your session is dying. And you'll have multiple real-world code examples you can adapt immediately.
By the end, you'll understand cookies at a deeper level than most web developers, and you'll have the tools to automate any authenticated flow that exists on the web.
HTTP Cookies: The Protocol Level
To handle cookies correctly in a scraper, you need to understand what's actually happening when a server sets a cookie and when your client sends it back. This isn't magic—it's just HTTP headers.
The Set-Cookie Response Header
When a server wants to set a cookie, it includes a Set-Cookie header in the HTTP response. A real example from the TLS encrypted connection to a major e-commerce site might look like this:
HTTP/1.1 200 OK
Set-Cookie: session_id=abc123def456; Path=/; Domain=.example.com; Secure; HttpOnly; SameSite=Strict; Max-Age=3600
Set-Cookie: user_pref=dark_mode; Path=/; Domain=.example.com; Secure; Max-Age=31536000
Each Set-Cookie header creates or updates one cookie. The format is:
Set-Cookie: name=value; attribute1=value1; attribute2=value2; ...
Let's break down the attributes you'll encounter:
Name=Value: The actual data. session_id=abc123def456 stores a string that the server will recognize. The server checks this value when you send it back to prove you're authenticated.
Domain: Controls which domains can access this cookie. Domain=.example.com means the browser (or your scraper) will send this cookie to example.com, www.example.com, api.example.com—anything under example.com. If no Domain is set, the cookie is only sent to the exact domain that set it. This matters when scraping sites with multiple subdomains; you might need to follow redirects to different subdomains to pick up cookies set there.
Path: Limits which URL paths on that domain receive the cookie. Path=/api means the cookie is only sent to requests like GET /api/users. Path=/ means all paths. When scraping, you usually don't need to worry about this—the server handles it—but it's worth knowing if you're debugging cookie behavior.
Secure: This cookie is only sent over HTTPS, never over unencrypted HTTP. Any modern site uses this. It's a flag, not a value (the presence of the word "Secure" means it's enabled).
HttpOnly: The cookie cannot be accessed by JavaScript, only sent in HTTP requests. This is a security feature to prevent XSS attacks from stealing session tokens. For scraping, it's transparent—you don't read the cookie value in JavaScript, you just receive it in Set-Cookie headers and send it back.
SameSite: Controls when the cookie is sent in cross-site requests. SameSite=Strict means the cookie is only sent to requests on the exact same site. SameSite=Lax (the modern default) allows the cookie in top-level navigation but not in cross-site subrequests. SameSite=None requires Secure and allows cross-site requests. For scraping on the same domain, this is usually irrelevant—you're making same-site requests.
Expires/Max-Age: When the cookie expires. Expires=Wed, 31-Dec-2025 23:59:59 GMT is an absolute time. Max-Age=3600 is seconds from now. If neither is set, the cookie is a "session cookie" that expires when the browser closes. For your scraper, session cookies mean you need to maintain a persistent HTTP connection or re-authenticate frequently.
The Cookie Request Header
After receiving Set-Cookie headers, your scraper needs to send the cookie back to the server in subsequent requests. This happens via the Cookie header:
GET /api/user/profile HTTP/1.1
Host: example.com
Cookie: session_id=abc123def456; user_pref=dark_mode
The Cookie header is simple: it's just name=value pairs separated by semicolons. The server uses the session_id value to look up your session in its database or verify your JWT token, and then serves you authenticated content.
This is where most cookie-related scraping bugs occur. Your scraper needs to: 1. Extract cookies from Set-Cookie headers in responses 2. Store them in a cookie jar 3. Automatically send them in the Cookie header for subsequent requests to the same domain 4. Handle cookie expiration 5. Know when cookies are invalid (when the server starts rejecting them)
Modern HTTP libraries like httpx and requests abstract away most of this complexity, but when things go wrong, understanding the protocol lets you debug effectively.
Sessions vs. Cookies: Understanding the Relationship
Cookies are the transport mechanism. Sessions are what they represent.
A cookie is a piece of data your scraper stores and sends in HTTP headers. A session is the server-side state associated with that data. Here's how it works:
When you log into a website, you POST your username and password. The server validates them against its database. If valid, the server creates a new session object (usually just a row in a database table or an entry in memory) containing information like:
{
"session_id": "abc123def456",
"user_id": 42,
"username": "john_doe",
"created_at": "2026-03-31T10:00:00Z",
"last_activity": "2026-03-31T10:00:00Z",
"ip_address": "203.0.113.45",
"user_agent": "Mozilla/5.0..."
}
The server sends back a Set-Cookie header with that session_id:
Set-Cookie: session_id=abc123def456; Path=/; Max-Age=3600
Now your scraper stores this cookie and sends it in every request. When you request /api/user/profile, the server receives the Cookie header, looks up session_id=abc123def456 in its session table, finds that it belongs to user 42, and serves that user's profile.
This design is elegant: the server doesn't need to verify your password on every request. It just needs to recognize the session ID. But it has implications for scraping:
Session expiration: The server can invalidate sessions after a time period (Max-Age) or after a period of inactivity. Your scraper needs to detect when this happens and re-authenticate.
Session binding: Modern sites bind sessions to IP addresses or user agents. If you're scraping through a rotating proxy, your IP changes with each request. The server might invalidate your session, thinking someone hijacked it. This is why pairing residential proxies (like ThorData at https://thordata.partnerstack.com/partner/0a0x4nzh) with proper session management is crucial for large-scale scraping.
Session validation: The server might include additional checks. It might verify that the User-Agent in your request matches the User-Agent from when the session was created. It might check the Referer header. It might require tokens to rotate.
JWT Tokens: Cookies for Stateless APIs
Modern APIs use JSON Web Tokens (JWTs) instead of server-side sessions. A JWT is a self-contained token that proves authentication. When you log in, the server returns a token:
{
"access_token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VyX2lkIjo0MiwiZXhwIjoxNjE2NDI4MDAwfQ.signature",
"token_type": "Bearer",
"expires_in": 3600
}
You then send this token in the Authorization header:
GET /api/user/profile HTTP/1.1
Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...
The server verifies the token cryptographically (without needing a database lookup) and serves the request. This is stateless—the server doesn't maintain a session table.
For scraping, JWTs usually come in two forms:
- Returned in the login response JSON, then stored and sent in the Authorization header
- Set as a cookie by the server, then automatically sent by your HTTP client
Some sites do both—they set a JWT as a cookie and also return it in JSON. We'll cover both patterns in the code examples.
httpx: Modern Cookie Handling
The httpx library is the Python HTTP client for 2026. It handles cookies automatically with intelligent defaults, and it's significantly better than the older requests library for scrapers that need to handle complex authentication flows.
Basic Cookie Jar Functionality
When you create an httpx.Client, it automatically maintains a cookie jar:
import httpx
client = httpx.Client()
# First request: the server sets a cookie
response = client.post("https://example.com/login", data={
"username": "john_doe",
"password": "secure_password"
})
# Second request: the cookie is automatically sent
response = client.get("https://example.com/api/user/profile")
# httpx automatically included the Set-Cookie from the first response in this request
print(client.cookies)
# Output: <Cookies([('session_id', 'abc123def456')])>
The Cookie object behaves like a dictionary:
# Access a specific cookie
session_id = client.cookies.get("session_id")
print(session_id) # abc123def456
# Iterate over all cookies
for name, value in client.cookies.items():
print(f"{name}={value}")
# Check if a cookie exists
if "session_id" in client.cookies:
print("Session is valid")
# Add a cookie manually (rarely needed)
client.cookies.set("custom_cookie", "custom_value", domain="example.com")
The automatic cookie handling is the key feature. As long as you use the same client object for all requests to a domain, cookies are handled transparently. This is very different from making raw curl requests where you manually extract and pass cookies.
Cookie Parameters and Control
When creating an httpx.Client, you can customize cookie behavior:
import httpx
# Default behavior: cookies are managed automatically
client = httpx.Client()
# Disable cookies entirely (rarely needed)
client = httpx.Client(cookie_jar=None)
# Use a specific cookie jar implementation
client = httpx.Client(
cookies={"existing_cookie": "value"}, # Seed with initial cookies
follow_redirects=True, # Follow redirects and maintain cookies through them
)
# Get all cookies as a dictionary
cookies_dict = dict(client.cookies)
# Clear all cookies
client.cookies.clear()
# Delete a specific cookie
del client.cookies["session_id"]
One critical setting for authenticated scraping:
# Always use follow_redirects=True
# Otherwise, you might miss Set-Cookie headers in redirect responses
client = httpx.Client(follow_redirects=True)
# This is even more important when the login flow involves redirects
response = client.post("https://example.com/login", data={
"username": "user",
"password": "pass",
# Server responds with 302 redirect
# With follow_redirects=True, httpx follows it and collects cookies
# With follow_redirects=False, you'd miss Set-Cookie headers in the redirect
})
Detecting Cookie Issues
Here's the challenge: when your scraper stops working, is it a cookie problem? These signs indicate cookie issues:
import httpx
client = httpx.Client(follow_redirects=True)
# Sign 1: Redirects back to login
response = client.get("https://example.com/api/data")
if response.url.path == "/login":
print("ERROR: Got redirected to login. Session expired or invalid.")
# Sign 2: 403 Forbidden (permission denied, usually auth issue)
if response.status_code == 403:
print("ERROR: 403 Forbidden. Cookies might be invalid or session expired.")
# Sign 3: 401 Unauthorized
if response.status_code == 401:
print("ERROR: 401 Unauthorized. Need to re-authenticate.")
# Sign 4: Empty or unexpected response
if len(response.text) < 100:
print("WARNING: Got unusually small response. Could indicate session issues.")
# Sign 5: No Set-Cookie headers during login (the auth failed)
response = client.post("https://example.com/login", data={"user": "u", "pass": "p"})
if "set-cookie" not in response.headers:
print("ERROR: No session cookie set. Login failed. Check credentials.")
# Inspect what cookies are currently set
print("Current cookies:", dict(client.cookies))
Requests.Session vs. httpx.Client: When to Use Each
The requests library dominated Python HTTP for years, and many older tutorials show requests.Session(). Let's compare to understand when to use each.
import requests
import httpx
# Both libraries provide session objects
requests_session = requests.Session()
httpx_client = httpx.Client()
# Functionally similar at the surface
requests_session.post("https://example.com/login", data={"user": "u", "pass": "p"})
response = requests_session.get("https://example.com/api/data")
httpx_client.post("https://example.com/login", data={"user": "u", "pass": "p"})
response = httpx_client.get("https://example.com/api/data")
Here's why you should prefer httpx for modern scraping:
1. httpx supports async/await out of the box
import httpx
import asyncio
async def scrape_concurrently():
async with httpx.AsyncClient() as client:
# Make 10 requests concurrently, not serially
tasks = [
client.get(f"https://example.com/page/{i}")
for i in range(10)
]
responses = await asyncio.gather(*tasks)
return responses
asyncio.run(scrape_concurrently())
With requests, you either use threading (error-prone) or use a third-party wrapper. httpx makes async natural.
2. httpx is under active development with modern defaults
httpx defaults to HTTP/2 support, which is more efficient. It has better type hints. It's actively maintained and receiving security updates.
3. httpx handles edge cases better
For example, httpx's cookie handling is more robust with domain matching. When requests has bugs in cookie handling (and it does), the requests maintainers are slow to release fixes. httpx is newer and fixes bugs faster.
4. requests lacks important features for scrapers
The requests library doesn't have built-in retry logic with exponential backoff. You need to write it yourself or use urllib3 directly. httpx includes this natively.
The only reason to use requests is legacy code or when you need to integrate with libraries that specifically require it. For new scraping projects, start with httpx.
Here's a migration example:
# Old (requests)
import requests
session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0..."})
response = session.get("https://example.com")
print(dict(session.cookies))
# New (httpx)
import httpx
client = httpx.Client(headers={"User-Agent": "Mozilla/5.0..."})
response = client.get("https://example.com")
print(dict(client.cookies))
The APIs are similar enough that migration is straightforward, but httpx's design is cleaner.
Login Flows: From Forms to APIs
Authentication takes different forms depending on the site. Let's cover the common patterns.
Form-Based Login (HTML POST)
Traditional websites use HTML forms. You submit username and password, the server validates them, and sets a session cookie.
import httpx
from html.parser import HTMLParser
client = httpx.Client(
follow_redirects=True,
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
)
# Step 1: GET the login page to extract any hidden fields
response = client.get("https://example.com/login")
# Some sites include CSRF tokens in the form
# We'll parse them to be safe
class FormParser(HTMLParser):
def __init__(self):
super().__init__()
self.form_data = {}
self.in_form = False
self.csrf_token = None
def handle_starttag(self, tag, attrs):
if tag == "form":
self.in_form = True
elif tag == "input" and self.in_form:
attrs_dict = dict(attrs)
if "name" in attrs_dict:
# Store hidden fields and tokens
if "value" in attrs_dict:
self.form_data[attrs_dict["name"]] = attrs_dict["value"]
if attrs_dict.get("type") == "hidden":
self.csrf_token = attrs_dict.get("value")
parser = FormParser()
parser.feed(response.text)
csrf_token = parser.csrf_token
print(f"Extracted CSRF token: {csrf_token}")
# Step 2: Submit the login form
login_data = {
"username": "john_doe",
"password": "secure_password",
"_csrf": csrf_token # Include the CSRF token
}
response = client.post(
"https://example.com/login",
data=login_data,
follow_redirects=True
)
# Step 3: Verify authentication succeeded
if response.status_code == 200 and "dashboard" in response.text.lower():
print("Login successful")
print(f"Session cookie: {client.cookies.get('session_id')}")
else:
print("Login failed")
print(f"Status: {response.status_code}")
print(f"Response preview: {response.text[:200]}")
# Step 4: Now use the authenticated session
response = client.get("https://example.com/api/user/profile")
print(response.json())
Key points:
- Always use
follow_redirects=True. Login flows often end with a redirect to the dashboard. - Extract and send CSRF tokens. We'll cover CSRF in depth in the next section.
- Verify the login succeeded. Don't assume it worked—check the response.
- Keep the client alive. As long as the httpx.Client object exists, it maintains cookies.
JSON API Authentication
Modern APIs don't use HTML forms. Instead, you POST JSON with credentials and receive a token:
import httpx
import json
import time
client = httpx.Client(
headers={"User-Agent": "Mozilla/5.0..."}
)
# Some APIs require specific headers for authentication
response = client.post(
"https://api.example.com/auth/login",
json={"email": "[email protected]", "password": "secure_password"},
headers={
"Content-Type": "application/json",
"Accept": "application/json",
}
)
if response.status_code != 200:
print(f"Login failed: {response.status_code}")
print(response.text)
exit(1)
auth_response = response.json()
print(f"Auth response: {auth_response}")
# Pattern 1: Token in JSON, you manually add it to Authorization header
if "access_token" in auth_response:
token = auth_response["access_token"]
# Option A: Set it as a default header for all requests
client.headers.update({"Authorization": f"Bearer {token}"})
response = client.get("https://api.example.com/user/profile")
print(response.json())
# Pattern 2: Token is set as a cookie automatically
# (less common but happens with some APIs)
# The Set-Cookie header from the login response contains the token
# httpx.Client handles it automatically
response = client.get("https://api.example.com/user/profile")
# The cookie is sent automatically because client maintains the cookie jar
The critical difference: with form-based auth, the server sets cookies automatically. With JSON APIs, you need to either:
- Extract the token from JSON and manually add it to the Authorization header
- Extract the token from JSON and manually set it as a cookie
- Let the server set it as a cookie and rely on httpx to send it
Let's see a more complex example with token refresh:
import httpx
import time
from datetime import datetime, timedelta
class APIClient:
def __init__(self, base_url, email, password):
self.base_url = base_url
self.email = email
self.password = password
self.client = httpx.Client(
base_url=base_url,
headers={"User-Agent": "Mozilla/5.0..."}
)
self.token = None
self.token_expires_at = None
def authenticate(self):
"""Log in and store the token"""
response = self.client.post(
"/auth/login",
json={"email": self.email, "password": self.password}
)
if response.status_code != 200:
raise ValueError(f"Authentication failed: {response.text}")
data = response.json()
self.token = data["access_token"]
expires_in = data.get("expires_in", 3600) # Default 1 hour
self.token_expires_at = time.time() + expires_in
self.client.headers.update({"Authorization": f"Bearer {self.token}"})
return True
def refresh_token_if_needed(self):
"""Refresh the token if it's close to expiration"""
if self.token is None or time.time() >= self.token_expires_at - 60:
# Token is missing or expires within 60 seconds
self.authenticate()
def get(self, path, **kwargs):
"""Make an authenticated GET request"""
self.refresh_token_if_needed()
return self.client.get(path, **kwargs)
def post(self, path, **kwargs):
"""Make an authenticated POST request"""
self.refresh_token_if_needed()
return self.client.post(path, **kwargs)
# Usage
client = APIClient("https://api.example.com", "[email protected]", "password")
client.authenticate()
# Make authenticated requests
response = client.get("/user/profile")
print(response.json())
# If this request happens more than 1 hour later, token is automatically refreshed
time.sleep(3600)
response = client.get("/user/data") # Token is silently refreshed if needed
print(response.json())
This pattern handles token expiration automatically. Many APIs issue tokens with an expires_in field in seconds. Your scraper should respect this and refresh before expiration.
OAuth2 and Social Login
Some sites require OAuth2 (login with Google, Facebook, etc.). These are complex because they involve browser redirects. We'll cover the approach here and then discuss Playwright for full browser automation.
With OAuth2, your scraper can't easily get a real OAuth token without controlling a browser. However, some sites provide alternative authentication methods:
import httpx
# Option 1: Skip OAuth and use API key authentication
# Some sites provide API keys in user settings
client = httpx.Client(
headers={"Authorization": "Bearer YOUR_API_KEY"}
)
response = client.get("https://api.example.com/data")
print(response.json())
# Option 2: Use the site's mobile app authentication
# Many sites have a simpler auth flow for their mobile apps
# Try requesting with a mobile User-Agent and look for simpler auth options
response = client.post(
"https://api.example.com/auth/mobile",
json={"email": "[email protected]", "password": "password"},
headers={"User-Agent": "Mobile App v1.0"}
)
# Option 3: If the site has a "Sign in with Google" button, inspect the network
# Sometimes the site posts credentials directly instead of going through OAuth
# (This is a security anti-pattern but some sites do it)
For sites that genuinely require OAuth with no alternative, you need Playwright to run a real browser. We'll cover that later.
CSRF Tokens: Extracting and Submitting
CSRF (Cross-Site Request Forgery) tokens are a security mechanism to prevent attackers from forging requests. When you submit a form, you need to include the CSRF token from that same form.
Extracting CSRF Tokens from HTML Forms
import httpx
import re
from html.parser import HTMLParser
client = httpx.Client()
response = client.get("https://example.com/login")
# Method 1: Using a simple regex (good for quick scripts)
match = re.search(r'<input[^>]*name=["\']csrf["\'][^>]*value=["\']([^"\']+)["\']', response.text)
if match:
csrf_token = match.group(1)
print(f"CSRF token: {csrf_token}")
# Method 2: Using an HTML parser (more robust)
class CSRFParser(HTMLParser):
def __init__(self):
super().__init__()
self.csrf_token = None
def handle_starttag(self, tag, attrs):
if tag == "input":
attrs_dict = dict(attrs)
if attrs_dict.get("name") in ("csrf", "_csrf", "csrf_token"):
self.csrf_token = attrs_dict.get("value")
parser = CSRFParser()
parser.feed(response.text)
print(f"CSRF token: {parser.csrf_token}")
# Method 3: Using BeautifulSoup (if you have it installed)
try:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
csrf_input = soup.find("input", {"name": ["csrf", "_csrf", "csrf_token"]})
if csrf_input:
csrf_token = csrf_input.get("value")
print(f"CSRF token: {csrf_token}")
except ImportError:
pass
CSRF Tokens in Meta Tags
Some single-page applications (SPAs) don't use form-based login. Instead, they include the CSRF token in a meta tag:
import httpx
import re
client = httpx.Client()
response = client.get("https://example.com/login")
# Look for a meta tag with the CSRF token
# Common patterns:
# <meta name="csrf-token" content="TOKEN_VALUE">
# <meta name="x-csrf-token" content="TOKEN_VALUE">
match = re.search(r'<meta[^>]*(?:name|property)=["\'](?:csrf|x-csrf-token)["\'][^>]*content=["\']([^"\']+)["\']', response.text)
if match:
csrf_token = match.group(1)
print(f"CSRF token: {csrf_token}")
# If the token is in a script variable instead:
match = re.search(r'window\.csrfToken\s*=\s*["\']([^"\']+)["\']', response.text)
if match:
csrf_token = match.group(1)
print(f"CSRF token from script: {csrf_token}")
Submitting Forms with CSRF Tokens
Once extracted, include the CSRF token in your POST request:
import httpx
client = httpx.Client(follow_redirects=True)
# Step 1: Get the form and extract CSRF token
response = client.get("https://example.com/login")
csrf_token = extract_csrf_token(response.text) # Use function from above
# Step 2: Submit the form with CSRF token
response = client.post(
"https://example.com/login",
data={
"username": "john_doe",
"password": "secure_password",
"_csrf": csrf_token, # Include the CSRF token
"remember": "on" # Any other form fields
}
)
# Verify success
if response.status_code == 200 and "dashboard" in response.text.lower():
print("Login successful")
CSRF in JSON APIs
When the API expects JSON, include the CSRF token as a header or in the JSON body:
import httpx
client = httpx.Client(follow_redirects=True)
# Get the CSRF token
response = client.get("https://example.com/login")
csrf_token = extract_csrf_token(response.text)
# Option 1: CSRF token as a request header
response = client.post(
"https://example.com/api/login",
json={"email": "[email protected]", "password": "password"},
headers={"X-CSRF-Token": csrf_token}
)
# Option 2: CSRF token in the JSON body
response = client.post(
"https://example.com/api/login",
json={
"email": "[email protected]",
"password": "password",
"_csrf": csrf_token
}
)
Multi-Step Authentication: Beyond Simple Login
Many sites require additional verification steps: 2FA codes, email verification, security questions, etc.
Two-Factor Authentication (2FA)
If a site requires 2FA, you have a few options:
Option 1: Skip sites with 2FA (reasonable for simple scraping)
import httpx
client = httpx.Client(follow_redirects=True)
response = client.post(
"https://example.com/login",
data={"username": "user", "password": "pass"}
)
# Check if we got a 2FA prompt
if "2fa" in response.text.lower() or response.status_code == 403:
print("Site requires 2FA. Skipping this account.")
exit(0)
# If we get here, login succeeded without 2FA
print("Logged in successfully")
Option 2: Support email-based 2FA
If the site sends 2FA codes via email and you have access to that email account, you can parse the code:
import httpx
import re
import time
def get_2fa_code_from_email(email, password, timeout=30):
"""Retrieve the 2FA code sent to an email address"""
start_time = time.time()
while time.time() - start_time < timeout:
try:
# This would need an email library like imap_tools
# For brevity, showing the pattern
pass
except Exception as e:
print(f"Error checking email: {e}")
time.sleep(2) # Check every 2 seconds
raise TimeoutError("No 2FA code received")
client = httpx.Client(follow_redirects=True)
# Initial login
response = client.post(
"https://example.com/login",
data={"username": "user", "password": "pass"}
)
if "2fa" in response.text.lower():
print("2FA required")
Option 3: TOTP (Time-based One-Time Password)
If the site uses an authenticator app, you can generate TOTP codes:
import httpx
import pyotp
client = httpx.Client(follow_redirects=True)
# You need the TOTP secret (usually a QR code from the site)
totp_secret = "JBSWY3DPEBLW64TMMQ======" # Your TOTP secret
response = client.post(
"https://example.com/login",
data={"username": "user", "password": "pass"}
)
if "2fa" in response.text.lower():
# Generate current TOTP code
totp = pyotp.TOTP(totp_secret)
code = totp.now()
response = client.post(
"https://example.com/verify-2fa",
data={"code": code}
)
if response.status_code == 200:
print("2FA passed")
Persisting Cookies to Disk
The whole point of handling sessions is to avoid re-authenticating every time your scraper runs. Persist cookies to disk so they survive between runs.
JSON Persistence (Simple)
import httpx
import json
import os
from datetime import datetime
class PersistentClient:
def __init__(self, cookies_file="cookies.json"):
self.cookies_file = cookies_file
self.client = httpx.Client(follow_redirects=True)
self.load_cookies()
def save_cookies(self):
"""Save cookies to a JSON file"""
cookies_dict = {}
for name, value in self.client.cookies.items():
cookies_dict[name] = value
with open(self.cookies_file, "w") as f:
json.dump(cookies_dict, f, indent=2)
print(f"Cookies saved to {self.cookies_file}")
def load_cookies(self):
"""Load cookies from a JSON file if it exists"""
if os.path.exists(self.cookies_file):
try:
with open(self.cookies_file, "r") as f:
cookies_dict = json.load(f)
# Restore cookies to the client
for name, value in cookies_dict.items():
self.client.cookies.set(name, value)
print(f"Cookies loaded from {self.cookies_file}")
except Exception as e:
print(f"Error loading cookies: {e}")
def login(self, username, password):
"""Log in to the site"""
response = self.client.post(
"https://example.com/login",
data={"username": username, "password": password}
)
if response.status_code == 200:
print("Login successful")
self.save_cookies() # Save cookies after login
return True
else:
print("Login failed")
return False
def get(self, url, **kwargs):
"""Make an authenticated GET request"""
response = self.client.get(url, **kwargs)
# If we get 401, try to re-login
if response.status_code == 401:
print("Session expired, re-authenticating...")
return response
# Usage
client = PersistentClient("cookies.json")
# If cookies exist and are valid, use them
# If not, login
response = client.get("https://example.com/api/user")
if response.status_code == 401:
client.login("john_doe", "password")
response = client.get("https://example.com/api/user")
print(response.json())
Pickle Persistence (Automatic)
Python's pickle module can serialize complex objects. httpx cookies can be pickled:
import httpx
import pickle
import os
class PickledClient:
def __init__(self, cookies_file="cookies.pkl"):
self.cookies_file = cookies_file
self.client = httpx.Client(follow_redirects=True)
self.load_cookies()
def save_cookies(self):
"""Save cookies using pickle (includes all cookie attributes)"""
with open(self.cookies_file, "wb") as f:
pickle.dump(self.client.cookies, f)
print(f"Cookies pickled to {self.cookies_file}")
def load_cookies(self):
"""Load pickled cookies"""
if os.path.exists(self.cookies_file):
try:
with open(self.cookies_file, "rb") as f:
self.client.cookies = pickle.load(f)
print(f"Cookies loaded from {self.cookies_file}")
except Exception as e:
print(f"Error loading cookies: {e}")
# Usage
client = PickledClient()
response = client.client.get("https://example.com/api/data")
client.save_cookies()
SQLite Persistence (Production)
For production scrapers managing multiple accounts and sessions, SQLite is more robust:
import httpx
import sqlite3
import json
import os
from datetime import datetime, timedelta
class SQLiteSessionStore:
def __init__(self, db_path="sessions.db"):
self.db_path = db_path
self.init_database()
def init_database(self):
"""Initialize the database schema"""
with sqlite3.connect(self.db_path) as conn:
conn.execute("""
CREATE TABLE IF NOT EXISTS sessions (
id INTEGER PRIMARY KEY,
account TEXT UNIQUE NOT NULL,
cookies TEXT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
last_used TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
expires_at TIMESTAMP
)
""")
conn.commit()
def save_session(self, account, cookies, expires_in_hours=24):
"""Save a session with an account identifier"""
cookies_json = json.dumps(dict(cookies))
expires_at = datetime.utcnow() + timedelta(hours=expires_in_hours)
with sqlite3.connect(self.db_path) as conn:
conn.execute("""
INSERT OR REPLACE INTO sessions (account, cookies, expires_at)
VALUES (?, ?, ?)
""", (account, cookies_json, expires_at))
conn.commit()
print(f"Session saved for {account}")
def load_session(self, account):
"""Load a session for an account"""
with sqlite3.connect(self.db_path) as conn:
cursor = conn.execute("""
SELECT cookies, expires_at FROM sessions
WHERE account = ?
""", (account,))
row = cursor.fetchone()
if not row:
return None
cookies_json, expires_at = row
# Check if the session has expired
if datetime.fromisoformat(expires_at) < datetime.utcnow():
print(f"Session for {account} has expired")
return None
cookies = json.loads(cookies_json)
return cookies
# Usage
store = SQLiteSessionStore()
# Create a client with saved cookies
client = httpx.Client(follow_redirects=True)
cookies = store.load_session("john_doe")
if cookies:
print("Using cached session")
for name, value in cookies.items():
client.cookies.set(name, value)
else:
print("No cached session, logging in...")
response = client.post("https://example.com/login", data={
"username": "john_doe",
"password": "secure_password"
})
if response.status_code == 200:
store.save_session("john_doe", client.cookies, expires_in_hours=24)
# Use the authenticated client
response = client.get("https://example.com/api/user")
print(response.json())
When httpx Fails: JavaScript-Rendered Cookies
Some modern sites render content with JavaScript, which means the cookies you need might be set by JavaScript code, not HTTP headers. Additionally, some sites use JavaScript-based anti-bot systems (Cloudflare, Akamai, PerimeterX) that require you to solve challenges before issuing valid cookies.
Signs You Need a Real Browser
import httpx
client = httpx.Client()
response = client.get("https://example.com/protected-data")
# Sign 1: Empty response body
if len(response.text) < 100:
print("Response too small, likely JS-rendered or blocked by anti-bot")
# Sign 2: JavaScript challenge page
if "challenge" in response.text.lower() or "cloudflare" in response.text.lower():
print("Cloudflare or similar anti-bot system detected")
print("Need to use Playwright to bypass")
# Sign 3: Meta tags only, no content
if response.text.count("<meta") > response.text.count("<p"):
print("Page is likely JS-rendered")
# Sign 4: Scripts instead of content
if response.text.count("<script") > 5:
print("Heavy JavaScript rendering detected")
When these signs appear, you need Playwright, which is a real browser automation framework.
Playwright: Full Browser Automation with Cookies
Playwright controls a real Chromium browser. It's slower than httpx but it handles everything: JavaScript execution, anti-bot systems, and cookies set by JavaScript.
Basic Playwright Login
import asyncio
from playwright.async_api import async_playwright
async def login_with_playwright():
async with async_playwright() as p:
# Launch a browser
browser = await p.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
# Navigate to the login page
await page.goto("https://example.com/login")
# Fill in the login form
await page.fill("input[name='username']", "john_doe")
await page.fill("input[name='password']", "secure_password")
# Click the login button
await page.click("button:has-text('Login')")
# Wait for navigation to complete
await page.wait_for_navigation()
# Get all cookies set by the browser
cookies = await context.cookies()
print(f"Cookies: {cookies}")
# Make an authenticated request
response = await page.goto("https://example.com/api/user/profile")
text = await page.content()
print(f"Authenticated content: {text[:200]}")
await browser.close()
# Run the async function
asyncio.run(login_with_playwright())
The cookies from Playwright are returned as a list of dictionaries containing all the cookie attributes.
Extracting Cookies and Using Them in httpx
The real power comes from combining Playwright (for bypassing anti-bot) with httpx (for fast, simple requests):
import asyncio
import httpx
import json
from playwright.async_api import async_playwright
async def get_cookies_with_playwright():
"""Use Playwright to log in and get cookies"""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
await page.goto("https://example.com/login")
await page.fill("input[name='email']", "[email protected]")
await page.fill("input[name='password']", "secure_password")
await page.click("button:has-text('Login')")
await page.wait_for_navigation()
# Get cookies from the authenticated browser
cookies = await context.cookies()
await browser.close()
return cookies
async def scrape_with_httpx_using_playwright_cookies():
"""Get cookies from Playwright, then use httpx for scraping"""
# Step 1: Use Playwright to log in (handles anti-bot, JS rendering)
cookies = await get_cookies_with_playwright()
# Convert Playwright cookies to httpx format
cookies_dict = {c["name"]: c["value"] for c in cookies}
# Step 2: Create an httpx client with these cookies
client = httpx.Client(cookies=cookies_dict)
# Step 3: Use httpx for fast scraping (no need for browser anymore)
for page in range(1, 100):
response = client.get(f"https://example.com/api/data?page={page}")
if response.status_code == 401:
print("Session expired, re-running Playwright login...")
cookies = await get_cookies_with_playwright()
cookies_dict = {c["name"]: c["value"] for c in cookies}
client.cookies.clear()
client.cookies.update(cookies_dict)
response = client.get(f"https://example.com/api/data?page={page}")
data = response.json()
print(f"Page {page}: {len(data)} items")
# Run it
asyncio.run(scrape_with_httpx_using_playwright_cookies())
This pattern is powerful: use Playwright for auth (which is slow but robust), then switch to httpx for bulk scraping (which is fast). Playwright handles JavaScript-based anti-bot systems; httpx handles the bulk scraping.
Playwright Storage State (Saving Browser Session)
Playwright can save the entire browser state (all cookies, local storage, session storage, etc.) to a file:
import asyncio
import json
from playwright.async_api import async_playwright
async def save_browser_state():
"""Save the entire browser session to a file"""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
# Log in
await page.goto("https://example.com/login")
await page.fill("input[name='email']", "[email protected]")
await page.fill("input[name='password']", "secure_password")
await page.click("button:has-text('Login')")
await page.wait_for_navigation()
# Save the entire state
await context.storage_state(path="browser_state.json")
print("Browser state saved to browser_state.json")
await browser.close()
async def load_browser_state():
"""Load a previously saved browser session"""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
# Create a context from saved state
context = await browser.new_context(
storage_state="browser_state.json"
)
page = await context.new_page()
# Now the page has the same cookies, local storage, etc.
# as when we saved the state
await page.goto("https://example.com/api/user/profile")
text = await page.content()
print(f"Authenticated page loaded: {text[:200]}")
await browser.close()
# Save state
asyncio.run(save_browser_state())
# Later, load state
asyncio.run(load_browser_state())
The saved browser_state.json contains cookies with all their attributes, as well as localStorage and sessionStorage data.
Cookie Rotation for Large-Scale Scraping
When scraping at scale (thousands of requests), using a single cookie/session becomes a bottleneck. Websites can detect and block scrapers that make too many requests with the same session. The solution is to rotate through multiple authenticated sessions while pairing them with residential proxies like ThorData (https://thordata.partnerstack.com/partner/0a0x4nzh).
Session Pool Implementation
import httpx
import sqlite3
import random
from datetime import datetime, timedelta
from typing import List, Optional
class SessionPool:
def __init__(self, db_path="session_pool.db"):
self.db_path = db_path
self.init_database()
def init_database(self):
"""Initialize the session pool database"""
with sqlite3.connect(self.db_path) as conn:
conn.execute("""
CREATE TABLE IF NOT EXISTS sessions (
id INTEGER PRIMARY KEY,
cookies TEXT NOT NULL,
request_count INTEGER DEFAULT 0,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
last_used TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
valid BOOLEAN DEFAULT 1
)
""")
conn.commit()
def add_session(self, cookies_dict):
"""Add a new session to the pool"""
import json
with sqlite3.connect(self.db_path) as conn:
conn.execute("""
INSERT INTO sessions (cookies)
VALUES (?)
""", (json.dumps(cookies_dict),))
conn.commit()
def get_session(self) -> Optional[dict]:
"""Get the next session to use (round-robin)"""
import json
with sqlite3.connect(self.db_path) as conn:
# Get the session with the fewest requests
cursor = conn.execute("""
SELECT id, cookies FROM sessions
WHERE valid = 1
ORDER BY request_count ASC
LIMIT 1
""")
row = cursor.fetchone()
if not row:
return None
session_id, cookies_json = row
return {
"id": session_id,
"cookies": json.loads(cookies_json)
}
def mark_used(self, session_id):
"""Increment request count and update last_used"""
with sqlite3.connect(self.db_path) as conn:
conn.execute("""
UPDATE sessions
SET request_count = request_count + 1,
last_used = CURRENT_TIMESTAMP
WHERE id = ?
""", (session_id,))
conn.commit()
def mark_invalid(self, session_id):
"""Mark a session as invalid (likely expired)"""
with sqlite3.connect(self.db_path) as conn:
conn.execute("""
UPDATE sessions
SET valid = 0
WHERE id = ?
""", (session_id,))
conn.commit()
def get_pool_stats(self):
"""Get statistics about the session pool"""
with sqlite3.connect(self.db_path) as conn:
cursor = conn.execute("""
SELECT COUNT(*) as total,
SUM(CASE WHEN valid=1 THEN 1 ELSE 0 END) as active,
SUM(request_count) as total_requests
FROM sessions
""")
total, active, total_requests = cursor.fetchone()
return {
"total_sessions": total or 0,
"active_sessions": active or 0,
"total_requests": total_requests or 0
}
# Usage with multiple authenticated sessions
pool = SessionPool()
# Create 10 authenticated sessions
for i in range(10):
# Each session represents a different account or IP
client = httpx.Client()
client.post("https://example.com/login", data={
"username": f"account_{i}",
"password": "secure_password"
})
pool.add_session(dict(client.cookies))
# Now use the pool to scrape with automatic session rotation
for page in range(1000):
session = pool.get_session()
if not session:
print("No valid sessions available")
break
client = httpx.Client(cookies=session["cookies"])
response = client.get(f"https://example.com/api/data?page={page}")
if response.status_code == 401:
# Session expired
pool.mark_invalid(session["id"])
print(f"Session {session['id']} marked invalid")
else:
pool.mark_used(session["id"])
data = response.json()
print(f"Page {page}: {len(data)} items")
print("Pool stats:", pool.get_pool_stats())
Pairing Sessions with Residential Proxies
ThorData provides residential proxies that rotate IP addresses. When combined with session rotation, you create a realistic traffic pattern that's harder to detect:
import httpx
import random
from typing import Optional
class ResidentialProxyPool:
"""Manage a pool of residential proxies from ThorData"""
def __init__(self, api_key: str):
self.api_key = api_key
# ThorData residential proxy endpoint
self.proxy_endpoint = "https://proxy.thordata.com"
self.available_proxies = self._fetch_proxies()
def _fetch_proxies(self) -> list:
"""Fetch available proxies from ThorData"""
# This would integrate with ThorData's API
# For now, return placeholder
return [
f"http://user:{self.api_key}@proxy{i}.thordata.com:8080"
for i in range(1, 101) # 100 proxies
]
def get_proxy(self) -> str:
"""Get a random proxy from the pool"""
return random.choice(self.available_proxies)
class ScraperWithProxyAndSessionRotation:
def __init__(self, sessions: list, thordata_api_key: str):
self.sessions = sessions
self.proxy_pool = ResidentialProxyPool(thordata_api_key)
self.current_session_idx = 0
def scrape(self, url: str) -> Optional[dict]:
"""Scrape a URL using rotated sessions and proxies"""
# Get the next session (round-robin)
session = self.sessions[self.current_session_idx]
self.current_session_idx = (self.current_session_idx + 1) % len(self.sessions)
# Get a random proxy from ThorData
proxy = self.proxy_pool.get_proxy()
# Create a client with session cookies and proxy
client = httpx.Client(
cookies=session["cookies"],
proxies=proxy,
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
)
try:
response = client.get(url, timeout=30)
if response.status_code == 200:
return response.json()
elif response.status_code == 401:
# Session expired, this session needs re-authentication
print(f"Session {self.current_session_idx} expired")
return None
else:
print(f"Error: {response.status_code}")
return None
except Exception as e:
print(f"Request error: {e}")
return None
finally:
client.close()
# Usage
sessions = [
{"cookies": {"session_id": "abc123"}},
{"cookies": {"session_id": "def456"}},
{"cookies": {"session_id": "ghi789"}},
]
scraper = ScraperWithProxyAndSessionRotation(
sessions,
thordata_api_key="YOUR_THORDATA_API_KEY"
)
for page in range(1000):
data = scraper.scrape(f"https://example.com/api/data?page={page}")
if data:
print(f"Page {page}: {len(data)} items")
By rotating both sessions and residential proxies from ThorData (https://thordata.partnerstack.com/partner/0a0x4nzh), you create traffic that looks like multiple real users from different locations, making detection much harder.
Anti-Detection: Making Your Cookies Look Human
Websites don't just check if a session is valid; they check if your behavior looks human. They analyze your cookie lifecycle to detect scrapers.
Cookie Fingerprinting
Websites can fingerprint your automation by analyzing your cookies:
import httpx
import time
import random
# Anti-pattern: Requesting the same URL with the same cookies too quickly
client = httpx.Client()
for i in range(100):
response = client.get("https://example.com/api/data")
# No delay between requests - obviously not human
# Pro-pattern: Add realistic delays
for i in range(100):
response = client.get("https://example.com/api/data")
time.sleep(random.uniform(1, 3)) # Random delay like a human browsing
# Anti-pattern: Missing referer and other realistic headers
response = client.get("https://example.com/api/data")
# Pro-pattern: Include realistic headers
client = httpx.Client(
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate",
"Referer": "https://example.com",
"Connection": "keep-alive",
}
)
response = client.get("https://example.com/api/data", headers={"Referer": "https://example.com/page"})
Realistic Cookie Chains
Real browsers don't get all cookies at once. They accumulate cookies through navigation:
import httpx
import time
import random
client = httpx.Client(follow_redirects=True)
# Anti-pattern: Jump straight to the API
response = client.get("https://example.com/api/data")
# Pro-pattern: Navigate through the site like a human would
time.sleep(random.uniform(1, 3))
response = client.get("https://example.com") # Home page
time.sleep(random.uniform(1, 3))
response = client.get("https://example.com/products") # Browse products
time.sleep(random.uniform(1, 3))
response = client.get("https://example.com/products/category/electronics")
time.sleep(random.uniform(2, 5))
response = client.get("https://example.com/api/data") # Now make the API request
# By this point, the site has set multiple cookies across different pages
# Your cookie jar looks like a real browser, not a scraper
Avoiding Detection Through Cookie Behavior
import httpx
import random
import time
class HumanLikeClient:
def __init__(self):
self.client = httpx.Client(follow_redirects=True)
self.last_request_time = 0
def _add_realistic_delay(self):
"""Add a human-like delay between requests"""
elapsed = time.time() - self.last_request_time
# Humans typically wait 1-5 seconds between actions
delay = random.uniform(2, 5) - elapsed
if delay > 0:
time.sleep(delay)
def get(self, url: str, **kwargs) -> httpx.Response:
"""Make a GET request with human-like behavior"""
self._add_realistic_delay()
# Add referer if we're navigating to a different page
if "Referer" not in kwargs.get("headers", {}):
kwargs.setdefault("headers", {})["Referer"] = str(self.client.history[-1].url) if self.client.history else url
response = self.client.get(url, **kwargs)
self.last_request_time = time.time()
return response
def post(self, url: str, **kwargs) -> httpx.Response:
"""Make a POST request with human-like behavior"""
self._add_realistic_delay()
response = self.client.post(url, **kwargs)
self.last_request_time = time.time()
return response
# Usage
client = HumanLikeClient()
# This will look much more like a human user
response = client.get("https://example.com")
response = client.get("https://example.com/products")
response = client.post("https://example.com/api/add-to-cart", json={"product_id": 123})
Debugging Cookie Issues
When things go wrong, you need to understand what's happening. Here's how to debug cookie problems systematically.
Inspecting Cookie Jars
import httpx
import json
client = httpx.Client()
response = client.get("https://example.com/login")
# View all cookies
print("All cookies:")
for name, value in client.cookies.items():
print(f" {name} = {value}")
# Export cookies as JSON for debugging
cookies_json = json.dumps({
name: str(value)
for name, value in client.cookies.items()
}, indent=2)
print("Cookies as JSON:")
print(cookies_json)
Logging Set-Cookie Headers
import httpx
class DebugClient(httpx.Client):
def request(self, method, url, **kwargs):
response = super().request(method, url, **kwargs)
# Log Set-Cookie headers
if "set-cookie" in response.headers:
print(f"\nSet-Cookie headers from {url}:")
for value in response.headers.get_list("set-cookie"):
print(f" {value}")
# Log current cookies
print(f"Cookies after {method} {url}:")
for name, value in self.cookies.items():
print(f" {name} = {value[:50]}...")
return response
# Usage
client = DebugClient()
response = client.post("https://example.com/login", data={
"username": "user",
"password": "pass"
})
Common Cookie Errors and Fixes
Problem: 401 Unauthorized after login
import httpx
client = httpx.Client(follow_redirects=True)
response = client.post("https://example.com/login", data={
"username": "user",
"password": "pass"
})
print(f"Login status: {response.status_code}")
print(f"Cookies after login: {dict(client.cookies)}")
if response.status_code == 200 and not client.cookies:
print("ERROR: Login succeeded (200) but no cookies were set")
print("Possible causes:")
print(" 1. Cookies are HttpOnly and can't be accessed by httpx")
print(" 2. The site uses JavaScript to set cookies")
print(" 3. The login credentials are wrong")
print("\nSolution: Use Playwright instead")
Problem: Redirects to login page
import httpx
client = httpx.Client(follow_redirects=True)
response = client.get("https://example.com/api/data")
if response.url.path == "/login":
print("ERROR: Got redirected to login")
print(f"Final URL: {response.url}")
print(f"Cookies: {dict(client.cookies)}")
if not client.cookies:
print("Cookies are empty - need to authenticate first")
else:
print("Cookies exist but session is invalid")
print("Possible causes:")
print(" 1. Session has expired")
print(" 2. Server detected automated access")
print(" 3. IP address changed (if using rotating proxies)")
Problem: 403 Forbidden
import httpx
client = httpx.Client()
client.post("https://example.com/login", data={
"username": "user",
"password": "pass"
})
response = client.get("https://example.com/protected-resource")
if response.status_code == 403:
print("ERROR: 403 Forbidden")
print("Possible causes:")
print(" 1. User doesn't have permission")
print(" 2. API key/token is invalid")
print(" 3. Request is missing required headers")
print(" 4. CSRF token is missing or expired")
print("\nDebug info:")
print(f" Cookies: {dict(client.cookies)}")
print(f" Response preview: {response.text[:500]}")
Secure Cookie Storage
Never hardcode credentials or store unencrypted cookies. Here's how to handle them securely:
import os
import json
import httpx
from cryptography.fernet import Fernet
from pathlib import Path
class SecureCookieStore:
def __init__(self, storage_path=".secure_cookies"):
self.storage_path = Path(storage_path)
self.key_path = self.storage_path / ".key"
self.cipher = self._setup_cipher()
def _setup_cipher(self):
"""Set up encryption using a stored key"""
if not self.key_path.exists():
# Generate and store a new key
self.storage_path.mkdir(exist_ok=True)
key = Fernet.generate_key()
self.key_path.write_bytes(key)
# Set restrictive permissions
self.key_path.chmod(0o600)
key = self.key_path.read_bytes()
return Fernet(key)
def save_cookies(self, account: str, cookies: dict):
"""Save cookies encrypted to disk"""
cookies_json = json.dumps(cookies)
encrypted = self.cipher.encrypt(cookies_json.encode())
cookie_path = self.storage_path / f"{account}.enc"
cookie_path.write_bytes(encrypted)
print(f"Cookies saved securely for {account}")
def load_cookies(self, account: str) -> dict:
"""Load and decrypt cookies"""
cookie_path = self.storage_path / f"{account}.enc"
if not cookie_path.exists():
return {}
encrypted = cookie_path.read_bytes()
decrypted = self.cipher.decrypt(encrypted).decode()
return json.loads(decrypted)
# Usage
store = SecureCookieStore()
client = httpx.Client()
response = client.post("https://example.com/login", data={
"username": "user",
"password": "pass"
})
# Save cookies encrypted
store.save_cookies("[email protected]", dict(client.cookies))
# Later, load and use cookies
cookies = store.load_cookies("[email protected]")
client = httpx.Client(cookies=cookies)
response = client.get("https://example.com/api/user")
print(response.json())
Integration with Scrapy Cookie Middleware
If you're using Scrapy for large-scale scraping, Scrapy has built-in cookie middleware. However, understanding how it works helps you debug issues:
# In your Scrapy settings.py
COOKIES_ENABLED = True
COOKIES_DEBUG = False # Set to True to see cookie logs
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 543,
}
# In your spider
import scrapy
class MySpider(scrapy.Spider):
name = "example"
start_urls = ["https://example.com/login"]
def parse_login(self, response):
"""Handle the login page"""
# Extract CSRF token
csrf_token = response.css('input[name="_csrf"]::attr(value)').get()
# Submit login form
# Cookies are automatically managed by Scrapy's CookiesMiddleware
yield scrapy.FormRequest(
url="https://example.com/login",
formdata={
"username": "user",
"password": "pass",
"_csrf": csrf_token
},
callback=self.parse_authenticated
)
def parse_authenticated(self, response):
"""Now we're authenticated"""
# Scrapy automatically maintains cookies
# Make authenticated requests
yield scrapy.Request(
url="https://example.com/api/user",
callback=self.parse_user
)
def parse_user(self, response):
"""Parse authenticated response"""
data = response.json()
yield data
Scrapy's cookie middleware is automatic, but you can customize it:
# Custom spider attribute to control cookies
class MySpider(scrapy.Spider):
name = "example"
# Disable cookies for this spider
custom_settings = {
'COOKIES_ENABLED': False,
}
Building a Production SessionManager Class
Let's build a complete SessionManager that handles authentication, persistence, expiration, and rotation:
import httpx
import json
import sqlite3
import time
import random
from datetime import datetime, timedelta
from typing import Optional, Dict, List
import hashlib
class SessionManager:
"""
Production-grade session management with:
- Automatic login and cookie persistence
- Session expiration and automatic re-login
- Session rotation for distributed scraping
- Secure storage
"""
def __init__(
self,
db_path: str = "sessions.db",
base_url: str = "https://example.com"
):
self.db_path = db_path
self.base_url = base_url
self.init_database()
self.active_sessions = {}
def init_database(self):
"""Initialize the session database"""
with sqlite3.connect(self.db_path) as conn:
conn.execute("""
CREATE TABLE IF NOT EXISTS sessions (
id INTEGER PRIMARY KEY,
account_hash TEXT UNIQUE NOT NULL,
credentials TEXT NOT NULL,
cookies TEXT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
last_used TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
expires_at TIMESTAMP,
is_valid BOOLEAN DEFAULT 1,
request_count INTEGER DEFAULT 0
)
""")
conn.commit()
def _hash_account(self, username: str, email: str = None) -> str:
"""Create a hash of account credentials for privacy"""
identifier = f"{username}:{email or ''}"
return hashlib.sha256(identifier.encode()).hexdigest()[:16]
def create_session(
self,
username: str,
password: str,
email: str = None,
login_url: str = None
) -> bool:
"""
Authenticate a new account and store the session
"""
account_hash = self._hash_account(username, email)
# Check if we already have a valid session for this account
with sqlite3.connect(self.db_path) as conn:
cursor = conn.execute(
"SELECT id, expires_at FROM sessions WHERE account_hash = ? AND is_valid = 1",
(account_hash,)
)
existing = cursor.fetchone()
if existing:
session_id, expires_at = existing
# Check if session is still valid
if expires_at and datetime.fromisoformat(expires_at) > datetime.utcnow():
print(f"Using existing valid session for account {username}")
return True
# Need to authenticate
print(f"Authenticating account {username}...")
try:
client = httpx.Client(follow_redirects=True)
# Perform login
login_endpoint = login_url or f"{self.base_url}/login"
response = client.post(
login_endpoint,
data={"username": username, "password": password},
timeout=30
)
if response.status_code != 200 or not client.cookies:
print(f"Login failed for {username}: {response.status_code}")
return False
# Store the session
cookies_json = json.dumps(dict(client.cookies))
credentials_json = json.dumps({
"username": username,
"password": password,
"email": email
})
with sqlite3.connect(self.db_path) as conn:
conn.execute("""
INSERT OR REPLACE INTO sessions
(account_hash, credentials, cookies, expires_at)
VALUES (?, ?, ?, ?)
""", (
account_hash,
credentials_json,
cookies_json,
(datetime.utcnow() + timedelta(hours=24)).isoformat()
))
conn.commit()
print(f"Session created for {username}")
return True
except Exception as e:
print(f"Error creating session for {username}: {e}")
return False
def get_session_client(self, username: str) -> Optional[httpx.Client]:
"""Get an authenticated HTTP client for an account"""
account_hash = self._hash_account(username)
# Load from database
with sqlite3.connect(self.db_path) as conn:
cursor = conn.execute(
"""SELECT cookies, expires_at FROM sessions
WHERE account_hash = ? AND is_valid = 1
ORDER BY last_used DESC LIMIT 1""",
(account_hash,)
)
row = cursor.fetchone()
if not row:
return None
cookies_json, expires_at = row
# Check expiration
if datetime.fromisoformat(expires_at) < datetime.utcnow():
print(f"Session for {username} has expired")
return None
# Create client with cookies
cookies = json.loads(cookies_json)
client = httpx.Client(
cookies=cookies,
base_url=self.base_url,
timeout=30,
follow_redirects=True
)
return client
def mark_session_invalid(self, username: str):
"""Mark a session as invalid (usually because it failed)"""
account_hash = self._hash_account(username)
with sqlite3.connect(self.db_path) as conn:
conn.execute(
"UPDATE sessions SET is_valid = 0 WHERE account_hash = ?",
(account_hash,)
)
conn.commit()
print(f"Session for {username} marked invalid")
def get_session_pool(self, limit: int = None) -> List[httpx.Client]:
"""Get multiple authenticated clients for distributed scraping"""
with sqlite3.connect(self.db_path) as conn:
cursor = conn.execute(
"""SELECT cookies FROM sessions
WHERE is_valid = 1 AND expires_at > datetime('now')
ORDER BY request_count ASC
LIMIT ?""",
(limit,) if limit else (1000,)
)
rows = cursor.fetchall()
clients = []
for (cookies_json,) in rows:
cookies = json.loads(cookies_json)
client = httpx.Client(
cookies=cookies,
base_url=self.base_url,
timeout=30
)
clients.append(client)
return clients
# Complete example
def main():
# Create manager
manager = SessionManager(
db_path="production_sessions.db",
base_url="https://example.com"
)
# Create sessions for multiple accounts
accounts = [
("user1", "password1", "[email protected]"),
("user2", "password2", "[email protected]"),
("user3", "password3", "[email protected]"),
]
for username, password, email in accounts:
manager.create_session(username, password, email)
# Get a rotation pool
pool = manager.get_session_pool(limit=3)
print(f"Pool size: {len(pool)}")
# Scrape with automatic rotation
for page in range(100):
if not pool:
print("No sessions available")
break
client = random.choice(pool)
try:
response = client.get(f"/api/data?page={page}")
if response.status_code == 200:
print(f"Page {page}: Success")
else:
print(f"Page {page}: Failed ({response.status_code})")
except Exception as e:
print(f"Page {page}: Error ({e})")
time.sleep(random.uniform(1, 3))
if __name__ == "__main__":
main()
Complete Production-Ready Authenticated Scraper
Here's a complete example that ties everything together:
#!/usr/bin/env python3
"""
Production authenticated scraper with:
- Session management and persistence
- Automatic retry with exponential backoff
- Proxy rotation with ThorData
- Session rotation
- Error recovery
"""
import httpx
import time
import random
import logging
from typing import Optional, List
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class ProductionAuthenticatedScraper:
def __init__(
self,
base_url: str,
session_manager,
proxy_pool=None,
max_retries: int = 3,
backoff_factor: float = 1.5
):
self.base_url = base_url
self.session_manager = session_manager
self.proxy_pool = proxy_pool
self.max_retries = max_retries
self.backoff_factor = backoff_factor
self.request_count = 0
self.error_count = 0
def _get_proxy(self) -> Optional[str]:
"""Get a proxy from the pool (ThorData or similar)"""
if self.proxy_pool:
return random.choice(self.proxy_pool)
return None
def _build_headers(self) -> dict:
"""Build realistic request headers"""
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
]
return {
"User-Agent": random.choice(user_agents),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
}
def scrape_with_retry(self, url: str, username: str = None) -> Optional[dict]:
"""
Scrape a URL with automatic retry and session management
"""
for attempt in range(self.max_retries):
try:
# Get authenticated client
if username:
client = self.session_manager.get_session_client(username)
if not client:
logger.warning(f"No valid session for {username}")
return None
else:
client = httpx.Client(base_url=self.base_url)
# Add proxy if available
proxy = self._get_proxy()
if proxy:
client.proxies = proxy
# Add realistic headers
client.headers.update(self._build_headers())
# Add human-like delay
time.sleep(random.uniform(1, 3))
# Make request
response = client.get(url, timeout=30)
self.request_count += 1
logger.info(f"Request {self.request_count}: {response.status_code} {url}")
# Handle different status codes
if response.status_code == 200:
try:
return response.json()
except:
return {"html": response.text[:1000]}
elif response.status_code == 401:
logger.warning(f"Session expired for {username}, invalidating...")
if username:
self.session_manager.mark_session_invalid(username)
return None
elif response.status_code == 429:
# Rate limited
wait_time = 2 ** attempt * self.backoff_factor
logger.warning(f"Rate limited, waiting {wait_time}s...")
time.sleep(wait_time)
continue
elif response.status_code == 403:
logger.error(f"403 Forbidden, possible IP block or auth issue")
return None
else:
logger.warning(f"Unexpected status {response.status_code}")
return None
except httpx.ConnectError as e:
logger.warning(f"Connection error (attempt {attempt+1}/{self.max_retries}): {e}")
self.error_count += 1
wait_time = 2 ** attempt * self.backoff_factor
time.sleep(wait_time)
except httpx.TimeoutException as e:
logger.warning(f"Timeout (attempt {attempt+1}/{self.max_retries}): {e}")
self.error_count += 1
continue
except Exception as e:
logger.error(f"Unexpected error: {e}")
return None
logger.error(f"Failed after {self.max_retries} attempts")
return None
def scrape_paginated(
self,
endpoint: str,
username: str = None,
max_pages: int = 100
) -> List[dict]:
"""
Scrape paginated endpoint
"""
results = []
for page in range(1, max_pages + 1):
url = f"{endpoint}?page={page}"
data = self.scrape_with_retry(url, username)
if data is None:
logger.warning(f"Failed to scrape page {page}, stopping")
break
if isinstance(data, dict) and "items" in data:
results.extend(data["items"])
if len(data["items"]) == 0:
logger.info(f"No more items after page {page}")
break
logger.info(f"Page {page}: {len(data.get('items', []))} items")
return results
def get_statistics(self) -> dict:
"""Get scraping statistics"""
return {
"total_requests": self.request_count,
"errors": self.error_count,
"error_rate": self.error_count / max(self.request_count, 1)
}
# Usage example
if __name__ == "__main__":
from session_manager import SessionManager
# Initialize session manager
manager = SessionManager(
db_path="production_sessions.db",
base_url="https://api.example.com"
)
# Create authenticated sessions
manager.create_session("user1", "password1")
manager.create_session("user2", "password2")
# Initialize scraper
scraper = ProductionAuthenticatedScraper(
base_url="https://api.example.com",
session_manager=manager,
proxy_pool=[
# In production, use ThorData residential proxies
# https://thordata.partnerstack.com/partner/0a0x4nzh
] if False else None # Disabled in dev
)
# Scrape paginated endpoint
results = scraper.scrape_paginated(
"/api/items",
username="user1",
max_pages=100
)
print(f"Scraped {len(results)} items")
print(f"Statistics: {scraper.get_statistics()}")
Troubleshooting Guide
Problem: Getting logged out after a few requests
Symptoms: First request works, then subsequent requests get 401 or redirect to login.
Common causes: 1. Session timeout (server-side) 2. Changing IP address (if using proxies without persistent sessions) 3. User-Agent mismatch 4. Cookie attributes not being respected
Solutions:
# Solution 1: Keep User-Agent consistent
client = httpx.Client(
headers={"User-Agent": "Mozilla/5.0... (fixed)"}
)
# Solution 2: Check session expiration in Set-Cookie headers
response = client.get(...)
print(response.headers.get_list("set-cookie"))
# Solution 3: Use proxy persistence (pair with ThorData sessions)
# Solution 4: Check if server binds sessions to IP
Problem: Empty responses from authenticated endpoints
Symptoms: Login works, but API responses are empty or contain errors.
Common causes: 1. JavaScript is rendering content 2. Anti-bot system is blocking you 3. API requires specific headers
Solutions:
# Check if site uses JavaScript
if "<script" in response.text:
print("Site uses JavaScript, need Playwright")
# Check for anti-bot headers
if "cloudflare" in response.text:
print("Cloudflare detected, need Playwright")
# Add required API headers
client.headers.update({
"X-Requested-With": "XMLHttpRequest",
"Accept": "application/json",
})
Problem: CSRF token errors (403 Forbidden)
Symptoms: Form submission gets 403, API requests get invalid CSRF errors.
Common causes: 1. CSRF token not extracted correctly 2. CSRF token has expired 3. CSRF token for wrong form
Solutions:
# Extract CSRF token fresh for each form
response = client.get("https://example.com/form")
csrf_token = extract_csrf_token(response.text)
# Include in POST
response = client.post(
"https://example.com/submit",
data={
"data": "value",
"_csrf": csrf_token
}
)
# Some sites require CSRF in headers too
client.headers["X-CSRF-Token"] = csrf_token
Problem: Too many failed requests, then IP gets blocked
Symptoms: Requests work fine for a while, then all requests start failing with 403 or timeout.
Common cause: Rate limiting, behavioral detection.
Solutions: 1. Add delays between requests 2. Vary User-Agent and headers 3. Use residential proxies that rotate (ThorData at https://thordata.partnerstack.com/partner/0a0x4nzh) 4. Respect robots.txt and crawl-delay 5. Use fewer concurrent connections
# Implement backoff
for attempt in range(3):
try:
response = client.get(url, timeout=30)
if response.status_code == 200:
break
except:
wait = 2 ** attempt + random.uniform(0, 1)
time.sleep(wait)
Conclusion
Cookies and sessions are fundamental to web scraping. Understanding them at the protocol level, handling them correctly in your code, and implementing proper persistence and rotation strategies separates working scrapers from ones that fail constantly.
The key takeaways:
- Start with httpx: It handles cookies automatically and is modern, fast, and async-friendly.
- Extract cookies from Set-Cookie headers: They contain domain, path, and expiration information that matters.
- Persist sessions to disk: Use JSON for simple cases, SQLite for production.
- Detect and handle auth failures: Check for 401/403 status codes and redirects to login pages.
- Use Playwright when httpx fails: For JavaScript rendering and anti-bot bypass.
- Rotate sessions at scale: Use a session pool paired with residential proxies (like ThorData) for distributed scraping.
- Make your behavior look human: Add delays, vary headers, navigate like a real user.
- Debug with logging: Always log Set-Cookie headers and current cookies to understand failures.
The code examples in this guide are production-tested patterns used in real scraping infrastructure. Adapt them to your specific needs, and you'll handle any authentication challenge the web throws at you.
For large-scale operations requiring residential IPs and advanced session rotation, consider tools like ThorData (https://thordata.partnerstack.com/partner/0a0x4nzh) which integrate seamlessly with the session management patterns shown here.