Web Scraping with Python in 2026: The Complete Beginner's Guide
Web Scraping with Python in 2026: The Complete Beginner's Guide
Web scraping means writing code that reads web pages and extracts data from them. Instead of copying information by hand, you write a script that does it in seconds — across hundreds or thousands of pages.
Developers use it for price monitoring, lead generation, market research, content aggregation, and building datasets for machine learning. If the data is on a public webpage, you can probably scrape it.
Python is the most popular language for scraping because of its simple syntax and excellent libraries. This guide will take you from zero to a working scraper — covering tools, practical code, error handling, data storage, anti-bot systems, and proxy usage.
The Tools: Pick What Fits Your Problem
There are four main approaches in 2026, and each fits a different situation:
requests + BeautifulSoup — The classic combo. requests fetches the HTML, BeautifulSoup parses it. Best for static pages where the content is in the HTML source. This is where every beginner should start.
httpx — A modern HTTP client that supports async requests and HTTP/2. Same idea as requests but you can fetch many pages concurrently. Use it when you need speed or modern protocol support.
Playwright / Selenium — These run a real browser. You need them when the page loads content with JavaScript — single-page apps, infinite scroll, anything that doesn't show up in "View Source." Playwright is the better choice in 2026; it's faster and has a cleaner API.
Scrapy — A full framework for large-scale scraping. Built-in crawling, middleware, pipelines, and export. Overkill for small jobs, essential for scraping thousands of pages across multiple sites.
Start with requests + BeautifulSoup. You can always upgrade later.
Installing Dependencies
# Core scraping toolkit
pip install httpx beautifulsoup4 lxml
# For JavaScript-heavy pages
pip install playwright
python -m playwright install chromium
# For large-scale crawling
pip install scrapy
Your First Scraper: Hacker News Front Page
Let's build something real. We'll scrape the Hacker News front page — it's public data and their site allows scraping.
The Code
import httpx
from bs4 import BeautifulSoup
import json
# Fetch the page
response = httpx.get("https://news.ycombinator.com/", headers={
"User-Agent": "Mozilla/5.0 (compatible; my-scraper/1.0)"
})
response.raise_for_status()
# Parse the HTML
soup = BeautifulSoup(response.text, "lxml")
# Each story is in a <tr> with class "athing"
stories = soup.select("tr.athing")
results = []
for story in stories:
# Title and link are in a span with class "titleline"
title_el = story.select_one(".titleline > a")
if not title_el:
continue
title = title_el.get_text()
link = title_el["href"]
# Score is in the next sibling row
score_row = story.find_next_sibling("tr")
score_el = score_row.select_one(".score") if score_row else None
points = score_el.get_text() if score_el else "0 points"
# Comment count
subtext = score_row.select(".subtext a") if score_row else []
comments = subtext[-1].get_text() if subtext else "0 comments"
story_data = {
"title": title,
"link": link,
"points": points,
"comments": comments,
}
results.append(story_data)
print(f"{points:>12} | {title}")
# Save to JSON
with open("hn_stories.json", "w") as f:
json.dump(results, f, indent=2)
print(f"\nSaved {len(results)} stories")
That's it. About 30 lines of code to pull structured data from a live website and save it.
Understanding HTML Structure
Before you can scrape, you need to understand what you're targeting. Here's the workflow:
- Open the page in Chrome/Firefox
- Right-click on the element you want to extract
- Click "Inspect" (or "Inspect Element")
- Look at the HTML structure around your target data
- Identify unique selectors: IDs, classes, data attributes
# CSS selector examples
soup.select_one("#main-title") # element with id="main-title"
soup.select_one(".price") # element with class="price"
soup.select_one("div.product-card") # div with class="product-card"
soup.select_one("[data-sku]") # element with data-sku attribute
soup.select_one("h2 > a") # a tag that is a direct child of h2
soup.select("ul.reviews li") # all li inside ul.reviews
# BeautifulSoup find methods
soup.find("div", id="content")
soup.find("span", class_="price-tag")
soup.find("a", href=True) # any a with an href
soup.find_all("p", limit=5) # first 5 paragraphs
Handling Pagination
Most sites spread data across multiple pages. Here's the general pattern:
import httpx
from bs4 import BeautifulSoup
import time
def scrape_all_pages(base_url: str, max_pages: int = 10) -> list:
"""Generic paginator — adapt the URL pattern and selector to your target."""
all_items = []
session = httpx.Client(headers={
"User-Agent": "Mozilla/5.0 (compatible; my-scraper/1.0)"
})
for page in range(1, max_pages + 1):
url = f"{base_url}?page={page}"
resp = session.get(url, timeout=20)
if resp.status_code == 404:
print(f"Page {page} not found — stopping")
break
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
# Extract items — this selector is site-specific
items = soup.select(".item-card")
if not items:
print(f"No items on page {page} — stopping")
break
for item in items:
all_items.append({
"title": item.select_one(".title").get_text(strip=True),
"price": item.select_one(".price").get_text(strip=True),
"link": item.find("a")["href"],
})
print(f"Page {page}: {len(items)} items (total: {len(all_items)})")
time.sleep(1) # Be polite
session.close()
return all_items
Common pagination patterns:
# Pattern 1: Query parameter (?page=2, ?p=2, ?offset=20)
url = f"https://example.com/products?page={page_num}"
# Pattern 2: URL path (/products/page/2/)
url = f"https://example.com/products/page/{page_num}/"
# Pattern 3: Offset (?start=20, ?offset=20)
url = f"https://example.com/items?start={offset}"
# Pattern 4: Cursor/token (find next_page link in HTML)
next_link = soup.select_one("a.next-page, a[rel='next'], .pagination .next")
if next_link:
next_url = next_link["href"]
Ethical Scraping: Don't Get Yourself Banned
Scraping is legal for public data in most jurisdictions, but being aggressive will get your IP blocked fast. Follow these rules:
Check robots.txt first. Visit example.com/robots.txt before scraping any site. It tells you which paths are off-limits. Respect it.
import urllib.robotparser
def is_allowed(url: str, user_agent: str = "*") -> bool:
from urllib.parse import urlparse
parsed = urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
rp = urllib.robotparser.RobotFileParser()
rp.set_url(robots_url)
rp.read()
return rp.can_fetch(user_agent, url)
# Check before scraping
if is_allowed("https://example.com/products"):
print("OK to scrape")
else:
print("robots.txt disallows this path")
Rate limit yourself. One request per second is a good baseline. For smaller sites, go slower. Never blast a server with concurrent requests.
Set a User-Agent. Identify your scraper. Some sites block requests with no User-Agent or the default Python one.
Cache aggressively. During development, save HTML to disk so you're not hitting the server on every test run.
from pathlib import Path
import hashlib
def cached_fetch(url: str, cache_dir: str = ".cache") -> str:
"""Fetch URL, using local cache to avoid redundant requests."""
Path(cache_dir).mkdir(exist_ok=True)
cache_key = hashlib.md5(url.encode()).hexdigest()
cache_file = Path(cache_dir) / f"{cache_key}.html"
if cache_file.exists():
return cache_file.read_text(encoding="utf-8")
resp = httpx.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=20)
resp.raise_for_status()
cache_file.write_text(resp.text, encoding="utf-8")
return resp.text
html = cached_fetch("https://news.ycombinator.com/")
Saving Data: CSV, JSON, SQLite
import json
import csv
import sqlite3
# ---- JSON ----
def save_json(data: list, filename: str):
with open(filename, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
print(f"Saved {len(data)} records to {filename}")
# ---- CSV ----
def save_csv(data: list, filename: str, fieldnames: list = None):
if not data:
return
fieldnames = fieldnames or list(data[0].keys())
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction="ignore")
writer.writeheader()
writer.writerows(data)
print(f"Saved {len(data)} rows to {filename}")
# ---- SQLite ----
def save_sqlite(data: list, db_path: str, table: str = "items"):
if not data:
return
conn = sqlite3.connect(db_path)
cols = list(data[0].keys())
placeholders = ", ".join("?" * len(cols))
col_defs = ", ".join(f"{c} TEXT" for c in cols)
conn.execute(f"CREATE TABLE IF NOT EXISTS {table} ({col_defs})")
conn.executemany(
f"INSERT INTO {table} VALUES ({placeholders})",
[tuple(row.get(c) for c in cols) for row in data]
)
conn.commit()
conn.close()
print(f"Saved {len(data)} records to {db_path}")
Async Scraping with httpx
When you need speed, fetch pages concurrently:
import asyncio
import httpx
async def fetch_one(client: httpx.AsyncClient, url: str) -> dict:
"""Fetch a single URL and return parsed data."""
try:
resp = await client.get(url, timeout=20)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
return {
"url": url,
"title": soup.find("title").get_text() if soup.find("title") else "",
"status": resp.status_code,
}
except Exception as e:
return {"url": url, "error": str(e), "status": None}
async def fetch_all(urls: list, max_concurrent: int = 10) -> list:
"""Fetch many URLs with bounded concurrency."""
semaphore = asyncio.Semaphore(max_concurrent)
results = []
async def bounded_fetch(client, url):
async with semaphore:
result = await fetch_one(client, url)
# Polite delay even in async
await asyncio.sleep(0.5)
return result
async with httpx.AsyncClient(
headers={"User-Agent": "Mozilla/5.0"},
follow_redirects=True,
) as client:
tasks = [bounded_fetch(client, url) for url in urls]
results = await asyncio.gather(*tasks)
return list(results)
# Run async scraper
urls = [f"https://news.ycombinator.com/news?p={i}" for i in range(1, 6)]
results = asyncio.run(fetch_all(urls, max_concurrent=3))
for r in results:
print(f"{r['status']} — {r['url']}")
JavaScript-Rendered Pages with Playwright
When the content isn't in the HTML source (check with View Source), use Playwright:
from playwright.sync_api import sync_playwright
def scrape_with_browser(url: str) -> str:
"""Render page with a real browser and return HTML."""
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
viewport={"width": 1280, "height": 720},
)
page = context.new_page()
page.goto(url, wait_until="networkidle", timeout=30000)
# Wait for specific element to appear
try:
page.wait_for_selector(".product-grid", timeout=10000)
except Exception:
pass
html = page.content()
browser.close()
return html
# Now parse the fully-rendered HTML
html = scrape_with_browser("https://example-spa.com/products")
soup = BeautifulSoup(html, "lxml")
products = soup.select(".product-card")
print(f"Found {len(products)} products")
Playwright for dynamic interactions:
from playwright.sync_api import sync_playwright
def scrape_with_login(url: str, username: str, password: str) -> list:
"""Handle login and scrape authenticated content."""
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Log in
page.goto("https://example.com/login")
page.fill("#username", username)
page.fill("#password", password)
page.click("button[type='submit']")
page.wait_for_url("**/dashboard") # wait for redirect
# Now scrape the protected page
page.goto(url)
page.wait_for_selector(".data-table")
# Handle infinite scroll
for _ in range(5):
page.keyboard.press("End")
page.wait_for_timeout(1000)
html = page.content()
browser.close()
soup = BeautifulSoup(html, "lxml")
return [row.get_text(strip=True) for row in soup.select(".data-table tr")]
When You Need Proxies
You need proxies when: - A site blocks your IP after a few hundred requests - You need to see geo-specific content (prices, availability in different regions) - You're running scrapers in production and can't afford downtime from IP bans - You're scraping from a server with a datacenter IP (very commonly blocked)
Residential proxies route your requests through real household IP addresses, making them much harder for sites to detect and block.
ThorData offers a residential proxy network with 200M+ IPs across 195 countries:
import httpx
PROXY_USER = "your_username"
PROXY_PASS = "your_password"
def get_proxy(country: str = None) -> str:
user = PROXY_USER
if country:
user += f"-country-{country}"
return f"http://{user}:{PROXY_PASS}@proxy.thordata.net:9000"
# Basic usage with httpx
response = httpx.get(
"https://example.com/products",
proxy=get_proxy("us"),
headers={"User-Agent": "Mozilla/5.0"}
)
# With requests library
import requests
session = requests.Session()
session.proxies = {"https": get_proxy("us")}
resp = session.get("https://example.com/products")
Start without proxies. Add them when you actually hit blocks.
Common Errors and How to Fix Them
403 Forbidden
The server is rejecting your request. Usually because:
- Missing or bad User-Agent. Always set one that looks like a real browser.
- No cookies/session. Some sites require you to load the homepage first. Use
httpx.Client()to maintain a session. - Rate limited. Slow down and add delays.
# Use a session to maintain cookies
with httpx.Client(headers={"User-Agent": "Mozilla/5.0"}) as client:
client.get("https://example.com") # load homepage, get cookies
response = client.get("https://example.com/data") # now fetch data with session cookies
Empty Results (No Data Extracted)
Your selectors are probably wrong, or the content loads via JavaScript.
- Check "View Source" (Ctrl+U), not Inspect Element. If your data isn't in the raw HTML, the page uses JavaScript rendering — you need Playwright.
- Selectors changed. Websites update their HTML. Your
.price-tagclass might now be.product-price. Re-inspect the page. - Debug by printing partial soup.
print(soup.prettify()[:3000])shows you what actually came back.
# Debug helper
def debug_response(soup, selector: str):
"""Check if selector works and print context."""
elements = soup.select(selector)
print(f"Found {len(elements)} elements for: {selector}")
if elements:
print("First match:", elements[0].get_text(strip=True)[:100])
else:
# Show available classes/ids to help diagnose
all_classes = set()
for tag in soup.find_all(True):
all_classes.update(tag.get("class", []))
print("Available classes (sample):", list(all_classes)[:20])
Connection Timeouts
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=2, max=30))
def fetch_with_retry(url: str) -> str:
resp = httpx.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=15)
resp.raise_for_status()
return resp.text
Cloudflare / Anti-Bot Protection
Modern anti-bot systems fingerprint your TLS connection, check JavaScript execution, and analyze behavior patterns.
For light protection: use httpx with proper headers and a real-looking User-Agent. For heavy protection: use Playwright with stealth settings, or switch to residential proxies from ThorData which provide IPs that pass fingerprinting checks.
# curl-cffi: impersonates real browser TLS fingerprint
from curl_cffi import requests as cffi_requests
resp = cffi_requests.get(
"https://cloudflare-protected-site.com/",
impersonate="chrome124" # mimics Chrome 124 TLS fingerprint
)
Building a Real Project: Price Tracker
Here's a practical end-to-end example: a price tracker that monitors product prices and saves history to SQLite.
import httpx
import sqlite3
from bs4 import BeautifulSoup
from datetime import datetime
import time
def setup_db(db_path="prices.db"):
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS price_history (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT,
product_name TEXT,
price TEXT,
scraped_at TEXT
)
""")
conn.commit()
return conn
def scrape_price(url: str) -> dict:
"""
Generic price scraper — adapt selectors to your target site.
This example targets a generic e-commerce structure.
"""
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept-Language": "en-US,en;q=0.9",
}
resp = httpx.get(url, headers=headers, timeout=20, follow_redirects=True)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
# Try multiple common price selectors
price = None
for selector in [
"[itemprop='price']",
".price", "#price", ".product-price",
"[data-price]", ".offer-price",
]:
el = soup.select_one(selector)
if el:
price = el.get("content") or el.get_text(strip=True)
break
# Try multiple name selectors
name = None
for selector in ["h1", "[itemprop='name']", ".product-title", "#productTitle"]:
el = soup.select_one(selector)
if el:
name = el.get_text(strip=True)[:200]
break
return {"url": url, "name": name, "price": price}
def track_prices(urls: list, db_path="prices.db", interval_hours: int = 6):
"""Monitor prices on a schedule."""
conn = setup_db(db_path)
while True:
now = datetime.utcnow().isoformat()
for url in urls:
try:
data = scrape_price(url)
conn.execute(
"INSERT INTO price_history (url, product_name, price, scraped_at) VALUES (?,?,?,?)",
(data["url"], data["name"], data["price"], now)
)
conn.commit()
print(f"{now} | {data['name'][:50]} | {data['price']}")
except Exception as e:
print(f"Error scraping {url}: {e}")
time.sleep(2)
print(f"\nSleeping {interval_hours}h until next check...")
time.sleep(interval_hours * 3600)
What to Build Next
Now that you have the fundamentals:
- Pick a real project. Price tracker, job board aggregator, news digest — something you'd actually use.
- Save data to a database. SQLite is perfect for most projects.
- Add error handling. Wrap requests in try/except, handle timeouts, retry on failures.
- Try async scraping. Use
httpx.AsyncClientto fetch multiple pages concurrently. - Learn XPath. Alternative to CSS selectors, sometimes more precise.
- Explore Scrapy. When you outgrow simple scripts, Scrapy provides middleware, pipelines, and built-in concurrency.
Quick Reference
| Situation | Tool |
|---|---|
| Static HTML pages | requests or httpx + BeautifulSoup |
| Need speed, async | httpx.AsyncClient |
| JavaScript-rendered content | Playwright |
| Large-scale crawling | Scrapy |
| Behind Cloudflare | curl-cffi or Playwright |
| Getting IP blocked | Residential proxies (ThorData) |
| Need to log in | Playwright with fill/click |
Web scraping is one of those skills where the basics are simple but mastery takes practice. Start with the Hacker News example above, modify it for a site you actually care about, and build from there.