← Back to blog

Web Scraping with Python in 2026: The Complete Beginner's Guide

Web Scraping with Python in 2026: The Complete Beginner's Guide

Web scraping means writing code that reads web pages and extracts data from them. Instead of copying information by hand, you write a script that does it in seconds — across hundreds or thousands of pages.

Developers use it for price monitoring, lead generation, market research, content aggregation, and building datasets for machine learning. If the data is on a public webpage, you can probably scrape it.

Python is the most popular language for scraping because of its simple syntax and excellent libraries. This guide will take you from zero to a working scraper — covering tools, practical code, error handling, data storage, anti-bot systems, and proxy usage.

The Tools: Pick What Fits Your Problem

There are four main approaches in 2026, and each fits a different situation:

requests + BeautifulSoup — The classic combo. requests fetches the HTML, BeautifulSoup parses it. Best for static pages where the content is in the HTML source. This is where every beginner should start.

httpx — A modern HTTP client that supports async requests and HTTP/2. Same idea as requests but you can fetch many pages concurrently. Use it when you need speed or modern protocol support.

Playwright / Selenium — These run a real browser. You need them when the page loads content with JavaScript — single-page apps, infinite scroll, anything that doesn't show up in "View Source." Playwright is the better choice in 2026; it's faster and has a cleaner API.

Scrapy — A full framework for large-scale scraping. Built-in crawling, middleware, pipelines, and export. Overkill for small jobs, essential for scraping thousands of pages across multiple sites.

Start with requests + BeautifulSoup. You can always upgrade later.

Installing Dependencies

# Core scraping toolkit
pip install httpx beautifulsoup4 lxml

# For JavaScript-heavy pages
pip install playwright
python -m playwright install chromium

# For large-scale crawling
pip install scrapy

Your First Scraper: Hacker News Front Page

Let's build something real. We'll scrape the Hacker News front page — it's public data and their site allows scraping.

The Code

import httpx
from bs4 import BeautifulSoup
import json

# Fetch the page
response = httpx.get("https://news.ycombinator.com/", headers={
    "User-Agent": "Mozilla/5.0 (compatible; my-scraper/1.0)"
})
response.raise_for_status()

# Parse the HTML
soup = BeautifulSoup(response.text, "lxml")

# Each story is in a <tr> with class "athing"
stories = soup.select("tr.athing")
results = []

for story in stories:
    # Title and link are in a span with class "titleline"
    title_el = story.select_one(".titleline > a")
    if not title_el:
        continue

    title = title_el.get_text()
    link = title_el["href"]

    # Score is in the next sibling row
    score_row = story.find_next_sibling("tr")
    score_el = score_row.select_one(".score") if score_row else None
    points = score_el.get_text() if score_el else "0 points"

    # Comment count
    subtext = score_row.select(".subtext a") if score_row else []
    comments = subtext[-1].get_text() if subtext else "0 comments"

    story_data = {
        "title": title,
        "link": link,
        "points": points,
        "comments": comments,
    }
    results.append(story_data)
    print(f"{points:>12} | {title}")

# Save to JSON
with open("hn_stories.json", "w") as f:
    json.dump(results, f, indent=2)

print(f"\nSaved {len(results)} stories")

That's it. About 30 lines of code to pull structured data from a live website and save it.

Understanding HTML Structure

Before you can scrape, you need to understand what you're targeting. Here's the workflow:

  1. Open the page in Chrome/Firefox
  2. Right-click on the element you want to extract
  3. Click "Inspect" (or "Inspect Element")
  4. Look at the HTML structure around your target data
  5. Identify unique selectors: IDs, classes, data attributes
# CSS selector examples
soup.select_one("#main-title")          # element with id="main-title"
soup.select_one(".price")               # element with class="price"
soup.select_one("div.product-card")    # div with class="product-card"
soup.select_one("[data-sku]")           # element with data-sku attribute
soup.select_one("h2 > a")              # a tag that is a direct child of h2
soup.select("ul.reviews li")           # all li inside ul.reviews

# BeautifulSoup find methods
soup.find("div", id="content")
soup.find("span", class_="price-tag")
soup.find("a", href=True)              # any a with an href
soup.find_all("p", limit=5)            # first 5 paragraphs

Handling Pagination

Most sites spread data across multiple pages. Here's the general pattern:

import httpx
from bs4 import BeautifulSoup
import time

def scrape_all_pages(base_url: str, max_pages: int = 10) -> list:
    """Generic paginator — adapt the URL pattern and selector to your target."""
    all_items = []
    session = httpx.Client(headers={
        "User-Agent": "Mozilla/5.0 (compatible; my-scraper/1.0)"
    })

    for page in range(1, max_pages + 1):
        url = f"{base_url}?page={page}"
        resp = session.get(url, timeout=20)

        if resp.status_code == 404:
            print(f"Page {page} not found — stopping")
            break

        resp.raise_for_status()
        soup = BeautifulSoup(resp.text, "lxml")

        # Extract items — this selector is site-specific
        items = soup.select(".item-card")
        if not items:
            print(f"No items on page {page} — stopping")
            break

        for item in items:
            all_items.append({
                "title": item.select_one(".title").get_text(strip=True),
                "price": item.select_one(".price").get_text(strip=True),
                "link": item.find("a")["href"],
            })

        print(f"Page {page}: {len(items)} items (total: {len(all_items)})")
        time.sleep(1)  # Be polite

    session.close()
    return all_items

Common pagination patterns:

# Pattern 1: Query parameter (?page=2, ?p=2, ?offset=20)
url = f"https://example.com/products?page={page_num}"

# Pattern 2: URL path (/products/page/2/)
url = f"https://example.com/products/page/{page_num}/"

# Pattern 3: Offset (?start=20, ?offset=20)
url = f"https://example.com/items?start={offset}"

# Pattern 4: Cursor/token (find next_page link in HTML)
next_link = soup.select_one("a.next-page, a[rel='next'], .pagination .next")
if next_link:
    next_url = next_link["href"]

Ethical Scraping: Don't Get Yourself Banned

Scraping is legal for public data in most jurisdictions, but being aggressive will get your IP blocked fast. Follow these rules:

Check robots.txt first. Visit example.com/robots.txt before scraping any site. It tells you which paths are off-limits. Respect it.

import urllib.robotparser

def is_allowed(url: str, user_agent: str = "*") -> bool:
    from urllib.parse import urlparse
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    return rp.can_fetch(user_agent, url)

# Check before scraping
if is_allowed("https://example.com/products"):
    print("OK to scrape")
else:
    print("robots.txt disallows this path")

Rate limit yourself. One request per second is a good baseline. For smaller sites, go slower. Never blast a server with concurrent requests.

Set a User-Agent. Identify your scraper. Some sites block requests with no User-Agent or the default Python one.

Cache aggressively. During development, save HTML to disk so you're not hitting the server on every test run.

from pathlib import Path
import hashlib

def cached_fetch(url: str, cache_dir: str = ".cache") -> str:
    """Fetch URL, using local cache to avoid redundant requests."""
    Path(cache_dir).mkdir(exist_ok=True)
    cache_key = hashlib.md5(url.encode()).hexdigest()
    cache_file = Path(cache_dir) / f"{cache_key}.html"

    if cache_file.exists():
        return cache_file.read_text(encoding="utf-8")

    resp = httpx.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=20)
    resp.raise_for_status()
    cache_file.write_text(resp.text, encoding="utf-8")
    return resp.text

html = cached_fetch("https://news.ycombinator.com/")

Saving Data: CSV, JSON, SQLite

import json
import csv
import sqlite3

# ---- JSON ----
def save_json(data: list, filename: str):
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)
    print(f"Saved {len(data)} records to {filename}")


# ---- CSV ----
def save_csv(data: list, filename: str, fieldnames: list = None):
    if not data:
        return
    fieldnames = fieldnames or list(data[0].keys())
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames, extrasaction="ignore")
        writer.writeheader()
        writer.writerows(data)
    print(f"Saved {len(data)} rows to {filename}")


# ---- SQLite ----
def save_sqlite(data: list, db_path: str, table: str = "items"):
    if not data:
        return
    conn = sqlite3.connect(db_path)
    cols = list(data[0].keys())
    placeholders = ", ".join("?" * len(cols))
    col_defs = ", ".join(f"{c} TEXT" for c in cols)

    conn.execute(f"CREATE TABLE IF NOT EXISTS {table} ({col_defs})")
    conn.executemany(
        f"INSERT INTO {table} VALUES ({placeholders})",
        [tuple(row.get(c) for c in cols) for row in data]
    )
    conn.commit()
    conn.close()
    print(f"Saved {len(data)} records to {db_path}")

Async Scraping with httpx

When you need speed, fetch pages concurrently:

import asyncio
import httpx

async def fetch_one(client: httpx.AsyncClient, url: str) -> dict:
    """Fetch a single URL and return parsed data."""
    try:
        resp = await client.get(url, timeout=20)
        resp.raise_for_status()
        soup = BeautifulSoup(resp.text, "lxml")
        return {
            "url": url,
            "title": soup.find("title").get_text() if soup.find("title") else "",
            "status": resp.status_code,
        }
    except Exception as e:
        return {"url": url, "error": str(e), "status": None}


async def fetch_all(urls: list, max_concurrent: int = 10) -> list:
    """Fetch many URLs with bounded concurrency."""
    semaphore = asyncio.Semaphore(max_concurrent)
    results = []

    async def bounded_fetch(client, url):
        async with semaphore:
            result = await fetch_one(client, url)
            # Polite delay even in async
            await asyncio.sleep(0.5)
            return result

    async with httpx.AsyncClient(
        headers={"User-Agent": "Mozilla/5.0"},
        follow_redirects=True,
    ) as client:
        tasks = [bounded_fetch(client, url) for url in urls]
        results = await asyncio.gather(*tasks)

    return list(results)


# Run async scraper
urls = [f"https://news.ycombinator.com/news?p={i}" for i in range(1, 6)]
results = asyncio.run(fetch_all(urls, max_concurrent=3))
for r in results:
    print(f"{r['status']} — {r['url']}")

JavaScript-Rendered Pages with Playwright

When the content isn't in the HTML source (check with View Source), use Playwright:

from playwright.sync_api import sync_playwright

def scrape_with_browser(url: str) -> str:
    """Render page with a real browser and return HTML."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
            viewport={"width": 1280, "height": 720},
        )
        page = context.new_page()
        page.goto(url, wait_until="networkidle", timeout=30000)

        # Wait for specific element to appear
        try:
            page.wait_for_selector(".product-grid", timeout=10000)
        except Exception:
            pass

        html = page.content()
        browser.close()
        return html


# Now parse the fully-rendered HTML
html = scrape_with_browser("https://example-spa.com/products")
soup = BeautifulSoup(html, "lxml")
products = soup.select(".product-card")
print(f"Found {len(products)} products")

Playwright for dynamic interactions:

from playwright.sync_api import sync_playwright

def scrape_with_login(url: str, username: str, password: str) -> list:
    """Handle login and scrape authenticated content."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        # Log in
        page.goto("https://example.com/login")
        page.fill("#username", username)
        page.fill("#password", password)
        page.click("button[type='submit']")
        page.wait_for_url("**/dashboard")  # wait for redirect

        # Now scrape the protected page
        page.goto(url)
        page.wait_for_selector(".data-table")

        # Handle infinite scroll
        for _ in range(5):
            page.keyboard.press("End")
            page.wait_for_timeout(1000)

        html = page.content()
        browser.close()

    soup = BeautifulSoup(html, "lxml")
    return [row.get_text(strip=True) for row in soup.select(".data-table tr")]

When You Need Proxies

You need proxies when: - A site blocks your IP after a few hundred requests - You need to see geo-specific content (prices, availability in different regions) - You're running scrapers in production and can't afford downtime from IP bans - You're scraping from a server with a datacenter IP (very commonly blocked)

Residential proxies route your requests through real household IP addresses, making them much harder for sites to detect and block.

ThorData offers a residential proxy network with 200M+ IPs across 195 countries:

import httpx

PROXY_USER = "your_username"
PROXY_PASS = "your_password"

def get_proxy(country: str = None) -> str:
    user = PROXY_USER
    if country:
        user += f"-country-{country}"
    return f"http://{user}:{PROXY_PASS}@proxy.thordata.net:9000"

# Basic usage with httpx
response = httpx.get(
    "https://example.com/products",
    proxy=get_proxy("us"),
    headers={"User-Agent": "Mozilla/5.0"}
)

# With requests library
import requests
session = requests.Session()
session.proxies = {"https": get_proxy("us")}
resp = session.get("https://example.com/products")

Start without proxies. Add them when you actually hit blocks.

Common Errors and How to Fix Them

403 Forbidden

The server is rejecting your request. Usually because:

  1. Missing or bad User-Agent. Always set one that looks like a real browser.
  2. No cookies/session. Some sites require you to load the homepage first. Use httpx.Client() to maintain a session.
  3. Rate limited. Slow down and add delays.
# Use a session to maintain cookies
with httpx.Client(headers={"User-Agent": "Mozilla/5.0"}) as client:
    client.get("https://example.com")  # load homepage, get cookies
    response = client.get("https://example.com/data")  # now fetch data with session cookies

Empty Results (No Data Extracted)

Your selectors are probably wrong, or the content loads via JavaScript.

  1. Check "View Source" (Ctrl+U), not Inspect Element. If your data isn't in the raw HTML, the page uses JavaScript rendering — you need Playwright.
  2. Selectors changed. Websites update their HTML. Your .price-tag class might now be .product-price. Re-inspect the page.
  3. Debug by printing partial soup. print(soup.prettify()[:3000]) shows you what actually came back.
# Debug helper
def debug_response(soup, selector: str):
    """Check if selector works and print context."""
    elements = soup.select(selector)
    print(f"Found {len(elements)} elements for: {selector}")
    if elements:
        print("First match:", elements[0].get_text(strip=True)[:100])
    else:
        # Show available classes/ids to help diagnose
        all_classes = set()
        for tag in soup.find_all(True):
            all_classes.update(tag.get("class", []))
        print("Available classes (sample):", list(all_classes)[:20])

Connection Timeouts

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=2, max=30))
def fetch_with_retry(url: str) -> str:
    resp = httpx.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=15)
    resp.raise_for_status()
    return resp.text

Cloudflare / Anti-Bot Protection

Modern anti-bot systems fingerprint your TLS connection, check JavaScript execution, and analyze behavior patterns.

For light protection: use httpx with proper headers and a real-looking User-Agent. For heavy protection: use Playwright with stealth settings, or switch to residential proxies from ThorData which provide IPs that pass fingerprinting checks.

# curl-cffi: impersonates real browser TLS fingerprint
from curl_cffi import requests as cffi_requests

resp = cffi_requests.get(
    "https://cloudflare-protected-site.com/",
    impersonate="chrome124"  # mimics Chrome 124 TLS fingerprint
)

Building a Real Project: Price Tracker

Here's a practical end-to-end example: a price tracker that monitors product prices and saves history to SQLite.

import httpx
import sqlite3
from bs4 import BeautifulSoup
from datetime import datetime
import time

def setup_db(db_path="prices.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS price_history (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            url TEXT,
            product_name TEXT,
            price TEXT,
            scraped_at TEXT
        )
    """)
    conn.commit()
    return conn


def scrape_price(url: str) -> dict:
    """
    Generic price scraper — adapt selectors to your target site.
    This example targets a generic e-commerce structure.
    """
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "Accept-Language": "en-US,en;q=0.9",
    }
    resp = httpx.get(url, headers=headers, timeout=20, follow_redirects=True)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "lxml")

    # Try multiple common price selectors
    price = None
    for selector in [
        "[itemprop='price']",
        ".price", "#price", ".product-price",
        "[data-price]", ".offer-price",
    ]:
        el = soup.select_one(selector)
        if el:
            price = el.get("content") or el.get_text(strip=True)
            break

    # Try multiple name selectors
    name = None
    for selector in ["h1", "[itemprop='name']", ".product-title", "#productTitle"]:
        el = soup.select_one(selector)
        if el:
            name = el.get_text(strip=True)[:200]
            break

    return {"url": url, "name": name, "price": price}


def track_prices(urls: list, db_path="prices.db", interval_hours: int = 6):
    """Monitor prices on a schedule."""
    conn = setup_db(db_path)

    while True:
        now = datetime.utcnow().isoformat()
        for url in urls:
            try:
                data = scrape_price(url)
                conn.execute(
                    "INSERT INTO price_history (url, product_name, price, scraped_at) VALUES (?,?,?,?)",
                    (data["url"], data["name"], data["price"], now)
                )
                conn.commit()
                print(f"{now} | {data['name'][:50]} | {data['price']}")
            except Exception as e:
                print(f"Error scraping {url}: {e}")

            time.sleep(2)

        print(f"\nSleeping {interval_hours}h until next check...")
        time.sleep(interval_hours * 3600)

What to Build Next

Now that you have the fundamentals:

  1. Pick a real project. Price tracker, job board aggregator, news digest — something you'd actually use.
  2. Save data to a database. SQLite is perfect for most projects.
  3. Add error handling. Wrap requests in try/except, handle timeouts, retry on failures.
  4. Try async scraping. Use httpx.AsyncClient to fetch multiple pages concurrently.
  5. Learn XPath. Alternative to CSS selectors, sometimes more precise.
  6. Explore Scrapy. When you outgrow simple scripts, Scrapy provides middleware, pipelines, and built-in concurrency.

Quick Reference

Situation Tool
Static HTML pages requests or httpx + BeautifulSoup
Need speed, async httpx.AsyncClient
JavaScript-rendered content Playwright
Large-scale crawling Scrapy
Behind Cloudflare curl-cffi or Playwright
Getting IP blocked Residential proxies (ThorData)
Need to log in Playwright with fill/click

Web scraping is one of those skills where the basics are simple but mastery takes practice. Start with the Hacker News example above, modify it for a site you actually care about, and build from there.