← Back to blog

20+ Free Web Scrapers for Developers in 2026 (No API Key Required)

20+ Free Web Scrapers for Developers in 2026 (No API Key Required)

In 2026, getting data from the web's biggest platforms has never been more expensive — or more restricted. LinkedIn charges anywhere from nothing to $99,000/year for API access depending on your use case. Twitter/X now demands $100/month minimum just for basic read access. Reddit's API crackdown in 2023 killed dozens of third-party apps and left developers scrambling. Amazon's Product Advertising API requires an active affiliate partnership with proven sales volume. And those are just the biggest offenders.

For indie developers, researchers, and small teams, paying enterprise API prices is simply not an option. The good news: a new generation of open, free scrapers has emerged that bypass these paywalls entirely — no API key, no partnership approval, no waitlist.

Here's a curated roundup of 20+ production-ready scrapers you can run today, all free to start. But first — let's understand how these scrapers actually work, and how you can build or extend them yourself.


Understanding Modern Web Scraping Architecture

Before diving into the tool list, it helps to understand the three-layer architecture most modern scrapers use:

Request Layer  -->  Extraction Layer  -->  Storage Layer
     |                    |                     |
 HTTP client        HTML/JSON parse         SQLite/JSON
 Proxy pool         XPath/CSS/regex         Database sink
 Rate limiter       Browser render          Queue/stream

The "request layer" is where most scraping projects fail. Platforms have invested heavily in bot detection: IP reputation scoring, TLS fingerprinting, behavioral analytics, and CAPTCHA systems. Getting the extraction logic right is only half the battle.


Setting Up Your Python Scraping Environment

All examples below use Python 3.10+. Install the core dependencies once:

pip install requests httpx beautifulsoup4 lxml playwright selectolax parsel fake-useragent
playwright install chromium

A reusable session with anti-bot headers:

import requests
from fake_useragent import UserAgent
import time
import random

ua = UserAgent()

def create_session(proxy=None):
    """Create a requests Session with realistic browser headers."""
    session = requests.Session()
    session.headers.update({
        "User-Agent": ua.random,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "DNT": "1",
    })
    if proxy:
        session.proxies = {"http": proxy, "https": proxy}
    return session

def polite_get(session, url, min_delay=1.0, max_delay=3.0):
    """Fetch a URL with a random delay to avoid rate limiting."""
    time.sleep(random.uniform(min_delay, max_delay))
    response = session.get(url, timeout=20)
    response.raise_for_status()
    return response

The Scrapers

Search & Discovery

Google Search Scraper — Pulls organic search results for any query without needing a Google API key or paying for the Custom Search JSON API. Ideal for rank tracking, keyword research, or building search-driven apps.

Here is a minimal Python implementation using requests + BeautifulSoup:

from bs4 import BeautifulSoup
import requests

def scrape_google_search(query, num_results=10, proxy=None):
    """
    Scrape Google search results for a given query.
    Use residential proxies for production to avoid blocks.
    """
    session = create_session(proxy)
    url = "https://www.google.com/search"
    params = {"q": query, "num": num_results, "hl": "en"}

    resp = session.get(url, params=params, timeout=15)
    soup = BeautifulSoup(resp.text, "lxml")

    results = []
    for g in soup.select("div.g"):
        title_el = g.select_one("h3")
        link_el = g.select_one("a[href]")
        snippet_el = g.select_one("div[data-sncf]") or g.select_one(".VwiC3b")

        if title_el and link_el:
            results.append({
                "title": title_el.get_text(strip=True),
                "url": link_el["href"],
                "snippet": snippet_el.get_text(strip=True) if snippet_el else "",
            })
    return results

results = scrape_google_search("python web scraping tutorial 2026", num_results=10)
for r in results:
    print(f"  {r['title']}")
    print(f"  {r['url']}")
    print()

Hacker News Scraper — Collects stories, comments, scores, and metadata from Hacker News. HN has a public Firebase API with no auth required:

import requests

def get_hn_top_stories(limit=30):
    """Fetch top Hacker News stories using the official Firebase API."""
    ids = requests.get("https://hacker-news.firebaseio.com/v0/topstories.json").json()

    stories = []
    for story_id in ids[:limit]:
        item = requests.get(
            f"https://hacker-news.firebaseio.com/v0/item/{story_id}.json"
        ).json()
        if item and item.get("type") == "story":
            stories.append({
                "id": item["id"],
                "title": item.get("title"),
                "url": item.get("url"),
                "score": item.get("score", 0),
                "by": item.get("by"),
                "descendants": item.get("descendants", 0),
                "time": item.get("time"),
            })
    return stories

def get_hn_comments(story_id):
    """Scrape comment tree from an HN thread."""
    from bs4 import BeautifulSoup

    session = create_session()
    resp = session.get(f"https://news.ycombinator.com/item?id={story_id}")
    soup = BeautifulSoup(resp.text, "lxml")

    comments = []
    for tr in soup.select("tr.athing.comtr"):
        comment_text = tr.select_one(".commtext")
        author = tr.select_one(".hnuser")
        age = tr.select_one(".age a")
        indent = tr.select_one(".ind img")

        if comment_text and author:
            comments.append({
                "author": author.get_text(),
                "text": comment_text.get_text(separator=" ", strip=True),
                "age": age.get_text() if age else "",
                "depth": int(indent.get("width", 0)) // 40 if indent else 0,
            })
    return comments

Product Hunt Scraper — Grabs daily top products, maker profiles, upvote counts, and taglines. Useful for competitive intelligence or tracking what is trending in the startup space.


Professional Networks

LinkedIn Jobs Scraper — Extracts job listings from LinkedIn search results without touching the LinkedIn API (which costs thousands per month for recruiting tiers). Returns titles, companies, locations, descriptions, and posting dates.

LinkedIn job listings are partially accessible without login:

import requests
from bs4 import BeautifulSoup

def scrape_linkedin_jobs(query, location="United States", limit=25):
    """
    Scrape LinkedIn job postings without authentication.
    Works on public job listing pages.
    """
    session = create_session()

    url = "https://www.linkedin.com/jobs/search/"
    params = {
        "keywords": query,
        "location": location,
        "start": 0,
        "count": limit,
    }

    resp = session.get(url, params=params)
    soup = BeautifulSoup(resp.text, "lxml")

    jobs = []
    for card in soup.select(".job-search-card, .base-card"):
        title = card.select_one(".base-search-card__title, h3")
        company = card.select_one(".base-search-card__subtitle, h4")
        location_el = card.select_one(".job-search-card__location")
        link = card.select_one("a[href*='/jobs/view/']")

        if title and company:
            jobs.append({
                "title": title.get_text(strip=True),
                "company": company.get_text(strip=True),
                "location": location_el.get_text(strip=True) if location_el else "",
                "url": link["href"] if link else "",
            })
    return jobs

LinkedIn Profile Scraper — Collects public LinkedIn profile data including work history, skills, education, and connections count. Works on publicly visible profiles without authentication.


Social Media

Twitter/X Scraper — Retrieves tweets, user profiles, follower counts, and engagement metrics without the $100/month Basic API tier.

Twitter guest token approach for unauthenticated access:

import requests

TWITTER_BEARER = "AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA"

def get_twitter_guest_token():
    """Obtain a guest bearer token for unauthenticated Twitter API access."""
    headers = {
        "Authorization": f"Bearer {TWITTER_BEARER}",
        "User-Agent": "Mozilla/5.0",
    }
    resp = requests.post("https://api.twitter.com/1.1/guest/activate.json", headers=headers)
    return resp.json().get("guest_token")

def search_tweets(query, count=20):
    """Search recent tweets using the Twitter internal API endpoint."""
    token = get_twitter_guest_token()

    headers = {
        "Authorization": f"Bearer {TWITTER_BEARER}",
        "x-guest-token": token,
        "User-Agent": "Mozilla/5.0",
    }

    params = {"q": query, "count": count, "tweet_mode": "extended"}
    resp = requests.get(
        "https://api.twitter.com/1.1/search/tweets.json",
        headers=headers,
        params=params,
    )
    data = resp.json()

    tweets = []
    for tweet in data.get("statuses", []):
        tweets.append({
            "id": tweet["id_str"],
            "text": tweet.get("full_text", tweet.get("text")),
            "author": tweet["user"]["screen_name"],
            "author_followers": tweet["user"]["followers_count"],
            "likes": tweet["favorite_count"],
            "retweets": tweet["retweet_count"],
            "created_at": tweet["created_at"],
        })
    return tweets

Reddit Scraper — Reddit's .json endpoint trick still works perfectly in 2026:

import requests

def scrape_subreddit(subreddit, sort="hot", limit=25):
    """
    Scrape Reddit posts using the undocumented .json endpoint.
    Append .json to any Reddit URL to get structured data.
    """
    session = create_session()
    url = f"https://www.reddit.com/r/{subreddit}/{sort}.json"

    resp = session.get(url, params={"limit": limit})
    data = resp.json()

    posts = []
    for child in data["data"]["children"]:
        post = child["data"]
        posts.append({
            "title": post["title"],
            "author": post["author"],
            "subreddit": post["subreddit"],
            "score": post["score"],
            "num_comments": post["num_comments"],
            "url": post["url"],
            "permalink": f"https://reddit.com{post['permalink']}",
            "created_utc": post["created_utc"],
            "selftext": post.get("selftext", "")[:500],
            "upvote_ratio": post.get("upvote_ratio", 0),
            "is_self": post.get("is_self", False),
        })
    return posts

def get_post_comments(subreddit, post_id):
    """Get all comments for a Reddit post using the .json trick."""
    session = create_session()
    url = f"https://www.reddit.com/r/{subreddit}/comments/{post_id}.json"

    resp = session.get(url, params={"limit": 500})
    data = resp.json()

    def parse_comments(listing):
        comments = []
        for child in listing.get("data", {}).get("children", []):
            if child.get("kind") == "t1":
                c = child["data"]
                comments.append({
                    "id": c["id"],
                    "author": c["author"],
                    "body": c.get("body", ""),
                    "score": c.get("score", 0),
                    "depth": c.get("depth", 0),
                    "replies": parse_comments(c.get("replies", {})) if isinstance(c.get("replies"), dict) else [],
                })
        return comments

    return parse_comments(data[1])

TikTok Scraper — Collects TikTok profiles, video metadata, view/like/comment counts, and comment threads. TikTok has no public API for content access.

Instagram Scraper — Uses the mobile API endpoint to fetch public profile data, post metadata, and follower counts. Instagram Graph API requires Meta app review — this skips all of that.

Bluesky Scraper — Pulls posts, replies, and profile data from Bluesky using the AT Protocol (which is fully open and documented):

import requests

def fetch_bluesky_feed(handle, limit=20):
    """
    Fetch posts from a Bluesky user using the AT Protocol public API.
    No authentication required for public content.
    """
    # Resolve handle to DID
    resp = requests.get(
        "https://bsky.social/xrpc/com.atproto.identity.resolveHandle",
        params={"handle": handle}
    )
    did = resp.json()["did"]

    # Get author feed
    resp = requests.get(
        "https://bsky.social/xrpc/app.bsky.feed.getAuthorFeed",
        params={"actor": did, "limit": limit}
    )
    data = resp.json()

    posts = []
    for item in data.get("feed", []):
        post = item.get("post", {})
        record = post.get("record", {})
        posts.append({
            "uri": post.get("uri"),
            "text": record.get("text", ""),
            "created_at": record.get("createdAt"),
            "likes": post.get("likeCount", 0),
            "reposts": post.get("repostCount", 0),
            "replies": post.get("replyCount", 0),
            "author": post.get("author", {}).get("handle"),
        })
    return posts

Pinterest Scraper and Telegram Scraper round out the social media stack.


Content & Publishing

YouTube Scraper — Fetches video metadata, view counts, channel stats, and upload history using YouTube InnerTube API (the same internal API YouTube.com uses):

import requests

INNERTUBE_API_KEY = "AIzaSyAO_FJ2SlqU8Q4STEHLGCilw_Y9_11qcW8"

def search_youtube(query, limit=20):
    """
    Search YouTube using the InnerTube API (no official API key required).
    This is the same endpoint the YouTube web app uses internally.
    """
    headers = {
        "Content-Type": "application/json",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "X-YouTube-Client-Name": "1",
        "X-YouTube-Client-Version": "2.20240101",
    }

    payload = {
        "context": {
            "client": {
                "clientName": "WEB",
                "clientVersion": "2.20240101",
                "hl": "en",
                "gl": "US",
            }
        },
        "query": query,
    }

    resp = requests.post(
        f"https://www.youtube.com/youtubei/v1/search?key={INNERTUBE_API_KEY}",
        headers=headers,
        json=payload,
    )
    data = resp.json()

    videos = []
    contents = data.get("contents", {}).get("twoColumnSearchResultsRenderer", {})
    items = contents.get("primaryContents", {}).get("sectionListRenderer", {}).get("contents", [])

    for section in items:
        for item in section.get("itemSectionRenderer", {}).get("contents", []):
            if "videoRenderer" in item:
                v = item["videoRenderer"]
                title_runs = v.get("title", {}).get("runs", [])
                channel_runs = v.get("ownerText", {}).get("runs", [])

                videos.append({
                    "video_id": v.get("videoId"),
                    "title": "".join(r["text"] for r in title_runs),
                    "channel": "".join(r["text"] for r in channel_runs),
                    "view_count": v.get("viewCountText", {}).get("simpleText", ""),
                    "published": v.get("publishedTimeText", {}).get("simpleText", ""),
                    "duration": v.get("lengthText", {}).get("simpleText", ""),
                    "url": f"https://youtube.com/watch?v={v.get('videoId')}",
                })
                if len(videos) >= limit:
                    return videos
    return videos

YouTube Comments Scraper uses the same InnerTube approach to pull comment threads from any public video. Invaluable for sentiment analysis and audience research.

Substack Scraper — Substack exposes a public REST API at https://PUBLICATION.substack.com/api/v1/posts. No authentication needed for published posts.


E-Commerce

Amazon Product Scraper — Amazon is the most heavily defended scraping target. A Playwright-based approach handles the JavaScript-rendered product pages:

from playwright.sync_api import sync_playwright

def scrape_amazon_product(asin, proxy=None):
    """
    Scrape an Amazon product page using Playwright (headless browser).
    Amazon fingerprints requests heavily. Use residential proxies for production.
    """
    with sync_playwright() as p:
        launch_args = {"headless": True}
        if proxy:
            launch_args["proxy"] = {"server": proxy}

        browser = p.chromium.launch(**launch_args)
        context = browser.new_context(
            viewport={"width": 1366, "height": 768},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
            locale="en-US",
            timezone_id="America/New_York",
        )

        # Block images/fonts for speed
        context.route("**/*.{png,jpg,jpeg,gif,webp,svg,woff,woff2}", lambda route: route.abort())

        page = context.new_page()
        page.goto(f"https://www.amazon.com/dp/{asin}", wait_until="domcontentloaded", timeout=30000)

        def sel_text(selector):
            el = page.query_selector(selector)
            return el.inner_text().strip() if el else None

        product = {
            "asin": asin,
            "title": sel_text("#productTitle"),
            "price": sel_text(".a-price-whole"),
            "rating": page.query_selector("#acrPopover") and page.query_selector("#acrPopover").get_attribute("title"),
            "review_count": sel_text("#acrCustomerReviewText"),
            "availability": sel_text("#availability"),
        }

        browser.close()
    return product

Etsy Scraper and Shopify Scraper — Shopify stores expose /products.json publicly by design, making them trivially accessible:

import requests
import time

def scrape_shopify_products(store_domain, limit=250):
    """
    Scrape products from any Shopify store using the public products.json endpoint.
    Shopify deliberately exposes this — it is a feature, not a vulnerability.
    """
    session = create_session()
    all_products = []
    page = 1

    while True:
        url = f"https://{store_domain}/products.json"
        resp = session.get(url, params={"limit": 250, "page": page})

        if resp.status_code != 200:
            break

        products = resp.json().get("products", [])
        if not products:
            break

        for p in products:
            all_products.append({
                "id": p["id"],
                "title": p["title"],
                "vendor": p.get("vendor"),
                "product_type": p.get("product_type"),
                "tags": p.get("tags", ""),
                "price": p["variants"][0]["price"] if p.get("variants") else None,
                "available": any(v.get("available") for v in p.get("variants", [])),
                "image_url": p["images"][0]["src"] if p.get("images") else None,
                "handle": p.get("handle"),
            })

        if len(products) < 250 or len(all_products) >= limit:
            break
        page += 1
        time.sleep(0.5)

    return all_products[:limit]

Developer & Business Data

GitHub Scraper — GitHub's REST API allows unauthenticated requests but rate-limits them to 60/hour. With a free personal access token, that jumps to 5,000/hour:

import requests
import base64

def search_github_repos(query, sort="stars", limit=30, token=None):
    """Search GitHub repositories using the REST API."""
    headers = {"Accept": "application/vnd.github.v3+json"}
    if token:
        headers["Authorization"] = f"token {token}"

    resp = requests.get(
        "https://api.github.com/search/repositories",
        headers=headers,
        params={"q": query, "sort": sort, "per_page": min(limit, 100)},
    )
    data = resp.json()

    repos = []
    for repo in data.get("items", []):
        repos.append({
            "name": repo["full_name"],
            "description": repo.get("description"),
            "stars": repo["stargazers_count"],
            "forks": repo["forks_count"],
            "language": repo.get("language"),
            "topics": repo.get("topics", []),
            "url": repo["html_url"],
            "homepage": repo.get("homepage"),
            "open_issues": repo["open_issues_count"],
            "created_at": repo["created_at"],
            "updated_at": repo["updated_at"],
            "license": repo.get("license", {}).get("spdx_id") if repo.get("license") else None,
        })
    return repos

def get_repo_readme(owner, repo, token=None):
    """Fetch decoded README content for a GitHub repository."""
    headers = {"Accept": "application/vnd.github.v3+json"}
    if token:
        headers["Authorization"] = f"token {token}"

    resp = requests.get(f"https://api.github.com/repos/{owner}/{repo}/readme", headers=headers)
    if resp.status_code == 200:
        data = resp.json()
        return base64.b64decode(data["content"]).decode("utf-8")
    return ""

Google Maps Scraper — Extracts business listings, addresses, phone numbers, ratings, and hours. The Places API charges per request; this scraper uses Playwright to render the JavaScript-heavy page:

from playwright.sync_api import sync_playwright

def scrape_google_maps_businesses(query, location, max_results=20):
    """
    Scrape business listings from Google Maps using Playwright.
    JavaScript rendering is required — Google Maps is a single-page app.
    """
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page(
            viewport={"width": 1280, "height": 900},
            user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
        )

        search_url = f"https://www.google.com/maps/search/{query.replace(' ', '+')}+{location.replace(' ', '+')}"
        page.goto(search_url, wait_until="networkidle", timeout=30000)
        page.wait_for_selector('[role="feed"]', timeout=10000)

        businesses = []
        seen_names = set()

        feed = page.query_selector('[role="feed"]')
        for _ in range(5):
            page.evaluate("(el) => el.scrollBy(0, 800)", feed)
            page.wait_for_timeout(1500)

        for listing in page.query_selector_all('[role="article"]')[:max_results]:
            name_el = listing.query_selector("h3, .qBF1Pd")
            rating_el = listing.query_selector(".MW4etd")
            reviews_el = listing.query_selector(".UY7F9")
            address_el = listing.query_selector(".W4Efsd:last-child .W4Efsd span:last-child")

            name = name_el.inner_text().strip() if name_el else ""
            if name and name not in seen_names:
                seen_names.add(name)
                businesses.append({
                    "name": name,
                    "rating": rating_el.inner_text().strip() if rating_el else None,
                    "reviews": reviews_el.inner_text().strip() if reviews_el else None,
                    "address": address_el.inner_text().strip() if address_el else None,
                })

        browser.close()
    return businesses

Anti-Detection Techniques

Running scrapers without proxy rotation or fingerprint masking will get you blocked, especially on LinkedIn, Google, and Amazon. Here is the complete anti-detection toolkit:

1. Rotating Residential Proxies

Datacenter IPs are trivially identified and blocked. Residential proxies route your requests through real home IP addresses, which platforms treat as organic users.

ThorData offers 90M+ residential IPs across 190+ countries with per-request rotation — ideal for high-volume scraping where you need fresh IPs on every request.

import random
import requests

THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
THORDATA_HOST = "proxy.thordata.com"
THORDATA_PORT = "9000"

def get_rotating_proxy(country=None):
    """
    Get a ThorData rotating residential proxy URL.
    Optional: target a specific country with country code.
    """
    if country:
        # Country-targeted proxy: appends country code to username
        user = f"{THORDATA_USER}-country-{country}"
    else:
        user = THORDATA_USER

    proxy_url = f"http://{user}:{THORDATA_PASS}@{THORDATA_HOST}:{THORDATA_PORT}"
    return {"http": proxy_url, "https": proxy_url}

# Example: scrape LinkedIn with a US residential IP
session = requests.Session()
session.proxies = get_rotating_proxy(country="US")
session.headers.update({"User-Agent": ua.random})
resp = session.get("https://www.linkedin.com/jobs/search/?keywords=python+developer")
print(resp.status_code)

2. Request Header Rotation

Always vary your User-Agent string. Browser fingerprinting checks Accept-Language, Accept-Encoding, and Sec-Fetch-* headers too:

import random

CHROME_VERSIONS = ["124.0.0.0", "123.0.0.0", "122.0.6261.112"]
WINDOWS_VERSIONS = ["10.0", "11.0"]

def random_chrome_headers():
    """Generate realistic Chrome browser headers."""
    chrome_ver = random.choice(CHROME_VERSIONS)
    win_ver = random.choice(WINDOWS_VERSIONS)

    return {
        "User-Agent": f"Mozilla/5.0 (Windows NT {win_ver}; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/{chrome_ver} Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": random.choice(["en-US,en;q=0.9", "en-GB,en;q=0.9", "en-US,en;q=0.8,es;q=0.7"]),
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Sec-Ch-Ua": f'"Chromium";v="{chrome_ver.split(".")[0]}", "Google Chrome";v="{chrome_ver.split(".")[0]}"',
        "Sec-Ch-Ua-Mobile": "?0",
        "Sec-Ch-Ua-Platform": '"Windows"',
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Upgrade-Insecure-Requests": "1",
    }

3. Rate Limiting with Jitter

Consistent request intervals look robotic. Add random delays:

import time
import random
from functools import wraps

def with_delay(min_seconds=1.0, max_seconds=4.0):
    """Decorator that adds random delays between function calls."""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            result = func(*args, **kwargs)
            delay = random.uniform(min_seconds, max_seconds)
            time.sleep(delay)
            return result
        return wrapper
    return decorator

@with_delay(min_seconds=2.0, max_seconds=5.0)
def scrape_page(url, session):
    resp = session.get(url, timeout=20)
    resp.raise_for_status()
    return resp.text

4. TLS Fingerprint Evasion with curl_cffi

Standard Python requests has a different TLS handshake pattern than real browsers. Platforms like Cloudflare detect this at the network layer before even looking at your headers. curl_cffi impersonates real browser TLS fingerprints:

pip install curl-cffi
from curl_cffi import requests as cffi_requests

# Impersonate Chrome TLS fingerprint at the socket level
session = cffi_requests.Session(impersonate="chrome120")

resp = session.get(
    "https://www.linkedin.com/",
    headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) ..."},
    proxies={"https": "http://user:[email protected]:9000"},
)
print(resp.status_code)

5. Playwright Stealth Mode

For JavaScript-heavy sites that run browser fingerprinting scripts:

from playwright.sync_api import sync_playwright

def scrape_with_stealth(url, proxy=None):
    """
    Use Playwright with stealth patches to bypass bot detection.
    Patches navigator.webdriver, chrome object, plugins, and more.
    """
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            args=[
                "--disable-blink-features=AutomationControlled",
                "--disable-features=IsolateOrigins,site-per-process",
                "--no-sandbox",
                "--disable-setuid-sandbox",
            ]
        )

        context_kwargs = {
            "viewport": {"width": 1366, "height": 768},
            "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
            "locale": "en-US",
            "timezone_id": "America/New_York",
            "extra_http_headers": {
                "Accept-Language": "en-US,en;q=0.9",
            }
        }
        if proxy:
            context_kwargs["proxy"] = {"server": proxy}

        context = browser.new_context(**context_kwargs)

        # Patch navigator.webdriver to hide automation
        context.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
            Object.defineProperty(navigator, 'plugins', { get: () => [1, 2, 3, 4, 5] });
            Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });
            window.chrome = { runtime: {} };
        """)

        page = context.new_page()
        page.goto(url, wait_until="networkidle", timeout=30000)
        content = page.content()
        browser.close()
    return content

6. Exponential Backoff on Rate Limit Responses

import time
import requests

def fetch_with_retry(url, session, max_retries=5):
    """
    Fetch a URL with exponential backoff on rate limit errors.
    Follows the API reliability rules: max 5 retries, give up after 3 minutes.
    """
    start_time = time.time()

    for attempt in range(max_retries):
        if time.time() - start_time > 180:  # 3 minute timeout
            raise TimeoutError(f"Exceeded 3 minute retry budget for {url}")

        try:
            resp = session.get(url, timeout=20)

            if resp.status_code == 200:
                return resp
            elif resp.status_code == 429:
                wait = (2 ** attempt) + random.uniform(0, 1)
                print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt + 1}/{max_retries})")
                time.sleep(wait)
            elif resp.status_code in (403, 407):
                print(f"Blocked (status {resp.status_code}). Rotating proxy...")
                session.proxies = get_rotating_proxy()
                time.sleep(2)
            else:
                resp.raise_for_status()

        except requests.exceptions.ConnectionError as e:
            wait = 2 ** attempt
            print(f"Connection error: {e}. Retrying in {wait}s...")
            time.sleep(wait)

    raise Exception(f"Failed after {max_retries} retries: {url}")

The Catch: IP Blocks and Rate Limits

Free scrapers are powerful, but they hit one universal wall: IP reputation. Datacenter IPs — the kind used by cloud servers and most VPNs — are heavily flagged by platforms like LinkedIn, Google, and Amazon. When you send too many requests from a single datacenter IP, you will start seeing CAPTCHAs, soft blocks, or outright bans within minutes.

The standard solution is residential proxies — IPs that belong to real home internet connections, which platforms treat as organic users. For production-grade scraping at scale, ThorData's residential proxy network offers one of the best price-to-coverage ratios available, with 90M+ IPs across 190+ countries. Paired with any of the scrapers above, it dramatically reduces block rates on the hardest targets.

For lower-volume use cases (a few hundred requests/day), many scrapers handle rotation automatically and work fine without additional proxy configuration.


How to Run These on Apify

All scrapers listed above run on Apify's platform. The free tier defaults to returning 5 results per run — enough for testing and prototyping. You can increase limits by adjusting the maxResults input parameter.

Every scraper exposes a REST API for programmatic triggering:

curl -X POST "https://api.apify.com/v2/acts/cryptosignals~google-search-scraper/runs" \
  -H "Content-Type: application/json" \
  -d '{"query": "best python scraping libraries 2026", "maxResults": 10}' \
  -H "Authorization: Bearer YOUR_APIFY_TOKEN"

Or via the Python client:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("cryptosignals/google-search-scraper").call(run_input={
    "query": "python web scraping 2026",
    "maxResults": 50,
})
results = list(client.dataset(run["defaultDatasetId"]).iterate_items())

for result in results:
    print(result)

Real-World Use Cases

Here are concrete use cases where this scraper stack delivers immediate value:

Market Research: Combine Amazon + Etsy scrapers to track pricing trends across categories. Build automated alerts when competitor prices drop or new listings appear in your niche.

Job Market Analysis: LinkedIn Jobs Scraper + GitHub Scraper — correlate tech stack trends with job posting volumes to identify upskilling opportunities.

Content Strategy: Reddit + Hacker News + Product Hunt — aggregate what developers are discussing, upvoting, and building to find underserved content niches.

Lead Generation: Google Maps Scraper — build targeted lists of local businesses in specific industries and locations for outreach campaigns.

Competitive Intelligence: Shopify Scraper + Product Hunt — monitor competitor stores and newly launched products in your market, automated daily.

Social Listening: Twitter + Bluesky + Reddit — track mentions of your brand, product, or keywords across platforms without paying $500+/month for enterprise social listening tools.


If you are building something data-driven in 2026 and need web data without signing enterprise contracts or waiting months for API approval, this stack covers most major platforms. Bookmark this page — it will be updated as new scrapers launch throughout the year.