Scraping Product Hunt Launches: Python Guide (2026)

2026-04-09 [python scraping producthunt graphql startups]

Scraping Product Hunt Launches: Python Guide (2026)

Product Hunt runs on a GraphQL API. Every product page, upvote count, maker profile, and daily ranking you see on the site comes from it. If you want to track new launches, monitor competitor products, or build a dataset of trending tools, this API is your entry point.

The API requires an Authorization header but doesn't need a registered app for basic queries. The tricky part is pagination, rate limits, and the fact that Product Hunt aggressively blocks automated requests that don't look like real browser traffic.

This guide covers the full picture: getting an API token, executing GraphQL queries, paginating through large datasets, scraping maker profiles, handling rate limits, and scaling with proxies.

Why Product Hunt Data Matters

Product Hunt is one of the few places on the internet where you can reliably find what new software products are launching, who built them, and what real users think of them (via upvotes and comments). The data is valuable for:

Competitive intelligence — Monitor when competitors launch new products or features
Lead generation — Find makers (founders/developers) who recently launched tools in your niche
Trend analysis — Track which categories are gaining traction over time
SEO research — Products with high upvote counts often have strong domain authority
Building directories — Aggregate Product Hunt data to create niche tool directories

Getting an Access Token

Product Hunt uses OAuth2. You can get a developer token from their API dashboard, or use a client credentials flow:

import httpx

def get_ph_token(client_id, client_secret):
    """Get a Product Hunt API access token via client credentials."""
    resp = httpx.post(
        "https://api.producthunt.com/v2/oauth/token",
        json={
            "client_id": client_id,
            "client_secret": client_secret,
            "grant_type": "client_credentials"
        }
    )
    resp.raise_for_status()
    data = resp.json()
    return data["access_token"]

# Alternatively, get a developer token directly from:
# https://www.producthunt.com/v2/oauth/applications
# Create an application > copy the "API Key" (not the secret)
token = "YOUR_API_KEY"

Once you have a token, all GraphQL queries go to a single endpoint with an Authorization header. Keep your token safe — Product Hunt will revoke tokens that violate their rate limits.

The GraphQL API Setup

Product Hunt's API is fully GraphQL. Every query hits the same endpoint:

import httpx
import json

API_URL = "https://api.producthunt.com/v2/api/graphql"

def ph_query(query, variables=None, token=None):
    """Execute a Product Hunt GraphQL query."""
    headers = {
        "Authorization": f"Bearer {token}",
        "Content-Type": "application/json",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
        "Accept": "application/json",
    }
    resp = httpx.post(
        API_URL,
        json={"query": query, "variables": variables or {}},
        headers=headers,
        timeout=30
    )
    resp.raise_for_status()
    data = resp.json()
    if "errors" in data:
        raise Exception(f"GraphQL errors: {json.dumps(data['errors'], indent=2)}")
    return data["data"]

Fetching Daily Rankings

The posts query returns products ordered by votes for a given day. This is the core query for daily launch tracking:

import time

POSTS_QUERY = """
query GetPosts($postedAfter: DateTime!, $postedBefore: DateTime!, $after: String) {
    posts(
        order: VOTES
        postedAfter: $postedAfter
        postedBefore: $postedBefore
        after: $after
        first: 20
    ) {
        edges {
            node {
                id
                name
                tagline
                url
                votesCount
                commentsCount
                website
                createdAt
                featuredAt
                topics {
                    edges {
                        node {
                            name
                            slug
                        }
                    }
                }
                makers {
                    id
                    name
                    username
                    headline
                    profileImage
                }
                thumbnail {
                    url
                }
                reviewsCount
                reviewsRating
            }
        }
        pageInfo {
            hasNextPage
            endCursor
        }
    }
}
"""

def get_daily_launches(date, token):
    """Get all launches for a specific date."""
    all_posts = []
    cursor = None

    while True:
        variables = {
            "postedAfter": f"{date}T00:00:00Z",
            "postedBefore": f"{date}T23:59:59Z",
            "after": cursor
        }
        data = ph_query(POSTS_QUERY, variables, token)
        posts = data["posts"]

        for edge in posts["edges"]:
            node = edge["node"]
            all_posts.append({
                "id": node["id"],
                "name": node["name"],
                "tagline": node["tagline"],
                "votes": node["votesCount"],
                "comments": node["commentsCount"],
                "url": node["url"],
                "website": node["website"],
                "created_at": node["createdAt"],
                "featured_at": node.get("featuredAt"),
                "makers": [{"name": m["name"], "username": m["username"]} for m in node["makers"]],
                "topics": [e["node"]["name"] for e in node["topics"]["edges"]],
                "thumbnail": node["thumbnail"]["url"] if node.get("thumbnail") else None,
                "reviews_count": node.get("reviewsCount", 0),
                "reviews_rating": node.get("reviewsRating"),
            })

        if not posts["pageInfo"]["hasNextPage"]:
            break
        cursor = posts["pageInfo"]["endCursor"]
        time.sleep(1)  # respect rate limits

    return sorted(all_posts, key=lambda x: x["votes"], reverse=True)

# Get yesterday's launches
launches = get_daily_launches("2026-04-23", token="YOUR_TOKEN")
print(f"Found {len(launches)} products launched")
for i, p in enumerate(launches[:10], 1):
    print(f"#{i} {p['name']} -- {p['votes']} votes -- {p['tagline']}")

Cursor-Based Pagination

Product Hunt uses cursor pagination -- not page numbers. Each response includes pageInfo.endCursor which you pass as the after variable in the next request. This is standard for Relay-style GraphQL APIs.

The pattern is always the same:

Make initial request without after
Check pageInfo.hasNextPage
Pass pageInfo.endCursor as after in the next request
Repeat until hasNextPage is false

Don't skip the time.sleep(1) between paginated requests. Product Hunt rate-limits aggressively and will revoke tokens that hammer the API.

def paginate_query(query, variables_fn, data_path, token, delay=1.0):
    """Generic cursor-based paginator for Product Hunt queries."""
    all_items = []
    cursor = None

    while True:
        variables = variables_fn(cursor)
        data = ph_query(query, variables, token)

        # Navigate to the page data using the path
        page_data = data
        for key in data_path.split("."):
            page_data = page_data[key]

        for edge in page_data["edges"]:
            all_items.append(edge["node"])

        if not page_data["pageInfo"]["hasNextPage"]:
            break

        cursor = page_data["pageInfo"]["endCursor"]
        time.sleep(delay)

    return all_items

Scraping Maker Profiles

To build a dataset of makers and their launch history:

MAKER_QUERY = """
query GetMaker($username: String!) {
    user(username: $username) {
        id
        name
        username
        headline
        profileImage
        websiteUrl
        twitterUsername
        followersCount
        followingCount
        votedPostsCount
        madePosts(first: 20) {
            edges {
                node {
                    id
                    name
                    tagline
                    votesCount
                    commentsCount
                    url
                    createdAt
                    topics {
                        edges {
                            node { name }
                        }
                    }
                }
            }
        }
    }
}
"""

def get_maker(username, token):
    """Get a maker's profile and their launches."""
    data = ph_query(MAKER_QUERY, {"username": username}, token)
    user = data["user"]

    if not user:
        return None

    return {
        "id": user["id"],
        "name": user["name"],
        "username": user["username"],
        "headline": user.get("headline"),
        "website": user.get("websiteUrl"),
        "twitter": user.get("twitterUsername"),
        "followers": user["followersCount"],
        "following": user["followingCount"],
        "voted_posts": user.get("votedPostsCount", 0),
        "products": [
            {
                "id": e["node"]["id"],
                "name": e["node"]["name"],
                "votes": e["node"]["votesCount"],
                "comments": e["node"]["commentsCount"],
                "launched": e["node"]["createdAt"],
                "url": e["node"]["url"],
                "topics": [t["node"]["name"] for t in e["node"]["topics"]["edges"]],
            }
            for e in user["madePosts"]["edges"]
        ]
    }

maker = get_maker("rrhoover", token="YOUR_TOKEN")
if maker:
    print(f"{maker['name']} (@{maker['username']})")
    print(f"Followers: {maker['followers']}")
    print(f"Products launched: {len(maker['products'])}")
    for p in maker["products"][:5]:
        print(f"  {p['name']} -- {p['votes']} votes ({p['launched'][:10]})")

Searching for Products by Topic

Product Hunt supports filtering by topic. To find all AI tools or all dev tools:

TOPIC_POSTS_QUERY = """
query GetTopicPosts($topic: String!, $after: String) {
    posts(
        order: VOTES
        topic: $topic
        after: $after
        first: 20
    ) {
        edges {
            node {
                id
                name
                tagline
                votesCount
                url
                createdAt
            }
        }
        pageInfo {
            hasNextPage
            endCursor
        }
    }
}
"""

def get_posts_by_topic(topic_slug, token, max_items=100):
    """Get top products in a specific topic/category."""
    all_posts = []
    cursor = None

    while len(all_posts) < max_items:
        data = ph_query(TOPIC_POSTS_QUERY, {"topic": topic_slug, "after": cursor}, token)
        page = data["posts"]

        for edge in page["edges"]:
            all_posts.append(edge["node"])

        if not page["pageInfo"]["hasNextPage"] or len(all_posts) >= max_items:
            break

        cursor = page["pageInfo"]["endCursor"]
        time.sleep(1)

    return all_posts[:max_items]

# Common topic slugs: artificial-intelligence, developer-tools, productivity,
# marketing, design-tools, finance, education, health-fitness
ai_tools = get_posts_by_topic("artificial-intelligence", token="YOUR_TOKEN", max_items=200)
print(f"Found {len(ai_tools)} AI products")

Anti-Bot Measures

Product Hunt's bot detection is more aggressive than most sites:

Token-based rate limiting — each API token has a request quota. Exceeding it returns 429 errors and can lead to token revocation. Stay under 100 requests per hour for sustained crawling.
Browser fingerprinting on the website — if you scrape the HTML directly instead of using the API, you'll hit Cloudflare challenges, JavaScript rendering requirements, and behavioral analysis.
GraphQL query complexity limits — requesting too many nested fields or too many items per page will fail with complexity errors. Keep first at 20 or below, and don't nest more than 3-4 levels deep.
IP reputation scoring — datacenter IPs get scrutinized more than residential ones.

For the API route, the main risk is token revocation. Keep requests under 100/hour and you'll be fine. For the website route (which you need for data not in the API), you need residential proxies.

ThorData's residential proxy network rotates IPs automatically and handles Cloudflare challenges, which is essential for Product Hunt's website -- their bot detection flags datacenter IPs within a few requests.

# For direct website scraping (not API)
import httpx

proxied_client = httpx.Client(
    proxy="http://user:[email protected]:9000",
    headers={
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.9",
    },
    timeout=30
)

# For the GraphQL API with proxy rotation:
def ph_query_proxied(query, variables=None, token=None, proxy_url=None):
    """Execute a Product Hunt GraphQL query via proxy."""
    headers = {
        "Authorization": f"Bearer {token}",
        "Content-Type": "application/json",
    }
    with httpx.Client(proxy=proxy_url, timeout=30) as client:
        resp = client.post(
            API_URL,
            json={"query": query, "variables": variables or {}},
            headers=headers
        )
        resp.raise_for_status()
        data = resp.json()
        if "errors" in data:
            raise Exception(f"GraphQL errors: {data['errors']}")
        return data["data"]

Tracking Launches Over Time

To build a historical dataset, run the daily scraper on a schedule:

from datetime import datetime, timedelta
import json

def scrape_date_range(start_date, days, token, output_file="launches.jsonl"):
    """Scrape launches over a range of dates. Appends to JSONL file."""
    current = datetime.strptime(start_date, "%Y-%m-%d")

    with open(output_file, "a") as out:
        for day in range(days):
            date_str = current.strftime("%Y-%m-%d")
            print(f"Scraping {date_str} ({day+1}/{days})...")

            try:
                launches = get_daily_launches(date_str, token)
                record = {
                    "date": date_str,
                    "count": len(launches),
                    "products": launches
                }
                out.write(json.dumps(record) + "\n")
                print(f"  Got {len(launches)} products")
            except Exception as e:
                print(f"  Failed: {e}")

            current += timedelta(days=1)
            time.sleep(5)  # be polite between days

# Scrape last 30 days
end_date = datetime.now()
start_date = end_date - timedelta(days=30)
scrape_date_range(start_date.strftime("%Y-%m-%d"), 30, token="YOUR_TOKEN")

Storing in SQLite

For a proper data pipeline, persist everything to SQLite:

import sqlite3

def init_db(db_path="producthunt.db"):
    conn = sqlite3.connect(db_path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS launches (
            id TEXT PRIMARY KEY,
            name TEXT NOT NULL,
            tagline TEXT,
            votes INTEGER DEFAULT 0,
            comments INTEGER DEFAULT 0,
            url TEXT,
            website TEXT,
            created_at TEXT,
            featured_at TEXT,
            topics TEXT,  -- JSON array
            thumbnail_url TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );

        CREATE TABLE IF NOT EXISTS makers (
            id TEXT,
            launch_id TEXT,
            name TEXT,
            username TEXT,
            PRIMARY KEY (id, launch_id),
            FOREIGN KEY (launch_id) REFERENCES launches(id)
        );

        CREATE INDEX IF NOT EXISTS idx_launches_votes ON launches(votes DESC);
        CREATE INDEX IF NOT EXISTS idx_launches_created ON launches(created_at);
    """)
    conn.commit()
    return conn

def save_launches(conn, launches):
    """Save a list of launches to the database."""
    import json as _json

    for launch in launches:
        conn.execute(
            """INSERT OR REPLACE INTO launches
               (id, name, tagline, votes, comments, url, website, created_at, featured_at, topics, thumbnail_url)
               VALUES (?,?,?,?,?,?,?,?,?,?,?)""",
            (
                launch["id"],
                launch["name"],
                launch["tagline"],
                launch["votes"],
                launch["comments"],
                launch["url"],
                launch.get("website"),
                launch.get("created_at"),
                launch.get("featured_at"),
                _json.dumps(launch.get("topics", [])),
                launch.get("thumbnail"),
            )
        )

        for maker in launch.get("makers", []):
            conn.execute(
                "INSERT OR REPLACE INTO makers (id, launch_id, name, username) VALUES (?,?,?,?)",
                (maker.get("id", maker["username"]), launch["id"], maker["name"], maker["username"])
            )

    conn.commit()

conn = init_db()
launches = get_daily_launches("2026-04-23", token="YOUR_TOKEN")
save_launches(conn, launches)
conn.close()

Analyzing the Data

Once you have data stored, some useful queries:

import sqlite3
import json

conn = sqlite3.connect("producthunt.db")

# Top products by votes
print("Top 10 all-time by votes:")
for row in conn.execute("SELECT name, votes, tagline FROM launches ORDER BY votes DESC LIMIT 10"):
    print(f"  {row[1]:5d} votes -- {row[0]}: {row[2][:50]}")

# Products per topic
print("\nMost common topics:")
for row in conn.execute("SELECT name, votes, topics FROM launches"):
    for topic in json.loads(row[2] or "[]"):
        pass  # aggregate topic counts

# Votes distribution
print("\nVote distribution:")
for row in conn.execute("""
    SELECT
        CASE
            WHEN votes >= 500 THEN '500+'
            WHEN votes >= 100 THEN '100-499'
            WHEN votes >= 50 THEN '50-99'
            WHEN votes >= 10 THEN '10-49'
            ELSE '0-9'
        END as bucket,
        COUNT(*) as count
    FROM launches
    GROUP BY bucket
    ORDER BY MIN(votes) DESC
"""):
    print(f"  {row[0]:10s}: {row[1]} products")

Rate Limiting Best Practices

Product Hunt will revoke your token if you abuse it. Here's a conservative request pattern that should keep you well within limits:

import time
import random
from datetime import datetime, timedelta

class RateLimitedPHClient:
    """Product Hunt client with built-in rate limiting."""

    def __init__(self, token, requests_per_hour=80):
        self.token = token
        self.requests_per_hour = requests_per_hour
        self.request_times = []

    def _wait_if_needed(self):
        now = time.time()
        # Remove requests older than 1 hour
        self.request_times = [t for t in self.request_times if now - t < 3600]

        if len(self.request_times) >= self.requests_per_hour:
            # Wait until the oldest request falls off the window
            oldest = self.request_times[0]
            wait_time = 3600 - (now - oldest) + 1
            print(f"Rate limit approaching, waiting {wait_time:.0f}s...")
            time.sleep(wait_time)

    def query(self, query, variables=None):
        self._wait_if_needed()
        result = ph_query(query, variables, self.token)
        self.request_times.append(time.time())
        # Small random delay to avoid machine-gun request patterns
        time.sleep(random.uniform(0.5, 1.5))
        return result

client = RateLimitedPHClient(token="YOUR_TOKEN", requests_per_hour=80)

Practical Tips

Use the API, not the website. The GraphQL API gives you structured data without fighting Cloudflare. Only scrape the HTML for data the API doesn't expose (like full post descriptions or gallery images).

Cache responses. Product Hunt data for past dates doesn't change much after the first 48 hours. Store daily snapshots and only re-fetch the current day. Daily archive data is essentially immutable.

Watch your token. Don't share API tokens across multiple scrapers. One revoked token means all your scrapers go down. Create separate tokens for separate projects.

Monitor for schema changes. GraphQL schemas evolve. Product Hunt occasionally deprecates fields or changes types. Pin your queries and test them weekly with a small validation scrape.

Use featuredAt, not createdAt. Products are featured on specific days but created slightly earlier. For "launched on date X" logic, filter on featuredAt not createdAt.

Handle the 403 gracefully. If you get a 403, don't immediately retry. Wait at least 60 seconds and check if the token is still valid. A 403 on the GraphQL endpoint usually means your token has been temporarily or permanently blocked.

Conclusion

Product Hunt's GraphQL API is one of the cleaner startup data sources to work with. The structured query format means you get exactly the fields you need, and cursor pagination handles large result sets reliably. The main constraints are the 900 request/day limit on free tokens and their aggressive IP-based blocking of direct website scraping.

For data that's only accessible via the website, ThorData's residential proxies provide the IP diversity needed to stay under Product Hunt's radar. For API-based scraping, stay under the rate limits and you'll have a solid, reliable pipeline for tracking the startup ecosystem.