Scraping Product Hunt Launches: Python Guide (2026)
Scraping Product Hunt Launches: Python Guide (2026)
Product Hunt runs on a GraphQL API. Every product page, upvote count, maker profile, and daily ranking you see on the site comes from it. If you want to track new launches, monitor competitor products, or build a dataset of trending tools, this API is your entry point.
The API requires an Authorization header but doesn't need a registered app for basic queries. The tricky part is pagination, rate limits, and the fact that Product Hunt aggressively blocks automated requests that don't look like real browser traffic.
This guide covers the full picture: getting an API token, executing GraphQL queries, paginating through large datasets, scraping maker profiles, handling rate limits, and scaling with proxies.
Why Product Hunt Data Matters
Product Hunt is one of the few places on the internet where you can reliably find what new software products are launching, who built them, and what real users think of them (via upvotes and comments). The data is valuable for:
- Competitive intelligence — Monitor when competitors launch new products or features
- Lead generation — Find makers (founders/developers) who recently launched tools in your niche
- Trend analysis — Track which categories are gaining traction over time
- SEO research — Products with high upvote counts often have strong domain authority
- Building directories — Aggregate Product Hunt data to create niche tool directories
Getting an Access Token
Product Hunt uses OAuth2. You can get a developer token from their API dashboard, or use a client credentials flow:
import httpx
def get_ph_token(client_id, client_secret):
"""Get a Product Hunt API access token via client credentials."""
resp = httpx.post(
"https://api.producthunt.com/v2/oauth/token",
json={
"client_id": client_id,
"client_secret": client_secret,
"grant_type": "client_credentials"
}
)
resp.raise_for_status()
data = resp.json()
return data["access_token"]
# Alternatively, get a developer token directly from:
# https://www.producthunt.com/v2/oauth/applications
# Create an application > copy the "API Key" (not the secret)
token = "YOUR_API_KEY"
Once you have a token, all GraphQL queries go to a single endpoint with an Authorization header. Keep your token safe — Product Hunt will revoke tokens that violate their rate limits.
The GraphQL API Setup
Product Hunt's API is fully GraphQL. Every query hits the same endpoint:
import httpx
import json
API_URL = "https://api.producthunt.com/v2/api/graphql"
def ph_query(query, variables=None, token=None):
"""Execute a Product Hunt GraphQL query."""
headers = {
"Authorization": f"Bearer {token}",
"Content-Type": "application/json",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Accept": "application/json",
}
resp = httpx.post(
API_URL,
json={"query": query, "variables": variables or {}},
headers=headers,
timeout=30
)
resp.raise_for_status()
data = resp.json()
if "errors" in data:
raise Exception(f"GraphQL errors: {json.dumps(data['errors'], indent=2)}")
return data["data"]
Fetching Daily Rankings
The posts query returns products ordered by votes for a given day. This is the core query for daily launch tracking:
import time
POSTS_QUERY = """
query GetPosts($postedAfter: DateTime!, $postedBefore: DateTime!, $after: String) {
posts(
order: VOTES
postedAfter: $postedAfter
postedBefore: $postedBefore
after: $after
first: 20
) {
edges {
node {
id
name
tagline
url
votesCount
commentsCount
website
createdAt
featuredAt
topics {
edges {
node {
name
slug
}
}
}
makers {
id
name
username
headline
profileImage
}
thumbnail {
url
}
reviewsCount
reviewsRating
}
}
pageInfo {
hasNextPage
endCursor
}
}
}
"""
def get_daily_launches(date, token):
"""Get all launches for a specific date."""
all_posts = []
cursor = None
while True:
variables = {
"postedAfter": f"{date}T00:00:00Z",
"postedBefore": f"{date}T23:59:59Z",
"after": cursor
}
data = ph_query(POSTS_QUERY, variables, token)
posts = data["posts"]
for edge in posts["edges"]:
node = edge["node"]
all_posts.append({
"id": node["id"],
"name": node["name"],
"tagline": node["tagline"],
"votes": node["votesCount"],
"comments": node["commentsCount"],
"url": node["url"],
"website": node["website"],
"created_at": node["createdAt"],
"featured_at": node.get("featuredAt"),
"makers": [{"name": m["name"], "username": m["username"]} for m in node["makers"]],
"topics": [e["node"]["name"] for e in node["topics"]["edges"]],
"thumbnail": node["thumbnail"]["url"] if node.get("thumbnail") else None,
"reviews_count": node.get("reviewsCount", 0),
"reviews_rating": node.get("reviewsRating"),
})
if not posts["pageInfo"]["hasNextPage"]:
break
cursor = posts["pageInfo"]["endCursor"]
time.sleep(1) # respect rate limits
return sorted(all_posts, key=lambda x: x["votes"], reverse=True)
# Get yesterday's launches
launches = get_daily_launches("2026-04-23", token="YOUR_TOKEN")
print(f"Found {len(launches)} products launched")
for i, p in enumerate(launches[:10], 1):
print(f"#{i} {p['name']} -- {p['votes']} votes -- {p['tagline']}")
Cursor-Based Pagination
Product Hunt uses cursor pagination -- not page numbers. Each response includes pageInfo.endCursor which you pass as the after variable in the next request. This is standard for Relay-style GraphQL APIs.
The pattern is always the same:
- Make initial request without
after - Check
pageInfo.hasNextPage - Pass
pageInfo.endCursorasafterin the next request - Repeat until
hasNextPageis false
Don't skip the time.sleep(1) between paginated requests. Product Hunt rate-limits aggressively and will revoke tokens that hammer the API.
def paginate_query(query, variables_fn, data_path, token, delay=1.0):
"""Generic cursor-based paginator for Product Hunt queries."""
all_items = []
cursor = None
while True:
variables = variables_fn(cursor)
data = ph_query(query, variables, token)
# Navigate to the page data using the path
page_data = data
for key in data_path.split("."):
page_data = page_data[key]
for edge in page_data["edges"]:
all_items.append(edge["node"])
if not page_data["pageInfo"]["hasNextPage"]:
break
cursor = page_data["pageInfo"]["endCursor"]
time.sleep(delay)
return all_items
Scraping Maker Profiles
To build a dataset of makers and their launch history:
MAKER_QUERY = """
query GetMaker($username: String!) {
user(username: $username) {
id
name
username
headline
profileImage
websiteUrl
twitterUsername
followersCount
followingCount
votedPostsCount
madePosts(first: 20) {
edges {
node {
id
name
tagline
votesCount
commentsCount
url
createdAt
topics {
edges {
node { name }
}
}
}
}
}
}
}
"""
def get_maker(username, token):
"""Get a maker's profile and their launches."""
data = ph_query(MAKER_QUERY, {"username": username}, token)
user = data["user"]
if not user:
return None
return {
"id": user["id"],
"name": user["name"],
"username": user["username"],
"headline": user.get("headline"),
"website": user.get("websiteUrl"),
"twitter": user.get("twitterUsername"),
"followers": user["followersCount"],
"following": user["followingCount"],
"voted_posts": user.get("votedPostsCount", 0),
"products": [
{
"id": e["node"]["id"],
"name": e["node"]["name"],
"votes": e["node"]["votesCount"],
"comments": e["node"]["commentsCount"],
"launched": e["node"]["createdAt"],
"url": e["node"]["url"],
"topics": [t["node"]["name"] for t in e["node"]["topics"]["edges"]],
}
for e in user["madePosts"]["edges"]
]
}
maker = get_maker("rrhoover", token="YOUR_TOKEN")
if maker:
print(f"{maker['name']} (@{maker['username']})")
print(f"Followers: {maker['followers']}")
print(f"Products launched: {len(maker['products'])}")
for p in maker["products"][:5]:
print(f" {p['name']} -- {p['votes']} votes ({p['launched'][:10]})")
Searching for Products by Topic
Product Hunt supports filtering by topic. To find all AI tools or all dev tools:
TOPIC_POSTS_QUERY = """
query GetTopicPosts($topic: String!, $after: String) {
posts(
order: VOTES
topic: $topic
after: $after
first: 20
) {
edges {
node {
id
name
tagline
votesCount
url
createdAt
}
}
pageInfo {
hasNextPage
endCursor
}
}
}
"""
def get_posts_by_topic(topic_slug, token, max_items=100):
"""Get top products in a specific topic/category."""
all_posts = []
cursor = None
while len(all_posts) < max_items:
data = ph_query(TOPIC_POSTS_QUERY, {"topic": topic_slug, "after": cursor}, token)
page = data["posts"]
for edge in page["edges"]:
all_posts.append(edge["node"])
if not page["pageInfo"]["hasNextPage"] or len(all_posts) >= max_items:
break
cursor = page["pageInfo"]["endCursor"]
time.sleep(1)
return all_posts[:max_items]
# Common topic slugs: artificial-intelligence, developer-tools, productivity,
# marketing, design-tools, finance, education, health-fitness
ai_tools = get_posts_by_topic("artificial-intelligence", token="YOUR_TOKEN", max_items=200)
print(f"Found {len(ai_tools)} AI products")
Anti-Bot Measures
Product Hunt's bot detection is more aggressive than most sites:
- Token-based rate limiting — each API token has a request quota. Exceeding it returns 429 errors and can lead to token revocation. Stay under 100 requests per hour for sustained crawling.
- Browser fingerprinting on the website — if you scrape the HTML directly instead of using the API, you'll hit Cloudflare challenges, JavaScript rendering requirements, and behavioral analysis.
- GraphQL query complexity limits — requesting too many nested fields or too many items per page will fail with complexity errors. Keep
firstat 20 or below, and don't nest more than 3-4 levels deep. - IP reputation scoring — datacenter IPs get scrutinized more than residential ones.
For the API route, the main risk is token revocation. Keep requests under 100/hour and you'll be fine. For the website route (which you need for data not in the API), you need residential proxies.
ThorData's residential proxy network rotates IPs automatically and handles Cloudflare challenges, which is essential for Product Hunt's website -- their bot detection flags datacenter IPs within a few requests.
# For direct website scraping (not API)
import httpx
proxied_client = httpx.Client(
proxy="http://user:[email protected]:9000",
headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
},
timeout=30
)
# For the GraphQL API with proxy rotation:
def ph_query_proxied(query, variables=None, token=None, proxy_url=None):
"""Execute a Product Hunt GraphQL query via proxy."""
headers = {
"Authorization": f"Bearer {token}",
"Content-Type": "application/json",
}
with httpx.Client(proxy=proxy_url, timeout=30) as client:
resp = client.post(
API_URL,
json={"query": query, "variables": variables or {}},
headers=headers
)
resp.raise_for_status()
data = resp.json()
if "errors" in data:
raise Exception(f"GraphQL errors: {data['errors']}")
return data["data"]
Tracking Launches Over Time
To build a historical dataset, run the daily scraper on a schedule:
from datetime import datetime, timedelta
import json
def scrape_date_range(start_date, days, token, output_file="launches.jsonl"):
"""Scrape launches over a range of dates. Appends to JSONL file."""
current = datetime.strptime(start_date, "%Y-%m-%d")
with open(output_file, "a") as out:
for day in range(days):
date_str = current.strftime("%Y-%m-%d")
print(f"Scraping {date_str} ({day+1}/{days})...")
try:
launches = get_daily_launches(date_str, token)
record = {
"date": date_str,
"count": len(launches),
"products": launches
}
out.write(json.dumps(record) + "\n")
print(f" Got {len(launches)} products")
except Exception as e:
print(f" Failed: {e}")
current += timedelta(days=1)
time.sleep(5) # be polite between days
# Scrape last 30 days
end_date = datetime.now()
start_date = end_date - timedelta(days=30)
scrape_date_range(start_date.strftime("%Y-%m-%d"), 30, token="YOUR_TOKEN")
Storing in SQLite
For a proper data pipeline, persist everything to SQLite:
import sqlite3
def init_db(db_path="producthunt.db"):
conn = sqlite3.connect(db_path)
conn.executescript("""
CREATE TABLE IF NOT EXISTS launches (
id TEXT PRIMARY KEY,
name TEXT NOT NULL,
tagline TEXT,
votes INTEGER DEFAULT 0,
comments INTEGER DEFAULT 0,
url TEXT,
website TEXT,
created_at TEXT,
featured_at TEXT,
topics TEXT, -- JSON array
thumbnail_url TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS makers (
id TEXT,
launch_id TEXT,
name TEXT,
username TEXT,
PRIMARY KEY (id, launch_id),
FOREIGN KEY (launch_id) REFERENCES launches(id)
);
CREATE INDEX IF NOT EXISTS idx_launches_votes ON launches(votes DESC);
CREATE INDEX IF NOT EXISTS idx_launches_created ON launches(created_at);
""")
conn.commit()
return conn
def save_launches(conn, launches):
"""Save a list of launches to the database."""
import json as _json
for launch in launches:
conn.execute(
"""INSERT OR REPLACE INTO launches
(id, name, tagline, votes, comments, url, website, created_at, featured_at, topics, thumbnail_url)
VALUES (?,?,?,?,?,?,?,?,?,?,?)""",
(
launch["id"],
launch["name"],
launch["tagline"],
launch["votes"],
launch["comments"],
launch["url"],
launch.get("website"),
launch.get("created_at"),
launch.get("featured_at"),
_json.dumps(launch.get("topics", [])),
launch.get("thumbnail"),
)
)
for maker in launch.get("makers", []):
conn.execute(
"INSERT OR REPLACE INTO makers (id, launch_id, name, username) VALUES (?,?,?,?)",
(maker.get("id", maker["username"]), launch["id"], maker["name"], maker["username"])
)
conn.commit()
conn = init_db()
launches = get_daily_launches("2026-04-23", token="YOUR_TOKEN")
save_launches(conn, launches)
conn.close()
Analyzing the Data
Once you have data stored, some useful queries:
import sqlite3
import json
conn = sqlite3.connect("producthunt.db")
# Top products by votes
print("Top 10 all-time by votes:")
for row in conn.execute("SELECT name, votes, tagline FROM launches ORDER BY votes DESC LIMIT 10"):
print(f" {row[1]:5d} votes -- {row[0]}: {row[2][:50]}")
# Products per topic
print("\nMost common topics:")
for row in conn.execute("SELECT name, votes, topics FROM launches"):
for topic in json.loads(row[2] or "[]"):
pass # aggregate topic counts
# Votes distribution
print("\nVote distribution:")
for row in conn.execute("""
SELECT
CASE
WHEN votes >= 500 THEN '500+'
WHEN votes >= 100 THEN '100-499'
WHEN votes >= 50 THEN '50-99'
WHEN votes >= 10 THEN '10-49'
ELSE '0-9'
END as bucket,
COUNT(*) as count
FROM launches
GROUP BY bucket
ORDER BY MIN(votes) DESC
"""):
print(f" {row[0]:10s}: {row[1]} products")
Rate Limiting Best Practices
Product Hunt will revoke your token if you abuse it. Here's a conservative request pattern that should keep you well within limits:
import time
import random
from datetime import datetime, timedelta
class RateLimitedPHClient:
"""Product Hunt client with built-in rate limiting."""
def __init__(self, token, requests_per_hour=80):
self.token = token
self.requests_per_hour = requests_per_hour
self.request_times = []
def _wait_if_needed(self):
now = time.time()
# Remove requests older than 1 hour
self.request_times = [t for t in self.request_times if now - t < 3600]
if len(self.request_times) >= self.requests_per_hour:
# Wait until the oldest request falls off the window
oldest = self.request_times[0]
wait_time = 3600 - (now - oldest) + 1
print(f"Rate limit approaching, waiting {wait_time:.0f}s...")
time.sleep(wait_time)
def query(self, query, variables=None):
self._wait_if_needed()
result = ph_query(query, variables, self.token)
self.request_times.append(time.time())
# Small random delay to avoid machine-gun request patterns
time.sleep(random.uniform(0.5, 1.5))
return result
client = RateLimitedPHClient(token="YOUR_TOKEN", requests_per_hour=80)
Practical Tips
Use the API, not the website. The GraphQL API gives you structured data without fighting Cloudflare. Only scrape the HTML for data the API doesn't expose (like full post descriptions or gallery images).
Cache responses. Product Hunt data for past dates doesn't change much after the first 48 hours. Store daily snapshots and only re-fetch the current day. Daily archive data is essentially immutable.
Watch your token. Don't share API tokens across multiple scrapers. One revoked token means all your scrapers go down. Create separate tokens for separate projects.
Monitor for schema changes. GraphQL schemas evolve. Product Hunt occasionally deprecates fields or changes types. Pin your queries and test them weekly with a small validation scrape.
Use featuredAt, not createdAt. Products are featured on specific days but created slightly earlier. For "launched on date X" logic, filter on featuredAt not createdAt.
Handle the 403 gracefully. If you get a 403, don't immediately retry. Wait at least 60 seconds and check if the token is still valid. A 403 on the GraphQL endpoint usually means your token has been temporarily or permanently blocked.
Conclusion
Product Hunt's GraphQL API is one of the cleaner startup data sources to work with. The structured query format means you get exactly the fields you need, and cursor pagination handles large result sets reliably. The main constraints are the 900 request/day limit on free tokens and their aggressive IP-based blocking of direct website scraping.
For data that's only accessible via the website, ThorData's residential proxies provide the IP diversity needed to stay under Product Hunt's radar. For API-based scraping, stay under the rate limits and you'll have a solid, reliable pipeline for tracking the startup ecosystem.