How to Scrape Bandcamp Music Data with Python (2026)
How to Scrape Bandcamp Music Data with Python (2026)
Bandcamp doesn't have a public API. They had one years ago, shut it down, and never brought it back. If you want structured data about artists, albums, pricing, or fan activity on Bandcamp, scraping is the only option.
The good news: Bandcamp's HTML is relatively clean. Most data lives in structured JSON-LD embedded in the page source. You don't need a headless browser for most tasks — plain HTTP requests work fine for public pages.
This guide covers scraping artist profiles, album listings, individual tracks, fan data, estimating sales from publicly visible signals, and running a full production pipeline with proxy rotation via ThorData.
Why Scrape Bandcamp?
Bandcamp is one of the last major music platforms where artists retain control and pricing is transparent. This makes it uniquely valuable for:
- Music industry research: Track which genres and artists are gaining traction, monitor pricing strategies, and analyze fan engagement patterns
- A&R discovery: Identify emerging artists before they get mainstream attention by monitoring sales velocity and fan growth
- Price intelligence: Understand what fans actually pay for digital music (the "name your price" model provides rich data)
- Fan community analysis: Bandcamp fan profiles reveal music taste clusters that marketing platforms can't show you
- Trend analysis: See which tags and sounds are emerging across independent music scenes
Unlike Spotify or Apple Music, Bandcamp shows real transactional signals — who bought what, at what price. That data doesn't exist anywhere else.
Setup
pip install requests beautifulsoup4 lxml sqlite3
You'll also need a proxy service for any serious data collection. We'll use ThorData residential proxies throughout this guide — the key reason is that residential IPs from real ISPs don't trigger Bandcamp's rate limiting the way datacenter IPs do.
Understanding Bandcamp's Page Structure
Before scraping, understand what you're working with. Bandcamp has several page types:
- Artist pages (
artist.bandcamp.com) — discography, about info, links - Album pages (
artist.bandcamp.com/album/name) — track listings, pricing, credits - Track pages (
artist.bandcamp.com/track/name) — individual track with streaming - Fan pages (
bandcamp.com/username) — collection of purchased albums - Discovery pages (
bandcamp.com/discover) — curated tag-based browsing
The most valuable insight: Bandcamp embeds JSON-LD structured data in <script type="application/ld+json"> tags on every artist and album page. This gives you machine-readable data without parsing messy HTML. Always try this first before falling back to CSS selectors.
Scraping Artist Pages
Every Bandcamp artist has a page at artistname.bandcamp.com. The page lists their discography and basic info.
import requests
from bs4 import BeautifulSoup
import json
import time
import random
import re
import sqlite3
from pathlib import Path
from datetime import datetime
from typing import Optional, Dict, List, Any
def make_headers(rotate_ua: bool = True) -> Dict[str, str]:
"""Generate realistic browser headers."""
user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:127.0) Gecko/20100101 Firefox/127.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 14.5; rv:127.0) Gecko/20100101 Firefox/127.0",
]
return {
"User-Agent": random.choice(user_agents) if rotate_ua else user_agents[0],
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
}
def scrape_artist(artist_slug: str, proxy: Optional[str] = None) -> Optional[Dict]:
"""Scrape artist page and return structured data."""
url = f"https://{artist_slug}.bandcamp.com"
headers = make_headers()
proxies = {"http": proxy, "https": proxy} if proxy else None
try:
response = requests.get(url, headers=headers, proxies=proxies, timeout=20)
response.raise_for_status()
except requests.RequestException as e:
print(f"[ERROR] Failed to fetch {url}: {e}")
return None
soup = BeautifulSoup(response.text, "lxml")
# Try JSON-LD first — cleanest data source
ld_script = soup.find("script", {"type": "application/ld+json"})
if ld_script:
try:
data = json.loads(ld_script.string)
# Normalize JSON-LD structure
result = {
"slug": artist_slug,
"url": url,
"name": data.get("name") or data.get("foundingLocation", {}).get("name"),
"description": data.get("description", ""),
"location": None,
"albums": [],
"source": "json-ld",
"scraped_at": datetime.utcnow().isoformat(),
}
# Albums are listed in "album" key as list or single object
albums_raw = data.get("album", [])
if isinstance(albums_raw, dict):
albums_raw = [albums_raw]
for alb in albums_raw:
result["albums"].append({
"title": alb.get("name"),
"url": alb.get("@id") or alb.get("url"),
"release_date": alb.get("datePublished"),
})
return result
except json.JSONDecodeError:
pass # Fall through to HTML parsing
# Fallback: parse HTML directly
artist_name = soup.select_one("#band-name-location .title")
location = soup.select_one("#band-name-location .location")
bio_el = soup.select_one("#bio-text")
albums = []
for item in soup.select("#music-grid .music-grid-item"):
link = item.select_one("a")
title = item.select_one(".title")
if link and title:
href = link.get("href", "")
if href and not href.startswith("http"):
href = f"https://{artist_slug}.bandcamp.com{href}"
albums.append({
"title": title.text.strip(),
"url": href,
"release_date": None,
})
return {
"slug": artist_slug,
"url": url,
"name": artist_name.text.strip() if artist_name else None,
"location": location.text.strip() if location else None,
"description": bio_el.get_text(" ", strip=True)[:500] if bio_el else "",
"albums": albums,
"source": "html",
"scraped_at": datetime.utcnow().isoformat(),
}
Extracting Album and Track Data
Album pages contain track listings, pricing, release dates, and credits. The JSON-LD data here is especially rich.
def scrape_album(album_url: str, proxy: Optional[str] = None) -> Optional[Dict]:
"""Scrape album page for full track and pricing data."""
headers = make_headers()
proxies = {"http": proxy, "https": proxy} if proxy else None
try:
response = requests.get(album_url, headers=headers, proxies=proxies, timeout=20)
response.raise_for_status()
except requests.RequestException as e:
print(f"[ERROR] Failed to fetch {album_url}: {e}")
return None
soup = BeautifulSoup(response.text, "lxml")
# JSON-LD contains track listing, price, release date
ld_script = soup.find("script", {"type": "application/ld+json"})
if ld_script:
try:
data = json.loads(ld_script.string)
except json.JSONDecodeError:
data = {}
else:
data = {}
album = {
"url": album_url,
"title": data.get("name"),
"artist": data.get("byArtist", {}).get("name"),
"release_date": data.get("datePublished"),
"num_tracks": data.get("numTracks"),
"description": data.get("description", "")[:500],
"tracks": [],
"price": None,
"tags": [],
"scraped_at": datetime.utcnow().isoformat(),
}
# Extract minimum price from offers
offers = data.get("offers", {})
if isinstance(offers, dict):
album["price"] = {
"amount": offers.get("price"),
"currency": offers.get("priceCurrency"),
"availability": offers.get("availability", "").split("/")[-1],
}
# Track listing
track_list = data.get("track", {})
if isinstance(track_list, dict):
for track in track_list.get("itemListElement", []):
item = track.get("item", {})
album["tracks"].append({
"position": track.get("position"),
"title": item.get("name"),
"duration": item.get("duration"), # ISO 8601 duration
"url": item.get("@id"),
})
# Tags from HTML (not in JSON-LD)
for tag_el in soup.select(".tralbum-tags .tag"):
album["tags"].append(tag_el.text.strip())
# Supporter count from HTML
supporter_text = soup.select_one(".collected-by-header")
if supporter_text:
match = re.search(r"([\d,]+)\s+supporter", supporter_text.text)
if match:
album["supporter_count"] = int(match.group(1).replace(",", ""))
# Credits from HTML
credits_el = soup.select_one("#trackInfo .tralbum-credits")
if credits_el:
album["credits"] = credits_el.get_text("\n", strip=True)[:500]
return album
def scrape_track(track_url: str, proxy: Optional[str] = None) -> Optional[Dict]:
"""Scrape individual track page."""
headers = make_headers()
proxies = {"http": proxy, "https": proxy} if proxy else None
try:
response = requests.get(track_url, headers=headers, proxies=proxies, timeout=20)
response.raise_for_status()
except requests.RequestException as e:
print(f"[ERROR] {e}")
return None
soup = BeautifulSoup(response.text, "lxml")
ld_script = soup.find("script", {"type": "application/ld+json"})
if not ld_script:
return None
try:
data = json.loads(ld_script.string)
except json.JSONDecodeError:
return None
return {
"url": track_url,
"title": data.get("name"),
"artist": data.get("byArtist", {}).get("name"),
"album": data.get("inAlbum", {}).get("name"),
"duration": data.get("duration"),
"release_date": data.get("datePublished"),
"tags": [t.text.strip() for t in soup.select(".tralbum-tags .tag")],
"price": data.get("offers", {}).get("price"),
"scraped_at": datetime.utcnow().isoformat(),
}
Scraping Fan Collections
Bandcamp fan profiles are public by default. Each fan page shows purchased albums and wishlisted items — useful for understanding what music a given community actually buys.
Fan collection pages use an internal API endpoint for pagination. The initial HTML embeds the first batch of items in a data-blob attribute.
def scrape_fan_collection(username: str, proxy: Optional[str] = None) -> Optional[Dict]:
"""Scrape a Bandcamp fan's complete purchased collection."""
url = f"https://bandcamp.com/{username}"
headers = make_headers()
proxies = {"http": proxy, "https": proxy} if proxy else None
try:
response = requests.get(url, headers=headers, proxies=proxies, timeout=20)
response.raise_for_status()
except requests.RequestException as e:
print(f"[ERROR] {e}")
return None
soup = BeautifulSoup(response.text, "lxml")
# Fan data is embedded in a data-blob attribute on #pagedata div
fan_data_div = soup.find("div", {"id": "pagedata"})
if not fan_data_div or "data-blob" not in fan_data_div.attrs:
print(f"[WARN] No pagedata blob found for {username}")
return None
try:
blob = json.loads(fan_data_div["data-blob"])
except json.JSONDecodeError:
return None
fan_data = blob.get("fan_data", {})
fan_id = fan_data.get("fan_id")
collection_data = blob.get("collection_data", {})
items = []
for item_id, item in collection_data.get("items", {}).items():
items.append({
"title": item.get("album_title") or item.get("item_title"),
"artist": item.get("band_name"),
"url": item.get("item_url"),
"item_type": item.get("item_type"),
"purchased": item.get("purchased"),
"band_url": item.get("band_url"),
})
# Paginate through full collection via internal API
last_token = collection_data.get("last_token")
page_num = 1
while last_token:
api_url = "https://bandcamp.com/api/fancollection/1/collection_items"
payload = {
"fan_id": fan_id,
"older_than_token": last_token,
"count": 20,
}
try:
api_resp = requests.post(
api_url,
json=payload,
headers={**headers, "Content-Type": "application/json"},
proxies=proxies,
timeout=20,
)
api_resp.raise_for_status()
page = api_resp.json()
except requests.RequestException as e:
print(f"[ERROR] Pagination request failed on page {page_num}: {e}")
break
page_items = page.get("items", [])
if not page_items:
break
for item in page_items:
items.append({
"title": item.get("album_title") or item.get("item_title"),
"artist": item.get("band_name"),
"url": item.get("item_url"),
"item_type": item.get("item_type"),
"purchased": item.get("purchased"),
})
last_token = page.get("last_token")
page_num += 1
time.sleep(random.uniform(1.0, 2.0))
return {
"username": username,
"fan_id": fan_id,
"display_name": fan_data.get("name"),
"location": fan_data.get("location"),
"collection_count": len(items),
"items": items,
"scraped_at": datetime.utcnow().isoformat(),
}
Scraping Bandcamp Discovery
The Bandcamp discovery page lets you browse by tag — a goldmine for finding music by genre.
def scrape_discovery(tag: str, pages: int = 5, proxy: Optional[str] = None) -> List[Dict]:
"""Scrape Bandcamp discover page for a given tag."""
all_items = []
headers = make_headers()
proxies = {"http": proxy, "https": proxy} if proxy else None
for page in range(1, pages + 1):
url = f"https://bandcamp.com/discover/{tag}"
params = {
"g": "all",
"s": "top",
"p": page,
"gn": 0,
"f": "digital",
"lo": "false",
"hi_ab": "false",
}
try:
resp = requests.get(url, params=params, headers=headers, proxies=proxies, timeout=20)
resp.raise_for_status()
except requests.RequestException as e:
print(f"[ERROR] Discovery page {page} failed: {e}")
break
soup = BeautifulSoup(resp.text, "lxml")
# Discovery items are in .discover-result elements
for item in soup.select(".discover-result"):
title_el = item.select_one(".result-title")
artist_el = item.select_one(".result-info-inner .subhead a")
link_el = item.select_one("a.art")
price_el = item.select_one(".result-info .price")
if title_el:
all_items.append({
"title": title_el.text.strip(),
"artist": artist_el.text.strip() if artist_el else None,
"url": link_el.get("href") if link_el else None,
"price": price_el.text.strip() if price_el else None,
"tag": tag,
"page": page,
})
time.sleep(random.uniform(2.0, 4.0))
return all_items
Estimating Sales from Public Data
Bandcamp doesn't publish sales numbers, but you can estimate them from publicly visible signals — supporter counts, pricing, and how long the album has been available.
def estimate_album_popularity(album_url: str, proxy: Optional[str] = None) -> Dict:
"""Estimate sales and popularity metrics for an album."""
headers = make_headers()
proxies = {"http": proxy, "https": proxy} if proxy else None
try:
response = requests.get(album_url, headers=headers, proxies=proxies, timeout=20)
response.raise_for_status()
except requests.RequestException as e:
return {"error": str(e), "url": album_url}
soup = BeautifulSoup(response.text, "lxml")
# Visible supporter/buyer count
supporter_text = soup.select_one(".collected-by-header")
supporter_count = 0
if supporter_text:
match = re.search(r"([\d,]+)\s+supporter", supporter_text.text)
if match:
supporter_count = int(match.group(1).replace(",", ""))
else:
# Count visible thumb images
thumbs = soup.select(".buy-thumbs .thumb-row a")
supporter_count = len(thumbs)
# Recent purchases feed (visible on the page)
recent_purchases = soup.select(".purchaseinfo .buyer")
recent_count = len(recent_purchases)
# Tags for genre context
tags = [t.text.strip() for t in soup.select(".tralbum-tags .tag")]
# Release date for velocity calculation
ld_script = soup.find("script", {"type": "application/ld+json"})
release_date = None
min_price = None
if ld_script:
try:
data = json.loads(ld_script.string)
release_date = data.get("datePublished")
offers = data.get("offers", {})
if isinstance(offers, dict):
min_price = offers.get("price")
except json.JSONDecodeError:
pass
# Estimate: not every buyer shows as supporter
# Typical ratio is 1 visible : 1.5-3 actual purchasers
low_estimate = supporter_count
high_estimate = int(supporter_count * 2.5)
# If we have a price, estimate revenue range
revenue_low = round(float(min_price) * low_estimate, 2) if min_price and float(min_price or 0) > 0 else None
revenue_high = round(float(min_price) * high_estimate, 2) if min_price and float(min_price or 0) > 0 else None
return {
"url": album_url,
"visible_supporters": supporter_count,
"recent_purchases_visible": recent_count,
"tags": tags,
"release_date": release_date,
"min_price": min_price,
"estimated_sales_low": low_estimate,
"estimated_sales_high": high_estimate,
"estimated_revenue_low": revenue_low,
"estimated_revenue_high": revenue_high,
}
Anti-Detection Techniques
Bandcamp's anti-bot protection is moderate. You won't hit CAPTCHAs on public pages in the way Google or Cloudflare would serve them, but aggressive scraping will get your IP temporarily blocked. Here's the full playbook:
1. Realistic Request Timing
Random delays between requests are essential. Humans don't make requests at perfectly even intervals.
def polite_sleep(min_s: float = 1.5, max_s: float = 3.5):
"""Sleep for a human-like random duration."""
duration = random.uniform(min_s, max_s)
time.sleep(duration)
def scrape_with_backoff(url: str, proxy: Optional[str] = None, max_retries: int = 5) -> Optional[requests.Response]:
"""Fetch URL with exponential backoff on rate limit responses."""
headers = make_headers()
proxies = {"http": proxy, "https": proxy} if proxy else None
for attempt in range(max_retries):
try:
resp = requests.get(url, headers=headers, proxies=proxies, timeout=20)
if resp.status_code == 200:
return resp
elif resp.status_code == 429:
wait = (2 ** attempt) * 5 + random.uniform(0, 5)
print(f"[RATE LIMIT] Waiting {wait:.1f}s before retry {attempt + 1}/{max_retries}")
time.sleep(wait)
elif resp.status_code in (403, 503):
wait = (2 ** attempt) * 10
print(f"[BLOCKED] Status {resp.status_code}, waiting {wait:.0f}s")
time.sleep(wait)
else:
print(f"[ERROR] Status {resp.status_code} for {url}")
return None
except requests.Timeout:
print(f"[TIMEOUT] Attempt {attempt + 1} timed out")
time.sleep(5 * (attempt + 1))
except requests.RequestException as e:
print(f"[ERROR] Request failed: {e}")
return None
print(f"[FAIL] All {max_retries} attempts exhausted for {url}")
return None
2. User-Agent Rotation
Bandcamp validates User-Agent headers. Rotating between 3-5 different current browser strings is sufficient for most collection tasks.
3. Session Reuse with Cooldowns
Using a requests.Session for related requests (artist page + all albums) is more realistic than creating new sessions for each request. But reset sessions after 15-20 requests.
def create_session(proxy: Optional[str] = None) -> requests.Session:
"""Create a configured session with realistic headers."""
session = requests.Session()
session.headers.update(make_headers(rotate_ua=False)) # Consistent UA within a session
if proxy:
session.proxies = {"http": proxy, "https": proxy}
return session
ThorData Proxy Integration
For any serious Bandcamp data collection — more than a few hundred pages — you need rotating residential proxies. Bandcamp blocks datacenter IP ranges (AWS, GCP, DigitalOcean) fairly aggressively.
ThorData provides residential proxies that route through real ISP-assigned IP addresses. Here's how to integrate:
import os
from typing import Iterator
class ThorDataProxyPool:
"""Rotating proxy pool using ThorData residential proxies."""
def __init__(self, username: str, password: str, host: str = "gate.thordata.com", port: int = 9000):
self.username = username
self.password = password
self.host = host
self.port = port
def get_proxy(self, country: Optional[str] = None, session_id: Optional[str] = None) -> str:
"""Get a proxy URL, optionally geo-targeted."""
user = self.username
if country:
user = f"{self.username}-country-{country.upper()}"
if session_id:
user = f"{user}-session-{session_id}"
return f"http://{user}:{self.password}@{self.host}:{self.port}"
def get_rotating_proxy(self, country: Optional[str] = None) -> str:
"""Get a fresh rotating proxy (new IP each request)."""
return self.get_proxy(country=country)
def get_sticky_proxy(self, session_id: str, country: Optional[str] = None) -> str:
"""Get a sticky proxy (same IP for the session duration)."""
return self.get_proxy(country=country, session_id=session_id)
def scrape_with_thordata(artist_slugs: List[str], thordata_user: str, thordata_pass: str) -> List[Dict]:
"""Scrape multiple artists using ThorData rotating proxies."""
pool = ThorDataProxyPool(thordata_user, thordata_pass)
results = []
for slug in artist_slugs:
# Fresh proxy for each artist
proxy = pool.get_rotating_proxy()
artist = scrape_artist(slug, proxy=proxy)
if not artist:
continue
results.append(artist)
print(f"Scraped {slug}: {len(artist.get('albums', []))} albums")
# Now get each album with the same session proxy for consistency
session_id = f"artist-{slug[:10]}"
session_proxy = pool.get_sticky_proxy(session_id)
for album_info in artist.get("albums", []):
album_url = album_info.get("url")
if not album_url:
continue
album = scrape_album(album_url, proxy=session_proxy)
if album:
artist.setdefault("album_details", []).append(album)
print(f" Album: {album.get('title')} — {len(album.get('tracks', []))} tracks")
polite_sleep(1.5, 3.0)
polite_sleep(3.0, 6.0) # Longer pause between artists
return results
Pagination Handling
Bandcamp artist pages don't paginate (all albums shown on one page), but the fan collection API and the discovery page do.
def paginate_fan_wishlist(fan_id: int, username: str, proxy: Optional[str] = None) -> List[Dict]:
"""Paginate through a fan's full wishlist."""
headers = make_headers()
proxies = {"http": proxy, "https": proxy} if proxy else None
items = []
cursor = None
page = 1
while True:
api_url = "https://bandcamp.com/api/fancollection/1/wishlist_items"
payload = {"fan_id": fan_id, "count": 20}
if cursor:
payload["older_than_token"] = cursor
try:
resp = requests.post(
api_url,
json=payload,
headers={**headers, "Content-Type": "application/json"},
proxies=proxies,
timeout=20,
)
resp.raise_for_status()
data = resp.json()
except requests.RequestException as e:
print(f"[ERROR] Wishlist page {page} failed: {e}")
break
page_items = data.get("items", [])
if not page_items:
break
items.extend(page_items)
cursor = data.get("last_token")
print(f" Wishlist page {page}: {len(page_items)} items (total: {len(items)})")
if not cursor:
break
page += 1
time.sleep(random.uniform(1.0, 2.0))
return items
Data Storage
SQLite is the right choice for Bandcamp data — structured enough to query, simple enough to set up.
def init_database(db_path: str = "bandcamp.db") -> sqlite3.Connection:
"""Initialize the Bandcamp data database."""
conn = sqlite3.connect(db_path)
conn.execute("PRAGMA journal_mode=WAL") # Better concurrent access
conn.executescript("""
CREATE TABLE IF NOT EXISTS artists (
slug TEXT PRIMARY KEY,
name TEXT,
location TEXT,
description TEXT,
url TEXT,
album_count INTEGER DEFAULT 0,
scraped_at TEXT,
updated_at TEXT DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS albums (
url TEXT PRIMARY KEY,
artist_slug TEXT,
title TEXT,
release_date TEXT,
num_tracks INTEGER,
min_price REAL,
currency TEXT,
supporter_count INTEGER,
tags TEXT, -- JSON array stored as text
scraped_at TEXT,
FOREIGN KEY (artist_slug) REFERENCES artists(slug)
);
CREATE TABLE IF NOT EXISTS tracks (
url TEXT PRIMARY KEY,
album_url TEXT,
title TEXT,
position INTEGER,
duration TEXT,
scraped_at TEXT,
FOREIGN KEY (album_url) REFERENCES albums(url)
);
CREATE TABLE IF NOT EXISTS fan_collections (
username TEXT PRIMARY KEY,
fan_id INTEGER,
display_name TEXT,
location TEXT,
collection_count INTEGER,
scraped_at TEXT
);
CREATE TABLE IF NOT EXISTS collection_items (
id INTEGER PRIMARY KEY AUTOINCREMENT,
username TEXT,
title TEXT,
artist TEXT,
item_url TEXT,
item_type TEXT,
purchased TEXT,
FOREIGN KEY (username) REFERENCES fan_collections(username)
);
CREATE INDEX IF NOT EXISTS idx_albums_artist ON albums(artist_slug);
CREATE INDEX IF NOT EXISTS idx_tracks_album ON tracks(album_url);
CREATE INDEX IF NOT EXISTS idx_items_username ON collection_items(username);
""")
conn.commit()
return conn
def save_artist(conn: sqlite3.Connection, artist: Dict):
"""Save artist data to database."""
conn.execute(
"""INSERT OR REPLACE INTO artists (slug, name, location, description, url, album_count, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?)""",
(
artist.get("slug"),
artist.get("name"),
artist.get("location"),
artist.get("description"),
artist.get("url"),
len(artist.get("albums", [])),
artist.get("scraped_at"),
)
)
conn.commit()
def save_album(conn: sqlite3.Connection, album: Dict, artist_slug: str):
"""Save album data including tracks."""
price_info = album.get("price", {}) or {}
conn.execute(
"""INSERT OR REPLACE INTO albums
(url, artist_slug, title, release_date, num_tracks, min_price, currency, supporter_count, tags, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
(
album.get("url"),
artist_slug,
album.get("title"),
album.get("release_date"),
album.get("num_tracks"),
price_info.get("amount"),
price_info.get("currency"),
album.get("supporter_count"),
json.dumps(album.get("tags", [])),
album.get("scraped_at"),
)
)
for track in album.get("tracks", []):
conn.execute(
"""INSERT OR REPLACE INTO tracks (url, album_url, title, position, duration, scraped_at)
VALUES (?, ?, ?, ?, ?, ?)""",
(
track.get("url"),
album.get("url"),
track.get("title"),
track.get("position"),
track.get("duration"),
album.get("scraped_at"),
)
)
conn.commit()
def query_top_albums_by_supporters(conn: sqlite3.Connection, limit: int = 20) -> List[Dict]:
"""Get albums ranked by supporter count."""
rows = conn.execute(
"""SELECT a.title, ar.name, a.supporter_count, a.min_price, a.tags, a.url
FROM albums a
JOIN artists ar ON a.artist_slug = ar.slug
WHERE a.supporter_count IS NOT NULL
ORDER BY a.supporter_count DESC
LIMIT ?""",
(limit,)
).fetchall()
return [
{
"title": r[0],
"artist": r[1],
"supporters": r[2],
"min_price": r[3],
"tags": json.loads(r[4]) if r[4] else [],
"url": r[5],
}
for r in rows
]
Full Production Pipeline
Putting it all together — a complete pipeline that scrapes artist discographies, stores results, and handles failures gracefully:
def run_discography_pipeline(
artist_slugs: List[str],
db_path: str = "bandcamp.db",
proxy: Optional[str] = None,
max_albums_per_artist: int = 50,
):
"""Complete pipeline: artists → albums → tracks → database."""
conn = init_database(db_path)
stats = {"artists": 0, "albums": 0, "tracks": 0, "errors": 0}
for slug in artist_slugs:
print(f"\n[ARTIST] {slug}")
# Check if already scraped recently
existing = conn.execute(
"SELECT scraped_at FROM artists WHERE slug = ?", (slug,)
).fetchone()
if existing:
# Skip if scraped within 7 days
from datetime import datetime, timedelta
scraped = datetime.fromisoformat(existing[0])
if datetime.utcnow() - scraped < timedelta(days=7):
print(f" Skipping — scraped {existing[0][:10]}")
continue
artist = scrape_artist(slug, proxy=proxy)
if not artist:
stats["errors"] += 1
continue
save_artist(conn, artist)
stats["artists"] += 1
print(f" {artist.get('name')} — {len(artist.get('albums', []))} albums")
albums_scraped = 0
for album_info in artist.get("albums", [])[:max_albums_per_artist]:
album_url = album_info.get("url")
if not album_url:
continue
# Ensure absolute URL
if not album_url.startswith("http"):
album_url = f"https://{slug}.bandcamp.com{album_url}"
# Check if already have this album
existing_album = conn.execute(
"SELECT url FROM albums WHERE url = ?", (album_url,)
).fetchone()
if existing_album:
print(f" [SKIP] {album_info.get('title')}")
continue
album = scrape_album(album_url, proxy=proxy)
if album:
save_album(conn, album, slug)
stats["albums"] += 1
stats["tracks"] += len(album.get("tracks", []))
print(f" {album.get('title')} — {len(album.get('tracks', []))} tracks")
albums_scraped += 1
polite_sleep(2.0, 4.0)
polite_sleep(5.0, 10.0) # Longer pause between artists
conn.close()
print(f"\nDone: {stats['artists']} artists, {stats['albums']} albums, {stats['tracks']} tracks, {stats['errors']} errors")
return stats
# Real-world use: scrape a genre's top artists
if __name__ == "__main__":
# Example artists in lo-fi/indie scene
artists = [
"bedepartment",
"godhatesnoone",
"yseultofficial",
"lapfrancesita",
]
# With ThorData proxy (replace with your credentials)
# proxy = "http://YOUR_USER:[email protected]:9000"
proxy = None # Remove this line when using ThorData
run_discography_pipeline(artists, proxy=proxy)
Real-World Use Cases
Music Trend Tracker: Monitor which tags are gaining supporters fastest over time. Pull discovery page data weekly, track supporter velocity per album, and surface emerging sounds before they hit mainstream media coverage.
A&R Intelligence Tool: Scrape new releases in target genres, filter by fan engagement (supporters > N, recent purchases visible), and surface artists worth signing before they get too expensive.
Pricing Strategy Research: Collect album prices across genres to understand what fans pay by style. Bandcamp's "name your price" model means you see both minimum prices and what fans actually choose to pay (partially visible from supporter data).
Fan Network Mapping: Scrape fan collections in a niche community to build recommendation graphs. Fans who bought Album A also bought Album B reveals taste clusters that traditional collaborative filtering misses.
Label Portfolio Monitoring: Track all artists on a given label (identifiable by shared domain patterns), monitor their release activity, and alert on sales spikes.
Legal Notes
Bandcamp's Terms of Service prohibit automated access. This code is for educational purposes. Don't use it to replicate Bandcamp's catalog, violate copyright, or harm artists. If you're building something that surfaces Bandcamp data, consider whether there's a way to do it that supports the artists rather than just extracting value from them.
Public pages contain publicly visible data, but scale and intent matter for legal risk. Scrape responsibly, cache aggressively, and don't re-fetch data you already have.
Use ThorData's residential proxy service to distribute requests naturally across different IP addresses and avoid putting load on Bandcamp's servers from a single source — this is both more respectful and more reliable.