Scraping SoundCloud Track Data: Play Counts, Comments, and Followers (2026)
Scraping SoundCloud Track Data: Play Counts, Comments, and Followers (2026)
SoundCloud sits in an unusual position for data collection. The platform killed public API registrations years ago — you cannot get new API keys through official channels. But the internal API still powers the website itself, the widget embed API remains open, and structured JSON loads on every page. For anyone building music analytics tools, tracking emerging artists, or researching audio content at scale, SoundCloud exposes rich data behind a relatively thin layer of protection.
Why Scrape SoundCloud?
The use cases are more varied than you might expect:
A&R discovery. Labels and talent scouts track play count velocity, not absolute numbers. An artist going from 500 to 50,000 plays in a week on an unsigned track is a signal. Automated monitoring across thousands of artists catches these moments before they hit music blogs.
Podcast and DJ set analytics. DJ mixes posted to SoundCloud accumulate timestamped comments that pinpoint crowd-favorite moments. Podcast producers use comment density to identify which segments resonated. Neither YouTube nor Spotify exposes this kind of granular engagement data.
Emerging artist monitoring. Distributor A&R teams watch follower growth rates and engagement ratios (comments per play, likes per play) as leading indicators of breakout potential. SoundCloud's listener base skews earlier in the discovery curve than Spotify.
Genre trend analysis. Tracking which tags and genres accumulate plays fastest gives music publishers and sync licensing teams early signals on what's trending before mainstream charts catch up.
Academic research. Musicologists, platform studies researchers, and media sociologists use SoundCloud data for peer-reviewed work on music sharing behavior, fan community formation, and geographic diffusion of musical styles.
Competitive intelligence. Record labels monitor competitor rosters, distributor performance, and unsigned talent in specific cities or genres — all publicly visible on SoundCloud.
What You Can Extract
SoundCloud's internal API exposes more than you'd expect once you examine network traffic:
- Track metadata — title, artist, genre, tags, description, duration, waveform URL, artwork URLs, creation date, license type, download settings
- Engagement stats — play count, like count, repost count, comment count, download count
- Comments — timestamped comments pinned to exact track positions, commenter profiles, reply threads
- User profiles — follower count, following count, track count, playlist count, bio, location, verified status, social links
- Playlists and sets — track ordering, set metadata, cumulative stats across all tracks
- Search results — tracks, users, playlists filtered by genre, duration, license, upload date
- Waveform data — amplitude arrays used for visualization (useful for audio analysis without downloading audio)
- Trending charts — top tracks by genre and region
- Related tracks — SoundCloud's recommendation graph for a given track
Architecture Overview
Every approach in this guide relies on the same underlying mechanism: SoundCloud's internal API at api-v2.soundcloud.com. This is the same API the website uses. It requires a client_id parameter that you extract from the JavaScript bundles on every page load.
The flow for any scraping task is:
1. Extract a valid client_id from SoundCloud's JS bundles
2. Use /resolve to convert a public URL to full structured data
3. Use resource-specific endpoints (/tracks/{id}/comments, /users/{id}/followers, etc.) for deeper data
4. Handle 401s by re-extracting the client ID; handle 429s with exponential backoff
All scripts below use httpx for HTTP requests. Install it with uv pip install httpx.
Part 1: Client ID Extraction
The client ID is a 32-character alphanumeric string embedded in SoundCloud's JavaScript bundles. It rotates periodically — roughly every few days — so your scraper needs to extract it dynamically rather than hardcode it.
# client_id.py
import httpx
import re
import json
import time
from pathlib import Path
from typing import Optional
CACHE_FILE = Path(".soundcloud_client_id.json")
CACHE_TTL = 3600 # 1 hour
def _load_cached_client_id() -> Optional[str]:
"""Load client ID from cache if still fresh."""
if not CACHE_FILE.exists():
return None
data = json.loads(CACHE_FILE.read_text())
if time.time() - data.get("timestamp", 0) > CACHE_TTL:
return None
return data.get("client_id")
def _save_client_id(client_id: str) -> None:
"""Persist client ID with timestamp."""
CACHE_FILE.write_text(json.dumps({
"client_id": client_id,
"timestamp": time.time(),
}))
def extract_client_id(force_refresh: bool = False) -> str:
"""
Extract a valid client_id from SoundCloud's JavaScript bundles.
Tries cached value first. On cache miss or force_refresh, fetches
the SoundCloud homepage, finds JS bundle URLs, and scans each for
the client_id pattern. Falls back through multiple bundles.
"""
if not force_refresh:
cached = _load_cached_client_id()
if cached:
return cached
headers = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
}
with httpx.Client(headers=headers, follow_redirects=True, timeout=20) as client:
# Fetch homepage to find JS bundle URLs
resp = client.get("https://soundcloud.com")
resp.raise_for_status()
# Multiple patterns for bundle URLs
script_patterns = [
r'src="(https://a-v2\.sndcdn\.com/assets/[^"]+\.js)"',
r'src="(https://[^"]*sndcdn\.com[^"]*\.js)"',
]
script_urls: list[str] = []
for pattern in script_patterns:
script_urls.extend(re.findall(pattern, resp.text))
# Remove duplicates, keep last few bundles (client_id is in app bundles)
seen: set[str] = set()
unique_urls = []
for url in reversed(script_urls):
if url not in seen:
seen.add(url)
unique_urls.append(url)
# Scan each bundle for client_id
client_id_patterns = [
r'client_id:"([a-zA-Z0-9]{32})"',
r'"client_id","([a-zA-Z0-9]{32})"',
r'client_id=([a-zA-Z0-9]{32})',
]
for script_url in unique_urls[:8]:
try:
js_resp = client.get(script_url, timeout=15)
js_resp.raise_for_status()
for pattern in client_id_patterns:
match = re.search(pattern, js_resp.text)
if match:
client_id = match.group(1)
_save_client_id(client_id)
return client_id
except httpx.HTTPError:
continue
raise RuntimeError(
"Could not extract client_id from any SoundCloud JS bundle. "
"The extraction patterns may need updating."
)
def validate_client_id(client_id: str) -> bool:
"""Test if a client_id is still valid by making a lightweight API call."""
try:
resp = httpx.get(
"https://api-v2.soundcloud.com/resolve",
params={"url": "https://soundcloud.com", "client_id": client_id},
timeout=10,
)
return resp.status_code != 401
except httpx.HTTPError:
return False
if __name__ == "__main__":
cid = extract_client_id(force_refresh=True)
print(f"Extracted client_id: {cid[:8]}...{cid[-4:]}")
print(f"Valid: {validate_client_id(cid)}")
Output:
Extracted client_id: iZIs9mch...AbCd
Valid: True
Part 2: Track Metadata Scraper
# track_scraper.py
import httpx
import time
import json
from dataclasses import dataclass, asdict, field
from typing import Optional
from client_id import extract_client_id
BASE_URL = "https://api-v2.soundcloud.com"
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0.0.0 Safari/537.36"
),
"Accept": "application/json, text/javascript, */*; q=0.01",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://soundcloud.com/",
"Origin": "https://soundcloud.com",
}
@dataclass
class ArtworkUrls:
small: str = "" # 32x32
medium: str = "" # 100x100
large: str = "" # 300x300
t500x500: str = "" # 500x500
@dataclass
class TrackData:
track_id: int
title: str
artist: str
artist_id: int
permalink_url: str
play_count: int
like_count: int
comment_count: int
repost_count: int
download_count: int
duration_ms: int
genre: str
tag_list: str
description: str
created_at: str
last_modified: str
license: str
waveform_url: str
artwork: ArtworkUrls = field(default_factory=ArtworkUrls)
downloadable: bool = False
streamable: bool = True
embeddable_by: str = "all"
purchase_url: Optional[str] = None
label_name: Optional[str] = None
bpm: Optional[int] = None
key_signature: Optional[str] = None
isrc: Optional[str] = None
def _parse_artwork(raw_url: Optional[str]) -> ArtworkUrls:
"""Generate artwork URLs at multiple sizes from the base artwork URL."""
if not raw_url:
return ArtworkUrls()
base = raw_url.replace("large", "{size}")
return ArtworkUrls(
small=base.replace("{size}", "small"),
medium=base.replace("{size}", "t100x100"),
large=base.replace("{size}", "t300x300"),
t500x500=base.replace("{size}", "t500x500"),
)
def _parse_track(data: dict) -> TrackData:
"""Parse raw API response dict into a TrackData dataclass."""
user = data.get("user", {})
return TrackData(
track_id=data["id"],
title=data.get("title", ""),
artist=user.get("username", ""),
artist_id=user.get("id", 0),
permalink_url=data.get("permalink_url", ""),
play_count=data.get("playback_count", 0),
like_count=data.get("likes_count", 0),
comment_count=data.get("comment_count", 0),
repost_count=data.get("reposts_count", 0),
download_count=data.get("download_count", 0),
duration_ms=data.get("duration", 0),
genre=data.get("genre", ""),
tag_list=data.get("tag_list", ""),
description=data.get("description", ""),
created_at=data.get("created_at", ""),
last_modified=data.get("last_modified", ""),
license=data.get("license", ""),
waveform_url=data.get("waveform_url", ""),
artwork=_parse_artwork(data.get("artwork_url")),
downloadable=data.get("downloadable", False),
streamable=data.get("streamable", True),
embeddable_by=data.get("embeddable_by", "all"),
purchase_url=data.get("purchase_url"),
label_name=data.get("label_name"),
bpm=data.get("bpm"),
key_signature=data.get("key_signature"),
isrc=data.get("isrc"),
)
def resolve_url(url: str, client_id: str, client: httpx.Client) -> dict:
"""Resolve any SoundCloud URL to its full API data object."""
resp = client.get(
f"{BASE_URL}/resolve",
params={"url": url, "client_id": client_id},
)
if resp.status_code == 401:
raise PermissionError("client_id expired — refresh required")
if resp.status_code == 429:
raise ConnectionError("Rate limited (429)")
resp.raise_for_status()
return resp.json()
def get_track(url: str, client_id: str, client: httpx.Client) -> TrackData:
"""Fetch full track metadata for a SoundCloud track URL."""
data = resolve_url(url, client_id, client)
if data.get("kind") != "track":
raise ValueError(f"URL resolved to {data.get('kind')}, not a track")
return _parse_track(data)
def scrape_tracks(
urls: list[str],
delay: float = 1.5,
proxy_url: Optional[str] = None,
) -> list[dict]:
"""
Scrape metadata for a list of SoundCloud track URLs.
Automatically re-extracts client_id on 401 errors.
Applies exponential backoff on 429 rate-limit responses.
"""
client_id = extract_client_id()
transport = httpx.HTTPTransport(proxy=proxy_url) if proxy_url else None
results = []
with httpx.Client(
headers=HEADERS,
transport=transport,
timeout=20,
follow_redirects=True,
) as client:
for i, url in enumerate(urls):
retries = 0
while retries < 5:
try:
track = get_track(url, client_id, client)
results.append(asdict(track))
print(
f"[{i+1}/{len(urls)}] {track.title} — "
f"{track.play_count:,} plays, "
f"{track.like_count:,} likes"
)
break
except PermissionError:
print(" client_id expired, refreshing...")
client_id = extract_client_id(force_refresh=True)
retries += 1
except ConnectionError:
wait = 2 ** retries * 5
print(f" Rate limited. Waiting {wait}s...")
time.sleep(wait)
retries += 1
except Exception as e:
print(f" Error scraping {url}: {e}")
break
time.sleep(delay)
return results
if __name__ == "__main__":
urls = [
"https://soundcloud.com/disclosure/latch-feat-sam-smith",
"https://soundcloud.com/flume/never-be-like-you-feat-kai",
]
tracks = scrape_tracks(urls)
print(json.dumps(tracks[0], indent=2, default=str))
Example output:
{
"track_id": 114688288,
"title": "Latch feat. Sam Smith",
"artist": "Disclosure",
"artist_id": 23489344,
"permalink_url": "https://soundcloud.com/disclosure/latch-feat-sam-smith",
"play_count": 18432901,
"like_count": 142300,
"comment_count": 3847,
"repost_count": 28900,
"download_count": 0,
"duration_ms": 237800,
"genre": "Electronic",
"tag_list": "\"UK House\" deep house electronic",
"description": "",
"created_at": "2013-08-29T11:22:14Z",
"license": "all-rights-reserved",
"waveform_url": "https://wave.sndcdn.com/aBcDeFgH_m.json",
"artwork": {
"small": "https://i1.sndcdn.com/artworks-xxx-small.jpg",
"medium": "https://i1.sndcdn.com/artworks-xxx-t100x100.jpg",
"large": "https://i1.sndcdn.com/artworks-xxx-t300x300.jpg",
"t500x500": "https://i1.sndcdn.com/artworks-xxx-t500x500.jpg"
},
"downloadable": false,
"bpm": null,
"isrc": null
}
Part 3: User Profile Scraper
# user_scraper.py
import httpx
import time
from dataclasses import dataclass, asdict
from typing import Optional
from client_id import extract_client_id
BASE_URL = "https://api-v2.soundcloud.com"
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0.0.0 Safari/537.36"
),
"Referer": "https://soundcloud.com/",
"Accept-Language": "en-US,en;q=0.9",
}
@dataclass
class SocialLinks:
youtube_channel_url: Optional[str] = None
facebook_page: Optional[str] = None
instagram_username: Optional[str] = None
twitter_handle: Optional[str] = None
website_url: Optional[str] = None
website_title: Optional[str] = None
@dataclass
class UserProfile:
user_id: int
username: str
permalink_url: str
display_name: str
followers_count: int
followings_count: int
track_count: int
playlist_count: int
likes_count: int
reposts_count: int
description: str
city: str
country_code: str
verified: bool
avatar_url: str
created_at: str
last_modified: str
social: SocialLinks
def _parse_user(data: dict) -> UserProfile:
social = SocialLinks(
youtube_channel_url=data.get("youtube_channel_url"),
facebook_page=data.get("facebook_page"),
instagram_username=data.get("instagram_username"),
twitter_handle=None, # removed from API but kept for schema
website_url=data.get("website_url"),
website_title=data.get("website_title"),
)
return UserProfile(
user_id=data["id"],
username=data.get("username", ""),
permalink_url=data.get("permalink_url", ""),
display_name=data.get("full_name") or data.get("username", ""),
followers_count=data.get("followers_count", 0),
followings_count=data.get("followings_count", 0),
track_count=data.get("track_count", 0),
playlist_count=data.get("playlist_count", 0),
likes_count=data.get("public_favorites_count", 0),
reposts_count=data.get("reposts_count", 0),
description=data.get("description", ""),
city=data.get("city", ""),
country_code=data.get("country_code", ""),
verified=data.get("verified", False),
avatar_url=data.get("avatar_url", ""),
created_at=data.get("created_at", ""),
last_modified=data.get("last_modified", ""),
social=social,
)
def get_user_profile(profile_url: str, client_id: str) -> UserProfile:
"""Resolve a SoundCloud profile URL to a full UserProfile."""
with httpx.Client(headers=HEADERS, timeout=20, follow_redirects=True) as client:
resp = client.get(
f"{BASE_URL}/resolve",
params={"url": profile_url, "client_id": client_id},
)
if resp.status_code == 401:
raise PermissionError("client_id expired")
resp.raise_for_status()
data = resp.json()
if data.get("kind") != "user":
raise ValueError(f"URL is a {data.get('kind')}, not a user")
return _parse_user(data)
def get_user_by_id(user_id: int, client_id: str) -> UserProfile:
"""Fetch user profile by numeric user ID."""
with httpx.Client(headers=HEADERS, timeout=20) as client:
resp = client.get(
f"{BASE_URL}/users/{user_id}",
params={"client_id": client_id},
)
resp.raise_for_status()
return _parse_user(resp.json())
def get_user_tracks(
user_id: int,
client_id: str,
limit: int = 50,
) -> list[dict]:
"""Fetch a user's uploaded tracks (up to limit)."""
tracks = []
next_href: Optional[str] = None
fetched = 0
with httpx.Client(headers=HEADERS, timeout=20) as client:
url = f"{BASE_URL}/users/{user_id}/tracks"
params = {
"client_id": client_id,
"limit": min(limit, 50),
"representation": "full",
}
while fetched < limit:
resp = client.get(next_href or url, params=params if not next_href else None)
if resp.status_code == 429:
time.sleep(30)
continue
resp.raise_for_status()
data = resp.json()
collection = data.get("collection", [])
if not collection:
break
tracks.extend(collection)
fetched += len(collection)
next_href = data.get("next_href")
if not next_href or fetched >= limit:
break
time.sleep(1.0)
return tracks[:limit]
if __name__ == "__main__":
import json
client_id = extract_client_id()
profile = get_user_profile("https://soundcloud.com/disclosure", client_id)
print(json.dumps(asdict(profile), indent=2))
Example output:
{
"user_id": 23489344,
"username": "disclosure",
"permalink_url": "https://soundcloud.com/disclosure",
"display_name": "Disclosure",
"followers_count": 1284900,
"followings_count": 312,
"track_count": 87,
"playlist_count": 14,
"likes_count": 2340,
"reposts_count": 892,
"description": "Disclosure is Guy and Howard Lawrence...",
"city": "London",
"country_code": "GB",
"verified": true,
"avatar_url": "https://i1.sndcdn.com/avatars-xxx-large.jpg",
"created_at": "2011-10-14T09:32:11Z",
"social": {
"youtube_channel_url": "https://youtube.com/disclosure",
"facebook_page": "https://facebook.com/disclosuremusic",
"instagram_username": "disclosure",
"website_url": "https://disclosuremusic.com"
}
}
Part 4: Comment Extraction
SoundCloud's timestamped comments are uniquely valuable — each comment is anchored to a specific millisecond position in the track. Density spikes reveal crowd-favorite moments.
# comment_scraper.py
import httpx
import time
from dataclasses import dataclass, asdict
from typing import Optional, Iterator
from client_id import extract_client_id
BASE_URL = "https://api-v2.soundcloud.com"
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0.0.0 Safari/537.36"
),
"Referer": "https://soundcloud.com/",
}
@dataclass
class CommentAuthor:
user_id: int
username: str
avatar_url: str
verified: bool
followers_count: int
@dataclass
class Comment:
comment_id: int
track_id: int
body: str
timestamp_ms: Optional[int] # position in track (None = general comment)
created_at: str
author: CommentAuthor
self_deletable: bool = False
def _parse_comment(raw: dict, track_id: int) -> Comment:
user = raw.get("user", {})
return Comment(
comment_id=raw["id"],
track_id=track_id,
body=raw.get("body", ""),
timestamp_ms=raw.get("timestamp"),
created_at=raw.get("created_at", ""),
author=CommentAuthor(
user_id=user.get("id", 0),
username=user.get("username", ""),
avatar_url=user.get("avatar_url", ""),
verified=user.get("verified", False),
followers_count=user.get("followers_count", 0),
),
)
def iter_comments(
track_id: int,
client_id: str,
batch_size: int = 50,
max_comments: int = 1000,
) -> Iterator[Comment]:
"""
Yield comments for a track, paginated.
Comments are returned newest-first by default.
Set threaded=1 to include reply threads.
"""
fetched = 0
next_href: Optional[str] = None
with httpx.Client(headers=HEADERS, timeout=20) as client:
base_url = f"{BASE_URL}/tracks/{track_id}/comments"
params = {
"client_id": client_id,
"limit": batch_size,
"threaded": 1,
"filter_replies": 0,
}
while fetched < max_comments:
url = next_href or base_url
request_params = None if next_href else params
resp = client.get(url, params=request_params)
if resp.status_code == 429:
wait = 30
print(f" Rate limited. Waiting {wait}s...")
time.sleep(wait)
continue
if resp.status_code == 401:
raise PermissionError("client_id expired")
resp.raise_for_status()
data = resp.json()
collection = data.get("collection", [])
if not collection:
break
for raw in collection:
yield _parse_comment(raw, track_id)
fetched += 1
if fetched >= max_comments:
return
next_href = data.get("next_href")
if not next_href:
break
time.sleep(0.8)
def get_all_comments(
track_id: int,
client_id: str,
max_comments: int = 500,
) -> list[dict]:
"""Collect all comments for a track into a list."""
return [asdict(c) for c in iter_comments(track_id, client_id, max_comments=max_comments)]
def analyze_comment_density(comments: list[dict], track_duration_ms: int, buckets: int = 20) -> list[dict]:
"""
Divide track into N time buckets and count comments per bucket.
Returns a heatmap useful for identifying crowd-favorite moments.
"""
bucket_size = track_duration_ms / buckets
counts = [0] * buckets
for comment in comments:
ts = comment.get("timestamp_ms")
if ts is not None and ts < track_duration_ms:
bucket = min(int(ts / bucket_size), buckets - 1)
counts[bucket] += 1
result = []
for i, count in enumerate(counts):
start_ms = int(i * bucket_size)
end_ms = int((i + 1) * bucket_size)
result.append({
"bucket": i,
"start_ms": start_ms,
"end_ms": end_ms,
"start_formatted": _ms_to_mmss(start_ms),
"end_formatted": _ms_to_mmss(end_ms),
"comment_count": count,
})
return result
def _ms_to_mmss(ms: int) -> str:
seconds = ms // 1000
return f"{seconds // 60}:{seconds % 60:02d}"
if __name__ == "__main__":
import json
client_id = extract_client_id()
track_id = 114688288 # replace with real track ID
comments = get_all_comments(track_id, client_id, max_comments=200)
print(f"Collected {len(comments)} comments")
if comments:
print(json.dumps(comments[0], indent=2))
Example comment output:
{
"comment_id": 1847293847,
"track_id": 114688288,
"body": "this drop never gets old 🔥",
"timestamp_ms": 142300,
"created_at": "2024-11-15T18:42:33Z",
"author": {
"user_id": 9823741,
"username": "beatmaker_uk",
"avatar_url": "https://i1.sndcdn.com/avatars-xxx-large.jpg",
"verified": false,
"followers_count": 234
}
}
Part 5: Playlist and Set Scraper
# playlist_scraper.py
import httpx
import time
from dataclasses import dataclass, asdict, field
from typing import Optional
from client_id import extract_client_id
BASE_URL = "https://api-v2.soundcloud.com"
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0.0.0 Safari/537.36"
),
"Referer": "https://soundcloud.com/",
}
@dataclass
class PlaylistTrack:
track_id: int
title: str
artist: str
duration_ms: int
play_count: int
like_count: int
permalink_url: str
position: int
@dataclass
class Playlist:
playlist_id: int
title: str
creator: str
creator_id: int
permalink_url: str
track_count: int
duration_ms: int
likes_count: int
reposts_count: int
description: str
genre: str
tag_list: str
created_at: str
is_album: bool
tracks: list[PlaylistTrack] = field(default_factory=list)
@property
def total_plays(self) -> int:
return sum(t.play_count for t in self.tracks)
@property
def total_likes(self) -> int:
return sum(t.like_count for t in self.tracks)
def _parse_playlist_track(data: dict, position: int) -> PlaylistTrack:
user = data.get("user", {})
return PlaylistTrack(
track_id=data["id"],
title=data.get("title", ""),
artist=user.get("username", ""),
duration_ms=data.get("duration", 0),
play_count=data.get("playback_count", 0),
like_count=data.get("likes_count", 0),
permalink_url=data.get("permalink_url", ""),
position=position,
)
def get_playlist(playlist_url: str, client_id: str) -> Playlist:
"""
Fetch a SoundCloud playlist or album with all track metadata.
For large playlists, the API may return track stubs (ID only).
We detect these and fetch missing tracks individually.
"""
with httpx.Client(headers=HEADERS, timeout=30, follow_redirects=True) as client:
# Resolve playlist URL
resp = client.get(
f"{BASE_URL}/resolve",
params={"url": playlist_url, "client_id": client_id},
)
resp.raise_for_status()
data = resp.json()
if data.get("kind") not in ("playlist",):
raise ValueError(f"URL resolved to {data.get('kind')}, expected playlist")
user = data.get("user", {})
playlist = Playlist(
playlist_id=data["id"],
title=data.get("title", ""),
creator=user.get("username", ""),
creator_id=user.get("id", 0),
permalink_url=data.get("permalink_url", ""),
track_count=data.get("track_count", 0),
duration_ms=data.get("duration", 0),
likes_count=data.get("likes_count", 0),
reposts_count=data.get("reposts_count", 0),
description=data.get("description", ""),
genre=data.get("genre", ""),
tag_list=data.get("tag_list", ""),
created_at=data.get("created_at", ""),
is_album=data.get("is_album", False),
)
# Parse tracks — some may be stubs with only id
raw_tracks = data.get("tracks", [])
stub_ids = []
for i, raw_track in enumerate(raw_tracks):
if raw_track.get("title"): # full track data
playlist.tracks.append(_parse_playlist_track(raw_track, i + 1))
else:
stub_ids.append((i + 1, raw_track["id"]))
# Fetch stubs in batches of 50
if stub_ids:
print(f" Fetching {len(stub_ids)} stub tracks...")
stub_id_map = {sid: pos for pos, sid in stub_ids}
ids_str = ",".join(str(sid) for _, sid in stub_ids[:50])
resp2 = client.get(
f"{BASE_URL}/tracks",
params={"ids": ids_str, "client_id": client_id},
)
resp2.raise_for_status()
for raw_track in resp2.json():
pos = stub_id_map.get(raw_track["id"], 0)
playlist.tracks.append(_parse_playlist_track(raw_track, pos))
time.sleep(0.5)
# Sort by position
playlist.tracks.sort(key=lambda t: t.position)
return playlist
if __name__ == "__main__":
import json
client_id = extract_client_id()
playlist = get_playlist(
"https://soundcloud.com/disclosure/sets/the-face-2023",
client_id,
)
print(f"Playlist: {playlist.title}")
print(f"Tracks: {playlist.track_count}")
print(f"Total plays across all tracks: {playlist.total_plays:,}")
print(f"Total likes across all tracks: {playlist.total_likes:,}")
for track in playlist.tracks:
print(f" {track.position}. {track.title} — {track.play_count:,} plays")
Part 6: Search Functionality
# search.py
import httpx
import time
from dataclasses import dataclass, asdict
from typing import Optional, Literal
from client_id import extract_client_id
BASE_URL = "https://api-v2.soundcloud.com"
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0.0.0 Safari/537.36"
),
"Referer": "https://soundcloud.com/",
}
SearchKind = Literal["tracks", "users", "playlists", "albums"]
@dataclass
class SearchFilter:
genre: Optional[str] = None # e.g. "electronic", "hip-hop"
duration_from: Optional[int] = None # seconds
duration_to: Optional[int] = None # seconds
license: Optional[str] = None # "cc-by", "cc-by-nc", etc.
created_at: Optional[str] = None # e.g. "last_week", "last_month", "last_year"
bpm_from: Optional[int] = None
bpm_to: Optional[int] = None
def search(
query: str,
kind: SearchKind = "tracks",
limit: int = 50,
filters: Optional[SearchFilter] = None,
client_id: Optional[str] = None,
) -> list[dict]:
"""
Search SoundCloud for tracks, users, or playlists.
Returns raw API dicts for flexibility. Pass to parse functions
from track_scraper.py or user_scraper.py for typed results.
"""
if client_id is None:
client_id = extract_client_id()
params: dict = {
"q": query,
"client_id": client_id,
"limit": min(limit, 50),
}
if filters:
if filters.genre:
params["filter.genre_or_genre_other"] = filters.genre
if filters.duration_from:
params["filter.duration.from"] = filters.duration_from * 1000 # API uses ms
if filters.duration_to:
params["filter.duration.to"] = filters.duration_to * 1000
if filters.license:
params["filter.license"] = filters.license
if filters.created_at:
params["filter.created_at"] = filters.created_at
if filters.bpm_from:
params["filter.bpm.from"] = filters.bpm_from
if filters.bpm_to:
params["filter.bpm.to"] = filters.bpm_to
results = []
next_href: Optional[str] = None
fetched = 0
with httpx.Client(headers=HEADERS, timeout=20) as client:
url = f"{BASE_URL}/search/{kind}"
while fetched < limit:
resp = client.get(next_href or url, params=params if not next_href else None)
if resp.status_code == 429:
time.sleep(30)
continue
if resp.status_code == 401:
raise PermissionError("client_id expired")
resp.raise_for_status()
data = resp.json()
collection = data.get("collection", [])
if not collection:
break
results.extend(collection)
fetched += len(collection)
next_href = data.get("next_href")
if not next_href or fetched >= limit:
break
time.sleep(1.0)
return results[:limit]
def search_tracks_by_genre(
genre: str,
limit: int = 100,
client_id: Optional[str] = None,
) -> list[dict]:
"""Convenience wrapper for genre-based track search."""
return search(
query=genre,
kind="tracks",
limit=limit,
filters=SearchFilter(genre=genre),
client_id=client_id,
)
if __name__ == "__main__":
import json
client_id = extract_client_id()
# Search for lo-fi hip hop tracks uploaded in the last month
results = search(
query="lo-fi hip hop",
kind="tracks",
limit=20,
filters=SearchFilter(genre="hip-hop", created_at="last_month"),
client_id=client_id,
)
print(f"Found {len(results)} tracks")
for r in results[:3]:
print(f" {r['title']} by {r['user']['username']} — {r.get('playback_count', 0):,} plays")
Example search output:
[
{
"title": "Late Night Study Session",
"user": {"username": "chillhop_beats"},
"playback_count": 48200,
"likes_count": 1840,
"genre": "Hip-hop & Rap",
"tag_list": "lofi chill study beats",
"created_at": "2026-03-15T12:00:00Z",
"permalink_url": "https://soundcloud.com/chillhop_beats/late-night-study"
}
]
Part 7: Waveform Data Analysis
Each SoundCloud track includes a waveform_url pointing to a JSON file with amplitude data. This lets you analyze audio characteristics without downloading the actual audio file.
# waveform.py
import httpx
import json
import math
import statistics
from dataclasses import dataclass
from typing import Optional
@dataclass
class WaveformAnalysis:
track_id: int
sample_count: int
peak_amplitude: float
rms_amplitude: float
dynamic_range: float # peak / RMS ratio
quiet_sections: list[tuple[float, float]] # (start_pct, end_pct)
loud_sections: list[tuple[float, float]]
average_amplitude: float
silence_threshold: float
def fetch_waveform(waveform_url: str) -> dict:
"""Download waveform JSON from SoundCloud's CDN."""
resp = httpx.get(waveform_url, timeout=15)
resp.raise_for_status()
return resp.json()
def analyze_waveform(
track_id: int,
waveform_url: str,
quiet_threshold_pct: float = 0.2,
loud_threshold_pct: float = 0.8,
) -> WaveformAnalysis:
"""
Fetch and analyze a track's waveform data.
Returns amplitude statistics and identifies quiet/loud sections.
Samples are normalized 0-1 relative to the waveform's max value.
"""
raw = fetch_waveform(waveform_url)
samples: list[int] = raw.get("samples", [])
if not samples:
raise ValueError("No waveform samples in response")
max_val = raw.get("height", 255)
normalized = [s / max_val for s in samples]
peak = max(normalized)
avg = statistics.mean(normalized)
rms = math.sqrt(statistics.mean(s ** 2 for s in normalized))
dynamic_range = peak / rms if rms > 0 else 0
quiet_thresh = quiet_threshold_pct
loud_thresh = loud_threshold_pct
# Find contiguous quiet and loud sections
def find_sections(values: list[float], condition) -> list[tuple[float, float]]:
sections = []
in_section = False
start_idx = 0
n = len(values)
for i, v in enumerate(values):
if condition(v) and not in_section:
in_section = True
start_idx = i
elif not condition(v) and in_section:
in_section = False
sections.append((start_idx / n, i / n))
if in_section:
sections.append((start_idx / n, 1.0))
return sections
quiet_sections = find_sections(normalized, lambda v: v < quiet_thresh)
loud_sections = find_sections(normalized, lambda v: v > loud_thresh)
return WaveformAnalysis(
track_id=track_id,
sample_count=len(samples),
peak_amplitude=round(peak, 4),
rms_amplitude=round(rms, 4),
dynamic_range=round(dynamic_range, 4),
quiet_sections=quiet_sections,
loud_sections=loud_sections,
average_amplitude=round(avg, 4),
silence_threshold=quiet_thresh,
)
if __name__ == "__main__":
from dataclasses import asdict
from client_id import extract_client_id
import httpx
client_id = extract_client_id()
# Get waveform URL from a track
resp = httpx.get(
"https://api-v2.soundcloud.com/resolve",
params={
"url": "https://soundcloud.com/disclosure/latch-feat-sam-smith",
"client_id": client_id,
},
timeout=15,
)
track_data = resp.json()
waveform_url = track_data["waveform_url"]
track_id = track_data["id"]
analysis = analyze_waveform(track_id, waveform_url)
print(f"Samples: {analysis.sample_count}")
print(f"Peak amplitude: {analysis.peak_amplitude}")
print(f"RMS amplitude: {analysis.rms_amplitude}")
print(f"Dynamic range: {analysis.dynamic_range:.2f}x")
print(f"Loud sections: {len(analysis.loud_sections)} drops/peaks detected")
for start, end in analysis.loud_sections[:5]:
print(f" {start*100:.1f}% — {end*100:.1f}%")
Part 8: Trending Charts Scraper
# charts.py
import httpx
import time
from dataclasses import dataclass, asdict
from typing import Optional, Literal
from client_id import extract_client_id
BASE_URL = "https://api-v2.soundcloud.com"
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0.0.0 Safari/537.36"
),
"Referer": "https://soundcloud.com/",
}
ChartKind = Literal["trending", "top"]
# Genre slugs SoundCloud accepts
GENRES = [
"all-music", "alternativerock", "ambient", "classical", "country",
"danceedm", "dancehall", "deephouse", "disco", "drumbass",
"dubstep", "electronic", "folksingersongwriter", "hiphoprap",
"house", "indie", "jazzblues", "latin", "metal", "piano",
"pop", "rbsoul", "reggae", "reggaeton", "rock", "soundtrack",
"techno", "trance", "trap", "triphop", "world",
]
@dataclass
class ChartEntry:
rank: int
track_id: int
title: str
artist: str
artist_id: int
play_count: int
like_count: int
comment_count: int
genre: str
permalink_url: str
created_at: str
score: Optional[float] = None # trend score when available
def get_charts(
kind: ChartKind = "trending",
genre: str = "all-music",
limit: int = 50,
client_id: Optional[str] = None,
) -> list[ChartEntry]:
"""
Fetch SoundCloud trending or top tracks chart.
kind="trending" — rising tracks (velocity-based)
kind="top" — all-time most played in genre
"""
if client_id is None:
client_id = extract_client_id()
params = {
"client_id": client_id,
"kind": kind,
"genre": f"soundcloud:genres:{genre}",
"limit": min(limit, 50),
"offset": 0,
"linked_partitioning": 1,
}
results = []
with httpx.Client(headers=HEADERS, timeout=20) as client:
resp = client.get(f"{BASE_URL}/charts", params=params)
if resp.status_code == 429:
raise ConnectionError("Rate limited")
if resp.status_code == 401:
raise PermissionError("client_id expired")
resp.raise_for_status()
data = resp.json()
collection = data.get("collection", [])
for i, entry in enumerate(collection):
track = entry.get("track", {})
if not track:
continue
user = track.get("user", {})
results.append(ChartEntry(
rank=i + 1,
track_id=track.get("id", 0),
title=track.get("title", ""),
artist=user.get("username", ""),
artist_id=user.get("id", 0),
play_count=track.get("playback_count", 0),
like_count=track.get("likes_count", 0),
comment_count=track.get("comment_count", 0),
genre=track.get("genre", ""),
permalink_url=track.get("permalink_url", ""),
created_at=track.get("created_at", ""),
score=entry.get("score"),
))
return results[:limit]
if __name__ == "__main__":
import json
from dataclasses import asdict
client_id = extract_client_id()
chart = get_charts(kind="trending", genre="hiphoprap", limit=20, client_id=client_id)
print(f"Top {len(chart)} trending hip-hop tracks:")
for entry in chart:
print(f" #{entry.rank} {entry.title} by {entry.artist} — {entry.play_count:,} plays")
Part 9: Related Tracks
# related_tracks.py
import httpx
from dataclasses import dataclass, asdict
from typing import Optional
from client_id import extract_client_id
BASE_URL = "https://api-v2.soundcloud.com"
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0.0.0 Safari/537.36"
),
"Referer": "https://soundcloud.com/",
}
@dataclass
class RelatedTrack:
track_id: int
title: str
artist: str
artist_id: int
play_count: int
like_count: int
genre: str
permalink_url: str
duration_ms: int
def get_related_tracks(
track_id: int,
client_id: Optional[str] = None,
limit: int = 20,
) -> list[RelatedTrack]:
"""
Fetch tracks related/recommended to a given track.
SoundCloud uses this for the "Up Next" queue in the player.
"""
if client_id is None:
client_id = extract_client_id()
with httpx.Client(headers=HEADERS, timeout=20) as client:
resp = client.get(
f"{BASE_URL}/tracks/{track_id}/related",
params={
"client_id": client_id,
"limit": min(limit, 50),
},
)
if resp.status_code == 401:
raise PermissionError("client_id expired")
resp.raise_for_status()
data = resp.json()
collection = data.get("collection", [])
results = []
for track in collection:
user = track.get("user", {})
results.append(RelatedTrack(
track_id=track.get("id", 0),
title=track.get("title", ""),
artist=user.get("username", ""),
artist_id=user.get("id", 0),
play_count=track.get("playback_count", 0),
like_count=track.get("likes_count", 0),
genre=track.get("genre", ""),
permalink_url=track.get("permalink_url", ""),
duration_ms=track.get("duration", 0),
))
return results[:limit]
Part 10: Bulk Artist Monitoring
This is the most practical tool for ongoing intelligence work — track new uploads and follower growth for a roster of artists over time.
# monitor.py
import httpx
import sqlite3
import json
import time
from dataclasses import dataclass, asdict
from datetime import datetime, timezone
from typing import Optional
from pathlib import Path
from client_id import extract_client_id
from user_scraper import get_user_profile, get_user_tracks, UserProfile
DB_PATH = Path("soundcloud_monitor.db")
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0.0.0 Safari/537.36"
),
"Referer": "https://soundcloud.com/",
}
def init_db(db_path: Path = DB_PATH) -> sqlite3.Connection:
"""Initialize monitoring database with required schema."""
conn = sqlite3.connect(db_path)
conn.row_factory = sqlite3.Row
conn.executescript("""
CREATE TABLE IF NOT EXISTS artists (
user_id INTEGER PRIMARY KEY,
username TEXT NOT NULL,
profile_url TEXT NOT NULL,
added_at TEXT NOT NULL
);
CREATE TABLE IF NOT EXISTS artist_snapshots (
id INTEGER PRIMARY KEY AUTOINCREMENT,
user_id INTEGER NOT NULL,
snapshot_at TEXT NOT NULL,
followers_count INTEGER,
followings_count INTEGER,
track_count INTEGER,
playlist_count INTEGER,
FOREIGN KEY (user_id) REFERENCES artists(user_id)
);
CREATE TABLE IF NOT EXISTS tracks (
track_id INTEGER PRIMARY KEY,
user_id INTEGER NOT NULL,
title TEXT,
permalink_url TEXT,
created_at TEXT,
first_seen_at TEXT NOT NULL,
FOREIGN KEY (user_id) REFERENCES artists(user_id)
);
CREATE TABLE IF NOT EXISTS track_snapshots (
id INTEGER PRIMARY KEY AUTOINCREMENT,
track_id INTEGER NOT NULL,
snapshot_at TEXT NOT NULL,
play_count INTEGER,
like_count INTEGER,
comment_count INTEGER,
repost_count INTEGER,
FOREIGN KEY (track_id) REFERENCES tracks(track_id)
);
CREATE INDEX IF NOT EXISTS idx_artist_snapshots_user_id
ON artist_snapshots(user_id);
CREATE INDEX IF NOT EXISTS idx_track_snapshots_track_id
ON track_snapshots(track_id);
CREATE INDEX IF NOT EXISTS idx_tracks_user_id
ON tracks(user_id);
""")
conn.commit()
return conn
def add_artist(conn: sqlite3.Connection, profile_url: str, client_id: str) -> int:
"""Add an artist to the monitoring roster. Returns user_id."""
profile = get_user_profile(profile_url, client_id)
now = datetime.now(timezone.utc).isoformat()
conn.execute(
"""INSERT OR REPLACE INTO artists (user_id, username, profile_url, added_at)
VALUES (?, ?, ?, ?)""",
(profile.user_id, profile.username, profile_url, now),
)
conn.commit()
print(f"Added artist: {profile.username} ({profile.followers_count:,} followers)")
return profile.user_id
def snapshot_artist(
conn: sqlite3.Connection,
user_id: int,
profile_url: str,
client_id: str,
) -> dict:
"""Take a point-in-time snapshot of an artist's stats and recent tracks."""
profile = get_user_profile(profile_url, client_id)
now = datetime.now(timezone.utc).isoformat()
# Save follower snapshot
conn.execute(
"""INSERT INTO artist_snapshots
(user_id, snapshot_at, followers_count, followings_count, track_count, playlist_count)
VALUES (?, ?, ?, ?, ?, ?)""",
(
user_id, now,
profile.followers_count, profile.followings_count,
profile.track_count, profile.playlist_count,
),
)
# Fetch recent tracks
time.sleep(0.5)
tracks = get_user_tracks(user_id, client_id, limit=20)
new_tracks = []
for track in tracks:
track_id = track["id"]
existing = conn.execute(
"SELECT track_id FROM tracks WHERE track_id = ?", (track_id,)
).fetchone()
if not existing:
# New track discovered
conn.execute(
"""INSERT INTO tracks (track_id, user_id, title, permalink_url, created_at, first_seen_at)
VALUES (?, ?, ?, ?, ?, ?)""",
(
track_id, user_id,
track.get("title", ""),
track.get("permalink_url", ""),
track.get("created_at", ""),
now,
),
)
new_tracks.append(track.get("title", ""))
# Save track stat snapshot
conn.execute(
"""INSERT INTO track_snapshots
(track_id, snapshot_at, play_count, like_count, comment_count, repost_count)
VALUES (?, ?, ?, ?, ?, ?)""",
(
track_id, now,
track.get("playback_count", 0),
track.get("likes_count", 0),
track.get("comment_count", 0),
track.get("reposts_count", 0),
),
)
conn.commit()
return {
"username": profile.username,
"followers": profile.followers_count,
"new_tracks": new_tracks,
"snapshot_at": now,
}
def run_monitoring_cycle(
profile_urls: list[str],
db_path: Path = DB_PATH,
delay_between_artists: float = 3.0,
) -> list[dict]:
"""
Run one monitoring cycle over all tracked artists.
Call this on a schedule (e.g., daily via cron).
"""
client_id = extract_client_id()
conn = init_db(db_path)
results = []
for url in profile_urls:
try:
# Ensure artist is registered
existing = conn.execute(
"SELECT user_id FROM artists WHERE profile_url = ?", (url,)
).fetchone()
if not existing:
user_id = add_artist(conn, url, client_id)
else:
user_id = existing["user_id"]
snapshot = snapshot_artist(conn, user_id, url, client_id)
results.append(snapshot)
if snapshot["new_tracks"]:
print(f" NEW TRACKS from {snapshot['username']}: {snapshot['new_tracks']}")
else:
print(f" {snapshot['username']}: {snapshot['followers']:,} followers, no new tracks")
except PermissionError:
print(" client_id expired, refreshing...")
client_id = extract_client_id(force_refresh=True)
except Exception as e:
print(f" Error monitoring {url}: {e}")
time.sleep(delay_between_artists)
conn.close()
return results
def get_follower_growth(
conn: sqlite3.Connection,
user_id: int,
days: int = 30,
) -> list[dict]:
"""Get follower growth history for an artist over the past N days."""
rows = conn.execute(
"""SELECT snapshot_at, followers_count
FROM artist_snapshots
WHERE user_id = ?
AND snapshot_at >= datetime('now', ?)
ORDER BY snapshot_at""",
(user_id, f"-{days} days"),
).fetchall()
return [dict(row) for row in rows]
if __name__ == "__main__":
roster = [
"https://soundcloud.com/disclosure",
"https://soundcloud.com/flume",
"https://soundcloud.com/kaytranada",
]
results = run_monitoring_cycle(roster)
print(json.dumps(results, indent=2))
Anti-Detection Deep Dive
Client ID Rotation
The client_id is the most common failure point. SoundCloud rotates these every few days. Your scraper must handle this gracefully:
def with_client_id_retry(func, *args, max_retries: int = 3, **kwargs):
"""Decorator-style wrapper that auto-refreshes client_id on 401."""
client_id = extract_client_id()
for attempt in range(max_retries):
try:
return func(*args, client_id=client_id, **kwargs)
except PermissionError:
if attempt == max_retries - 1:
raise
print(f"client_id expired, refreshing (attempt {attempt + 1})...")
client_id = extract_client_id(force_refresh=True)
time.sleep(1)
Rate Limiting with Exponential Backoff
SoundCloud returns 429 after roughly 15-20 requests per minute from a single IP. The backoff strategy:
import time
import random
def request_with_backoff(
client: httpx.Client,
url: str,
params: dict,
max_retries: int = 5,
base_delay: float = 5.0,
) -> httpx.Response:
"""Make a request with exponential backoff on 429 responses."""
for attempt in range(max_retries):
resp = client.get(url, params=params)
if resp.status_code == 429:
# Check Retry-After header
retry_after = resp.headers.get("Retry-After")
if retry_after:
wait = float(retry_after)
else:
wait = base_delay * (2 ** attempt)
# Add jitter to avoid thundering herd
jitter = random.uniform(0, wait * 0.2)
total_wait = wait + jitter
print(f"Rate limited (attempt {attempt + 1}/{max_retries}). Waiting {total_wait:.1f}s...")
time.sleep(total_wait)
continue
return resp
raise ConnectionError(f"Still rate-limited after {max_retries} retries")
Session Management and Cookie Persistence
Maintaining a consistent session reduces 429 frequency:
import httpx
def create_session(proxy_url: str = None) -> httpx.Client:
"""
Create a persistent HTTP session that looks like a real browser.
Cookies are maintained across requests. The session visits the
SoundCloud homepage first to establish a realistic session state.
"""
transport = httpx.HTTPTransport(proxy=proxy_url) if proxy_url else None
client = httpx.Client(
transport=transport,
headers={
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/123.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
},
follow_redirects=True,
timeout=25,
)
# Warm up with a homepage visit to get session cookies
try:
client.get("https://soundcloud.com")
time.sleep(random.uniform(1.0, 2.0))
except httpx.HTTPError:
pass # Continue even if warmup fails
# Switch to API headers
client.headers.update({
"Accept": "application/json, text/javascript, */*; q=0.01",
"Referer": "https://soundcloud.com/",
"X-Requested-With": "XMLHttpRequest",
})
return client
Request Timing with Jitter
Uniform timing is a detection signal. Add jitter:
import random
import time
def sleep_with_jitter(base_seconds: float, jitter_pct: float = 0.3) -> None:
"""Sleep for base_seconds ± jitter_pct of base_seconds."""
jitter = base_seconds * jitter_pct * random.uniform(-1, 1)
actual = max(0.1, base_seconds + jitter)
time.sleep(actual)
IP Reputation and Proxy Strategy
Datacenter IP ranges (AWS, GCP, DigitalOcean) are the first thing SoundCloud rate-limits aggressively. The fingerprinting checks your IP's ASN against known datacenter ranges.
For serious collection work, residential proxies are the practical solution. ThorData's residential proxy network is worth evaluating here — they offer rotating residential IPs that cycle between requests, which eliminates the per-IP rate limit problem. Their geo-targeting also lets you verify whether SoundCloud serves different chart data or search results by region (it does — trending charts differ between US, UK, and Germany).
# Proxy rotation example with ThorData
import random
PROXY_ENDPOINTS = [
"http://user:[email protected]:10000",
# ThorData rotates the exit IP automatically on each connection
]
def get_proxy() -> str:
return random.choice(PROXY_ENDPOINTS)
# Use with create_session():
client = create_session(proxy_url=get_proxy())
Data Storage
SQLite Schema for Full Pipeline
# storage.py
import sqlite3
import json
from pathlib import Path
from datetime import datetime, timezone
SCHEMA = """
CREATE TABLE IF NOT EXISTS tracks (
track_id INTEGER PRIMARY KEY,
title TEXT,
artist TEXT,
artist_id INTEGER,
permalink_url TEXT,
play_count INTEGER,
like_count INTEGER,
comment_count INTEGER,
repost_count INTEGER,
download_count INTEGER,
duration_ms INTEGER,
genre TEXT,
tag_list TEXT,
description TEXT,
created_at TEXT,
license TEXT,
waveform_url TEXT,
downloadable INTEGER,
bpm REAL,
isrc TEXT,
scraped_at TEXT
);
CREATE TABLE IF NOT EXISTS users (
user_id INTEGER PRIMARY KEY,
username TEXT,
display_name TEXT,
permalink_url TEXT,
followers_count INTEGER,
followings_count INTEGER,
track_count INTEGER,
playlist_count INTEGER,
verified INTEGER,
description TEXT,
city TEXT,
country_code TEXT,
website_url TEXT,
created_at TEXT,
scraped_at TEXT
);
CREATE TABLE IF NOT EXISTS comments (
comment_id INTEGER PRIMARY KEY,
track_id INTEGER,
user_id INTEGER,
username TEXT,
body TEXT,
timestamp_ms INTEGER,
created_at TEXT,
scraped_at TEXT,
FOREIGN KEY (track_id) REFERENCES tracks(track_id)
);
CREATE INDEX IF NOT EXISTS idx_comments_track_id ON comments(track_id);
CREATE INDEX IF NOT EXISTS idx_tracks_artist_id ON tracks(artist_id);
CREATE INDEX IF NOT EXISTS idx_tracks_genre ON tracks(genre);
CREATE INDEX IF NOT EXISTS idx_tracks_play_count ON tracks(play_count DESC);
"""
def init_storage(db_path: str = "soundcloud.db") -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.executescript(SCHEMA)
conn.commit()
return conn
def save_track(conn: sqlite3.Connection, track: dict) -> None:
now = datetime.now(timezone.utc).isoformat()
conn.execute(
"""INSERT OR REPLACE INTO tracks
(track_id, title, artist, artist_id, permalink_url,
play_count, like_count, comment_count, repost_count, download_count,
duration_ms, genre, tag_list, description, created_at, license,
waveform_url, downloadable, bpm, isrc, scraped_at)
VALUES
(:track_id, :title, :artist, :artist_id, :permalink_url,
:play_count, :like_count, :comment_count, :repost_count, :download_count,
:duration_ms, :genre, :tag_list, :description, :created_at, :license,
:waveform_url, :downloadable, :bpm, :isrc, :scraped_at)""",
{**track, "scraped_at": now},
)
conn.commit()
def save_comments(conn: sqlite3.Connection, comments: list[dict]) -> None:
now = datetime.now(timezone.utc).isoformat()
conn.executemany(
"""INSERT OR REPLACE INTO comments
(comment_id, track_id, user_id, username, body, timestamp_ms, created_at, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
[
(
c["comment_id"],
c["track_id"],
c["author"]["user_id"],
c["author"]["username"],
c["body"],
c.get("timestamp_ms"),
c["created_at"],
now,
)
for c in comments
],
)
conn.commit()
def export_csv(conn: sqlite3.Connection, table: str, output_path: str) -> None:
"""Export a table to CSV."""
import csv
rows = conn.execute(f"SELECT * FROM {table}").fetchall()
if not rows:
print(f"No data in {table}")
return
headers = [d[0] for d in conn.execute(f"SELECT * FROM {table} LIMIT 0").description]
with open(output_path, "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(rows)
print(f"Exported {len(rows)} rows to {output_path}")
Complete End-to-End Pipeline
This script ties everything together: extract client ID, search a genre, get full metadata, collect comments, store in SQLite.
# pipeline.py
"""
SoundCloud data collection pipeline.
Usage:
python3 pipeline.py --genre "lo-fi hip hop" --limit 50 --comments 100
"""
import argparse
import json
import time
import sqlite3
from dataclasses import asdict
from pathlib import Path
from client_id import extract_client_id
from search import search, SearchFilter
from track_scraper import _parse_track, resolve_url, HEADERS
from comment_scraper import get_all_comments
from storage import init_storage, save_track, save_comments, export_csv
import httpx
def run_pipeline(
genre: str,
limit: int = 50,
comments_per_track: int = 100,
db_path: str = "soundcloud.db",
proxy_url: str = None,
delay: float = 1.5,
) -> None:
print(f"=== SoundCloud Pipeline: {genre} ===")
print(f"Target: {limit} tracks, up to {comments_per_track} comments each")
# Step 1: Extract client ID
print("\n[1/4] Extracting client_id...")
client_id = extract_client_id()
print(f" Got client_id: {client_id[:8]}...")
# Step 2: Initialize database
print("\n[2/4] Initializing database...")
conn = init_storage(db_path)
print(f" Database: {db_path}")
# Step 3: Search for tracks
print(f"\n[3/4] Searching for '{genre}' tracks...")
search_results = search(
query=genre,
kind="tracks",
limit=limit,
filters=SearchFilter(genre=genre.lower().replace(" ", "")),
client_id=client_id,
)
print(f" Found {len(search_results)} tracks")
# Step 4: Enrich each track with full metadata + comments
print(f"\n[4/4] Fetching full metadata and comments...")
transport = httpx.HTTPTransport(proxy=proxy_url) if proxy_url else None
client = httpx.Client(
headers=HEADERS,
transport=transport,
timeout=20,
follow_redirects=True,
)
summary = []
try:
for i, raw in enumerate(search_results):
track_url = raw.get("permalink_url", "")
if not track_url:
continue
print(f"\n [{i+1}/{len(search_results)}] {raw.get('title', '?')}")
# Get full track data (search results may have incomplete fields)
try:
full_data = resolve_url(track_url, client_id, client)
track = _parse_track(full_data)
track_dict = asdict(track)
save_track(conn, track_dict)
print(f" Plays: {track.play_count:,} Likes: {track.like_count:,} Comments: {track.comment_count:,}")
except PermissionError:
print(" client_id expired, refreshing...")
client_id = extract_client_id(force_refresh=True)
continue
except Exception as e:
print(f" Error fetching track: {e}")
time.sleep(delay)
continue
time.sleep(delay)
# Collect comments if track has any
if track.comment_count > 0 and comments_per_track > 0:
try:
comments = get_all_comments(
track.track_id,
client_id,
max_comments=comments_per_track,
)
save_comments(conn, comments)
print(f" Saved {len(comments)} comments")
except Exception as e:
print(f" Error fetching comments: {e}")
time.sleep(delay)
summary.append({
"title": track.title,
"artist": track.artist,
"play_count": track.play_count,
"like_count": track.like_count,
"comment_count": track.comment_count,
"permalink_url": track.permalink_url,
})
finally:
client.close()
# Export results
print("\n=== Export ===")
export_csv(conn, "tracks", f"soundcloud_tracks_{genre.replace(' ', '_')}.csv")
export_csv(conn, "comments", f"soundcloud_comments_{genre.replace(' ', '_')}.csv")
conn.close()
# Summary
print(f"\n=== Done ===")
print(f"Collected {len(summary)} tracks")
top = sorted(summary, key=lambda x: x["play_count"], reverse=True)[:5]
print("Top 5 by plays:")
for t in top:
print(f" {t['title']} by {t['artist']} — {t['play_count']:,} plays")
# Save summary JSON
with open(f"pipeline_summary_{genre.replace(' ', '_')}.json", "w") as f:
json.dump(summary, f, indent=2)
print(f"Summary saved to pipeline_summary_{genre.replace(' ', '_')}.json")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="SoundCloud data pipeline")
parser.add_argument("--genre", default="electronic", help="Genre to search")
parser.add_argument("--limit", type=int, default=50, help="Number of tracks")
parser.add_argument("--comments", type=int, default=100, help="Comments per track")
parser.add_argument("--db", default="soundcloud.db", help="SQLite database path")
parser.add_argument("--proxy", default=None, help="Proxy URL")
parser.add_argument("--delay", type=float, default=1.5, help="Delay between requests")
args = parser.parse_args()
run_pipeline(
genre=args.genre,
limit=args.limit,
comments_per_track=args.comments,
db_path=args.db,
proxy_url=args.proxy,
delay=args.delay,
)
Run it:
uv pip install httpx
python3 pipeline.py --genre "lo-fi hip hop" --limit 100 --comments 200
Ethics and Legal Considerations
Terms of Service. SoundCloud's ToS prohibits automated access. Using these techniques for commercial data products, reselling data, or high-volume collection without permission puts you in violation. Personal research, academic study, and building tools for your own tracks occupy a greyer area.
Copyright. The metadata and comments are data about music, not the music itself. Do not use these tools to download or redistribute audio files. Stream URLs extracted from the API are time-limited and tied to the requesting IP.
Rate limits. Stay within rates that don't degrade service for other users. If you're collecting millions of tracks, you're no longer in research territory — contact SoundCloud about data partnerships.
Data privacy. Comments and user profiles are public, but collecting and storing them at scale may implicate GDPR if you're in the EU or collecting data on EU residents. Anonymize where possible and don't build personal profiles on individual users.
robots.txt. SoundCloud's robots.txt disallows crawlers on most paths. The internal API is not explicitly addressed, but the spirit of the rules applies.
Troubleshooting
401 Unauthorized — client_id Expired
The most common error. The client ID has rotated.
httpx.HTTPStatusError: Client error '401 Unauthorized'
Fix: Call extract_client_id(force_refresh=True). The cache TTL is 1 hour by default; reduce it if you're seeing frequent 401s. If extraction fails, the JS bundle URL patterns may have changed — inspect https://soundcloud.com source and update the regex in client_id.py.
429 Too Many Requests — Rate Limited
httpx.HTTPStatusError: Client error '429 Too Many Requests'
Fix: You're hitting roughly 15-20 requests per minute from a single IP. Options in order of preference:
1. Increase delay parameter to 3-5 seconds
2. Add request jitter
3. Use residential proxies to distribute load
403 Forbidden — Geo-Blocked Content
Some tracks are region-restricted. The API returns 403 when you try to resolve them from a blocked region.
Fix: Use a proxy in an allowed region. Most SoundCloud content is globally accessible; restricted tracks are the exception, typically due to label agreements.
Empty collection Arrays
Search or chart endpoints return an empty collection with a valid 200 response.
Causes:
- Genre slug doesn't match SoundCloud's accepted values (see GENRES list in charts.py)
- The filter combination returns no results
- The search query is too narrow
Fix: Try without filters first. Use "all-music" as the genre. Check GENRES for valid slugs.
waveform_url Returns 404
Waveform CDN URLs are stable but occasionally expire or get regenerated.
Fix: Re-fetch the track metadata to get a fresh waveform_url. Waveform URLs are in wave.sndcdn.com — if that domain is unreachable, it's a CDN issue, not a scraping issue.
Stale Play Counts
Play counts in API responses can lag by a few minutes during high-traffic periods.
Fix: For accuracy-sensitive applications, query the same track twice with a 60-second gap and use the higher value (counts only go up).
ImportError on httpx
ModuleNotFoundError: No module named 'httpx'
Fix:
uv pip install httpx
All scripts in this guide require httpx. No other third-party dependencies are needed.
Summary
SoundCloud's internal API is well-structured and stable enough for serious data collection. The main operational concern is the rotating client_id, which your scraper must handle automatically. Everything else — pagination, rate limiting, data parsing — follows straightforward patterns.
The most productive workflow for ongoing music intelligence:
- Run
pipeline.pywith your target genre to bootstrap a local database - Use
monitor.pyon a daily schedule to track follower growth and new uploads - Pull waveform data selectively for tracks that cross engagement thresholds
- Export to CSV for analysis in any tool you prefer
For large-scale collection, residential proxies remove the IP-rate-limit ceiling and let you collect at whatever pace the data requires.