← Back to blog

Scraping SoundCloud Track Data: Play Counts, Comments, and Followers (2026)

Scraping SoundCloud Track Data: Play Counts, Comments, and Followers (2026)

SoundCloud sits in an unusual position for data collection. The platform killed public API registrations years ago — you cannot get new API keys through official channels. But the internal API still powers the website itself, the widget embed API remains open, and structured JSON loads on every page. For anyone building music analytics tools, tracking emerging artists, or researching audio content at scale, SoundCloud exposes rich data behind a relatively thin layer of protection.

Why Scrape SoundCloud?

The use cases are more varied than you might expect:

A&R discovery. Labels and talent scouts track play count velocity, not absolute numbers. An artist going from 500 to 50,000 plays in a week on an unsigned track is a signal. Automated monitoring across thousands of artists catches these moments before they hit music blogs.

Podcast and DJ set analytics. DJ mixes posted to SoundCloud accumulate timestamped comments that pinpoint crowd-favorite moments. Podcast producers use comment density to identify which segments resonated. Neither YouTube nor Spotify exposes this kind of granular engagement data.

Emerging artist monitoring. Distributor A&R teams watch follower growth rates and engagement ratios (comments per play, likes per play) as leading indicators of breakout potential. SoundCloud's listener base skews earlier in the discovery curve than Spotify.

Genre trend analysis. Tracking which tags and genres accumulate plays fastest gives music publishers and sync licensing teams early signals on what's trending before mainstream charts catch up.

Academic research. Musicologists, platform studies researchers, and media sociologists use SoundCloud data for peer-reviewed work on music sharing behavior, fan community formation, and geographic diffusion of musical styles.

Competitive intelligence. Record labels monitor competitor rosters, distributor performance, and unsigned talent in specific cities or genres — all publicly visible on SoundCloud.

What You Can Extract

SoundCloud's internal API exposes more than you'd expect once you examine network traffic:

Architecture Overview

Every approach in this guide relies on the same underlying mechanism: SoundCloud's internal API at api-v2.soundcloud.com. This is the same API the website uses. It requires a client_id parameter that you extract from the JavaScript bundles on every page load.

The flow for any scraping task is: 1. Extract a valid client_id from SoundCloud's JS bundles 2. Use /resolve to convert a public URL to full structured data 3. Use resource-specific endpoints (/tracks/{id}/comments, /users/{id}/followers, etc.) for deeper data 4. Handle 401s by re-extracting the client ID; handle 429s with exponential backoff

All scripts below use httpx for HTTP requests. Install it with uv pip install httpx.


Part 1: Client ID Extraction

The client ID is a 32-character alphanumeric string embedded in SoundCloud's JavaScript bundles. It rotates periodically — roughly every few days — so your scraper needs to extract it dynamically rather than hardcode it.

# client_id.py
import httpx
import re
import json
import time
from pathlib import Path
from typing import Optional

CACHE_FILE = Path(".soundcloud_client_id.json")
CACHE_TTL = 3600  # 1 hour


def _load_cached_client_id() -> Optional[str]:
    """Load client ID from cache if still fresh."""
    if not CACHE_FILE.exists():
        return None
    data = json.loads(CACHE_FILE.read_text())
    if time.time() - data.get("timestamp", 0) > CACHE_TTL:
        return None
    return data.get("client_id")


def _save_client_id(client_id: str) -> None:
    """Persist client ID with timestamp."""
    CACHE_FILE.write_text(json.dumps({
        "client_id": client_id,
        "timestamp": time.time(),
    }))


def extract_client_id(force_refresh: bool = False) -> str:
    """
    Extract a valid client_id from SoundCloud's JavaScript bundles.

    Tries cached value first. On cache miss or force_refresh, fetches
    the SoundCloud homepage, finds JS bundle URLs, and scans each for
    the client_id pattern. Falls back through multiple bundles.
    """
    if not force_refresh:
        cached = _load_cached_client_id()
        if cached:
            return cached

    headers = {
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/123.0.0.0 Safari/537.36"
        ),
        "Accept-Language": "en-US,en;q=0.9",
    }

    with httpx.Client(headers=headers, follow_redirects=True, timeout=20) as client:
        # Fetch homepage to find JS bundle URLs
        resp = client.get("https://soundcloud.com")
        resp.raise_for_status()

        # Multiple patterns for bundle URLs
        script_patterns = [
            r'src="(https://a-v2\.sndcdn\.com/assets/[^"]+\.js)"',
            r'src="(https://[^"]*sndcdn\.com[^"]*\.js)"',
        ]

        script_urls: list[str] = []
        for pattern in script_patterns:
            script_urls.extend(re.findall(pattern, resp.text))

        # Remove duplicates, keep last few bundles (client_id is in app bundles)
        seen: set[str] = set()
        unique_urls = []
        for url in reversed(script_urls):
            if url not in seen:
                seen.add(url)
                unique_urls.append(url)

        # Scan each bundle for client_id
        client_id_patterns = [
            r'client_id:"([a-zA-Z0-9]{32})"',
            r'"client_id","([a-zA-Z0-9]{32})"',
            r'client_id=([a-zA-Z0-9]{32})',
        ]

        for script_url in unique_urls[:8]:
            try:
                js_resp = client.get(script_url, timeout=15)
                js_resp.raise_for_status()
                for pattern in client_id_patterns:
                    match = re.search(pattern, js_resp.text)
                    if match:
                        client_id = match.group(1)
                        _save_client_id(client_id)
                        return client_id
            except httpx.HTTPError:
                continue

    raise RuntimeError(
        "Could not extract client_id from any SoundCloud JS bundle. "
        "The extraction patterns may need updating."
    )


def validate_client_id(client_id: str) -> bool:
    """Test if a client_id is still valid by making a lightweight API call."""
    try:
        resp = httpx.get(
            "https://api-v2.soundcloud.com/resolve",
            params={"url": "https://soundcloud.com", "client_id": client_id},
            timeout=10,
        )
        return resp.status_code != 401
    except httpx.HTTPError:
        return False


if __name__ == "__main__":
    cid = extract_client_id(force_refresh=True)
    print(f"Extracted client_id: {cid[:8]}...{cid[-4:]}")
    print(f"Valid: {validate_client_id(cid)}")

Output:

Extracted client_id: iZIs9mch...AbCd
Valid: True

Part 2: Track Metadata Scraper

# track_scraper.py
import httpx
import time
import json
from dataclasses import dataclass, asdict, field
from typing import Optional
from client_id import extract_client_id

BASE_URL = "https://api-v2.soundcloud.com"

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Accept": "application/json, text/javascript, */*; q=0.01",
    "Accept-Language": "en-US,en;q=0.9",
    "Referer": "https://soundcloud.com/",
    "Origin": "https://soundcloud.com",
}


@dataclass
class ArtworkUrls:
    small: str = ""       # 32x32
    medium: str = ""      # 100x100
    large: str = ""       # 300x300
    t500x500: str = ""    # 500x500


@dataclass
class TrackData:
    track_id: int
    title: str
    artist: str
    artist_id: int
    permalink_url: str
    play_count: int
    like_count: int
    comment_count: int
    repost_count: int
    download_count: int
    duration_ms: int
    genre: str
    tag_list: str
    description: str
    created_at: str
    last_modified: str
    license: str
    waveform_url: str
    artwork: ArtworkUrls = field(default_factory=ArtworkUrls)
    downloadable: bool = False
    streamable: bool = True
    embeddable_by: str = "all"
    purchase_url: Optional[str] = None
    label_name: Optional[str] = None
    bpm: Optional[int] = None
    key_signature: Optional[str] = None
    isrc: Optional[str] = None


def _parse_artwork(raw_url: Optional[str]) -> ArtworkUrls:
    """Generate artwork URLs at multiple sizes from the base artwork URL."""
    if not raw_url:
        return ArtworkUrls()
    base = raw_url.replace("large", "{size}")
    return ArtworkUrls(
        small=base.replace("{size}", "small"),
        medium=base.replace("{size}", "t100x100"),
        large=base.replace("{size}", "t300x300"),
        t500x500=base.replace("{size}", "t500x500"),
    )


def _parse_track(data: dict) -> TrackData:
    """Parse raw API response dict into a TrackData dataclass."""
    user = data.get("user", {})
    return TrackData(
        track_id=data["id"],
        title=data.get("title", ""),
        artist=user.get("username", ""),
        artist_id=user.get("id", 0),
        permalink_url=data.get("permalink_url", ""),
        play_count=data.get("playback_count", 0),
        like_count=data.get("likes_count", 0),
        comment_count=data.get("comment_count", 0),
        repost_count=data.get("reposts_count", 0),
        download_count=data.get("download_count", 0),
        duration_ms=data.get("duration", 0),
        genre=data.get("genre", ""),
        tag_list=data.get("tag_list", ""),
        description=data.get("description", ""),
        created_at=data.get("created_at", ""),
        last_modified=data.get("last_modified", ""),
        license=data.get("license", ""),
        waveform_url=data.get("waveform_url", ""),
        artwork=_parse_artwork(data.get("artwork_url")),
        downloadable=data.get("downloadable", False),
        streamable=data.get("streamable", True),
        embeddable_by=data.get("embeddable_by", "all"),
        purchase_url=data.get("purchase_url"),
        label_name=data.get("label_name"),
        bpm=data.get("bpm"),
        key_signature=data.get("key_signature"),
        isrc=data.get("isrc"),
    )


def resolve_url(url: str, client_id: str, client: httpx.Client) -> dict:
    """Resolve any SoundCloud URL to its full API data object."""
    resp = client.get(
        f"{BASE_URL}/resolve",
        params={"url": url, "client_id": client_id},
    )
    if resp.status_code == 401:
        raise PermissionError("client_id expired — refresh required")
    if resp.status_code == 429:
        raise ConnectionError("Rate limited (429)")
    resp.raise_for_status()
    return resp.json()


def get_track(url: str, client_id: str, client: httpx.Client) -> TrackData:
    """Fetch full track metadata for a SoundCloud track URL."""
    data = resolve_url(url, client_id, client)
    if data.get("kind") != "track":
        raise ValueError(f"URL resolved to {data.get('kind')}, not a track")
    return _parse_track(data)


def scrape_tracks(
    urls: list[str],
    delay: float = 1.5,
    proxy_url: Optional[str] = None,
) -> list[dict]:
    """
    Scrape metadata for a list of SoundCloud track URLs.

    Automatically re-extracts client_id on 401 errors.
    Applies exponential backoff on 429 rate-limit responses.
    """
    client_id = extract_client_id()
    transport = httpx.HTTPTransport(proxy=proxy_url) if proxy_url else None

    results = []
    with httpx.Client(
        headers=HEADERS,
        transport=transport,
        timeout=20,
        follow_redirects=True,
    ) as client:
        for i, url in enumerate(urls):
            retries = 0
            while retries < 5:
                try:
                    track = get_track(url, client_id, client)
                    results.append(asdict(track))
                    print(
                        f"[{i+1}/{len(urls)}] {track.title} — "
                        f"{track.play_count:,} plays, "
                        f"{track.like_count:,} likes"
                    )
                    break
                except PermissionError:
                    print("  client_id expired, refreshing...")
                    client_id = extract_client_id(force_refresh=True)
                    retries += 1
                except ConnectionError:
                    wait = 2 ** retries * 5
                    print(f"  Rate limited. Waiting {wait}s...")
                    time.sleep(wait)
                    retries += 1
                except Exception as e:
                    print(f"  Error scraping {url}: {e}")
                    break

            time.sleep(delay)

    return results


if __name__ == "__main__":
    urls = [
        "https://soundcloud.com/disclosure/latch-feat-sam-smith",
        "https://soundcloud.com/flume/never-be-like-you-feat-kai",
    ]
    tracks = scrape_tracks(urls)
    print(json.dumps(tracks[0], indent=2, default=str))

Example output:

{
  "track_id": 114688288,
  "title": "Latch feat. Sam Smith",
  "artist": "Disclosure",
  "artist_id": 23489344,
  "permalink_url": "https://soundcloud.com/disclosure/latch-feat-sam-smith",
  "play_count": 18432901,
  "like_count": 142300,
  "comment_count": 3847,
  "repost_count": 28900,
  "download_count": 0,
  "duration_ms": 237800,
  "genre": "Electronic",
  "tag_list": "\"UK House\" deep house electronic",
  "description": "",
  "created_at": "2013-08-29T11:22:14Z",
  "license": "all-rights-reserved",
  "waveform_url": "https://wave.sndcdn.com/aBcDeFgH_m.json",
  "artwork": {
    "small": "https://i1.sndcdn.com/artworks-xxx-small.jpg",
    "medium": "https://i1.sndcdn.com/artworks-xxx-t100x100.jpg",
    "large": "https://i1.sndcdn.com/artworks-xxx-t300x300.jpg",
    "t500x500": "https://i1.sndcdn.com/artworks-xxx-t500x500.jpg"
  },
  "downloadable": false,
  "bpm": null,
  "isrc": null
}

Part 3: User Profile Scraper

# user_scraper.py
import httpx
import time
from dataclasses import dataclass, asdict
from typing import Optional
from client_id import extract_client_id

BASE_URL = "https://api-v2.soundcloud.com"
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Referer": "https://soundcloud.com/",
    "Accept-Language": "en-US,en;q=0.9",
}


@dataclass
class SocialLinks:
    youtube_channel_url: Optional[str] = None
    facebook_page: Optional[str] = None
    instagram_username: Optional[str] = None
    twitter_handle: Optional[str] = None
    website_url: Optional[str] = None
    website_title: Optional[str] = None


@dataclass
class UserProfile:
    user_id: int
    username: str
    permalink_url: str
    display_name: str
    followers_count: int
    followings_count: int
    track_count: int
    playlist_count: int
    likes_count: int
    reposts_count: int
    description: str
    city: str
    country_code: str
    verified: bool
    avatar_url: str
    created_at: str
    last_modified: str
    social: SocialLinks


def _parse_user(data: dict) -> UserProfile:
    social = SocialLinks(
        youtube_channel_url=data.get("youtube_channel_url"),
        facebook_page=data.get("facebook_page"),
        instagram_username=data.get("instagram_username"),
        twitter_handle=None,  # removed from API but kept for schema
        website_url=data.get("website_url"),
        website_title=data.get("website_title"),
    )
    return UserProfile(
        user_id=data["id"],
        username=data.get("username", ""),
        permalink_url=data.get("permalink_url", ""),
        display_name=data.get("full_name") or data.get("username", ""),
        followers_count=data.get("followers_count", 0),
        followings_count=data.get("followings_count", 0),
        track_count=data.get("track_count", 0),
        playlist_count=data.get("playlist_count", 0),
        likes_count=data.get("public_favorites_count", 0),
        reposts_count=data.get("reposts_count", 0),
        description=data.get("description", ""),
        city=data.get("city", ""),
        country_code=data.get("country_code", ""),
        verified=data.get("verified", False),
        avatar_url=data.get("avatar_url", ""),
        created_at=data.get("created_at", ""),
        last_modified=data.get("last_modified", ""),
        social=social,
    )


def get_user_profile(profile_url: str, client_id: str) -> UserProfile:
    """Resolve a SoundCloud profile URL to a full UserProfile."""
    with httpx.Client(headers=HEADERS, timeout=20, follow_redirects=True) as client:
        resp = client.get(
            f"{BASE_URL}/resolve",
            params={"url": profile_url, "client_id": client_id},
        )
        if resp.status_code == 401:
            raise PermissionError("client_id expired")
        resp.raise_for_status()
        data = resp.json()
        if data.get("kind") != "user":
            raise ValueError(f"URL is a {data.get('kind')}, not a user")
        return _parse_user(data)


def get_user_by_id(user_id: int, client_id: str) -> UserProfile:
    """Fetch user profile by numeric user ID."""
    with httpx.Client(headers=HEADERS, timeout=20) as client:
        resp = client.get(
            f"{BASE_URL}/users/{user_id}",
            params={"client_id": client_id},
        )
        resp.raise_for_status()
        return _parse_user(resp.json())


def get_user_tracks(
    user_id: int,
    client_id: str,
    limit: int = 50,
) -> list[dict]:
    """Fetch a user's uploaded tracks (up to limit)."""
    tracks = []
    next_href: Optional[str] = None
    fetched = 0

    with httpx.Client(headers=HEADERS, timeout=20) as client:
        url = f"{BASE_URL}/users/{user_id}/tracks"
        params = {
            "client_id": client_id,
            "limit": min(limit, 50),
            "representation": "full",
        }

        while fetched < limit:
            resp = client.get(next_href or url, params=params if not next_href else None)
            if resp.status_code == 429:
                time.sleep(30)
                continue
            resp.raise_for_status()
            data = resp.json()

            collection = data.get("collection", [])
            if not collection:
                break

            tracks.extend(collection)
            fetched += len(collection)
            next_href = data.get("next_href")

            if not next_href or fetched >= limit:
                break

            time.sleep(1.0)

    return tracks[:limit]


if __name__ == "__main__":
    import json
    client_id = extract_client_id()
    profile = get_user_profile("https://soundcloud.com/disclosure", client_id)
    print(json.dumps(asdict(profile), indent=2))

Example output:

{
  "user_id": 23489344,
  "username": "disclosure",
  "permalink_url": "https://soundcloud.com/disclosure",
  "display_name": "Disclosure",
  "followers_count": 1284900,
  "followings_count": 312,
  "track_count": 87,
  "playlist_count": 14,
  "likes_count": 2340,
  "reposts_count": 892,
  "description": "Disclosure is Guy and Howard Lawrence...",
  "city": "London",
  "country_code": "GB",
  "verified": true,
  "avatar_url": "https://i1.sndcdn.com/avatars-xxx-large.jpg",
  "created_at": "2011-10-14T09:32:11Z",
  "social": {
    "youtube_channel_url": "https://youtube.com/disclosure",
    "facebook_page": "https://facebook.com/disclosuremusic",
    "instagram_username": "disclosure",
    "website_url": "https://disclosuremusic.com"
  }
}

Part 4: Comment Extraction

SoundCloud's timestamped comments are uniquely valuable — each comment is anchored to a specific millisecond position in the track. Density spikes reveal crowd-favorite moments.

# comment_scraper.py
import httpx
import time
from dataclasses import dataclass, asdict
from typing import Optional, Iterator
from client_id import extract_client_id

BASE_URL = "https://api-v2.soundcloud.com"
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Referer": "https://soundcloud.com/",
}


@dataclass
class CommentAuthor:
    user_id: int
    username: str
    avatar_url: str
    verified: bool
    followers_count: int


@dataclass
class Comment:
    comment_id: int
    track_id: int
    body: str
    timestamp_ms: Optional[int]  # position in track (None = general comment)
    created_at: str
    author: CommentAuthor
    self_deletable: bool = False


def _parse_comment(raw: dict, track_id: int) -> Comment:
    user = raw.get("user", {})
    return Comment(
        comment_id=raw["id"],
        track_id=track_id,
        body=raw.get("body", ""),
        timestamp_ms=raw.get("timestamp"),
        created_at=raw.get("created_at", ""),
        author=CommentAuthor(
            user_id=user.get("id", 0),
            username=user.get("username", ""),
            avatar_url=user.get("avatar_url", ""),
            verified=user.get("verified", False),
            followers_count=user.get("followers_count", 0),
        ),
    )


def iter_comments(
    track_id: int,
    client_id: str,
    batch_size: int = 50,
    max_comments: int = 1000,
) -> Iterator[Comment]:
    """
    Yield comments for a track, paginated.

    Comments are returned newest-first by default.
    Set threaded=1 to include reply threads.
    """
    fetched = 0
    next_href: Optional[str] = None

    with httpx.Client(headers=HEADERS, timeout=20) as client:
        base_url = f"{BASE_URL}/tracks/{track_id}/comments"
        params = {
            "client_id": client_id,
            "limit": batch_size,
            "threaded": 1,
            "filter_replies": 0,
        }

        while fetched < max_comments:
            url = next_href or base_url
            request_params = None if next_href else params

            resp = client.get(url, params=request_params)

            if resp.status_code == 429:
                wait = 30
                print(f"  Rate limited. Waiting {wait}s...")
                time.sleep(wait)
                continue
            if resp.status_code == 401:
                raise PermissionError("client_id expired")
            resp.raise_for_status()

            data = resp.json()
            collection = data.get("collection", [])
            if not collection:
                break

            for raw in collection:
                yield _parse_comment(raw, track_id)
                fetched += 1
                if fetched >= max_comments:
                    return

            next_href = data.get("next_href")
            if not next_href:
                break

            time.sleep(0.8)


def get_all_comments(
    track_id: int,
    client_id: str,
    max_comments: int = 500,
) -> list[dict]:
    """Collect all comments for a track into a list."""
    return [asdict(c) for c in iter_comments(track_id, client_id, max_comments=max_comments)]


def analyze_comment_density(comments: list[dict], track_duration_ms: int, buckets: int = 20) -> list[dict]:
    """
    Divide track into N time buckets and count comments per bucket.
    Returns a heatmap useful for identifying crowd-favorite moments.
    """
    bucket_size = track_duration_ms / buckets
    counts = [0] * buckets

    for comment in comments:
        ts = comment.get("timestamp_ms")
        if ts is not None and ts < track_duration_ms:
            bucket = min(int(ts / bucket_size), buckets - 1)
            counts[bucket] += 1

    result = []
    for i, count in enumerate(counts):
        start_ms = int(i * bucket_size)
        end_ms = int((i + 1) * bucket_size)
        result.append({
            "bucket": i,
            "start_ms": start_ms,
            "end_ms": end_ms,
            "start_formatted": _ms_to_mmss(start_ms),
            "end_formatted": _ms_to_mmss(end_ms),
            "comment_count": count,
        })
    return result


def _ms_to_mmss(ms: int) -> str:
    seconds = ms // 1000
    return f"{seconds // 60}:{seconds % 60:02d}"


if __name__ == "__main__":
    import json
    client_id = extract_client_id()
    track_id = 114688288  # replace with real track ID
    comments = get_all_comments(track_id, client_id, max_comments=200)
    print(f"Collected {len(comments)} comments")
    if comments:
        print(json.dumps(comments[0], indent=2))

Example comment output:

{
  "comment_id": 1847293847,
  "track_id": 114688288,
  "body": "this drop never gets old 🔥",
  "timestamp_ms": 142300,
  "created_at": "2024-11-15T18:42:33Z",
  "author": {
    "user_id": 9823741,
    "username": "beatmaker_uk",
    "avatar_url": "https://i1.sndcdn.com/avatars-xxx-large.jpg",
    "verified": false,
    "followers_count": 234
  }
}

Part 5: Playlist and Set Scraper

# playlist_scraper.py
import httpx
import time
from dataclasses import dataclass, asdict, field
from typing import Optional
from client_id import extract_client_id

BASE_URL = "https://api-v2.soundcloud.com"
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Referer": "https://soundcloud.com/",
}


@dataclass
class PlaylistTrack:
    track_id: int
    title: str
    artist: str
    duration_ms: int
    play_count: int
    like_count: int
    permalink_url: str
    position: int


@dataclass
class Playlist:
    playlist_id: int
    title: str
    creator: str
    creator_id: int
    permalink_url: str
    track_count: int
    duration_ms: int
    likes_count: int
    reposts_count: int
    description: str
    genre: str
    tag_list: str
    created_at: str
    is_album: bool
    tracks: list[PlaylistTrack] = field(default_factory=list)

    @property
    def total_plays(self) -> int:
        return sum(t.play_count for t in self.tracks)

    @property
    def total_likes(self) -> int:
        return sum(t.like_count for t in self.tracks)


def _parse_playlist_track(data: dict, position: int) -> PlaylistTrack:
    user = data.get("user", {})
    return PlaylistTrack(
        track_id=data["id"],
        title=data.get("title", ""),
        artist=user.get("username", ""),
        duration_ms=data.get("duration", 0),
        play_count=data.get("playback_count", 0),
        like_count=data.get("likes_count", 0),
        permalink_url=data.get("permalink_url", ""),
        position=position,
    )


def get_playlist(playlist_url: str, client_id: str) -> Playlist:
    """
    Fetch a SoundCloud playlist or album with all track metadata.

    For large playlists, the API may return track stubs (ID only).
    We detect these and fetch missing tracks individually.
    """
    with httpx.Client(headers=HEADERS, timeout=30, follow_redirects=True) as client:
        # Resolve playlist URL
        resp = client.get(
            f"{BASE_URL}/resolve",
            params={"url": playlist_url, "client_id": client_id},
        )
        resp.raise_for_status()
        data = resp.json()

        if data.get("kind") not in ("playlist",):
            raise ValueError(f"URL resolved to {data.get('kind')}, expected playlist")

        user = data.get("user", {})
        playlist = Playlist(
            playlist_id=data["id"],
            title=data.get("title", ""),
            creator=user.get("username", ""),
            creator_id=user.get("id", 0),
            permalink_url=data.get("permalink_url", ""),
            track_count=data.get("track_count", 0),
            duration_ms=data.get("duration", 0),
            likes_count=data.get("likes_count", 0),
            reposts_count=data.get("reposts_count", 0),
            description=data.get("description", ""),
            genre=data.get("genre", ""),
            tag_list=data.get("tag_list", ""),
            created_at=data.get("created_at", ""),
            is_album=data.get("is_album", False),
        )

        # Parse tracks — some may be stubs with only id
        raw_tracks = data.get("tracks", [])
        stub_ids = []

        for i, raw_track in enumerate(raw_tracks):
            if raw_track.get("title"):  # full track data
                playlist.tracks.append(_parse_playlist_track(raw_track, i + 1))
            else:
                stub_ids.append((i + 1, raw_track["id"]))

        # Fetch stubs in batches of 50
        if stub_ids:
            print(f"  Fetching {len(stub_ids)} stub tracks...")
            stub_id_map = {sid: pos for pos, sid in stub_ids}
            ids_str = ",".join(str(sid) for _, sid in stub_ids[:50])

            resp2 = client.get(
                f"{BASE_URL}/tracks",
                params={"ids": ids_str, "client_id": client_id},
            )
            resp2.raise_for_status()
            for raw_track in resp2.json():
                pos = stub_id_map.get(raw_track["id"], 0)
                playlist.tracks.append(_parse_playlist_track(raw_track, pos))
            time.sleep(0.5)

        # Sort by position
        playlist.tracks.sort(key=lambda t: t.position)
        return playlist


if __name__ == "__main__":
    import json

    client_id = extract_client_id()
    playlist = get_playlist(
        "https://soundcloud.com/disclosure/sets/the-face-2023",
        client_id,
    )
    print(f"Playlist: {playlist.title}")
    print(f"Tracks: {playlist.track_count}")
    print(f"Total plays across all tracks: {playlist.total_plays:,}")
    print(f"Total likes across all tracks: {playlist.total_likes:,}")
    for track in playlist.tracks:
        print(f"  {track.position}. {track.title} — {track.play_count:,} plays")

Part 6: Search Functionality

# search.py
import httpx
import time
from dataclasses import dataclass, asdict
from typing import Optional, Literal
from client_id import extract_client_id

BASE_URL = "https://api-v2.soundcloud.com"
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Referer": "https://soundcloud.com/",
}

SearchKind = Literal["tracks", "users", "playlists", "albums"]


@dataclass
class SearchFilter:
    genre: Optional[str] = None           # e.g. "electronic", "hip-hop"
    duration_from: Optional[int] = None   # seconds
    duration_to: Optional[int] = None     # seconds
    license: Optional[str] = None         # "cc-by", "cc-by-nc", etc.
    created_at: Optional[str] = None      # e.g. "last_week", "last_month", "last_year"
    bpm_from: Optional[int] = None
    bpm_to: Optional[int] = None


def search(
    query: str,
    kind: SearchKind = "tracks",
    limit: int = 50,
    filters: Optional[SearchFilter] = None,
    client_id: Optional[str] = None,
) -> list[dict]:
    """
    Search SoundCloud for tracks, users, or playlists.

    Returns raw API dicts for flexibility. Pass to parse functions
    from track_scraper.py or user_scraper.py for typed results.
    """
    if client_id is None:
        client_id = extract_client_id()

    params: dict = {
        "q": query,
        "client_id": client_id,
        "limit": min(limit, 50),
    }

    if filters:
        if filters.genre:
            params["filter.genre_or_genre_other"] = filters.genre
        if filters.duration_from:
            params["filter.duration.from"] = filters.duration_from * 1000  # API uses ms
        if filters.duration_to:
            params["filter.duration.to"] = filters.duration_to * 1000
        if filters.license:
            params["filter.license"] = filters.license
        if filters.created_at:
            params["filter.created_at"] = filters.created_at
        if filters.bpm_from:
            params["filter.bpm.from"] = filters.bpm_from
        if filters.bpm_to:
            params["filter.bpm.to"] = filters.bpm_to

    results = []
    next_href: Optional[str] = None
    fetched = 0

    with httpx.Client(headers=HEADERS, timeout=20) as client:
        url = f"{BASE_URL}/search/{kind}"

        while fetched < limit:
            resp = client.get(next_href or url, params=params if not next_href else None)

            if resp.status_code == 429:
                time.sleep(30)
                continue
            if resp.status_code == 401:
                raise PermissionError("client_id expired")
            resp.raise_for_status()

            data = resp.json()
            collection = data.get("collection", [])
            if not collection:
                break

            results.extend(collection)
            fetched += len(collection)
            next_href = data.get("next_href")

            if not next_href or fetched >= limit:
                break

            time.sleep(1.0)

    return results[:limit]


def search_tracks_by_genre(
    genre: str,
    limit: int = 100,
    client_id: Optional[str] = None,
) -> list[dict]:
    """Convenience wrapper for genre-based track search."""
    return search(
        query=genre,
        kind="tracks",
        limit=limit,
        filters=SearchFilter(genre=genre),
        client_id=client_id,
    )


if __name__ == "__main__":
    import json

    client_id = extract_client_id()

    # Search for lo-fi hip hop tracks uploaded in the last month
    results = search(
        query="lo-fi hip hop",
        kind="tracks",
        limit=20,
        filters=SearchFilter(genre="hip-hop", created_at="last_month"),
        client_id=client_id,
    )

    print(f"Found {len(results)} tracks")
    for r in results[:3]:
        print(f"  {r['title']} by {r['user']['username']} — {r.get('playback_count', 0):,} plays")

Example search output:

[
  {
    "title": "Late Night Study Session",
    "user": {"username": "chillhop_beats"},
    "playback_count": 48200,
    "likes_count": 1840,
    "genre": "Hip-hop & Rap",
    "tag_list": "lofi chill study beats",
    "created_at": "2026-03-15T12:00:00Z",
    "permalink_url": "https://soundcloud.com/chillhop_beats/late-night-study"
  }
]

Part 7: Waveform Data Analysis

Each SoundCloud track includes a waveform_url pointing to a JSON file with amplitude data. This lets you analyze audio characteristics without downloading the actual audio file.

# waveform.py
import httpx
import json
import math
import statistics
from dataclasses import dataclass
from typing import Optional


@dataclass
class WaveformAnalysis:
    track_id: int
    sample_count: int
    peak_amplitude: float
    rms_amplitude: float
    dynamic_range: float         # peak / RMS ratio
    quiet_sections: list[tuple[float, float]]  # (start_pct, end_pct)
    loud_sections: list[tuple[float, float]]
    average_amplitude: float
    silence_threshold: float


def fetch_waveform(waveform_url: str) -> dict:
    """Download waveform JSON from SoundCloud's CDN."""
    resp = httpx.get(waveform_url, timeout=15)
    resp.raise_for_status()
    return resp.json()


def analyze_waveform(
    track_id: int,
    waveform_url: str,
    quiet_threshold_pct: float = 0.2,
    loud_threshold_pct: float = 0.8,
) -> WaveformAnalysis:
    """
    Fetch and analyze a track's waveform data.

    Returns amplitude statistics and identifies quiet/loud sections.
    Samples are normalized 0-1 relative to the waveform's max value.
    """
    raw = fetch_waveform(waveform_url)
    samples: list[int] = raw.get("samples", [])
    if not samples:
        raise ValueError("No waveform samples in response")

    max_val = raw.get("height", 255)
    normalized = [s / max_val for s in samples]

    peak = max(normalized)
    avg = statistics.mean(normalized)
    rms = math.sqrt(statistics.mean(s ** 2 for s in normalized))
    dynamic_range = peak / rms if rms > 0 else 0

    quiet_thresh = quiet_threshold_pct
    loud_thresh = loud_threshold_pct

    # Find contiguous quiet and loud sections
    def find_sections(values: list[float], condition) -> list[tuple[float, float]]:
        sections = []
        in_section = False
        start_idx = 0
        n = len(values)
        for i, v in enumerate(values):
            if condition(v) and not in_section:
                in_section = True
                start_idx = i
            elif not condition(v) and in_section:
                in_section = False
                sections.append((start_idx / n, i / n))
        if in_section:
            sections.append((start_idx / n, 1.0))
        return sections

    quiet_sections = find_sections(normalized, lambda v: v < quiet_thresh)
    loud_sections = find_sections(normalized, lambda v: v > loud_thresh)

    return WaveformAnalysis(
        track_id=track_id,
        sample_count=len(samples),
        peak_amplitude=round(peak, 4),
        rms_amplitude=round(rms, 4),
        dynamic_range=round(dynamic_range, 4),
        quiet_sections=quiet_sections,
        loud_sections=loud_sections,
        average_amplitude=round(avg, 4),
        silence_threshold=quiet_thresh,
    )


if __name__ == "__main__":
    from dataclasses import asdict
    from client_id import extract_client_id
    import httpx

    client_id = extract_client_id()

    # Get waveform URL from a track
    resp = httpx.get(
        "https://api-v2.soundcloud.com/resolve",
        params={
            "url": "https://soundcloud.com/disclosure/latch-feat-sam-smith",
            "client_id": client_id,
        },
        timeout=15,
    )
    track_data = resp.json()
    waveform_url = track_data["waveform_url"]
    track_id = track_data["id"]

    analysis = analyze_waveform(track_id, waveform_url)
    print(f"Samples: {analysis.sample_count}")
    print(f"Peak amplitude: {analysis.peak_amplitude}")
    print(f"RMS amplitude: {analysis.rms_amplitude}")
    print(f"Dynamic range: {analysis.dynamic_range:.2f}x")
    print(f"Loud sections: {len(analysis.loud_sections)} drops/peaks detected")
    for start, end in analysis.loud_sections[:5]:
        print(f"  {start*100:.1f}% — {end*100:.1f}%")

# charts.py
import httpx
import time
from dataclasses import dataclass, asdict
from typing import Optional, Literal
from client_id import extract_client_id

BASE_URL = "https://api-v2.soundcloud.com"
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Referer": "https://soundcloud.com/",
}

ChartKind = Literal["trending", "top"]

# Genre slugs SoundCloud accepts
GENRES = [
    "all-music", "alternativerock", "ambient", "classical", "country",
    "danceedm", "dancehall", "deephouse", "disco", "drumbass",
    "dubstep", "electronic", "folksingersongwriter", "hiphoprap",
    "house", "indie", "jazzblues", "latin", "metal", "piano",
    "pop", "rbsoul", "reggae", "reggaeton", "rock", "soundtrack",
    "techno", "trance", "trap", "triphop", "world",
]


@dataclass
class ChartEntry:
    rank: int
    track_id: int
    title: str
    artist: str
    artist_id: int
    play_count: int
    like_count: int
    comment_count: int
    genre: str
    permalink_url: str
    created_at: str
    score: Optional[float] = None  # trend score when available


def get_charts(
    kind: ChartKind = "trending",
    genre: str = "all-music",
    limit: int = 50,
    client_id: Optional[str] = None,
) -> list[ChartEntry]:
    """
    Fetch SoundCloud trending or top tracks chart.

    kind="trending" — rising tracks (velocity-based)
    kind="top"      — all-time most played in genre
    """
    if client_id is None:
        client_id = extract_client_id()

    params = {
        "client_id": client_id,
        "kind": kind,
        "genre": f"soundcloud:genres:{genre}",
        "limit": min(limit, 50),
        "offset": 0,
        "linked_partitioning": 1,
    }

    results = []
    with httpx.Client(headers=HEADERS, timeout=20) as client:
        resp = client.get(f"{BASE_URL}/charts", params=params)

        if resp.status_code == 429:
            raise ConnectionError("Rate limited")
        if resp.status_code == 401:
            raise PermissionError("client_id expired")
        resp.raise_for_status()

        data = resp.json()
        collection = data.get("collection", [])

        for i, entry in enumerate(collection):
            track = entry.get("track", {})
            if not track:
                continue
            user = track.get("user", {})
            results.append(ChartEntry(
                rank=i + 1,
                track_id=track.get("id", 0),
                title=track.get("title", ""),
                artist=user.get("username", ""),
                artist_id=user.get("id", 0),
                play_count=track.get("playback_count", 0),
                like_count=track.get("likes_count", 0),
                comment_count=track.get("comment_count", 0),
                genre=track.get("genre", ""),
                permalink_url=track.get("permalink_url", ""),
                created_at=track.get("created_at", ""),
                score=entry.get("score"),
            ))

    return results[:limit]


if __name__ == "__main__":
    import json
    from dataclasses import asdict

    client_id = extract_client_id()
    chart = get_charts(kind="trending", genre="hiphoprap", limit=20, client_id=client_id)

    print(f"Top {len(chart)} trending hip-hop tracks:")
    for entry in chart:
        print(f"  #{entry.rank} {entry.title} by {entry.artist} — {entry.play_count:,} plays")

# related_tracks.py
import httpx
from dataclasses import dataclass, asdict
from typing import Optional
from client_id import extract_client_id

BASE_URL = "https://api-v2.soundcloud.com"
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Referer": "https://soundcloud.com/",
}


@dataclass
class RelatedTrack:
    track_id: int
    title: str
    artist: str
    artist_id: int
    play_count: int
    like_count: int
    genre: str
    permalink_url: str
    duration_ms: int


def get_related_tracks(
    track_id: int,
    client_id: Optional[str] = None,
    limit: int = 20,
) -> list[RelatedTrack]:
    """
    Fetch tracks related/recommended to a given track.

    SoundCloud uses this for the "Up Next" queue in the player.
    """
    if client_id is None:
        client_id = extract_client_id()

    with httpx.Client(headers=HEADERS, timeout=20) as client:
        resp = client.get(
            f"{BASE_URL}/tracks/{track_id}/related",
            params={
                "client_id": client_id,
                "limit": min(limit, 50),
            },
        )
        if resp.status_code == 401:
            raise PermissionError("client_id expired")
        resp.raise_for_status()

        data = resp.json()
        collection = data.get("collection", [])

        results = []
        for track in collection:
            user = track.get("user", {})
            results.append(RelatedTrack(
                track_id=track.get("id", 0),
                title=track.get("title", ""),
                artist=user.get("username", ""),
                artist_id=user.get("id", 0),
                play_count=track.get("playback_count", 0),
                like_count=track.get("likes_count", 0),
                genre=track.get("genre", ""),
                permalink_url=track.get("permalink_url", ""),
                duration_ms=track.get("duration", 0),
            ))

        return results[:limit]

Part 10: Bulk Artist Monitoring

This is the most practical tool for ongoing intelligence work — track new uploads and follower growth for a roster of artists over time.

# monitor.py
import httpx
import sqlite3
import json
import time
from dataclasses import dataclass, asdict
from datetime import datetime, timezone
from typing import Optional
from pathlib import Path
from client_id import extract_client_id
from user_scraper import get_user_profile, get_user_tracks, UserProfile

DB_PATH = Path("soundcloud_monitor.db")
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/123.0.0.0 Safari/537.36"
    ),
    "Referer": "https://soundcloud.com/",
}


def init_db(db_path: Path = DB_PATH) -> sqlite3.Connection:
    """Initialize monitoring database with required schema."""
    conn = sqlite3.connect(db_path)
    conn.row_factory = sqlite3.Row
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS artists (
            user_id     INTEGER PRIMARY KEY,
            username    TEXT NOT NULL,
            profile_url TEXT NOT NULL,
            added_at    TEXT NOT NULL
        );

        CREATE TABLE IF NOT EXISTS artist_snapshots (
            id              INTEGER PRIMARY KEY AUTOINCREMENT,
            user_id         INTEGER NOT NULL,
            snapshot_at     TEXT NOT NULL,
            followers_count INTEGER,
            followings_count INTEGER,
            track_count     INTEGER,
            playlist_count  INTEGER,
            FOREIGN KEY (user_id) REFERENCES artists(user_id)
        );

        CREATE TABLE IF NOT EXISTS tracks (
            track_id        INTEGER PRIMARY KEY,
            user_id         INTEGER NOT NULL,
            title           TEXT,
            permalink_url   TEXT,
            created_at      TEXT,
            first_seen_at   TEXT NOT NULL,
            FOREIGN KEY (user_id) REFERENCES artists(user_id)
        );

        CREATE TABLE IF NOT EXISTS track_snapshots (
            id              INTEGER PRIMARY KEY AUTOINCREMENT,
            track_id        INTEGER NOT NULL,
            snapshot_at     TEXT NOT NULL,
            play_count      INTEGER,
            like_count      INTEGER,
            comment_count   INTEGER,
            repost_count    INTEGER,
            FOREIGN KEY (track_id) REFERENCES tracks(track_id)
        );

        CREATE INDEX IF NOT EXISTS idx_artist_snapshots_user_id
            ON artist_snapshots(user_id);
        CREATE INDEX IF NOT EXISTS idx_track_snapshots_track_id
            ON track_snapshots(track_id);
        CREATE INDEX IF NOT EXISTS idx_tracks_user_id
            ON tracks(user_id);
    """)
    conn.commit()
    return conn


def add_artist(conn: sqlite3.Connection, profile_url: str, client_id: str) -> int:
    """Add an artist to the monitoring roster. Returns user_id."""
    profile = get_user_profile(profile_url, client_id)
    now = datetime.now(timezone.utc).isoformat()

    conn.execute(
        """INSERT OR REPLACE INTO artists (user_id, username, profile_url, added_at)
           VALUES (?, ?, ?, ?)""",
        (profile.user_id, profile.username, profile_url, now),
    )
    conn.commit()
    print(f"Added artist: {profile.username} ({profile.followers_count:,} followers)")
    return profile.user_id


def snapshot_artist(
    conn: sqlite3.Connection,
    user_id: int,
    profile_url: str,
    client_id: str,
) -> dict:
    """Take a point-in-time snapshot of an artist's stats and recent tracks."""
    profile = get_user_profile(profile_url, client_id)
    now = datetime.now(timezone.utc).isoformat()

    # Save follower snapshot
    conn.execute(
        """INSERT INTO artist_snapshots
           (user_id, snapshot_at, followers_count, followings_count, track_count, playlist_count)
           VALUES (?, ?, ?, ?, ?, ?)""",
        (
            user_id, now,
            profile.followers_count, profile.followings_count,
            profile.track_count, profile.playlist_count,
        ),
    )

    # Fetch recent tracks
    time.sleep(0.5)
    tracks = get_user_tracks(user_id, client_id, limit=20)
    new_tracks = []

    for track in tracks:
        track_id = track["id"]
        existing = conn.execute(
            "SELECT track_id FROM tracks WHERE track_id = ?", (track_id,)
        ).fetchone()

        if not existing:
            # New track discovered
            conn.execute(
                """INSERT INTO tracks (track_id, user_id, title, permalink_url, created_at, first_seen_at)
                   VALUES (?, ?, ?, ?, ?, ?)""",
                (
                    track_id, user_id,
                    track.get("title", ""),
                    track.get("permalink_url", ""),
                    track.get("created_at", ""),
                    now,
                ),
            )
            new_tracks.append(track.get("title", ""))

        # Save track stat snapshot
        conn.execute(
            """INSERT INTO track_snapshots
               (track_id, snapshot_at, play_count, like_count, comment_count, repost_count)
               VALUES (?, ?, ?, ?, ?, ?)""",
            (
                track_id, now,
                track.get("playback_count", 0),
                track.get("likes_count", 0),
                track.get("comment_count", 0),
                track.get("reposts_count", 0),
            ),
        )

    conn.commit()
    return {
        "username": profile.username,
        "followers": profile.followers_count,
        "new_tracks": new_tracks,
        "snapshot_at": now,
    }


def run_monitoring_cycle(
    profile_urls: list[str],
    db_path: Path = DB_PATH,
    delay_between_artists: float = 3.0,
) -> list[dict]:
    """
    Run one monitoring cycle over all tracked artists.
    Call this on a schedule (e.g., daily via cron).
    """
    client_id = extract_client_id()
    conn = init_db(db_path)
    results = []

    for url in profile_urls:
        try:
            # Ensure artist is registered
            existing = conn.execute(
                "SELECT user_id FROM artists WHERE profile_url = ?", (url,)
            ).fetchone()

            if not existing:
                user_id = add_artist(conn, url, client_id)
            else:
                user_id = existing["user_id"]

            snapshot = snapshot_artist(conn, user_id, url, client_id)
            results.append(snapshot)

            if snapshot["new_tracks"]:
                print(f"  NEW TRACKS from {snapshot['username']}: {snapshot['new_tracks']}")
            else:
                print(f"  {snapshot['username']}: {snapshot['followers']:,} followers, no new tracks")

        except PermissionError:
            print("  client_id expired, refreshing...")
            client_id = extract_client_id(force_refresh=True)
        except Exception as e:
            print(f"  Error monitoring {url}: {e}")

        time.sleep(delay_between_artists)

    conn.close()
    return results


def get_follower_growth(
    conn: sqlite3.Connection,
    user_id: int,
    days: int = 30,
) -> list[dict]:
    """Get follower growth history for an artist over the past N days."""
    rows = conn.execute(
        """SELECT snapshot_at, followers_count
           FROM artist_snapshots
           WHERE user_id = ?
             AND snapshot_at >= datetime('now', ?)
           ORDER BY snapshot_at""",
        (user_id, f"-{days} days"),
    ).fetchall()
    return [dict(row) for row in rows]


if __name__ == "__main__":
    roster = [
        "https://soundcloud.com/disclosure",
        "https://soundcloud.com/flume",
        "https://soundcloud.com/kaytranada",
    ]
    results = run_monitoring_cycle(roster)
    print(json.dumps(results, indent=2))

Anti-Detection Deep Dive

Client ID Rotation

The client_id is the most common failure point. SoundCloud rotates these every few days. Your scraper must handle this gracefully:

def with_client_id_retry(func, *args, max_retries: int = 3, **kwargs):
    """Decorator-style wrapper that auto-refreshes client_id on 401."""
    client_id = extract_client_id()
    for attempt in range(max_retries):
        try:
            return func(*args, client_id=client_id, **kwargs)
        except PermissionError:
            if attempt == max_retries - 1:
                raise
            print(f"client_id expired, refreshing (attempt {attempt + 1})...")
            client_id = extract_client_id(force_refresh=True)
            time.sleep(1)

Rate Limiting with Exponential Backoff

SoundCloud returns 429 after roughly 15-20 requests per minute from a single IP. The backoff strategy:

import time
import random

def request_with_backoff(
    client: httpx.Client,
    url: str,
    params: dict,
    max_retries: int = 5,
    base_delay: float = 5.0,
) -> httpx.Response:
    """Make a request with exponential backoff on 429 responses."""
    for attempt in range(max_retries):
        resp = client.get(url, params=params)

        if resp.status_code == 429:
            # Check Retry-After header
            retry_after = resp.headers.get("Retry-After")
            if retry_after:
                wait = float(retry_after)
            else:
                wait = base_delay * (2 ** attempt)

            # Add jitter to avoid thundering herd
            jitter = random.uniform(0, wait * 0.2)
            total_wait = wait + jitter

            print(f"Rate limited (attempt {attempt + 1}/{max_retries}). Waiting {total_wait:.1f}s...")
            time.sleep(total_wait)
            continue

        return resp

    raise ConnectionError(f"Still rate-limited after {max_retries} retries")

Maintaining a consistent session reduces 429 frequency:

import httpx

def create_session(proxy_url: str = None) -> httpx.Client:
    """
    Create a persistent HTTP session that looks like a real browser.

    Cookies are maintained across requests. The session visits the
    SoundCloud homepage first to establish a realistic session state.
    """
    transport = httpx.HTTPTransport(proxy=proxy_url) if proxy_url else None

    client = httpx.Client(
        transport=transport,
        headers={
            "User-Agent": (
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/123.0.0.0 Safari/537.36"
            ),
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "Connection": "keep-alive",
            "Upgrade-Insecure-Requests": "1",
            "Sec-Fetch-Dest": "document",
            "Sec-Fetch-Mode": "navigate",
            "Sec-Fetch-Site": "none",
        },
        follow_redirects=True,
        timeout=25,
    )

    # Warm up with a homepage visit to get session cookies
    try:
        client.get("https://soundcloud.com")
        time.sleep(random.uniform(1.0, 2.0))
    except httpx.HTTPError:
        pass  # Continue even if warmup fails

    # Switch to API headers
    client.headers.update({
        "Accept": "application/json, text/javascript, */*; q=0.01",
        "Referer": "https://soundcloud.com/",
        "X-Requested-With": "XMLHttpRequest",
    })

    return client

Request Timing with Jitter

Uniform timing is a detection signal. Add jitter:

import random
import time

def sleep_with_jitter(base_seconds: float, jitter_pct: float = 0.3) -> None:
    """Sleep for base_seconds ± jitter_pct of base_seconds."""
    jitter = base_seconds * jitter_pct * random.uniform(-1, 1)
    actual = max(0.1, base_seconds + jitter)
    time.sleep(actual)

IP Reputation and Proxy Strategy

Datacenter IP ranges (AWS, GCP, DigitalOcean) are the first thing SoundCloud rate-limits aggressively. The fingerprinting checks your IP's ASN against known datacenter ranges.

For serious collection work, residential proxies are the practical solution. ThorData's residential proxy network is worth evaluating here — they offer rotating residential IPs that cycle between requests, which eliminates the per-IP rate limit problem. Their geo-targeting also lets you verify whether SoundCloud serves different chart data or search results by region (it does — trending charts differ between US, UK, and Germany).

# Proxy rotation example with ThorData
import random

PROXY_ENDPOINTS = [
    "http://user:[email protected]:10000",
    # ThorData rotates the exit IP automatically on each connection
]

def get_proxy() -> str:
    return random.choice(PROXY_ENDPOINTS)

# Use with create_session():
client = create_session(proxy_url=get_proxy())

Data Storage

SQLite Schema for Full Pipeline

# storage.py
import sqlite3
import json
from pathlib import Path
from datetime import datetime, timezone

SCHEMA = """
CREATE TABLE IF NOT EXISTS tracks (
    track_id        INTEGER PRIMARY KEY,
    title           TEXT,
    artist          TEXT,
    artist_id       INTEGER,
    permalink_url   TEXT,
    play_count      INTEGER,
    like_count      INTEGER,
    comment_count   INTEGER,
    repost_count    INTEGER,
    download_count  INTEGER,
    duration_ms     INTEGER,
    genre           TEXT,
    tag_list        TEXT,
    description     TEXT,
    created_at      TEXT,
    license         TEXT,
    waveform_url    TEXT,
    downloadable    INTEGER,
    bpm             REAL,
    isrc            TEXT,
    scraped_at      TEXT
);

CREATE TABLE IF NOT EXISTS users (
    user_id             INTEGER PRIMARY KEY,
    username            TEXT,
    display_name        TEXT,
    permalink_url       TEXT,
    followers_count     INTEGER,
    followings_count    INTEGER,
    track_count         INTEGER,
    playlist_count      INTEGER,
    verified            INTEGER,
    description         TEXT,
    city                TEXT,
    country_code        TEXT,
    website_url         TEXT,
    created_at          TEXT,
    scraped_at          TEXT
);

CREATE TABLE IF NOT EXISTS comments (
    comment_id      INTEGER PRIMARY KEY,
    track_id        INTEGER,
    user_id         INTEGER,
    username        TEXT,
    body            TEXT,
    timestamp_ms    INTEGER,
    created_at      TEXT,
    scraped_at      TEXT,
    FOREIGN KEY (track_id) REFERENCES tracks(track_id)
);

CREATE INDEX IF NOT EXISTS idx_comments_track_id ON comments(track_id);
CREATE INDEX IF NOT EXISTS idx_tracks_artist_id ON tracks(artist_id);
CREATE INDEX IF NOT EXISTS idx_tracks_genre ON tracks(genre);
CREATE INDEX IF NOT EXISTS idx_tracks_play_count ON tracks(play_count DESC);
"""


def init_storage(db_path: str = "soundcloud.db") -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.executescript(SCHEMA)
    conn.commit()
    return conn


def save_track(conn: sqlite3.Connection, track: dict) -> None:
    now = datetime.now(timezone.utc).isoformat()
    conn.execute(
        """INSERT OR REPLACE INTO tracks
           (track_id, title, artist, artist_id, permalink_url,
            play_count, like_count, comment_count, repost_count, download_count,
            duration_ms, genre, tag_list, description, created_at, license,
            waveform_url, downloadable, bpm, isrc, scraped_at)
           VALUES
           (:track_id, :title, :artist, :artist_id, :permalink_url,
            :play_count, :like_count, :comment_count, :repost_count, :download_count,
            :duration_ms, :genre, :tag_list, :description, :created_at, :license,
            :waveform_url, :downloadable, :bpm, :isrc, :scraped_at)""",
        {**track, "scraped_at": now},
    )
    conn.commit()


def save_comments(conn: sqlite3.Connection, comments: list[dict]) -> None:
    now = datetime.now(timezone.utc).isoformat()
    conn.executemany(
        """INSERT OR REPLACE INTO comments
           (comment_id, track_id, user_id, username, body, timestamp_ms, created_at, scraped_at)
           VALUES (?, ?, ?, ?, ?, ?, ?, ?)""",
        [
            (
                c["comment_id"],
                c["track_id"],
                c["author"]["user_id"],
                c["author"]["username"],
                c["body"],
                c.get("timestamp_ms"),
                c["created_at"],
                now,
            )
            for c in comments
        ],
    )
    conn.commit()


def export_csv(conn: sqlite3.Connection, table: str, output_path: str) -> None:
    """Export a table to CSV."""
    import csv
    rows = conn.execute(f"SELECT * FROM {table}").fetchall()
    if not rows:
        print(f"No data in {table}")
        return
    headers = [d[0] for d in conn.execute(f"SELECT * FROM {table} LIMIT 0").description]
    with open(output_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(headers)
        writer.writerows(rows)
    print(f"Exported {len(rows)} rows to {output_path}")

Complete End-to-End Pipeline

This script ties everything together: extract client ID, search a genre, get full metadata, collect comments, store in SQLite.

# pipeline.py
"""
SoundCloud data collection pipeline.

Usage:
    python3 pipeline.py --genre "lo-fi hip hop" --limit 50 --comments 100
"""
import argparse
import json
import time
import sqlite3
from dataclasses import asdict
from pathlib import Path

from client_id import extract_client_id
from search import search, SearchFilter
from track_scraper import _parse_track, resolve_url, HEADERS
from comment_scraper import get_all_comments
from storage import init_storage, save_track, save_comments, export_csv

import httpx


def run_pipeline(
    genre: str,
    limit: int = 50,
    comments_per_track: int = 100,
    db_path: str = "soundcloud.db",
    proxy_url: str = None,
    delay: float = 1.5,
) -> None:
    print(f"=== SoundCloud Pipeline: {genre} ===")
    print(f"Target: {limit} tracks, up to {comments_per_track} comments each")

    # Step 1: Extract client ID
    print("\n[1/4] Extracting client_id...")
    client_id = extract_client_id()
    print(f"  Got client_id: {client_id[:8]}...")

    # Step 2: Initialize database
    print("\n[2/4] Initializing database...")
    conn = init_storage(db_path)
    print(f"  Database: {db_path}")

    # Step 3: Search for tracks
    print(f"\n[3/4] Searching for '{genre}' tracks...")
    search_results = search(
        query=genre,
        kind="tracks",
        limit=limit,
        filters=SearchFilter(genre=genre.lower().replace(" ", "")),
        client_id=client_id,
    )
    print(f"  Found {len(search_results)} tracks")

    # Step 4: Enrich each track with full metadata + comments
    print(f"\n[4/4] Fetching full metadata and comments...")

    transport = httpx.HTTPTransport(proxy=proxy_url) if proxy_url else None
    client = httpx.Client(
        headers=HEADERS,
        transport=transport,
        timeout=20,
        follow_redirects=True,
    )

    summary = []

    try:
        for i, raw in enumerate(search_results):
            track_url = raw.get("permalink_url", "")
            if not track_url:
                continue

            print(f"\n  [{i+1}/{len(search_results)}] {raw.get('title', '?')}")

            # Get full track data (search results may have incomplete fields)
            try:
                full_data = resolve_url(track_url, client_id, client)
                track = _parse_track(full_data)
                track_dict = asdict(track)
                save_track(conn, track_dict)

                print(f"    Plays: {track.play_count:,}  Likes: {track.like_count:,}  Comments: {track.comment_count:,}")
            except PermissionError:
                print("    client_id expired, refreshing...")
                client_id = extract_client_id(force_refresh=True)
                continue
            except Exception as e:
                print(f"    Error fetching track: {e}")
                time.sleep(delay)
                continue

            time.sleep(delay)

            # Collect comments if track has any
            if track.comment_count > 0 and comments_per_track > 0:
                try:
                    comments = get_all_comments(
                        track.track_id,
                        client_id,
                        max_comments=comments_per_track,
                    )
                    save_comments(conn, comments)
                    print(f"    Saved {len(comments)} comments")
                except Exception as e:
                    print(f"    Error fetching comments: {e}")

                time.sleep(delay)

            summary.append({
                "title": track.title,
                "artist": track.artist,
                "play_count": track.play_count,
                "like_count": track.like_count,
                "comment_count": track.comment_count,
                "permalink_url": track.permalink_url,
            })

    finally:
        client.close()

    # Export results
    print("\n=== Export ===")
    export_csv(conn, "tracks", f"soundcloud_tracks_{genre.replace(' ', '_')}.csv")
    export_csv(conn, "comments", f"soundcloud_comments_{genre.replace(' ', '_')}.csv")

    conn.close()

    # Summary
    print(f"\n=== Done ===")
    print(f"Collected {len(summary)} tracks")
    top = sorted(summary, key=lambda x: x["play_count"], reverse=True)[:5]
    print("Top 5 by plays:")
    for t in top:
        print(f"  {t['title']} by {t['artist']} — {t['play_count']:,} plays")

    # Save summary JSON
    with open(f"pipeline_summary_{genre.replace(' ', '_')}.json", "w") as f:
        json.dump(summary, f, indent=2)
    print(f"Summary saved to pipeline_summary_{genre.replace(' ', '_')}.json")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="SoundCloud data pipeline")
    parser.add_argument("--genre", default="electronic", help="Genre to search")
    parser.add_argument("--limit", type=int, default=50, help="Number of tracks")
    parser.add_argument("--comments", type=int, default=100, help="Comments per track")
    parser.add_argument("--db", default="soundcloud.db", help="SQLite database path")
    parser.add_argument("--proxy", default=None, help="Proxy URL")
    parser.add_argument("--delay", type=float, default=1.5, help="Delay between requests")
    args = parser.parse_args()

    run_pipeline(
        genre=args.genre,
        limit=args.limit,
        comments_per_track=args.comments,
        db_path=args.db,
        proxy_url=args.proxy,
        delay=args.delay,
    )

Run it:

uv pip install httpx
python3 pipeline.py --genre "lo-fi hip hop" --limit 100 --comments 200

Terms of Service. SoundCloud's ToS prohibits automated access. Using these techniques for commercial data products, reselling data, or high-volume collection without permission puts you in violation. Personal research, academic study, and building tools for your own tracks occupy a greyer area.

Copyright. The metadata and comments are data about music, not the music itself. Do not use these tools to download or redistribute audio files. Stream URLs extracted from the API are time-limited and tied to the requesting IP.

Rate limits. Stay within rates that don't degrade service for other users. If you're collecting millions of tracks, you're no longer in research territory — contact SoundCloud about data partnerships.

Data privacy. Comments and user profiles are public, but collecting and storing them at scale may implicate GDPR if you're in the EU or collecting data on EU residents. Anonymize where possible and don't build personal profiles on individual users.

robots.txt. SoundCloud's robots.txt disallows crawlers on most paths. The internal API is not explicitly addressed, but the spirit of the rules applies.


Troubleshooting

401 Unauthorized — client_id Expired

The most common error. The client ID has rotated.

httpx.HTTPStatusError: Client error '401 Unauthorized'

Fix: Call extract_client_id(force_refresh=True). The cache TTL is 1 hour by default; reduce it if you're seeing frequent 401s. If extraction fails, the JS bundle URL patterns may have changed — inspect https://soundcloud.com source and update the regex in client_id.py.

429 Too Many Requests — Rate Limited

httpx.HTTPStatusError: Client error '429 Too Many Requests'

Fix: You're hitting roughly 15-20 requests per minute from a single IP. Options in order of preference: 1. Increase delay parameter to 3-5 seconds 2. Add request jitter 3. Use residential proxies to distribute load

403 Forbidden — Geo-Blocked Content

Some tracks are region-restricted. The API returns 403 when you try to resolve them from a blocked region.

Fix: Use a proxy in an allowed region. Most SoundCloud content is globally accessible; restricted tracks are the exception, typically due to label agreements.

Empty collection Arrays

Search or chart endpoints return an empty collection with a valid 200 response.

Causes: - Genre slug doesn't match SoundCloud's accepted values (see GENRES list in charts.py) - The filter combination returns no results - The search query is too narrow

Fix: Try without filters first. Use "all-music" as the genre. Check GENRES for valid slugs.

waveform_url Returns 404

Waveform CDN URLs are stable but occasionally expire or get regenerated.

Fix: Re-fetch the track metadata to get a fresh waveform_url. Waveform URLs are in wave.sndcdn.com — if that domain is unreachable, it's a CDN issue, not a scraping issue.

Stale Play Counts

Play counts in API responses can lag by a few minutes during high-traffic periods.

Fix: For accuracy-sensitive applications, query the same track twice with a 60-second gap and use the higher value (counts only go up).

ImportError on httpx

ModuleNotFoundError: No module named 'httpx'

Fix:

uv pip install httpx

All scripts in this guide require httpx. No other third-party dependencies are needed.


Summary

SoundCloud's internal API is well-structured and stable enough for serious data collection. The main operational concern is the rotating client_id, which your scraper must handle automatically. Everything else — pagination, rate limiting, data parsing — follows straightforward patterns.

The most productive workflow for ongoing music intelligence:

  1. Run pipeline.py with your target genre to bootstrap a local database
  2. Use monitor.py on a daily schedule to track follower growth and new uploads
  3. Pull waveform data selectively for tracks that cross engagement thresholds
  4. Export to CSV for analysis in any tool you prefer

For large-scale collection, residential proxies remove the IP-rate-limit ceiling and let you collect at whatever pace the data requires.