Scraping Genius for Song Lyrics, Annotations, and Album Data

2026-04-09 scraping genius python lyrics api music

Genius started as a lyrics annotation site and grew into one of the most comprehensive music metadata platforms on the web. Beyond lyrics, it has structured data on albums, track lists, producers, samples, and user-contributed annotations explaining references in the text.

Genius has an official API, but it does not return lyrics directly — that requires scraping the web pages. This post covers both approaches: using the API for structured metadata and scraping for the actual lyric text.

What You Can Get from Genius

Before diving into code, here's the full data landscape:

Via the official API: - Song search and metadata (title, artist, album, release date, view count) - Artist profiles (bio, followers, verified status) - Album info and track listings - Annotations/referents (with annotated lyric fragments and explanation text) - Song relationships (samples, interpolations, remixes, covers) - Song stats (page views, annotation count)

Via web scraping: - Full lyrics text (not in the API) - Song credits in detail (producers, writers, engineers, mastering, mixing) - Featured artists per track - Q&A sections - Fan IQ scores and top contributors - Embeddable lyric previews

Setting Up the Genius API

Register at genius.com/api-clients to get an access token. You only need a Client Access Token for public data — no OAuth flow required for read-only access.

# genius_client.py
import httpx
import time
import re
import json
from typing import Optional

class GeniusClient:
    BASE_URL = "https://api.genius.com"

    def __init__(self, token: str, proxy_url: str | None = None):
        transport = httpx.HTTPTransport(proxy=proxy_url) if proxy_url else None
        self.client = httpx.Client(
            base_url=self.BASE_URL,
            headers={
                "Authorization": f"Bearer {token}",
                "User-Agent": "GeniusScraper/1.0",
                "Accept": "application/json",
            },
            transport=transport,
            timeout=15,
        )

    def search(self, query: str, page: int = 1) -> list[dict]:
        """Search for songs by title, artist, or lyrics keywords."""
        resp = self.client.get("/search", params={"q": query, "page": page})
        resp.raise_for_status()
        return [hit["result"] for hit in resp.json()["response"]["hits"]]

    def get_song(self, song_id: int, text_format: str = "plain") -> dict:
        """Get full song metadata including credits and stats."""
        resp = self.client.get(
            f"/songs/{song_id}",
            params={"text_format": text_format},
        )
        resp.raise_for_status()
        return resp.json()["response"]["song"]

    def get_artist(self, artist_id: int) -> dict:
        """Get artist info including bio and social links."""
        resp = self.client.get(f"/artists/{artist_id}")
        resp.raise_for_status()
        return resp.json()["response"]["artist"]

    def get_artist_songs(
        self,
        artist_id: int,
        page: int = 1,
        per_page: int = 50,
        sort: str = "popularity",
    ) -> dict:
        """Get paginated list of an artist's songs.

        sort: 'popularity' or 'title' or 'release_date'
        """
        resp = self.client.get(
            f"/artists/{artist_id}/songs",
            params={"page": page, "per_page": per_page, "sort": sort},
        )
        resp.raise_for_status()
        return resp.json()["response"]

    def get_album(self, album_id: int) -> dict:
        """Get album metadata."""
        resp = self.client.get(f"/albums/{album_id}")
        resp.raise_for_status()
        return resp.json()["response"]["album"]

    def get_album_tracks(self, album_id: int) -> list[dict]:
        """Get all tracks in an album with their details."""
        resp = self.client.get(f"/albums/{album_id}/tracks")
        resp.raise_for_status()
        return resp.json()["response"]["tracks"]

    def get_referents(
        self,
        song_id: int,
        text_format: str = "plain",
        per_page: int = 50,
    ) -> list[dict]:
        """Get annotations/referents for a song."""
        all_referents = []
        page = 1

        while True:
            resp = self.client.get("/referents", params={
                "song_id": song_id,
                "text_format": text_format,
                "per_page": per_page,
                "page": page,
            })
            resp.raise_for_status()
            data = resp.json()["response"]
            batch = data.get("referents", [])
            if not batch:
                break
            all_referents.extend(batch)
            if not data.get("next_page"):
                break
            page += 1
            time.sleep(0.3)

        return all_referents

    def search_songs_by_artist(self, artist_name: str, max_results: int = 100) -> list[dict]:
        """Find songs by an artist via search, handling multiple pages."""
        songs = []
        page = 1
        seen_ids = set()

        while len(songs) < max_results:
            results = self.search(artist_name, page=page)
            if not results:
                break

            for result in results:
                artist = result.get("primary_artist", {})
                if artist.get("name", "").lower() == artist_name.lower():
                    if result["id"] not in seen_ids:
                        songs.append(result)
                        seen_ids.add(result["id"])

            page += 1
            time.sleep(0.5)

            if page > 5:  # Don't paginate search too deep
                break

        return songs[:max_results]

Scraping Lyrics from Web Pages

The API returns a url field for each song pointing to the lyrics page. The lyrics are in the HTML but not in the API response — Genius deliberately omits them to protect licensing deals.

# genius_lyrics.py
from selectolax.parser import HTMLParser

def scrape_lyrics(song_url: str, proxy_url: str | None = None) -> str | None:
    """
    Scrape lyrics text from a Genius song page.

    Genius changed their HTML structure in 2024 — the lyrics now live in
    div[data-lyrics-container='true'] rather than the old .lyrics class.
    This function handles both the new and old formats.
    """
    transport = httpx.HTTPTransport(proxy=proxy_url) if proxy_url else None
    client = httpx.Client(
        transport=transport,
        timeout=15,
        follow_redirects=True,
        headers={
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                          "AppleWebKit/537.36 (KHTML, like Gecko) "
                          "Chrome/125.0.0.0 Safari/537.36",
            "Accept": "text/html,application/xhtml+xml",
            "Accept-Language": "en-US,en;q=0.9",
            "Referer": "https://genius.com/",
        },
    )

    try:
        resp = client.get(song_url)
        resp.raise_for_status()
    except httpx.HTTPError as e:
        print(f"  HTTP error fetching {song_url}: {e}")
        return None
    finally:
        client.close()

    tree = HTMLParser(resp.text)

    # New format (2024+): multiple containers with data-lyrics-container attribute
    containers = tree.css("div[data-lyrics-container='true']")
    if containers:
        lyrics_parts = []
        for container in containers:
            # Replace <br> with newlines before extracting text
            html = container.html or ""
            html = re.sub(r"<br\s*/?>", "\n", html)
            html = re.sub(r"<[^>]+>", "", html)  # strip remaining tags
            text = html.strip()
            if text:
                lyrics_parts.append(text)
        if lyrics_parts:
            return "\n\n".join(lyrics_parts)

    # Legacy format: div.lyrics
    old_format = tree.css_first(".lyrics")
    if old_format:
        return old_format.text(separator="\n").strip()

    # Fallback: look for the lyrics in embedded JSON
    for script in tree.css("script"):
        text = script.text() or ""
        if '"lyrics"' in text and '"body"' in text:
            try:
                # Extract JSON blob
                match = re.search(r'"lyrics"\s*:\s*\{.*?"body"\s*:\s*\{"plain"\s*:\s*"([^"]+)"', text, re.DOTALL)
                if match:
                    raw = match.group(1)
                    return raw.encode().decode("unicode_escape")
            except Exception:
                pass

    return None


def scrape_lyrics_with_retry(
    song_url: str,
    proxies: list[str] | None = None,
    max_attempts: int = 3,
) -> str | None:
    """Scrape with proxy rotation and retry logic."""
    import random

    for attempt in range(max_attempts):
        proxy = random.choice(proxies) if proxies else None
        lyrics = scrape_lyrics(song_url, proxy_url=proxy)
        if lyrics and len(lyrics) > 50:  # Avoid empty/error pages
            return lyrics
        if attempt < max_attempts - 1:
            wait = random.uniform(3, 8)
            print(f"  Attempt {attempt + 1} failed, waiting {wait:.1f}s")
            time.sleep(wait)

    return None

Building a Complete Song Dataset

Combining API metadata with scraped lyrics:

def collect_song_data(
    genius: GeniusClient,
    song_result: dict,
    proxy_url: str | None = None,
    include_annotations: bool = True,
) -> dict:
    """Collect full song data: metadata + lyrics + annotations."""
    song_id = song_result["id"]

    # Get detailed metadata from API
    song = genius.get_song(song_id)

    # Extract structured credits
    credits = {
        "producers": [p["name"] for p in song.get("producer_artists", [])],
        "writers": [w["name"] for w in song.get("writer_artists", [])],
        "featured": [f["name"] for f in song.get("featured_artists", [])],
    }

    # Extract samples and interpolations
    relationships = {}
    for rel in song.get("song_relationships", []):
        rel_type = rel.get("relationship_type")
        rel_songs = [s.get("title") for s in rel.get("songs", [])]
        if rel_songs:
            relationships[rel_type] = rel_songs

    # Stats
    stats = song.get("stats", {})

    data = {
        "id": song_id,
        "title": song.get("title"),
        "title_with_featured": song.get("title_with_featured"),
        "url": song.get("url"),
        "primary_artist": song.get("primary_artist", {}).get("name"),
        "artist_id": song.get("primary_artist", {}).get("id"),
        "album": song.get("album", {}).get("name") if song.get("album") else None,
        "album_id": song.get("album", {}).get("id") if song.get("album") else None,
        "release_date": song.get("release_date"),
        "page_views": stats.get("pageviews"),
        "annotation_count": stats.get("unreviewed_annotations", 0) + stats.get("verified_annotations", 0),
        "credits": credits,
        "relationships": relationships,
        "lyrics_state": song.get("lyrics_state"),
    }

    # Scrape lyrics if available
    if song.get("lyrics_state") == "complete" and song.get("url"):
        lyrics = scrape_lyrics(song["url"], proxy_url=proxy_url)
        data["lyrics"] = lyrics
        if lyrics:
            data["lyrics_line_count"] = len([l for l in lyrics.split("\n") if l.strip()])
    else:
        data["lyrics"] = None

    # Get annotations
    if include_annotations:
        try:
            referents = genius.get_referents(song_id)
            data["annotations"] = [
                {
                    "fragment": ref.get("fragment"),
                    "annotation": ref["annotations"][0]["body"]["plain"] if ref.get("annotations") else "",
                    "votes": ref["annotations"][0].get("votes_total", 0) if ref.get("annotations") else 0,
                    "verified": ref["annotations"][0].get("verified", False) if ref.get("annotations") else False,
                }
                for ref in referents if ref.get("annotations")
            ]
        except Exception as e:
            print(f"  Failed to get annotations for {song_id}: {e}")
            data["annotations"] = []

    return data

Building an Album Scraper

To get all lyrics for an album, you need the album's tracks and then scrape each song page:

def scrape_album(
    genius: GeniusClient,
    album_query: str,
    proxy_url: str | None = None,
    delay: float = 2.0,
) -> list[dict]:
    """Search for an album and scrape all its tracks."""
    import random

    results = genius.search(album_query)
    if not results:
        print(f"No results for: {album_query}")
        return []

    # Find album via first result's song
    first_song = genius.get_song(results[0]["id"])
    album = first_song.get("album")

    if not album:
        print("No album associated with first result")
        return []

    album_id = album["id"]
    print(f"Found album: {album['name']} (ID: {album_id})")

    # Get all tracks from the album API endpoint
    tracks_data = genius.get_album_tracks(album_id)
    tracks = []

    for track_info in tracks_data:
        song = track_info.get("song", {})
        if not song:
            continue

        print(f"  Track {track_info.get('number', '?')}: {song.get('title')}")

        song_data = collect_song_data(
            genius,
            song,
            proxy_url=proxy_url,
            include_annotations=False,  # Skip annotations for album scrape
        )
        song_data["track_number"] = track_info.get("number")
        tracks.append(song_data)

        wait = delay + random.uniform(0, 1)
        time.sleep(wait)

    tracks.sort(key=lambda t: t.get("track_number") or 999)
    return tracks


# Scrape an album
album_tracks = scrape_album(genius, "OK Computer Radiohead", delay=2.5)
with open("ok_computer.json", "w") as f:
    json.dump(album_tracks, f, indent=2, ensure_ascii=False)

print(f"Saved {len(album_tracks)} tracks")
for track in album_tracks:
    lyric_lines = track.get("lyrics_line_count", 0)
    print(f"  {track['track_number']}. {track['title']} — {lyric_lines} lines")

Extracting Annotations

Genius annotations are the site's core feature — user-contributed explanations of lyric references:

def get_song_annotations(genius: GeniusClient, song_id: int) -> list[dict]:
    """Get all annotations for a song, sorted by quality."""
    referents = genius.get_referents(song_id)
    annotations = []

    for ref in referents:
        fragment = ref.get("fragment", "")
        # A referent may have multiple annotations (different explanations)
        for annotation in ref.get("annotations", []):
            body_plain = annotation.get("body", {}).get("plain", "")
            if not body_plain.strip():
                continue
            annotations.append({
                "lyrics_fragment": fragment,
                "annotation": body_plain,
                "annotation_id": annotation.get("id"),
                "votes_total": annotation.get("votes_total", 0),
                "verified": annotation.get("verified", False),
                "authors": [
                    a.get("user", {}).get("login")
                    for a in annotation.get("authors", [])
                ],
                "created_at": annotation.get("created_at"),
                "has_media": bool(annotation.get("media", [])),
            })

    # Sort: verified first, then by vote count
    annotations.sort(
        key=lambda a: (a["verified"], a["votes_total"]),
        reverse=True,
    )
    return annotations


# Get annotations with quality filter
def get_quality_annotations(
    genius: GeniusClient,
    song_id: int,
    min_votes: int = 5,
) -> list[dict]:
    """Get only well-voted annotations."""
    all_annotations = get_song_annotations(genius, song_id)
    return [a for a in all_annotations if a["votes_total"] >= min_votes or a["verified"]]


# Example: annotations for Kendrick Lamar - HUMBLE.
song_id = 2778038  # Adjust to actual song ID
annotations = get_quality_annotations(genius, song_id, min_votes=10)
print(f"Found {len(annotations)} quality annotations\n")
for ann in annotations[:5]:
    print(f"Lyric: '{ann['lyrics_fragment']}'")
    print(f"  Explanation: {ann['annotation'][:200]}...")
    print(f"  Votes: {ann['votes_total']}, Verified: {ann['verified']}")
    print()

Storing Data in SQLite

For building a music database that persists across scraping sessions:

import sqlite3

def init_genius_db(db_path: str) -> sqlite3.Connection:
    conn = sqlite3.connect(db_path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS songs (
            id              INTEGER PRIMARY KEY,
            title           TEXT,
            url             TEXT,
            primary_artist  TEXT,
            artist_id       INTEGER,
            album           TEXT,
            album_id        INTEGER,
            release_date    TEXT,
            page_views      INTEGER,
            annotation_count INTEGER,
            lyrics          TEXT,
            lyrics_line_count INTEGER,
            credits_json    TEXT,
            relationships_json TEXT,
            scraped_at      TEXT DEFAULT CURRENT_TIMESTAMP
        );

        CREATE TABLE IF NOT EXISTS annotations (
            id              INTEGER PRIMARY KEY,
            song_id         INTEGER,
            lyrics_fragment TEXT,
            annotation      TEXT,
            votes_total     INTEGER,
            verified        INTEGER,
            authors_json    TEXT,
            created_at      TEXT,
            FOREIGN KEY (song_id) REFERENCES songs(id)
        );

        CREATE TABLE IF NOT EXISTS artists (
            id              INTEGER PRIMARY KEY,
            name            TEXT,
            url             TEXT,
            followers       INTEGER,
            verified        INTEGER,
            description     TEXT,
            scraped_at      TEXT DEFAULT CURRENT_TIMESTAMP
        );

        CREATE VIRTUAL TABLE IF NOT EXISTS lyrics_fts USING fts5(
            song_id UNINDEXED,
            title,
            lyrics,
            content=songs,
            content_rowid=rowid
        );
    """)
    conn.commit()
    return conn


def store_song(conn: sqlite3.Connection, song_data: dict):
    """Store a song and its annotations in the database."""
    now = datetime.utcnow().isoformat() if 'datetime' in dir() else "2026-01-01"

    conn.execute(
        """INSERT OR REPLACE INTO songs
           (id, title, url, primary_artist, artist_id, album, album_id,
            release_date, page_views, annotation_count, lyrics,
            lyrics_line_count, credits_json, relationships_json, scraped_at)
           VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
        (
            song_data["id"], song_data.get("title"), song_data.get("url"),
            song_data.get("primary_artist"), song_data.get("artist_id"),
            song_data.get("album"), song_data.get("album_id"),
            song_data.get("release_date"), song_data.get("page_views"),
            song_data.get("annotation_count"), song_data.get("lyrics"),
            song_data.get("lyrics_line_count"),
            json.dumps(song_data.get("credits", {})),
            json.dumps(song_data.get("relationships", {})),
            now
        )
    )

    for ann in song_data.get("annotations", []):
        conn.execute(
            """INSERT OR IGNORE INTO annotations
               (song_id, lyrics_fragment, annotation, votes_total, verified,
                authors_json, created_at)
               VALUES (?, ?, ?, ?, ?, ?, ?)""",
            (
                song_data["id"], ann.get("lyrics_fragment"), ann.get("annotation"),
                ann.get("votes_total", 0), int(ann.get("verified", False)),
                json.dumps(ann.get("authors", [])), ann.get("created_at")
            )
        )

    conn.commit()


def search_lyrics_locally(conn: sqlite3.Connection, query: str) -> list[dict]:
    """Full-text search across cached lyrics."""
    # Rebuild FTS index
    conn.execute("INSERT INTO lyrics_fts(lyrics_fts) VALUES('rebuild')")

    rows = conn.execute("""
        SELECT s.id, s.title, s.primary_artist,
               snippet(lyrics_fts, 2, '[', ']', '...', 15) as snippet
        FROM lyrics_fts
        JOIN songs s ON lyrics_fts.song_id = s.id
        WHERE lyrics_fts MATCH ?
        ORDER BY rank
        LIMIT 20
    """, (query,)).fetchall()

    return [{"id": r[0], "title": r[1], "artist": r[2], "match": r[3]} for r in rows]

Anti-Bot Measures on Genius

API rate limits: Genius doesn't publish exact limits, but 5 requests/second is safe. Add 0.2-0.5 second delays between calls and you'll stay well under the limit. The API will return 429 responses if you exceed limits.

Web scraping challenges: - Genius serves lyrics via a combination of server-rendered HTML and client-side React components. The data-lyrics-container selector is the current stable way to find lyrics. - Cloudflare protection activates after repeated requests from the same IP, typically serving a 403 or JS challenge page. - Some lyrics pages have region restrictions (rare but happens with certain licensing agreements). - Lyrics for some songs are hidden behind a "verified" paywall on the app but accessible on the web.

For large-scale scraping — collecting lyrics for thousands of songs — rotating proxies are essential. ThorData provides residential IP rotation that avoids Cloudflare challenges when scraping lyrics pages in bulk:

import random

# Pool of proxies for rotation
PROXIES = [
    "http://USER:[email protected]:9000",
    # ThorData provides a single endpoint with session-based rotation
    # Use session IDs to control stickiness per song
]

def build_thordata_proxy(session_id: str | None = None) -> str:
    """Build ThorData proxy URL. Use session_id for IP stickiness."""
    user = "YOUR_USER"
    password = "YOUR_PASS"
    if session_id:
        return f"http://{user}-session-{session_id}:{password}@proxy.thordata.com:9000"
    return f"http://{user}:{password}@proxy.thordata.com:9000"


def scrape_album_with_rotation(
    genius: GeniusClient,
    album_tracks: list[dict],
) -> list[dict]:
    """Scrape lyrics for album tracks using proxy rotation."""
    for track in album_tracks:
        if not track.get("url") or track.get("lyrics"):
            continue

        # Use a unique session ID per track for IP stickiness during the request
        session_id = str(track["id"])
        proxy = build_thordata_proxy(session_id)

        lyrics = scrape_lyrics_with_retry(track["url"], proxies=[proxy])
        track["lyrics"] = lyrics

        wait = random.uniform(2, 5)
        time.sleep(wait)

    return album_tracks

Practical Tips

Search is fuzzy: The search endpoint matches on title, artist, and lyrics content. Include both song title and artist name for best results: "Alright Kendrick Lamar" beats just "Alright".

Song IDs are stable: Once you find a song's ID, cache it. IDs never change, so you can skip the search step on repeat runs and go straight to get_song(song_id).

Not all songs have lyrics: Instrumental tracks, some older entries, and recently added songs may have empty lyrics pages. Check song.get("lyrics_state") — values include "complete", "unreleased", "instrumental". Skip non-complete songs.

Annotations vary wildly in quality: Filter by votes_total >= 5 to get community-vetted explanations. Unverified annotations with zero votes are often low quality or spam. The verified field marks Genius staff-approved annotations.

Use text_format=plain: The API supports dom, plain, and html formats for text fields. Plain text is easiest for analysis. HTML is useful if you want formatting. The DOM format is a parsed JSON tree — overkill for most uses.

Sample relationships are gold: The song_relationships field maps out which songs sample which. This is invaluable data for music genealogy projects. A song marked with relationship_type: "samples" links to the original track.

Page views = popularity signal: The stats.pageviews field is a decent proxy for song popularity within the Genius ecosystem. Use it to prioritize which songs to scrape first when you're resource-constrained.

The combination of API metadata and scraped lyrics gives you a rich dataset for NLP projects (sentiment analysis, vocabulary complexity, topic modeling), music analysis (rhyme scheme detection, verse/chorus segmentation), or building fan tools. Start with the API for structure and only scrape web pages for lyrics themselves — that's where the efficiency/risk balance is best.