Scraping Genius for Song Lyrics, Annotations, and Album Data
Genius started as a lyrics annotation site and grew into one of the most comprehensive music metadata platforms on the web. Beyond lyrics, it has structured data on albums, track lists, producers, samples, and user-contributed annotations explaining references in the text.
Genius has an official API, but it does not return lyrics directly — that requires scraping the web pages. This post covers both approaches: using the API for structured metadata and scraping for the actual lyric text.
What You Can Get from Genius
Before diving into code, here's the full data landscape:
Via the official API: - Song search and metadata (title, artist, album, release date, view count) - Artist profiles (bio, followers, verified status) - Album info and track listings - Annotations/referents (with annotated lyric fragments and explanation text) - Song relationships (samples, interpolations, remixes, covers) - Song stats (page views, annotation count)
Via web scraping: - Full lyrics text (not in the API) - Song credits in detail (producers, writers, engineers, mastering, mixing) - Featured artists per track - Q&A sections - Fan IQ scores and top contributors - Embeddable lyric previews
Setting Up the Genius API
Register at genius.com/api-clients to get an access token. You only need a Client Access Token for public data — no OAuth flow required for read-only access.
# genius_client.py
import httpx
import time
import re
import json
from typing import Optional
class GeniusClient:
BASE_URL = "https://api.genius.com"
def __init__(self, token: str, proxy_url: str | None = None):
transport = httpx.HTTPTransport(proxy=proxy_url) if proxy_url else None
self.client = httpx.Client(
base_url=self.BASE_URL,
headers={
"Authorization": f"Bearer {token}",
"User-Agent": "GeniusScraper/1.0",
"Accept": "application/json",
},
transport=transport,
timeout=15,
)
def search(self, query: str, page: int = 1) -> list[dict]:
"""Search for songs by title, artist, or lyrics keywords."""
resp = self.client.get("/search", params={"q": query, "page": page})
resp.raise_for_status()
return [hit["result"] for hit in resp.json()["response"]["hits"]]
def get_song(self, song_id: int, text_format: str = "plain") -> dict:
"""Get full song metadata including credits and stats."""
resp = self.client.get(
f"/songs/{song_id}",
params={"text_format": text_format},
)
resp.raise_for_status()
return resp.json()["response"]["song"]
def get_artist(self, artist_id: int) -> dict:
"""Get artist info including bio and social links."""
resp = self.client.get(f"/artists/{artist_id}")
resp.raise_for_status()
return resp.json()["response"]["artist"]
def get_artist_songs(
self,
artist_id: int,
page: int = 1,
per_page: int = 50,
sort: str = "popularity",
) -> dict:
"""Get paginated list of an artist's songs.
sort: 'popularity' or 'title' or 'release_date'
"""
resp = self.client.get(
f"/artists/{artist_id}/songs",
params={"page": page, "per_page": per_page, "sort": sort},
)
resp.raise_for_status()
return resp.json()["response"]
def get_album(self, album_id: int) -> dict:
"""Get album metadata."""
resp = self.client.get(f"/albums/{album_id}")
resp.raise_for_status()
return resp.json()["response"]["album"]
def get_album_tracks(self, album_id: int) -> list[dict]:
"""Get all tracks in an album with their details."""
resp = self.client.get(f"/albums/{album_id}/tracks")
resp.raise_for_status()
return resp.json()["response"]["tracks"]
def get_referents(
self,
song_id: int,
text_format: str = "plain",
per_page: int = 50,
) -> list[dict]:
"""Get annotations/referents for a song."""
all_referents = []
page = 1
while True:
resp = self.client.get("/referents", params={
"song_id": song_id,
"text_format": text_format,
"per_page": per_page,
"page": page,
})
resp.raise_for_status()
data = resp.json()["response"]
batch = data.get("referents", [])
if not batch:
break
all_referents.extend(batch)
if not data.get("next_page"):
break
page += 1
time.sleep(0.3)
return all_referents
def search_songs_by_artist(self, artist_name: str, max_results: int = 100) -> list[dict]:
"""Find songs by an artist via search, handling multiple pages."""
songs = []
page = 1
seen_ids = set()
while len(songs) < max_results:
results = self.search(artist_name, page=page)
if not results:
break
for result in results:
artist = result.get("primary_artist", {})
if artist.get("name", "").lower() == artist_name.lower():
if result["id"] not in seen_ids:
songs.append(result)
seen_ids.add(result["id"])
page += 1
time.sleep(0.5)
if page > 5: # Don't paginate search too deep
break
return songs[:max_results]
Scraping Lyrics from Web Pages
The API returns a url field for each song pointing to the lyrics page. The lyrics are in the HTML but not in the API response — Genius deliberately omits them to protect licensing deals.
# genius_lyrics.py
from selectolax.parser import HTMLParser
def scrape_lyrics(song_url: str, proxy_url: str | None = None) -> str | None:
"""
Scrape lyrics text from a Genius song page.
Genius changed their HTML structure in 2024 — the lyrics now live in
div[data-lyrics-container='true'] rather than the old .lyrics class.
This function handles both the new and old formats.
"""
transport = httpx.HTTPTransport(proxy=proxy_url) if proxy_url else None
client = httpx.Client(
transport=transport,
timeout=15,
follow_redirects=True,
headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/125.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://genius.com/",
},
)
try:
resp = client.get(song_url)
resp.raise_for_status()
except httpx.HTTPError as e:
print(f" HTTP error fetching {song_url}: {e}")
return None
finally:
client.close()
tree = HTMLParser(resp.text)
# New format (2024+): multiple containers with data-lyrics-container attribute
containers = tree.css("div[data-lyrics-container='true']")
if containers:
lyrics_parts = []
for container in containers:
# Replace <br> with newlines before extracting text
html = container.html or ""
html = re.sub(r"<br\s*/?>", "\n", html)
html = re.sub(r"<[^>]+>", "", html) # strip remaining tags
text = html.strip()
if text:
lyrics_parts.append(text)
if lyrics_parts:
return "\n\n".join(lyrics_parts)
# Legacy format: div.lyrics
old_format = tree.css_first(".lyrics")
if old_format:
return old_format.text(separator="\n").strip()
# Fallback: look for the lyrics in embedded JSON
for script in tree.css("script"):
text = script.text() or ""
if '"lyrics"' in text and '"body"' in text:
try:
# Extract JSON blob
match = re.search(r'"lyrics"\s*:\s*\{.*?"body"\s*:\s*\{"plain"\s*:\s*"([^"]+)"', text, re.DOTALL)
if match:
raw = match.group(1)
return raw.encode().decode("unicode_escape")
except Exception:
pass
return None
def scrape_lyrics_with_retry(
song_url: str,
proxies: list[str] | None = None,
max_attempts: int = 3,
) -> str | None:
"""Scrape with proxy rotation and retry logic."""
import random
for attempt in range(max_attempts):
proxy = random.choice(proxies) if proxies else None
lyrics = scrape_lyrics(song_url, proxy_url=proxy)
if lyrics and len(lyrics) > 50: # Avoid empty/error pages
return lyrics
if attempt < max_attempts - 1:
wait = random.uniform(3, 8)
print(f" Attempt {attempt + 1} failed, waiting {wait:.1f}s")
time.sleep(wait)
return None
Building a Complete Song Dataset
Combining API metadata with scraped lyrics:
def collect_song_data(
genius: GeniusClient,
song_result: dict,
proxy_url: str | None = None,
include_annotations: bool = True,
) -> dict:
"""Collect full song data: metadata + lyrics + annotations."""
song_id = song_result["id"]
# Get detailed metadata from API
song = genius.get_song(song_id)
# Extract structured credits
credits = {
"producers": [p["name"] for p in song.get("producer_artists", [])],
"writers": [w["name"] for w in song.get("writer_artists", [])],
"featured": [f["name"] for f in song.get("featured_artists", [])],
}
# Extract samples and interpolations
relationships = {}
for rel in song.get("song_relationships", []):
rel_type = rel.get("relationship_type")
rel_songs = [s.get("title") for s in rel.get("songs", [])]
if rel_songs:
relationships[rel_type] = rel_songs
# Stats
stats = song.get("stats", {})
data = {
"id": song_id,
"title": song.get("title"),
"title_with_featured": song.get("title_with_featured"),
"url": song.get("url"),
"primary_artist": song.get("primary_artist", {}).get("name"),
"artist_id": song.get("primary_artist", {}).get("id"),
"album": song.get("album", {}).get("name") if song.get("album") else None,
"album_id": song.get("album", {}).get("id") if song.get("album") else None,
"release_date": song.get("release_date"),
"page_views": stats.get("pageviews"),
"annotation_count": stats.get("unreviewed_annotations", 0) + stats.get("verified_annotations", 0),
"credits": credits,
"relationships": relationships,
"lyrics_state": song.get("lyrics_state"),
}
# Scrape lyrics if available
if song.get("lyrics_state") == "complete" and song.get("url"):
lyrics = scrape_lyrics(song["url"], proxy_url=proxy_url)
data["lyrics"] = lyrics
if lyrics:
data["lyrics_line_count"] = len([l for l in lyrics.split("\n") if l.strip()])
else:
data["lyrics"] = None
# Get annotations
if include_annotations:
try:
referents = genius.get_referents(song_id)
data["annotations"] = [
{
"fragment": ref.get("fragment"),
"annotation": ref["annotations"][0]["body"]["plain"] if ref.get("annotations") else "",
"votes": ref["annotations"][0].get("votes_total", 0) if ref.get("annotations") else 0,
"verified": ref["annotations"][0].get("verified", False) if ref.get("annotations") else False,
}
for ref in referents if ref.get("annotations")
]
except Exception as e:
print(f" Failed to get annotations for {song_id}: {e}")
data["annotations"] = []
return data
Building an Album Scraper
To get all lyrics for an album, you need the album's tracks and then scrape each song page:
def scrape_album(
genius: GeniusClient,
album_query: str,
proxy_url: str | None = None,
delay: float = 2.0,
) -> list[dict]:
"""Search for an album and scrape all its tracks."""
import random
results = genius.search(album_query)
if not results:
print(f"No results for: {album_query}")
return []
# Find album via first result's song
first_song = genius.get_song(results[0]["id"])
album = first_song.get("album")
if not album:
print("No album associated with first result")
return []
album_id = album["id"]
print(f"Found album: {album['name']} (ID: {album_id})")
# Get all tracks from the album API endpoint
tracks_data = genius.get_album_tracks(album_id)
tracks = []
for track_info in tracks_data:
song = track_info.get("song", {})
if not song:
continue
print(f" Track {track_info.get('number', '?')}: {song.get('title')}")
song_data = collect_song_data(
genius,
song,
proxy_url=proxy_url,
include_annotations=False, # Skip annotations for album scrape
)
song_data["track_number"] = track_info.get("number")
tracks.append(song_data)
wait = delay + random.uniform(0, 1)
time.sleep(wait)
tracks.sort(key=lambda t: t.get("track_number") or 999)
return tracks
# Scrape an album
album_tracks = scrape_album(genius, "OK Computer Radiohead", delay=2.5)
with open("ok_computer.json", "w") as f:
json.dump(album_tracks, f, indent=2, ensure_ascii=False)
print(f"Saved {len(album_tracks)} tracks")
for track in album_tracks:
lyric_lines = track.get("lyrics_line_count", 0)
print(f" {track['track_number']}. {track['title']} — {lyric_lines} lines")
Extracting Annotations
Genius annotations are the site's core feature — user-contributed explanations of lyric references:
def get_song_annotations(genius: GeniusClient, song_id: int) -> list[dict]:
"""Get all annotations for a song, sorted by quality."""
referents = genius.get_referents(song_id)
annotations = []
for ref in referents:
fragment = ref.get("fragment", "")
# A referent may have multiple annotations (different explanations)
for annotation in ref.get("annotations", []):
body_plain = annotation.get("body", {}).get("plain", "")
if not body_plain.strip():
continue
annotations.append({
"lyrics_fragment": fragment,
"annotation": body_plain,
"annotation_id": annotation.get("id"),
"votes_total": annotation.get("votes_total", 0),
"verified": annotation.get("verified", False),
"authors": [
a.get("user", {}).get("login")
for a in annotation.get("authors", [])
],
"created_at": annotation.get("created_at"),
"has_media": bool(annotation.get("media", [])),
})
# Sort: verified first, then by vote count
annotations.sort(
key=lambda a: (a["verified"], a["votes_total"]),
reverse=True,
)
return annotations
# Get annotations with quality filter
def get_quality_annotations(
genius: GeniusClient,
song_id: int,
min_votes: int = 5,
) -> list[dict]:
"""Get only well-voted annotations."""
all_annotations = get_song_annotations(genius, song_id)
return [a for a in all_annotations if a["votes_total"] >= min_votes or a["verified"]]
# Example: annotations for Kendrick Lamar - HUMBLE.
song_id = 2778038 # Adjust to actual song ID
annotations = get_quality_annotations(genius, song_id, min_votes=10)
print(f"Found {len(annotations)} quality annotations\n")
for ann in annotations[:5]:
print(f"Lyric: '{ann['lyrics_fragment']}'")
print(f" Explanation: {ann['annotation'][:200]}...")
print(f" Votes: {ann['votes_total']}, Verified: {ann['verified']}")
print()
Storing Data in SQLite
For building a music database that persists across scraping sessions:
import sqlite3
def init_genius_db(db_path: str) -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.executescript("""
CREATE TABLE IF NOT EXISTS songs (
id INTEGER PRIMARY KEY,
title TEXT,
url TEXT,
primary_artist TEXT,
artist_id INTEGER,
album TEXT,
album_id INTEGER,
release_date TEXT,
page_views INTEGER,
annotation_count INTEGER,
lyrics TEXT,
lyrics_line_count INTEGER,
credits_json TEXT,
relationships_json TEXT,
scraped_at TEXT DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS annotations (
id INTEGER PRIMARY KEY,
song_id INTEGER,
lyrics_fragment TEXT,
annotation TEXT,
votes_total INTEGER,
verified INTEGER,
authors_json TEXT,
created_at TEXT,
FOREIGN KEY (song_id) REFERENCES songs(id)
);
CREATE TABLE IF NOT EXISTS artists (
id INTEGER PRIMARY KEY,
name TEXT,
url TEXT,
followers INTEGER,
verified INTEGER,
description TEXT,
scraped_at TEXT DEFAULT CURRENT_TIMESTAMP
);
CREATE VIRTUAL TABLE IF NOT EXISTS lyrics_fts USING fts5(
song_id UNINDEXED,
title,
lyrics,
content=songs,
content_rowid=rowid
);
""")
conn.commit()
return conn
def store_song(conn: sqlite3.Connection, song_data: dict):
"""Store a song and its annotations in the database."""
now = datetime.utcnow().isoformat() if 'datetime' in dir() else "2026-01-01"
conn.execute(
"""INSERT OR REPLACE INTO songs
(id, title, url, primary_artist, artist_id, album, album_id,
release_date, page_views, annotation_count, lyrics,
lyrics_line_count, credits_json, relationships_json, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
(
song_data["id"], song_data.get("title"), song_data.get("url"),
song_data.get("primary_artist"), song_data.get("artist_id"),
song_data.get("album"), song_data.get("album_id"),
song_data.get("release_date"), song_data.get("page_views"),
song_data.get("annotation_count"), song_data.get("lyrics"),
song_data.get("lyrics_line_count"),
json.dumps(song_data.get("credits", {})),
json.dumps(song_data.get("relationships", {})),
now
)
)
for ann in song_data.get("annotations", []):
conn.execute(
"""INSERT OR IGNORE INTO annotations
(song_id, lyrics_fragment, annotation, votes_total, verified,
authors_json, created_at)
VALUES (?, ?, ?, ?, ?, ?, ?)""",
(
song_data["id"], ann.get("lyrics_fragment"), ann.get("annotation"),
ann.get("votes_total", 0), int(ann.get("verified", False)),
json.dumps(ann.get("authors", [])), ann.get("created_at")
)
)
conn.commit()
def search_lyrics_locally(conn: sqlite3.Connection, query: str) -> list[dict]:
"""Full-text search across cached lyrics."""
# Rebuild FTS index
conn.execute("INSERT INTO lyrics_fts(lyrics_fts) VALUES('rebuild')")
rows = conn.execute("""
SELECT s.id, s.title, s.primary_artist,
snippet(lyrics_fts, 2, '[', ']', '...', 15) as snippet
FROM lyrics_fts
JOIN songs s ON lyrics_fts.song_id = s.id
WHERE lyrics_fts MATCH ?
ORDER BY rank
LIMIT 20
""", (query,)).fetchall()
return [{"id": r[0], "title": r[1], "artist": r[2], "match": r[3]} for r in rows]
Anti-Bot Measures on Genius
API rate limits: Genius doesn't publish exact limits, but 5 requests/second is safe. Add 0.2-0.5 second delays between calls and you'll stay well under the limit. The API will return 429 responses if you exceed limits.
Web scraping challenges:
- Genius serves lyrics via a combination of server-rendered HTML and client-side React components. The data-lyrics-container selector is the current stable way to find lyrics.
- Cloudflare protection activates after repeated requests from the same IP, typically serving a 403 or JS challenge page.
- Some lyrics pages have region restrictions (rare but happens with certain licensing agreements).
- Lyrics for some songs are hidden behind a "verified" paywall on the app but accessible on the web.
For large-scale scraping — collecting lyrics for thousands of songs — rotating proxies are essential. ThorData provides residential IP rotation that avoids Cloudflare challenges when scraping lyrics pages in bulk:
import random
# Pool of proxies for rotation
PROXIES = [
"http://USER:[email protected]:9000",
# ThorData provides a single endpoint with session-based rotation
# Use session IDs to control stickiness per song
]
def build_thordata_proxy(session_id: str | None = None) -> str:
"""Build ThorData proxy URL. Use session_id for IP stickiness."""
user = "YOUR_USER"
password = "YOUR_PASS"
if session_id:
return f"http://{user}-session-{session_id}:{password}@proxy.thordata.com:9000"
return f"http://{user}:{password}@proxy.thordata.com:9000"
def scrape_album_with_rotation(
genius: GeniusClient,
album_tracks: list[dict],
) -> list[dict]:
"""Scrape lyrics for album tracks using proxy rotation."""
for track in album_tracks:
if not track.get("url") or track.get("lyrics"):
continue
# Use a unique session ID per track for IP stickiness during the request
session_id = str(track["id"])
proxy = build_thordata_proxy(session_id)
lyrics = scrape_lyrics_with_retry(track["url"], proxies=[proxy])
track["lyrics"] = lyrics
wait = random.uniform(2, 5)
time.sleep(wait)
return album_tracks
Practical Tips
Search is fuzzy: The search endpoint matches on title, artist, and lyrics content. Include both song title and artist name for best results: "Alright Kendrick Lamar" beats just "Alright".
Song IDs are stable: Once you find a song's ID, cache it. IDs never change, so you can skip the search step on repeat runs and go straight to get_song(song_id).
Not all songs have lyrics: Instrumental tracks, some older entries, and recently added songs may have empty lyrics pages. Check song.get("lyrics_state") — values include "complete", "unreleased", "instrumental". Skip non-complete songs.
Annotations vary wildly in quality: Filter by votes_total >= 5 to get community-vetted explanations. Unverified annotations with zero votes are often low quality or spam. The verified field marks Genius staff-approved annotations.
Use text_format=plain: The API supports dom, plain, and html formats for text fields. Plain text is easiest for analysis. HTML is useful if you want formatting. The DOM format is a parsed JSON tree — overkill for most uses.
Sample relationships are gold: The song_relationships field maps out which songs sample which. This is invaluable data for music genealogy projects. A song marked with relationship_type: "samples" links to the original track.
Page views = popularity signal: The stats.pageviews field is a decent proxy for song popularity within the Genius ecosystem. Use it to prioritize which songs to scrape first when you're resource-constrained.
The combination of API metadata and scraped lyrics gives you a rich dataset for NLP projects (sentiment analysis, vocabulary complexity, topic modeling), music analysis (rhyme scheme detection, verse/chorus segmentation), or building fan tools. Start with the API for structure and only scrape web pages for lyrics themselves — that's where the efficiency/risk balance is best.