How to Scrape Yahoo Finance with Python (2026)
How to Scrape Yahoo Finance with Python (2026)
Yahoo Finance is one of the most accessible sources for financial data. While their official API was discontinued years ago, the internal endpoints that power the website are well-known and return clean JSON. You can pull real-time quotes, historical OHLCV data, financial statements, options chains, earnings estimates, and analyst recommendations — all without authentication for public market data.
This guide covers the key Yahoo Finance endpoints, practical Python code for each, SQLite storage for building a financial data pipeline, error handling for production use, and proxy rotation for high-volume fetching.
Why Yahoo Finance Over Paid Data Providers?
Yahoo Finance provides data that typically costs $30-100/month from paid providers:
- Real-time quotes (15-minute delay for free users, which is sufficient for most research)
- 5+ years of daily OHLCV history for any listed security
- Quarterly and annual financial statements (income statement, balance sheet, cash flow)
- Analyst consensus estimates and EPS forecasts
- Options chains with Greeks
- Insider transactions and institutional ownership
- ESG scores
- Earnings calendar data
The endpoints are the same ones used by the Yahoo Finance website itself. No API key, no billing, no rate limit that prevents reasonable use.
The Key Endpoints
Before writing any code, here's what Yahoo Finance exposes:
| Endpoint | URL Pattern | Purpose |
|---|---|---|
| v8 chart | /v8/finance/chart/SYMBOL |
Quotes, OHLCV history, splits, dividends |
| v10 quoteSummary | /v10/finance/quoteSummary/SYMBOL |
Everything else (financials, profile, earnings) |
| v7 quote | /v7/finance/quote |
Batch quotes for multiple symbols |
| v7 options | /v7/finance/options/SYMBOL |
Options chain |
Real-Time Stock Quotes
# yahoo_finance.py
import httpx
import time
import random
from typing import Optional
YAHOO_BASE = "https://query1.finance.yahoo.com"
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/121.0.0.0 Safari/537.36"
),
"Accept": "application/json, text/plain, */*",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Origin": "https://finance.yahoo.com",
"Referer": "https://finance.yahoo.com/",
}
def make_client(proxy_url: Optional[str] = None) -> httpx.Client:
"""Create an httpx client with optional proxy support."""
client_kwargs = {
"headers": HEADERS,
"follow_redirects": True,
"timeout": 30,
}
if proxy_url:
client_kwargs["proxy"] = proxy_url
return httpx.Client(**client_kwargs)
def get_quote(symbol: str, client: Optional[httpx.Client] = None) -> dict:
"""
Get real-time quote data for a stock symbol.
Returns price, change, volume, market cap, and trading session info.
"""
if client is None:
client = make_client()
url = f"{YAHOO_BASE}/v8/finance/chart/{symbol}"
params = {
"range": "1d",
"interval": "1m",
"includePrePost": "true",
"useYfid": "true",
}
resp = client.get(url, params=params)
if resp.status_code == 429:
raise RuntimeError(f"Rate limited fetching {symbol}")
if resp.status_code == 404:
raise ValueError(f"Symbol not found: {symbol}")
resp.raise_for_status()
data = resp.json()
if data.get("chart", {}).get("error"):
err = data["chart"]["error"]
raise ValueError(f"Yahoo Finance error for {symbol}: {err}")
result = data["chart"]["result"][0]
meta = result["meta"]
# Determine current price based on market state
market_state = meta.get("marketState", "REGULAR")
if market_state == "PRE":
current_price = meta.get("chartPreviousClose", meta.get("regularMarketPrice"))
elif market_state == "POST":
current_price = meta.get("postMarketPrice", meta.get("regularMarketPrice"))
else:
current_price = meta.get("regularMarketPrice")
prev_close = meta.get("previousClose") or meta.get("chartPreviousClose")
change = (current_price - prev_close) if current_price and prev_close else None
change_pct = (change / prev_close * 100) if change and prev_close else None
return {
"symbol": meta.get("symbol"),
"exchange": meta.get("exchangeName"),
"currency": meta.get("currency"),
"market_state": market_state,
"price": current_price,
"previous_close": prev_close,
"change": round(change, 4) if change else None,
"change_pct": round(change_pct, 4) if change_pct else None,
"open": meta.get("regularMarketOpen"),
"day_high": meta.get("regularMarketDayHigh"),
"day_low": meta.get("regularMarketDayLow"),
"volume": meta.get("regularMarketVolume"),
"avg_volume": meta.get("averageDailyVolume10Day"),
"market_cap": meta.get("marketCap"),
"52w_high": meta.get("fiftyTwoWeekHigh"),
"52w_low": meta.get("fiftyTwoWeekLow"),
"timezone": meta.get("timezone"),
}
quote = get_quote("AAPL")
direction = "+" if (quote["change"] or 0) >= 0 else ""
print(f"{quote['symbol']}: ${quote['price']:.2f} ({direction}{quote['change_pct']:.2f}%)")
print(f"Volume: {quote['volume']:,}")
print(f"Market Cap: ${quote['market_cap']:,.0f}")
print(f"52-Week Range: ${quote['52w_low']:.2f} - ${quote['52w_high']:.2f}")
Batch Quotes for Multiple Tickers
For efficiency, query multiple symbols in one request using the v7 quote endpoint:
def get_batch_quotes(symbols: list, client: Optional[httpx.Client] = None) -> list:
"""
Get quotes for multiple symbols in a single request.
More efficient than individual calls when monitoring a portfolio.
"""
if client is None:
client = make_client()
url = f"{YAHOO_BASE}/v7/finance/quote"
params = {
"symbols": ",".join(symbols),
"fields": ",".join([
"regularMarketPrice",
"regularMarketChange",
"regularMarketChangePercent",
"regularMarketVolume",
"regularMarketDayHigh",
"regularMarketDayLow",
"marketCap",
"trailingPE",
"forwardPE",
"dividendYield",
"fiftyTwoWeekHigh",
"fiftyTwoWeekLow",
"shortName",
"longName",
]),
}
resp = client.get(url, params=params)
resp.raise_for_status()
data = resp.json()
results = []
for q in data.get("quoteResponse", {}).get("result", []):
change = q.get("regularMarketChange", 0)
results.append({
"symbol": q.get("symbol"),
"name": q.get("shortName") or q.get("longName"),
"price": q.get("regularMarketPrice"),
"change": round(change, 4),
"change_pct": round(q.get("regularMarketChangePercent", 0), 4),
"volume": q.get("regularMarketVolume"),
"market_cap": q.get("marketCap"),
"pe_trailing": q.get("trailingPE"),
"pe_forward": q.get("forwardPE"),
"dividend_yield": q.get("dividendYield"),
"52w_high": q.get("fiftyTwoWeekHigh"),
"52w_low": q.get("fiftyTwoWeekLow"),
"day_high": q.get("regularMarketDayHigh"),
"day_low": q.get("regularMarketDayLow"),
})
return results
# Get the S&P 500 megacaps in one request
megacaps = get_batch_quotes(["AAPL", "MSFT", "GOOGL", "AMZN", "NVDA", "META", "TSLA", "BRK-B"])
print(f"{'Symbol':<8} {'Price':>10} {'Change':>10} {'Mkt Cap':>15}")
for q in sorted(megacaps, key=lambda x: x["change_pct"] or 0, reverse=True):
mc = f"${q['market_cap']/1e12:.2f}T" if q['market_cap'] and q['market_cap'] > 1e12 else ""
chg = f"{q['change_pct']:+.2f}%" if q['change_pct'] else "N/A"
print(f"{q['symbol']:<8} ${q['price']:>9.2f} {chg:>10} {mc:>15}")
Historical OHLCV Data
The v8 chart endpoint handles historical data. Change the range and interval parameters:
import datetime
def get_historical(
symbol: str,
period: str = "1y",
interval: str = "1d",
client: Optional[httpx.Client] = None,
) -> list:
"""
Fetch historical OHLCV data.
period: 1d, 5d, 1mo, 3mo, 6mo, 1y, 2y, 5y, 10y, ytd, max
interval: 1m, 2m, 5m, 15m, 30m, 60m, 90m, 1h, 1d, 5d, 1wk, 1mo, 3mo
Note: Intraday intervals (1m-90m) only available for recent periods.
1m: max 7 days, 5m: max 60 days, 1h: max 730 days
"""
if client is None:
client = make_client()
url = f"{YAHOO_BASE}/v8/finance/chart/{symbol}"
params = {
"range": period,
"interval": interval,
"includeAdjustedClose": "true",
"events": "div,split", # include dividend and split events
}
resp = client.get(url, params=params)
resp.raise_for_status()
data = resp.json()
if data.get("chart", {}).get("error"):
raise ValueError(f"Error fetching {symbol}: {data['chart']['error']}")
result = data["chart"]["result"][0]
timestamps = result.get("timestamp", [])
ohlcv = result.get("indicators", {}).get("quote", [{}])[0]
adj_close_data = result.get("indicators", {}).get("adjclose", [{}])[0]
adj_close = adj_close_data.get("adjclose", []) if adj_close_data else []
# Extract dividend and split events
events = result.get("events", {})
dividends = events.get("dividends", {})
splits = events.get("splits", {})
rows = []
for i, ts in enumerate(timestamps):
if ts is None:
continue
date_str = datetime.datetime.fromtimestamp(ts).strftime("%Y-%m-%d")
def safe_round(val, digits=4):
return round(val, digits) if val is not None else None
row = {
"date": date_str,
"timestamp": ts,
"open": safe_round(ohlcv.get("open", [None] * (i+1))[i]),
"high": safe_round(ohlcv.get("high", [None] * (i+1))[i]),
"low": safe_round(ohlcv.get("low", [None] * (i+1))[i]),
"close": safe_round(ohlcv.get("close", [None] * (i+1))[i]),
"adj_close": safe_round(adj_close[i] if i < len(adj_close) else None),
"volume": ohlcv.get("volume", [None] * (i+1))[i],
"dividend": None,
"split_ratio": None,
}
# Annotate dividend and split events by timestamp
for div_ts, div_data in dividends.items():
if abs(int(div_ts) - ts) < 86400:
row["dividend"] = div_data.get("amount")
for split_ts, split_data in splits.items():
if abs(int(split_ts) - ts) < 86400:
numerator = split_data.get("numerator", 1)
denominator = split_data.get("denominator", 1)
row["split_ratio"] = f"{numerator}:{denominator}"
rows.append(row)
return rows
history = get_historical("MSFT", period="3mo", interval="1d")
print(f"Got {len(history)} data points for MSFT")
print(f"\nLast 5 days:")
for row in history[-5:]:
print(f" {row['date']}: O={row['open']} H={row['high']} L={row['low']} C={row['close']} Vol={row['volume']:,}")
Financial Statements
The v10 quoteSummary endpoint returns detailed financial data — income statements, balance sheets, cash flow, and more. The modules parameter controls what sections to fetch:
def get_financials(
symbol: str,
modules: list = None,
client: Optional[httpx.Client] = None,
) -> dict:
"""
Fetch financial statements and company summary via quoteSummary.
Available modules (request only what you need):
- incomeStatementHistory, incomeStatementHistoryQuarterly
- balanceSheetHistory, balanceSheetHistoryQuarterly
- cashflowStatementHistory, cashflowStatementHistoryQuarterly
- earnings, earningsTrend
- financialData (key financial ratios)
- defaultKeyStatistics (valuation metrics)
- assetProfile (company description, sector, officers)
- recommendationTrend (analyst buy/sell/hold breakdown)
- earningsHistory (historical EPS vs estimates)
- institutionOwnership
- insiderHolders, insiderTransactions
- esgScores
"""
if modules is None:
modules = [
"incomeStatementHistory",
"balanceSheetHistory",
"cashflowStatementHistory",
"financialData",
"defaultKeyStatistics",
"assetProfile",
"earningsTrend",
]
if client is None:
client = make_client()
url = f"{YAHOO_BASE}/v10/finance/quoteSummary/{symbol}"
params = {"modules": ",".join(modules)}
resp = client.get(url, params=params)
resp.raise_for_status()
data = resp.json()
if data.get("quoteSummary", {}).get("error"):
raise ValueError(f"quoteSummary error for {symbol}: {data['quoteSummary']['error']}")
result = data["quoteSummary"]["result"]
if not result:
return {}
return result[0]
def extract_key_metrics(fin_data: dict) -> dict:
"""Extract and flatten key financial metrics from quoteSummary response."""
def fmt(section: str, key: str, field: str = "fmt") -> str:
return fin_data.get(section, {}).get(key, {}).get(field)
def raw(section: str, key: str) -> float:
return fin_data.get(section, {}).get(key, {}).get("raw")
profile = fin_data.get("assetProfile", {})
return {
# Company info
"company": profile.get("longBusinessSummary", "")[:200],
"sector": profile.get("sector"),
"industry": profile.get("industry"),
"employees": profile.get("fullTimeEmployees"),
"country": profile.get("country"),
# Valuation
"revenue": fmt("financialData", "totalRevenue"),
"revenue_raw": raw("financialData", "totalRevenue"),
"gross_profit": fmt("financialData", "grossProfits"),
"ebitda": fmt("financialData", "ebitda"),
"net_income": fmt("financialData", "netIncomeToCommon"),
"profit_margin": fmt("financialData", "profitMargins"),
"operating_margin": fmt("financialData", "operatingMargins"),
"return_on_equity": fmt("financialData", "returnOnEquity"),
"return_on_assets": fmt("financialData", "returnOnAssets"),
"debt_to_equity": fmt("financialData", "debtToEquity"),
"current_ratio": fmt("financialData", "currentRatio"),
"free_cash_flow": fmt("financialData", "freeCashflow"),
# Per-share and valuation metrics
"pe_ratio": fmt("defaultKeyStatistics", "forwardPE"),
"peg_ratio": fmt("defaultKeyStatistics", "pegRatio"),
"price_to_book": fmt("defaultKeyStatistics", "priceToBook"),
"eps_trailing": fmt("defaultKeyStatistics", "trailingEps"),
"eps_forward": fmt("defaultKeyStatistics", "forwardEps"),
"beta": fmt("defaultKeyStatistics", "beta"),
"shares_outstanding": fmt("defaultKeyStatistics", "sharesOutstanding"),
"float_shares": fmt("defaultKeyStatistics", "floatShares"),
"short_ratio": fmt("defaultKeyStatistics", "shortRatio"),
"short_pct_float": fmt("defaultKeyStatistics", "shortPercentOfFloat"),
"dividend_yield": fmt("defaultKeyStatistics", "dividendYield"),
"payout_ratio": fmt("defaultKeyStatistics", "payoutRatio"),
"52w_high": fmt("defaultKeyStatistics", "fiftyTwoWeekHigh"),
"52w_low": fmt("defaultKeyStatistics", "fiftyTwoWeekLow"),
"50d_avg": fmt("defaultKeyStatistics", "fiftyDayAverage"),
"200d_avg": fmt("defaultKeyStatistics", "twoHundredDayAverage"),
# Income statement history (last 4 annual)
"income_statements": [
{
"date": stmt.get("endDate", {}).get("fmt"),
"revenue": stmt.get("totalRevenue", {}).get("fmt"),
"gross_profit": stmt.get("grossProfit", {}).get("fmt"),
"net_income": stmt.get("netIncome", {}).get("fmt"),
"ebitda": stmt.get("ebitda", {}).get("fmt"),
"eps_basic": stmt.get("basicEPS", {}).get("fmt"),
"eps_diluted": stmt.get("dilutedEPS", {}).get("fmt"),
}
for stmt in fin_data.get("incomeStatementHistory", {})
.get("incomeStatementHistory", [])
],
# Analyst recommendations
"analyst_target_price": fmt("financialData", "targetMeanPrice"),
"analyst_recommendation": fin_data.get("financialData", {}).get("recommendationKey"),
"analyst_count": raw("financialData", "numberOfAnalystOpinions"),
}
# Usage
fin = get_financials("NVDA")
metrics = extract_key_metrics(fin)
print(f"Sector: {metrics['sector']} | Industry: {metrics['industry']}")
print(f"Revenue: {metrics['revenue']} | EBITDA: {metrics['ebitda']}")
print(f"P/E (fwd): {metrics['pe_ratio']} | EPS (fwd): {metrics['eps_forward']}")
print(f"Profit Margin: {metrics['profit_margin']} | ROE: {metrics['return_on_equity']}")
print(f"Analyst Target: {metrics['analyst_target_price']} ({metrics['analyst_recommendation']})")
Options Chains
def get_options(
symbol: str,
expiration_date: int = None,
client: Optional[httpx.Client] = None,
) -> dict:
"""
Fetch options chain for a symbol.
expiration_date: Unix timestamp (get available dates from the response first).
"""
if client is None:
client = make_client()
url = f"{YAHOO_BASE}/v7/finance/options/{symbol}"
params = {}
if expiration_date:
params["date"] = expiration_date
resp = client.get(url, params=params)
resp.raise_for_status()
data = resp.json()
result = data["optionChain"]["result"][0]
expirations = result.get("expirationDates", [])
options = result.get("options", [{}])[0]
def parse_contract(c: dict) -> dict:
return {
"contract_symbol": c.get("contractSymbol"),
"strike": c.get("strike"),
"expiration": datetime.datetime.fromtimestamp(
c.get("expiration", 0)
).strftime("%Y-%m-%d") if c.get("expiration") else None,
"last_price": c.get("lastPrice"),
"bid": c.get("bid"),
"ask": c.get("ask"),
"change": c.get("change"),
"pct_change": c.get("percentChange"),
"volume": c.get("volume"),
"open_interest": c.get("openInterest"),
"implied_volatility": round(c.get("impliedVolatility", 0), 4) if c.get("impliedVolatility") else None,
"in_the_money": c.get("inTheMoney", False),
}
quote = result.get("quote", {})
underlying_price = quote.get("regularMarketPrice")
return {
"symbol": symbol,
"underlying_price": underlying_price,
"expiration_dates": [
datetime.datetime.fromtimestamp(ts).strftime("%Y-%m-%d")
for ts in expirations
],
"calls": sorted(
[parse_contract(c) for c in options.get("calls", [])],
key=lambda x: x["strike"] or 0
),
"puts": sorted(
[parse_contract(c) for c in options.get("puts", [])],
key=lambda x: x["strike"] or 0
),
}
# Get options for AAPL
opts = get_options("AAPL")
underlying = opts["underlying_price"]
print(f"AAPL @ ${underlying:.2f}")
print(f"Available expirations: {', '.join(opts['expiration_dates'][:5])}")
print(f"\nATM calls (strikes near ${underlying:.0f}):")
atm_calls = [c for c in opts["calls"] if abs((c["strike"] or 0) - underlying) < 10]
for c in atm_calls[:5]:
print(f" ${c['strike']:.0f}: last=${c['last_price']}, IV={c['implied_volatility']}, OI={c['open_interest']:,}")
Anti-Bot Measures and Proxy Rotation
Yahoo Finance is more lenient than most sites, but there are still real limits.
Rate limiting. The v8 and v10 endpoints allow roughly 100 requests per minute per IP before throttling. You'll get 429 responses or temporary blocks. The v7 batch endpoint is more efficient — one request for 500 symbols beats 500 individual requests.
IP-based blocking. Datacenter IPs get lower rate limits than residential IPs. If you're scraping from a cloud server, you'll hit limits much faster than from a home connection.
Consent cookies in EU regions. Yahoo requires cookie consent for users in EU-regulated regions and may redirect non-US IPs to localized versions that behave differently.
Crumb authentication. Some endpoints occasionally require a "crumb" token extracted from a cookie. If you get 401 errors that shouldn't be there, you need to fetch the crumb first.
def get_yahoo_crumb(client: httpx.Client) -> str:
"""
Fetch a Yahoo Finance crumb token.
Required by some endpoints when cookie consent is enforced.
"""
# Visit the main Yahoo Finance page to get the cookie
client.get("https://finance.yahoo.com/")
# Fetch the crumb from the dedicated endpoint
resp = client.get(
"https://query1.finance.yahoo.com/v1/test/getcrumb",
headers={**HEADERS, "Referer": "https://finance.yahoo.com/"},
)
if resp.status_code == 200 and resp.text.strip():
return resp.text.strip()
return ""
For high-volume financial data collection — tracking 500+ tickers with intraday data — proxy rotation distributes requests across IPs. ThorData's residential proxy pool handles this well since their rotating IPs come with clean reputation scores that Yahoo doesn't throttle aggressively:
def get_tickers_with_proxy(
tickers: list,
proxy_url: str,
delay_range: tuple = (0.5, 1.5),
) -> dict:
"""Fetch quotes for many tickers using a rotating proxy."""
client = make_client(proxy_url=proxy_url)
results = {}
# Use batch endpoint for efficiency (up to 100 symbols per request)
batch_size = 50
for i in range(0, len(tickers), batch_size):
batch = tickers[i:i+batch_size]
try:
quotes = get_batch_quotes(batch, client=client)
for q in quotes:
results[q["symbol"]] = q
print(f"Batch {i//batch_size + 1}: {len(quotes)} quotes fetched")
except Exception as e:
print(f"Batch {i//batch_size + 1} error: {e}")
if i + batch_size < len(tickers):
time.sleep(random.uniform(*delay_range))
return results
# Fetch all S&P 500 components
sp500_tickers = ["AAPL", "MSFT", "GOOGL", "AMZN", "NVDA", "META", "TSLA"] # add full list
proxy = "http://user:[email protected]:9000"
all_quotes = get_tickers_with_proxy(sp500_tickers, proxy_url=proxy)
print(f"Fetched {len(all_quotes)} quotes")
SQLite Storage Schema
import sqlite3
import json
def init_finance_db(db_path: str = "yahoo_finance.db") -> sqlite3.Connection:
"""Initialize SQLite database for financial data."""
conn = sqlite3.connect(db_path)
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA synchronous=NORMAL")
conn.executescript("""
CREATE TABLE IF NOT EXISTS tickers (
symbol TEXT PRIMARY KEY,
name TEXT,
exchange TEXT,
sector TEXT,
industry TEXT,
country TEXT,
currency TEXT,
employees INTEGER,
description TEXT,
added_at TEXT DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS price_snapshots (
id INTEGER PRIMARY KEY AUTOINCREMENT,
symbol TEXT NOT NULL,
price REAL,
change REAL,
change_pct REAL,
volume INTEGER,
market_cap REAL,
market_state TEXT,
snapshot_at TEXT DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (symbol) REFERENCES tickers(symbol)
);
CREATE TABLE IF NOT EXISTS ohlcv (
id INTEGER PRIMARY KEY AUTOINCREMENT,
symbol TEXT NOT NULL,
date TEXT,
interval_type TEXT DEFAULT '1d',
open REAL,
high REAL,
low REAL,
close REAL,
adj_close REAL,
volume INTEGER,
dividend REAL,
split_ratio TEXT,
UNIQUE(symbol, date, interval_type),
FOREIGN KEY (symbol) REFERENCES tickers(symbol)
);
CREATE TABLE IF NOT EXISTS financials (
id INTEGER PRIMARY KEY AUTOINCREMENT,
symbol TEXT NOT NULL,
period_type TEXT DEFAULT 'annual',
period_end TEXT,
revenue REAL,
gross_profit REAL,
net_income REAL,
ebitda REAL,
eps_basic TEXT,
eps_diluted TEXT,
fetched_at TEXT DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (symbol) REFERENCES tickers(symbol)
);
CREATE TABLE IF NOT EXISTS key_metrics (
symbol TEXT PRIMARY KEY,
pe_ratio TEXT,
peg_ratio TEXT,
price_to_book TEXT,
profit_margin TEXT,
operating_margin TEXT,
roe TEXT,
roa TEXT,
debt_to_equity TEXT,
current_ratio TEXT,
beta TEXT,
dividend_yield TEXT,
short_pct_float TEXT,
analyst_target TEXT,
analyst_reco TEXT,
updated_at TEXT DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_price_symbol ON price_snapshots (symbol);
CREATE INDEX IF NOT EXISTS idx_price_at ON price_snapshots (snapshot_at);
CREATE INDEX IF NOT EXISTS idx_ohlcv_symbol_date ON ohlcv (symbol, date);
""")
conn.commit()
return conn
def save_ticker(conn: sqlite3.Connection, symbol: str, profile: dict = None):
"""Upsert a ticker record with optional profile data."""
conn.execute(
"""INSERT OR IGNORE INTO tickers (symbol) VALUES (?)""",
(symbol,)
)
if profile:
conn.execute(
"""UPDATE tickers SET name=?, exchange=?, sector=?, industry=?,
country=?, currency=?, employees=?, description=?
WHERE symbol=?""",
(
profile.get("shortName") or profile.get("longName"),
profile.get("exchange"),
profile.get("sector"),
profile.get("industry"),
profile.get("country"),
profile.get("currency"),
profile.get("fullTimeEmployees"),
profile.get("longBusinessSummary", "")[:500],
symbol,
),
)
conn.commit()
def save_quote_snapshot(conn: sqlite3.Connection, quote: dict):
"""Record a price snapshot."""
conn.execute(
"""INSERT INTO price_snapshots
(symbol, price, change, change_pct, volume, market_cap, market_state)
VALUES (?, ?, ?, ?, ?, ?, ?)""",
(
quote.get("symbol"),
quote.get("price"),
quote.get("change"),
quote.get("change_pct"),
quote.get("volume"),
quote.get("market_cap"),
quote.get("market_state"),
),
)
conn.commit()
def save_ohlcv_history(conn: sqlite3.Connection, symbol: str, history: list, interval: str = "1d"):
"""Bulk insert OHLCV history. Skips duplicates."""
for row in history:
try:
conn.execute(
"""INSERT OR IGNORE INTO ohlcv
(symbol, date, interval_type, open, high, low, close, adj_close, volume, dividend, split_ratio)
VALUES (?,?,?,?,?,?,?,?,?,?,?)""",
(
symbol, row.get("date"), interval,
row.get("open"), row.get("high"), row.get("low"),
row.get("close"), row.get("adj_close"), row.get("volume"),
row.get("dividend"), row.get("split_ratio"),
),
)
except sqlite3.IntegrityError:
pass
conn.commit()
def save_key_metrics(conn: sqlite3.Connection, symbol: str, metrics: dict):
"""Upsert key financial metrics for a symbol."""
conn.execute(
"""INSERT OR REPLACE INTO key_metrics
(symbol, pe_ratio, peg_ratio, price_to_book, profit_margin, operating_margin,
roe, roa, debt_to_equity, current_ratio, beta, dividend_yield,
short_pct_float, analyst_target, analyst_reco, updated_at)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,CURRENT_TIMESTAMP)""",
(
symbol,
metrics.get("pe_ratio"), metrics.get("peg_ratio"), metrics.get("price_to_book"),
metrics.get("profit_margin"), metrics.get("operating_margin"),
metrics.get("return_on_equity"), metrics.get("return_on_assets"),
metrics.get("debt_to_equity"), metrics.get("current_ratio"),
metrics.get("beta"), metrics.get("dividend_yield"),
metrics.get("short_pct_float"),
metrics.get("analyst_target_price"), metrics.get("analyst_recommendation"),
),
)
conn.commit()
Complete Monitoring Pipeline
def build_financial_database(
tickers: list,
db_path: str = "yahoo_finance.db",
proxy_url: str = None,
fetch_history: bool = True,
history_period: str = "2y",
fetch_financials: bool = True,
delay_range: tuple = (1.0, 2.5),
):
"""
Full pipeline: quotes + history + financials -> SQLite storage.
Designed for daily or weekly runs to build a financial research database.
"""
conn = init_finance_db(db_path)
client = make_client(proxy_url=proxy_url)
print(f"Processing {len(tickers)} tickers...")
# Phase 1: Batch quotes (efficient)
print("\nFetching batch quotes...")
batch_size = 50
for i in range(0, len(tickers), batch_size):
batch = tickers[i:i+batch_size]
try:
quotes = get_batch_quotes(batch, client=client)
for q in quotes:
save_ticker(conn, q["symbol"])
save_quote_snapshot(conn, q)
print(f" Batch {i//batch_size + 1}: {len(quotes)} quotes")
time.sleep(random.uniform(0.5, 1.0))
except Exception as e:
print(f" Batch error: {e}")
# Phase 2: History and financials (per-ticker)
for i, symbol in enumerate(tickers):
print(f"\n[{i+1}/{len(tickers)}] {symbol}")
if fetch_history:
try:
history = get_historical(symbol, period=history_period, client=client)
save_ohlcv_history(conn, symbol, history)
print(f" History: {len(history)} candles")
except Exception as e:
print(f" History error: {e}")
time.sleep(random.uniform(*delay_range))
if fetch_financials:
try:
fin = get_financials(symbol, client=client)
metrics = extract_key_metrics(fin)
save_key_metrics(conn, symbol, metrics)
print(f" Financials: P/E={metrics.get('pe_ratio')} Margin={metrics.get('profit_margin')}")
# Save income statement history
for stmt in metrics.get("income_statements", []):
if stmt.get("date"):
conn.execute(
"""INSERT OR IGNORE INTO financials
(symbol, period_type, period_end, revenue, gross_profit, net_income)
VALUES (?,?,?,?,?,?)""",
(symbol, "annual", stmt["date"],
stmt.get("revenue"), stmt.get("gross_profit"), stmt.get("net_income"))
)
conn.commit()
except Exception as e:
print(f" Financials error: {e}")
time.sleep(random.uniform(*delay_range))
# Report
snap_count = conn.execute("SELECT COUNT(*) FROM price_snapshots").fetchone()[0]
ohlcv_count = conn.execute("SELECT COUNT(*) FROM ohlcv").fetchone()[0]
print(f"\nDatabase summary: {snap_count:,} snapshots, {ohlcv_count:,} OHLCV rows")
conn.close()
# Run it
WATCHLIST = ["AAPL", "MSFT", "GOOGL", "AMZN", "NVDA", "META", "TSLA", "BRK-B", "JPM", "V"]
PROXY = "http://user:[email protected]:9000"
build_financial_database(
WATCHLIST,
proxy_url=PROXY,
fetch_history=True,
history_period="1y",
fetch_financials=True,
)
Practical Tips
Use query1 vs query2. Both query1.finance.yahoo.com and query2.finance.yahoo.com serve the same data. If one is slow or returning errors, switch to the other.
Handle None in OHLCV data. Yahoo sometimes returns null for individual OHLCV values for weekends, market holidays, and trading suspensions. Always check for None before doing arithmetic.
Intraday limits. 1-minute interval data only goes back about 7 days. For longer intraday history, you need a paid data provider. For daily OHLCV, Yahoo provides data going back to the IPO date.
Crumb token errors. If you get 401 or unexpected authentication errors, visit finance.yahoo.com first in your HTTP session and extract the crumb from the cookie/page source. Some server-side configurations require it.
The v7 batch endpoint for portfolios. If you monitor a fixed set of tickers (a portfolio watchlist), the batch v7 endpoint in 50-100 symbol chunks is dramatically more efficient than per-symbol calls.
Prefer adjusted close for return calculations. The adj_close field accounts for splits and dividends. Always use it for return calculations — unadjusted close will give wrong results for stocks with splits.
Legal Notes
Yahoo Finance's Terms of Service prohibit scraping and automated access. Their API was officially shut down in 2017, and the unofficial endpoints described here are not sanctioned by Yahoo.
In practice, the financial data returned is widely accessed through the yfinance Python library (which uses the same endpoints under the hood) and considered generally acceptable for personal, research, and non-commercial use. No enforcement actions against individual researchers using these endpoints have been reported.
For commercial applications, licensed data providers like Alpha Vantage (free tier available), Quandl/Nasdaq Data Link, or Polygon.io provide the same data with proper usage rights and SLAs. These are worth the cost if you're building a product rather than doing research.