Scraping Morningstar: Mutual Fund Ratings, Performance & Expense Ratios with Python (2026)
Scraping Morningstar: Mutual Fund Ratings, Performance & Expense Ratios with Python (2026)
Morningstar rates and tracks over 600,000 investment offerings worldwide. Their star ratings, expense ratio data, and performance metrics are the standard reference for fund comparison. Financial advisors, researchers, and individual investors all rely on it.
The catch: Morningstar doesn't offer a free public API. Their data services cost thousands per year. But the website is publicly accessible, and the data is right there in the HTML and embedded JSON.
This guide covers scraping fund ratings, performance history, expense ratios, and holdings from Morningstar's public pages — along with the anti-bot measures you'll need to navigate.
Legal and Ethical Note
Morningstar's terms of service restrict automated data collection. This guide is for educational purposes — learning how web scraping works against a complex, real-world financial target. If you need Morningstar data for commercial use, look into their official data feeds or licensed APIs. Use respectful request rates and do not republish scraped data commercially.
Understanding Morningstar's Page Structure
Morningstar fund pages follow this pattern:
https://www.morningstar.com/funds/xnas/[TICKER]/quote
Key sub-pages per fund:
| Page | URL Pattern | Data |
|---|---|---|
| Quote | /quote |
Star rating, category, current price |
| Performance | /performance |
Returns over periods, vs benchmark |
| Portfolio | /portfolio |
Holdings, sector weights, top positions |
| Price/Fees | /price |
Expense ratio, loads, minimums |
| Risk | /risk |
Standard deviation, Sharpe ratio, alpha/beta |
A lot of the data is rendered server-side in the initial HTML, but some comes from internal API calls that the page makes on load. The key insight: Morningstar embeds fund data as inline JSON in <script> tags, which is often easier to extract than scraping table HTML.
Dependencies and Setup
pip install httpx[http2] beautifulsoup4 lxml requests playwright
playwright install chromium
Base Request Setup
import httpx
from bs4 import BeautifulSoup
import json
import time
import re
import random
USER_AGENTS = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:127.0) Gecko/20100101 Firefox/127.0",
]
BASE_HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
}
def get_headers() -> dict:
return {**BASE_HEADERS, "User-Agent": random.choice(USER_AGENTS)}
def build_morningstar_client(proxy_url: str = None) -> httpx.Client:
"""Build an httpx client configured for Morningstar scraping."""
client_kwargs = {
"headers": get_headers(),
"follow_redirects": True,
"timeout": 25,
}
if proxy_url:
client_kwargs["proxies"] = {"http://": proxy_url, "https://": proxy_url}
return httpx.Client(**client_kwargs)
Basic Fund Scraper
def scrape_fund_overview(ticker: str, client: httpx.Client = None) -> dict:
"""Scrape basic fund data from Morningstar quote page."""
url = f"https://www.morningstar.com/funds/xnas/{ticker.lower()}/quote"
if client is None:
client = build_morningstar_client()
# Visit homepage first to establish a session/cookies (helps with Akamai)
try:
client.get("https://www.morningstar.com", timeout=10)
time.sleep(1)
except Exception:
pass
resp = client.get(url)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
fund = {"ticker": ticker.upper(), "url": url}
# Fund name — try multiple patterns
for selector in ["h1", "[data-testid='security-name']", ".mdc-fund-header__name"]:
name_el = soup.select_one(selector)
if name_el:
fund["name"] = name_el.get_text(strip=True)
break
# Star rating — Morningstar uses aria-label or specific classes
for selector in ["[class*='star-rating']", "[aria-label*='star']", ".mdc-rating"]:
star_el = soup.select_one(selector)
if star_el:
label = star_el.get("aria-label", "")
match = re.search(r"(\d)\s+star", label, re.IGNORECASE)
if match:
fund["star_rating"] = int(match.group(1))
break
# Some pages just have a number
text = star_el.get_text(strip=True)
if text.isdigit() and 1 <= int(text) <= 5:
fund["star_rating"] = int(text)
break
# Category
for selector in ["[data-testid='category']", ".mdc-category", "[class*='category']"]:
cat_el = soup.select_one(selector)
if cat_el and len(cat_el.get_text(strip=True)) > 2:
fund["category"] = cat_el.get_text(strip=True)
break
# Try to extract embedded JSON data
for script in soup.select("script[type='application/json'], script[type='application/ld+json']"):
try:
script_data = json.loads(script.string)
if isinstance(script_data, dict):
if "name" in script_data or "starRating" in script_data:
fund["embedded_data"] = script_data
# Extract common fields
if "starRating" in script_data:
fund["star_rating"] = script_data["starRating"]
if "category" in script_data:
fund["category"] = script_data["category"]
break
except (json.JSONDecodeError, TypeError):
continue
# Extract any inline JS data blocks (Morningstar sometimes puts fund data in window.__INITIAL_DATA__)
init_data_match = re.search(
r'window\.__INITIAL_DATA__\s*=\s*({.*?})(?:;|</script>)',
resp.text, re.DOTALL
)
if init_data_match:
try:
init_data = json.loads(init_data_match.group(1))
fund["initial_data"] = init_data
except json.JSONDecodeError:
pass
return fund
Extracting Performance Data
Performance data is typically rendered in tables on the performance page:
def scrape_fund_performance(ticker: str, client: httpx.Client = None) -> dict:
"""Extract historical return data from Morningstar performance page."""
url = f"https://www.morningstar.com/funds/xnas/{ticker.lower()}/performance"
if client is None:
client = build_morningstar_client()
resp = client.get(url)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
perf = {"ticker": ticker.upper(), "returns": {}, "trailing_returns": {}}
# Performance tables
for table in soup.select("table"):
header_row = table.select_one("thead tr")
if not header_row:
continue
headers_text = [th.get_text(strip=True) for th in header_row.select("th, td")]
if not any(period in str(headers_text) for period in ["YTD", "1 Year", "3 Year", "5 Year", "10 Year"]):
continue
for row in table.select("tbody tr"):
cells = [td.get_text(strip=True) for td in row.select("td, th")]
if len(cells) < 2:
continue
label = cells[0]
values = cells[1:]
row_data = {}
for i, val in enumerate(values):
if i < len(headers_text) - 1:
row_data[headers_text[i + 1]] = val
if label:
perf["returns"][label] = row_data
# Also try to extract from JSON embedded in page
for script in soup.select("script"):
if script.string and "trailingReturn" in script.string:
try:
# Look for the JSON object containing trailing returns
match = re.search(r'"trailingReturn":\s*(\{[^}]+\})', script.string)
if match:
perf["trailing_returns"] = json.loads(match.group(1))
except (json.JSONDecodeError, ValueError):
pass
return perf
Getting Expense Ratios and Fees
Expense ratios are on the price/fees page. This is the data people search for most:
def scrape_fund_fees(ticker: str, client: httpx.Client = None) -> dict:
"""Extract fee and expense data from Morningstar."""
url = f"https://www.morningstar.com/funds/xnas/{ticker.lower()}/price"
if client is None:
client = build_morningstar_client()
resp = client.get(url)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
fees = {"ticker": ticker.upper()}
text = soup.get_text()
# Expense ratio patterns — try multiple common formats
patterns = {
"expense_ratio": [
r"Expense Ratio[:\s]*(\d+\.\d+)\s*%",
r"Total Expense Ratio[:\s]*(\d+\.\d+)\s*%",
r'"expenseRatio":\s*"?(\d+\.\d+)"?',
],
"net_expense_ratio": [
r"Net Expense Ratio[:\s]*(\d+\.\d+)\s*%",
r'"netExpenseRatio":\s*"?(\d+\.\d+)"?',
],
"management_fee": [
r"Management Fee[:\s]*(\d+\.\d+)\s*%",
],
}
for field, field_patterns in patterns.items():
for pattern in field_patterns:
match = re.search(pattern, text, re.IGNORECASE)
if match:
fees[field] = float(match.group(1))
break
# Minimum investment
for pattern in [
r"Minimum (?:Initial )?Investment[:\s]*\$?([\d,]+)",
r'"minimumInvestment":\s*"?(\d+)"?',
]:
min_match = re.search(pattern, text, re.IGNORECASE)
if min_match:
fees["min_investment"] = int(min_match.group(1).replace(",", ""))
break
# Fee-related DL items
for dt in soup.select("dt, [class*='label'], [class*='key']"):
dd = dt.find_next_sibling("dd") or dt.find_next_sibling()
if dd:
key = dt.get_text(strip=True).lower()
val = dd.get_text(strip=True)
if any(term in key for term in ["load", "fee", "turnover", "yield", "12b", "redemption"]):
safe_key = re.sub(r'[^a-z0-9_]', '_', key)
fees[safe_key] = val
# Portfolio turnover rate
turnover_match = re.search(
r"(?:Portfolio )?Turnover[:\s]*([\d.]+)\s*%",
text, re.IGNORECASE
)
if turnover_match:
fees["portfolio_turnover_pct"] = float(turnover_match.group(1))
return fees
def scrape_fund_holdings(ticker: str, client: httpx.Client = None) -> dict:
"""Extract top holdings and sector allocations from portfolio page."""
url = f"https://www.morningstar.com/funds/xnas/{ticker.lower()}/portfolio"
if client is None:
client = build_morningstar_client()
resp = client.get(url)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
holdings = {"ticker": ticker.upper(), "top_holdings": [], "sector_weights": {}}
# Try to find top holdings table
for table in soup.select("table"):
headers = [th.get_text(strip=True) for th in table.select("thead th")]
if any(h in str(headers) for h in ["% Net Assets", "Portfolio", "Holding"]):
for row in table.select("tbody tr")[:15]:
cells = [td.get_text(strip=True) for td in row.select("td")]
if len(cells) >= 2:
holdings["top_holdings"].append({
"name": cells[0],
"weight_pct": cells[1] if len(cells) > 1 else None,
"sector": cells[2] if len(cells) > 2 else None,
})
break
# Sector weights
for row in soup.select("[class*='sector'] tr, [data-testid*='sector'] tr"):
cells = [td.get_text(strip=True) for td in row.select("td")]
if len(cells) >= 2:
sector = cells[0]
weight_match = re.search(r"(\d+\.?\d*)", cells[1])
if sector and weight_match:
holdings["sector_weights"][sector] = float(weight_match.group(1))
return holdings
Dealing with Morningstar's Anti-Bot Stack
Morningstar is one of the tougher scraping targets in finance. They use multiple layers:
- Akamai Bot Manager — fingerprints your browser, checks TLS signatures, and analyzes behavioral patterns.
- Rate limiting — aggressive per-IP throttling, even for normal browsing speeds.
- Dynamic selectors — CSS class names change between deploys.
- Cookie walls — some pages require session cookies set by JavaScript.
What works:
- Slow requests — 5-10 seconds between page loads minimum. Morningstar flags anything faster.
- Session persistence — use
httpx.Client()to maintain cookies across requests. - Residential proxies — datacenter IPs are blocked almost immediately. ThorData's rotating residential proxies are what you need for financial sites — the residential IPs mimic real user traffic patterns, which is critical for getting past Akamai's fingerprinting. Their session-sticky option helps when you need cookies to persist across multiple page loads for the same fund.
def build_proxied_session(proxy_url: str) -> httpx.Client:
"""Build a session-based client with proxy for sustained Morningstar scraping."""
client = httpx.Client(
proxies={"http://": proxy_url, "https://": proxy_url},
headers=get_headers(),
follow_redirects=True,
timeout=30,
)
# Establish session by visiting homepage first
try:
client.get("https://www.morningstar.com")
time.sleep(random.uniform(2, 4))
except Exception:
pass
return client
def scrape_fund_complete(ticker: str, client: httpx.Client) -> dict:
"""Scrape all available data for a fund using a shared session."""
result = {"ticker": ticker.upper()}
try:
overview = scrape_fund_overview(ticker, client=client)
result.update(overview)
time.sleep(random.uniform(5, 10))
except Exception as e:
print(f" Overview failed for {ticker}: {e}")
try:
fees = scrape_fund_fees(ticker, client=client)
result.update(fees)
time.sleep(random.uniform(5, 10))
except Exception as e:
print(f" Fees failed for {ticker}: {e}")
try:
perf = scrape_fund_performance(ticker, client=client)
result["performance"] = perf
time.sleep(random.uniform(5, 10))
except Exception as e:
print(f" Performance failed for {ticker}: {e}")
return result
Playwright Fallback for Akamai-Blocked Pages
When httpx gets Akamai challenges, use Playwright:
from playwright.sync_api import sync_playwright
def scrape_fund_with_playwright(ticker: str, proxy: str = None) -> dict:
"""Use Playwright to scrape a Morningstar fund page when httpx is blocked."""
url = f"https://www.morningstar.com/funds/xnas/{ticker.lower()}/quote"
launch_kwargs = {"headless": True, "args": ["--no-sandbox", "--disable-dev-shm-usage"]}
if proxy:
launch_kwargs["proxy"] = {"server": proxy}
with sync_playwright() as p:
browser = p.chromium.launch(**launch_kwargs)
context = browser.new_context(
user_agent=random.choice(USER_AGENTS),
viewport={"width": 1280, "height": 900},
locale="en-US",
timezone_id="America/New_York",
)
page = context.new_page()
# Visit homepage first
page.goto("https://www.morningstar.com", wait_until="domcontentloaded", timeout=20000)
page.wait_for_timeout(2000)
# Navigate to fund page
page.goto(url, wait_until="networkidle", timeout=30000)
page.wait_for_timeout(3000)
html = page.content()
browser.close()
# Parse the HTML
soup = BeautifulSoup(html, "lxml")
fund = {"ticker": ticker.upper(), "url": url}
# Extract using same logic as httpx approach
name_el = soup.select_one("h1")
if name_el:
fund["name"] = name_el.get_text(strip=True)
star_el = soup.select_one("[aria-label*='star']")
if star_el:
label = star_el.get("aria-label", "")
match = re.search(r"(\d)\s+star", label, re.IGNORECASE)
if match:
fund["star_rating"] = int(match.group(1))
return fund
Batch Scraping Multiple Funds
When collecting data across many funds, structure it as a pipeline:
import sqlite3
from datetime import datetime
def init_fund_db(db_path: str = "morningstar_funds.db") -> sqlite3.Connection:
"""Initialize the Morningstar fund database."""
conn = sqlite3.connect(db_path)
conn.executescript("""
CREATE TABLE IF NOT EXISTS funds (
ticker TEXT PRIMARY KEY,
name TEXT,
star_rating INTEGER,
category TEXT,
expense_ratio REAL,
net_expense_ratio REAL,
portfolio_turnover_pct REAL,
min_investment INTEGER,
performance_1yr REAL,
performance_3yr REAL,
performance_5yr REAL,
performance_10yr REAL,
raw_data TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE IF NOT EXISTS fund_holdings (
id INTEGER PRIMARY KEY AUTOINCREMENT,
ticker TEXT,
holding_name TEXT,
weight_pct TEXT,
sector TEXT,
scraped_at TEXT,
FOREIGN KEY (ticker) REFERENCES funds(ticker)
);
CREATE TABLE IF NOT EXISTS expense_history (
ticker TEXT,
expense_ratio REAL,
net_expense_ratio REAL,
snapshot_date TEXT,
PRIMARY KEY (ticker, snapshot_date)
);
CREATE INDEX IF NOT EXISTS idx_funds_star ON funds(star_rating);
CREATE INDEX IF NOT EXISTS idx_funds_expense ON funds(expense_ratio);
CREATE INDEX IF NOT EXISTS idx_funds_category ON funds(category);
""")
conn.commit()
return conn
def save_fund(conn: sqlite3.Connection, fund_data: dict):
"""Save fund data to SQLite."""
now = datetime.utcnow().isoformat()
ticker = fund_data.get("ticker", "")
# Extract performance values from nested data
perf = fund_data.get("performance", {}).get("returns", {})
perf_1yr = perf_3yr = perf_5yr = perf_10yr = None
for key, vals in perf.items():
if "Fund" in vals or ticker in vals:
fund_vals = vals.get("Fund", vals.get(ticker, {}))
for period_key, val in fund_vals.items():
try:
pct = float(val.replace("%", ""))
if "1 Year" in period_key or "1-Year" in period_key:
perf_1yr = pct
elif "3 Year" in period_key or "3-Year" in period_key:
perf_3yr = pct
elif "5 Year" in period_key or "5-Year" in period_key:
perf_5yr = pct
elif "10 Year" in period_key or "10-Year" in period_key:
perf_10yr = pct
except (ValueError, AttributeError):
pass
conn.execute(
"""INSERT OR REPLACE INTO funds
(ticker, name, star_rating, category, expense_ratio, net_expense_ratio,
portfolio_turnover_pct, min_investment, performance_1yr, performance_3yr,
performance_5yr, performance_10yr, raw_data, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
(
ticker, fund_data.get("name"), fund_data.get("star_rating"),
fund_data.get("category"), fund_data.get("expense_ratio"),
fund_data.get("net_expense_ratio"), fund_data.get("portfolio_turnover_pct"),
fund_data.get("min_investment"),
perf_1yr, perf_3yr, perf_5yr, perf_10yr,
json.dumps({k: v for k, v in fund_data.items() if k not in ("raw_data", "performance")}),
now,
)
)
# Log expense ratio history
if fund_data.get("expense_ratio"):
conn.execute(
"INSERT OR REPLACE INTO expense_history (ticker, expense_ratio, net_expense_ratio, snapshot_date) VALUES (?, ?, ?, ?)",
(ticker, fund_data.get("expense_ratio"), fund_data.get("net_expense_ratio"), now[:10])
)
conn.commit()
def scrape_fund_batch(
tickers: list,
proxy_url: str = None,
db_path: str = "morningstar_funds.db",
):
"""Scrape a batch of funds with storage and error handling."""
conn = init_fund_db(db_path)
client = build_proxied_session(proxy_url) if proxy_url else build_morningstar_client()
for i, ticker in enumerate(tickers):
print(f"[{i+1}/{len(tickers)}] Scraping {ticker}...")
try:
fund_data = scrape_fund_complete(ticker, client=client)
save_fund(conn, fund_data)
print(f" {fund_data.get('name', 'N/A')} — {fund_data.get('star_rating', '?')} stars — {fund_data.get('expense_ratio', '?')}% ER")
except httpx.HTTPStatusError as e:
print(f" HTTP error: {e}")
if e.response.status_code in (403, 429, 503):
print(" Backing off for 60 seconds...")
time.sleep(60)
except Exception as e:
print(f" Error on {ticker}: {e}")
# Random delay — Morningstar needs long delays
delay = random.uniform(8, 15)
print(f" Waiting {delay:.1f}s...")
time.sleep(delay)
conn.close()
print(f"\nDone. Scraped {len(tickers)} funds.")
Analysis: Fund Comparison Queries
def compare_index_funds(db_path: str = "morningstar_funds.db") -> list:
"""Compare expense ratios and performance across index funds."""
conn = sqlite3.connect(db_path)
cursor = conn.execute("""
SELECT ticker, name, star_rating, category,
expense_ratio, net_expense_ratio,
performance_1yr, performance_3yr, performance_5yr, performance_10yr
FROM funds
WHERE expense_ratio IS NOT NULL
ORDER BY expense_ratio ASC
""")
results = cursor.fetchall()
conn.close()
return results
def find_best_by_category(db_path: str = "morningstar_funds.db") -> dict:
"""Find the highest-rated funds in each category."""
conn = sqlite3.connect(db_path)
cursor = conn.execute("""
SELECT category, ticker, name, star_rating, expense_ratio, performance_5yr
FROM funds f1
WHERE star_rating = (
SELECT MAX(star_rating) FROM funds f2 WHERE f2.category = f1.category
)
AND category IS NOT NULL
ORDER BY category, expense_ratio ASC
""")
rows = cursor.fetchall()
conn.close()
by_category = {}
for row in rows:
cat = row[0]
if cat not in by_category:
by_category[cat] = []
by_category[cat].append({
"ticker": row[1], "name": row[2], "stars": row[3],
"expense_ratio": row[4], "perf_5yr": row[5],
})
return by_category
def fee_impact_analysis(
initial_investment: float,
annual_return_pct: float,
years: int,
expense_ratios: list,
) -> dict:
"""
Model how different expense ratios compound over time.
Shows the real cost of fee differences.
"""
results = {}
for er in expense_ratios:
net_return = (annual_return_pct - er) / 100
final_value = initial_investment * ((1 + net_return) ** years)
results[er] = {
"net_annual_return_pct": round((annual_return_pct - er), 2),
"final_value": round(final_value, 2),
"total_fees_paid": round(
initial_investment * ((1 + annual_return_pct / 100) ** years) - final_value, 2
),
}
return results
# Example analysis
print("Fee impact over 30 years ($100,000 investment, 8% gross return):")
impact = fee_impact_analysis(
initial_investment=100_000,
annual_return_pct=8.0,
years=30,
expense_ratios=[0.03, 0.10, 0.50, 1.00, 1.50],
)
for er, data in impact.items():
print(f" {er:.2f}% ER: ${data['final_value']:,.0f} final value, ${data['total_fees_paid']:,.0f} in fees")
Fund Comparison Tool
def build_comparison_report(tickers: list, db_path: str = "morningstar_funds.db") -> list:
"""Build a side-by-side comparison for a list of fund tickers."""
conn = sqlite3.connect(db_path)
placeholders = ",".join("?" * len(tickers))
cursor = conn.execute(f"""
SELECT ticker, name, star_rating, category,
expense_ratio, net_expense_ratio,
performance_1yr, performance_3yr, performance_5yr, performance_10yr,
min_investment, portfolio_turnover_pct
FROM funds
WHERE ticker IN ({placeholders})
ORDER BY expense_ratio ASC
""", tickers)
results = []
for row in cursor.fetchall():
results.append({
"ticker": row[0],
"name": row[1],
"star_rating": row[2],
"category": row[3],
"expense_ratio": row[4],
"net_expense_ratio": row[5],
"return_1yr": row[6],
"return_3yr": row[7],
"return_5yr": row[8],
"return_10yr": row[9],
"min_investment": row[10],
"turnover_pct": row[11],
})
conn.close()
# Print formatted comparison
print(f"\n{'Ticker':<8} {'Stars':>5} {'ER%':>6} {'1-Yr':>7} {'3-Yr':>7} {'5-Yr':>7} {'10-Yr':>8}")
print("-" * 55)
for f in results:
stars = "★" * (f["star_rating"] or 0)
er = f"{f['expense_ratio']:.2f}%" if f["expense_ratio"] else "N/A"
r1 = f"{f['return_1yr']:.1f}%" if f["return_1yr"] else "N/A"
r3 = f"{f['return_3yr']:.1f}%" if f["return_3yr"] else "N/A"
r5 = f"{f['return_5yr']:.1f}%" if f["return_5yr"] else "N/A"
r10 = f"{f['return_10yr']:.1f}%" if f["return_10yr"] else "N/A"
print(f"{f['ticker']:<8} {stars:>5} {er:>6} {r1:>7} {r3:>7} {r5:>7} {r10:>8}")
return results
# Compare popular index funds
if __name__ == "__main__":
PROXY_URL = "http://YOUR_USER:[email protected]:9000"
# Popular index funds to compare
TICKERS = ["VFIAX", "FXAIX", "SWPPX", "VTSAX", "FSKAX", "SWTSX", "VBTLX", "FXNAX"]
scrape_fund_batch(TICKERS, proxy_url=PROXY_URL)
build_comparison_report(TICKERS)
What You Can Build
Morningstar data enables some useful analyses:
- Fund comparison tools — side-by-side expense ratios, returns, and ratings across fund families (Vanguard vs Fidelity vs Schwab)
- Fee impact calculators — model how expense ratio differences compound over 10-30 year horizons
- Category performance trackers — monitor which fund categories are outperforming over rolling periods
- Holdings overlap analysis — compare portfolio holdings across similar funds to find true diversification
- Star rating predictor — analyze what distinguishes 4-5 star funds from 2-3 star funds in the same category
- Alert system — trigger notifications when a fund's expense ratio changes or star rating drops
Financial data scraping is slower and more defended than most other verticals. The payoff is that the data is extremely valuable and changes slowly enough that you don't need to scrape every day. Weekly or monthly collection is usually enough for fund analysis. At 5-10 second delays between requests with residential proxies from ThorData, you can collect 200-300 fund profiles per hour — more than enough for a comprehensive comparison database.