Scraping Morningstar: Mutual Fund Ratings, Performance & Expense Ratios with Python (2026)

2026-04-09 [python scraping morningstar mutual-funds finance]

Scraping Morningstar: Mutual Fund Ratings, Performance & Expense Ratios with Python (2026)

Morningstar rates and tracks over 600,000 investment offerings worldwide. Their star ratings, expense ratio data, and performance metrics are the standard reference for fund comparison. Financial advisors, researchers, and individual investors all rely on it.

The catch: Morningstar doesn't offer a free public API. Their data services cost thousands per year. But the website is publicly accessible, and the data is right there in the HTML and embedded JSON.

This guide covers scraping fund ratings, performance history, expense ratios, and holdings from Morningstar's public pages — along with the anti-bot measures you'll need to navigate.

Legal and Ethical Note

Morningstar's terms of service restrict automated data collection. This guide is for educational purposes — learning how web scraping works against a complex, real-world financial target. If you need Morningstar data for commercial use, look into their official data feeds or licensed APIs. Use respectful request rates and do not republish scraped data commercially.

Understanding Morningstar's Page Structure

Morningstar fund pages follow this pattern:

https://www.morningstar.com/funds/xnas/[TICKER]/quote

Key sub-pages per fund:

Page	URL Pattern	Data
Quote	`/quote`	Star rating, category, current price
Performance	`/performance`	Returns over periods, vs benchmark
Portfolio	`/portfolio`	Holdings, sector weights, top positions
Price/Fees	`/price`	Expense ratio, loads, minimums
Risk	`/risk`	Standard deviation, Sharpe ratio, alpha/beta

A lot of the data is rendered server-side in the initial HTML, but some comes from internal API calls that the page makes on load. The key insight: Morningstar embeds fund data as inline JSON in <script> tags, which is often easier to extract than scraping table HTML.

Dependencies and Setup

pip install httpx[http2] beautifulsoup4 lxml requests playwright
playwright install chromium

Base Request Setup

import httpx
from bs4 import BeautifulSoup
import json
import time
import re
import random

USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:127.0) Gecko/20100101 Firefox/127.0",
]

BASE_HEADERS = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
}


def get_headers() -> dict:
    return {**BASE_HEADERS, "User-Agent": random.choice(USER_AGENTS)}


def build_morningstar_client(proxy_url: str = None) -> httpx.Client:
    """Build an httpx client configured for Morningstar scraping."""
    client_kwargs = {
        "headers": get_headers(),
        "follow_redirects": True,
        "timeout": 25,
    }
    if proxy_url:
        client_kwargs["proxies"] = {"http://": proxy_url, "https://": proxy_url}
    return httpx.Client(**client_kwargs)

Basic Fund Scraper

def scrape_fund_overview(ticker: str, client: httpx.Client = None) -> dict:
    """Scrape basic fund data from Morningstar quote page."""
    url = f"https://www.morningstar.com/funds/xnas/{ticker.lower()}/quote"

    if client is None:
        client = build_morningstar_client()

    # Visit homepage first to establish a session/cookies (helps with Akamai)
    try:
        client.get("https://www.morningstar.com", timeout=10)
        time.sleep(1)
    except Exception:
        pass

    resp = client.get(url)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "lxml")

    fund = {"ticker": ticker.upper(), "url": url}

    # Fund name — try multiple patterns
    for selector in ["h1", "[data-testid='security-name']", ".mdc-fund-header__name"]:
        name_el = soup.select_one(selector)
        if name_el:
            fund["name"] = name_el.get_text(strip=True)
            break

    # Star rating — Morningstar uses aria-label or specific classes
    for selector in ["[class*='star-rating']", "[aria-label*='star']", ".mdc-rating"]:
        star_el = soup.select_one(selector)
        if star_el:
            label = star_el.get("aria-label", "")
            match = re.search(r"(\d)\s+star", label, re.IGNORECASE)
            if match:
                fund["star_rating"] = int(match.group(1))
                break
            # Some pages just have a number
            text = star_el.get_text(strip=True)
            if text.isdigit() and 1 <= int(text) <= 5:
                fund["star_rating"] = int(text)
                break

    # Category
    for selector in ["[data-testid='category']", ".mdc-category", "[class*='category']"]:
        cat_el = soup.select_one(selector)
        if cat_el and len(cat_el.get_text(strip=True)) > 2:
            fund["category"] = cat_el.get_text(strip=True)
            break

    # Try to extract embedded JSON data
    for script in soup.select("script[type='application/json'], script[type='application/ld+json']"):
        try:
            script_data = json.loads(script.string)
            if isinstance(script_data, dict):
                if "name" in script_data or "starRating" in script_data:
                    fund["embedded_data"] = script_data
                    # Extract common fields
                    if "starRating" in script_data:
                        fund["star_rating"] = script_data["starRating"]
                    if "category" in script_data:
                        fund["category"] = script_data["category"]
                    break
        except (json.JSONDecodeError, TypeError):
            continue

    # Extract any inline JS data blocks (Morningstar sometimes puts fund data in window.__INITIAL_DATA__)
    init_data_match = re.search(
        r'window\.__INITIAL_DATA__\s*=\s*({.*?})(?:;|</script>)',
        resp.text, re.DOTALL
    )
    if init_data_match:
        try:
            init_data = json.loads(init_data_match.group(1))
            fund["initial_data"] = init_data
        except json.JSONDecodeError:
            pass

    return fund

Extracting Performance Data

Performance data is typically rendered in tables on the performance page:

def scrape_fund_performance(ticker: str, client: httpx.Client = None) -> dict:
    """Extract historical return data from Morningstar performance page."""
    url = f"https://www.morningstar.com/funds/xnas/{ticker.lower()}/performance"

    if client is None:
        client = build_morningstar_client()

    resp = client.get(url)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "lxml")

    perf = {"ticker": ticker.upper(), "returns": {}, "trailing_returns": {}}

    # Performance tables
    for table in soup.select("table"):
        header_row = table.select_one("thead tr")
        if not header_row:
            continue

        headers_text = [th.get_text(strip=True) for th in header_row.select("th, td")]
        if not any(period in str(headers_text) for period in ["YTD", "1 Year", "3 Year", "5 Year", "10 Year"]):
            continue

        for row in table.select("tbody tr"):
            cells = [td.get_text(strip=True) for td in row.select("td, th")]
            if len(cells) < 2:
                continue
            label = cells[0]
            values = cells[1:]
            row_data = {}
            for i, val in enumerate(values):
                if i < len(headers_text) - 1:
                    row_data[headers_text[i + 1]] = val
            if label:
                perf["returns"][label] = row_data

    # Also try to extract from JSON embedded in page
    for script in soup.select("script"):
        if script.string and "trailingReturn" in script.string:
            try:
                # Look for the JSON object containing trailing returns
                match = re.search(r'"trailingReturn":\s*(\{[^}]+\})', script.string)
                if match:
                    perf["trailing_returns"] = json.loads(match.group(1))
            except (json.JSONDecodeError, ValueError):
                pass

    return perf

Getting Expense Ratios and Fees

Expense ratios are on the price/fees page. This is the data people search for most:

def scrape_fund_fees(ticker: str, client: httpx.Client = None) -> dict:
    """Extract fee and expense data from Morningstar."""
    url = f"https://www.morningstar.com/funds/xnas/{ticker.lower()}/price"

    if client is None:
        client = build_morningstar_client()

    resp = client.get(url)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "lxml")

    fees = {"ticker": ticker.upper()}
    text = soup.get_text()

    # Expense ratio patterns — try multiple common formats
    patterns = {
        "expense_ratio": [
            r"Expense Ratio[:\s]*(\d+\.\d+)\s*%",
            r"Total Expense Ratio[:\s]*(\d+\.\d+)\s*%",
            r'"expenseRatio":\s*"?(\d+\.\d+)"?',
        ],
        "net_expense_ratio": [
            r"Net Expense Ratio[:\s]*(\d+\.\d+)\s*%",
            r'"netExpenseRatio":\s*"?(\d+\.\d+)"?',
        ],
        "management_fee": [
            r"Management Fee[:\s]*(\d+\.\d+)\s*%",
        ],
    }

    for field, field_patterns in patterns.items():
        for pattern in field_patterns:
            match = re.search(pattern, text, re.IGNORECASE)
            if match:
                fees[field] = float(match.group(1))
                break

    # Minimum investment
    for pattern in [
        r"Minimum (?:Initial )?Investment[:\s]*\$?([\d,]+)",
        r'"minimumInvestment":\s*"?(\d+)"?',
    ]:
        min_match = re.search(pattern, text, re.IGNORECASE)
        if min_match:
            fees["min_investment"] = int(min_match.group(1).replace(",", ""))
            break

    # Fee-related DL items
    for dt in soup.select("dt, [class*='label'], [class*='key']"):
        dd = dt.find_next_sibling("dd") or dt.find_next_sibling()
        if dd:
            key = dt.get_text(strip=True).lower()
            val = dd.get_text(strip=True)
            if any(term in key for term in ["load", "fee", "turnover", "yield", "12b", "redemption"]):
                safe_key = re.sub(r'[^a-z0-9_]', '_', key)
                fees[safe_key] = val

    # Portfolio turnover rate
    turnover_match = re.search(
        r"(?:Portfolio )?Turnover[:\s]*([\d.]+)\s*%",
        text, re.IGNORECASE
    )
    if turnover_match:
        fees["portfolio_turnover_pct"] = float(turnover_match.group(1))

    return fees


def scrape_fund_holdings(ticker: str, client: httpx.Client = None) -> dict:
    """Extract top holdings and sector allocations from portfolio page."""
    url = f"https://www.morningstar.com/funds/xnas/{ticker.lower()}/portfolio"

    if client is None:
        client = build_morningstar_client()

    resp = client.get(url)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "lxml")

    holdings = {"ticker": ticker.upper(), "top_holdings": [], "sector_weights": {}}

    # Try to find top holdings table
    for table in soup.select("table"):
        headers = [th.get_text(strip=True) for th in table.select("thead th")]
        if any(h in str(headers) for h in ["% Net Assets", "Portfolio", "Holding"]):
            for row in table.select("tbody tr")[:15]:
                cells = [td.get_text(strip=True) for td in row.select("td")]
                if len(cells) >= 2:
                    holdings["top_holdings"].append({
                        "name": cells[0],
                        "weight_pct": cells[1] if len(cells) > 1 else None,
                        "sector": cells[2] if len(cells) > 2 else None,
                    })
            break

    # Sector weights
    for row in soup.select("[class*='sector'] tr, [data-testid*='sector'] tr"):
        cells = [td.get_text(strip=True) for td in row.select("td")]
        if len(cells) >= 2:
            sector = cells[0]
            weight_match = re.search(r"(\d+\.?\d*)", cells[1])
            if sector and weight_match:
                holdings["sector_weights"][sector] = float(weight_match.group(1))

    return holdings

Dealing with Morningstar's Anti-Bot Stack

Morningstar is one of the tougher scraping targets in finance. They use multiple layers:

Akamai Bot Manager — fingerprints your browser, checks TLS signatures, and analyzes behavioral patterns.
Rate limiting — aggressive per-IP throttling, even for normal browsing speeds.
Dynamic selectors — CSS class names change between deploys.
Cookie walls — some pages require session cookies set by JavaScript.

What works:

Slow requests — 5-10 seconds between page loads minimum. Morningstar flags anything faster.
Session persistence — use httpx.Client() to maintain cookies across requests.
Residential proxies — datacenter IPs are blocked almost immediately. ThorData's rotating residential proxies are what you need for financial sites — the residential IPs mimic real user traffic patterns, which is critical for getting past Akamai's fingerprinting. Their session-sticky option helps when you need cookies to persist across multiple page loads for the same fund.

def build_proxied_session(proxy_url: str) -> httpx.Client:
    """Build a session-based client with proxy for sustained Morningstar scraping."""
    client = httpx.Client(
        proxies={"http://": proxy_url, "https://": proxy_url},
        headers=get_headers(),
        follow_redirects=True,
        timeout=30,
    )

    # Establish session by visiting homepage first
    try:
        client.get("https://www.morningstar.com")
        time.sleep(random.uniform(2, 4))
    except Exception:
        pass

    return client


def scrape_fund_complete(ticker: str, client: httpx.Client) -> dict:
    """Scrape all available data for a fund using a shared session."""
    result = {"ticker": ticker.upper()}

    try:
        overview = scrape_fund_overview(ticker, client=client)
        result.update(overview)
        time.sleep(random.uniform(5, 10))
    except Exception as e:
        print(f"  Overview failed for {ticker}: {e}")

    try:
        fees = scrape_fund_fees(ticker, client=client)
        result.update(fees)
        time.sleep(random.uniform(5, 10))
    except Exception as e:
        print(f"  Fees failed for {ticker}: {e}")

    try:
        perf = scrape_fund_performance(ticker, client=client)
        result["performance"] = perf
        time.sleep(random.uniform(5, 10))
    except Exception as e:
        print(f"  Performance failed for {ticker}: {e}")

    return result

Playwright Fallback for Akamai-Blocked Pages

When httpx gets Akamai challenges, use Playwright:

from playwright.sync_api import sync_playwright

def scrape_fund_with_playwright(ticker: str, proxy: str = None) -> dict:
    """Use Playwright to scrape a Morningstar fund page when httpx is blocked."""
    url = f"https://www.morningstar.com/funds/xnas/{ticker.lower()}/quote"

    launch_kwargs = {"headless": True, "args": ["--no-sandbox", "--disable-dev-shm-usage"]}
    if proxy:
        launch_kwargs["proxy"] = {"server": proxy}

    with sync_playwright() as p:
        browser = p.chromium.launch(**launch_kwargs)
        context = browser.new_context(
            user_agent=random.choice(USER_AGENTS),
            viewport={"width": 1280, "height": 900},
            locale="en-US",
            timezone_id="America/New_York",
        )
        page = context.new_page()

        # Visit homepage first
        page.goto("https://www.morningstar.com", wait_until="domcontentloaded", timeout=20000)
        page.wait_for_timeout(2000)

        # Navigate to fund page
        page.goto(url, wait_until="networkidle", timeout=30000)
        page.wait_for_timeout(3000)

        html = page.content()
        browser.close()

    # Parse the HTML
    soup = BeautifulSoup(html, "lxml")
    fund = {"ticker": ticker.upper(), "url": url}

    # Extract using same logic as httpx approach
    name_el = soup.select_one("h1")
    if name_el:
        fund["name"] = name_el.get_text(strip=True)

    star_el = soup.select_one("[aria-label*='star']")
    if star_el:
        label = star_el.get("aria-label", "")
        match = re.search(r"(\d)\s+star", label, re.IGNORECASE)
        if match:
            fund["star_rating"] = int(match.group(1))

    return fund

Batch Scraping Multiple Funds

When collecting data across many funds, structure it as a pipeline:

import sqlite3
from datetime import datetime

def init_fund_db(db_path: str = "morningstar_funds.db") -> sqlite3.Connection:
    """Initialize the Morningstar fund database."""
    conn = sqlite3.connect(db_path)
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS funds (
            ticker TEXT PRIMARY KEY,
            name TEXT,
            star_rating INTEGER,
            category TEXT,
            expense_ratio REAL,
            net_expense_ratio REAL,
            portfolio_turnover_pct REAL,
            min_investment INTEGER,
            performance_1yr REAL,
            performance_3yr REAL,
            performance_5yr REAL,
            performance_10yr REAL,
            raw_data TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        );

        CREATE TABLE IF NOT EXISTS fund_holdings (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            ticker TEXT,
            holding_name TEXT,
            weight_pct TEXT,
            sector TEXT,
            scraped_at TEXT,
            FOREIGN KEY (ticker) REFERENCES funds(ticker)
        );

        CREATE TABLE IF NOT EXISTS expense_history (
            ticker TEXT,
            expense_ratio REAL,
            net_expense_ratio REAL,
            snapshot_date TEXT,
            PRIMARY KEY (ticker, snapshot_date)
        );

        CREATE INDEX IF NOT EXISTS idx_funds_star ON funds(star_rating);
        CREATE INDEX IF NOT EXISTS idx_funds_expense ON funds(expense_ratio);
        CREATE INDEX IF NOT EXISTS idx_funds_category ON funds(category);
    """)
    conn.commit()
    return conn


def save_fund(conn: sqlite3.Connection, fund_data: dict):
    """Save fund data to SQLite."""
    now = datetime.utcnow().isoformat()
    ticker = fund_data.get("ticker", "")

    # Extract performance values from nested data
    perf = fund_data.get("performance", {}).get("returns", {})
    perf_1yr = perf_3yr = perf_5yr = perf_10yr = None

    for key, vals in perf.items():
        if "Fund" in vals or ticker in vals:
            fund_vals = vals.get("Fund", vals.get(ticker, {}))
            for period_key, val in fund_vals.items():
                try:
                    pct = float(val.replace("%", ""))
                    if "1 Year" in period_key or "1-Year" in period_key:
                        perf_1yr = pct
                    elif "3 Year" in period_key or "3-Year" in period_key:
                        perf_3yr = pct
                    elif "5 Year" in period_key or "5-Year" in period_key:
                        perf_5yr = pct
                    elif "10 Year" in period_key or "10-Year" in period_key:
                        perf_10yr = pct
                except (ValueError, AttributeError):
                    pass

    conn.execute(
        """INSERT OR REPLACE INTO funds
           (ticker, name, star_rating, category, expense_ratio, net_expense_ratio,
            portfolio_turnover_pct, min_investment, performance_1yr, performance_3yr,
            performance_5yr, performance_10yr, raw_data, scraped_at)
           VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
        (
            ticker, fund_data.get("name"), fund_data.get("star_rating"),
            fund_data.get("category"), fund_data.get("expense_ratio"),
            fund_data.get("net_expense_ratio"), fund_data.get("portfolio_turnover_pct"),
            fund_data.get("min_investment"),
            perf_1yr, perf_3yr, perf_5yr, perf_10yr,
            json.dumps({k: v for k, v in fund_data.items() if k not in ("raw_data", "performance")}),
            now,
        )
    )

    # Log expense ratio history
    if fund_data.get("expense_ratio"):
        conn.execute(
            "INSERT OR REPLACE INTO expense_history (ticker, expense_ratio, net_expense_ratio, snapshot_date) VALUES (?, ?, ?, ?)",
            (ticker, fund_data.get("expense_ratio"), fund_data.get("net_expense_ratio"), now[:10])
        )

    conn.commit()


def scrape_fund_batch(
    tickers: list,
    proxy_url: str = None,
    db_path: str = "morningstar_funds.db",
):
    """Scrape a batch of funds with storage and error handling."""
    conn = init_fund_db(db_path)
    client = build_proxied_session(proxy_url) if proxy_url else build_morningstar_client()

    for i, ticker in enumerate(tickers):
        print(f"[{i+1}/{len(tickers)}] Scraping {ticker}...")
        try:
            fund_data = scrape_fund_complete(ticker, client=client)
            save_fund(conn, fund_data)
            print(f"  {fund_data.get('name', 'N/A')} — {fund_data.get('star_rating', '?')} stars — {fund_data.get('expense_ratio', '?')}% ER")
        except httpx.HTTPStatusError as e:
            print(f"  HTTP error: {e}")
            if e.response.status_code in (403, 429, 503):
                print("  Backing off for 60 seconds...")
                time.sleep(60)
        except Exception as e:
            print(f"  Error on {ticker}: {e}")

        # Random delay — Morningstar needs long delays
        delay = random.uniform(8, 15)
        print(f"  Waiting {delay:.1f}s...")
        time.sleep(delay)

    conn.close()
    print(f"\nDone. Scraped {len(tickers)} funds.")

Analysis: Fund Comparison Queries

def compare_index_funds(db_path: str = "morningstar_funds.db") -> list:
    """Compare expense ratios and performance across index funds."""
    conn = sqlite3.connect(db_path)
    cursor = conn.execute("""
        SELECT ticker, name, star_rating, category,
               expense_ratio, net_expense_ratio,
               performance_1yr, performance_3yr, performance_5yr, performance_10yr
        FROM funds
        WHERE expense_ratio IS NOT NULL
        ORDER BY expense_ratio ASC
    """)
    results = cursor.fetchall()
    conn.close()
    return results


def find_best_by_category(db_path: str = "morningstar_funds.db") -> dict:
    """Find the highest-rated funds in each category."""
    conn = sqlite3.connect(db_path)
    cursor = conn.execute("""
        SELECT category, ticker, name, star_rating, expense_ratio, performance_5yr
        FROM funds f1
        WHERE star_rating = (
            SELECT MAX(star_rating) FROM funds f2 WHERE f2.category = f1.category
        )
        AND category IS NOT NULL
        ORDER BY category, expense_ratio ASC
    """)
    rows = cursor.fetchall()
    conn.close()

    by_category = {}
    for row in rows:
        cat = row[0]
        if cat not in by_category:
            by_category[cat] = []
        by_category[cat].append({
            "ticker": row[1], "name": row[2], "stars": row[3],
            "expense_ratio": row[4], "perf_5yr": row[5],
        })
    return by_category


def fee_impact_analysis(
    initial_investment: float,
    annual_return_pct: float,
    years: int,
    expense_ratios: list,
) -> dict:
    """
    Model how different expense ratios compound over time.
    Shows the real cost of fee differences.
    """
    results = {}
    for er in expense_ratios:
        net_return = (annual_return_pct - er) / 100
        final_value = initial_investment * ((1 + net_return) ** years)
        results[er] = {
            "net_annual_return_pct": round((annual_return_pct - er), 2),
            "final_value": round(final_value, 2),
            "total_fees_paid": round(
                initial_investment * ((1 + annual_return_pct / 100) ** years) - final_value, 2
            ),
        }
    return results


# Example analysis
print("Fee impact over 30 years ($100,000 investment, 8% gross return):")
impact = fee_impact_analysis(
    initial_investment=100_000,
    annual_return_pct=8.0,
    years=30,
    expense_ratios=[0.03, 0.10, 0.50, 1.00, 1.50],
)
for er, data in impact.items():
    print(f"  {er:.2f}% ER: ${data['final_value']:,.0f} final value, ${data['total_fees_paid']:,.0f} in fees")

Fund Comparison Tool

def build_comparison_report(tickers: list, db_path: str = "morningstar_funds.db") -> list:
    """Build a side-by-side comparison for a list of fund tickers."""
    conn = sqlite3.connect(db_path)
    placeholders = ",".join("?" * len(tickers))

    cursor = conn.execute(f"""
        SELECT ticker, name, star_rating, category,
               expense_ratio, net_expense_ratio,
               performance_1yr, performance_3yr, performance_5yr, performance_10yr,
               min_investment, portfolio_turnover_pct
        FROM funds
        WHERE ticker IN ({placeholders})
        ORDER BY expense_ratio ASC
    """, tickers)

    results = []
    for row in cursor.fetchall():
        results.append({
            "ticker": row[0],
            "name": row[1],
            "star_rating": row[2],
            "category": row[3],
            "expense_ratio": row[4],
            "net_expense_ratio": row[5],
            "return_1yr": row[6],
            "return_3yr": row[7],
            "return_5yr": row[8],
            "return_10yr": row[9],
            "min_investment": row[10],
            "turnover_pct": row[11],
        })

    conn.close()

    # Print formatted comparison
    print(f"\n{'Ticker':<8} {'Stars':>5} {'ER%':>6} {'1-Yr':>7} {'3-Yr':>7} {'5-Yr':>7} {'10-Yr':>8}")
    print("-" * 55)
    for f in results:
        stars = "★" * (f["star_rating"] or 0)
        er = f"{f['expense_ratio']:.2f}%" if f["expense_ratio"] else "N/A"
        r1 = f"{f['return_1yr']:.1f}%" if f["return_1yr"] else "N/A"
        r3 = f"{f['return_3yr']:.1f}%" if f["return_3yr"] else "N/A"
        r5 = f"{f['return_5yr']:.1f}%" if f["return_5yr"] else "N/A"
        r10 = f"{f['return_10yr']:.1f}%" if f["return_10yr"] else "N/A"
        print(f"{f['ticker']:<8} {stars:>5} {er:>6} {r1:>7} {r3:>7} {r5:>7} {r10:>8}")

    return results


# Compare popular index funds
if __name__ == "__main__":
    PROXY_URL = "http://YOUR_USER:[email protected]:9000"

    # Popular index funds to compare
    TICKERS = ["VFIAX", "FXAIX", "SWPPX", "VTSAX", "FSKAX", "SWTSX", "VBTLX", "FXNAX"]

    scrape_fund_batch(TICKERS, proxy_url=PROXY_URL)
    build_comparison_report(TICKERS)

What You Can Build

Morningstar data enables some useful analyses:

Fund comparison tools — side-by-side expense ratios, returns, and ratings across fund families (Vanguard vs Fidelity vs Schwab)
Fee impact calculators — model how expense ratio differences compound over 10-30 year horizons
Category performance trackers — monitor which fund categories are outperforming over rolling periods
Holdings overlap analysis — compare portfolio holdings across similar funds to find true diversification
Star rating predictor — analyze what distinguishes 4-5 star funds from 2-3 star funds in the same category
Alert system — trigger notifications when a fund's expense ratio changes or star rating drops

Financial data scraping is slower and more defended than most other verticals. The payoff is that the data is extremely valuable and changes slowly enough that you don't need to scrape every day. Weekly or monthly collection is usually enough for fund analysis. At 5-10 second delays between requests with residential proxies from ThorData, you can collect 200-300 fund profiles per hour — more than enough for a comprehensive comparison database.