Scraping Google News Articles in 2026 (RSS + Topic APIs)

2026-04-09 scraping google-news python rss news

Scraping Google News Articles in 2026 (RSS + Topic APIs)

Google News aggregates articles from thousands of publishers into topic-based feeds. If you need structured news data — for market research, media monitoring, or building a custom aggregator — there are several ways to pull it programmatically.

This guide covers four approaches in order of complexity: the public RSS feeds (simplest, most reliable), the undocumented topic API (more data), full article content extraction with deduplication, and an async pipeline for high-volume monitoring — including how to handle proxy rotation when Google rate limiting kicks in.

Environment Setup

pip install httpx feedparser trafilatura beautifulsoup4 lxml aiohttp aiofiles

For async pipelines:

pip install asyncio aiohttp

Approach 1: Google News RSS Feeds

Google News still serves RSS feeds, though they are not prominently linked anywhere on the site. The base URL pattern is:

https://news.google.com/rss/search?q=QUERY&hl=en-US&gl=US&ceid=US:en

You can also get topic-specific feeds using topic IDs:

https://news.google.com/rss/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRGx1YlY4U0FtVnVHZ0pWVXlnQVAB?hl=en-US

The RSS response includes the article title, publisher, publication date, and a Google News redirect URL. The actual article URL is embedded in the redirect link.

# google_news_rss.py
import httpx
import feedparser
import time
import random

def fetch_google_news(query, lang="en", country="US", max_results=20, proxy=None):
    """
    Fetch Google News articles via the public RSS feed endpoint.
    Returns structured article list with title, source, date, and link.
    """
    transport = httpx.HTTPTransport(proxy=proxy) if proxy else None
    client = httpx.Client(
        transport=transport,
        timeout=15,
        headers={
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
            "Accept-Language": f"{lang}-{country},{lang};q=0.9",
        },
        follow_redirects=True,
    )

    url = "https://news.google.com/rss/search"
    params = {
        "q": query,
        "hl": f"{lang}-{country}",
        "gl": country,
        "ceid": f"{country}:{lang}",
    }

    try:
        resp = client.get(url, params=params)
        resp.raise_for_status()
    finally:
        client.close()

    feed = feedparser.parse(resp.text)
    articles = []

    for entry in feed.entries[:max_results]:
        source = ""
        if hasattr(entry, "source"):
            source = entry.source.get("title", "")
        elif hasattr(entry, "tags") and entry.tags:
            source = entry.tags[0].get("label", "")

        articles.append({
            "title": entry.title,
            "source": source,
            "published": entry.get("published", ""),
            "published_parsed": entry.get("published_parsed"),
            "link": entry.link,
            "description": entry.get("summary", ""),
            "id": entry.get("id", ""),
        })

    return articles

# Usage
results = fetch_google_news("artificial intelligence regulation 2026")
for article in results:
    print(f"[{article['source']}] {article['title']}")
    print(f"  Published: {article['published']}")
    print()

Topic-Based RSS Feeds

Google News organizes content into topic clusters. You can access specific topic feeds using their encoded topic IDs:

GOOGLE_NEWS_TOPICS = {
    "top_stories": "CAAqJggKIiBDQkFTRWdvSUwyMHZNRGx1YlY4U0FtVnVHZ0pWVXlnQVAB",
    "world": "CAAqJggKIiBDQkFTRWdvSUwyMHZNRGx1YlY4U0FtVnVHZ0pWVXlnQVAB",
    "business": "CAAqJggKIiBDQkFTRWdvSUwyMHZNRFp1WlY4U0FtVnVHZ0pWVXlnQVAB",
    "technology": "CAAqJggKIiBDQkFTRWdvSUwyMHZNRGRqTVhZU0FtVnVHZ0pWVXlnQVAB",
    "science": "CAAqJggKIiBDQkFTRWdvSUwyMHZNR1p0Y1hRU0FtVnVHZ0pWVXlnQVAB",
    "health": "CAAqIQgKIhtDQkFTRGdvSUwyMHZNR3QwTlRFU0FtVnVLQUFQAQ",
    "sports": "CAAqJggKIiBDQkFTRWdvSUwyMHZNR1oxY1hRU0FtVnVHZ0pWVXlnQVAB",
    "entertainment": "CAAqJggKIiBDQkFTRWdvSUwyMHZNREpxYVhRU0FtVnVHZ0pWVXlnQVAB",
}

def fetch_topic_feed(topic_key, lang="en", country="US", max_results=20):
    """Fetch articles from a specific Google News topic category."""
    topic_id = GOOGLE_NEWS_TOPICS.get(topic_key)
    if not topic_id:
        raise ValueError(f"Unknown topic: {topic_key}")

    url = f"https://news.google.com/rss/topics/{topic_id}"
    params = {"hl": f"{lang}-{country}", "gl": country, "ceid": f"{country}:{lang}"}

    client = httpx.Client(
        timeout=15,
        headers={"User-Agent": "Mozilla/5.0 (compatible; NewsBot/1.0)"},
        follow_redirects=True,
    )
    resp = client.get(url, params=params)
    client.close()

    feed = feedparser.parse(resp.text)
    return [
        {
            "title": e.title,
            "source": e.source.get("title", "") if hasattr(e, "source") else "",
            "published": e.get("published", ""),
            "link": e.link,
        }
        for e in feed.entries[:max_results]
    ]

# Monitor multiple topics
for topic in ["technology", "business", "science"]:
    articles = fetch_topic_feed(topic, max_results=5)
    print(f"\n--- {topic.upper()} ---")
    for a in articles:
        print(f"  [{a['source']}] {a['title']}")

Approach 2: Resolving Real Article URLs

Google News wraps every link through a redirect (news.google.com/rss/articles/...). To get the actual publisher URL, follow the HTTP redirect chain:

import httpx
import re
from urllib.parse import unquote

def resolve_google_news_url(google_url, proxy=None):
    """
    Follow Google News redirect to get the real publisher article URL.
    Returns the final URL after all redirects.
    """
    transport = httpx.HTTPTransport(proxy=proxy) if proxy else None
    client = httpx.Client(transport=transport, timeout=10, follow_redirects=True)

    try:
        resp = client.head(google_url)
        real_url = str(resp.url)

        if "news.google.com" in real_url:
            resp = client.get(google_url)
            match = re.search(r'<link rel="canonical" href="([^"]+)"', resp.text)
            if match:
                real_url = match.group(1)

        return real_url
    finally:
        client.close()

def resolve_urls_batch(articles, proxy=None, delay_range=(0.5, 1.5)):
    """Resolve Google News redirect URLs for a batch of articles."""
    resolved = []
    for article in articles:
        real_url = resolve_google_news_url(article["link"], proxy=proxy)
        resolved.append({**article, "real_url": real_url})
        time.sleep(random.uniform(*delay_range))
    return resolved

Approach 3: Full Article Content Extraction

Once you have the real publisher URLs, extract article text using trafilatura (better than newspaper3k for most modern news sites):

# article_extractor.py
import httpx
import trafilatura
import hashlib
import time
import random
from datetime import datetime

USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
]

def extract_article(url, proxy_url=None, min_words=100):
    """
    Extract clean article text from a news URL.
    Uses trafilatura which handles paywalls, navigation, and boilerplate better than newspaper3k.
    Returns None if extraction fails or content is too short.
    """
    transport = httpx.HTTPTransport(proxy=proxy_url) if proxy_url else None
    client = httpx.Client(
        transport=transport,
        timeout=20,
        headers={
            "User-Agent": random.choice(USER_AGENTS),
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "Referer": "https://news.google.com/",
        },
        follow_redirects=True,
    )

    try:
        resp = client.get(url)
        resp.raise_for_status()
    except (httpx.HTTPError, httpx.ConnectError):
        return None
    finally:
        client.close()

    text = trafilatura.extract(
        resp.text,
        include_comments=False,
        include_tables=False,
        no_fallback=False,
        favor_precision=True,
    )

    if not text or len(text.split()) < min_words:
        return None

    metadata = trafilatura.extract_metadata(resp.text)

    return {
        "url": url,
        "text": text,
        "word_count": len(text.split()),
        "content_hash": hashlib.md5(text.encode()).hexdigest(),
        "title": metadata.title if metadata else None,
        "author": metadata.author if metadata else None,
        "date": metadata.date if metadata else None,
        "description": metadata.description if metadata else None,
        "extracted_at": datetime.utcnow().isoformat(),
    }

def extract_articles_batch(urls, proxy_url=None, delay_range=(1.5, 4.0)):
    """Extract content from multiple URLs with polite delays."""
    results = []
    failed = []

    for i, url in enumerate(urls):
        article = extract_article(url, proxy_url=proxy_url)
        if article:
            results.append(article)
            print(f"  [{i+1}/{len(urls)}] OK: {url[:60]}... ({article['word_count']} words)")
        else:
            failed.append(url)
            print(f"  [{i+1}/{len(urls)}] FAILED: {url[:60]}")

        time.sleep(random.uniform(*delay_range))

    print(f"\nExtracted: {len(results)}/{len(urls)} | Failed: {len(failed)}")
    return results, failed

Deduplication Strategies

News stories get republished across dozens of outlets with minor rewrites. Here are three strategies from simple to sophisticated.

1. Exact Hash Deduplication

def dedup_by_hash(articles):
    """Remove exact duplicate articles using content hash."""
    seen_hashes = set()
    unique = []
    for article in articles:
        h = article.get("content_hash") or hashlib.md5(
            article.get("text", "").encode()
        ).hexdigest()
        if h not in seen_hashes:
            seen_hashes.add(h)
            unique.append(article)
    return unique

2. Title Similarity Deduplication

def normalize_title(title):
    import re
    t = title.lower()
    t = re.sub(r"[^a-z0-9 ]", " ", t)
    t = re.sub(r"\s+", " ", t).strip()
    stopwords = {"the", "a", "an", "and", "or", "but", "in", "on", "at", "to", "for", "of", "with", "by"}
    words = [w for w in t.split() if w not in stopwords]
    return " ".join(words)

def dedup_by_title(articles, similarity_threshold=0.8):
    """Remove near-duplicate articles based on title similarity."""
    from difflib import SequenceMatcher

    unique = []
    seen_titles = []

    for article in articles:
        norm_title = normalize_title(article.get("title", ""))
        is_duplicate = False

        for existing_title in seen_titles:
            ratio = SequenceMatcher(None, norm_title, existing_title).ratio()
            if ratio > similarity_threshold:
                is_duplicate = True
                break

        if not is_duplicate:
            unique.append(article)
            seen_titles.append(norm_title)

    return unique

3. Trigram Similarity Deduplication (Full Text)

def trigram_similarity(text_a, text_b):
    """
    Calculate Jaccard similarity using word trigrams.
    Values above 0.6 indicate near-duplicate articles.
    """
    def trigrams(text):
        words = text.lower().split()
        return set(tuple(words[i:i+3]) for i in range(len(words) - 2))

    set_a = trigrams(text_a)
    set_b = trigrams(text_b)

    if not set_a or not set_b:
        return 0.0

    intersection = set_a & set_b
    union = set_a | set_b
    return len(intersection) / len(union)

def deduplicate_articles(articles, threshold=0.6):
    """
    Remove near-duplicate articles based on full-text trigram similarity.
    O(n^2) — acceptable for up to a few hundred articles per batch.
    For larger sets, use MinHash/LSH (datasketch library).
    """
    unique = []

    for article in articles:
        text = article.get("text", "")
        is_duplicate = False

        for existing in unique:
            if trigram_similarity(text, existing.get("text", "")) > threshold:
                is_duplicate = True
                break

        if not is_duplicate:
            unique.append(article)

    return unique

4. MinHash for Large-Scale Deduplication

For production pipelines processing thousands of articles per day:

pip install datasketch

from datasketch import MinHash, MinHashLSH
import re

def text_to_shingles(text, k=3):
    words = re.findall(r"\b\w+\b", text.lower())
    return set(" ".join(words[i:i+k]) for i in range(len(words) - k + 1))

def build_minhash(text, num_perm=128):
    m = MinHash(num_perm=num_perm)
    for shingle in text_to_shingles(text):
        m.update(shingle.encode("utf8"))
    return m

def build_dedup_index(articles, threshold=0.6, num_perm=128):
    """
    Build an LSH index for fast approximate near-duplicate detection.
    Scales to millions of articles efficiently.
    """
    lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
    unique = []

    for i, article in enumerate(articles):
        text = article.get("text", "")
        if not text:
            continue

        mh = build_minhash(text, num_perm=num_perm)
        key = f"article_{i}"

        result = lsh.query(mh)
        if not result:
            lsh.insert(key, mh)
            unique.append(article)

    return unique

Anti-Bot Measures and Proxy Rotation

Google is aggressive about blocking automated requests. RSS feeds are more lenient than the web interface, but at scale you will still hit CAPTCHAs and 429 responses.

Header Rotation

import random

CHROME_VERSIONS = ["124.0.0.0", "123.0.0.0", "122.0.6261.112"]

def random_headers():
    ver = random.choice(CHROME_VERSIONS)
    return {
        "User-Agent": f"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/{ver} Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Language": random.choice(["en-US,en;q=0.9", "en-GB,en;q=0.9"]),
        "Accept-Encoding": "gzip, deflate, br",
        "Sec-Ch-Ua": f'"Chromium";v="{ver.split(".")[0]}", "Google Chrome";v="{ver.split(".")[0]}"',
        "Sec-Ch-Ua-Mobile": "?0",
        "Sec-Ch-Ua-Platform": '"Windows"',
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
    }

Proxy Rotation with ThorData

Rotating residential proxies are the most effective defense against IP-based rate limiting on Google properties. ThorData provides 90M+ residential IPs across 190+ countries — critical for Google News because results are region-specific, so you may want IPs from specific countries to get localized news.

THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
THORDATA_HOST = "proxy.thordata.com"
THORDATA_PORT = "9000"

def get_proxy(country=None, session_id=None):
    """
    Build a ThorData proxy URL.

    country: ISO 2-letter code (US, GB, DE, JP, etc.) for geo-targeted results
    session_id: pass a string to get sticky IPs across multiple requests,
                omit for per-request rotation
    """
    user = THORDATA_USER
    if country:
        user = f"{user}-country-{country.upper()}"
    if session_id:
        user = f"{user}-session-{session_id}"

    return f"http://{user}:{THORDATA_PASS}@{THORDATA_HOST}:{THORDATA_PORT}"

# Example: scrape US Google News with a US residential IP
proxy_us = get_proxy(country="US")
articles = fetch_google_news("renewable energy policy", proxy=proxy_us)

# Example: get German news with a German IP
proxy_de = get_proxy(country="DE")
articles_de = fetch_google_news("Bundestag Klimapolitik", lang="de", country="DE", proxy=proxy_de)

# Example: sticky session for following up on articles (same IP)
proxy_sticky = get_proxy(session_id="news_session_001")

Backoff on 429 Responses

import httpx

def fetch_rss_with_retry(url, params=None, proxy=None, max_retries=5):
    """
    Fetch RSS feed with exponential backoff on rate limit errors.
    Follows the 3-minute timeout rule: give up rather than loop indefinitely.
    """
    transport = httpx.HTTPTransport(proxy=proxy) if proxy else None
    start_time = time.time()

    for attempt in range(max_retries):
        if time.time() - start_time > 180:
            raise TimeoutError("Exceeded 3-minute retry budget")

        try:
            client = httpx.Client(
                transport=transport,
                timeout=15,
                headers=random_headers(),
                follow_redirects=True,
            )
            resp = client.get(url, params=params)
            client.close()

            if resp.status_code == 200:
                return resp

            elif resp.status_code == 429:
                wait = (2 ** attempt) + random.uniform(0, 2)
                print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt + 1}/{max_retries})")
                time.sleep(wait)

            elif resp.status_code == 403:
                print(f"Blocked. Rotating proxy and waiting...")
                proxy = get_proxy()
                transport = httpx.HTTPTransport(proxy=proxy)
                time.sleep(5)

            else:
                resp.raise_for_status()

        except httpx.ConnectError as e:
            wait = 2 ** attempt
            print(f"Connection error: {e}. Waiting {wait}s...")
            time.sleep(wait)

    raise Exception(f"Failed after {max_retries} retries: {url}")

Approach 4: Async Pipeline for High-Volume Monitoring

For monitoring dozens of queries and extracting full articles in near-real-time, synchronous requests are too slow. Here is a production-ready async pipeline:

import asyncio
import aiohttp
import feedparser
import trafilatura
import json
import time
from pathlib import Path
from datetime import datetime

async def fetch_rss_async(session, query, country="US", lang="en"):
    """Fetch RSS feed asynchronously."""
    url = "https://news.google.com/rss/search"
    params = {
        "q": query,
        "hl": f"{lang}-{country}",
        "gl": country,
        "ceid": f"{country}:{lang}",
    }

    async with session.get(url, params=params) as resp:
        text = await resp.text()
        feed = feedparser.parse(text)
        return [
            {
                "title": e.title,
                "source": e.source.get("title", "") if hasattr(e, "source") else "",
                "published": e.get("published", ""),
                "link": e.link,
                "query": query,
            }
            for e in feed.entries[:20]
        ]

async def resolve_url_async(session, google_url):
    """Follow Google News redirect asynchronously."""
    try:
        async with session.head(google_url, allow_redirects=True) as resp:
            return str(resp.url)
    except Exception:
        return google_url

async def extract_article_async(session, url):
    """Extract article text asynchronously."""
    try:
        headers = {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
            "Referer": "https://news.google.com/",
        }
        async with session.get(url, headers=headers, timeout=aiohttp.ClientTimeout(total=20)) as resp:
            html = await resp.text()
            text = trafilatura.extract(html, favor_precision=True)
            return {"url": url, "text": text, "word_count": len(text.split()) if text else 0}
    except Exception as e:
        return {"url": url, "text": None, "error": str(e)}

async def run_news_pipeline(queries, output_file="news_pipeline_output.json", proxy=None):
    """
    Full async pipeline: fetch RSS, resolve URLs, extract content.
    Processes all queries concurrently for maximum throughput.
    """
    connector = aiohttp.TCPConnector(limit=10)
    timeout = aiohttp.ClientTimeout(total=30)

    async with aiohttp.ClientSession(
        connector=connector,
        timeout=timeout,
        headers={"User-Agent": "Mozilla/5.0"},
    ) as session:

        print(f"Fetching RSS for {len(queries)} queries...")
        rss_tasks = [fetch_rss_async(session, q) for q in queries]
        rss_results = await asyncio.gather(*rss_tasks, return_exceptions=True)

        all_articles = []
        for result in rss_results:
            if isinstance(result, list):
                all_articles.extend(result)

        print(f"Found {len(all_articles)} article references")

        print("Resolving redirect URLs...")
        resolve_tasks = [resolve_url_async(session, a["link"]) for a in all_articles]
        real_urls = await asyncio.gather(*resolve_tasks, return_exceptions=True)

        for article, real_url in zip(all_articles, real_urls):
            if isinstance(real_url, str):
                article["real_url"] = real_url

        print("Extracting article content...")
        semaphore = asyncio.Semaphore(5)

        async def extract_with_sem(url):
            async with semaphore:
                await asyncio.sleep(random.uniform(0.5, 2.0))
                return await extract_article_async(session, url)

        extract_tasks = [
            extract_with_sem(a.get("real_url", a["link"]))
            for a in all_articles
        ]
        extractions = await asyncio.gather(*extract_tasks, return_exceptions=True)

        for article, extraction in zip(all_articles, extractions):
            if isinstance(extraction, dict):
                article.update(extraction)

    complete = [a for a in all_articles if a.get("text") and a.get("word_count", 0) > 100]
    print(f"Complete articles with content: {len(complete)}/{len(all_articles)}")

    Path(output_file).write_text(json.dumps(complete, indent=2))
    print(f"Saved to {output_file}")
    return complete

import random
queries = [
    "artificial intelligence regulation 2026",
    "renewable energy investment",
    "semiconductor supply chain",
    "remote work enterprise policy",
    "quantum computing commercial applications",
]

articles = asyncio.run(run_news_pipeline(queries, output_file="news_monitor.json"))

Complete Aggregator: Production News Monitor

# news_aggregator.py
import json
import time
import random
import hashlib
import sqlite3
from pathlib import Path
from datetime import datetime

class GoogleNewsAggregator:
    """
    Production news aggregator using Google News RSS.
    Features: multi-query, deduplication, proxy rotation, incremental storage.
    """

    def __init__(self, proxy_url=None, db_path="news_monitor.db"):
        self.proxy_url = proxy_url
        self.db_path = db_path
        self._init_db()
        self.seen_hashes = self._load_seen_hashes()
        print(f"Loaded {len(self.seen_hashes)} known article hashes")

    def _init_db(self):
        conn = sqlite3.connect(self.db_path)
        conn.execute("PRAGMA journal_mode=WAL")
        conn.execute("""
            CREATE TABLE IF NOT EXISTS articles (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                title_hash TEXT UNIQUE,
                title TEXT,
                source TEXT,
                published TEXT,
                link TEXT,
                real_url TEXT,
                query TEXT,
                text TEXT,
                word_count INTEGER,
                fetch_date TEXT
            )
        """)
        conn.execute("CREATE INDEX IF NOT EXISTS idx_articles_date ON articles(fetch_date)")
        conn.execute("CREATE INDEX IF NOT EXISTS idx_articles_source ON articles(source)")
        conn.commit()
        conn.close()

    def _load_seen_hashes(self):
        conn = sqlite3.connect(self.db_path)
        hashes = set(
            row[0] for row in conn.execute("SELECT title_hash FROM articles").fetchall()
        )
        conn.close()
        return hashes

    def _article_hash(self, title):
        return hashlib.md5(title.lower().strip().encode()).hexdigest()

    def fetch_articles(self, queries, articles_per_query=15):
        all_articles = []

        for query in queries:
            print(f"Fetching: {query}")
            try:
                articles = fetch_google_news(
                    query,
                    max_results=articles_per_query,
                    proxy=self.proxy_url,
                )

                for article in articles:
                    h = self._article_hash(article["title"])
                    if h not in self.seen_hashes:
                        self.seen_hashes.add(h)
                        article["query"] = query
                        article["fetch_date"] = datetime.utcnow().isoformat()
                        article["title_hash"] = h
                        all_articles.append(article)

            except Exception as e:
                print(f"  Error fetching '{query}': {e}")

            time.sleep(random.uniform(2, 4))

        return all_articles

    def resolve_and_extract(self, articles, extract_content=True):
        enriched = []

        for i, article in enumerate(articles):
            print(f"  [{i+1}/{len(articles)}] {article['title'][:60]}...")

            real_url = resolve_google_news_url(article["link"], proxy=self.proxy_url)
            article["real_url"] = real_url

            if extract_content and "news.google.com" not in real_url:
                content = extract_article(real_url, proxy_url=self.proxy_url)
                if content:
                    article["text"] = content["text"]
                    article["word_count"] = content["word_count"]

            enriched.append(article)
            time.sleep(random.uniform(1.5, 3.5))

        return enriched

    def save(self, articles):
        conn = sqlite3.connect(self.db_path)
        saved = 0
        for a in articles:
            try:
                conn.execute("""
                    INSERT OR IGNORE INTO articles
                    (title_hash, title, source, published, link, real_url,
                     query, text, word_count, fetch_date)
                    VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
                """, (
                    a.get("title_hash"),
                    a.get("title"),
                    a.get("source"),
                    a.get("published"),
                    a.get("link"),
                    a.get("real_url"),
                    a.get("query"),
                    a.get("text"),
                    a.get("word_count"),
                    a.get("fetch_date"),
                ))
                saved += 1
            except Exception as e:
                print(f"  Save error: {e}")
        conn.commit()
        conn.close()
        print(f"Saved {saved} new articles")
        return saved

    def run(self, queries, extract_content=True):
        print(f"Starting news aggregation for {len(queries)} queries")
        print(f"Proxy: {'enabled' if self.proxy_url else 'disabled'}")
        print()

        articles = self.fetch_articles(queries)
        print(f"\nNew articles found: {len(articles)}")

        if articles and extract_content:
            print("\nExtracting content...")
            articles = self.resolve_and_extract(articles, extract_content=True)

        saved = self.save(articles)
        return articles, saved

MONITORING_QUERIES = [
    "python web scraping 2026",
    "AI regulation legislation",
    "data privacy GDPR enforcement",
    "web automation tools",
    "residential proxy services",
]

aggregator = GoogleNewsAggregator(
    proxy_url=get_proxy(country="US") if False else None,
    db_path="news_monitor.db",
)

new_articles, count = aggregator.run(
    queries=MONITORING_QUERIES,
    extract_content=True,
)

print(f"\nDone. Saved {count} new articles to news_monitor.db")

Storing Results in SQLite

See the aggregator class above for the full schema. For a simpler export to JSON:

def export_articles_json(db_path, output_path, days_back=7):
    """Export recent articles from SQLite to JSON."""
    import sqlite3
    from pathlib import Path
    from datetime import datetime, timedelta

    cutoff = (datetime.utcnow() - timedelta(days=days_back)).isoformat()
    conn = sqlite3.connect(db_path)

    cols = ["id", "title", "source", "published", "real_url", "query", "text", "word_count", "fetch_date"]
    rows = conn.execute(
        f"SELECT {','.join(cols)} FROM articles WHERE fetch_date > ? ORDER BY fetch_date DESC",
        (cutoff,)
    ).fetchall()
    conn.close()

    articles = [dict(zip(cols, row)) for row in rows]
    Path(output_path).write_text(json.dumps(articles, indent=2))
    print(f"Exported {len(articles)} articles to {output_path}")
    return articles

Rate Limits and Practical Considerations

Google News RSS feeds are public and do not require authentication, but they are not an official API. Practical limits:

Keep requests under 1 per second for RSS feeds
Do not redistribute full article text — extract what you need for analysis
Check each publisher robots.txt before scraping their articles directly
For production use at scale, consider the Google News API via SerpAPI or similar services that handle compliance

The RSS approach works well for monitoring up to a few hundred queries per day. Beyond that, you need proxy infrastructure like ThorData and more sophisticated request distribution to stay under Google radar without hitting 429 or CAPTCHA responses.

Practical Applications

Media Monitoring: Track brand, competitor, or keyword mentions across thousands of news sources automatically. Set up daily runs that email you digests of new articles matching your topics.

Market Intelligence: Monitor news around publicly-traded companies, industries, or regulatory developments. Combine with sentiment analysis to build a news-driven signal for investment research.

Research Corpora: Build labeled datasets of news articles for NLP research — classification, summarization, named entity recognition. The RSS metadata gives you clean labels (source, topic, date) for free.

Content Curation: Power an automated newsletter or social media account with curated news summaries. The deduplication pipeline ensures you do not post the same story twice.

SEO Research: Monitor news coverage of specific keywords to find content gap opportunities — topics that are in the news but not well-covered by evergreen SEO content.

The Google News RSS approach is genuinely one of the most reliable and legally safe scraping targets: the feeds are publicly documented, no authentication required, and the rate limits are generous for moderate use. For scale, pair with ThorData residential proxies for geo-targeted news collection across multiple countries.