Scraping Google News Articles in 2026 (RSS + Topic APIs)
Scraping Google News Articles in 2026 (RSS + Topic APIs)
Google News aggregates articles from thousands of publishers into topic-based feeds. If you need structured news data — for market research, media monitoring, or building a custom aggregator — there are several ways to pull it programmatically.
This guide covers four approaches in order of complexity: the public RSS feeds (simplest, most reliable), the undocumented topic API (more data), full article content extraction with deduplication, and an async pipeline for high-volume monitoring — including how to handle proxy rotation when Google rate limiting kicks in.
Environment Setup
pip install httpx feedparser trafilatura beautifulsoup4 lxml aiohttp aiofiles
For async pipelines:
pip install asyncio aiohttp
Approach 1: Google News RSS Feeds
Google News still serves RSS feeds, though they are not prominently linked anywhere on the site. The base URL pattern is:
https://news.google.com/rss/search?q=QUERY&hl=en-US&gl=US&ceid=US:en
You can also get topic-specific feeds using topic IDs:
https://news.google.com/rss/topics/CAAqJggKIiBDQkFTRWdvSUwyMHZNRGx1YlY4U0FtVnVHZ0pWVXlnQVAB?hl=en-US
The RSS response includes the article title, publisher, publication date, and a Google News redirect URL. The actual article URL is embedded in the redirect link.
# google_news_rss.py
import httpx
import feedparser
import time
import random
def fetch_google_news(query, lang="en", country="US", max_results=20, proxy=None):
"""
Fetch Google News articles via the public RSS feed endpoint.
Returns structured article list with title, source, date, and link.
"""
transport = httpx.HTTPTransport(proxy=proxy) if proxy else None
client = httpx.Client(
transport=transport,
timeout=15,
headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Accept-Language": f"{lang}-{country},{lang};q=0.9",
},
follow_redirects=True,
)
url = "https://news.google.com/rss/search"
params = {
"q": query,
"hl": f"{lang}-{country}",
"gl": country,
"ceid": f"{country}:{lang}",
}
try:
resp = client.get(url, params=params)
resp.raise_for_status()
finally:
client.close()
feed = feedparser.parse(resp.text)
articles = []
for entry in feed.entries[:max_results]:
source = ""
if hasattr(entry, "source"):
source = entry.source.get("title", "")
elif hasattr(entry, "tags") and entry.tags:
source = entry.tags[0].get("label", "")
articles.append({
"title": entry.title,
"source": source,
"published": entry.get("published", ""),
"published_parsed": entry.get("published_parsed"),
"link": entry.link,
"description": entry.get("summary", ""),
"id": entry.get("id", ""),
})
return articles
# Usage
results = fetch_google_news("artificial intelligence regulation 2026")
for article in results:
print(f"[{article['source']}] {article['title']}")
print(f" Published: {article['published']}")
print()
Topic-Based RSS Feeds
Google News organizes content into topic clusters. You can access specific topic feeds using their encoded topic IDs:
GOOGLE_NEWS_TOPICS = {
"top_stories": "CAAqJggKIiBDQkFTRWdvSUwyMHZNRGx1YlY4U0FtVnVHZ0pWVXlnQVAB",
"world": "CAAqJggKIiBDQkFTRWdvSUwyMHZNRGx1YlY4U0FtVnVHZ0pWVXlnQVAB",
"business": "CAAqJggKIiBDQkFTRWdvSUwyMHZNRFp1WlY4U0FtVnVHZ0pWVXlnQVAB",
"technology": "CAAqJggKIiBDQkFTRWdvSUwyMHZNRGRqTVhZU0FtVnVHZ0pWVXlnQVAB",
"science": "CAAqJggKIiBDQkFTRWdvSUwyMHZNR1p0Y1hRU0FtVnVHZ0pWVXlnQVAB",
"health": "CAAqIQgKIhtDQkFTRGdvSUwyMHZNR3QwTlRFU0FtVnVLQUFQAQ",
"sports": "CAAqJggKIiBDQkFTRWdvSUwyMHZNR1oxY1hRU0FtVnVHZ0pWVXlnQVAB",
"entertainment": "CAAqJggKIiBDQkFTRWdvSUwyMHZNREpxYVhRU0FtVnVHZ0pWVXlnQVAB",
}
def fetch_topic_feed(topic_key, lang="en", country="US", max_results=20):
"""Fetch articles from a specific Google News topic category."""
topic_id = GOOGLE_NEWS_TOPICS.get(topic_key)
if not topic_id:
raise ValueError(f"Unknown topic: {topic_key}")
url = f"https://news.google.com/rss/topics/{topic_id}"
params = {"hl": f"{lang}-{country}", "gl": country, "ceid": f"{country}:{lang}"}
client = httpx.Client(
timeout=15,
headers={"User-Agent": "Mozilla/5.0 (compatible; NewsBot/1.0)"},
follow_redirects=True,
)
resp = client.get(url, params=params)
client.close()
feed = feedparser.parse(resp.text)
return [
{
"title": e.title,
"source": e.source.get("title", "") if hasattr(e, "source") else "",
"published": e.get("published", ""),
"link": e.link,
}
for e in feed.entries[:max_results]
]
# Monitor multiple topics
for topic in ["technology", "business", "science"]:
articles = fetch_topic_feed(topic, max_results=5)
print(f"\n--- {topic.upper()} ---")
for a in articles:
print(f" [{a['source']}] {a['title']}")
Approach 2: Resolving Real Article URLs
Google News wraps every link through a redirect (news.google.com/rss/articles/...). To get the actual publisher URL, follow the HTTP redirect chain:
import httpx
import re
from urllib.parse import unquote
def resolve_google_news_url(google_url, proxy=None):
"""
Follow Google News redirect to get the real publisher article URL.
Returns the final URL after all redirects.
"""
transport = httpx.HTTPTransport(proxy=proxy) if proxy else None
client = httpx.Client(transport=transport, timeout=10, follow_redirects=True)
try:
resp = client.head(google_url)
real_url = str(resp.url)
if "news.google.com" in real_url:
resp = client.get(google_url)
match = re.search(r'<link rel="canonical" href="([^"]+)"', resp.text)
if match:
real_url = match.group(1)
return real_url
finally:
client.close()
def resolve_urls_batch(articles, proxy=None, delay_range=(0.5, 1.5)):
"""Resolve Google News redirect URLs for a batch of articles."""
resolved = []
for article in articles:
real_url = resolve_google_news_url(article["link"], proxy=proxy)
resolved.append({**article, "real_url": real_url})
time.sleep(random.uniform(*delay_range))
return resolved
Approach 3: Full Article Content Extraction
Once you have the real publisher URLs, extract article text using trafilatura (better than newspaper3k for most modern news sites):
# article_extractor.py
import httpx
import trafilatura
import hashlib
import time
import random
from datetime import datetime
USER_AGENTS = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
]
def extract_article(url, proxy_url=None, min_words=100):
"""
Extract clean article text from a news URL.
Uses trafilatura which handles paywalls, navigation, and boilerplate better than newspaper3k.
Returns None if extraction fails or content is too short.
"""
transport = httpx.HTTPTransport(proxy=proxy_url) if proxy_url else None
client = httpx.Client(
transport=transport,
timeout=20,
headers={
"User-Agent": random.choice(USER_AGENTS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://news.google.com/",
},
follow_redirects=True,
)
try:
resp = client.get(url)
resp.raise_for_status()
except (httpx.HTTPError, httpx.ConnectError):
return None
finally:
client.close()
text = trafilatura.extract(
resp.text,
include_comments=False,
include_tables=False,
no_fallback=False,
favor_precision=True,
)
if not text or len(text.split()) < min_words:
return None
metadata = trafilatura.extract_metadata(resp.text)
return {
"url": url,
"text": text,
"word_count": len(text.split()),
"content_hash": hashlib.md5(text.encode()).hexdigest(),
"title": metadata.title if metadata else None,
"author": metadata.author if metadata else None,
"date": metadata.date if metadata else None,
"description": metadata.description if metadata else None,
"extracted_at": datetime.utcnow().isoformat(),
}
def extract_articles_batch(urls, proxy_url=None, delay_range=(1.5, 4.0)):
"""Extract content from multiple URLs with polite delays."""
results = []
failed = []
for i, url in enumerate(urls):
article = extract_article(url, proxy_url=proxy_url)
if article:
results.append(article)
print(f" [{i+1}/{len(urls)}] OK: {url[:60]}... ({article['word_count']} words)")
else:
failed.append(url)
print(f" [{i+1}/{len(urls)}] FAILED: {url[:60]}")
time.sleep(random.uniform(*delay_range))
print(f"\nExtracted: {len(results)}/{len(urls)} | Failed: {len(failed)}")
return results, failed
Deduplication Strategies
News stories get republished across dozens of outlets with minor rewrites. Here are three strategies from simple to sophisticated.
1. Exact Hash Deduplication
def dedup_by_hash(articles):
"""Remove exact duplicate articles using content hash."""
seen_hashes = set()
unique = []
for article in articles:
h = article.get("content_hash") or hashlib.md5(
article.get("text", "").encode()
).hexdigest()
if h not in seen_hashes:
seen_hashes.add(h)
unique.append(article)
return unique
2. Title Similarity Deduplication
def normalize_title(title):
import re
t = title.lower()
t = re.sub(r"[^a-z0-9 ]", " ", t)
t = re.sub(r"\s+", " ", t).strip()
stopwords = {"the", "a", "an", "and", "or", "but", "in", "on", "at", "to", "for", "of", "with", "by"}
words = [w for w in t.split() if w not in stopwords]
return " ".join(words)
def dedup_by_title(articles, similarity_threshold=0.8):
"""Remove near-duplicate articles based on title similarity."""
from difflib import SequenceMatcher
unique = []
seen_titles = []
for article in articles:
norm_title = normalize_title(article.get("title", ""))
is_duplicate = False
for existing_title in seen_titles:
ratio = SequenceMatcher(None, norm_title, existing_title).ratio()
if ratio > similarity_threshold:
is_duplicate = True
break
if not is_duplicate:
unique.append(article)
seen_titles.append(norm_title)
return unique
3. Trigram Similarity Deduplication (Full Text)
def trigram_similarity(text_a, text_b):
"""
Calculate Jaccard similarity using word trigrams.
Values above 0.6 indicate near-duplicate articles.
"""
def trigrams(text):
words = text.lower().split()
return set(tuple(words[i:i+3]) for i in range(len(words) - 2))
set_a = trigrams(text_a)
set_b = trigrams(text_b)
if not set_a or not set_b:
return 0.0
intersection = set_a & set_b
union = set_a | set_b
return len(intersection) / len(union)
def deduplicate_articles(articles, threshold=0.6):
"""
Remove near-duplicate articles based on full-text trigram similarity.
O(n^2) — acceptable for up to a few hundred articles per batch.
For larger sets, use MinHash/LSH (datasketch library).
"""
unique = []
for article in articles:
text = article.get("text", "")
is_duplicate = False
for existing in unique:
if trigram_similarity(text, existing.get("text", "")) > threshold:
is_duplicate = True
break
if not is_duplicate:
unique.append(article)
return unique
4. MinHash for Large-Scale Deduplication
For production pipelines processing thousands of articles per day:
pip install datasketch
from datasketch import MinHash, MinHashLSH
import re
def text_to_shingles(text, k=3):
words = re.findall(r"\b\w+\b", text.lower())
return set(" ".join(words[i:i+k]) for i in range(len(words) - k + 1))
def build_minhash(text, num_perm=128):
m = MinHash(num_perm=num_perm)
for shingle in text_to_shingles(text):
m.update(shingle.encode("utf8"))
return m
def build_dedup_index(articles, threshold=0.6, num_perm=128):
"""
Build an LSH index for fast approximate near-duplicate detection.
Scales to millions of articles efficiently.
"""
lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
unique = []
for i, article in enumerate(articles):
text = article.get("text", "")
if not text:
continue
mh = build_minhash(text, num_perm=num_perm)
key = f"article_{i}"
result = lsh.query(mh)
if not result:
lsh.insert(key, mh)
unique.append(article)
return unique
Anti-Bot Measures and Proxy Rotation
Google is aggressive about blocking automated requests. RSS feeds are more lenient than the web interface, but at scale you will still hit CAPTCHAs and 429 responses.
Header Rotation
import random
CHROME_VERSIONS = ["124.0.0.0", "123.0.0.0", "122.0.6261.112"]
def random_headers():
ver = random.choice(CHROME_VERSIONS)
return {
"User-Agent": f"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/{ver} Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Language": random.choice(["en-US,en;q=0.9", "en-GB,en;q=0.9"]),
"Accept-Encoding": "gzip, deflate, br",
"Sec-Ch-Ua": f'"Chromium";v="{ver.split(".")[0]}", "Google Chrome";v="{ver.split(".")[0]}"',
"Sec-Ch-Ua-Mobile": "?0",
"Sec-Ch-Ua-Platform": '"Windows"',
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
}
Proxy Rotation with ThorData
Rotating residential proxies are the most effective defense against IP-based rate limiting on Google properties. ThorData provides 90M+ residential IPs across 190+ countries — critical for Google News because results are region-specific, so you may want IPs from specific countries to get localized news.
THORDATA_USER = "your_username"
THORDATA_PASS = "your_password"
THORDATA_HOST = "proxy.thordata.com"
THORDATA_PORT = "9000"
def get_proxy(country=None, session_id=None):
"""
Build a ThorData proxy URL.
country: ISO 2-letter code (US, GB, DE, JP, etc.) for geo-targeted results
session_id: pass a string to get sticky IPs across multiple requests,
omit for per-request rotation
"""
user = THORDATA_USER
if country:
user = f"{user}-country-{country.upper()}"
if session_id:
user = f"{user}-session-{session_id}"
return f"http://{user}:{THORDATA_PASS}@{THORDATA_HOST}:{THORDATA_PORT}"
# Example: scrape US Google News with a US residential IP
proxy_us = get_proxy(country="US")
articles = fetch_google_news("renewable energy policy", proxy=proxy_us)
# Example: get German news with a German IP
proxy_de = get_proxy(country="DE")
articles_de = fetch_google_news("Bundestag Klimapolitik", lang="de", country="DE", proxy=proxy_de)
# Example: sticky session for following up on articles (same IP)
proxy_sticky = get_proxy(session_id="news_session_001")
Backoff on 429 Responses
import httpx
def fetch_rss_with_retry(url, params=None, proxy=None, max_retries=5):
"""
Fetch RSS feed with exponential backoff on rate limit errors.
Follows the 3-minute timeout rule: give up rather than loop indefinitely.
"""
transport = httpx.HTTPTransport(proxy=proxy) if proxy else None
start_time = time.time()
for attempt in range(max_retries):
if time.time() - start_time > 180:
raise TimeoutError("Exceeded 3-minute retry budget")
try:
client = httpx.Client(
transport=transport,
timeout=15,
headers=random_headers(),
follow_redirects=True,
)
resp = client.get(url, params=params)
client.close()
if resp.status_code == 200:
return resp
elif resp.status_code == 429:
wait = (2 ** attempt) + random.uniform(0, 2)
print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt + 1}/{max_retries})")
time.sleep(wait)
elif resp.status_code == 403:
print(f"Blocked. Rotating proxy and waiting...")
proxy = get_proxy()
transport = httpx.HTTPTransport(proxy=proxy)
time.sleep(5)
else:
resp.raise_for_status()
except httpx.ConnectError as e:
wait = 2 ** attempt
print(f"Connection error: {e}. Waiting {wait}s...")
time.sleep(wait)
raise Exception(f"Failed after {max_retries} retries: {url}")
Approach 4: Async Pipeline for High-Volume Monitoring
For monitoring dozens of queries and extracting full articles in near-real-time, synchronous requests are too slow. Here is a production-ready async pipeline:
import asyncio
import aiohttp
import feedparser
import trafilatura
import json
import time
from pathlib import Path
from datetime import datetime
async def fetch_rss_async(session, query, country="US", lang="en"):
"""Fetch RSS feed asynchronously."""
url = "https://news.google.com/rss/search"
params = {
"q": query,
"hl": f"{lang}-{country}",
"gl": country,
"ceid": f"{country}:{lang}",
}
async with session.get(url, params=params) as resp:
text = await resp.text()
feed = feedparser.parse(text)
return [
{
"title": e.title,
"source": e.source.get("title", "") if hasattr(e, "source") else "",
"published": e.get("published", ""),
"link": e.link,
"query": query,
}
for e in feed.entries[:20]
]
async def resolve_url_async(session, google_url):
"""Follow Google News redirect asynchronously."""
try:
async with session.head(google_url, allow_redirects=True) as resp:
return str(resp.url)
except Exception:
return google_url
async def extract_article_async(session, url):
"""Extract article text asynchronously."""
try:
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Referer": "https://news.google.com/",
}
async with session.get(url, headers=headers, timeout=aiohttp.ClientTimeout(total=20)) as resp:
html = await resp.text()
text = trafilatura.extract(html, favor_precision=True)
return {"url": url, "text": text, "word_count": len(text.split()) if text else 0}
except Exception as e:
return {"url": url, "text": None, "error": str(e)}
async def run_news_pipeline(queries, output_file="news_pipeline_output.json", proxy=None):
"""
Full async pipeline: fetch RSS, resolve URLs, extract content.
Processes all queries concurrently for maximum throughput.
"""
connector = aiohttp.TCPConnector(limit=10)
timeout = aiohttp.ClientTimeout(total=30)
async with aiohttp.ClientSession(
connector=connector,
timeout=timeout,
headers={"User-Agent": "Mozilla/5.0"},
) as session:
print(f"Fetching RSS for {len(queries)} queries...")
rss_tasks = [fetch_rss_async(session, q) for q in queries]
rss_results = await asyncio.gather(*rss_tasks, return_exceptions=True)
all_articles = []
for result in rss_results:
if isinstance(result, list):
all_articles.extend(result)
print(f"Found {len(all_articles)} article references")
print("Resolving redirect URLs...")
resolve_tasks = [resolve_url_async(session, a["link"]) for a in all_articles]
real_urls = await asyncio.gather(*resolve_tasks, return_exceptions=True)
for article, real_url in zip(all_articles, real_urls):
if isinstance(real_url, str):
article["real_url"] = real_url
print("Extracting article content...")
semaphore = asyncio.Semaphore(5)
async def extract_with_sem(url):
async with semaphore:
await asyncio.sleep(random.uniform(0.5, 2.0))
return await extract_article_async(session, url)
extract_tasks = [
extract_with_sem(a.get("real_url", a["link"]))
for a in all_articles
]
extractions = await asyncio.gather(*extract_tasks, return_exceptions=True)
for article, extraction in zip(all_articles, extractions):
if isinstance(extraction, dict):
article.update(extraction)
complete = [a for a in all_articles if a.get("text") and a.get("word_count", 0) > 100]
print(f"Complete articles with content: {len(complete)}/{len(all_articles)}")
Path(output_file).write_text(json.dumps(complete, indent=2))
print(f"Saved to {output_file}")
return complete
import random
queries = [
"artificial intelligence regulation 2026",
"renewable energy investment",
"semiconductor supply chain",
"remote work enterprise policy",
"quantum computing commercial applications",
]
articles = asyncio.run(run_news_pipeline(queries, output_file="news_monitor.json"))
Complete Aggregator: Production News Monitor
# news_aggregator.py
import json
import time
import random
import hashlib
import sqlite3
from pathlib import Path
from datetime import datetime
class GoogleNewsAggregator:
"""
Production news aggregator using Google News RSS.
Features: multi-query, deduplication, proxy rotation, incremental storage.
"""
def __init__(self, proxy_url=None, db_path="news_monitor.db"):
self.proxy_url = proxy_url
self.db_path = db_path
self._init_db()
self.seen_hashes = self._load_seen_hashes()
print(f"Loaded {len(self.seen_hashes)} known article hashes")
def _init_db(self):
conn = sqlite3.connect(self.db_path)
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("""
CREATE TABLE IF NOT EXISTS articles (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title_hash TEXT UNIQUE,
title TEXT,
source TEXT,
published TEXT,
link TEXT,
real_url TEXT,
query TEXT,
text TEXT,
word_count INTEGER,
fetch_date TEXT
)
""")
conn.execute("CREATE INDEX IF NOT EXISTS idx_articles_date ON articles(fetch_date)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_articles_source ON articles(source)")
conn.commit()
conn.close()
def _load_seen_hashes(self):
conn = sqlite3.connect(self.db_path)
hashes = set(
row[0] for row in conn.execute("SELECT title_hash FROM articles").fetchall()
)
conn.close()
return hashes
def _article_hash(self, title):
return hashlib.md5(title.lower().strip().encode()).hexdigest()
def fetch_articles(self, queries, articles_per_query=15):
all_articles = []
for query in queries:
print(f"Fetching: {query}")
try:
articles = fetch_google_news(
query,
max_results=articles_per_query,
proxy=self.proxy_url,
)
for article in articles:
h = self._article_hash(article["title"])
if h not in self.seen_hashes:
self.seen_hashes.add(h)
article["query"] = query
article["fetch_date"] = datetime.utcnow().isoformat()
article["title_hash"] = h
all_articles.append(article)
except Exception as e:
print(f" Error fetching '{query}': {e}")
time.sleep(random.uniform(2, 4))
return all_articles
def resolve_and_extract(self, articles, extract_content=True):
enriched = []
for i, article in enumerate(articles):
print(f" [{i+1}/{len(articles)}] {article['title'][:60]}...")
real_url = resolve_google_news_url(article["link"], proxy=self.proxy_url)
article["real_url"] = real_url
if extract_content and "news.google.com" not in real_url:
content = extract_article(real_url, proxy_url=self.proxy_url)
if content:
article["text"] = content["text"]
article["word_count"] = content["word_count"]
enriched.append(article)
time.sleep(random.uniform(1.5, 3.5))
return enriched
def save(self, articles):
conn = sqlite3.connect(self.db_path)
saved = 0
for a in articles:
try:
conn.execute("""
INSERT OR IGNORE INTO articles
(title_hash, title, source, published, link, real_url,
query, text, word_count, fetch_date)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
a.get("title_hash"),
a.get("title"),
a.get("source"),
a.get("published"),
a.get("link"),
a.get("real_url"),
a.get("query"),
a.get("text"),
a.get("word_count"),
a.get("fetch_date"),
))
saved += 1
except Exception as e:
print(f" Save error: {e}")
conn.commit()
conn.close()
print(f"Saved {saved} new articles")
return saved
def run(self, queries, extract_content=True):
print(f"Starting news aggregation for {len(queries)} queries")
print(f"Proxy: {'enabled' if self.proxy_url else 'disabled'}")
print()
articles = self.fetch_articles(queries)
print(f"\nNew articles found: {len(articles)}")
if articles and extract_content:
print("\nExtracting content...")
articles = self.resolve_and_extract(articles, extract_content=True)
saved = self.save(articles)
return articles, saved
MONITORING_QUERIES = [
"python web scraping 2026",
"AI regulation legislation",
"data privacy GDPR enforcement",
"web automation tools",
"residential proxy services",
]
aggregator = GoogleNewsAggregator(
proxy_url=get_proxy(country="US") if False else None,
db_path="news_monitor.db",
)
new_articles, count = aggregator.run(
queries=MONITORING_QUERIES,
extract_content=True,
)
print(f"\nDone. Saved {count} new articles to news_monitor.db")
Storing Results in SQLite
See the aggregator class above for the full schema. For a simpler export to JSON:
def export_articles_json(db_path, output_path, days_back=7):
"""Export recent articles from SQLite to JSON."""
import sqlite3
from pathlib import Path
from datetime import datetime, timedelta
cutoff = (datetime.utcnow() - timedelta(days=days_back)).isoformat()
conn = sqlite3.connect(db_path)
cols = ["id", "title", "source", "published", "real_url", "query", "text", "word_count", "fetch_date"]
rows = conn.execute(
f"SELECT {','.join(cols)} FROM articles WHERE fetch_date > ? ORDER BY fetch_date DESC",
(cutoff,)
).fetchall()
conn.close()
articles = [dict(zip(cols, row)) for row in rows]
Path(output_path).write_text(json.dumps(articles, indent=2))
print(f"Exported {len(articles)} articles to {output_path}")
return articles
Rate Limits and Practical Considerations
Google News RSS feeds are public and do not require authentication, but they are not an official API. Practical limits:
- Keep requests under 1 per second for RSS feeds
- Do not redistribute full article text — extract what you need for analysis
- Check each publisher robots.txt before scraping their articles directly
- For production use at scale, consider the Google News API via SerpAPI or similar services that handle compliance
The RSS approach works well for monitoring up to a few hundred queries per day. Beyond that, you need proxy infrastructure like ThorData and more sophisticated request distribution to stay under Google radar without hitting 429 or CAPTCHA responses.
Practical Applications
Media Monitoring: Track brand, competitor, or keyword mentions across thousands of news sources automatically. Set up daily runs that email you digests of new articles matching your topics.
Market Intelligence: Monitor news around publicly-traded companies, industries, or regulatory developments. Combine with sentiment analysis to build a news-driven signal for investment research.
Research Corpora: Build labeled datasets of news articles for NLP research — classification, summarization, named entity recognition. The RSS metadata gives you clean labels (source, topic, date) for free.
Content Curation: Power an automated newsletter or social media account with curated news summaries. The deduplication pipeline ensures you do not post the same story twice.
SEO Research: Monitor news coverage of specific keywords to find content gap opportunities — topics that are in the news but not well-covered by evergreen SEO content.
The Google News RSS approach is genuinely one of the most reliable and legally safe scraping targets: the feeds are publicly documented, no authentication required, and the rate limits are generous for moderate use. For scale, pair with ThorData residential proxies for geo-targeted news collection across multiple countries.