How to Scrape Quora Questions and Answers in 2026 (Python Guide)
How to Scrape Quora Questions and Answers in 2026 (Complete Python Guide)
Quora is a goldmine for researchers, content marketers, and NLP practitioners. The upvote system means community-validated signal is baked in — popular answers aren't just opinions, they're opinions that thousands of real humans endorsed with a click. The question volume across topics is enormous: millions of questions spanning every industry, profession, and human curiosity. For competitive research, NLP training data, content gap analysis, or understanding what your market is confused about, Quora is one of the richest publicly available Q&A datasets.
The problem is that scraping it is legitimately non-trivial. Quora in 2026 runs Cloudflare, aggressive bot fingerprinting, login walls that appear after a few page views, and a fully React-rendered frontend that naive HTTP scrapers can't parse at all. This guide covers the complete technical stack for pulling Quora data: questions, answers, upvote counts, user profiles, topic feeds, and related questions — with working code and real anti-detection strategies.
Why Playwright Is Non-Negotiable
The fundamental challenge with Quora is that it's a React SPA. When your browser first requests a Quora URL, the server returns an HTML shell that contains almost no content — just the app skeleton. The actual questions, answers, and vote counts are fetched asynchronously via GraphQL API calls that happen after the JavaScript executes.
If you try requests, httpx, urllib3, or even aiohttp, you get either an empty page or an immediate 403 before you see a single answer. Tools like mechanize or scrapy have the same problem — they don't execute JavaScript.
You need a real browser. Playwright is the right choice in 2026:
- Faster startup and more scriptable than Selenium
- Native async support with
asyncio - Direct integration with Chromium, Firefox, and WebKit
- Built-in network interception for capturing API responses
- Better maintained than Puppeteer for Python use cases
pip install playwright playwright-stealth
playwright install chromium
Use the async API. You'll almost certainly want concurrent workers to scrape at any useful rate, and async lets you run multiple browser contexts without threads.
Installation and Project Setup
# requirements.txt
# playwright>=1.44.0
# playwright-stealth>=1.0.6
import asyncio
import json
import random
import time
import re
from typing import Optional
from playwright.async_api import async_playwright, Page, BrowserContext
Verify your setup:
async def verify_setup():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto("https://httpbin.org/user-agent")
ua = await page.evaluate("() => navigator.userAgent")
print(f"User agent: {ua}")
await browser.close()
asyncio.run(verify_setup())
Stealth Configuration
Quora's bot detection checks for browser automation signals before serving content. Apply stealth patches to every new page before navigation:
from playwright_stealth import stealth_async
REALISTIC_UA = (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
)
async def create_stealth_context(
browser,
proxy_config: dict = None,
locale: str = "en-US",
) -> BrowserContext:
"""Create a browser context with stealth configuration."""
context_opts = {
"user_agent": REALISTIC_UA,
"viewport": {"width": 1366, "height": 768},
"locale": locale,
"timezone_id": "America/New_York",
"color_scheme": "light",
"accept_downloads": False,
}
if proxy_config:
context_opts["proxy"] = proxy_config
context = await browser.new_context(**context_opts)
# Override automation fingerprints
await context.add_init_script("""
// Webdriver flag
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
// Plugin array (headless has none)
const pluginData = [
{name: 'PDF Viewer', filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai'},
{name: 'Chrome PDF Viewer', filename: 'internal-pdf-viewer'},
{name: 'Chromium PDF Viewer', filename: 'mhjfbmdgcfjbbpaeojofohoefgiehjai'},
{name: 'Microsoft Edge PDF Viewer', filename: 'msedgepdf'},
{name: 'WebKit built-in PDF', filename: 'webkit-fake-pdf-plugin'},
];
Object.defineProperty(navigator, 'plugins', {
get: () => {
const arr = Array.from(pluginData);
arr.length = pluginData.length;
return arr;
}
});
// Languages
Object.defineProperty(navigator, 'languages', {get: () => ['en-US', 'en']});
// Chrome runtime
window.chrome = window.chrome || {
runtime: {}, app: {isInstalled: false}
};
// Permissions API (headless returns different results)
const originalQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (parameters) => (
parameters.name === 'notifications' ?
Promise.resolve({state: Notification.permission}) :
originalQuery(parameters)
);
""")
return context
async def new_stealth_page(context: BrowserContext) -> Page:
"""Create a new page with stealth patches applied."""
page = await context.new_page()
await stealth_async(page)
return page
Scraping Q&A Pages
A Quora question URL looks like https://www.quora.com/What-is-the-best-way-to-learn-Python. Answers load dynamically as the page hydrates, so you must wait for content to appear before extracting:
async def scrape_question(
url: str,
max_answers: int = 20,
proxy_config: dict = None,
) -> dict:
"""
Scrape a single Quora question page.
Returns question text and a list of answers with upvotes.
"""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await create_stealth_context(browser, proxy_config)
page = await new_stealth_page(context)
# Capture GraphQL responses for vote data
graphql_data = []
async def capture_graphql(response):
if "graphql" in response.url.lower() or "api" in response.url.lower():
try:
data = await response.json()
graphql_data.append(data)
except Exception:
pass
page.on("response", capture_graphql)
# Navigate to the question
await page.goto(url, wait_until="networkidle", timeout=30000)
await asyncio.sleep(random.uniform(2.5, 4.5))
# Dismiss login modal (appears after ~3 page views in a session)
await dismiss_login_modal(page)
# Wait for answer container to appear
try:
await page.wait_for_selector(".q-box", timeout=8000)
except Exception:
pass # Continue even if selector times out
# Scroll to load more answers
for _ in range(3):
await page.evaluate("window.scrollBy(0, window.innerHeight * 2)")
await asyncio.sleep(random.uniform(1.5, 2.5))
# Extract question title
question_text = ""
for selector in ["h1.q-text", "h1", ".q-title", "[data-testid='question-title']"]:
el = await page.query_selector(selector)
if el:
question_text = (await el.text_content() or "").strip()
if question_text:
break
# Extract answers
answers = await extract_answers(page, max_answers)
# Extract related questions
related = await extract_related_questions(page)
# Extract topic tags
topics = await page.evaluate("""
() => {
const els = document.querySelectorAll('a[href*="/topic/"]');
return [...new Set(Array.from(els).map(e => e.textContent.trim()))].filter(Boolean);
}
""")
await browser.close()
return {
"url": url,
"question": question_text,
"answers": answers[:max_answers],
"topics": topics[:10],
"related_questions": related[:5],
"graphql_responses_captured": len(graphql_data),
}
async def dismiss_login_modal(page: Page):
"""Try to dismiss Quora's login modal if present."""
dismiss_selectors = [
"[aria-label='Close']",
"button[data-functional-selector='close-button']",
".q-modal__close",
"[class*='modal'] button[aria-label*='lose']",
]
for selector in dismiss_selectors:
try:
btn = await page.query_selector(selector)
if btn:
await btn.click()
await asyncio.sleep(random.uniform(0.5, 1.2))
return True
except Exception:
continue
return False
async def extract_answers(page: Page, max_answers: int = 20) -> list[dict]:
"""Extract answer cards from a loaded Quora page."""
# Quora's class names partially rotate, so try multiple selector strategies
answer_selectors = [
".Answer",
"[class*='Answer_answer']",
".dom_annotate_question_answer_item",
"[data-aid]", # Quora uses data-aid on answer containers
]
answer_elements = []
for selector in answer_selectors:
elements = await page.query_selector_all(selector)
if elements:
answer_elements = elements
break
if not answer_elements:
# Fallback: extract all substantial text blocks
return await extract_answers_fallback(page)
answers = []
for el in answer_elements[:max_answers + 5]: # Grab extra, filter below
try:
answer = await extract_single_answer(el)
if answer and len(answer.get("content", "")) >= 50:
answers.append(answer)
except Exception:
continue
return answers[:max_answers]
async def extract_single_answer(el) -> Optional[dict]:
"""Extract data from a single answer element."""
# Author name
author = ""
author_selectors = [
".q-box .ui_profile_header",
"[class*='author']",
"a[href*='/profile/']",
".UserCredential",
]
for sel in author_selectors:
author_el = await el.query_selector(sel)
if author_el:
author = (await author_el.text_content() or "").strip()
if author:
break
# Author credential/bio line
credential = ""
cred_el = await el.query_selector(".CredentialListItem, [class*='credential']")
if cred_el:
credential = (await cred_el.text_content() or "").strip()
# Answer content
content = ""
content_selectors = [
".q-relative .q-text",
"[class*='answer_content']",
".q-box.spacing_log_answer_content",
]
for sel in content_selectors:
content_el = await el.query_selector(sel)
if content_el:
content = (await content_el.inner_text() or "").strip()
if len(content) >= 50:
break
if not content:
# Get all text from the element, filter out nav/meta text
raw_text = (await el.inner_text() or "").strip()
# Remove short lines that are likely UI chrome
lines = [l.strip() for l in raw_text.split("\n") if len(l.strip()) > 30]
content = "\n".join(lines)
# Upvote count
upvotes = "0"
vote_selectors = [
"[class*='VoterCount']",
"[class*='upvote'] span",
"button[aria-label*='pvote'] span",
".q-text[class*='upvote']",
]
for sel in vote_selectors:
vote_el = await el.query_selector(sel)
if vote_el:
vote_text = (await vote_el.text_content() or "0").strip()
if re.search(r'\d', vote_text):
upvotes = vote_text
break
# Share count / views (sometimes available)
views = ""
views_el = await el.query_selector("[class*='views'], [class*='Views']")
if views_el:
views = (await views_el.text_content() or "").strip()
# Timestamp
timestamp = ""
time_el = await el.query_selector("time, [class*='timestamp'], [datetime]")
if time_el:
timestamp = (
await time_el.get_attribute("datetime") or
await time_el.text_content() or ""
).strip()
return {
"author": author[:100] if author else "Anonymous",
"credential": credential[:200],
"content": content,
"upvotes": upvotes,
"views": views,
"timestamp": timestamp,
}
async def extract_answers_fallback(page: Page) -> list[dict]:
"""Fallback extraction when main selectors fail."""
return await page.evaluate("""
() => {
// Find all substantial paragraph blocks
const paras = document.querySelectorAll('p, .q-relative');
const answers = [];
let current = {};
for (const el of paras) {
const text = el.textContent.trim();
if (text.length > 100) {
current.content = (current.content || '') + ' ' + text;
}
if (Object.keys(current).length > 0 && text.length > 80) {
if (!current.upvotes) current.upvotes = '0';
if (!current.author) current.author = 'Anonymous';
if (current.content && current.content.length > 150) {
answers.push({...current});
current = {};
}
}
}
return answers.slice(0, 20);
}
""")
async def extract_related_questions(page: Page) -> list[dict]:
"""Extract related/similar questions from the sidebar."""
return await page.evaluate("""
() => {
const links = document.querySelectorAll('a[href*="/"]');
const related = [];
for (const link of links) {
const href = link.getAttribute('href') || '';
const text = link.textContent.trim();
// Quora question URLs end with a question mark in the slug
if (href.startsWith('/') && text.endsWith('?') && text.length > 15) {
related.push({
title: text,
url: 'https://www.quora.com' + href,
});
}
}
return related.slice(0, 10);
}
""")
Scraping User Profiles
Profile pages at quora.com/profile/Username expose bio text, credential lines, follower/following counts, answer counts, and sometimes educational/professional history. Useful for building author authority signals:
async def scrape_profile(
username: str,
include_recent_answers: bool = False,
proxy_config: dict = None,
) -> dict:
"""
Scrape a Quora user profile.
"""
url = f"https://www.quora.com/profile/{username}"
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await create_stealth_context(browser, proxy_config)
page = await new_stealth_page(context)
await page.goto(url, wait_until="networkidle", timeout=30000)
await asyncio.sleep(random.uniform(2, 3.5))
await dismiss_login_modal(page)
profile = {"username": username, "url": url}
# Display name
name_el = await page.query_selector("h1, .ProfileNameAndSig, [class*='profile_name']")
profile["display_name"] = (await name_el.text_content() or username).strip() if name_el else username
# Bio/description
bio_el = await page.query_selector(".ProfileAboutMe, .q-text.qu-dynamicFontSize--regular")
profile["bio"] = (await bio_el.text_content() or "").strip() if bio_el else ""
# Credentials (job title, education, etc.)
cred_els = await page.query_selector_all(".CredentialListItem, [class*='credential']")
profile["credentials"] = [
(await el.text_content() or "").strip()
for el in cred_els
if (await el.text_content() or "").strip()
][:5]
# Stats: followers, following, answers, questions
stats = {}
stat_links = await page.query_selector_all("a[href*='followers'], a[href*='following'], a[href*='answers']")
for link in stat_links:
href = await link.get_attribute("href") or ""
text = (await link.text_content() or "").strip()
if "followers" in href:
stats["followers"] = text
elif "following" in href:
stats["following"] = text
elif "answers" in href:
stats["answers"] = text
profile.update(stats)
# Knows about topics
topic_els = await page.query_selector_all("a[href*='/topic/']")
profile["known_for_topics"] = list({
(await el.text_content() or "").strip()
for el in topic_els
if (await el.text_content() or "").strip()
})[:10]
# Recent answers (optional)
if include_recent_answers:
answer_links = await page.query_selector_all("a[href*='answer']")
recent = []
for link in answer_links[:5]:
href = await link.get_attribute("href") or ""
text = (await link.text_content() or "").strip()
if href and text and len(text) > 20:
recent.append({"text_preview": text[:100], "url": href})
profile["recent_answers"] = recent
await browser.close()
return profile
Topic Feeds and Infinite Scroll
Topic pages at quora.com/topic/Machine-Learning list questions tagged with that topic. The feed uses infinite scroll — scroll to the bottom and new questions appear:
async def scrape_topic_feed(
topic: str,
max_questions: int = 50,
proxy_config: dict = None,
) -> list[dict]:
"""
Scrape questions from a Quora topic feed.
Handles infinite scroll pagination.
"""
url = f"https://www.quora.com/topic/{topic}"
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await create_stealth_context(browser, proxy_config)
page = await new_stealth_page(context)
await page.goto(url, wait_until="networkidle", timeout=30000)
await asyncio.sleep(random.uniform(2, 3))
await dismiss_login_modal(page)
questions = set()
last_count = 0
# Scroll until we have enough or stop getting new content
for scroll_attempt in range(20):
# Extract current visible questions
new_questions = await page.evaluate("""
() => {
const links = document.querySelectorAll('a[href^="/"]');
const questions = [];
for (const link of links) {
const text = link.textContent.trim();
const href = link.getAttribute('href');
// Quora question slugs are long and descriptive
if (text.endsWith('?') && text.length > 20 && href && href.length > 10) {
questions.push({
title: text,
url: 'https://www.quora.com' + href,
});
}
}
return questions;
}
""")
for q in new_questions:
questions.add(json.dumps(q)) # Use JSON string for set deduplication
if len(questions) >= max_questions:
break
if len(questions) == last_count and scroll_attempt > 3:
# No new content loading
break
last_count = len(questions)
# Scroll down
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await asyncio.sleep(random.uniform(2.5, 4.5))
await browser.close()
# Parse back from JSON strings
result = [json.loads(q) for q in questions]
return result[:max_questions]
async def scrape_topic_with_answers(
topic: str,
max_questions: int = 10,
proxy_config: dict = None,
) -> list[dict]:
"""
Scrape a topic feed then scrape each question's top answer.
"""
questions = await scrape_topic_feed(topic, max_questions=max_questions * 2, proxy_config=proxy_config)
enriched = []
for i, q in enumerate(questions[:max_questions]):
print(f" [{i+1}/{max_questions}] {q['title'][:60]}...")
try:
full = await scrape_question(q["url"], max_answers=3, proxy_config=proxy_config)
enriched.append({
**q,
"top_answer": full["answers"][0] if full["answers"] else None,
"answer_count": len(full["answers"]),
"topics": full.get("topics", []),
})
except Exception as e:
print(f" Error: {e}")
enriched.append({**q, "error": str(e)})
# Delay between question scrapes
await asyncio.sleep(random.uniform(4, 8))
return enriched
Searching Quora (Search Page)
Quora's search at quora.com/search?q=... works similarly to topic pages:
async def search_quora(
query: str,
max_results: int = 30,
content_type: str = "question", # question | answer | profile | post
proxy_config: dict = None,
) -> list[dict]:
"""Search Quora and return matching questions."""
encoded_query = query.replace(" ", "+")
url = f"https://www.quora.com/search?q={encoded_query}&type={content_type}"
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await create_stealth_context(browser, proxy_config)
page = await new_stealth_page(context)
await page.goto(url, wait_until="networkidle", timeout=30000)
await asyncio.sleep(random.uniform(2, 3.5))
await dismiss_login_modal(page)
# Scroll to load more results
for _ in range(4):
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await asyncio.sleep(random.uniform(2, 3))
results = await page.evaluate("""
() => {
const links = document.querySelectorAll('a[href^="/"]');
const seen = new Set();
const items = [];
for (const link of links) {
const href = link.getAttribute('href');
const text = link.textContent.trim();
if (!seen.has(href) && text.length > 20 && href.length > 5) {
seen.add(href);
items.push({
title: text,
url: 'https://www.quora.com' + href,
});
}
}
return items;
}
""")
await browser.close()
return [r for r in results if "?" in r["title"] or len(r["title"]) > 30][:max_results]
Handling Login Walls
Quora shows login/signup modals aggressively, particularly after 3-5 page views in a session. Three effective strategies:
Strategy 1: Dismiss the modal. The modal has a close button, but the aria-label and class names vary. Try multiple selectors:
async def aggressive_dismiss(page: Page, max_attempts: int = 3):
"""Try multiple approaches to dismiss the login modal."""
for attempt in range(max_attempts):
dismissed = await dismiss_login_modal(page)
if dismissed:
return True
# Press Escape key as fallback
await page.keyboard.press("Escape")
await asyncio.sleep(0.8)
# Look for any overlay and click outside it
try:
overlay = await page.query_selector("[class*='overlay'], [class*='modal']")
if overlay:
bbox = await overlay.bounding_box()
if bbox:
# Click outside the modal bounds
await page.mouse.click(10, 10)
except Exception:
pass
await asyncio.sleep(1)
return False
Strategy 2: Inject cookies from a logged-in session. Export cookies from a real browser session and load them into Playwright:
async def load_quora_session(context, cookies_path: str):
"""Load saved Quora session cookies."""
import json
with open(cookies_path) as f:
cookies = json.load(f)
# Ensure cookies have required fields
cleaned = []
for c in cookies:
if c.get("name") and c.get("value") and "quora.com" in c.get("domain", ""):
cleaned.append({
"name": c["name"],
"value": c["value"],
"domain": c.get("domain", ".quora.com"),
"path": c.get("path", "/"),
"httpOnly": c.get("httpOnly", False),
"secure": c.get("secure", True),
})
await context.add_cookies(cleaned)
print(f"Loaded {len(cleaned)} Quora cookies")
Strategy 3: Use fresh contexts with short sessions. Keep each browser context under 3 question views before creating a new one. This prevents the modal trigger that fires after multiple page views.
Proxy Configuration and Anti-Detection
The technical layers Quora deploys in 2026:
Cloudflare handles initial IP reputation checks. Datacenter IPs (AWS, GCP, Azure, DigitalOcean ranges) fail this check immediately and get the JS challenge or a soft block. Residential IPs pass.
Browser fingerprinting — covered by our create_stealth_context setup above. The key signals checked: navigator.webdriver, plugin array, Chrome runtime presence, Canvas/WebGL rendering characteristics.
Behavioral analysis — request timing, navigation patterns, mouse movement. Our random delays and session length limits address this.
TLS fingerprinting — Playwright using real Chromium passes this automatically since it has a real browser TLS stack.
For IP rotation, ThorData's residential proxy network works well with Playwright's built-in proxy configuration:
THORDATA_CONFIG = {
"server": "http://proxy.thordata.com:9000",
"username": "YOUR_THORDATA_USER",
"password": "YOUR_THORDATA_PASS",
}
# For country-targeted access (e.g., US IP)
THORDATA_US = {
"server": "http://proxy.thordata.com:9000",
"username": "YOUR_THORDATA_USER-country-us",
"password": "YOUR_THORDATA_PASS",
}
async def scrape_with_proxy(question_url: str) -> dict:
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy=THORDATA_US,
)
context = await create_stealth_context(browser, THORDATA_US)
page = await new_stealth_page(context)
await page.goto(question_url, wait_until="networkidle")
await asyncio.sleep(random.uniform(2.5, 4))
await dismiss_login_modal(page)
answers = await extract_answers(page)
await browser.close()
return answers
Rotate the proxy on each new browser context rather than per-request. Creating a new context with a new proxy IP gives you a fresh IP address, fresh cookies, and a clean behavioral fingerprint — far more convincing than switching IPs mid-session.
Rate limit: one page load per 3-8 seconds per IP. If running parallel workers, your proxy pool must be large enough to distribute the load. Ten workers at 1 request each per 3 seconds = ~200 req/min — you need at least 10 separate IPs cycling.
Session length: under 20 page views per context. Quora's modal trigger scales with session depth. Shorter sessions mean the modal appears less often and behavioral scoring has less data to work with.
Concurrent Scraping with Worker Pool
import asyncio
from asyncio import Queue
async def question_worker(
worker_id: int,
queue: Queue,
results: list,
proxy_config: dict = None,
max_answers: int = 10,
):
"""Worker coroutine that processes questions from a queue."""
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await create_stealth_context(browser, proxy_config)
page_views = 0
while True:
try:
url = queue.get_nowait()
except asyncio.QueueEmpty:
break
if page_views > 0 and page_views % 15 == 0:
# Rotate context to reset session state
await context.close()
context = await create_stealth_context(browser, proxy_config)
print(f" Worker {worker_id}: rotated context at {page_views} views")
try:
page = await new_stealth_page(context)
await page.goto(url, wait_until="networkidle", timeout=30000)
await asyncio.sleep(random.uniform(2, 4))
await dismiss_login_modal(page)
question_text = ""
h1 = await page.query_selector("h1")
if h1:
question_text = (await h1.text_content() or "").strip()
answers = await extract_answers(page, max_answers)
await page.close()
page_views += 1
results.append({
"url": url,
"question": question_text,
"answers": answers,
"worker": worker_id,
})
queue.task_done()
await asyncio.sleep(random.uniform(3, 6))
except Exception as e:
print(f" Worker {worker_id} error on {url}: {e}")
queue.task_done()
await asyncio.sleep(5)
await browser.close()
async def scrape_questions_parallel(
urls: list[str],
num_workers: int = 3,
proxy_config: dict = None,
) -> list[dict]:
"""
Scrape multiple Quora questions concurrently.
"""
queue = Queue()
for url in urls:
await queue.put(url)
results = []
workers = [
question_worker(i, queue, results, proxy_config)
for i in range(num_workers)
]
await asyncio.gather(*workers)
return results
Storing and Indexing Scraped Data
import sqlite3
from datetime import datetime
def init_quora_db(db_path: str = "quora_data.db") -> sqlite3.Connection:
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS questions (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT UNIQUE NOT NULL,
question_text TEXT,
topic TEXT,
scraped_at TEXT,
answer_count INTEGER DEFAULT 0
)
""")
conn.execute("""
CREATE TABLE IF NOT EXISTS answers (
id INTEGER PRIMARY KEY AUTOINCREMENT,
question_url TEXT NOT NULL,
author TEXT,
credential TEXT,
content TEXT,
upvotes TEXT,
timestamp TEXT,
scraped_at TEXT,
FOREIGN KEY (question_url) REFERENCES questions(url)
)
""")
conn.execute("""
CREATE VIRTUAL TABLE IF NOT EXISTS answers_fts
USING fts5(content, question_url, author, tokenize='porter unicode61')
""")
conn.execute("CREATE INDEX IF NOT EXISTS idx_q_url ON questions(url)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_a_question ON answers(question_url)")
conn.commit()
return conn
def save_question_with_answers(
conn: sqlite3.Connection,
data: dict,
topic: str = "",
):
"""Save a scraped Q&A to the database."""
now = datetime.utcnow().isoformat()
url = data["url"]
conn.execute("""
INSERT OR REPLACE INTO questions (url, question_text, topic, scraped_at, answer_count)
VALUES (?, ?, ?, ?, ?)
""", (url, data.get("question", ""), topic, now, len(data.get("answers", []))))
for answer in data.get("answers", []):
content = answer.get("content", "")
if len(content) < 50:
continue
conn.execute("""
INSERT INTO answers (question_url, author, credential, content, upvotes, timestamp, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?)
""", (
url, answer.get("author", ""),
answer.get("credential", ""), content,
answer.get("upvotes", "0"), answer.get("timestamp", ""), now,
))
# FTS index
conn.execute(
"INSERT INTO answers_fts (content, question_url, author) VALUES (?, ?, ?)",
(content, url, answer.get("author", ""))
)
conn.commit()
def search_answers(conn: sqlite3.Connection, query: str, limit: int = 20) -> list[dict]:
"""Full-text search across all scraped answers."""
rows = conn.execute("""
SELECT a.question_url, a.author, a.content, a.upvotes,
q.question_text
FROM answers_fts f
JOIN answers a ON a.question_url = f.question_url AND a.content = f.content
JOIN questions q ON q.url = a.question_url
WHERE answers_fts MATCH ?
ORDER BY rank
LIMIT ?
""", (query, limit)).fetchall()
return [
{
"question": r[4], "url": r[0], "author": r[1],
"content": r[2][:300], "upvotes": r[3],
}
for r in rows
]
Complete Working Pipeline
import asyncio
import json
async def run_quora_pipeline(
topics: list[str],
questions_per_topic: int = 20,
answers_per_question: int = 5,
db_path: str = "quora_data.db",
proxy_config: dict = None,
):
"""
Full pipeline: topic feeds -> question scraping -> database storage.
"""
db = init_quora_db(db_path)
total_saved = 0
for topic in topics:
print(f"\nTopic: {topic}")
print(" Fetching question list...")
try:
questions = await scrape_topic_feed(
topic,
max_questions=questions_per_topic * 2,
proxy_config=proxy_config,
)
except Exception as e:
print(f" Feed failed: {e}")
continue
print(f" Found {len(questions)} questions, scraping top {questions_per_topic}...")
for i, q in enumerate(questions[:questions_per_topic]):
print(f" [{i+1}/{questions_per_topic}] {q['title'][:55]}...")
try:
data = await scrape_question(
q["url"],
max_answers=answers_per_question,
proxy_config=proxy_config,
)
save_question_with_answers(db, data, topic=topic)
total_saved += 1
print(f" Saved {len(data['answers'])} answers")
except Exception as e:
print(f" Error: {e}")
# Rate limit: 4-10 seconds between questions
await asyncio.sleep(random.uniform(4, 10))
print(f" Topic '{topic}' complete. Total saved: {total_saved}")
# Summary
row = db.execute("SELECT COUNT(*) FROM questions").fetchone()
ans_row = db.execute("SELECT COUNT(*) FROM answers").fetchone()
print(f"\nDatabase: {row[0]} questions, {ans_row[0]} answers")
db.close()
# Run it
if __name__ == "__main__":
asyncio.run(run_quora_pipeline(
topics=["Machine-Learning", "Python-programming-language", "Startups"],
questions_per_topic=15,
answers_per_question=5,
proxy_config=THORDATA_US,
))
Common Gotchas
Quora changes class names frequently. They don't use human-readable class names — they're compiled and rotated on each deploy. The selectors in this guide use multiple fallbacks for exactly this reason. When things break, inspect the live DOM and look for structural patterns (data-* attributes, element hierarchy) rather than specific class names.
Login walls are session-scoped. A fresh browser context resets the session counter. If you're seeing modals constantly, your context is too old. Rotate more aggressively.
Upvote numbers are display strings, not integers. Quora displays "2.3K upvotes" not "2300". Parse these with: int(float(s.replace('K','')) * 1000) if 'K' in s else int(s.replace(',', '')).
Some questions are behind paywall ("Quora+"). These render a blurred preview. Your content extraction will return very short or empty strings. Filter by minimum content length (>100 chars).
Answers with collapsed "more" sections. Long answers have a "Continue Reading" button. If you need full answer text, click it before extracting content:
async def expand_answers(page: Page):
"""Click all 'Continue Reading' / 'more' buttons to expand answers."""
expand_selectors = [
"button[class*='expand']",
"span[class*='more']",
"a.continue_reading",
"[data-functional-selector='expand-answer']",
]
for sel in expand_selectors:
buttons = await page.query_selector_all(sel)
for btn in buttons:
try:
await btn.click()
await asyncio.sleep(0.3)
except Exception:
pass
Ethics and Legal Considerations
Quora's Terms of Service prohibit automated scraping. The 2022 hiQ v. LinkedIn ruling established that scraping publicly accessible data doesn't automatically violate the Computer Fraud and Abuse Act — but that's a US legal standard, and it applies to the criminal statute, not to Quora's right to ban your IP.
Practical guidelines for responsible use:
- Rate limit aggressively. 1 request per 3-8 seconds per IP is respectful.
- Don't scrape behind login walls — stick to public questions and answers.
- For commercial products built on Quora data, the risk-to-benefit math probably doesn't favor scraping. Contact Quora about data licensing.
- For research, NLP training data, and personal analysis, small-scale scraping is standard practice. Keep session lengths short, don't hammer their servers, and you'll be fine operationally.
- Don't republish scraped answers verbatim in public-facing content. Attribution and transformation are your friends legally and ethically.
The techniques here work reliably in 2026. Use them responsibly.