Scraping AliExpress Products Without Getting Blocked (2026)
Scraping AliExpress Products Without Getting Blocked (2026)
AliExpress is one of the most challenging e-commerce platforms to scrape at scale. Owned by Alibaba Group, the platform hosts hundreds of millions of product listings from sellers worldwide, making it a critical data source for competitive price intelligence, dropshipping research, supplier discovery, product trend analysis, and marketplace analytics. Yet getting that data reliably requires understanding — and systematically defeating — a multi-layered anti-bot defense that Alibaba has been refining for years.
The challenge with AliExpress is not just one problem; it is three interlocking problems. First, most product data is served through JavaScript execution rather than static HTML. A standard HTTP request for a product page returns mostly empty placeholder elements — the actual price, variant information, seller rating, and shipping cost all get populated after JavaScript runs and makes additional API calls to Alibaba's backend. This means any scraper that cannot execute JavaScript will receive incomplete or useless data.
Second, AliExpress uses device fingerprinting that goes far beyond IP-based blocking. It analyzes the TLS handshake, HTTP/2 frame ordering, browser API behavior, canvas fingerprint, WebGL renderer, audio context characteristics, and dozens of other signals to determine whether a visitor is a real Chrome browser or an automated tool. Even well-configured headless browsers fail this check unless they have been explicitly hardened against fingerprint detection.
Third, the site rate-limits aggressively by multiple independent signals simultaneously. Your IP address is one dimension. Your cookie session is another. The frequency of requests from a given device fingerprint is a third. You can rotate IPs and still get blocked because your cookie session has been flagged. You can clear cookies and rotate IPs and still get blocked because your TLS fingerprint matches known scraper tooling. A production AliExpress scraper has to address all three dimensions simultaneously to maintain high success rates over time.
This guide covers the complete technical approach: Playwright-based scraping with stealth configuration, extraction from the window.__INIT_DATA__ object for structured data, residential proxy rotation with ThorData, CAPTCHA handling strategies, retry logic, and five complete use cases with working Python code.
Why Standard Requests-Based Scraping Fails
Before covering what works, it is worth understanding exactly why the naive approach fails so you can anticipate the same failure modes in your own code.
A bare requests.get("https://www.aliexpress.com/item/...") call returns the page skeleton — the HTML document loaded before JavaScript runs. The product title might be there as an H1 tag, but the price, variant options, stock count, and seller information are missing. They live inside <script> tags as JavaScript initialization data, or they are fetched from Alibaba's API after page load. Without executing that JavaScript, you get maybe 20% of the data you need.
Even if you switch to a headless browser and execute the JavaScript, you immediately hit fingerprinting. Playwright and Puppeteer in their default configurations have known fingerprints that AliExpress detects reliably. The navigator.webdriver property is set to true, the Chrome automation extension is visible in navigator.plugins, and the headless browser's canvas rendering differs from a real GPU-accelerated browser in ways that anti-bot systems have learned to detect.
With the fingerprinting problem solved, you then hit rate limiting. AliExpress will serve you product pages successfully for the first 15-30 requests in a session, then silently degrade the response — returning old cached data, or redirecting you to a simplified "lite" version of the page that lacks the structured data objects. After another 20-30 requests it escalates to CAPTCHA challenges, and eventually to outright 403 blocks on the IP.
The solution requires stacking multiple mitigations: browser stealth patching (via playwright-stealth or rebrowser-patches), residential proxy rotation, realistic request timing, session cookie management, and graceful degradation when blocks do occur.
Setting Up Playwright with Stealth
Install the required packages:
pip install playwright playwright-stealth httpx beautifulsoup4 tenacity
playwright install chromium
The playwright-stealth library patches the most common fingerprinting signals that get headless browsers detected:
import asyncio
import json
import random
import time
from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeout
from playwright_stealth import stealth_async
STEALTH_USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36",
]
async def create_stealthy_browser(proxy_url: str = None):
"""Create a Playwright browser with stealth patches applied."""
playwright = await async_playwright().start()
launch_args = {
"headless": True,
"args": [
"--disable-blink-features=AutomationControlled",
"--disable-dev-shm-usage",
"--no-sandbox",
"--disable-setuid-sandbox",
"--disable-infobars",
"--window-position=0,0",
"--ignore-certifcate-errors",
"--ignore-certifcate-errors-spki-list",
"--disable-extensions",
],
}
if proxy_url:
launch_args["proxy"] = {"server": proxy_url}
browser = await playwright.chromium.launch(**launch_args)
ua = random.choice(STEALTH_USER_AGENTS)
context = await browser.new_context(
user_agent=ua,
viewport={"width": 1366, "height": 768},
locale="en-US",
timezone_id="America/New_York",
geolocation={"longitude": -73.935242, "latitude": 40.730610},
permissions=["geolocation"],
extra_http_headers={
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
},
)
page = await context.new_page()
# Apply stealth patches — removes webdriver signals, patches navigator, etc.
await stealth_async(page)
# Additional manual patches for AliExpress specifically
await page.add_init_script("""
// Override the plugins length to appear non-headless
Object.defineProperty(navigator, 'plugins', {
get: () => [1, 2, 3, 4, 5],
});
// Override languages
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en'],
});
// Prevent WebGL fingerprinting
const getParameter = WebGLRenderingContext.prototype.getParameter;
WebGLRenderingContext.prototype.getParameter = function(parameter) {
if (parameter === 37445) {
return 'Intel Open Source Technology Center';
}
if (parameter === 37446) {
return 'Mesa DRI Intel(R) Ivybridge Mobile';
}
return getParameter.call(this, parameter);
};
""")
return playwright, browser, context, page
Extracting Product Data from window.INIT_DATA
AliExpress embeds a large JSON object called window.__INIT_DATA__ in every product page. This is the canonical data source — it contains the full product record including all SKU variants, pricing tiers, seller information, shipping options, review aggregates, and promotion data. Extracting this object is far more reliable than parsing DOM elements, which change layout frequently.
async def scrape_aliexpress_product(url: str, proxy_url: str = None) -> dict:
"""
Scrape a single AliExpress product page and return structured data.
Returns a dict with keys: title, price, original_price, discount,
sold_count, rating, review_count, shipping, seller, sku_data, images
"""
playwright, browser, context, page = await create_stealthy_browser(proxy_url)
result = {}
try:
# Warm up with a visit to the homepage first — reduces detection
await page.goto("https://www.aliexpress.com/", wait_until="domcontentloaded", timeout=20000)
await asyncio.sleep(random.uniform(1.5, 3.0))
# Navigate to the product page
await page.goto(url, wait_until="networkidle", timeout=45000)
# Check if we got a CAPTCHA or block page
page_title = await page.title()
if any(word in page_title.lower() for word in ["captcha", "verify", "blocked", "access denied"]):
raise Exception(f"Bot detection triggered: {page_title}")
# Wait for product title to confirm page loaded
try:
await page.wait_for_selector("h1.product-title-text", timeout=15000)
except PlaywrightTimeout:
# Try alternative selectors used on different page layouts
await page.wait_for_selector("[data-pl='product-title']", timeout=10000)
# Simulate human-like behavior: scroll down slightly
await page.evaluate("window.scrollBy(0, 300)")
await asyncio.sleep(random.uniform(0.5, 1.5))
# Extract window.__INIT_DATA__
init_data_raw = await page.evaluate("""
() => {
try {
if (window.__INIT_DATA__) {
return JSON.stringify(window.__INIT_DATA__);
}
// Some pages use a different variable name
if (window.runParams) {
return JSON.stringify(window.runParams);
}
return null;
} catch(e) {
return null;
}
}
""")
if init_data_raw:
init_data = json.loads(init_data_raw)
result = parse_init_data(init_data)
# Fall back to DOM extraction if __INIT_DATA__ parsing failed
if not result.get("title"):
result["title"] = await page.text_content("h1.product-title-text") or \
await page.text_content("[data-pl='product-title']") or ""
if not result.get("price"):
price_el = await page.query_selector(".product-price-value")
if price_el:
result["price"] = await price_el.text_content()
# Get product images
images = await page.evaluate("""
() => {
const imgs = document.querySelectorAll('.slider-image img, .product-image img');
return Array.from(imgs).map(img => img.src || img.dataset.src).filter(Boolean);
}
""")
result["images"] = list(set(images))[:10] # deduplicate, cap at 10
result["url"] = url
result["scraped_at"] = time.time()
finally:
await browser.close()
await playwright.stop()
return result
def parse_init_data(data: dict) -> dict:
"""Parse the window.__INIT_DATA__ structure into a clean product dict."""
result = {}
# The data key contains the main product info
product_data = data.get("data", data) # some pages omit the outer "data" wrapper
# Product info component
info = product_data.get("productInfoComponent", {})
result["title"] = info.get("subject", "")
result["item_id"] = info.get("productId", "")
# Price component
price = product_data.get("priceComponent", {})
result["price"] = price.get("formatedActivityPrice", "") or price.get("formatedPrice", "")
result["original_price"] = price.get("formatedPrice", "")
result["discount"] = price.get("discount", "")
# Seller / trade component
trade = product_data.get("tradeComponent", {})
result["sold_count"] = trade.get("formatTradeCount", "")
# Review component
review = product_data.get("reviewComponent", {})
result["rating"] = review.get("averageStar", "")
result["review_count"] = review.get("totalValidNum", 0)
# Shipping component
shipping = product_data.get("shippingComponent", {})
dynamic = shipping.get("shippingInfo", {}).get("shippingList", [])
if dynamic:
first = dynamic[0]
result["shipping"] = {
"method": first.get("serviceName", ""),
"price": first.get("freightAmount", {}).get("formatedAmount", ""),
"delivery_days": first.get("deliveryDayMax", ""),
}
# Seller component
seller = product_data.get("sellerComponent", {})
result["seller"] = {
"name": seller.get("storeName", ""),
"id": seller.get("storeNum", ""),
"positive_feedback": seller.get("positiveRate", ""),
"followers": seller.get("followingNumber", 0),
}
# SKU data — all variant combinations and their prices
sku = product_data.get("skuComponent", {})
result["sku_data"] = {
"props": sku.get("productSKUPropertyList", []),
"price_list": sku.get("skuPriceList", []),
}
return result
Proxy Rotation with ThorData
Residential proxies are non-negotiable for AliExpress scraping at any volume beyond a few dozen requests. AliExpress blocks datacenter IP ranges (AWS, GCP, DigitalOcean, Linode, etc.) aggressively. A residential proxy routes your requests through real ISP-assigned IP addresses, making them indistinguishable from organic traffic at the network layer.
ThorData provides rotating residential proxies with geo-targeting support. For AliExpress, geo-targeting matters: a US-based IP scraping AliExpress sees US pricing and USD displays, which simplifies price normalization in your data pipeline.
import random
import asyncio
from dataclasses import dataclass
from typing import Optional
@dataclass
class ProxyConfig:
host: str
port: int
username: str
password: str
country: str = "US"
def to_url(self) -> str:
return f"http://{self.username}:{self.password}@{self.host}:{self.port}"
def to_playwright_proxy(self) -> dict:
return {
"server": f"http://{self.host}:{self.port}",
"username": self.username,
"password": self.password,
}
class ProxyRotator:
"""Manages a pool of proxies with failure tracking and rotation."""
def __init__(self, proxies: list[ProxyConfig]):
self.proxies = proxies
self.failure_counts: dict[str, int] = {}
self.cooldown_until: dict[str, float] = {}
self.MAX_FAILURES = 3
self.COOLDOWN_SECONDS = 300 # 5 minutes
def get_proxy(self) -> Optional[ProxyConfig]:
"""Get a healthy proxy, avoiding recently failed ones."""
now = time.time()
available = [
p for p in self.proxies
if self.failure_counts.get(p.host, 0) < self.MAX_FAILURES
and self.cooldown_until.get(p.host, 0) < now
]
if not available:
# Reset all proxies if none available — maybe the cooldowns expired
self.failure_counts.clear()
self.cooldown_until.clear()
available = self.proxies
return random.choice(available) if available else None
def mark_failure(self, proxy: ProxyConfig):
"""Record a failure for a proxy; put it in cooldown after threshold."""
count = self.failure_counts.get(proxy.host, 0) + 1
self.failure_counts[proxy.host] = count
if count >= self.MAX_FAILURES:
self.cooldown_until[proxy.host] = time.time() + self.COOLDOWN_SECONDS
print(f"Proxy {proxy.host} in cooldown for {self.COOLDOWN_SECONDS}s")
def mark_success(self, proxy: ProxyConfig):
"""Reset failure count on success."""
self.failure_counts[proxy.host] = 0
# ThorData rotating residential proxy setup
# Replace with your ThorData credentials from https://thordata.partnerstack.com/partner/0a0x4nzh
THORDATA_PROXIES = [
ProxyConfig(
host="rotating.thordata.net",
port=9080,
username="your_username-country-US",
password="your_password",
country="US",
),
ProxyConfig(
host="rotating.thordata.net",
port=9080,
username="your_username-country-GB",
password="your_password",
country="GB",
),
ProxyConfig(
host="rotating.thordata.net",
port=9080,
username="your_username-country-DE",
password="your_password",
country="DE",
),
]
rotator = ProxyRotator(THORDATA_PROXIES)
async def scrape_with_rotation(urls: list[str], max_concurrent: int = 3) -> list[dict]:
"""Scrape multiple AliExpress URLs with proxy rotation."""
semaphore = asyncio.Semaphore(max_concurrent)
results = []
async def scrape_one(url: str) -> dict:
async with semaphore:
proxy = rotator.get_proxy()
proxy_url = proxy.to_url() if proxy else None
# Add jitter to avoid synchronized request patterns
await asyncio.sleep(random.uniform(1.0, 4.0))
try:
result = await scrape_aliexpress_product(url, proxy_url=proxy_url)
if proxy:
rotator.mark_success(proxy)
return result
except Exception as e:
print(f"Failed {url}: {e}")
if proxy:
rotator.mark_failure(proxy)
return {"url": url, "error": str(e)}
tasks = [scrape_one(url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
return [r for r in results if isinstance(r, dict)]
CAPTCHA Handling
AliExpress uses two types of CAPTCHA challenges: a slider CAPTCHA (drag the piece to complete the puzzle) and an image selection CAPTCHA. Both are served as interstitial pages before or instead of the product content.
The most reliable strategy is avoidance — slowing down requests and using residential proxies reduces the CAPTCHA encounter rate dramatically. But when you do hit one, you have two options: route through a CAPTCHA solving service, or skip and retry with a fresh session and IP.
import httpx
async def solve_captcha_2captcha(page, api_key: str) -> bool:
"""
Attempt to solve an AliExpress slider CAPTCHA using 2captcha.
Returns True if solved, False if failed or no CAPTCHA found.
"""
# Check if we're on a CAPTCHA page
captcha_frame = await page.query_selector("iframe[src*='captcha']")
if not captcha_frame:
return True # No CAPTCHA present
# Get the CAPTCHA challenge image URL
challenge_url = await page.evaluate("""
() => {
const img = document.querySelector('.captcha-img, [class*="captcha"] img');
return img ? img.src : null;
}
""")
if not challenge_url:
return False
# Submit to 2captcha
async with httpx.AsyncClient() as client:
# Submit the task
submit_resp = await client.post(
"http://2captcha.com/in.php",
data={
"key": api_key,
"method": "base64",
"json": 1,
"type": "rotate", # or "slider" depending on CAPTCHA type
},
)
task_data = submit_resp.json()
if task_data.get("status") != 1:
return False
task_id = task_data["request"]
# Poll for result (2captcha typically takes 5-20 seconds)
for _ in range(20):
await asyncio.sleep(3)
result_resp = await client.get(
f"http://2captcha.com/res.php?key={api_key}&action=get&id={task_id}&json=1"
)
result_data = result_resp.json()
if result_data.get("status") == 1:
# Got the solution
angle = float(result_data["request"])
# Apply the rotation to the slider
slider = await page.query_selector(".captcha-slider, [class*='slider']")
if slider:
box = await slider.bounding_box()
await page.mouse.move(box["x"] + box["width"] / 2, box["y"] + box["height"] / 2)
await page.mouse.down()
# Move proportionally to the angle value
await page.mouse.move(
box["x"] + (angle / 360) * 280,
box["y"] + box["height"] / 2,
steps=15,
)
await page.mouse.up()
await asyncio.sleep(1.5)
return True
elif result_data.get("request") == "ERROR_CAPTCHA_UNSOLVABLE":
return False
return False
async def handle_captcha_or_skip(page, proxy_rotator: ProxyRotator, current_proxy: ProxyConfig) -> bool:
"""
Check for CAPTCHA and either solve it or mark proxy as failed and signal retry.
Returns True to continue scraping, False to trigger a retry with fresh session.
"""
is_captcha = await page.evaluate("""
() => {
const indicators = ['captcha', 'verify', 'robot', 'challenge'];
const text = document.body.innerText.toLowerCase();
return indicators.some(ind => text.includes(ind));
}
""")
if is_captcha:
print("CAPTCHA detected — retiring current session")
if current_proxy:
proxy_rotator.mark_failure(current_proxy)
return False
return True
Rate Limiting and Request Timing
Beyond proxies, request timing is one of the most impactful levers for avoiding detection. Real users do not make requests at perfectly regular intervals. They read the page for several seconds, maybe scroll, click around, then navigate to the next product. Mimicking this behavior pattern dramatically reduces bot detection scores.
import random
import asyncio
class RateLimiter:
"""Token bucket rate limiter with jitter for human-like request pacing."""
def __init__(self, requests_per_minute: float = 10.0):
self.min_interval = 60.0 / requests_per_minute
self.last_request_time = 0.0
self.jitter_factor = 0.3 # ±30% randomness
async def wait(self):
"""Wait the appropriate time before making the next request."""
now = time.time()
elapsed = now - self.last_request_time
base_wait = max(0, self.min_interval - elapsed)
# Add random jitter: ±30% of the base interval
jitter = base_wait * self.jitter_factor * (random.random() * 2 - 1)
total_wait = max(0, base_wait + jitter)
if total_wait > 0:
await asyncio.sleep(total_wait)
self.last_request_time = time.time()
async def simulate_human_page_interaction(page):
"""Simulate human-like browsing behavior on a product page."""
# Random scroll pattern
scroll_positions = [300, 600, 400, 800, 500]
for pos in scroll_positions:
await page.evaluate(f"window.scrollTo(0, {pos + random.randint(-50, 50)})")
await asyncio.sleep(random.uniform(0.3, 0.8))
# Sometimes move the mouse around
if random.random() > 0.5:
for _ in range(random.randint(2, 5)):
x = random.randint(200, 1100)
y = random.randint(100, 600)
await page.mouse.move(x, y)
await asyncio.sleep(random.uniform(0.1, 0.4))
# Occasionally "read" the page for a few seconds
read_time = random.uniform(2.0, 6.0)
await asyncio.sleep(read_time)
Use Case 1: Price Intelligence for Dropshipping
Dropshippers need to monitor supplier prices and update their storefronts when AliExpress prices change. This scraper polls a list of tracked products and writes changes to a SQLite database.
import sqlite3
import json
import asyncio
from datetime import datetime
def setup_price_db(db_path: str = "aliexpress_prices.db"):
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS price_history (
id INTEGER PRIMARY KEY AUTOINCREMENT,
item_id TEXT NOT NULL,
url TEXT NOT NULL,
title TEXT,
price TEXT,
original_price TEXT,
discount TEXT,
sold_count TEXT,
rating TEXT,
scraped_at REAL,
created_at TEXT DEFAULT CURRENT_TIMESTAMP
)
""")
conn.execute("CREATE INDEX IF NOT EXISTS idx_item_id ON price_history(item_id)")
conn.commit()
return conn
async def monitor_prices(tracked_urls: list[str], db_path: str = "aliexpress_prices.db"):
"""Scrape tracked products and store price history. Run on a schedule (e.g., every 6 hours)."""
conn = setup_price_db(db_path)
rate_limiter = RateLimiter(requests_per_minute=8)
print(f"Monitoring {len(tracked_urls)} products...")
for url in tracked_urls:
await rate_limiter.wait()
proxy = rotator.get_proxy()
try:
data = await scrape_aliexpress_product(url, proxy_url=proxy.to_url() if proxy else None)
if data.get("price"):
conn.execute("""
INSERT INTO price_history
(item_id, url, title, price, original_price, discount, sold_count, rating, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
""", (
data.get("item_id", ""),
url,
data.get("title", ""),
data.get("price", ""),
data.get("original_price", ""),
data.get("discount", ""),
data.get("sold_count", ""),
data.get("rating", ""),
data.get("scraped_at", 0),
))
conn.commit()
print(f"Saved: {data.get('title', url)[:60]} @ {data.get('price', 'N/A')}")
if proxy:
rotator.mark_success(proxy)
except Exception as e:
print(f"Error on {url}: {e}")
if proxy:
rotator.mark_failure(proxy)
conn.close()
print("Price monitoring run complete")
Use Case 2: Category-Level Product Discovery
Scraping AliExpress search results to discover products in a category — useful for market research, finding bestsellers, or building comparison databases.
async def scrape_category(
category_url: str,
max_pages: int = 5,
proxy_url: str = None,
) -> list[dict]:
"""
Scrape product listings from an AliExpress category or search results page.
Returns list of product summaries (without full detail data).
"""
all_products = []
playwright, browser, context, page = await create_stealthy_browser(proxy_url)
try:
for page_num in range(1, max_pages + 1):
# Add page number to URL
if "?" in category_url:
url = f"{category_url}&page={page_num}"
else:
url = f"{category_url}?page={page_num}"
await page.goto(url, wait_until="networkidle", timeout=45000)
await asyncio.sleep(random.uniform(2.0, 4.0))
# Try to extract from window.__LISTING_DATA__ or similar
listing_data = await page.evaluate("""
() => {
// AliExpress search pages sometimes expose window.__DATA__
if (window.__DATA__) return JSON.stringify(window.__DATA__);
return null;
}
""")
products_on_page = []
if listing_data:
data = json.loads(listing_data)
items = (data.get("data", {})
.get("itemList", {})
.get("content", []))
for item in items:
products_on_page.append({
"item_id": item.get("itemId", ""),
"title": item.get("title", {}).get("displayTitle", ""),
"price": item.get("prices", {}).get("salePrice", {}).get("formattedPrice", ""),
"rating": item.get("evaluation", {}).get("starRating", ""),
"sold_count": item.get("trade", {}).get("tradeDesc", ""),
"url": f"https://www.aliexpress.com/item/{item.get('itemId', '')}.html",
"image": item.get("image", {}).get("imgUrl", ""),
})
# Fall back to DOM parsing if structured data not available
if not products_on_page:
cards = await page.query_selector_all("[class*='product-snippet'], [class*='list--item']")
for card in cards:
try:
title_el = await card.query_selector("[class*='title']")
price_el = await card.query_selector("[class*='price']")
link_el = await card.query_selector("a[href*='/item/']")
title = (await title_el.text_content()).strip() if title_el else ""
price = (await price_el.text_content()).strip() if price_el else ""
link = await link_el.get_attribute("href") if link_el else ""
if title:
products_on_page.append({
"title": title,
"price": price,
"url": link,
})
except Exception:
continue
print(f"Page {page_num}: found {len(products_on_page)} products")
all_products.extend(products_on_page)
# Simulate reading the page before going to next
await simulate_human_page_interaction(page)
finally:
await browser.close()
await playwright.stop()
return all_products
Use Case 3: Seller Store Scraping
Research a specific AliExpress seller's complete product catalog — useful for supplier vetting or competitive intelligence on a specific seller.
async def scrape_seller_store(store_url: str, proxy_url: str = None) -> dict:
"""
Scrape all products from an AliExpress seller's store.
store_url example: https://www.aliexpress.com/store/12345678
"""
playwright, browser, context, page = await create_stealthy_browser(proxy_url)
store_data = {"url": store_url, "products": [], "seller_info": {}}
try:
await page.goto(store_url, wait_until="networkidle", timeout=45000)
await asyncio.sleep(random.uniform(2.0, 3.0))
# Extract seller info from the store header
store_name_el = await page.query_selector("[class*='store-name'], .shop-name")
if store_name_el:
store_data["seller_info"]["name"] = await store_name_el.text_content()
feedback_el = await page.query_selector("[class*='feedback'], [class*='score']")
if feedback_el:
store_data["seller_info"]["feedback"] = await feedback_el.text_content()
# Scroll through and collect product listings
last_count = 0
scroll_attempts = 0
max_scroll_attempts = 20
while scroll_attempts < max_scroll_attempts:
# Scroll to bottom to trigger infinite scroll loading
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await asyncio.sleep(random.uniform(1.5, 3.0))
# Count current products
products = await page.query_selector_all("[class*='product-item'], [class*='item-card']")
current_count = len(products)
if current_count == last_count:
break # No new products loaded
last_count = current_count
scroll_attempts += 1
print(f"Store scroll {scroll_attempts}: {current_count} products visible")
# Extract all product data from the final DOM state
products_data = await page.evaluate("""
() => {
const cards = document.querySelectorAll('[class*="product-item"], [class*="item-card"]');
return Array.from(cards).map(card => {
const titleEl = card.querySelector('[class*="title"]');
const priceEl = card.querySelector('[class*="price"]');
const linkEl = card.querySelector('a[href*="/item/"]');
const imgEl = card.querySelector('img');
return {
title: titleEl ? titleEl.textContent.trim() : '',
price: priceEl ? priceEl.textContent.trim() : '',
url: linkEl ? linkEl.href : '',
image: imgEl ? (imgEl.src || imgEl.dataset.src) : '',
};
}).filter(p => p.title);
}
""")
store_data["products"] = products_data
store_data["total_products"] = len(products_data)
finally:
await browser.close()
await playwright.stop()
return store_data
Use Case 4: Bulk Product Enrichment Pipeline
When you have a list of AliExpress item IDs (e.g., from a CSV export or API response), enrich them with full product detail data including SKU matrix, shipping options, and seller ratings.
import csv
import json
async def enrich_product_ids(
item_ids: list[str],
output_file: str = "enriched_products.jsonl",
workers: int = 3,
) -> int:
"""
Take a list of AliExpress item IDs and fetch full product data for each.
Writes results as JSON lines to output_file. Returns count of successes.
Skips IDs that already appear in the output file (resumable).
"""
# Load already-processed IDs to support resuming
processed_ids = set()
try:
with open(output_file, "r") as f:
for line in f:
try:
record = json.loads(line)
if record.get("item_id"):
processed_ids.add(record["item_id"])
except json.JSONDecodeError:
pass
except FileNotFoundError:
pass
remaining = [id_ for id_ in item_ids if id_ not in processed_ids]
print(f"{len(remaining)} items to process ({len(processed_ids)} already done)")
semaphore = asyncio.Semaphore(workers)
rate_limiter = RateLimiter(requests_per_minute=10)
success_count = 0
outfile = open(output_file, "a")
async def process_one(item_id: str):
nonlocal success_count
async with semaphore:
await rate_limiter.wait()
url = f"https://www.aliexpress.com/item/{item_id}.html"
proxy = rotator.get_proxy()
try:
data = await scrape_aliexpress_product(url, proxy_url=proxy.to_url() if proxy else None)
if data.get("title"):
outfile.write(json.dumps(data) + "\n")
outfile.flush()
success_count += 1
if proxy:
rotator.mark_success(proxy)
except Exception as e:
print(f"Failed {item_id}: {e}")
if proxy:
rotator.mark_failure(proxy)
await asyncio.gather(*[process_one(id_) for id_ in remaining])
outfile.close()
return success_count
Use Case 5: Trend Monitoring and Keyword Research
Track which product types are trending on AliExpress by monitoring the "Hot Products" and "New Arrivals" sections and tracking sold counts over time.
async def track_trending_products(
keywords: list[str],
db_path: str = "trends.db",
) -> dict:
"""
Search AliExpress for each keyword, record the top results and their metrics.
Run this daily to build a trend timeline.
"""
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS trend_data (
id INTEGER PRIMARY KEY AUTOINCREMENT,
keyword TEXT,
item_id TEXT,
title TEXT,
price TEXT,
sold_count TEXT,
rating TEXT,
position INTEGER,
run_date TEXT,
created_at TEXT DEFAULT CURRENT_TIMESTAMP
)
""")
conn.commit()
run_date = datetime.now().strftime("%Y-%m-%d")
results_summary = {}
for keyword in keywords:
search_url = f"https://www.aliexpress.com/w/wholesale-{keyword.replace(' ', '-')}.html?SortType=total_tranpro_desc"
proxy = rotator.get_proxy()
products = await scrape_category(
search_url,
max_pages=2,
proxy_url=proxy.to_url() if proxy else None,
)
for position, product in enumerate(products[:20], 1):
conn.execute("""
INSERT INTO trend_data (keyword, item_id, title, price, sold_count, rating, position, run_date)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
""", (
keyword,
product.get("item_id", ""),
product.get("title", ""),
product.get("price", ""),
product.get("sold_count", ""),
product.get("rating", ""),
position,
run_date,
))
conn.commit()
results_summary[keyword] = len(products)
print(f"'{keyword}': tracked {len(products)} products")
await asyncio.sleep(random.uniform(5.0, 10.0))
conn.close()
return results_summary
Output Schema
A complete product record from the scraper looks like this:
{
"item_id": "1005006789012345",
"url": "https://www.aliexpress.com/item/1005006789012345.html",
"title": "2024 New LED Desk Lamp Wireless Charging USB Reading Light Study Lamp for Bedroom Office",
"price": "US $12.99",
"original_price": "US $18.56",
"discount": "30%",
"sold_count": "1,234 sold",
"rating": "4.8",
"review_count": 892,
"shipping": {
"method": "AliExpress Standard Shipping",
"price": "Free",
"delivery_days": 20
},
"seller": {
"name": "Electronics World Store",
"id": "12345678",
"positive_feedback": "97.8%",
"followers": 4521
},
"sku_data": {
"props": [
{
"skuPropertyName": "Color",
"skuPropertyValues": [
{"propertyValueName": "Black", "skuPropertyImagePath": "//ae01.alicdn.com/..."},
{"propertyValueName": "White", "skuPropertyImagePath": "//ae01.alicdn.com/..."}
]
}
],
"price_list": [
{"skuPropIds": "200000182:201441035", "skuVal": {"skuAmount": {"value": "12.99"}}},
{"skuPropIds": "200000182:201441036", "skuVal": {"skuAmount": {"value": "13.49"}}}
]
},
"images": [
"//ae01.alicdn.com/kf/S123abc.jpg",
"//ae01.alicdn.com/kf/S456def.jpg"
],
"scraped_at": 1743436800.0
}
Error Handling and Retry Logic
Production scrapers need to distinguish between transient errors (network timeout, temporary block) and permanent failures (product removed, invalid URL) and handle each appropriately.
from tenacity import (
retry, stop_after_attempt, wait_exponential,
retry_if_exception_type, before_sleep_log
)
import logging
logger = logging.getLogger(__name__)
class ProductRemovedError(Exception):
"""Product no longer exists on AliExpress."""
pass
class BotDetectedError(Exception):
"""Anti-bot triggered — needs fresh session/proxy."""
pass
class ScrapingError(Exception):
"""General scraping failure."""
pass
@retry(
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=2, min=5, max=60),
retry=retry_if_exception_type(ScrapingError),
reraise=True,
)
async def scrape_with_retry(url: str, proxy_rotator: ProxyRotator) -> dict:
"""
Scrape with automatic retry for transient errors.
Uses exponential backoff: 5s, 10s, 20s, 40s, 60s (capped).
"""
proxy = proxy_rotator.get_proxy()
try:
result = await scrape_aliexpress_product(
url,
proxy_url=proxy.to_url() if proxy else None,
)
if not result:
raise ScrapingError("Empty result returned")
if result.get("error"):
error_msg = result["error"]
if "404" in error_msg or "item not found" in error_msg.lower():
raise ProductRemovedError(f"Product removed: {url}")
if any(word in error_msg.lower() for word in ["captcha", "blocked", "bot"]):
proxy_rotator.mark_failure(proxy)
raise ScrapingError(f"Bot detection: {error_msg}")
raise ScrapingError(f"Scrape error: {error_msg}")
if proxy:
proxy_rotator.mark_success(proxy)
return result
except ProductRemovedError:
raise # Don't retry — product is gone
except PlaywrightTimeout as e:
if proxy:
proxy_rotator.mark_failure(proxy)
raise ScrapingError(f"Timeout: {e}")
except Exception as e:
if proxy:
proxy_rotator.mark_failure(proxy)
raise ScrapingError(f"Unexpected error: {e}")
async def bulk_scrape_safe(
urls: list[str],
output_path: str = "results.jsonl",
error_path: str = "errors.jsonl",
) -> tuple[int, int]:
"""
Scrape a list of URLs with full error handling.
Writes successes to output_path and failures to error_path.
Returns (success_count, error_count).
"""
success_count = 0
error_count = 0
with open(output_path, "a") as out, open(error_path, "a") as err:
for url in urls:
try:
result = await scrape_with_retry(url, rotator)
out.write(json.dumps(result) + "\n")
out.flush()
success_count += 1
except ProductRemovedError as e:
err.write(json.dumps({"url": url, "error": "removed", "detail": str(e)}) + "\n")
err.flush()
error_count += 1
except Exception as e:
err.write(json.dumps({"url": url, "error": "failed", "detail": str(e)}) + "\n")
err.flush()
error_count += 1
logger.error(f"Failed after all retries: {url} — {e}")
# Delay between each URL regardless of success/failure
await asyncio.sleep(random.uniform(2.0, 5.0))
return success_count, error_count
Common Errors and What They Mean
TimeoutError waiting for product selector — The page did not load the product content within the timeout. Either the page is genuinely slow, you received a redirect to a block/CAPTCHA page, or the site layout changed. Check await page.screenshot(path="debug.png") to see what the page actually looks like.
Empty price field despite page loading — JavaScript did not finish executing before you extracted content. Increase the timeout on wait_until="networkidle" or add an explicit wait for the price element: await page.wait_for_selector(".product-price-value", timeout=15000).
window.__INIT_DATA__ is null — This happens on some AliExpress page variants (mobile-routed pages, lite pages for blocked IPs, or new A/B test layouts). Fall back to DOM parsing and check window.runParams as an alternative data source.
403 on search pages, 200 on product pages — Search pages have stricter anti-bot thresholds. Use a different User-Agent for search, add more delay between paginated requests, and ensure your cookie session was established by visiting the homepage first.
Prices in wrong currency — Your proxy IP is routing through a country with different pricing. Specify country-targeting in your ThorData proxy username (e.g., username-country-US) to ensure consistent USD pricing.
Silent degradation (data looks old or incomplete) — AliExpress sometimes returns cached/stale data to suspected bots rather than blocking outright. Check if scraped_at timestamps in your data correspond to fresh scrapes, and compare product data against the browser manually to verify freshness.
AliExpress scraping is a moving target. The selectors change, window.__INIT_DATA__ key paths shift occasionally, and Alibaba's anti-bot systems receive regular updates. The window.__INIT_DATA__ approach is significantly more stable than DOM parsing — prioritize it and fall back to DOM extraction only when the structured data is missing. Residential proxies and stealth browser patching are the two non-negotiables for maintaining reliable access at scale.