Blog.
Technical deep dives on web extraction, content parsing for LLMs, anti-bot bypass, and building open-source infrastructure in Rust. Written by the team behind webclaw.
webclaw turns any website into clean, structured content for AI applications. These posts cover the engineering decisions, trade-offs, and lessons learned building a web extraction toolkit from scratch.
Anti-bot scraping API: browser fallback beats browser-first
Choose an anti-bot scraping API that detects blocks, avoids browser-first costs, and returns clean markdown or JSON for AI agents and RAG.
How to evaluate web scraping APIs for AI agents
A practical checklist for testing web scraping APIs on real agent and RAG workflows, not toy URLs like example.com.
Migrating from Firecrawl: compatible API for AI agents
Already using Firecrawl? Learn how Firecrawl-compatible endpoints work, what to test before switching, and how to evaluate webclaw with your existing scrape and crawl calls.
Cloudflare scraping checklist: diagnose the block before you retry
A practical checklist for Cloudflare scraping failures. What to log, what each signal means, and when to change fingerprints, sessions, rate limits, or browser rendering.
TLS fingerprinting in 2026: why curl gets 403 and Chrome does not
The reason curl gets blocked and Chrome gets through is not JavaScript. It is the TLS handshake. Deep dive on JA3, JA4, HTTP/2 fingerprints, and how to match a real browser without launching one.
Cloudflare error codes for scrapers: 403 vs 503 vs 1020 (and the rest)
A 403, a 503, a 1020 and a 1015 are not the same problem. Decision tree for which Cloudflare block you hit, what each code really means, and what to change in the scraper.
Puppeteer stealth vs Cloudflare: why it breaks
Puppeteer stealth still patches browser leaks, but Cloudflare scores more than JavaScript. See what breaks in 2026 and what to do instead.
Cloudflare Turnstile scraping: fixes for 2026
Cloudflare Turnstile scraping fails as 403s, empty shells, or loops. Learn how to detect it, log the right signals, and choose the right fallback.
LlamaIndex web scraping: fix SimpleWebPageReader
LlamaIndex web scraping breaks on blocks, empty shells, and noisy HTML. Feed cleaner markdown into SimpleWebPageReader, RAG, and agents.
LangChain web scraping in 2026: what loaders can't do
LangChain's built-in loaders break on bot-protected sites and return raw HTML your LLM can't use. Here's how to get clean, reliable web data into any LangChain pipeline.
5 ways to scrape Google search results in 2026
Google killed plain HTTP access to search results. Here's what works now, from TLS fingerprinting libraries to headless browsers to APIs, with code examples for each approach.
The 6 best web scraping APIs for LLMs in 2026
If you're building with LLMs, you need web data. Here's how the main scraping APIs compare on the things that actually matter for AI use cases.
Cloudflare Web Scraping: What Works in 2026
A practical guide to Cloudflare scraping blocks in 2026. Learn what causes 403s, what signals matter, and which approaches still work.
Extract structured data from any URL in one call
You don't always need the full page. Sometimes you need three fields from a product listing. Here's how to pull exactly the data you want from any URL.
Build a RAG pipeline with live web data (4 steps)
Most RAG tutorials stop at "upload a PDF." Real apps need live web data. Here's how to build a pipeline that fetches, extracts, and indexes pages.
MCP web scraping for Claude Code and Cursor
MCP web scraping gives Claude Code, Cursor, and AI agents live web access. Scrape, crawl, search, extract, and summarize from one server.
HTML to Markdown for LLMs: cleaner RAG input
Convert HTML to Markdown for LLMs with boilerplate removed, links preserved, and fewer wasted tokens for RAG, agents, and summarization.
Web scraping for AI agents: 3 hidden problems
Most scraping tools were built for data pipelines, not AI agents. Three things quietly break your pipeline and how to fix them.
Why I built webclaw (Rust scraper for LLMs)
I was tired of scrapers that return 403 or need headless Chrome for basic HTML. So I built one in Rust that actually works.