Blog.
Technical deep dives on web extraction, content parsing for LLMs, anti-bot bypass, and building open-source infrastructure in Rust. Written by the team behind webclaw.
webclaw turns any website into clean, structured content for AI applications. These posts cover the engineering decisions, trade-offs, and lessons learned building a web extraction toolkit from scratch.
TLS fingerprinting in 2026: why curl gets 403 and Chrome does not
The reason curl gets blocked and Chrome gets through is not JavaScript. It is the TLS handshake. Deep dive on JA3, JA4, HTTP/2 fingerprints, and how to match a real browser without launching one.
Cloudflare error codes for scrapers: 403 vs 503 vs 1020 (and the rest)
A 403, a 503, a 1020 and a 1015 are not the same problem. Decision tree for which Cloudflare block you hit, what each code really means, and what to change in the scraper.
Why Puppeteer stealth stopped working on Cloudflare
Puppeteer stealth still patches useful browser leaks. Cloudflare changed the game around it. Here is what breaks, what is real, and what to do instead.
Cloudflare Turnstile in 2026: what stealth plugins miss
Turnstile looks invisible. It's not. What it actually does when your scraper hits a protected page, why Puppeteer-stealth stopped solving it in late 2025, and what works now.
LlamaIndex web scraping in 2026: what the readers miss
LlamaIndex readers like SimpleWebPageReader and TrafilaturaWebReader break on bot-protected sites and dump raw HTML into your index. Here's how to feed clean, LLM-ready web data into any LlamaIndex pipeline.
LangChain web scraping in 2026: what loaders can't do
LangChain's built-in loaders break on bot-protected sites and return raw HTML your LLM can't use. Here's how to get clean, reliable web data into any LangChain pipeline.
5 ways to scrape Google search results in 2026
Google killed plain HTTP access to search results. Here's what works now, from TLS fingerprinting libraries to headless browsers to APIs, with code examples for each approach.
The 6 best web scraping APIs for LLMs in 2026
If you're building with LLMs, you need web data. Here's how the main scraping APIs compare on the things that actually matter for AI use cases.
How to bypass Cloudflare bot protection (2026)
Cloudflare protects over 20% of the web. If you're scraping, you've hit a 403. Here's what actually works, what doesn't, and why most tools get it wrong.
Extract structured data from any URL in one call
You don't always need the full page. Sometimes you need three fields from a product listing. Here's how to pull exactly the data you want from any URL.
Build a RAG pipeline with live web data (4 steps)
Most RAG tutorials stop at "upload a PDF." Real apps need live web data. Here's how to build a pipeline that fetches, extracts, and indexes pages.
MCP web scraping in 2026: 12 tools for Claude, Cursor, Windsurf
Open-source MCP server with 12 web extraction tools. Plug into Claude Desktop, Claude Code, Cursor, or Windsurf in one config line. Cloudflare and DataDome bypass included.
HTML to markdown for LLMs in 2026: cut tokens 97% in one API call
Raw HTML is 50,000 tokens. The content you need is 800. Strip boilerplate, save 97% on LLM tokens, get clean markdown from any URL in one API call.
Web scraping for AI agents: 3 hidden problems
Most scraping tools were built for data pipelines, not AI agents. Three things quietly break your pipeline and how to fix them.
Why I built webclaw (Rust scraper for LLMs)
I was tired of scrapers that return 403 or need headless Chrome for basic HTML. So I built one in Rust that actually works.