Web extraction, LLMs, and building in public.
Technical deep dives on web extraction, content parsing for LLMs, anti-bot bypass, and building open-source infrastructure in Rust. Written by the team behind webclaw.
webclaw turns any website into clean, structured content for AI applications. These posts cover the engineering decisions, trade-offs, and lessons learned building a web extraction toolkit from scratch.

How to evaluate web scraping APIs for AI agents
A practical checklist for testing web scraping APIs on real agent and RAG workflows, not toy URLs like example.com.

Migrating from Firecrawl: compatible API for AI agents
Already using Firecrawl? Learn how Firecrawl-compatible endpoints work, what to test before switching, and how to evaluate webclaw with your existing scrape and crawl calls.

Cloudflare Scraping Checklist: Diagnose the Block in 2026
A checklist for Cloudflare scraping failures. What to log, what each signal means, and when to change fingerprint, session, rate limit, or render in a browser.

Why curl Gets 403 on Cloudflare (TLS and JA4 Fingerprint)
curl gets 403, Chrome gets 200, same request. The reason is the TLS and HTTP/2 fingerprint, JA3 and JA4. How browser-grade clients flip the result.

Cloudflare 403, 503, 1020, 1015: What Each Block Means
Cloudflare 403, 503, 1020, 1015 each mean a different block. A decision tree to read the code, find the failing layer, and fix it. Includes error 1020.

Why Puppeteer Stealth Still Fails on Cloudflare (2026)
puppeteer-extra-plugin-stealth still gets caught by Cloudflare in 2026. The network, request, and session signals that give it away, and what to run instead.

Cloudflare Turnstile in 2026: What Actually Bypasses It
What works against Cloudflare Turnstile in 2026 and what does not. The four signals that decide pass or block: TLS, HTTP/2, token, session. No solver hype.

LlamaIndex Web Scraping: Fix SimpleWebPageReader
LlamaIndex web scraping fails on blocks, empty shells, and noisy HTML. Feed cleaner markdown into RAG pipelines and agents.

LangChain web scraping in 2026: what loaders can't do
LangChain's built-in loaders break on bot-protected sites and return raw HTML your LLM can't use. Here's how to get clean, reliable web data into any LangChain pipeline.
Stop reading. Start scraping.
Cancel anytime. Turn any page into clean, structured content your agent can actually use.