Web extraction, LLMs, and building in public.
Technical deep dives on web extraction, content parsing for LLMs, anti-bot bypass, and building open-source infrastructure in Rust. Written by the team behind webclaw.
webclaw turns any website into clean, structured content for AI applications. These posts cover the engineering decisions, trade-offs, and lessons learned building a web extraction toolkit from scratch.
Anti-Bot Scraping API 2026: signals that force browser fallback
The exact block markers, JA4 fingerprints, empty shells, anti-bot cookies, JavaScript heuristics, and content-quality signals that decide when a scraping API should escalate to a browser.

Anti-Bot Scraping API in 2026: Skip Browser-First, Stay Fast
An anti-bot scraping API that detects the block first, then escalates to a browser only when needed. Faster and cheaper, with clean markdown or JSON out.

How to evaluate web scraping APIs for AI agents
A practical checklist for testing web scraping APIs on real agent and RAG workflows, not toy URLs like example.com.

Migrating from Firecrawl: compatible API for AI agents
Already using Firecrawl? Learn how Firecrawl-compatible endpoints work, what to test before switching, and how to evaluate webclaw with your existing scrape and crawl calls.

Cloudflare Scraping Checklist: Diagnose the Block in 2026
A checklist for Cloudflare scraping failures. What to log, what each signal means, and when to change fingerprint, session, rate limit, or render in a browser.

Why curl Gets 403 on Cloudflare (TLS and JA4 Fingerprint)
curl gets 403, Chrome gets 200, same request. The reason is the TLS and HTTP/2 fingerprint, JA3 and JA4. How browser-grade clients flip the result.

Cloudflare 403, 503, 1020, 1015: What Each Block Means
Cloudflare 403, 503, 1020, 1015 each mean a different block. A decision tree to read the code, find the failing layer, and fix it. Includes error 1020.

Why Puppeteer Stealth Still Fails on Cloudflare (2026)
puppeteer-extra-plugin-stealth still gets caught by Cloudflare in 2026. The network, request, and session signals that give it away, and what to run instead.

Cloudflare Turnstile in 2026: What Actually Bypasses It
What works against Cloudflare Turnstile in 2026 and what does not. The four signals that decide pass or block: TLS, HTTP/2, token, session. No solver hype.
Stop reading. Start scraping.
Cancel anytime. Turn any page into clean, structured content your agent can actually use.