Blog

Web extraction, LLMs, and building in public.

Technical deep dives on web extraction, content parsing for LLMs, anti-bot bypass, and building open-source infrastructure in Rust. Written by the team behind webclaw.

webclaw turns any website into clean, structured content for AI applications. These posts cover the engineering decisions, trade-offs, and lessons learned building a web extraction toolkit from scratch.

29 postsPage 2 / 4
Anti-Bot Scraping API 2026: signals that force browser fallback
May 19, 2026Massi

Anti-Bot Scraping API 2026: signals that force browser fallback

The exact block markers, JA4 fingerprints, empty shells, anti-bot cookies, JavaScript heuristics, and content-quality signals that decide when a scraping API should escalate to a browser.

Anti-Bot Scraping API in 2026: Skip Browser-First, Stay Fast
May 14, 2026Massi

Anti-Bot Scraping API in 2026: Skip Browser-First, Stay Fast

An anti-bot scraping API that detects the block first, then escalates to a browser only when needed. Faster and cheaper, with clean markdown or JSON out.

How to evaluate web scraping APIs for AI agents
May 12, 2026Massi

How to evaluate web scraping APIs for AI agents

A practical checklist for testing web scraping APIs on real agent and RAG workflows, not toy URLs like example.com.

Migrating from Firecrawl: compatible API for AI agents
May 8, 2026Massi

Migrating from Firecrawl: compatible API for AI agents

Already using Firecrawl? Learn how Firecrawl-compatible endpoints work, what to test before switching, and how to evaluate webclaw with your existing scrape and crawl calls.

Cloudflare Scraping Checklist: Diagnose the Block in 2026
May 5, 2026Massi

Cloudflare Scraping Checklist: Diagnose the Block in 2026

A checklist for Cloudflare scraping failures. What to log, what each signal means, and when to change fingerprint, session, rate limit, or render in a browser.

Why curl Gets 403 on Cloudflare (TLS and JA4 Fingerprint)
Apr 30, 2026Massi

Why curl Gets 403 on Cloudflare (TLS and JA4 Fingerprint)

curl gets 403, Chrome gets 200, same request. The reason is the TLS and HTTP/2 fingerprint, JA3 and JA4. How browser-grade clients flip the result.

Cloudflare 403, 503, 1020, 1015: What Each Block Means
Apr 28, 2026Massi

Cloudflare 403, 503, 1020, 1015: What Each Block Means

Cloudflare 403, 503, 1020, 1015 each mean a different block. A decision tree to read the code, find the failing layer, and fix it. Includes error 1020.

Why Puppeteer Stealth Still Fails on Cloudflare (2026)
Apr 24, 2026Massi

Why Puppeteer Stealth Still Fails on Cloudflare (2026)

puppeteer-extra-plugin-stealth still gets caught by Cloudflare in 2026. The network, request, and session signals that give it away, and what to run instead.

Cloudflare Turnstile in 2026: What Actually Bypasses It
Apr 21, 2026Massi

Cloudflare Turnstile in 2026: What Actually Bypasses It

What works against Cloudflare Turnstile in 2026 and what does not. The four signals that decide pass or block: TLS, HTTP/2, token, session. No solver hype.

Stop reading. Start scraping.

Cancel anytime. Turn any page into clean, structured content your agent can actually use.

Read the docs