Blog.

Technical deep dives on web extraction, content parsing for LLMs, anti-bot bypass, and building open-source infrastructure in Rust. Written by the team behind webclaw.

webclaw turns any website into clean, structured content for AI applications. These posts cover the engineering decisions, trade-offs, and lessons learned building a web extraction toolkit from scratch.

May 14, 2026Massi

Anti-bot scraping API: browser fallback beats browser-first

Choose an anti-bot scraping API that detects blocks, avoids browser-first costs, and returns clean markdown or JSON for AI agents and RAG.

May 12, 2026Massi

How to evaluate web scraping APIs for AI agents

A practical checklist for testing web scraping APIs on real agent and RAG workflows, not toy URLs like example.com.

May 8, 2026Massi

Migrating from Firecrawl: compatible API for AI agents

Already using Firecrawl? Learn how Firecrawl-compatible endpoints work, what to test before switching, and how to evaluate webclaw with your existing scrape and crawl calls.

May 5, 2026Massi

Cloudflare scraping checklist: diagnose the block before you retry

A practical checklist for Cloudflare scraping failures. What to log, what each signal means, and when to change fingerprints, sessions, rate limits, or browser rendering.

Apr 30, 2026Massi

TLS fingerprinting in 2026: why curl gets 403 and Chrome does not

The reason curl gets blocked and Chrome gets through is not JavaScript. It is the TLS handshake. Deep dive on JA3, JA4, HTTP/2 fingerprints, and how to match a real browser without launching one.

Apr 28, 2026Massi

Cloudflare error codes for scrapers: 403 vs 503 vs 1020 (and the rest)

A 403, a 503, a 1020 and a 1015 are not the same problem. Decision tree for which Cloudflare block you hit, what each code really means, and what to change in the scraper.

Apr 24, 2026Massi

Puppeteer stealth vs Cloudflare: why it breaks

Puppeteer stealth still patches browser leaks, but Cloudflare scores more than JavaScript. See what breaks in 2026 and what to do instead.

Apr 21, 2026Massi

Cloudflare Turnstile scraping: fixes for 2026

Cloudflare Turnstile scraping fails as 403s, empty shells, or loops. Learn how to detect it, log the right signals, and choose the right fallback.

Apr 17, 2026Massi

LlamaIndex web scraping: fix SimpleWebPageReader

LlamaIndex web scraping breaks on blocks, empty shells, and noisy HTML. Feed cleaner markdown into SimpleWebPageReader, RAG, and agents.

Apr 14, 2026Massi

LangChain web scraping in 2026: what loaders can't do

LangChain's built-in loaders break on bot-protected sites and return raw HTML your LLM can't use. Here's how to get clean, reliable web data into any LangChain pipeline.

Apr 10, 2026Massi

5 ways to scrape Google search results in 2026

Google killed plain HTTP access to search results. Here's what works now, from TLS fingerprinting libraries to headless browsers to APIs, with code examples for each approach.

Apr 7, 2026Massi

The 6 best web scraping APIs for LLMs in 2026

If you're building with LLMs, you need web data. Here's how the main scraping APIs compare on the things that actually matter for AI use cases.

Apr 2, 2026Massi

Cloudflare Web Scraping: What Works in 2026

A practical guide to Cloudflare scraping blocks in 2026. Learn what causes 403s, what signals matter, and which approaches still work.

Mar 31, 2026Massi

Extract structured data from any URL in one call

You don't always need the full page. Sometimes you need three fields from a product listing. Here's how to pull exactly the data you want from any URL.

Mar 27, 2026Massi

Build a RAG pipeline with live web data (4 steps)

Most RAG tutorials stop at "upload a PDF." Real apps need live web data. Here's how to build a pipeline that fetches, extracts, and indexes pages.

Mar 24, 2026Massi

MCP web scraping for Claude Code and Cursor

MCP web scraping gives Claude Code, Cursor, and AI agents live web access. Scrape, crawl, search, extract, and summarize from one server.

Mar 20, 2026Massi

HTML to Markdown for LLMs: cleaner RAG input

Convert HTML to Markdown for LLMs with boilerplate removed, links preserved, and fewer wasted tokens for RAG, agents, and summarization.

Mar 17, 2026Massi

Web scraping for AI agents: 3 hidden problems

Most scraping tools were built for data pipelines, not AI agents. Three things quietly break your pipeline and how to fix them.

Mar 12, 2026Massi

Why I built webclaw (Rust scraper for LLMs)

I was tired of scrapers that return 403 or need headless Chrome for basic HTML. So I built one in Rust that actually works.