Massi

Founder & engineer, webclaw

I'm Massi, also known online as 0xMassi. I build web extraction infrastructure in Rust, focused on the problem of getting clean, reliable web data into language models and AI agents.

My work lives at the intersection of three hard problems: bot protection bypass (TLS fingerprinting, HTTP/2 impersonation), high-throughput content extraction (Rust, async, zero-copy), and LLM tooling (MCP, structured extraction, RAG pipelines). webclaw is where I ship that work as open source.

Before webclaw, I spent years writing iOS apps, backend services, and developer tooling. I've shipped native apps to the App Store, run production APIs, and maintained Rust crates used by other developers.

Areas of expertise

  • Rust systems programming
  • Web content extraction
  • TLS fingerprinting and browser impersonation
  • HTTP/2 protocol internals
  • Bot protection bypass (Cloudflare, DataDome, AWS WAF)
  • Model Context Protocol (MCP) server design
  • Retrieval augmented generation (RAG) pipelines
  • LLM tooling and agent infrastructure

Projects

Articles

Anti-bot scraping API: browser fallback beats browser-first

Choose an anti-bot scraping API that detects blocks, avoids browser-first costs, and returns clean markdown or JSON for AI agents and RAG.

How to evaluate web scraping APIs for AI agents

A practical checklist for testing web scraping APIs on real agent and RAG workflows, not toy URLs like example.com.

Migrating from Firecrawl: compatible API for AI agents

Already using Firecrawl? Learn how Firecrawl-compatible endpoints work, what to test before switching, and how to evaluate webclaw with your existing scrape and crawl calls.

Cloudflare scraping checklist: diagnose the block before you retry

A practical checklist for Cloudflare scraping failures. What to log, what each signal means, and when to change fingerprints, sessions, rate limits, or browser rendering.

TLS fingerprinting in 2026: why curl gets 403 and Chrome does not

The reason curl gets blocked and Chrome gets through is not JavaScript. It is the TLS handshake. Deep dive on JA3, JA4, HTTP/2 fingerprints, and how to match a real browser without launching one.

Cloudflare error codes for scrapers: 403 vs 503 vs 1020 (and the rest)

A 403, a 503, a 1020 and a 1015 are not the same problem. Decision tree for which Cloudflare block you hit, what each code really means, and what to change in the scraper.

Puppeteer stealth vs Cloudflare: why it breaks

Puppeteer stealth still patches browser leaks, but Cloudflare scores more than JavaScript. See what breaks in 2026 and what to do instead.

Cloudflare Turnstile scraping: fixes for 2026

Cloudflare Turnstile scraping fails as 403s, empty shells, or loops. Learn how to detect it, log the right signals, and choose the right fallback.

LlamaIndex web scraping: fix SimpleWebPageReader

LlamaIndex web scraping breaks on blocks, empty shells, and noisy HTML. Feed cleaner markdown into SimpleWebPageReader, RAG, and agents.

LangChain web scraping in 2026: what loaders can't do

LangChain's built-in loaders break on bot-protected sites and return raw HTML your LLM can't use. Here's how to get clean, reliable web data into any LangChain pipeline.

5 ways to scrape Google search results in 2026

Google killed plain HTTP access to search results. Here's what works now, from TLS fingerprinting libraries to headless browsers to APIs, with code examples for each approach.

The 6 best web scraping APIs for LLMs in 2026

If you're building with LLMs, you need web data. Here's how the main scraping APIs compare on the things that actually matter for AI use cases.

Cloudflare Web Scraping: What Works in 2026

A practical guide to Cloudflare scraping blocks in 2026. Learn what causes 403s, what signals matter, and which approaches still work.

Extract structured data from any URL in one call

You don't always need the full page. Sometimes you need three fields from a product listing. Here's how to pull exactly the data you want from any URL.

Build a RAG pipeline with live web data (4 steps)

Most RAG tutorials stop at "upload a PDF." Real apps need live web data. Here's how to build a pipeline that fetches, extracts, and indexes pages.

MCP web scraping for Claude Code and Cursor

MCP web scraping gives Claude Code, Cursor, and AI agents live web access. Scrape, crawl, search, extract, and summarize from one server.

HTML to Markdown for LLMs: cleaner RAG input

Convert HTML to Markdown for LLMs with boilerplate removed, links preserved, and fewer wasted tokens for RAG, agents, and summarization.

Web scraping for AI agents: 3 hidden problems

Most scraping tools were built for data pipelines, not AI agents. Three things quietly break your pipeline and how to fix them.

Why I built webclaw (Rust scraper for LLMs)

I was tired of scrapers that return 403 or need headless Chrome for basic HTML. So I built one in Rust that actually works.

Elsewhere