Massi

Founder & engineer, webclaw

I'm Massi, also known online as 0xMassi. I build web extraction infrastructure in Rust, focused on the problem of getting clean, reliable web data into language models and AI agents.

My work lives at the intersection of three hard problems: bot protection bypass (TLS fingerprinting, HTTP/2 impersonation), high-throughput content extraction (Rust, async, zero-copy), and LLM tooling (MCP, structured extraction, RAG pipelines). webclaw is where I ship that work as open source.

Before webclaw, I spent years writing iOS apps, backend services, and developer tooling. I've shipped native apps to the App Store, run production APIs, and maintained Rust crates used by other developers.

Areas of expertise

  • Rust systems programming
  • Web content extraction
  • TLS fingerprinting and browser impersonation
  • HTTP/2 protocol internals
  • Bot protection bypass (Cloudflare, DataDome, AWS WAF)
  • Model Context Protocol (MCP) server design
  • Retrieval augmented generation (RAG) pipelines
  • LLM tooling and agent infrastructure

Projects

Articles

Jina Reader Alternative for LLM Web Scraping

Compare Jina Reader, r.jina.ai, and Webclaw for URL to markdown, RAG input, crawling, batching, JavaScript rendering, anti-bot pages, and production extraction.

Crawl4AI vs Playwright for LLM Web Scraping

Compare Crawl4AI and Playwright for scraping dynamic sites, RAG input, markdown output, browser control, and production reliability.

JavaScript Rendering API for Web Scraping: when browser fallback is actually needed

Learn when a JavaScript rendering API is necessary for scraping dynamic websites, how to detect empty app shells, and why browser fallback should run only after response classification.

Anti-Bot Scraping API 2026: signals that force browser fallback

The exact block markers, JA4 fingerprints, empty shells, anti-bot cookies, JavaScript heuristics, and content-quality signals that decide when a scraping API should escalate to a browser.

Anti-bot scraping API: browser fallback beats browser-first

Choose an anti-bot scraping API that detects blocks, avoids browser-first costs, and returns clean markdown or JSON for AI agents and RAG.

How to evaluate web scraping APIs for AI agents

A practical checklist for testing web scraping APIs on real agent and RAG workflows, not toy URLs like example.com.

Migrating from Firecrawl: compatible API for AI agents

Already using Firecrawl? Learn how Firecrawl-compatible endpoints work, what to test before switching, and how to evaluate webclaw with your existing scrape and crawl calls.

Cloudflare scraping checklist: diagnose the block before you retry

A practical checklist for Cloudflare scraping failures. What to log, what each signal means, and when to change fingerprints, sessions, rate limits, or browser rendering.

TLS Fingerprint vs Cloudflare: Why curl Gets 403

curl gets blocked while Chrome works because the TLS and HTTP/2 fingerprints differ. Learn how JA3, JA4, and browser-grade clients change the result.

Cloudflare Error Codes for Scrapers: 403, 503, 1020

Cloudflare 403, 503, 1020, and 1015 mean different scraper failures. Use this decision tree to find the block and fix the right layer.

Puppeteer Stealth Not Working on Cloudflare?

Puppeteer stealth breaks on Cloudflare when network, request, and session signals disagree. See why it fails and what to use instead.

Cloudflare Turnstile Scraping: What Works in 2026

Cloudflare Turnstile scraping works when TLS, HTTP/2, token, and session signals agree. Learn how to detect failures and choose the right fallback.

LlamaIndex Web Scraping: Fix SimpleWebPageReader

LlamaIndex web scraping fails on blocks, empty shells, and noisy HTML. Feed cleaner markdown into RAG pipelines and agents.

LangChain web scraping in 2026: what loaders can't do

LangChain's built-in loaders break on bot-protected sites and return raw HTML your LLM can't use. Here's how to get clean, reliable web data into any LangChain pipeline.

5 ways to scrape Google search results in 2026

Google killed plain HTTP access to search results. Here's what works now, from TLS fingerprinting libraries to headless browsers to APIs, with code examples for each approach.

The 6 best web scraping APIs for LLMs in 2026

If you're building with LLMs, you need web data. Here's how the main scraping APIs compare on the things that actually matter for AI use cases.

Bypass Cloudflare Bot Protection in 2026

Bypass Cloudflare bot protection by fixing TLS, HTTP/2, challenge, and session signals instead of only rotating proxies or user agents.

Extract structured data from any URL in one call

You don't always need the full page. Sometimes you need three fields from a product listing. Here's how to pull exactly the data you want from any URL.

Build a RAG pipeline with live web data (4 steps)

Most RAG tutorials stop at "upload a PDF." Real apps need live web data. Here's how to build a pipeline that fetches, extracts, and indexes pages.

MCP web scraping for Claude Code and Cursor

MCP web scraping gives Claude Code, Cursor, and AI agents live web access. Scrape, crawl, search, extract, and summarize from one server.

HTML to Markdown for LLMs and RAG

Convert HTML to Markdown for LLMs with boilerplate removed, links preserved, and cleaner RAG input for agents and summarization.

Web scraping for AI agents: 3 hidden problems

Most scraping tools were built for data pipelines, not AI agents. Three things quietly break your pipeline and how to fix them.

Why I built webclaw (Rust scraper for LLMs)

I was tired of scrapers that return 403 or need headless Chrome for basic HTML. So I built one in Rust that actually works.

Elsewhere

Studio partners

Backing open web extraction

View partners
Quantum ProxiesProxy-SellerQuantum ProxiesProxy-SellerQuantum ProxiesProxy-Seller
Quantum ProxiesProxy-SellerQuantum ProxiesProxy-SellerQuantum ProxiesProxy-Seller