March 12, 2026Massi

Why I built webclaw (Rust scraper for LLMs)

Name: webclaw
Price: 19 USD
Author: Massi

I wanted to scrape a website. That's it. That's the origin story.

No grand vision, no pitch deck, no "what if we reimagined web extraction for the AI era." I had a URL, I needed the content, and every tool I tried either didn't work or made it way harder than it should be.

The 403 problem

Here's what happens when you try to scrape anything in 2026.

You install one of those popular scraping tools. You pass it a URL. You wait. 403 Forbidden. Or worse, you get back HTML that's just a Cloudflare challenge page, and the library says "here's your content!" like it did something useful.

So you look at the docs. "For pages behind anti-bot protection, enable our premium proxy network." Ah. Cool. So the free tier is basically a fetch() wrapper that breaks on any real website. Got it.

Or you try one of those "ethical" crawlers that respects robots.txt. Which, fine in principle. But when you're building an AI agent that needs to read a pricing page to compare options for a user, you're not a search engine. You're not indexing the web. You just need to read a page. The same page any human can read by clicking a link.

These tools treat every URL like you're about to DDoS it. Meanwhile your browser opens the same page in 200ms, no questions asked.

Why everything is a headless browser

The other approach is headless Chrome. Puppeteer, Playwright, Selenium. Spin up a real browser, navigate to the page, wait for JavaScript, then extract the DOM.

It works. It works really well actually. But it's like driving a truck to the grocery store to buy a banana.

Most pages don't need JavaScript rendering. The content is right there in the HTML. An article page, a docs site, a blog post. The HTML response has everything. You don't need 200MB of Chromium to read it.

And the performance cost is insane. Spinning up a Chrome instance, loading all the assets, executing JavaScript, waiting for network idle. You're looking at 2-5 seconds per page. Multiply that by a thousand pages in a crawl and you're waiting hours for something that should take minutes.

So I wrote it in Rust

Not because Rust is trendy (ok, a little bit because of that). But because the problem is fundamentally about parsing HTML fast and making HTTP requests that don't get blocked.

The core insight is simple: you don't need a browser, you need to *look like* a browser. TLS fingerprinting, proper headers, realistic request patterns. That's what gets you through anti-bot on most sites. Not a 200MB Chrome binary.

webclaw uses browser-grade TLS impersonation. It can look like Chrome 142, Firefox 144, whatever you need. The actual request goes out over raw HTTP. No rendering engine, no JavaScript VM, no GPU process eating your RAM.

The result is about 20x faster than the Chrome-based tools. And it works on sites where those tools either get blocked or need you to pay for proxy rotation.

Making it useful for LLMs

Speed was just step one. The real problem is that raw HTML is garbage for LLMs.

You give an LLM a full HTML page and half your tokens are navigation bars, footer links, cookie banners, and CSS class names. The actual content, the article, the docs, the product info, is maybe 10% of the payload.

So I built a 9-step optimization pipeline. Strip images that aren't content-relevant. Remove emphasis that doesn't add meaning. Deduplicate links. Collapse whitespace. Merge stat blocks. The output is clean markdown that an LLM can actually reason over without burning your context window on <div class="flex items-center justify-between px-4 py-2">.

The difference is real. On a typical docs page, you go from 15,000 tokens of HTML to about 800 tokens of LLM-optimized markdown. Same information, 95% less noise.

The MCP thing

Then Claude dropped MCP (Model Context Protocol). Basically a standard way for AI agents to call tools. And web scraping is like the most obvious tool an AI agent would need.

So I built webclaw-mcp. You plug it into Claude Desktop or Claude Code and your AI can scrape, crawl, extract structured data, track content changes. All through a clean tool interface.

It's the thing I wish existed when I was trying to build AI agents that needed to read the web. Instead of writing custom scraping code for every project, the agent just calls scrape("https://example.com") and gets back clean markdown.

Open source, obviously

The whole thing is open source. You can self-host it, run it as a Docker container, or use the cloud API if you don't want to deal with infrastructure.

I built it because I needed it. Turns out a lot of other people needed it too. If you're tired of scrapers that return 403 or charge you per-page for basic HTML extraction, give it a try.

The repo is at github.com/0xMassi/webclaw. Star it if it saves you time.