Introduction
webclaw is a web extraction toolkit built in Rust. It turns any website into LLM-ready markdown, JSON, plain text, or token-optimized output -- without a headless browser. All extraction happens over raw HTTP using Impit TLS impersonation, making it fast, lightweight, and deployable anywhere.
Three binaries, one engine
webclaw ships as three standalone binaries, all powered by the same extraction core:
webclaw
The CLI. Extract, crawl, summarize, and track changes from the terminal. Pipe output to files, chain with other tools, or use interactively.
webclaw-server
The REST API. An axum-based HTTP server with authentication, CORS, gzip compression, and async job management. Every extraction feature is available as a JSON endpoint.
webclaw-mcp
The MCP server. Exposes 8 tools over the Model Context Protocol (stdio transport) for use with Claude Desktop, Claude Code, and any MCP-compatible AI client.
Key features
No headless browser. Pure HTTP extraction via Impit TLS impersonation. No Playwright, no Puppeteer, no Chrome. Fast and lightweight.
4 output formats. Markdown, plain text, JSON, and LLM-optimized (9-step pipeline: image stripping, emphasis removal, link dedup, stat merging, whitespace collapse).
CSS selector filtering. Include or exclude content by CSS selector. Extract only article bodies, skip navbars and footers.
Crawling and sitemap discovery. BFS same-origin crawler with configurable depth, concurrency, and delay. Sitemap.xml and robots.txt discovery built in.
Content change tracking. Snapshot pages as JSON and diff against future extractions to detect what changed.
Brand extraction. Extract brand identity -- colors, fonts, logo URL, favicon -- from DOM structure and CSS analysis.
LLM integration. Provider chain: Ollama (local-first) then OpenAI then Anthropic. JSON schema extraction, prompt-based extraction, and summarization.
PDF extraction. Auto-detected via Content-Type header. Text extraction from PDF documents without external dependencies.
Proxy rotation. Per-request proxy rotation from a pool file. Auto-loads proxies.txt from the working directory.
Browser impersonation. Chrome (v142, v136, v133, v131) and Firefox (v144, v135, v133, v128) TLS fingerprint profiles. Random mode available.
Open source
webclaw is MIT licensed and fully open source. The repository is at github.com/0xMassi/webclaw.
Architecture
The project is a Rust workspace split into focused crates. The core extraction engine has zero network dependencies and is WASM-compatible.
webclaw-core
The pure extraction engine. Takes raw HTML as a string, returns structured output. No network calls, no I/O -- just parsing and scoring. This is what makes the core WASM-compatible.
Key modules: readability-style content scoring with text density and link density penalties, shared noise filtering (tags, ARIA roles, class/ID patterns, Tailwind-safe), JSON data island extraction for React SPAs and Next.js, HTML to markdown conversion with URL resolution, and a 9-step LLM optimization pipeline.
webclaw-fetch
The HTTP layer. Uses Impit for TLS impersonation with Chrome and Firefox browser profiles. Handles BFS crawling with configurable depth and concurrency, sitemap.xml and robots.txt discovery, multi-URL batch operations, and per-request proxy rotation.
webclaw-llm
LLM provider chain with automatic fallback: tries Ollama first (local, no API key needed), then OpenAI, then Anthropic. Uses plain reqwest (not Impit) since LLM APIs do not need TLS fingerprinting. Supports JSON schema extraction, prompt-based extraction, and summarization.
&str HTML and returns structured data. All HTTP, LLM calls, and PDF parsing happen in the other crates.Next steps
Getting started -- install webclaw and run your first extraction in under a minute.
CLI reference -- every flag and option for the command-line tool.
REST API -- programmatic access to the full extraction engine.