webclaw

Introduction

webclaw is a web extraction toolkit built in Rust. It turns any website into LLM-ready markdown, JSON, plain text, or token-optimized output -- without a headless browser. All extraction happens over raw HTTP using Impit TLS impersonation, making it fast, lightweight, and deployable anywhere.

Three binaries, one engine

webclaw ships as three standalone binaries, all powered by the same extraction core:

webclaw

The CLI. Extract, crawl, summarize, and track changes from the terminal. Pipe output to files, chain with other tools, or use interactively.

webclaw-server

The REST API. An axum-based HTTP server with authentication, CORS, gzip compression, and async job management. Every extraction feature is available as a JSON endpoint.

webclaw-mcp

The MCP server. Exposes 8 tools over the Model Context Protocol (stdio transport) for use with Claude Desktop, Claude Code, and any MCP-compatible AI client.

Key features

No headless browser. Pure HTTP extraction via Impit TLS impersonation. No Playwright, no Puppeteer, no Chrome. Fast and lightweight.

4 output formats. Markdown, plain text, JSON, and LLM-optimized (9-step pipeline: image stripping, emphasis removal, link dedup, stat merging, whitespace collapse).

CSS selector filtering. Include or exclude content by CSS selector. Extract only article bodies, skip navbars and footers.

Crawling and sitemap discovery. BFS same-origin crawler with configurable depth, concurrency, and delay. Sitemap.xml and robots.txt discovery built in.

Content change tracking. Snapshot pages as JSON and diff against future extractions to detect what changed.

Brand extraction. Extract brand identity -- colors, fonts, logo URL, favicon -- from DOM structure and CSS analysis.

LLM integration. Provider chain: Ollama (local-first) then OpenAI then Anthropic. JSON schema extraction, prompt-based extraction, and summarization.

PDF extraction. Auto-detected via Content-Type header. Text extraction from PDF documents without external dependencies.

Proxy rotation. Per-request proxy rotation from a pool file. Auto-loads proxies.txt from the working directory.

Browser impersonation. Chrome (v142, v136, v133, v131) and Firefox (v144, v135, v133, v128) TLS fingerprint profiles. Random mode available.

Open source

webclaw is MIT licensed and fully open source. The repository is at github.com/0xMassi/webclaw.

Architecture

The project is a Rust workspace split into focused crates. The core extraction engine has zero network dependencies and is WASM-compatible.

workspace
webclaw/
  crates/
    webclaw-core/     # Extraction engine. WASM-safe. Zero network deps.
                      # Readability scoring, noise filtering, markdown
                      # conversion, LLM optimization, CSS selector
                      # filtering, diff engine, brand extraction.

    webclaw-fetch/    # HTTP client via Impit. Crawler. Sitemap discovery.
                      # Batch operations. Proxy pool rotation.

    webclaw-llm/      # LLM provider chain (Ollama -> OpenAI -> Anthropic).
                      # JSON schema extraction, prompt extraction,
                      # summarization.

    webclaw-pdf/      # PDF text extraction via pdf-extract.

    webclaw-server/   # axum REST API. Auth, CORS, gzip, job management.

    webclaw-mcp/      # MCP server over stdio transport. 8 tools for
                      # AI agents.

    webclaw-cli/      # CLI binary.

webclaw-core

The pure extraction engine. Takes raw HTML as a string, returns structured output. No network calls, no I/O -- just parsing and scoring. This is what makes the core WASM-compatible.

Key modules: readability-style content scoring with text density and link density penalties, shared noise filtering (tags, ARIA roles, class/ID patterns, Tailwind-safe), JSON data island extraction for React SPAs and Next.js, HTML to markdown conversion with URL resolution, and a 9-step LLM optimization pipeline.

webclaw-fetch

The HTTP layer. Uses Impit for TLS impersonation with Chrome and Firefox browser profiles. Handles BFS crawling with configurable depth and concurrency, sitemap.xml and robots.txt discovery, multi-URL batch operations, and per-request proxy rotation.

webclaw-llm

LLM provider chain with automatic fallback: tries Ollama first (local, no API key needed), then OpenAI, then Anthropic. Uses plain reqwest (not Impit) since LLM APIs do not need TLS fingerprinting. Supports JSON schema extraction, prompt-based extraction, and summarization.

Note
The core crate never makes network requests. It takes &str HTML and returns structured data. All HTTP, LLM calls, and PDF parsing happen in the other crates.

Next steps

Getting started -- install webclaw and run your first extraction in under a minute.

CLI reference -- every flag and option for the command-line tool.

REST API -- programmatic access to the full extraction engine.