Self-Hosting
The OSS release ships a single binary, webclaw-server, that exposes the extraction engine over HTTP. Same JSON shapes as api.webclaw.io for the endpoints that exist in OSS. Stateless, no database, no job queue. Run it in Docker or build from source.
OSS vs hosted
| Capability | Self-hosted | api.webclaw.io |
|---|---|---|
| Scrape, map, batch, diff, brand | Yes | Yes |
| Crawl | Synchronous, capped at 500 pages per call | Async jobs, unbounded |
| Extract, summarize | Bring your own LLM (Ollama local, or OpenAI / Anthropic key) | Managed, Haiku + Sonnet chain |
| Anti-bot bypass (Cloudflare, DataDome, WAFs) | No | Yes |
| JS rendering (SPAs) | No | Yes |
| Search, research, agent scrape | No | Yes |
| Watch (scheduled monitors) | No | Yes |
| Auth | Single bearer token | OAuth, API keys, multi-tenant billing |
| Infrastructure | Single binary, stateless | Managed, autoscaled |
Rule of thumb: self-host when you want to keep the extraction pipeline on your own box and point it at sites that don't fight you. Use the hosted API when you hit Cloudflare, JS-rendered SPAs, or need async crawl jobs. You can also mix both: self-host for the easy 80% and point the hard 20% at api.webclaw.io.
Docker
Fastest way to run the server. The image ships three binaries (webclaw CLI, webclaw-mcp, webclaw-server) and picks up on WEBCLAW_PORT, WEBCLAW_HOST, and WEBCLAW_API_KEY as env vars.
The trailing webclaw-server is important. Without it, the default CMD runs the CLI and prints its --help, which is correct for docker run IMAGE https://example.com but not for running a server.
With authentication
When WEBCLAW_API_KEY is set, every request to /v1/* must present Authorization: Bearer mysecret. The comparison is constant-time. /health stays public.
Docker Compose with Ollama
To use /v1/extract and /v1/summarize, you need an LLM. Ollama runs locally for free.
docker exec -it ollama ollama pull qwen3:8b. The extract/summarize endpoints will fall back to OpenAI or Anthropic if Ollama isn't reachable and the corresponding API key is set.From source
Requires Rust 1.85+ (edition 2024).
Produces three binaries in target/release/:
| Binary | Role |
|---|---|
webclaw | CLI (single-shot extraction, crawl, no server). |
webclaw-server | Stateless REST API (this page). |
webclaw-mcp | MCP server over stdio for AI agents. |
Default bind is 127.0.0.1 so the CLI case stays safe on a laptop. Pass --host 0.0.0.0 (or WEBCLAW_HOST=0.0.0.0) when you want to expose it on a network. The Docker image already flips this to 0.0.0.0.
Environment variables
Every CLI flag has a matching env var. All are optional — the server starts with sensible defaults.
Server
| Variable | Default | Description |
|---|---|---|
WEBCLAW_PORT | 3000 | HTTP port. |
WEBCLAW_HOST | 127.0.0.1 (binary), 0.0.0.0 (Docker) | Bind address. |
WEBCLAW_API_KEY | -- | Bearer token required on /v1/*. Unset = open mode. |
RUST_LOG | info,webclaw_server=info | Tracing filter (e.g. debug, webclaw_fetch=trace). |
LLM providers (for /v1/extract and /v1/summarize)
Provider chain tries Ollama first, then OpenAI, then Anthropic. Set at least one. If none are reachable the endpoints return 422 with a readable error.
| Variable | Default | Description |
|---|---|---|
OLLAMA_HOST | http://localhost:11434 | Ollama API endpoint. |
OLLAMA_MODEL | qwen3:8b | Default model. |
OPENAI_API_KEY | -- | Fallback provider. |
OPENAI_BASE_URL | -- | OpenAI-compatible endpoint (for proxies or self-hosted inference). |
ANTHROPIC_API_KEY | -- | Fallback provider. |
Endpoints
Response shapes match the hosted API, so swapping between self-hosted and cloud is a base-URL change, not a code change.
| Method | Path | Purpose |
|---|---|---|
GET | /health | Liveness check. Always open, even with auth enabled. |
POST | /v1/scrape | Extract one URL. Formats: markdown, text, llm, json, html. |
POST | /v1/crawl | Synchronous BFS crawl. Capped at 500 pages. |
POST | /v1/map | Discover URLs via robots.txt + sitemap.xml. |
POST | /v1/batch | Fetch + extract many URLs. Capped at 100 URLs / 20 concurrent. |
POST | /v1/extract | LLM-backed structured extraction. Pass a JSON Schema or a prompt. Needs an LLM provider. |
POST | /v1/summarize | N-sentence summary. Needs an LLM provider. |
POST | /v1/diff | Compare current content against a prior snapshot. |
POST | /v1/brand | Extract colors, fonts, logo, favicon from a page. |
Why your self-hosted server still sees 403s on some sites
The OSS server uses the same TLS-fingerprinting HTTP client as the CLI, which is enough for most of the open web. It does not do: Cloudflare / DataDome / AWS WAF bypass, JS rendering, proxy rotation across residential pools, or session cookie warming across browser instances. Those run on dedicated infrastructure behind api.webclaw.io and are intentionally closed source. If you need them, point the stubborn targets at the hosted API while keeping the easy ones on your box.