webclaw

Self-Hosting

The OSS release ships a single binary, webclaw-server, that exposes the extraction engine over HTTP. Same JSON shapes as api.webclaw.io for the endpoints that exist in OSS. Stateless, no database, no job queue. Run it in Docker or build from source.

Warning
Self-hosting gives you the extraction pipeline, not the full cloud platform. See the capability matrix below before deciding where to point your traffic.

OSS vs hosted

CapabilitySelf-hostedapi.webclaw.io
Scrape, map, batch, diff, brandYesYes
CrawlSynchronous, capped at 500 pages per callAsync jobs, unbounded
Extract, summarizeBring your own LLM (Ollama local, or OpenAI / Anthropic key)Managed, Haiku + Sonnet chain
Anti-bot bypass (Cloudflare, DataDome, WAFs)NoYes
JS rendering (SPAs)NoYes
Search, research, agent scrapeNoYes
Watch (scheduled monitors)NoYes
AuthSingle bearer tokenOAuth, API keys, multi-tenant billing
InfrastructureSingle binary, statelessManaged, autoscaled

Rule of thumb: self-host when you want to keep the extraction pipeline on your own box and point it at sites that don't fight you. Use the hosted API when you hit Cloudflare, JS-rendered SPAs, or need async crawl jobs. You can also mix both: self-host for the easy 80% and point the hard 20% at api.webclaw.io.

Docker

Fastest way to run the server. The image ships three binaries (webclaw CLI, webclaw-mcp, webclaw-server) and picks up on WEBCLAW_PORT, WEBCLAW_HOST, and WEBCLAW_API_KEY as env vars.

Run the server
docker run -p 3000:3000 ghcr.io/0xmassi/webclaw:latest webclaw-server

The trailing webclaw-server is important. Without it, the default CMD runs the CLI and prints its --help, which is correct for docker run IMAGE https://example.com but not for running a server.

With authentication

bash
docker run -p 3000:3000 \
  -e WEBCLAW_API_KEY=mysecret \
  ghcr.io/0xmassi/webclaw:latest webclaw-server

When WEBCLAW_API_KEY is set, every request to /v1/* must present Authorization: Bearer mysecret. The comparison is constant-time. /health stays public.

Docker Compose with Ollama

To use /v1/extract and /v1/summarize, you need an LLM. Ollama runs locally for free.

docker-compose.yml
services:
  webclaw:
    image: ghcr.io/0xmassi/webclaw:latest
    command: webclaw-server
    ports:
      - "3000:3000"
    environment:
      - WEBCLAW_API_KEY=mysecret
      - OLLAMA_HOST=http://ollama:11434
    depends_on:
      - ollama

  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama

volumes:
  ollama_data:
Tip
After the stack is up, pull a model: docker exec -it ollama ollama pull qwen3:8b. The extract/summarize endpoints will fall back to OpenAI or Anthropic if Ollama isn't reachable and the corresponding API key is set.

From source

Requires Rust 1.85+ (edition 2024).

bash
git clone https://github.com/0xMassi/webclaw.git
cd webclaw
cargo build --release

Produces three binaries in target/release/:

BinaryRole
webclawCLI (single-shot extraction, crawl, no server).
webclaw-serverStateless REST API (this page).
webclaw-mcpMCP server over stdio for AI agents.
Start the server
./target/release/webclaw-server --port 3000 --api-key mysecret

Default bind is 127.0.0.1 so the CLI case stays safe on a laptop. Pass --host 0.0.0.0 (or WEBCLAW_HOST=0.0.0.0) when you want to expose it on a network. The Docker image already flips this to 0.0.0.0.

Environment variables

Every CLI flag has a matching env var. All are optional — the server starts with sensible defaults.

Server

VariableDefaultDescription
WEBCLAW_PORT3000HTTP port.
WEBCLAW_HOST127.0.0.1 (binary), 0.0.0.0 (Docker)Bind address.
WEBCLAW_API_KEY--Bearer token required on /v1/*. Unset = open mode.
RUST_LOGinfo,webclaw_server=infoTracing filter (e.g. debug, webclaw_fetch=trace).

LLM providers (for /v1/extract and /v1/summarize)

Provider chain tries Ollama first, then OpenAI, then Anthropic. Set at least one. If none are reachable the endpoints return 422 with a readable error.

VariableDefaultDescription
OLLAMA_HOSThttp://localhost:11434Ollama API endpoint.
OLLAMA_MODELqwen3:8bDefault model.
OPENAI_API_KEY--Fallback provider.
OPENAI_BASE_URL--OpenAI-compatible endpoint (for proxies or self-hosted inference).
ANTHROPIC_API_KEY--Fallback provider.

Endpoints

Response shapes match the hosted API, so swapping between self-hosted and cloud is a base-URL change, not a code change.

MethodPathPurpose
GET/healthLiveness check. Always open, even with auth enabled.
POST/v1/scrapeExtract one URL. Formats: markdown, text, llm, json, html.
POST/v1/crawlSynchronous BFS crawl. Capped at 500 pages.
POST/v1/mapDiscover URLs via robots.txt + sitemap.xml.
POST/v1/batchFetch + extract many URLs. Capped at 100 URLs / 20 concurrent.
POST/v1/extractLLM-backed structured extraction. Pass a JSON Schema or a prompt. Needs an LLM provider.
POST/v1/summarizeN-sentence summary. Needs an LLM provider.
POST/v1/diffCompare current content against a prior snapshot.
POST/v1/brandExtract colors, fonts, logo, favicon from a page.
Smoke test
# Liveness
curl http://localhost:3000/health

# Scrape a page
curl -X POST http://localhost:3000/v1/scrape \
  -H "content-type: application/json" \
  -H "authorization: Bearer mysecret" \
  -d '{"url":"https://example.com","formats":["markdown"]}'

Why your self-hosted server still sees 403s on some sites

The OSS server uses the same TLS-fingerprinting HTTP client as the CLI, which is enough for most of the open web. It does not do: Cloudflare / DataDome / AWS WAF bypass, JS rendering, proxy rotation across residential pools, or session cookie warming across browser instances. Those run on dedicated infrastructure behind api.webclaw.io and are intentionally closed source. If you need them, point the stubborn targets at the hosted API while keeping the easy ones on your box.