# webclaw Documentation

> webclaw is a Rust-based web extraction toolkit that turns any website into LLM-ready markdown, JSON, plain text, or token-optimized output -- without a headless browser.

- Source: [github.com/0xMassi/webclaw](https://github.com/0xMassi/webclaw)
- Cloud API: [api.webclaw.io](https://api.webclaw.io)
- License: MIT

## Table of Contents

- [Introduction](#introduction)
- [Getting Started](#getting-started)
- [CLI Reference](#cli-reference)
- [REST API](#rest-api)
  - [Overview](#rest-api-overview)
  - [POST /v1/scrape](#post-v1scrape)
  - [POST /v1/crawl](#post-v1crawl)
  - [GET /v1/crawl/{id}](#get-v1crawlid)
  - [POST /v1/batch](#post-v1batch)
  - [POST /v1/map](#post-v1map)
  - [POST /v1/extract](#post-v1extract)
  - [POST /v1/summarize](#post-v1summarize)
  - [POST /v1/diff](#post-v1diff)
  - [POST /v1/brand](#post-v1brand)
- [MCP Server](#mcp-server)
- [Self-Hosting](#self-hosting)
- [Cloud API](#cloud-api)
- [SDKs](#sdks)

---

## Introduction

webclaw is a web extraction toolkit built in Rust. It turns any website into LLM-ready markdown, JSON, plain text, or token-optimized output -- without a headless browser. All extraction happens over raw HTTP using Impit TLS impersonation, making it fast, lightweight, and deployable anywhere.

### Three binaries, one engine

webclaw ships as three standalone binaries, all powered by the same extraction core:

- **webclaw** -- The CLI. Extract, crawl, summarize, and track changes from the terminal. Pipe output to files, chain with other tools, or use interactively.
- **webclaw-server** -- The REST API. An axum-based HTTP server with authentication, CORS, gzip compression, and async job management. Every extraction feature is available as a JSON endpoint.
- **webclaw-mcp** -- The MCP server. Exposes 8 tools over the Model Context Protocol (stdio transport) for use with Claude Desktop, Claude Code, and any MCP-compatible AI client.

### Key features

- **No headless browser.** Pure HTTP extraction via Impit TLS impersonation. No Playwright, no Puppeteer, no Chrome. Fast and lightweight.
- **4 output formats.** Markdown, plain text, JSON, and LLM-optimized (9-step pipeline: image stripping, emphasis removal, link dedup, stat merging, whitespace collapse).
- **CSS selector filtering.** Include or exclude content by CSS selector. Extract only article bodies, skip navbars and footers.
- **Crawling and sitemap discovery.** BFS same-origin crawler with configurable depth, concurrency, and delay. Sitemap.xml and robots.txt discovery built in.
- **Content change tracking.** Snapshot pages as JSON and diff against future extractions to detect what changed.
- **Brand extraction.** Extract brand identity -- colors, fonts, logo URL, favicon -- from DOM structure and CSS analysis.
- **LLM integration.** Provider chain: Ollama (local-first) then OpenAI then Anthropic. JSON schema extraction, prompt-based extraction, and summarization.
- **PDF extraction.** Auto-detected via Content-Type header. Text extraction from PDF documents without external dependencies.
- **Proxy rotation.** Per-request proxy rotation from a pool file. Auto-loads proxies.txt from the working directory.
- **Browser impersonation.** Chrome (v142, v136, v133, v131) and Firefox (v144, v135, v133, v128) TLS fingerprint profiles. Random mode available.

### Open source

webclaw is MIT licensed and fully open source. The repository is at [github.com/0xMassi/webclaw](https://github.com/0xMassi/webclaw).

### Architecture

The project is a Rust workspace split into focused crates. The core extraction engine has zero network dependencies and is WASM-compatible.

```text
webclaw/
  crates/
    webclaw-core/     # Extraction engine. WASM-safe. Zero network deps.
                      # Readability scoring, noise filtering, markdown
                      # conversion, LLM optimization, CSS selector
                      # filtering, diff engine, brand extraction.

    webclaw-fetch/    # HTTP client via Impit. Crawler. Sitemap discovery.
                      # Batch operations. Proxy pool rotation.

    webclaw-llm/      # LLM provider chain (Ollama -> OpenAI -> Anthropic).
                      # JSON schema extraction, prompt extraction,
                      # summarization.

    webclaw-pdf/      # PDF text extraction via pdf-extract.

    webclaw-server/   # axum REST API. Auth, CORS, gzip, job management.

    webclaw-mcp/      # MCP server over stdio transport. 8 tools for
                      # AI agents.

    webclaw-cli/      # CLI binary.
```

**webclaw-core** -- The pure extraction engine. Takes raw HTML as a string, returns structured output. No network calls, no I/O -- just parsing and scoring. This is what makes the core WASM-compatible. Key modules: readability-style content scoring with text density and link density penalties, shared noise filtering (tags, ARIA roles, class/ID patterns, Tailwind-safe), JSON data island extraction for React SPAs and Next.js, HTML to markdown conversion with URL resolution, and a 9-step LLM optimization pipeline.

**webclaw-fetch** -- The HTTP layer. Uses Impit for TLS impersonation with Chrome and Firefox browser profiles. Handles BFS crawling with configurable depth and concurrency, sitemap.xml and robots.txt discovery, multi-URL batch operations, and per-request proxy rotation.

**webclaw-llm** -- LLM provider chain with automatic fallback: tries Ollama first (local, no API key needed), then OpenAI, then Anthropic. Uses plain reqwest (not Impit) since LLM APIs do not need TLS fingerprinting. Supports JSON schema extraction, prompt-based extraction, and summarization.

> **Note:** The core crate never makes network requests. It takes `&str` HTML and returns structured data. All HTTP, LLM calls, and PDF parsing happen in the other crates.

---

## Getting Started

Get webclaw installed and extract your first page in under a minute. Choose from cargo install, building from source, or Docker.

### Installation

#### From crates.io

The fastest way to install. Requires a working Rust toolchain.

```bash
cargo install webclaw
```

#### From source

Clone the repository and build all three binaries in release mode.

```bash
git clone https://github.com/0xMassi/webclaw
cd webclaw
cargo build --release
```

The binaries will be at `target/release/webclaw`, `target/release/webclaw-server`, and `target/release/webclaw-mcp`.

> **Tip:** The workspace uses patched `rustls` and `h2` forks for Impit TLS impersonation. These are configured via `[patch.crates-io]` in the workspace Cargo.toml -- no manual setup needed.

### Docker

Pull the official image and run the API server in a container.

```bash
docker pull ghcr.io/0xmassi/webclaw:latest
```

```bash
# run the API server
docker run -p 3000:3000 ghcr.io/0xmassi/webclaw:latest
```

The server will be available at `http://localhost:3000`. Add `-e WEBCLAW_API_KEY=your_key` to enable authentication.

### Your first extraction

The simplest usage: pass a URL. webclaw extracts the main content and outputs clean markdown by default.

```bash
# markdown output (default)
webclaw https://example.com
```

Switch to LLM-optimized output for the most token-efficient representation. This runs a 9-step pipeline that strips images, removes emphasis, deduplicates links, merges stat blocks, and collapses whitespace.

```bash
# LLM-optimized output
webclaw https://example.com -f llm
```

Use JSON format to get the full ExtractionResult with metadata, content, word count, and extracted URLs.

```bash
# JSON output
webclaw https://example.com -f json
```

### Start the API server

The REST API exposes every extraction feature as an HTTP endpoint. Start the server on any port.

```bash
webclaw-server --port 3000
```

Test it with a scrape request:

```bash
curl -X POST http://localhost:3000/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'
```

> **Note:** By default the server runs without authentication. Pass `--api-key your_secret` to require a Bearer token on all requests.

### MCP server

The MCP server lets AI agents use webclaw as a tool. It communicates over stdio transport and works with Claude Desktop, Claude Code, and any MCP-compatible client.

Add webclaw to your Claude Desktop configuration at `~/Library/Application Support/Claude/claude_desktop_config.json`:

```json
{
  "mcpServers": {
    "webclaw": {
      "command": "/path/to/webclaw-mcp"
    }
  }
}
```

Replace `/path/to/webclaw-mcp` with the actual binary path (e.g. `target/release/webclaw-mcp` if built from source).

The MCP server exposes 8 tools:

| Tool | Description |
|------|-------------|
| scrape | Extract content from a single URL |
| crawl | BFS crawl a website with depth control |
| map | Discover URLs from sitemap.xml and robots.txt |
| batch | Extract content from multiple URLs |
| extract | LLM-powered JSON schema or prompt extraction |
| summarize | LLM-powered content summarization |
| diff | Track content changes between snapshots |
| brand | Extract brand identity (colors, fonts, logo) |

### Cloud API

For managed infrastructure, sign up at `webclaw.io` and create an API key from the dashboard. Keys are prefixed with `wc_`.

```bash
curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer wc_your_key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "formats": ["markdown", "llm"]}'
```

The cloud API uses the same endpoints and request format as the self-hosted server. Every example in this documentation works with both -- just swap the base URL and add the Authorization header.

> **Tip:** The cloud API includes a free tier. No credit card required to start building.

---

## CLI Reference

Complete reference for the `webclaw` command-line tool. Every flag, every option, with practical examples.

### Basic extraction

Pass one or more URLs as positional arguments. webclaw fetches each page, extracts the main content, and outputs clean markdown.

| Flag | Description |
|------|-------------|
| `webclaw <url>` | Extract a single URL. |
| `webclaw url1 url2 url3` | Batch extract multiple URLs in one command. |
| `--urls-file <file>` | Read URLs from a file, one per line. |
| `--file <path>` | Read HTML from a local file instead of fetching. |
| `--stdin` | Read HTML from stdin. |

```bash
# Single URL
webclaw https://example.com

# Multiple URLs
webclaw https://example.com https://news.ycombinator.com

# From a file list
webclaw --urls-file urls.txt

# Local HTML file
webclaw --file page.html

# Pipe from another command
curl -s https://example.com | webclaw --stdin
```

### Output formats

Control the output format with the `-f` flag. The default is markdown.

| Flag | Description |
|------|-------------|
| `-f markdown` | Clean markdown with resolved URLs and collected assets. This is the default. |
| `-f text` | Plain text with no formatting. |
| `-f json` | Full ExtractionResult as JSON. Includes metadata, content, word count, and extracted URLs. |
| `-f llm` | LLM-optimized output. 9-step pipeline: image stripping, emphasis removal, link deduplication, stat merging, whitespace collapse. Includes a metadata header. |
| `--metadata` | Include page metadata (title, description, OG tags) in the output. |
| `--raw-html` | Output the raw HTML response without any extraction processing. |

```bash
# Default markdown
webclaw https://example.com

# LLM-optimized for minimal token usage
webclaw https://example.com -f llm

# Full JSON with metadata
webclaw https://example.com -f json

# Plain text
webclaw https://example.com -f text

# Markdown with metadata header
webclaw https://example.com --metadata

# Save JSON snapshot for later diffing
webclaw https://example.com -f json > snapshot.json
```

> **Tip:** The `llm` format typically achieves 67% fewer tokens than raw HTML while preserving all meaningful content. Use it when feeding content to language models.

### Content filtering

Use CSS selectors to control what content is extracted. Include mode is exclusive -- only matched elements are returned. Exclude mode removes matched elements from the normal extraction.

| Flag | Description |
|------|-------------|
| `--include <selectors>` | CSS selectors to extract. Comma-separated. Exclusive mode: only these elements are returned. |
| `--exclude <selectors>` | CSS selectors to remove from extraction. Comma-separated. |
| `--only-main-content` | Extract only article, main, or [role="main"] elements. |

```bash
# Extract only the article body
webclaw https://example.com --include "article"

# Extract specific sections
webclaw https://example.com --include ".post-content, .comments"

# Remove navigation and footer noise
webclaw https://example.com --exclude "nav, footer, .sidebar"

# Combine both
webclaw https://example.com --include "main" --exclude ".ads, .related-posts"

# Quick mode: just the main content area
webclaw https://example.com --only-main-content
```

### Browser impersonation

webclaw uses Impit to impersonate real browser TLS fingerprints. This makes requests indistinguishable from actual browser traffic at the TLS layer. Each browser option includes multiple version profiles.

| Flag | Description |
|------|-------------|
| `-b chrome` | Chrome profiles: v142, v136, v133, v131. This is the default. |
| `-b firefox` | Firefox profiles: v144, v135, v133, v128. |
| `-b random` | Random browser profile per request. Useful for bulk extraction. |

```bash
# Default Chrome impersonation
webclaw https://example.com

# Firefox fingerprint
webclaw https://example.com -b firefox

# Random profile per request (good for batch)
webclaw url1 url2 url3 -b random
```

> **Note:** This is TLS fingerprint impersonation, not a headless browser. No browser engine is launched. Requests complete in milliseconds, not seconds.

### Proxy

Route requests through HTTP proxies. Supports single proxy or pool rotation.

| Flag | Description |
|------|-------------|
| `-p <url>` | Single proxy. Format: http://user:pass@host:port |
| `--proxy-file <file>` | Proxy pool file. One proxy per line in host:port:user:pass format. Rotates per request. |

```bash
# Single proxy
webclaw https://example.com -p http://user:pass@proxy.example.com:8080

# Proxy pool with rotation
webclaw https://example.com --proxy-file proxies.txt

# Batch extraction with proxy rotation
webclaw --urls-file urls.txt --proxy-file proxies.txt
```

> **Tip:** webclaw auto-loads a file named `proxies.txt` from the working directory if present. No flag needed. Proxy rotation is per-request, not per-client, so each request in a batch or crawl uses a different proxy from the pool.

### Crawling

BFS same-origin crawler. Discovers and extracts pages by following links within the same domain.

| Flag | Description |
|------|-------------|
| `--crawl` | Enable BFS crawling from the given URL. |
| `--depth <n>` | Maximum crawl depth. Default: 1. |
| `--max-pages <n>` | Maximum number of pages to crawl. Default: 20. |
| `--concurrency <n>` | Number of parallel requests. Default: 5. |
| `--delay <ms>` | Delay between requests in milliseconds. Default: 100. |
| `--path-prefix <path>` | Only crawl URLs whose path starts with this prefix. |
| `--sitemap` | Seed the crawl queue from sitemap discovery before starting BFS. |

```bash
# Basic crawl, 1 level deep
webclaw https://docs.example.com --crawl

# Deep crawl with limits
webclaw https://docs.example.com --crawl --depth 3 --max-pages 100

# Faster crawl with more concurrency
webclaw https://docs.example.com --crawl --concurrency 10 --delay 50

# Only crawl the /api/ section
webclaw https://docs.example.com --crawl --path-prefix /api/

# Seed from sitemap first, then crawl
webclaw https://docs.example.com --crawl --depth 2 --max-pages 50 --sitemap
```

> **Warning:** Crawling is same-origin only. webclaw will not follow links to external domains. Respect the target site by keeping concurrency and depth reasonable.

### Sitemap discovery

Discover all URLs from a site's sitemap.xml and robots.txt without crawling.

| Flag | Description |
|------|-------------|
| `--map` | Discover URLs from sitemap.xml and robots.txt. Outputs the URL list. |

```bash
# Discover all URLs from sitemap
webclaw https://docs.example.com --map

# Save discovered URLs to a file, then extract them
webclaw https://docs.example.com --map > urls.txt
webclaw --urls-file urls.txt -f llm
```

### Change tracking

Snapshot a page as JSON and compare against a future extraction to see what changed.

| Flag | Description |
|------|-------------|
| `--diff-with <file>` | Compare the current extraction against a previous JSON snapshot file. |

```bash
# Take a snapshot
webclaw https://example.com -f json > snapshot.json

# Later, check what changed
webclaw https://example.com --diff-with snapshot.json
```

### Brand extraction

Extract brand identity from a website: colors, fonts, logo URL, and favicon. Analyzes both DOM structure and CSS.

| Flag | Description |
|------|-------------|
| `--brand` | Extract brand identity (colors, fonts, logo, favicon) from the page. |

```bash
webclaw https://example.com --brand
```

### LLM features

webclaw can use LLMs to extract structured data, answer questions about page content, or summarize. The provider chain tries Ollama first (local, free), then OpenAI, then Anthropic.

| Flag | Description |
|------|-------------|
| `--extract-json <schema>` | Extract data matching a JSON schema. Pass the schema as a string or use @file to read from a file. |
| `--extract-prompt <text>` | Natural language extraction. Describe what you want and the LLM extracts it. |
| `--summarize [sentences]` | Summarize the page content. Default: 3 sentences. |
| `--llm-provider <name>` | Force a specific LLM provider: ollama, openai, or anthropic. |
| `--llm-model <name>` | Override the default model for the chosen provider. |
| `--llm-base-url <url>` | Override the provider's base URL (useful for proxies or custom deployments). |

```bash
# Summarize a page
webclaw https://example.com --summarize
webclaw https://example.com --summarize 5

# Natural language extraction
webclaw https://example.com --extract-prompt "Get all pricing tiers"
webclaw https://example.com --extract-prompt "List every author name and their role"

# JSON schema extraction
webclaw https://example.com --extract-json '{"type":"object","properties":{"title":{"type":"string"},"price":{"type":"number"}}}'

# Schema from a file
webclaw https://example.com --extract-json @schema.json

# Force OpenAI instead of Ollama
webclaw https://example.com --summarize --llm-provider openai

# Use a specific model
webclaw https://example.com --summarize --llm-provider anthropic --llm-model claude-sonnet-4-20250514
```

> **Note:** Ollama runs locally and requires no API key. Install it from `ollama.ai` and webclaw will use it automatically. For OpenAI and Anthropic, set the standard environment variables: `OPENAI_API_KEY` or `ANTHROPIC_API_KEY`.

### PDF extraction

webclaw auto-detects PDF documents via the Content-Type header and extracts text content. No special flags needed for basic PDF extraction.

| Flag | Description |
|------|-------------|
| `--pdf-mode auto` | Default mode. Extracts text from PDFs. Returns an error if text extraction fails. |
| `--pdf-mode fast` | Returns empty content on extraction failure instead of erroring. |

```bash
# Auto-detected from Content-Type
webclaw https://example.com/report.pdf

# Fast mode (skip failures silently)
webclaw https://example.com/report.pdf --pdf-mode fast
```

### Other options

| Flag | Description |
|------|-------------|
| `-t <seconds>` | Request timeout in seconds. Default: 30. |
| `-v` | Enable verbose logging. Shows request details, timing, and extraction stats. |

```bash
# Longer timeout for slow sites
webclaw https://slow-site.example.com -t 60

# Verbose output for debugging
webclaw https://example.com -v
```

### Complete examples

Common workflows combining multiple flags.

#### Extract a blog post for an LLM

```bash
webclaw https://blog.example.com/post \
  -f llm \
  --include "article" \
  --exclude ".comments, .related-posts"
```

#### Crawl documentation with proxy rotation

```bash
webclaw https://docs.example.com \
  --crawl \
  --depth 2 \
  --max-pages 50 \
  --sitemap \
  --proxy-file proxies.txt \
  -b random \
  -f llm
```

#### Extract structured pricing data

```bash
webclaw https://example.com/pricing \
  --extract-json '{
    "type": "object",
    "properties": {
      "plans": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "price": {"type": "string"},
            "features": {"type": "array", "items": {"type": "string"}}
          }
        }
      }
    }
  }'
```

#### Monitor a page for changes

```bash
# Initial snapshot
webclaw https://example.com/status -f json > baseline.json

# Check for changes (run periodically)
webclaw https://example.com/status --diff-with baseline.json
```

#### Batch extract with Firefox impersonation

```bash
webclaw \
  https://site-a.com \
  https://site-b.com \
  https://site-c.com \
  -b firefox \
  -f llm \
  --metadata
```

---

## REST API

### REST API Overview

The webclaw REST API gives you programmatic access to the full extraction engine. Every endpoint accepts JSON and returns JSON.

#### Base URL

```text
# Cloud (managed)
https://api.webclaw.io

# Self-hosted
http://localhost:3000
```

#### Authentication

All requests require an API key sent via the Authorization header.

```http
Authorization: Bearer <api_key>
```

**Cloud:** Create API keys from your dashboard at webclaw.io. Keys are prefixed with `wc_`.

**Self-hosted:** Pass `--api-key` when starting the server, or set the `WEBCLAW_API_KEY` environment variable. If neither is set, the server runs without authentication.

> **Note:** Self-hosted instances with no API key configured accept all requests. Set one before exposing the server to the internet.

#### Request format

All POST endpoints accept a JSON body. Set the Content-Type header accordingly.

```http
Content-Type: application/json
```

#### Response format

All responses are JSON. Successful responses return the data directly. Errors use a consistent shape:

```json
{
  "error": "Human-readable error message"
}
```

#### Rate limiting

Cloud API rate limits are based on your plan tier. Self-hosted instances have no rate limits by default. See the Cloud API section for plan details.

#### Endpoints

| Method | Path | Description |
|--------|------|-------------|
| POST | /v1/scrape | Single URL extraction |
| POST | /v1/crawl | Start async crawl |
| GET | /v1/crawl/{id} | Poll crawl status |
| POST | /v1/batch | Multi-URL extraction |
| POST | /v1/map | Sitemap discovery |
| POST | /v1/extract | LLM JSON extraction |
| POST | /v1/summarize | LLM summarization |
| POST | /v1/diff | Content change tracking |
| POST | /v1/brand | Brand identity extraction |
| GET | /health | Health check + Ollama status |

#### Quick example

```bash
curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "formats": ["markdown"]}'
```

---

### POST /v1/scrape

Extract content from a single URL. This is the core endpoint -- one URL in, clean structured content out.

**POST** `/v1/scrape` -- Extract content from a single URL in one or more output formats.

#### Request body

```json
{
  "url": "https://example.com",
  "formats": ["markdown", "llm", "text", "json"],
  "include_selectors": [".article-content"],
  "exclude_selectors": ["nav", ".sidebar"],
  "only_main_content": true
}
```

#### Parameters

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `url` | string | Yes | The URL to scrape. |
| `formats` | string[] | No | Output formats to include. Options: `markdown`, `llm`, `text`, `json`. Defaults to `["markdown"]`. |
| `include_selectors` | string[] | No | CSS selectors to extract exclusively. Only content matching these selectors will be included. |
| `exclude_selectors` | string[] | No | CSS selectors to remove from the page before extraction. |
| `only_main_content` | boolean | No | When true, extracts only the main article or content element, ignoring sidebars, headers, and footers. |

> **Tip:** The `llm` format runs a 9-step optimization pipeline that strips images, collapses whitespace, deduplicates links, and reduces token count by ~67% compared to raw HTML. Use it when feeding content to language models.

#### Response

The response includes the requested formats alongside extracted metadata. Only the formats you request are populated.

```json
{
  "url": "https://example.com",
  "metadata": {
    "title": "Example Domain",
    "description": "This domain is for use in illustrative examples.",
    "author": null,
    "published_date": null,
    "language": "en",
    "site_name": "Example",
    "image": null,
    "favicon": "https://example.com/favicon.ico",
    "word_count": 1234
  },
  "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
  "text": "Example Domain\n\nThis domain is for use in illustrative examples...",
  "llm": "> URL: https://example.com\n> Title: Example Domain\n\nThis domain is for use in illustrative examples...\n\n## Links\n- ...",
  "extraction": { ... }
}
```

#### Metadata fields

| Field | Type | Description |
|-------|------|-------------|
| `title` | string | Page title from OG, meta, or title tag. |
| `description` | string | Page description from meta or OG tags. |
| `author` | string? | Author name if detected. |
| `published_date` | string? | Publication date if found in metadata. |
| `language` | string? | Page language code (e.g. "en"). |
| `site_name` | string? | Site name from OG metadata. |
| `image` | string? | Primary image URL (OG or Twitter Card). |
| `favicon` | string? | Favicon URL. |
| `word_count` | number | Total word count of extracted content. |

#### Examples

**Basic extraction:**

```bash
curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://openai.com/blog/gpt-4",
    "formats": ["markdown"]
  }'
```

**LLM-optimized with selector filtering:**

```bash
curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.stripe.com/payments/checkout",
    "formats": ["llm"],
    "include_selectors": [".content-container"],
    "exclude_selectors": ["nav", "footer", ".sidebar"],
    "only_main_content": true
  }'
```

**Multiple output formats:**

```bash
curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://en.wikipedia.org/wiki/Web_scraping",
    "formats": ["markdown", "llm", "text"]
  }'
```

> **Note:** PDFs are auto-detected via Content-Type. If the URL serves a PDF, webclaw extracts text using its PDF engine instead of HTML parsing.

#### Error responses

```json
// 400 Bad Request
{
  "error": "Missing required field: url"
}

// 401 Unauthorized
{
  "error": "Invalid or missing API key"
}

// 422 Unprocessable
{
  "error": "Failed to fetch URL: connection timeout"
}
```

---

### POST /v1/crawl

Crawl an entire site with BFS traversal. Crawls run asynchronously -- start one, then poll until it completes.

**POST** `/v1/crawl` -- Start an async same-origin crawl from the given URL.

#### Request body

```json
{
  "url": "https://docs.example.com",
  "max_depth": 2,
  "max_pages": 50,
  "use_sitemap": true
}
```

#### Parameters

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `url` | string | Yes | Starting URL. The crawl stays within this origin. |
| `max_depth` | number | No | Maximum link depth to follow. Default: 2. |
| `max_pages` | number | No | Maximum number of pages to extract. Default: 50. |
| `use_sitemap` | boolean | No | Seed the crawl queue with URLs from the site's sitemap. Default: false. |

#### Response

```json
{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "running"
}
```

> **Note:** The crawl ID is a UUID. Store it -- you need it to poll for results.

---

### GET /v1/crawl/{id}

**GET** `/v1/crawl/{id}` -- Get the current status and results of a running or completed crawl.

#### Path parameters

| Parameter | Type | Description |
|-----------|------|-------------|
| `id` | string | The crawl job UUID returned by POST /v1/crawl. |

#### Response

```json
{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "pages": [
    {
      "url": "https://docs.example.com",
      "markdown": "# Getting Started\n\nWelcome to the documentation...",
      "metadata": {
        "title": "Getting Started",
        "word_count": 842
      }
    },
    {
      "url": "https://docs.example.com/guides/setup",
      "markdown": "# Setup Guide\n\nFollow these steps...",
      "metadata": {
        "title": "Setup Guide",
        "word_count": 1203
      }
    }
  ],
  "total": 50,
  "completed": 48,
  "errors": 2,
  "created_at": "2026-03-12T10:30:00Z"
}
```

#### Status values

| Status | Description |
|--------|-------------|
| `running` | Crawl is in progress. Poll again for updates. |
| `completed` | Crawl finished. All results are available. |
| `failed` | Crawl encountered a fatal error and stopped. |

#### Example

```bash
# Start a crawl
curl -X POST https://api.webclaw.io/v1/crawl \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.stripe.com",
    "max_depth": 2,
    "max_pages": 100,
    "use_sitemap": true
  }'
```

```bash
# Poll for results
curl https://api.webclaw.io/v1/crawl/550e8400-e29b-41d4-a716-446655440000 \
  -H "Authorization: Bearer wc_your_api_key"
```

> **Tip:** Enable `use_sitemap` to seed the crawl queue with sitemap URLs. This helps discover pages that aren't reachable through link traversal alone.

---

### POST /v1/batch

Extract content from multiple URLs in a single request. Requests are processed concurrently on the server.

**POST** `/v1/batch` -- Extract content from multiple URLs concurrently.

#### Request body

```json
{
  "urls": [
    "https://example.com/page-1",
    "https://example.com/page-2",
    "https://example.com/page-3"
  ],
  "formats": ["markdown", "llm"],
  "concurrency": 5
}
```

#### Parameters

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `urls` | string[] | Yes | Array of URLs to extract. |
| `formats` | string[] | No | Output formats. Options: `markdown`, `llm`, `text`, `json`. Defaults to `["markdown"]`. |
| `concurrency` | number | No | Max concurrent requests. Default: 5. |

#### Response

```json
{
  "results": [
    {
      "url": "https://example.com/page-1",
      "markdown": "# Page One\n\nContent of the first page...",
      "metadata": {
        "title": "Page One",
        "word_count": 654
      },
      "error": null
    },
    {
      "url": "https://example.com/page-2",
      "markdown": "# Page Two\n\nContent of the second page...",
      "metadata": {
        "title": "Page Two",
        "word_count": 1102
      },
      "error": null
    },
    {
      "url": "https://example.com/page-3",
      "markdown": null,
      "metadata": null,
      "error": "Failed to fetch: 404 Not Found"
    }
  ]
}
```

> **Note:** Individual URL failures do not fail the entire batch. Check the `error` field on each result to detect per-URL failures.

#### Example

```bash
curl -X POST https://api.webclaw.io/v1/batch \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://openai.com/blog/gpt-4",
      "https://anthropic.com/research/claude-3",
      "https://deepmind.google/technologies/gemini"
    ],
    "formats": ["markdown", "llm"],
    "concurrency": 3
  }'
```

> **Tip:** For large URL lists, keep concurrency reasonable (5-10) to avoid overwhelming target servers. The server-side default of 5 is a good starting point.

---

### POST /v1/map

Discover all URLs on a site by parsing robots.txt and sitemap.xml. Recursively resolves sitemap indexes to find every listed page.

**POST** `/v1/map` -- Discover all URLs on a site via sitemap parsing.

#### Request body

```json
{
  "url": "https://docs.example.com"
}
```

#### Parameters

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `url` | string | Yes | Base URL of the site to map. |

#### Response

```json
{
  "urls": [
    "https://docs.example.com",
    "https://docs.example.com/getting-started",
    "https://docs.example.com/api/reference",
    "https://docs.example.com/guides/authentication",
    "https://docs.example.com/guides/deployment"
  ],
  "count": 156
}
```

| Field | Type | Description |
|-------|------|-------------|
| `urls` | string[] | All discovered URLs from sitemap parsing. |
| `count` | number | Total number of URLs found. |

> **Note:** The map endpoint checks `robots.txt` for sitemap references first, then falls back to `/sitemap.xml`. Sitemap indexes are resolved recursively, so a single request can discover thousands of URLs.

#### Example

```bash
curl -X POST https://api.webclaw.io/v1/map \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.stripe.com"}'
```

> **Tip:** Use `/v1/map` to build a URL list, then feed it to `/v1/batch` for bulk extraction. This is faster than crawling when the site has a comprehensive sitemap.

---

### POST /v1/extract

Extract structured JSON data from any URL. Provide a JSON schema for typed output, or a natural language prompt for flexible extraction. Both modes use an LLM to parse the page content.

**POST** `/v1/extract` -- Extract structured data from a URL using a JSON schema or natural language prompt.

> **Note:** This endpoint requires an LLM provider. The provider chain tries Ollama (local) first, then falls back to OpenAI, then Anthropic. At least one must be configured.

#### Schema mode

Provide a JSON Schema and the LLM will return data conforming to it. This gives you predictable, typed output.

**Request body:**

```json
{
  "url": "https://example.com/pricing",
  "schema": {
    "type": "object",
    "properties": {
      "title": { "type": "string" },
      "price": { "type": "number" },
      "currency": { "type": "string" },
      "features": {
        "type": "array",
        "items": { "type": "string" }
      }
    }
  }
}
```

**Response:**

```json
{
  "data": {
    "title": "Pro Plan",
    "price": 49,
    "currency": "USD",
    "features": [
      "Unlimited extractions",
      "Priority support",
      "Custom browser profiles"
    ]
  }
}
```

#### Prompt mode

Describe what you want in plain English. The LLM will determine the structure based on your prompt and the page content.

**Request body:**

```json
{
  "url": "https://example.com/pricing",
  "prompt": "Extract all pricing tiers with name, price, and features"
}
```

**Response:**

```json
{
  "data": {
    "tiers": [
      {
        "name": "Free",
        "price": 0,
        "features": ["500 pages/month", "Community support"]
      },
      {
        "name": "Pro",
        "price": 49,
        "features": ["100k pages/month", "Priority support", "Custom profiles"]
      },
      {
        "name": "Scale",
        "price": 199,
        "features": ["500k pages/month", "Dedicated support", "SLA"]
      }
    ]
  }
}
```

#### Parameters

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `url` | string | Yes | URL to extract data from. |
| `schema` | object | No* | JSON Schema defining the desired output structure. |
| `prompt` | string | No* | Natural language description of what to extract. |

> **Warning:** You must provide either `schema` or `prompt`. If both are provided, `schema` takes precedence.

#### LLM provider chain

The extract endpoint tries LLM providers in this order:

1. **Ollama** (local) -- free, no API key needed. Set `OLLAMA_HOST` if not running on localhost.
2. **OpenAI** -- requires `OPENAI_API_KEY`.
3. **Anthropic** -- requires `ANTHROPIC_API_KEY`.

#### Examples

```bash
# Schema mode
curl -X POST https://api.webclaw.io/v1/extract \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product/widget",
    "schema": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "price": { "type": "number" },
        "in_stock": { "type": "boolean" }
      }
    }
  }'
```

```bash
# Prompt mode
curl -X POST https://api.webclaw.io/v1/extract \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/team",
    "prompt": "Extract all team members with name, role, and LinkedIn URL"
  }'
```

---

### POST /v1/summarize

Generate a concise summary of any web page. The page is scraped, cleaned, and passed to an LLM which produces a summary of the specified length.

**POST** `/v1/summarize` -- Generate an LLM-powered summary of a web page.

#### Request body

```json
{
  "url": "https://example.com/blog/long-article",
  "max_sentences": 3
}
```

#### Parameters

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `url` | string | Yes | URL of the page to summarize. |
| `max_sentences` | number | No | Maximum number of sentences in the summary. Default: 3. |

#### Response

```json
{
  "summary": "The article discusses the latest advances in web extraction technology, focusing on HTTP-based approaches that avoid headless browsers. Key findings show a 20x speed improvement over Chrome-based solutions. The author recommends Rust-based extractors for production workloads."
}
```

> **Note:** Like the extract endpoint, summarize uses the LLM provider chain: Ollama (local) first, then OpenAI, then Anthropic. At least one must be configured.

#### Example

```bash
curl -X POST https://api.webclaw.io/v1/summarize \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://en.wikipedia.org/wiki/Rust_(programming_language)",
    "max_sentences": 3
  }'
```

---

### POST /v1/diff

Track changes between two snapshots of a web page. Scrape a URL now, compare it against a previous extraction result, and get a unified diff of what changed.

**POST** `/v1/diff` -- Compare a URL's current content against a previous extraction snapshot.

#### Request body

```json
{
  "url": "https://example.com/pricing",
  "previous": {
    "url": "https://example.com/pricing",
    "metadata": {
      "title": "Pricing",
      "word_count": 450
    },
    "markdown": "# Pricing\n\nStarter: $9/mo\nPro: $29/mo",
    "text": "Pricing\n\nStarter: $9/mo\nPro: $29/mo"
  }
}
```

#### Parameters

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `url` | string | Yes | URL to scrape for the current version. |
| `previous` | object | Yes | A previous ExtractionResult (the output from a prior `/v1/scrape` call). |

> **Tip:** Save the full JSON response from `/v1/scrape` as your baseline snapshot. Pass it as the `previous` field in subsequent diff requests to track changes over time.

#### Response

```json
{
  "status": "Changed",
  "text_diff": "--- previous\n+++ current\n@@ -2,3 +2,3 @@\n-Starter: $9/mo\n-Pro: $29/mo\n+Starter: $12/mo\n+Pro: $39/mo\n+Enterprise: $99/mo",
  "metadata_changes": [
    {
      "field": "word_count",
      "old": 450,
      "new": 520
    }
  ],
  "links_added": [
    "https://example.com/enterprise"
  ],
  "links_removed": [],
  "word_count_delta": 70
}
```

#### Response fields

| Field | Type | Description |
|-------|------|-------------|
| `status` | string | "Changed" or "Unchanged". |
| `text_diff` | string | Unified diff of the text content. |
| `metadata_changes` | array | List of metadata fields that changed, with old and new values. |
| `links_added` | string[] | URLs present in the current version but not the previous. |
| `links_removed` | string[] | URLs present in the previous version but removed. |
| `word_count_delta` | number | Difference in word count (positive = content added). |

#### Example

```bash
curl -X POST https://api.webclaw.io/v1/diff \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/pricing",
    "previous": {
      "url": "https://example.com/pricing",
      "metadata": { "title": "Pricing", "word_count": 450 },
      "markdown": "# Pricing\n\nStarter: $9/mo\nPro: $29/mo",
      "text": "Pricing\n\nStarter: $9/mo\nPro: $29/mo"
    }
  }'
```

---

### POST /v1/brand

Extract brand identity from any website. Analyzes DOM structure and CSS to find colors, fonts, logos, and favicons.

**POST** `/v1/brand` -- Extract brand identity (colors, fonts, logos) from a URL.

#### Request body

```json
{
  "url": "https://example.com"
}
```

#### Parameters

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `url` | string | Yes | URL of the site to analyze. |

#### Response

```json
{
  "colors": [
    { "hex": "#FF5733", "usage": "Primary", "count": 15 },
    { "hex": "#2C3E50", "usage": "Text", "count": 42 },
    { "hex": "#ECF0F1", "usage": "Background", "count": 8 }
  ],
  "fonts": [
    "Inter",
    "Roboto Mono"
  ],
  "logo_url": "https://example.com/images/logo.svg",
  "favicon_url": "https://example.com/favicon.ico"
}
```

#### Response fields

| Field | Type | Description |
|-------|------|-------------|
| `colors` | array | Detected colors with hex value, inferred usage, and occurrence count. |
| `fonts` | string[] | Font families found in stylesheets and inline styles. |
| `logo_url` | string? | URL of the detected logo image, if found. |
| `favicon_url` | string? | Favicon URL from link tags or the default /favicon.ico path. |

> **Note:** Brand extraction works entirely from the HTML and inline CSS -- no headless browser is used. Colors are detected from inline styles, style tags, and common CSS patterns. Results are best on marketing pages and homepages.

#### Example

```bash
curl -X POST https://api.webclaw.io/v1/brand \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://stripe.com"}'
```

---

## MCP Server

The webclaw MCP (Model Context Protocol) server exposes the full extraction engine as tools that AI agents can call directly. Works with Claude Desktop, Claude Code, and any MCP-compatible client.

### What is MCP

Model Context Protocol is an open standard for connecting AI models to external tools and data sources. Instead of making HTTP calls manually, an AI agent discovers available tools through the MCP server and calls them natively. The webclaw MCP server communicates over stdio transport and exposes 8 tools covering scraping, crawling, extraction, and more.

### Setup

#### Claude Desktop

Add webclaw to your Claude Desktop config file:

```json
// ~/Library/Application Support/Claude/claude_desktop_config.json
{
  "mcpServers": {
    "webclaw": {
      "command": "webclaw-mcp",
      "env": {
        "WEBCLAW_API_KEY": "<YOUR_API_KEY>"
      }
    }
  }
}
```

Replace `webclaw-mcp` with the full path if not in PATH. The `WEBCLAW_API_KEY` enables automatic cloud API fallback for bot-protected sites (Cloudflare, DataDome, AWS WAF) and JS-rendered SPAs. Without it, extraction works for ~80% of sites via local HTTP. Get a key at https://webclaw.io.

#### Claude Code

```bash
claude mcp add webclaw webclaw-mcp
```

Or add the JSON config above to your Claude Desktop config file. Claude Code auto-discovers MCP servers from the same config.

### Smart Fetch Architecture

The MCP server uses a local-first approach:

1. **Local fetch** -- Fast, free, no API credits used (~80% of sites work)
2. **Cloud API fallback** -- Automatic when bot protection (Cloudflare, DataDome, AWS WAF) or JS rendering (React, Next.js, Vue SPAs) is detected
3. Requires `WEBCLAW_API_KEY` for the cloud fallback. Without it, bot-protected sites return challenge pages.

### Environment Variables

| Variable | Description |
|----------|-------------|
| `WEBCLAW_API_KEY` | Enables cloud API fallback for bot-protected and JS-rendered sites |
| `OPENAI_API_KEY` | Enables extract and summarize tools (OpenAI provider) |
| `ANTHROPIC_API_KEY` | Enables extract and summarize tools (Anthropic provider) |
| `OLLAMA_HOST` | Custom Ollama URL (default: http://localhost:11434) |

#### Other MCP clients

Any MCP client that supports stdio transport can connect to webclaw-mcp. Point the client at the binary and it will discover all available tools through the standard MCP handshake.

> **Note:** The MCP server uses the `rmcp` crate (the official Rust MCP SDK) and communicates over stdio. No network ports are opened.

### Tools

The MCP server exposes 8 tools. Each tool maps to a corresponding REST API endpoint.

#### 1. scrape

Extract content from a single URL.

| Param | Type | Required | Description |
|-------|------|----------|-------------|
| `url` | string | Yes | URL to scrape. |
| `format` | string | No | Output format: markdown, llm, text, or json. |
| `include_selectors` | string[] | No | CSS selectors to include exclusively. |
| `exclude_selectors` | string[] | No | CSS selectors to remove. |
| `only_main_content` | boolean | No | Extract only the main content element. |
| `browser` | string | No | Browser profile: chrome, firefox, or random. |

#### 2. crawl

Crawl a website with BFS traversal.

| Param | Type | Required | Description |
|-------|------|----------|-------------|
| `url` | string | Yes | Starting URL. |
| `depth` | number | No | Max crawl depth. Default: 2. |
| `max_pages` | number | No | Max pages to extract. Default: 50. |
| `concurrency` | number | No | Concurrent requests. Default: 5. |
| `use_sitemap` | boolean | No | Seed queue with sitemap URLs. |
| `format` | string | No | Output format for each page. |

#### 3. map

Discover all URLs on a site via sitemap parsing.

| Param | Type | Required | Description |
|-------|------|----------|-------------|
| `url` | string | Yes | Base URL of the site to map. |

#### 4. batch

Extract content from multiple URLs concurrently.

| Param | Type | Required | Description |
|-------|------|----------|-------------|
| `urls` | string[] | Yes | Array of URLs to extract. |
| `format` | string | No | Output format for each URL. |
| `concurrency` | number | No | Max concurrent requests. Default: 5. |

#### 5. extract

Extract structured JSON data using an LLM.

| Param | Type | Required | Description |
|-------|------|----------|-------------|
| `url` | string | Yes | URL to extract data from. |
| `prompt` | string | No* | Natural language extraction prompt. |
| `schema` | string | No* | JSON schema string defining the output structure. |

#### 6. summarize

Generate a concise summary of a web page.

| Param | Type | Required | Description |
|-------|------|----------|-------------|
| `url` | string | Yes | URL to summarize. |
| `max_sentences` | number | No | Max sentences in summary. Default: 3. |

#### 7. diff

Track content changes between snapshots.

| Param | Type | Required | Description |
|-------|------|----------|-------------|
| `url` | string | Yes | URL to scrape for current version. |
| `previous_snapshot` | string | Yes | JSON string of a previous extraction result. |

#### 8. brand

Extract brand identity (colors, fonts, logos) from a site.

| Param | Type | Required | Description |
|-------|------|----------|-------------|
| `url` | string | Yes | URL of the site to analyze. |

### Example conversations

Here is how an AI agent might use the webclaw MCP tools in practice.

**User:** Scrape the Stripe pricing page and pull out all the plan names and prices.

**Claude (using webclaw MCP):** I will use the extract tool to pull structured pricing data from the page.

```json
// Tool call: extract
{
  "url": "https://stripe.com/pricing",
  "prompt": "Extract all plan names, monthly prices, and included features"
}
```

**User:** Crawl the Next.js docs and summarize the top 5 pages.

**Claude (using webclaw MCP):** I will first map the site to discover pages, then crawl and summarize the most important ones.

```json
// Tool call: map
{
  "url": "https://nextjs.org/docs"
}
```

```json
// Tool call: batch
{
  "urls": [
    "https://nextjs.org/docs",
    "https://nextjs.org/docs/getting-started",
    "https://nextjs.org/docs/routing",
    "https://nextjs.org/docs/rendering",
    "https://nextjs.org/docs/data-fetching"
  ],
  "format": "llm"
}
```

> **Tip:** The MCP server runs the same extraction engine as the REST API and CLI. Every tool produces identical output to its REST API counterpart.

---

## Self-Hosting

Run the webclaw server on your own infrastructure. Choose Docker for the fastest setup, build from source for maximum control, or deploy to Fly.io for managed hosting with your own binary.

### Docker

The quickest way to run webclaw. The image includes the server binary and all dependencies.

```bash
docker run -p 3000:3000 ghcr.io/0xmassi/webclaw:latest
```

#### With authentication

```bash
docker run -p 3000:3000 \
  -e WEBCLAW_API_KEY=mysecret \
  ghcr.io/0xmassi/webclaw:latest
```

#### Docker Compose with Ollama

For LLM features (extract, summarize), run Ollama alongside webclaw.

```yaml
# docker-compose.yml
version: "3.8"

services:
  webclaw:
    image: ghcr.io/0xmassi/webclaw:latest
    ports:
      - "3000:3000"
    environment:
      - WEBCLAW_API_KEY=mysecret
      - OLLAMA_HOST=http://ollama:11434
    depends_on:
      - ollama

  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama

volumes:
  ollama_data:
```

> **Tip:** After starting the compose stack, pull a model into Ollama: `docker exec -it ollama ollama pull qwen3:8b`

### From source

Build the binaries directly from the Rust source. Requires Rust 1.75+.

```bash
git clone https://github.com/0xMassi/webclaw.git
cd webclaw
cargo build --release
```

The build produces three binaries in `target/release/`:

| Binary | Description |
|--------|-------------|
| `webclaw` | CLI tool for extraction, crawling, and more. |
| `webclaw-server` | REST API server (axum). |
| `webclaw-mcp` | MCP server for AI agents. |

```bash
# Start the server
./target/release/webclaw-server --port 3000 --api-key mysecret
```

### Fly.io

Deploy to Fly.io for managed infrastructure with global edge distribution.

```toml
# fly.toml
app = "webclaw"
primary_region = "iad"

[build]
  image = "ghcr.io/0xmassi/webclaw:latest"

[env]
  WEBCLAW_PORT = "8080"
  WEBCLAW_HOST = "0.0.0.0"

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true

[[vm]]
  size = "shared-cpu-1x"
  memory = "512mb"
```

```bash
fly launch
fly secrets set WEBCLAW_API_KEY=mysecret
```

### Environment variables

All configuration is done through environment variables. None are required -- the server runs with sensible defaults.

#### Server

| Variable | Default | Description |
|----------|---------|-------------|
| `WEBCLAW_PORT` | 3000 | HTTP port to listen on. |
| `WEBCLAW_HOST` | 0.0.0.0 | Bind address. |
| `WEBCLAW_API_KEY` | -- | API key for authentication. If unset, no auth is required. |
| `WEBCLAW_MAX_CONCURRENCY` | 50 | Max concurrent extraction tasks. |
| `WEBCLAW_JOB_TTL_SECS` | 3600 | How long to keep completed crawl jobs (seconds). |
| `WEBCLAW_MAX_JOBS` | 100 | Maximum number of concurrent crawl jobs. |
| `WEBCLAW_LOG` | info | Tracing filter (e.g. debug, webclaw=trace). |

#### Proxy

| Variable | Default | Description |
|----------|---------|-------------|
| `WEBCLAW_PROXY` | -- | Single proxy URL (http, https, or socks5). |
| `WEBCLAW_PROXY_FILE` | -- | Path to a file with one proxy URL per line. Rotated per-request. |
| `WEBCLAW_ANTIBOT_URL` | -- | Anti-bot service endpoint. |
| `WEBCLAW_ANTIBOT_KEY` | -- | API key for the anti-bot service. |

#### Auth and OAuth (cloud features)

| Variable | Default | Description |
|----------|---------|-------------|
| `DATABASE_URL` | -- | PostgreSQL connection string. Enables OAuth and billing. |
| `GOOGLE_CLIENT_ID` | -- | Google OAuth client ID. |
| `GOOGLE_CLIENT_SECRET` | -- | Google OAuth client secret. |
| `GITHUB_CLIENT_ID` | -- | GitHub OAuth client ID. |
| `GITHUB_CLIENT_SECRET` | -- | GitHub OAuth client secret. |
| `WEBCLAW_JWT_SECRET` | -- | JWT signing secret for session tokens. |
| `WEBCLAW_BASE_URL` | -- | Public URL of the server (for OAuth callbacks). |
| `WEBCLAW_FRONTEND_URL` | -- | Frontend URL (for CORS and redirect). |

> **Warning:** The OAuth and billing variables (DATABASE_URL, GOOGLE_CLIENT_*, etc.) are only needed if you are building a multi-tenant deployment with user accounts. For standard self-hosted usage, only WEBCLAW_API_KEY and the LLM provider keys matter.

#### LLM providers

| Variable | Default | Description |
|----------|---------|-------------|
| `OLLAMA_HOST` | http://localhost:11434 | Ollama API endpoint. |
| `OLLAMA_MODEL` | qwen3:8b | Default Ollama model for extraction and summarization. |
| `OPENAI_API_KEY` | -- | OpenAI API key. Enables OpenAI as a fallback provider. |
| `OPENAI_BASE_URL` | -- | Custom OpenAI-compatible endpoint (for proxies or local models). |
| `ANTHROPIC_API_KEY` | -- | Anthropic API key. Enables Anthropic as a fallback provider. |

---

## Cloud API

The webclaw Cloud API is a managed service -- no servers to run, no infrastructure to maintain. Sign up, create an API key, and start extracting.

### Getting started

1. Sign up at [webclaw.io](https://webclaw.io)
2. Create an API key in the dashboard
3. Send requests to `https://api.webclaw.io`

### Authentication

All cloud requests require an API key in the Authorization header. Keys are prefixed with `wc_` for easy identification.

```bash
curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer wc_live_abc123def456" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'
```

### Base URL

```text
https://api.webclaw.io
```

All REST API endpoints documented in this site are available under this base URL.

### Rate limits

Rate limits are based on your plan tier and measured in pages per billing period.

| Plan | Pages / month | Price |
|------|---------------|-------|
| Free | 500 | $0 |
| Starter | 10,000 | $19/mo |
| Pro | 100,000 | $49/mo |
| Scale | 500,000 | $199/mo |

> **Note:** A "page" is one URL processed by any endpoint. A batch request with 10 URLs counts as 10 pages. Crawl pages are counted as they complete. Map requests count as 1 page regardless of how many URLs are discovered.

### Usage tracking

Monitor your usage in real time from the dashboard at [webclaw.io/dashboard/usage](https://webclaw.io/dashboard/usage). The dashboard shows:

- Pages consumed in the current billing period
- Remaining quota
- Historical usage by day
- Breakdown by endpoint

### Billing

Billing periods are calendar months. Your page quota resets on the 1st of each month at 00:00 UTC. Unused pages do not roll over. Upgrades take effect immediately and prorate the cost for the remainder of the current period.

> **Tip:** Need more than 500k pages/month? Contact us for custom enterprise pricing with dedicated infrastructure and SLA guarantees.

---

## SDKs

Official SDKs are available for Python, TypeScript, and Go. All three cover every endpoint with typed clients and idiomatic APIs.

### Python SDK

```bash
pip install webclaw
```

Requires Python 3.9+. Only dependency is httpx. Provides both sync (`Webclaw`) and async (`AsyncWebclaw`) clients.

```python
from webclaw import Webclaw

client = Webclaw("wc_your_api_key")

# Scrape
result = client.scrape("https://example.com", formats=["markdown", "llm"])
print(result.markdown)

# Crawl
job = client.crawl("https://example.com", max_depth=3, max_pages=100)
status = job.wait(interval=2.0, timeout=300.0)
for page in status.pages:
    print(page.url, len(page.markdown or ""))

# Batch
result = client.batch(["https://a.com", "https://b.com"], formats=["markdown"])

# Map
result = client.map("https://example.com")

# Extract
result = client.extract("https://example.com/pricing", prompt="Extract all pricing tiers")

# Summarize
result = client.summarize("https://example.com", max_sentences=3)

# Brand
result = client.brand("https://example.com")
```

Async:

```python
from webclaw import AsyncWebclaw

async with AsyncWebclaw("wc_your_api_key") as client:
    result = await client.scrape("https://example.com")
```

Options: `base_url` (default: https://api.webclaw.io), `timeout` (default: 30.0s).

Source: https://github.com/0xMassi/webclaw-python

### TypeScript SDK

```bash
npm install webclaw
```

Zero dependencies, native fetch. Requires Node.js 18+. Also works in Bun and Deno.

```typescript
import { Webclaw } from "webclaw";

const client = new Webclaw({ apiKey: "wc_your_api_key" });

// Scrape
const result = await client.scrape({
  url: "https://example.com",
  formats: ["markdown", "llm"],
  only_main_content: true,
});

// Crawl
const job = await client.crawl({ url: "https://example.com", max_depth: 3 });
const status = await job.waitForCompletion({ interval: 2_000, maxWait: 300_000 });

// Batch
const batch = await client.batch({ urls: ["https://a.com", "https://b.com"], formats: ["markdown"] });

// Map
const map = await client.map({ url: "https://example.com" });

// Extract
const extracted = await client.extract({ url: "https://example.com/pricing", prompt: "Extract pricing tiers" });

// Summarize
const summary = await client.summarize({ url: "https://example.com", max_sentences: 3 });

// Brand
const brand = await client.brand({ url: "https://example.com" });
```

Options: `baseUrl` (default: https://api.webclaw.io), `timeout` (default: 30000ms).

Source: https://github.com/0xMassi/webclaw-js

### Go SDK

```bash
go get github.com/0xMassi/webclaw-go
```

Zero dependencies beyond stdlib. Requires Go 1.21+. context.Context on every method. Functional options for configuration.

```go
import webclaw "github.com/0xMassi/webclaw-go"

client := webclaw.NewClient("wc_your_api_key")

// Scrape
result, err := client.Scrape(ctx, webclaw.ScrapeRequest{
    URL:     "https://example.com",
    Formats: []webclaw.Format{webclaw.FormatMarkdown, webclaw.FormatLLM},
})

// Crawl
job, err := client.Crawl(ctx, webclaw.CrawlRequest{URL: "https://example.com", MaxDepth: 3})
status, err := client.WaitForCrawl(ctx, job.ID, 2*time.Second, 5*time.Minute)

// Batch
result, err := client.Batch(ctx, webclaw.BatchRequest{
    URLs:    []string{"https://a.com", "https://b.com"},
    Formats: []webclaw.Format{webclaw.FormatMarkdown},
})

// Map
result, err := client.Map(ctx, webclaw.MapRequest{URL: "https://example.com"})

// Extract
result, err := client.Extract(ctx, webclaw.ExtractRequest{
    URL:    "https://example.com/pricing",
    Prompt: "Extract all pricing tiers",
})

// Summarize
result, err := client.Summarize(ctx, webclaw.SummarizeRequest{URL: "https://example.com", MaxSentences: 3})

// Brand
result, err := client.Brand(ctx, webclaw.BrandRequest{URL: "https://example.com"})
```

Options: `WithBaseURL(url)`, `WithTimeout(d)`, `WithHTTPClient(hc)`.

Source: https://github.com/0xMassi/webclaw-go