# webclaw Documentation > webclaw is a Rust-based web extraction toolkit that turns any website into LLM-ready markdown, JSON, plain text, or token-optimized output -- without a headless browser. - Source: [github.com/0xMassi/webclaw](https://github.com/0xMassi/webclaw) - Cloud API: [api.webclaw.io](https://api.webclaw.io) - License: AGPL-3.0 ## Table of Contents - [Introduction](#introduction) - [Getting Started](#getting-started) - [CLI Reference](#cli-reference) - [REST API](#rest-api) - [Overview](#rest-api-overview) - [POST /v1/scrape](#post-v1scrape) - [POST /v1/crawl](#post-v1crawl) - [GET /v1/crawl/{id}](#get-v1crawlid) - [GET /v1/crawl/{id}/stream](#get-v1crawlidstream) - [GET /v1/crawl/history](#get-v1crawlhistory) - [POST /v1/crawl/{id}/retry](#post-v1crawlidretry) - [POST /v1/batch](#post-v1batch) - [POST /v1/map](#post-v1map) - [POST /v1/search](#post-v1search) - [POST /v1/extract](#post-v1extract) - [POST /v1/summarize](#post-v1summarize) - [POST /v1/diff](#post-v1diff) - [POST /v1/brand](#post-v1brand) - [Webhooks](#webhooks) - [Firecrawl v2 Compatibility](#firecrawl-v2-compatibility) - [MCP Server](#mcp-server) - [Self-Hosting](#self-hosting) - [Cloud API](#cloud-api) - [SDKs](#sdks) --- ## Introduction webclaw is a web extraction toolkit built in Rust. It turns any website into LLM-ready markdown, JSON, plain text, or token-optimized output -- without a headless browser. All extraction happens over raw HTTP using browser-grade TLS impersonation, making it fast, lightweight, and deployable anywhere. ### Three binaries, one engine webclaw ships as three standalone binaries, all powered by the same extraction core: - **webclaw** -- The CLI. Extract, crawl, summarize, and track changes from the terminal. Pipe output to files, chain with other tools, or use interactively. - **webclaw-server** -- The REST API. An axum-based HTTP server with authentication, CORS, gzip compression, and async job management. Every extraction feature is available as a JSON endpoint. - **webclaw-mcp** -- The MCP server. Exposes 8 tools over the Model Context Protocol (stdio transport) for use with Claude Desktop, Claude Code, Cursor, Windsurf, OpenCode, Codex, Antigravity, and any MCP-compatible AI client. ### Key features - **No headless browser.** Pure HTTP extraction with browser-grade TLS impersonation. No Playwright, no Puppeteer, no Chrome. Fast and lightweight. - **9 output formats.** Markdown, plain text, JSON, LLM-optimized, screenshot, links, rawHtml, attributes, and query. Request any combination per scrape. - **CSS selector filtering.** Include or exclude content by CSS selector. Extract only article bodies, skip navbars and footers. - **Crawling and sitemap discovery.** BFS same-origin crawler with configurable depth, concurrency, and delay. Sitemap.xml and robots.txt discovery built in. - **Content change tracking.** Snapshot pages as JSON and diff against future extractions to detect what changed. - **Brand extraction.** Extract brand identity -- colors, fonts, logo URL, favicon -- from DOM structure and CSS analysis. - **LLM integration.** Provider chain: Ollama (local-first) then OpenAI then Anthropic. JSON schema extraction, prompt-based extraction, prompt-to-schema generation, page-level Q&A, and summarization. - **YouTube transcript extraction.** Auto-detected for `youtube.com/watch`, `youtube.com/shorts/`, and `youtu.be/` URLs. Response carries a structured `youtube` block (channel, channel_url, duration_seconds, view_count, like_count, tags, categories, thumbnail) plus a top-level `transcript` string with the full auto-caption text. Same shape on the SDK clients (`result.youtube`, `result.transcript` / `resp.YouTube`, `resp.Transcript`). - **PDF extraction.** Auto-detected via Content-Type header. Text extraction from PDF documents without external dependencies. - **Document parsing.** Auto-detected from URL extension or Content-Type. DOCX to markdown, XLSX to markdown tables, CSV to markdown tables. No special parameters needed. - **Web search.** Search the web and optionally scrape all result URLs in parallel. One request to go from query to structured content. - **Browser actions and screenshots.** Perform click, type, scroll, wait, press (keyboard events), and executeJavascript actions before extraction. Take screenshots as base64 PNG. Mobile User-Agent support. - **Antibot bypass.** Automatic bot protection detection and bypass for AWS WAF, Cloudflare, DataDome, and hCaptcha. Two-layer JS rendering: lightweight renderer (fast) then Chrome headless (fallback). - **Firecrawl compatibility.** Drop-in replacement for Firecrawl v2 API. Point existing Firecrawl SDK code at webclaw by changing the base URL. - **Proxy rotation.** Per-request proxy rotation from a pool file. Auto-loads proxies.txt from the working directory. - **Browser impersonation.** Chrome (v142, v136, v133, v131) and Firefox (v144, v135, v133, v128) TLS fingerprint profiles. Random mode available. - **Webhooks.** Get notified when crawl jobs complete. Automatic Discord/Slack formatting. HMAC-SHA256 signed payloads. ### Open source webclaw is AGPL-3.0 licensed and fully open source. The repository is at [github.com/0xMassi/webclaw](https://github.com/0xMassi/webclaw). ### Architecture The project is a Rust workspace split into focused crates. The core extraction engine has zero network dependencies and is WASM-compatible. ```text webclaw/ crates/ webclaw-core/ # Extraction engine. WASM-safe. Zero network deps. # Readability scoring, noise filtering, markdown # conversion, LLM optimization, CSS selector # filtering, diff engine, brand extraction. webclaw-fetch/ # HTTP client with browser TLS impersonation. Crawler. Sitemap discovery. # Batch operations. Proxy pool rotation. webclaw-llm/ # LLM provider chain (Ollama -> OpenAI -> Anthropic). # JSON schema extraction, prompt extraction, # summarization. webclaw-pdf/ # PDF text extraction via pdf-extract. webclaw-server/ # axum REST API. Auth, CORS, gzip, job management. webclaw-mcp/ # MCP server over stdio transport. 8 tools for # AI agents. webclaw-cli/ # CLI binary. ``` **webclaw-core** -- The pure extraction engine. Takes raw HTML as a string, returns structured output. No network calls, no I/O -- just parsing and scoring. This is what makes the core WASM-compatible. Key modules: readability-style content scoring with text density and link density penalties, shared noise filtering (tags, ARIA roles, class/ID patterns, Tailwind-safe), JSON data island extraction for React SPAs and Next.js, HTML to markdown conversion with URL resolution, and a 9-step LLM optimization pipeline. **webclaw-fetch** -- The HTTP layer. Uses browser-grade TLS impersonation with Chrome and Firefox browser profiles. Handles BFS crawling with configurable depth and concurrency, sitemap.xml and robots.txt discovery, multi-URL batch operations, and per-request proxy rotation. **webclaw-llm** -- LLM provider chain with automatic fallback: tries Ollama first (local, free), then OpenAI, then Anthropic. Uses plain HTTP since LLM APIs do not need TLS fingerprinting. Supports JSON schema extraction, prompt-based extraction, and summarization. > **Note:** The core crate never makes network requests. It takes `&str` HTML and returns structured data. All HTTP, LLM calls, and PDF parsing happen in the other crates. --- ## Getting Started Get webclaw installed and extract your first page in under a minute. Choose from cargo install, building from source, or Docker. ### Installation #### From crates.io The fastest way to install. Requires a working Rust toolchain. ```bash cargo install webclaw ``` #### From source Clone the repository and build all three binaries in release mode. ```bash git clone https://github.com/0xMassi/webclaw cd webclaw cargo build --release ``` The binaries will be at `target/release/webclaw`, `target/release/webclaw-server`, and `target/release/webclaw-mcp`. > **Tip:** The workspace uses patched `rustls` and `h2` forks for browser-grade TLS impersonation. These are configured via `[patch.crates-io]` in the workspace Cargo.toml -- no manual setup needed. ### Docker Pull the official image and run the API server in a container. ```bash docker pull ghcr.io/0xmassi/webclaw:latest ``` ```bash # run the API server docker run -p 3000:3000 ghcr.io/0xmassi/webclaw:latest ``` The server will be available at `http://localhost:3000`. Add `-e WEBCLAW_API_KEY=your_key` to enable authentication. ### Your first extraction The simplest usage: pass a URL. webclaw extracts the main content and outputs clean markdown by default. ```bash # markdown output (default) webclaw https://example.com ``` Switch to LLM-optimized output for the most token-efficient representation. This runs a 9-step pipeline that strips images, removes emphasis, deduplicates links, merges stat blocks, and collapses whitespace. ```bash # LLM-optimized output webclaw https://example.com -f llm ``` Use JSON format to get the full ExtractionResult with metadata, content, word count, and extracted URLs. ```bash # JSON output webclaw https://example.com -f json ``` ### Start the API server The REST API exposes every extraction feature as an HTTP endpoint. Start the server on any port. ```bash webclaw-server --port 3000 ``` Test it with a scrape request: ```bash curl -X POST http://localhost:3000/v1/scrape \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com"}' ``` > **Note:** By default the server runs without authentication. Pass `--api-key your_secret` to require a Bearer token on all requests. ### MCP server The MCP server lets AI agents use webclaw as a tool. It communicates over stdio transport and works with Claude Desktop, Claude Code, Cursor, Windsurf, OpenCode, Codex, Antigravity, and any MCP-compatible client. Add webclaw to your Claude Desktop configuration at `~/Library/Application Support/Claude/claude_desktop_config.json`: ```json { "mcpServers": { "webclaw": { "command": "/path/to/webclaw-mcp" } } } ``` Replace `/path/to/webclaw-mcp` with the actual binary path (e.g. `target/release/webclaw-mcp` if built from source). The MCP server exposes 8 tools: | Tool | Description | |------|-------------| | scrape | Extract content from a single URL | | crawl | BFS crawl a website with depth control | | map | Discover URLs from sitemap.xml and robots.txt | | batch | Extract content from multiple URLs | | extract | LLM-powered JSON schema or prompt extraction | | summarize | LLM-powered content summarization | | diff | Track content changes between snapshots | | brand | Extract brand identity (colors, fonts, logo) | ### Cloud API For managed infrastructure, sign up at `webclaw.io` and create an API key from the dashboard. Keys are prefixed with `wc_`. ```bash curl -X POST https://api.webclaw.io/v1/scrape \ -H "Authorization: Bearer wc_your_key" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com", "formats": ["markdown", "llm"]}' ``` The cloud API uses the same endpoints and request format as the self-hosted server. Every example in this documentation works with both -- just swap the base URL and add the Authorization header. > **Tip:** The cloud API offers a 7-day Starter trial. Cancel anytime. The CLI and self-hosted server are AGPL-3.0 and free forever on your own hardware. --- ## CLI Reference Complete reference for the `webclaw` command-line tool. Every flag, every option, with practical examples. ### Basic extraction Pass one or more URLs as positional arguments. webclaw fetches each page, extracts the main content, and outputs clean markdown. | Flag | Description | |------|-------------| | `webclaw ` | Extract a single URL. | | `webclaw url1 url2 url3` | Batch extract multiple URLs in one command. | | `--urls-file ` | Read URLs from a file, one per line. | | `--file ` | Read HTML from a local file instead of fetching. | | `--stdin` | Read HTML from stdin. | ```bash # Single URL webclaw https://example.com # Multiple URLs webclaw https://example.com https://news.ycombinator.com # From a file list webclaw --urls-file urls.txt # Local HTML file webclaw --file page.html # Pipe from another command curl -s https://example.com | webclaw --stdin ``` ### Output formats Control the output format with the `-f` flag. The default is markdown. | Flag | Description | |------|-------------| | `-f markdown` | Clean markdown with resolved URLs and collected assets. This is the default. | | `-f text` | Plain text with no formatting. | | `-f json` | Full ExtractionResult as JSON. Includes metadata, content, word count, and extracted URLs. | | `-f llm` | LLM-optimized output. 9-step pipeline: image stripping, emphasis removal, link deduplication, stat merging, whitespace collapse. Includes a metadata header. | | `--metadata` | Include page metadata (title, description, OG tags) in the output. | | `--raw-html` | Output the raw HTML response without any extraction processing. | ```bash # Default markdown webclaw https://example.com # LLM-optimized for minimal token usage webclaw https://example.com -f llm # Full JSON with metadata webclaw https://example.com -f json # Plain text webclaw https://example.com -f text # Markdown with metadata header webclaw https://example.com --metadata # Save JSON snapshot for later diffing webclaw https://example.com -f json > snapshot.json ``` > **Tip:** The `llm` format typically achieves ~90% fewer tokens than raw HTML (measured on 18 production sites) while preserving all meaningful content. Use it when feeding content to language models. ### Content filtering Use CSS selectors to control what content is extracted. Include mode is exclusive -- only matched elements are returned. Exclude mode removes matched elements from the normal extraction. | Flag | Description | |------|-------------| | `--include ` | CSS selectors to extract. Comma-separated. Exclusive mode: only these elements are returned. | | `--exclude ` | CSS selectors to remove from extraction. Comma-separated. | | `--only-main-content` | Extract only article, main, or [role="main"] elements. | ```bash # Extract only the article body webclaw https://example.com --include "article" # Extract specific sections webclaw https://example.com --include ".post-content, .comments" # Remove navigation and footer noise webclaw https://example.com --exclude "nav, footer, .sidebar" # Combine both webclaw https://example.com --include "main" --exclude ".ads, .related-posts" # Quick mode: just the main content area webclaw https://example.com --only-main-content ``` ### Browser impersonation webclaw impersonates real browser TLS fingerprints. This makes requests indistinguishable from actual browser traffic at the TLS layer. Each browser option includes multiple version profiles. | Flag | Description | |------|-------------| | `-b chrome` | Chrome profiles: v142, v136, v133, v131. This is the default. | | `-b firefox` | Firefox profiles: v144, v135, v133, v128. | | `-b random` | Random browser profile per request. Useful for bulk extraction. | ```bash # Default Chrome impersonation webclaw https://example.com # Firefox fingerprint webclaw https://example.com -b firefox # Random profile per request (good for batch) webclaw url1 url2 url3 -b random ``` > **Note:** This is TLS fingerprint impersonation, not a headless browser. No browser engine is launched. Requests complete in milliseconds, not seconds. ### Proxy Route requests through HTTP proxies. Supports single proxy or pool rotation. | Flag | Description | |------|-------------| | `-p ` | Single proxy. Format: http://user:pass@host:port | | `--proxy-file ` | Proxy pool file. One proxy per line in host:port:user:pass format. Rotates per request. | ```bash # Single proxy webclaw https://example.com -p http://user:pass@proxy.example.com:8080 # Proxy pool with rotation webclaw https://example.com --proxy-file proxies.txt # Batch extraction with proxy rotation webclaw --urls-file urls.txt --proxy-file proxies.txt ``` > **Tip:** webclaw auto-loads a file named `proxies.txt` from the working directory if present. No flag needed. Proxy rotation is per-request, not per-client, so each request in a batch or crawl uses a different proxy from the pool. ### Crawling BFS same-origin crawler. Discovers and extracts pages by following links within the same domain. | Flag | Description | |------|-------------| | `--crawl` | Enable BFS crawling from the given URL. | | `--depth ` | Maximum crawl depth. Default: 1. | | `--max-pages ` | Maximum number of pages to crawl. Default: 20. | | `--concurrency ` | Number of parallel requests. Default: 5. | | `--delay ` | Delay between requests in milliseconds. Default: 100. | | `--path-prefix ` | Only crawl URLs whose path starts with this prefix. | | `--sitemap` | Seed the crawl queue from sitemap discovery before starting BFS. | ```bash # Basic crawl, 1 level deep webclaw https://docs.example.com --crawl # Deep crawl with limits webclaw https://docs.example.com --crawl --depth 3 --max-pages 100 # Faster crawl with more concurrency webclaw https://docs.example.com --crawl --concurrency 10 --delay 50 # Only crawl the /api/ section webclaw https://docs.example.com --crawl --path-prefix /api/ # Seed from sitemap first, then crawl webclaw https://docs.example.com --crawl --depth 2 --max-pages 50 --sitemap ``` > **Warning:** Crawling is same-origin only. webclaw will not follow links to external domains. Respect the target site by keeping concurrency and depth reasonable. ### Sitemap discovery Discover all URLs from a site's sitemap.xml and robots.txt without crawling. | Flag | Description | |------|-------------| | `--map` | Discover URLs from sitemap.xml and robots.txt. Outputs the URL list. | ```bash # Discover all URLs from sitemap webclaw https://docs.example.com --map # Save discovered URLs to a file, then extract them webclaw https://docs.example.com --map > urls.txt webclaw --urls-file urls.txt -f llm ``` ### Change tracking Snapshot a page as JSON and compare against a future extraction to see what changed. | Flag | Description | |------|-------------| | `--diff-with ` | Compare the current extraction against a previous JSON snapshot file. | ```bash # Take a snapshot webclaw https://example.com -f json > snapshot.json # Later, check what changed webclaw https://example.com --diff-with snapshot.json ``` ### Brand extraction Extract brand identity from a website: colors, fonts, logo URL, and favicon. Analyzes both DOM structure and CSS. | Flag | Description | |------|-------------| | `--brand` | Extract brand identity (colors, fonts, logo, favicon) from the page. | ```bash webclaw https://example.com --brand ``` ### LLM features webclaw can use LLMs to extract structured data, answer questions about page content, or summarize. The provider chain tries Ollama first (local, free), then OpenAI, then Anthropic. | Flag | Description | |------|-------------| | `--extract-json ` | Extract data matching a JSON schema. Pass the schema as a string or use @file to read from a file. | | `--extract-prompt ` | Natural language extraction. Describe what you want and the LLM extracts it. | | `--summarize [sentences]` | Summarize the page content. Default: 3 sentences. | | `--llm-provider ` | Force a specific LLM provider: ollama, openai, or anthropic. | | `--llm-model ` | Override the default model for the chosen provider. | | `--llm-base-url ` | Override the provider's base URL (useful for proxies or custom deployments). | ```bash # Summarize a page webclaw https://example.com --summarize webclaw https://example.com --summarize 5 # Natural language extraction webclaw https://example.com --extract-prompt "Get all pricing tiers" webclaw https://example.com --extract-prompt "List every author name and their role" # JSON schema extraction webclaw https://example.com --extract-json '{"type":"object","properties":{"title":{"type":"string"},"price":{"type":"number"}}}' # Schema from a file webclaw https://example.com --extract-json @schema.json # Force OpenAI instead of Ollama webclaw https://example.com --summarize --llm-provider openai # Use a specific model webclaw https://example.com --summarize --llm-provider anthropic --llm-model claude-sonnet-4-20250514 ``` > **Note:** Ollama runs locally and requires no API key. Install it from `ollama.ai` and webclaw will use it automatically. For OpenAI and Anthropic, set the standard environment variables: `OPENAI_API_KEY` or `ANTHROPIC_API_KEY`. ### PDF extraction webclaw auto-detects PDF documents via the Content-Type header and extracts text content. No special flags needed for basic PDF extraction. | Flag | Description | |------|-------------| | `--pdf-mode auto` | Default mode. Extracts text from PDFs. Returns an error if text extraction fails. | | `--pdf-mode fast` | Returns empty content on extraction failure instead of erroring. | ```bash # Auto-detected from Content-Type webclaw https://example.com/report.pdf # Fast mode (skip failures silently) webclaw https://example.com/report.pdf --pdf-mode fast ``` ### Other options | Flag | Description | |------|-------------| | `-t ` | Request timeout in seconds. Default: 30. | | `-v` | Enable verbose logging. Shows request details, timing, and extraction stats. | ```bash # Longer timeout for slow sites webclaw https://slow-site.example.com -t 60 # Verbose output for debugging webclaw https://example.com -v ``` ### Complete examples Common workflows combining multiple flags. #### Extract a blog post for an LLM ```bash webclaw https://blog.example.com/post \ -f llm \ --include "article" \ --exclude ".comments, .related-posts" ``` #### Crawl documentation with proxy rotation ```bash webclaw https://docs.example.com \ --crawl \ --depth 2 \ --max-pages 50 \ --sitemap \ --proxy-file proxies.txt \ -b random \ -f llm ``` #### Extract structured pricing data ```bash webclaw https://example.com/pricing \ --extract-json '{ "type": "object", "properties": { "plans": { "type": "array", "items": { "type": "object", "properties": { "name": {"type": "string"}, "price": {"type": "string"}, "features": {"type": "array", "items": {"type": "string"}} } } } } }' ``` #### Monitor a page for changes ```bash # Initial snapshot webclaw https://example.com/status -f json > baseline.json # Check for changes (run periodically) webclaw https://example.com/status --diff-with baseline.json ``` #### Batch extract with Firefox impersonation ```bash webclaw \ https://site-a.com \ https://site-b.com \ https://site-c.com \ -b firefox \ -f llm \ --metadata ``` --- ## REST API ### REST API Overview The webclaw REST API gives you programmatic access to the full extraction engine. Every endpoint accepts JSON and returns JSON. #### Base URL ```text # Cloud (managed) https://api.webclaw.io # Self-hosted http://localhost:3000 ``` #### Authentication All requests require an API key sent via the Authorization header. ```http Authorization: Bearer ``` **Cloud:** Create API keys from your dashboard at webclaw.io. Keys are prefixed with `wc_`. **Self-hosted:** Pass `--api-key` when starting the server, or set the `WEBCLAW_API_KEY` environment variable. If neither is set, the server runs without authentication. > **Note:** Self-hosted instances with no API key configured accept all requests. Set one before exposing the server to the internet. #### Request format All POST endpoints accept a JSON body. Set the Content-Type header accordingly. ```http Content-Type: application/json ``` #### Response format All responses are JSON. Successful responses return the data directly. Errors use a consistent shape: ```json { "error": "Human-readable error message" } ``` #### Rate limiting Cloud API rate limits are based on your plan tier. Self-hosted instances have no rate limits by default. See the Cloud API section for plan details. #### Endpoints | Method | Path | Description | |--------|------|-------------| | POST | /v1/scrape | Single URL extraction with optional browser actions and screenshots | | POST | /v1/crawl | Start async crawl | | GET | /v1/crawl/{id} | Poll crawl status | | GET | /v1/crawl/{id}/stream | SSE stream of crawl progress | | GET | /v1/crawl/history | Paginated crawl job history | | POST | /v1/crawl/{id}/retry | Retry a failed crawl | | POST | /v1/batch | Multi-URL extraction | | POST | /v1/map | Sitemap discovery | | POST | /v1/search | Web search with parallel scraping | | POST | /v1/extract | LLM JSON extraction with prompt-to-schema generation | | POST | /v1/summarize | LLM summarization | | POST | /v1/diff | Content change tracking | | POST | /v1/brand | Brand identity extraction | | POST | /v1/webhooks | Create webhook | | GET | /v1/webhooks | List webhooks | | PATCH | /v1/webhooks/{id} | Update webhook | | DELETE | /v1/webhooks/{id} | Delete webhook | | POST | /v1/webhooks/{id}/test | Test webhook delivery | | POST | /v2/scrape | Firecrawl v2 compatible scrape | | POST | /v2/crawl | Firecrawl v2 compatible crawl | | GET | /v2/crawl/{id} | Firecrawl v2 compatible crawl status | | POST | /v2/search | Firecrawl v2 compatible search | | GET | /health | Health check + Ollama status | #### Quick example ```bash curl -X POST https://api.webclaw.io/v1/scrape \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com", "formats": ["markdown"]}' ``` --- ### POST /v1/scrape Extract content from a single URL. This is the core endpoint -- one URL in, clean structured content out. Supports browser actions, screenshots, mobile User-Agent, page-level Q&A, and attribute extraction. **POST** `/v1/scrape` -- Extract content from a single URL in one or more output formats. #### Request body ```json { "url": "https://example.com", "formats": ["markdown", "llm", "text", "json", "links", "rawHtml"], "include_selectors": [".article-content"], "exclude_selectors": ["nav", ".sidebar"], "only_main_content": true, "screenshot": false, "mobile": false, "actions": [], "query": "What is the pricing?", "attribute_selectors": [{"selector": "img", "attribute": "alt"}] } ``` #### Parameters | Field | Type | Required | Description | |-------|------|----------|-------------| | `url` | string | Yes | The URL to scrape. | | `formats` | string[] | No | Output formats to include. Options: `markdown`, `llm`, `text`, `json`, `screenshot`, `links`, `rawHtml`, `attributes`, `query`. Defaults to `["markdown"]`. | | `include_selectors` | string[] | No | CSS selectors to extract exclusively. Only content matching these selectors will be included. | | `exclude_selectors` | string[] | No | CSS selectors to remove from the page before extraction. | | `only_main_content` | boolean | No | When true, extracts only the main article or content element, ignoring sidebars, headers, and footers. | | `screenshot` | boolean | No | Take a screenshot after extraction. Returns a base64-encoded PNG in the response. | | `mobile` | boolean | No | Use a mobile User-Agent for the request. Useful for sites that serve different content to mobile devices. | | `actions` | object[] | No | Array of browser actions to perform before content extraction. See Browser Actions below. | | `query` | string | No | Natural language question about the page. Used with the `query` format. The answer is returned in `query_answer`. | | `attribute_selectors` | object[] | No | Array of `{selector, attribute}` pairs. Used with the `attributes` format to extract specific DOM attributes by CSS selector. | #### Output formats | Format | Description | |--------|-------------| | `markdown` | Clean markdown with resolved URLs and collected assets. Default. | | `text` | Plain text with no formatting. | | `json` | Full ExtractionResult as JSON with metadata, content, word count, and URLs. | | `llm` | LLM-optimized. 9-step pipeline: image stripping, emphasis removal, link dedup, stat merging, whitespace collapse. ~90% fewer tokens than raw HTML. | | `screenshot` | Base64-encoded PNG screenshot of the page. | | `links` | Array of all extracted links. Each entry has `text` and `href` fields. Returns `[{text, href}]`. | | `rawHtml` | The raw HTML string from the pipeline, before any extraction processing. | | `attributes` | Extract DOM attributes by CSS selector. Requires the `attribute_selectors` parameter. | | `query` | Page-level Q&A with LLM. Requires the `query` parameter. Returns `query_answer` in the response. | #### Browser Actions Perform actions on the page before extracting content. Actions execute in order. Useful for clicking cookie banners, expanding sections, filling forms, scrolling to load lazy content, pressing keyboard keys, or running custom JavaScript. | Action | Fields | Description | |--------|--------|-------------| | `click` | `selector` | Click an element matching the CSS selector. | | `type` | `selector`, `value` | Type text into an input element matching the CSS selector. | | `scroll` | `direction` | Scroll the page. Direction: `up` or `down`. | | `wait` | `ms` | Wait for the specified number of milliseconds. | | `screenshot` | -- | Take a screenshot at this point in the action sequence. | | `press` | `key` | Send a keyboard event. Supported keys: Enter, Tab, Escape, ArrowDown, ArrowUp, Space, Backspace. | | `executeJavascript` | `code` | Run custom JavaScript on the page. Results are returned in the `js_results` array in the response. | ```json { "url": "https://example.com", "formats": ["markdown"], "actions": [ {"type": "click", "selector": "#accept-cookies"}, {"type": "wait", "ms": 1000}, {"type": "type", "selector": "#search-input", "value": "webclaw"}, {"type": "press", "key": "Enter"}, {"type": "wait", "ms": 2000}, {"type": "scroll", "direction": "down"}, {"type": "executeJavascript", "code": "return document.title"}, {"type": "screenshot"} ], "screenshot": true } ``` > **Tip:** The `llm` format runs a 9-step optimization pipeline that strips images, collapses whitespace, deduplicates links, and reduces token count by ~90% compared to raw HTML. Use it when feeding content to language models. #### YouTube transcript extraction YouTube URLs are auto-detected and routed through a `yt-dlp`-backed short-circuit that bypasses the standard HTML pipeline. Triggers on `youtube.com/watch`, `youtube.com/shorts/`, and `youtu.be/`. No special parameters needed. The response carries two YouTube-specific fields alongside the standard `markdown` / `text` / `llm` formats: - `transcript` (string) -- full auto-caption text. `null` when the video has no captions. - `youtube` (object) -- structured metadata: `video_id`, `title`, `description`, `channel`, `channel_url`, `uploader`, `upload_date` (YYYYMMDD), `duration_seconds`, `view_count`, `like_count`, `thumbnail`, `tags`, `categories`, `language`. ```bash curl -X POST https://api.webclaw.io/v1/scrape \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}' ``` **Response (excerpt):** ```json { "url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ", "metadata": { "title": "...", "description": "...", "image": "..." }, "youtube": { "video_id": "dQw4w9WgXcQ", "channel": "Rick Astley", "duration_seconds": 213, "view_count": 1490000000, "tags": ["rick astley", "..."], "thumbnail": "https://i.ytimg.com/vi/dQw4w9WgXcQ/maxresdefault.jpg" }, "transcript": "We're no strangers to love...", "markdown": "# Never Gonna Give You Up\n\n**Channel:** Rick Astley..." } ``` The same shape is available across SDKs: Python `result.youtube`, JS/TS `result.youtube`, Go `resp.YouTube`. Cold p50 latency 2-4s; cache hits ~150ms. #### Response The response includes the requested formats alongside extracted metadata. Only the formats you request are populated. ```json { "url": "https://example.com", "metadata": { "title": "Example Domain", "description": "This domain is for use in illustrative examples.", "author": null, "published_date": null, "language": "en", "site_name": "Example", "image": null, "favicon": "https://example.com/favicon.ico", "word_count": 1234 }, "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...", "text": "Example Domain\n\nThis domain is for use in illustrative examples...", "llm": "> URL: https://example.com\n> Title: Example Domain\n\nThis domain is for use in illustrative examples...\n\n## Links\n- ...", "links": [{"text": "More information...", "href": "https://www.iana.org/domains/example"}], "rawHtml": "...", "query_answer": "The Pro plan costs $49/month.", "js_results": ["Example Domain"], "extraction": { ... }, "screenshot": "iVBORw0KGgoAAAANSUhEUg..." } ``` #### Metadata fields | Field | Type | Description | |-------|------|-------------| | `title` | string | Page title from OG, meta, or title tag. | | `description` | string | Page description from meta or OG tags. | | `author` | string? | Author name if detected. | | `published_date` | string? | Publication date if found in metadata. | | `language` | string? | Page language code (e.g. "en"). | | `site_name` | string? | Site name from OG metadata. | | `image` | string? | Primary image URL (OG or Twitter Card). | | `favicon` | string? | Favicon URL. | | `word_count` | number | Total word count of extracted content. | | `screenshot` | string? | Base64-encoded PNG screenshot. Only present when `screenshot: true` or a screenshot action was used. | #### Examples **Basic extraction:** ```bash curl -X POST https://api.webclaw.io/v1/scrape \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "url": "https://openai.com/blog/gpt-4", "formats": ["markdown"] }' ``` **LLM-optimized with selector filtering:** ```bash curl -X POST https://api.webclaw.io/v1/scrape \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "url": "https://docs.stripe.com/payments/checkout", "formats": ["llm"], "include_selectors": [".content-container"], "exclude_selectors": ["nav", "footer", ".sidebar"], "only_main_content": true }' ``` **Extract all links from a page:** ```bash curl -X POST https://api.webclaw.io/v1/scrape \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com", "formats": ["links"] }' ``` **Extract DOM attributes by CSS selector:** ```bash curl -X POST https://api.webclaw.io/v1/scrape \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com", "formats": ["attributes"], "attribute_selectors": [ {"selector": "img", "attribute": "alt"}, {"selector": "a", "attribute": "href"} ] }' ``` **Page-level Q&A with the query format:** ```bash curl -X POST https://api.webclaw.io/v1/scrape \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com/pricing", "formats": ["query"], "query": "What is the price of the Pro plan?" }' ``` **With browser actions, keyboard press, and JavaScript execution:** ```bash curl -X POST https://api.webclaw.io/v1/scrape \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com/dashboard", "formats": ["markdown"], "actions": [ {"type": "click", "selector": ".show-more"}, {"type": "wait", "ms": 1500}, {"type": "press", "key": "Enter"}, {"type": "executeJavascript", "code": "return document.title"}, {"type": "scroll", "direction": "down"} ], "screenshot": true }' ``` **Mobile extraction:** ```bash curl -X POST https://api.webclaw.io/v1/scrape \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com", "formats": ["markdown"], "mobile": true }' ``` > **Note:** PDFs and documents (DOCX, XLSX, CSV) are auto-detected via Content-Type or URL extension. If the URL serves a PDF, webclaw extracts text using its PDF engine. DOCX files are converted to markdown. XLSX and CSV files are converted to markdown tables. No special parameters needed. #### Error responses ```json // 400 Bad Request { "error": "Missing required field: url" } // 401 Unauthorized { "error": "Invalid or missing API key" } // 422 Unprocessable { "error": "Failed to fetch URL: connection timeout" } ``` --- ### POST /v1/crawl Crawl an entire site with BFS traversal. Crawls run asynchronously -- start one, then poll until it completes. Supports subdomain crawling, external link following, path filtering, and webhook notifications. **POST** `/v1/crawl` -- Start an async same-origin crawl from the given URL. #### Request body ```json { "url": "https://docs.example.com", "max_depth": 2, "max_pages": 50, "use_sitemap": true, "allow_subdomains": false, "allow_external_links": false, "include_paths": ["/docs/**", "/guides/**"], "exclude_paths": ["/blog/**", "/changelog/**"], "webhook_url": "https://your-server.com/webhook" } ``` #### Parameters | Field | Type | Required | Description | |-------|------|----------|-------------| | `url` | string | Yes | Starting URL. The crawl stays within this origin by default. | | `max_depth` | number | No | Maximum link depth to follow. Default: 2. | | `max_pages` | number | No | Maximum number of pages to extract. Default: 50. | | `use_sitemap` | boolean | No | Seed the crawl queue with URLs from the site's sitemap. Default: false. | | `allow_subdomains` | boolean | No | Allow the crawler to follow links to subdomains of the starting URL's domain. Default: false. | | `allow_external_links` | boolean | No | Allow the crawler to follow links to external domains. Default: false. | | `include_paths` | string[] | No | Glob patterns for paths to include. Only URLs matching at least one pattern will be crawled. | | `exclude_paths` | string[] | No | Glob patterns for paths to exclude. URLs matching any pattern will be skipped. | | `webhook_url` | string | No | URL to notify when the crawl completes. Receives a POST with the crawl result. | #### Response ```json { "id": "550e8400-e29b-41d4-a716-446655440000", "status": "running" } ``` > **Note:** The crawl ID is a UUID. Store it -- you need it to poll for results. --- ### GET /v1/crawl/{id} **GET** `/v1/crawl/{id}` -- Get the current status and results of a running or completed crawl. #### Path parameters | Parameter | Type | Description | |-----------|------|-------------| | `id` | string | The crawl job UUID returned by POST /v1/crawl. | #### Response ```json { "id": "550e8400-e29b-41d4-a716-446655440000", "status": "completed", "pages": [ { "url": "https://docs.example.com", "markdown": "# Getting Started\n\nWelcome to the documentation...", "metadata": { "title": "Getting Started", "word_count": 842 } }, { "url": "https://docs.example.com/guides/setup", "markdown": "# Setup Guide\n\nFollow these steps...", "metadata": { "title": "Setup Guide", "word_count": 1203 } } ], "total": 50, "completed": 48, "errors": 2, "created_at": "2026-03-12T10:30:00Z" } ``` #### Status values | Status | Description | |--------|-------------| | `running` | Crawl is in progress. Poll again for updates. | | `completed` | Crawl finished. All results are available. | | `failed` | Crawl encountered a fatal error and stopped. | #### Example ```bash # Start a crawl curl -X POST https://api.webclaw.io/v1/crawl \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "url": "https://docs.stripe.com", "max_depth": 2, "max_pages": 100, "use_sitemap": true }' ``` ```bash # Poll for results curl https://api.webclaw.io/v1/crawl/550e8400-e29b-41d4-a716-446655440000 \ -H "Authorization: Bearer wc_your_api_key" ``` > **Tip:** Enable `use_sitemap` to seed the crawl queue with sitemap URLs. This helps discover pages that aren't reachable through link traversal alone. --- ### GET /v1/crawl/{id}/stream Stream crawl progress in real time using Server-Sent Events (SSE). Connect once and receive events as pages are extracted, without polling. **GET** `/v1/crawl/{id}/stream` -- SSE stream of crawl progress. #### Path parameters | Parameter | Type | Description | |-----------|------|-------------| | `id` | string | The crawl job UUID returned by POST /v1/crawl. | #### Event types | Event | Description | |-------|-------------| | `page` | A page was extracted. Data contains the page result (url, markdown, metadata). | | `progress` | Progress update. Data contains `completed`, `total`, and `errors` counts. | | `done` | Crawl completed. Stream will close. | | `error` | Crawl encountered a fatal error. Stream will close. | #### Example ```bash curl -N https://api.webclaw.io/v1/crawl/550e8400-e29b-41d4-a716-446655440000/stream \ -H "Authorization: Bearer wc_your_api_key" ``` ```text event: page data: {"url":"https://docs.example.com","markdown":"# Getting Started...","metadata":{"title":"Getting Started","word_count":842}} event: progress data: {"completed":1,"total":50,"errors":0} event: page data: {"url":"https://docs.example.com/setup","markdown":"# Setup...","metadata":{"title":"Setup","word_count":1203}} event: progress data: {"completed":2,"total":50,"errors":0} event: done data: {"id":"550e8400-e29b-41d4-a716-446655440000","status":"completed","completed":48,"errors":2} ``` > **Tip:** SSE is more efficient than polling for real-time crawl monitoring. Use it in dashboards or when you want to process pages as they arrive. --- ### GET /v1/crawl/history Get a paginated list of all crawl jobs. Useful for dashboards and auditing. **GET** `/v1/crawl/history` -- Paginated crawl job history. #### Query parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `page` | number | 1 | Page number (1-indexed). | | `per_page` | number | 20 | Items per page. Max: 100. | #### Response ```json { "jobs": [ { "id": "550e8400-e29b-41d4-a716-446655440000", "url": "https://docs.example.com", "status": "completed", "total": 50, "completed": 48, "errors": 2, "created_at": "2026-03-12T10:30:00Z" } ], "page": 1, "per_page": 20, "total": 42 } ``` #### Example ```bash curl "https://api.webclaw.io/v1/crawl/history?page=1&per_page=10" \ -H "Authorization: Bearer wc_your_api_key" ``` --- ### POST /v1/crawl/{id}/retry Retry a failed crawl job. Creates a new crawl with the same parameters as the original. **POST** `/v1/crawl/{id}/retry` -- Retry a failed crawl. #### Path parameters | Parameter | Type | Description | |-----------|------|-------------| | `id` | string | The UUID of the failed crawl job to retry. | #### Response Returns a new crawl job with a new UUID. ```json { "id": "661f9511-f3ac-52e5-b827-557766551111", "status": "running" } ``` #### Example ```bash curl -X POST https://api.webclaw.io/v1/crawl/550e8400-e29b-41d4-a716-446655440000/retry \ -H "Authorization: Bearer wc_your_api_key" ``` > **Note:** Only failed crawls can be retried. Attempting to retry a running or completed crawl returns a 400 error. --- ### POST /v1/batch Extract content from multiple URLs in a single request. Requests are processed concurrently on the server. **POST** `/v1/batch` -- Extract content from multiple URLs concurrently. #### Request body ```json { "urls": [ "https://example.com/page-1", "https://example.com/page-2", "https://example.com/page-3" ], "formats": ["markdown", "llm"], "concurrency": 5 } ``` #### Parameters | Field | Type | Required | Description | |-------|------|----------|-------------| | `urls` | string[] | Yes | Array of URLs to extract. | | `formats` | string[] | No | Output formats. Options: `markdown`, `llm`, `text`, `json`, `links`, `rawHtml`. Defaults to `["markdown"]`. | | `concurrency` | number | No | Max concurrent requests. Default: 5. | #### Response ```json { "results": [ { "url": "https://example.com/page-1", "markdown": "# Page One\n\nContent of the first page...", "metadata": { "title": "Page One", "word_count": 654 }, "error": null }, { "url": "https://example.com/page-2", "markdown": "# Page Two\n\nContent of the second page...", "metadata": { "title": "Page Two", "word_count": 1102 }, "error": null }, { "url": "https://example.com/page-3", "markdown": null, "metadata": null, "error": "Failed to fetch: 404 Not Found" } ] } ``` > **Note:** Individual URL failures do not fail the entire batch. Check the `error` field on each result to detect per-URL failures. #### Example ```bash curl -X POST https://api.webclaw.io/v1/batch \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "urls": [ "https://openai.com/blog/gpt-4", "https://anthropic.com/research/claude-3", "https://deepmind.google/technologies/gemini" ], "formats": ["markdown", "llm"], "concurrency": 3 }' ``` > **Tip:** For large URL lists, keep concurrency reasonable (5-10) to avoid overwhelming target servers. The server-side default of 5 is a good starting point. --- ### POST /v1/map Discover all URLs on a site by parsing robots.txt and sitemap.xml. Recursively resolves sitemap indexes to find every listed page. **POST** `/v1/map` -- Discover all URLs on a site via sitemap parsing. #### Request body ```json { "url": "https://docs.example.com" } ``` #### Parameters | Field | Type | Required | Description | |-------|------|----------|-------------| | `url` | string | Yes | Base URL of the site to map. | #### Response ```json { "urls": [ "https://docs.example.com", "https://docs.example.com/getting-started", "https://docs.example.com/api/reference", "https://docs.example.com/guides/authentication", "https://docs.example.com/guides/deployment" ], "count": 156 } ``` | Field | Type | Description | |-------|------|-------------| | `urls` | string[] | All discovered URLs from sitemap parsing. | | `count` | number | Total number of URLs found. | > **Note:** The map endpoint checks `robots.txt` for sitemap references first, then falls back to `/sitemap.xml`. Sitemap indexes are resolved recursively, so a single request can discover thousands of URLs. #### Example ```bash curl -X POST https://api.webclaw.io/v1/map \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{"url": "https://docs.stripe.com"}' ``` > **Tip:** Use `/v1/map` to build a URL list, then feed it to `/v1/batch` for bulk extraction. This is faster than crawling when the site has a comprehensive sitemap. --- ### POST /v1/search Search the web and optionally scrape all result URLs in parallel. One request to go from a search query to structured, LLM-ready content. **POST** `/v1/search` -- Web search with parallel scraping of top results. #### Request body ```json { "query": "best rust web frameworks", "num_results": 5, "scrape": true, "formats": ["markdown"], "country": "us", "lang": "en" } ``` #### Parameters | Field | Type | Required | Description | |-------|------|----------|-------------| | `query` | string | Yes | The search query. | | `num_results` | number | No | Number of search results to return. Default: 5. Max: 10. | | `scrape` | boolean | No | When true, scrape the content of each result URL in parallel. Default: true. | | `formats` | string[] | No | Output formats for scraped content. Options: `markdown`, `llm`, `text`, `json`. Defaults to `["markdown"]`. | | `country` | string | No | Country code for localized results (e.g. `us`, `gb`, `de`). | | `lang` | string | No | Language code for results (e.g. `en`, `fr`, `ja`). | #### Response ```json { "query": "best rust web frameworks", "results": [ { "title": "Top 10 Rust Web Frameworks in 2026", "url": "https://blog.example.com/rust-frameworks", "snippet": "A comprehensive comparison of Rust web frameworks including Axum, Actix, and Rocket...", "position": 1, "markdown": "# Top 10 Rust Web Frameworks in 2026\n\nRust continues to gain traction...", "metadata": { "title": "Top 10 Rust Web Frameworks in 2026", "word_count": 2340 } }, { "title": "Choosing a Rust Web Framework", "url": "https://docs.example.com/choosing-framework", "snippet": "How to pick the right Rust web framework for your project...", "position": 2, "markdown": "# Choosing a Rust Web Framework\n\nThe Rust ecosystem offers several...", "metadata": { "title": "Choosing a Rust Web Framework", "word_count": 1856 } } ], "scrape": true } ``` #### Response fields | Field | Type | Description | |-------|------|-------------| | `query` | string | The original search query. | | `results` | array | Array of search results with optional scraped content. | | `results[].title` | string | Search result title. | | `results[].url` | string | URL of the search result. | | `results[].snippet` | string | Search engine snippet for the result. | | `results[].position` | number | Position in search results (1-indexed). | | `results[].markdown` | string? | Scraped markdown content. Only present when `scrape: true`. | | `results[].metadata` | object? | Page metadata from scraping. Only present when `scrape: true`. | | `scrape` | boolean | Whether scraping was performed. | > **Note:** When `scrape: false`, results contain only `title`, `url`, `snippet`, and `position` -- no page content. This is faster when you only need search result metadata. #### Examples ```bash # Search and scrape top 5 results curl -X POST https://api.webclaw.io/v1/search \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "query": "rust async runtime comparison", "num_results": 5, "formats": ["markdown", "llm"] }' ``` ```bash # Search only (no scraping) for quick results curl -X POST https://api.webclaw.io/v1/search \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "query": "webclaw documentation", "scrape": false, "num_results": 10 }' ``` ```bash # Localized search curl -X POST https://api.webclaw.io/v1/search \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "query": "meilleur framework web rust", "country": "fr", "lang": "fr", "num_results": 5 }' ``` > **Tip:** Combine `/v1/search` with the `llm` format to build RAG pipelines. One request gives you search results with LLM-optimized content ready for context injection. --- ### POST /v1/extract Extract structured JSON data from any URL. Provide a JSON schema for typed output, or a natural language prompt for flexible extraction. Both modes use an LLM to parse the page content. **POST** `/v1/extract` -- Extract structured data from a URL using a JSON schema or natural language prompt. > **Note:** This endpoint requires an LLM provider. The provider chain tries Ollama (local) first, then falls back to OpenAI, then Anthropic. At least one must be configured. #### Schema mode Provide a JSON Schema and the LLM will return data conforming to it. This gives you predictable, typed output. **Request body:** ```json { "url": "https://example.com/pricing", "schema": { "type": "object", "properties": { "title": { "type": "string" }, "price": { "type": "number" }, "currency": { "type": "string" }, "features": { "type": "array", "items": { "type": "string" } } } } } ``` **Response:** ```json { "data": { "title": "Pro Plan", "price": 49, "currency": "USD", "features": [ "Unlimited extractions", "Priority support", "Custom browser profiles" ] } } ``` #### Prompt mode Describe what you want in plain English. The LLM will determine the structure based on your prompt and the page content. **Request body:** ```json { "url": "https://example.com/pricing", "prompt": "Extract all pricing tiers with name, price, and features" } ``` **Response:** ```json { "data": { "tiers": [ { "name": "Hobby", "price": 9, "features": ["1 seat", "Community support"] }, { "name": "Pro", "price": 49, "features": ["5 seats", "Priority support", "Custom profiles"] }, { "name": "Scale", "price": 199, "features": ["500k pages/month", "Dedicated support", "SLA"] } ] } } ``` #### Prompt-to-schema generation When only `prompt` is provided (no `schema`), the LLM generates a JSON schema first, then extracts structured data matching it. The response includes a `generated_schema` field showing the auto-generated schema. **Request body:** ```json { "url": "https://example.com/pricing", "prompt": "Extract all pricing tiers with name, price, and features" } ``` **Response with generated_schema:** ```json { "data": { "tiers": [ {"name": "Hobby", "price": 9, "features": ["1 seat", "Community support"]}, {"name": "Pro", "price": 49, "features": ["5 seats", "Priority support"]} ] }, "generated_schema": { "type": "object", "properties": { "tiers": { "type": "array", "items": { "type": "object", "properties": { "name": {"type": "string"}, "price": {"type": "number"}, "features": {"type": "array", "items": {"type": "string"}} } } } } } } ``` #### Parameters | Field | Type | Required | Description | |-------|------|----------|-------------| | `url` | string | Yes | URL to extract data from. | | `schema` | object | No* | JSON Schema defining the desired output structure. | | `prompt` | string | No* | Natural language description of what to extract. When provided without a schema, the LLM auto-generates a schema first. | > **Warning:** You must provide either `schema` or `prompt`. If both are provided, `schema` takes precedence. #### LLM provider chain The extract endpoint tries LLM providers in this order: 1. **Ollama** (local) -- free, no API key needed. Set `OLLAMA_HOST` if not running on localhost. 2. **OpenAI** -- requires `OPENAI_API_KEY`. 3. **Anthropic** -- requires `ANTHROPIC_API_KEY`. #### Examples ```bash # Schema mode curl -X POST https://api.webclaw.io/v1/extract \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com/product/widget", "schema": { "type": "object", "properties": { "name": { "type": "string" }, "price": { "type": "number" }, "in_stock": { "type": "boolean" } } } }' ``` ```bash # Prompt mode (auto-generates schema) curl -X POST https://api.webclaw.io/v1/extract \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com/team", "prompt": "Extract all team members with name, role, and LinkedIn URL" }' ``` --- ### POST /v1/summarize Generate a concise summary of any web page. The page is scraped, cleaned, and passed to an LLM which produces a summary of the specified length. **POST** `/v1/summarize` -- Generate an LLM-powered summary of a web page. #### Request body ```json { "url": "https://example.com/blog/long-article", "max_sentences": 3 } ``` #### Parameters | Field | Type | Required | Description | |-------|------|----------|-------------| | `url` | string | Yes | URL of the page to summarize. | | `max_sentences` | number | No | Maximum number of sentences in the summary. Default: 3. | #### Response ```json { "summary": "The article discusses the latest advances in web extraction technology, focusing on HTTP-based approaches that avoid headless browsers. Key findings show a 20x speed improvement over Chrome-based solutions. The author recommends Rust-based extractors for production workloads." } ``` > **Note:** Like the extract endpoint, summarize uses the LLM provider chain: Ollama (local) first, then OpenAI, then Anthropic. At least one must be configured. #### Example ```bash curl -X POST https://api.webclaw.io/v1/summarize \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "url": "https://en.wikipedia.org/wiki/Rust_(programming_language)", "max_sentences": 3 }' ``` --- ### POST /v1/diff Track changes between two snapshots of a web page. Scrape a URL now, compare it against a previous extraction result, and get a unified diff of what changed. **POST** `/v1/diff` -- Compare a URL's current content against a previous extraction snapshot. #### Request body ```json { "url": "https://example.com/pricing", "previous": { "url": "https://example.com/pricing", "metadata": { "title": "Pricing", "word_count": 450 }, "markdown": "# Pricing\n\nStarter: $9/mo\nPro: $29/mo", "text": "Pricing\n\nStarter: $9/mo\nPro: $29/mo" } } ``` #### Parameters | Field | Type | Required | Description | |-------|------|----------|-------------| | `url` | string | Yes | URL to scrape for the current version. | | `previous` | object | Yes | A previous ExtractionResult (the output from a prior `/v1/scrape` call). | > **Tip:** Save the full JSON response from `/v1/scrape` as your baseline snapshot. Pass it as the `previous` field in subsequent diff requests to track changes over time. #### Response ```json { "status": "Changed", "text_diff": "--- previous\n+++ current\n@@ -2,3 +2,3 @@\n-Starter: $9/mo\n-Pro: $29/mo\n+Starter: $12/mo\n+Pro: $39/mo\n+Enterprise: $99/mo", "metadata_changes": [ { "field": "word_count", "old": 450, "new": 520 } ], "links_added": [ "https://example.com/enterprise" ], "links_removed": [], "word_count_delta": 70 } ``` #### Response fields | Field | Type | Description | |-------|------|-------------| | `status` | string | "Changed" or "Unchanged". | | `text_diff` | string | Unified diff of the text content. | | `metadata_changes` | array | List of metadata fields that changed, with old and new values. | | `links_added` | string[] | URLs present in the current version but not the previous. | | `links_removed` | string[] | URLs present in the previous version but removed. | | `word_count_delta` | number | Difference in word count (positive = content added). | #### Example ```bash curl -X POST https://api.webclaw.io/v1/diff \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "url": "https://example.com/pricing", "previous": { "url": "https://example.com/pricing", "metadata": { "title": "Pricing", "word_count": 450 }, "markdown": "# Pricing\n\nStarter: $9/mo\nPro: $29/mo", "text": "Pricing\n\nStarter: $9/mo\nPro: $29/mo" } }' ``` --- ### POST /v1/brand Extract brand identity from any website. Analyzes DOM structure and CSS to find colors, fonts, logos, and favicons. **POST** `/v1/brand` -- Extract brand identity (colors, fonts, logos) from a URL. #### Request body ```json { "url": "https://example.com" } ``` #### Parameters | Field | Type | Required | Description | |-------|------|----------|-------------| | `url` | string | Yes | URL of the site to analyze. | #### Response ```json { "colors": [ { "hex": "#FF5733", "usage": "Primary", "count": 15 }, { "hex": "#2C3E50", "usage": "Text", "count": 42 }, { "hex": "#ECF0F1", "usage": "Background", "count": 8 } ], "fonts": [ "Inter", "Roboto Mono" ], "logo_url": "https://example.com/images/logo.svg", "favicon_url": "https://example.com/favicon.ico" } ``` #### Response fields | Field | Type | Description | |-------|------|-------------| | `colors` | array | Detected colors with hex value, inferred usage, and occurrence count. | | `fonts` | string[] | Font families found in stylesheets and inline styles. | | `logo_url` | string? | URL of the detected logo image, if found. | | `favicon_url` | string? | Favicon URL from link tags or the default /favicon.ico path. | > **Note:** Brand extraction works entirely from the HTML and inline CSS -- no headless browser is used. Colors are detected from inline styles, style tags, and common CSS patterns. Results are best on marketing pages and homepages. #### Example ```bash curl -X POST https://api.webclaw.io/v1/brand \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{"url": "https://stripe.com"}' ``` --- ### Webhooks Configure webhooks to receive notifications when crawl jobs complete. Webhooks send a POST request to your URL with the crawl result. Payloads are signed with HMAC-SHA256 for verification. #### POST /v1/webhooks Create a new webhook. **Request body:** ```json { "url": "https://your-server.com/webhook", "events": ["crawl.completed", "crawl.failed"], "secret": "your_webhook_secret" } ``` | Field | Type | Required | Description | |-------|------|----------|-------------| | `url` | string | Yes | The URL to send webhook notifications to. | | `events` | string[] | No | Event types to subscribe to. Default: all events. Options: `crawl.completed`, `crawl.failed`. | | `secret` | string | No | Secret used for HMAC-SHA256 payload signing. If not provided, one is generated and returned. | **Response:** ```json { "id": "wh_abc123", "url": "https://your-server.com/webhook", "events": ["crawl.completed", "crawl.failed"], "secret": "whsec_...", "created_at": "2026-03-16T10:00:00Z" } ``` #### GET /v1/webhooks List all configured webhooks. **Response:** ```json { "webhooks": [ { "id": "wh_abc123", "url": "https://your-server.com/webhook", "events": ["crawl.completed", "crawl.failed"], "created_at": "2026-03-16T10:00:00Z" } ] } ``` #### PATCH /v1/webhooks/{id} Update an existing webhook. **Request body:** ```json { "url": "https://your-server.com/new-webhook", "events": ["crawl.completed"] } ``` #### DELETE /v1/webhooks/{id} Delete a webhook. Returns 204 No Content on success. #### POST /v1/webhooks/{id}/test Send a test webhook delivery to verify your endpoint is configured correctly. **Response:** ```json { "success": true, "status_code": 200, "response_time_ms": 142 } ``` #### Webhook payload When a webhook fires, it sends a POST request to your URL with a JSON body: ```json { "event": "crawl.completed", "timestamp": "2026-03-16T10:30:00Z", "data": { "id": "550e8400-e29b-41d4-a716-446655440000", "status": "completed", "url": "https://docs.example.com", "total": 50, "completed": 48, "errors": 2 } } ``` #### Payload verification Webhook payloads include an `X-Webclaw-Signature` header containing an HMAC-SHA256 signature of the request body, computed with your webhook secret. Verify it to ensure the payload is authentic. ```text X-Webclaw-Signature: sha256= ``` #### Discord and Slack formatting Webhooks automatically detect Discord and Slack webhook URLs and format the payload accordingly -- rich embeds for Discord, Block Kit for Slack. No extra configuration needed. Just use a Discord or Slack webhook URL. #### Examples ```bash # Create a webhook curl -X POST https://api.webclaw.io/v1/webhooks \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "url": "https://your-server.com/webhook", "events": ["crawl.completed"] }' ``` ```bash # Test a webhook curl -X POST https://api.webclaw.io/v1/webhooks/wh_abc123/test \ -H "Authorization: Bearer wc_your_api_key" ``` ```bash # Send crawl results to Discord curl -X POST https://api.webclaw.io/v1/webhooks \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "url": "https://discord.com/api/webhooks/1234567890/abcdefg" }' ``` --- ### Firecrawl v2 Compatibility webclaw includes a drop-in compatibility layer for the Firecrawl v2 API. If you have existing code using a Firecrawl SDK or making direct API calls, you can point it at webclaw by changing the base URL. No other code changes needed. #### Endpoints | Firecrawl Endpoint | webclaw Mapping | Description | |--------------------|-----------------|-------------| | POST /v2/scrape | Maps to /v1/scrape | camelCase field translation (e.g. `onlyMainContent` to `only_main_content`) | | POST /v2/crawl | Maps to /v1/crawl | Same parameter mapping | | GET /v2/crawl/{id} | Maps to /v1/crawl/{id} | Same response format | | POST /v2/search | Maps to /v1/search | Same parameter mapping | #### How it works The v2 endpoints accept Firecrawl-formatted requests with camelCase field names and translate them internally to webclaw's snake_case format. Responses are translated back to camelCase. The behavior is identical to the v1 endpoints -- only the field naming convention differs. #### Migration example ```python # Before (Firecrawl) from firecrawl import FirecrawlApp app = FirecrawlApp(api_key="fc_...", api_url="https://api.firecrawl.dev") # After (webclaw -- just change the URL and key) from firecrawl import FirecrawlApp app = FirecrawlApp(api_key="wc_...", api_url="https://api.webclaw.io") # All existing code works unchanged result = app.scrape_url("https://example.com") ``` ```typescript // Before (Firecrawl) import FirecrawlApp from "@mendable/firecrawl-js"; const app = new FirecrawlApp({ apiKey: "fc_...", apiUrl: "https://api.firecrawl.dev" }); // After (webclaw) const app = new FirecrawlApp({ apiKey: "wc_...", apiUrl: "https://api.webclaw.io" }); ``` > **Note:** The compatibility layer covers the most commonly used Firecrawl v2 endpoints. If you encounter a missing feature, open an issue on GitHub. --- ### Antibot Bypass and JS Rendering The webclaw Cloud API includes automatic bot protection bypass. This is transparent -- no extra parameters needed. When bot protection is detected, webclaw automatically handles it. #### Supported protections | Protection | Detection | Bypass Method | |------------|-----------|---------------| | AWS WAF | Challenge page detection | Automatic challenge solving | | Cloudflare | Turnstile and Under Attack mode | Automatic challenge solving | | DataDome | Captcha page detection | Automatic challenge solving | | hCaptcha | Challenge detection | Automatic challenge solving | #### Two-layer JS rendering When a page requires JavaScript to render (React SPAs, Next.js, Vue apps), webclaw uses a two-layer approach: 1. **Fast renderer** -- A lightweight headless browser (~1-3s render time). Handles most JS-rendered pages. 2. **Chrome headless** (fallback) -- Full Chrome DevTools Protocol. Used when the fast renderer cannot fully render the page. The layer selection is automatic. Most pages render via the fast renderer. Chrome is only invoked when necessary. #### Response metadata When antibot bypass is triggered, the response includes an `antibot` field in the metadata: ```json { "url": "https://protected-site.com", "metadata": { "title": "Protected Page", "antibot": { "detected": "cloudflare", "bypassed": true, "render_engine": "lightpanda", "render_time_ms": 1240 } }, "markdown": "..." } ``` > **Note:** Antibot bypass and JS rendering are Cloud API features. Self-hosted instances use pure HTTP extraction only. For self-hosted JS rendering, configure `WEBCLAW_ANTIBOT_URL` and `WEBCLAW_ANTIBOT_KEY` to point to your own rendering service. #### Document Parsing webclaw auto-detects documents by URL extension or Content-Type header and extracts their content as markdown. No special parameters or configuration needed -- just pass the document URL to `/v1/scrape`. | Format | Extension | Output | |--------|-----------|--------| | PDF | .pdf | Plain text extraction | | DOCX | .docx | Markdown with headings, lists, tables | | XLSX | .xlsx | Markdown tables (one per sheet) | | CSV | .csv | Markdown table | ```bash # PDF -- auto-detected curl -X POST https://api.webclaw.io/v1/scrape \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com/report.pdf"}' # DOCX -- auto-detected curl -X POST https://api.webclaw.io/v1/scrape \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com/proposal.docx"}' # XLSX -- auto-detected curl -X POST https://api.webclaw.io/v1/scrape \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com/data.xlsx"}' # CSV -- auto-detected curl -X POST https://api.webclaw.io/v1/scrape \ -H "Authorization: Bearer wc_your_api_key" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com/export.csv"}' ``` > **Tip:** Document parsing works with all output formats. Request `llm` format on a DOCX file to get a token-optimized version of the document content. --- ## MCP Server The webclaw MCP (Model Context Protocol) server exposes the full extraction engine as tools that AI agents can call directly. Works with Claude Desktop, Claude Code, Cursor, Windsurf, OpenCode, Codex, Antigravity, and any MCP-compatible client. ### What is MCP Model Context Protocol is an open standard for connecting AI models to external tools and data sources. Instead of making HTTP calls manually, an AI agent discovers available tools through the MCP server and calls them natively. The webclaw MCP server communicates over stdio transport and exposes 8 tools covering scraping, crawling, extraction, and more. ### Setup #### Claude Desktop Add webclaw to your Claude Desktop config file: ```json // ~/Library/Application Support/Claude/claude_desktop_config.json { "mcpServers": { "webclaw": { "command": "webclaw-mcp", "env": { "WEBCLAW_API_KEY": "" } } } } ``` Replace `webclaw-mcp` with the full path if not in PATH. The `WEBCLAW_API_KEY` enables automatic cloud API fallback for bot-protected sites (Cloudflare, DataDome, AWS WAF) and JS-rendered SPAs. Without it, extraction works for ~80% of sites via local HTTP. Get a key at https://webclaw.io. #### Claude Code ```bash claude mcp add webclaw webclaw-mcp ``` Or add the JSON config above to your Claude Desktop config file. Claude Code auto-discovers MCP servers from the same config. ### Smart Fetch Architecture The MCP server uses a local-first approach: 1. **Local fetch** -- Fast, free, no API credits used (~80% of sites work) 2. **Cloud API fallback** -- Automatic when bot protection (Cloudflare, DataDome, AWS WAF) or JS rendering (React, Next.js, Vue SPAs) is detected 3. Requires `WEBCLAW_API_KEY` for the cloud fallback. Without it, bot-protected sites return challenge pages. ### Environment Variables | Variable | Description | |----------|-------------| | `WEBCLAW_API_KEY` | Enables cloud API fallback for bot-protected and JS-rendered sites | | `OPENAI_API_KEY` | Enables extract and summarize tools (OpenAI provider) | | `ANTHROPIC_API_KEY` | Enables extract and summarize tools (Anthropic provider) | | `OLLAMA_HOST` | Custom Ollama URL (default: http://localhost:11434) | #### Cursor Add webclaw to your Cursor MCP config at `~/.cursor/mcp.json`: ```json { "mcpServers": { "webclaw": { "command": "webclaw-mcp", "env": { "WEBCLAW_API_KEY": "" } } } } ``` #### Windsurf Add webclaw to your Windsurf MCP config at `~/.windsurf/mcp.json`: ```json { "mcpServers": { "webclaw": { "command": "webclaw-mcp", "env": { "WEBCLAW_API_KEY": "" } } } } ``` #### OpenCode Add webclaw to your OpenCode config at `~/.config/opencode/opencode.json`: ```json { "mcp": { "webclaw": { "type": "local", "command": ["~/.webclaw/webclaw-mcp"], "enabled": true } } } ``` #### Codex Add webclaw to your Codex config at `~/.codex/config.toml`: ```toml [mcp_servers.webclaw] command = "~/.webclaw/webclaw-mcp" args = [] enabled = true ``` #### Antigravity Antigravity uses the same mcpServers JSON format as Claude Desktop: ```json { "mcpServers": { "webclaw": { "command": "webclaw-mcp", "env": { "WEBCLAW_API_KEY": "" } } } } ``` #### Other MCP clients Any MCP client that supports stdio transport can connect to webclaw-mcp. Point the client at the binary and it will discover all available tools through the standard MCP handshake. > **Note:** The MCP server uses the `rmcp` crate (the official Rust MCP SDK) and communicates over stdio. No network ports are opened. ### Tools The MCP server exposes 8 tools. Each tool maps to a corresponding REST API endpoint. #### 1. scrape Extract content from a single URL. | Param | Type | Required | Description | |-------|------|----------|-------------| | `url` | string | Yes | URL to scrape. | | `format` | string | No | Output format: markdown, llm, text, json, links, rawHtml, attributes, or query. | | `include_selectors` | string[] | No | CSS selectors to include exclusively. | | `exclude_selectors` | string[] | No | CSS selectors to remove. | | `only_main_content` | boolean | No | Extract only the main content element. | | `browser` | string | No | Browser profile: chrome, firefox, or random. | #### 2. crawl Crawl a website with BFS traversal. | Param | Type | Required | Description | |-------|------|----------|-------------| | `url` | string | Yes | Starting URL. | | `depth` | number | No | Max crawl depth. Default: 2. | | `max_pages` | number | No | Max pages to extract. Default: 50. | | `concurrency` | number | No | Concurrent requests. Default: 5. | | `use_sitemap` | boolean | No | Seed queue with sitemap URLs. | | `format` | string | No | Output format for each page. | #### 3. map Discover all URLs on a site via sitemap parsing. | Param | Type | Required | Description | |-------|------|----------|-------------| | `url` | string | Yes | Base URL of the site to map. | #### 4. batch Extract content from multiple URLs concurrently. | Param | Type | Required | Description | |-------|------|----------|-------------| | `urls` | string[] | Yes | Array of URLs to extract. | | `format` | string | No | Output format for each URL. | | `concurrency` | number | No | Max concurrent requests. Default: 5. | #### 5. extract Extract structured JSON data using an LLM. Supports prompt-to-schema generation -- when only a prompt is provided (no schema), the LLM generates a JSON schema first, then extracts data matching it. | Param | Type | Required | Description | |-------|------|----------|-------------| | `url` | string | Yes | URL to extract data from. | | `prompt` | string | No* | Natural language extraction prompt. When provided without a schema, the LLM auto-generates a schema first. | | `schema` | string | No* | JSON schema string defining the output structure. | #### 6. summarize Generate a concise summary of a web page. | Param | Type | Required | Description | |-------|------|----------|-------------| | `url` | string | Yes | URL to summarize. | | `max_sentences` | number | No | Max sentences in summary. Default: 3. | #### 7. diff Track content changes between snapshots. | Param | Type | Required | Description | |-------|------|----------|-------------| | `url` | string | Yes | URL to scrape for current version. | | `previous_snapshot` | string | Yes | JSON string of a previous extraction result. | #### 8. brand Extract brand identity (colors, fonts, logos) from a site. | Param | Type | Required | Description | |-------|------|----------|-------------| | `url` | string | Yes | URL of the site to analyze. | ### Example conversations Here is how an AI agent might use the webclaw MCP tools in practice. **User:** Scrape the Stripe pricing page and pull out all the plan names and prices. **Claude (using webclaw MCP):** I will use the extract tool to pull structured pricing data from the page. ```json // Tool call: extract { "url": "https://stripe.com/pricing", "prompt": "Extract all plan names, monthly prices, and included features" } ``` **User:** Crawl the Next.js docs and summarize the top 5 pages. **Claude (using webclaw MCP):** I will first map the site to discover pages, then crawl and summarize the most important ones. ```json // Tool call: map { "url": "https://nextjs.org/docs" } ``` ```json // Tool call: batch { "urls": [ "https://nextjs.org/docs", "https://nextjs.org/docs/getting-started", "https://nextjs.org/docs/routing", "https://nextjs.org/docs/rendering", "https://nextjs.org/docs/data-fetching" ], "format": "llm" } ``` > **Tip:** The MCP server runs the same extraction engine as the REST API and CLI. Every tool produces identical output to its REST API counterpart. --- ## Self-Hosting Run the webclaw server on your own infrastructure. Choose Docker for the fastest setup, build from source for maximum control, or deploy to Fly.io for managed hosting with your own binary. ### Docker The quickest way to run webclaw. The image includes the server binary and all dependencies. ```bash docker run -p 3000:3000 ghcr.io/0xmassi/webclaw:latest ``` #### With authentication ```bash docker run -p 3000:3000 \ -e WEBCLAW_API_KEY=mysecret \ ghcr.io/0xmassi/webclaw:latest ``` #### Docker Compose with Ollama For LLM features (extract, summarize), run Ollama alongside webclaw. ```yaml # docker-compose.yml version: "3.8" services: webclaw: image: ghcr.io/0xmassi/webclaw:latest ports: - "3000:3000" environment: - WEBCLAW_API_KEY=mysecret - OLLAMA_HOST=http://ollama:11434 depends_on: - ollama ollama: image: ollama/ollama:latest ports: - "11434:11434" volumes: - ollama_data:/root/.ollama volumes: ollama_data: ``` > **Tip:** After starting the compose stack, pull a model into Ollama: `docker exec -it ollama ollama pull qwen3:8b` ### From source Build the binaries directly from the Rust source. Requires Rust 1.75+. ```bash git clone https://github.com/0xMassi/webclaw.git cd webclaw cargo build --release ``` The build produces three binaries in `target/release/`: | Binary | Description | |--------|-------------| | `webclaw` | CLI tool for extraction, crawling, and more. | | `webclaw-server` | REST API server (axum). | | `webclaw-mcp` | MCP server for AI agents. | ```bash # Start the server ./target/release/webclaw-server --port 3000 --api-key mysecret ``` ### Fly.io Deploy to Fly.io for managed infrastructure with global edge distribution. ```toml # fly.toml app = "webclaw" primary_region = "iad" [build] image = "ghcr.io/0xmassi/webclaw:latest" [env] WEBCLAW_PORT = "8080" WEBCLAW_HOST = "0.0.0.0" [http_service] internal_port = 8080 force_https = true auto_stop_machines = true auto_start_machines = true [[vm]] size = "shared-cpu-1x" memory = "512mb" ``` ```bash fly launch fly secrets set WEBCLAW_API_KEY=mysecret ``` ### Environment variables All configuration is done through environment variables. None are required -- the server runs with sensible defaults. #### Server | Variable | Default | Description | |----------|---------|-------------| | `WEBCLAW_PORT` | 3000 | HTTP port to listen on. | | `WEBCLAW_HOST` | 0.0.0.0 | Bind address. | | `WEBCLAW_API_KEY` | -- | API key for authentication. If unset, no auth is required. | | `WEBCLAW_MAX_CONCURRENCY` | 50 | Max concurrent extraction tasks. | | `WEBCLAW_JOB_TTL_SECS` | 3600 | How long to keep completed crawl jobs (seconds). | | `WEBCLAW_MAX_JOBS` | 100 | Maximum number of concurrent crawl jobs. | | `WEBCLAW_LOG` | info | Tracing filter (e.g. debug, webclaw=trace). | #### Proxy | Variable | Default | Description | |----------|---------|-------------| | `WEBCLAW_PROXY` | -- | Single proxy URL (http, https, or socks5). | | `WEBCLAW_PROXY_FILE` | -- | Path to a file with one proxy URL per line. Rotated per-request. | | `WEBCLAW_ANTIBOT_URL` | -- | Anti-bot service endpoint. | | `WEBCLAW_ANTIBOT_KEY` | -- | API key for the anti-bot service. | #### Auth and OAuth (cloud features) | Variable | Default | Description | |----------|---------|-------------| | `DATABASE_URL` | -- | PostgreSQL connection string. Enables OAuth and billing. | | `GOOGLE_CLIENT_ID` | -- | Google OAuth client ID. | | `GOOGLE_CLIENT_SECRET` | -- | Google OAuth client secret. | | `GITHUB_CLIENT_ID` | -- | GitHub OAuth client ID. | | `GITHUB_CLIENT_SECRET` | -- | GitHub OAuth client secret. | | `WEBCLAW_JWT_SECRET` | -- | JWT signing secret for session tokens. | | `WEBCLAW_BASE_URL` | -- | Public URL of the server (for OAuth callbacks). | | `WEBCLAW_FRONTEND_URL` | -- | Frontend URL (for CORS and redirect). | > **Warning:** The OAuth and billing variables (DATABASE_URL, GOOGLE_CLIENT_*, etc.) are only needed if you are building a multi-tenant deployment with user accounts. For standard self-hosted usage, only WEBCLAW_API_KEY and the LLM provider keys matter. #### LLM providers | Variable | Default | Description | |----------|---------|-------------| | `OLLAMA_HOST` | http://localhost:11434 | Ollama API endpoint. | | `OLLAMA_MODEL` | qwen3:8b | Default Ollama model for extraction and summarization. | | `OPENAI_API_KEY` | -- | OpenAI API key. Enables OpenAI as a fallback provider. | | `OPENAI_BASE_URL` | -- | Custom OpenAI-compatible endpoint (for proxies or local models). | | `ANTHROPIC_API_KEY` | -- | Anthropic API key. Enables Anthropic as a fallback provider. | --- ## Cloud API The webclaw Cloud API is a managed service -- no servers to run, no infrastructure to maintain. Sign up, create an API key, and start extracting. ### Getting started 1. Sign up at [webclaw.io](https://webclaw.io) 2. Create an API key in the dashboard 3. Send requests to `https://api.webclaw.io` ### Authentication All cloud requests require an API key in the Authorization header. Keys are prefixed with `wc_` for easy identification. ```bash curl -X POST https://api.webclaw.io/v1/scrape \ -H "Authorization: Bearer wc_live_abc123def456" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com"}' ``` ### Base URL ```text https://api.webclaw.io ``` All REST API endpoints documented in this site are available under this base URL. ### Rate limits Rate limits are based on your plan tier and measured in credits per billing period. | Plan | Credits / month | Research / month | Max sources | Price | |------|-----------------|------------------|-------------|-------| | Free | 500 | 0 | 0 | $0 | | Starter | 10,000 | 3 | 10 | $19/mo | | Pro | 250,000 | 20 | 30 | $99/mo | | Scale | 1,000,000 | 60 | 100 | $399/mo | > **Note:** A plain page costs 1 credit. JS render adds 2 credits, antibot adds 9 credits, search costs 2 credits per 10 results, summarize costs 10 credits, brand costs 5 credits, diff costs 2 credits, and LLM extract costs 25 credits. ### Usage tracking Monitor your usage in real time from the dashboard at [webclaw.io/dashboard/usage](https://webclaw.io/dashboard/usage). The dashboard shows: - Pages consumed in the current billing period - Remaining quota - Historical usage by day - Breakdown by endpoint ### Billing Billing periods are calendar months. Your page quota resets on the 1st of each month at 00:00 UTC. Unused pages do not roll over. Upgrades take effect immediately and prorate the cost for the remainder of the current period. > **Tip:** Need more than 500k pages/month? Contact us for custom enterprise pricing with dedicated infrastructure and SLA guarantees. --- ## SDKs Official SDKs are available for Python, TypeScript, and Go. All three cover every endpoint with typed clients and idiomatic APIs. ### Python SDK ```bash pip install webclaw ``` Requires Python 3.9+. Only dependency is httpx. Provides both sync (`Webclaw`) and async (`AsyncWebclaw`) clients. ```python from webclaw import Webclaw client = Webclaw("wc_your_api_key") # Scrape result = client.scrape("https://example.com", formats=["markdown", "llm"]) print(result.markdown) # Crawl job = client.crawl("https://example.com", max_depth=3, max_pages=100) status = job.wait(interval=2.0, timeout=300.0) for page in status.pages: print(page.url, len(page.markdown or "")) # Batch result = client.batch(["https://a.com", "https://b.com"], formats=["markdown"]) # Map result = client.map("https://example.com") # Search result = client.search("best rust web frameworks", num_results=5) for r in result.results: print(r.title, r.url) # Extract result = client.extract("https://example.com/pricing", prompt="Extract all pricing tiers") # Summarize result = client.summarize("https://example.com", max_sentences=3) # Brand result = client.brand("https://example.com") ``` Async: ```python from webclaw import AsyncWebclaw async with AsyncWebclaw("wc_your_api_key") as client: result = await client.scrape("https://example.com") ``` Options: `base_url` (default: https://api.webclaw.io), `timeout` (default: 30.0s). Source: https://github.com/0xMassi/webclaw-python ### TypeScript SDK ```bash npm install webclaw ``` Zero dependencies, native fetch. Requires Node.js 18+. Also works in Bun and Deno. ```typescript import { Webclaw } from "webclaw"; const client = new Webclaw({ apiKey: "wc_your_api_key" }); // Scrape const result = await client.scrape({ url: "https://example.com", formats: ["markdown", "llm"], only_main_content: true, }); // Crawl const job = await client.crawl({ url: "https://example.com", max_depth: 3 }); const status = await job.waitForCompletion({ interval: 2_000, maxWait: 300_000 }); // Batch const batch = await client.batch({ urls: ["https://a.com", "https://b.com"], formats: ["markdown"] }); // Map const map = await client.map({ url: "https://example.com" }); // Search const search = await client.search({ query: "best rust web frameworks", num_results: 5 }); // Extract const extracted = await client.extract({ url: "https://example.com/pricing", prompt: "Extract pricing tiers" }); // Summarize const summary = await client.summarize({ url: "https://example.com", max_sentences: 3 }); // Brand const brand = await client.brand({ url: "https://example.com" }); ``` Options: `baseUrl` (default: https://api.webclaw.io), `timeout` (default: 30000ms). Source: https://github.com/0xMassi/webclaw-js ### Go SDK ```bash go get github.com/0xMassi/webclaw-go ``` Zero dependencies beyond stdlib. Requires Go 1.21+. context.Context on every method. Functional options for configuration. ```go import webclaw "github.com/0xMassi/webclaw-go" client := webclaw.NewClient("wc_your_api_key") // Scrape result, err := client.Scrape(ctx, webclaw.ScrapeRequest{ URL: "https://example.com", Formats: []webclaw.Format{webclaw.FormatMarkdown, webclaw.FormatLLM}, }) // Crawl job, err := client.Crawl(ctx, webclaw.CrawlRequest{URL: "https://example.com", MaxDepth: 3}) status, err := client.WaitForCrawl(ctx, job.ID, 2*time.Second, 5*time.Minute) // Batch result, err := client.Batch(ctx, webclaw.BatchRequest{ URLs: []string{"https://a.com", "https://b.com"}, Formats: []webclaw.Format{webclaw.FormatMarkdown}, }) // Map result, err := client.Map(ctx, webclaw.MapRequest{URL: "https://example.com"}) // Search result, err := client.Search(ctx, webclaw.SearchRequest{ Query: "best rust web frameworks", NumResults: 5, }) // Agent scrape result, err := client.AgentScrape(ctx, webclaw.AgentScrapeRequest{ URL: "https://example.com/products", Goal: "Extract all product names and prices", }) // Extract result, err := client.Extract(ctx, webclaw.ExtractRequest{ URL: "https://example.com/pricing", Prompt: "Extract all pricing tiers", }) // Summarize result, err := client.Summarize(ctx, webclaw.SummarizeRequest{URL: "https://example.com", MaxSentences: 3}) // Brand result, err := client.Brand(ctx, webclaw.BrandRequest{URL: "https://example.com"}) ``` Options: `WithBaseURL(url)`, `WithTimeout(d)`, `WithHTTPClient(hc)`. Source: https://github.com/0xMassi/webclaw-go