REST API
The webclaw REST API gives you programmatic access to the full extraction engine. Every endpoint accepts JSON and returns JSON.
Base URL
Use the cloud endpoint for managed infrastructure, or point at your own instance when self-hosting.
# Cloud (managed)
https://api.webclaw.io
# Self-hosted
http://localhost:3000Authentication
All requests require an API key sent via the Authorization header.
Authorization: Bearer <api_key>Cloud: Create API keys from your dashboard at webclaw.io. Keys are prefixed with wc_.
Self-hosted: Pass --api-key when starting the server, or set the WEBCLAW_API_KEY environment variable. If neither is set, the server runs without authentication.
Request format
All POST endpoints accept a JSON body. Set the Content-Type header accordingly.
Content-Type: application/jsonResponse format
All responses are JSON. Successful responses return the data directly. Errors use a consistent shape:
{
"error": "Human-readable error message"
}Rate limiting
Cloud API rate limits are based on your plan tier. Self-hosted instances have no rate limits by default. See the Cloud API page for plan details.
Output formats
The /v1/scrape endpoint supports 9 output formats. Pass one or more in the formats array.
| Format | Description |
|---|---|
markdown | Clean markdown with resolved URLs and collected assets. Default. |
text | Plain text with no formatting. |
json | Full ExtractionResult as JSON. Includes metadata, content, word count, and extracted URLs. |
llm | LLM-optimized. 9-step pipeline: image stripping, emphasis removal, link dedup, stat merging, whitespace collapse. ~90% fewer tokens than raw HTML. |
screenshot | Base64-encoded PNG screenshot of the page. |
links | Array of all extracted links. Each entry has text and href fields. |
rawHtml | The raw HTML string from the pipeline, before any extraction processing. |
attributes | Extract DOM attributes by CSS selector. Requires the attribute_selectors parameter. |
query | Page-level Q&A with LLM. Requires the query parameter. Returns query_answer in the response. |
Document parsing
webclaw auto-detects document types from Content-Type headers and file extensions. In addition to PDF, the following formats are supported natively and converted to your requested output format:
- DOCX -- Microsoft Word documents
- XLSX -- Excel spreadsheets (each sheet becomes a markdown table)
- CSV -- Comma-separated values
YouTube transcript extraction
YouTube URLs (youtube.com/watch) are auto-detected and processed differently. webclaw extracts structured markdown containing the video title, channel name, view count, publish date, duration, description, and the full transcript text.
curl -X POST https://api.webclaw.io/v1/scrape \
-H "Authorization: Bearer wc_your_api_key" \
-H "Content-Type: application/json" \
-d '{"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ", "formats": ["markdown"]}'Endpoints
The full list of available endpoints.
Core endpoints
| Method | Path | Description |
|---|---|---|
| POST | /v1/scrape | Single URL extraction. Supports browser actions, screenshots, mobile emulation, and document parsing (PDF, DOCX, XLSX, CSV). |
| POST | /v1/crawl | Start async crawl with path filtering, subdomain/external link control, and webhook support. |
| GET | /v1/crawl/{id} | Poll crawl status |
| GET | /v1/crawl/{id}/stream | Stream crawl results via Server-Sent Events (SSE) |
| GET | /v1/crawl/{id}/pages | Paginated page results for large crawls |
| GET | /v1/crawl/history | List all crawl jobs for your account |
| POST | /v1/crawl/{id}/retry | Retry a failed or completed crawl with the same configuration |
| POST | /v1/batch | Multi-URL extraction |
| POST | /v1/map | Sitemap discovery |
| POST | /v1/search | Web search with parallel scraping of result pages |
| POST | /v1/extract | LLM JSON extraction with prompt-to-schema generation |
| POST | /v1/summarize | LLM summarization |
| POST | /v1/diff | Content change tracking |
| POST | /v1/brand | Brand identity extraction |
Webhooks
| Method | Path | Description |
|---|---|---|
| POST | /v1/webhooks | Register a new webhook |
| GET | /v1/webhooks | List all registered webhooks |
| PATCH | /v1/webhooks/{id} | Update a webhook configuration |
| DELETE | /v1/webhooks/{id} | Delete a webhook |
Firecrawl v2 compatibility
Drop-in replacements for Firecrawl v2 endpoints. Point existing Firecrawl SDKs at webclaw by changing the base URL.
| Method | Path | Description |
|---|---|---|
| POST | /v2/scrape | Firecrawl-compatible single URL scrape |
| POST | /v2/crawl | Firecrawl-compatible async crawl |
| GET | /v2/crawl/{id} | Firecrawl-compatible crawl status polling |
| DELETE | /v2/crawl/{id} | Firecrawl-compatible crawl cancellation |
| POST | /v2/map | Firecrawl-compatible sitemap URL discovery |
| POST | /v2/search | Firecrawl-compatible web search |
Results and history
Authenticated cloud support endpoints for dashboard history, request inspection, replay output lookup, and cached result lookup.
| Method | Path | Description |
|---|---|---|
| GET | /v1/cache/lookup | Look up the latest cached output for an endpoint and URL. |
| GET | /v1/results | Paginated cached results. Supports endpoint, page, and per_page. |
| GET | /v1/results/{id} | Fetch one cached result by result ID. |
| GET | /v1/results/run/{usage_log_id} | Resolve cached output for a specific dashboard run. |
curl https://api.webclaw.io/v1/results/run/87ac93f1-1ff8-4622-a1f4-1ac2a8117fa7 \
-H "Authorization: Bearer wc_your_api_key"Utility
| Method | Path | Description |
|---|---|---|
| GET | /health | Health check + Ollama status |
POST /v1/search
Perform a web search and optionally scrape all result pages in parallel. Combines discovery and extraction into a single call.
/v1/searchSearch the web and scrape result pages in parallel.
Request body
{
"query": "best rust web frameworks 2026",
"num_results": 5,
"scrape": true,
"formats": ["markdown"],
"country": "us",
"lang": "en"
}| Field | Type | Required | Description |
|---|---|---|---|
query | string | Yes | The search query string. |
num_results | number | No | Number of search results to return. Default: 5. |
scrape | boolean | No | When true, scrapes each result page in parallel. Default: true. |
formats | string[] | No | Output formats for scraped pages. Same options as /v1/scrape. Defaults to ["markdown"]. |
country | string | No | Two-letter country code to localize results (e.g. "us", "gb", "de"). |
lang | string | No | Two-letter language code for results (e.g. "en", "fr"). |
Example
curl -X POST https://api.webclaw.io/v1/search \
-H "Authorization: Bearer wc_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"query": "best rust web frameworks 2026",
"num_results": 3,
"formats": ["llm"],
"country": "us"
}'POST /v1/extract
Extract structured JSON data from any URL. Provide a JSON schema for typed output, or a natural language prompt for flexible extraction. When only a prompt is provided (no schema), the LLM generates a JSON schema first, then extracts data matching it. The response includes a generated_schema field showing the auto-generated schema.
/v1/extractLLM-powered structured data extraction with prompt-to-schema generation.
Prompt-to-schema example
curl -X POST https://api.webclaw.io/v1/extract \
-H "Authorization: Bearer wc_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/pricing",
"prompt": "Extract all pricing tiers with name, price, and features"
}'When only prompt is provided, the response includes both the extracted data and the generated schema:
{
"data": {
"tiers": [
{"name": "Hobby", "price": 9, "features": ["1 seat", "Community support"]},
{"name": "Pro", "price": 49, "features": ["5 seats", "Priority support"]}
]
},
"generated_schema": {
"type": "object",
"properties": {
"tiers": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"features": {"type": "array", "items": {"type": "string"}}
}
}
}
}
}
}Enhanced scrape parameters
The /v1/scrape endpoint supports additional parameters for browser automation, screenshots, mobile emulation, cache control, page-level Q&A, and attribute extraction.
| Field | Type | Required | Description |
|---|---|---|---|
actions | object[] | No | Array of browser actions to execute before extraction (click, type, scroll, wait, press, executeJavascript, etc.). |
screenshot | boolean | No | Capture a full-page screenshot. Returned as a base64-encoded PNG in the response. |
mobile | boolean | No | Emulate a mobile device viewport and user agent. |
no_cache | boolean | No | Bypass the response cache and force a fresh fetch. |
query | string | No | Natural language question about the page. Used with the query format. The answer is returned in query_answer. |
attribute_selectors | object[] | No | Array of {selector, attribute} pairs. Used with the attributes format to extract specific DOM attributes by CSS selector. |
Browser actions
Actions execute in order before content extraction. Useful for clicking cookie banners, expanding sections, filling forms, scrolling to load lazy content, pressing keyboard keys, or running custom JavaScript.
| Action | Fields | Description |
|---|---|---|
click | selector | Click an element matching the CSS selector. |
type | selector, value | Type text into an input element. |
scroll | direction | Scroll the page. Direction: up or down. |
wait | ms | Wait for the specified number of milliseconds. |
screenshot | -- | Take a screenshot at this point in the action sequence. |
press | key | Send a keyboard event. Supported keys: Enter, Tab, Escape, ArrowDown, ArrowUp, Space, Backspace. |
executeJavascript | code | Run custom JavaScript on the page. Results are returned in the js_results array in the response. |
Browser actions example
{
"url": "https://example.com/dashboard",
"actions": [
{ "type": "wait", "selector": ".login-form" },
{ "type": "type", "selector": "#email", "value": "user@example.com" },
{ "type": "press", "key": "Tab" },
{ "type": "type", "selector": "#password", "value": "secret" },
{ "type": "press", "key": "Enter" },
{ "type": "wait", "milliseconds": 2000 },
{ "type": "executeJavascript", "code": "return document.title" },
{ "type": "scroll", "direction": "down" },
{ "type": "screenshot" }
],
"screenshot": true,
"formats": ["markdown"]
}Query format example
curl -X POST https://api.webclaw.io/v1/scrape \
-H "Authorization: Bearer wc_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/pricing",
"formats": ["query"],
"query": "What is the price of the Pro plan?"
}'Attributes format example
curl -X POST https://api.webclaw.io/v1/scrape \
-H "Authorization: Bearer wc_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"formats": ["attributes"],
"attribute_selectors": [
{"selector": "img", "attribute": "alt"},
{"selector": "a", "attribute": "href"}
]
}'Enhanced crawl parameters
The /v1/crawl endpoint now supports path filtering, subdomain and external link control, and webhook notifications.
| Field | Type | Required | Description |
|---|---|---|---|
allow_subdomains | boolean | No | Allow the crawler to follow links to subdomains of the starting URL. Default: false. |
allow_external_links | boolean | No | Allow the crawler to follow links to external domains. Default: false. |
include_paths | string[] | No | Glob patterns for paths to include (e.g. /docs/**). Only matching URLs will be crawled. |
exclude_paths | string[] | No | Glob patterns for paths to exclude (e.g. /blog/**). Matching URLs will be skipped. |
webhook_url | string | No | URL to receive a POST request when the crawl completes or fails. |
Crawl sub-endpoints
/v1/crawl/{id}/streamStream crawl results in real-time via Server-Sent Events. Each page result is sent as an SSE event as it completes.
/v1/crawl/historyList all crawl jobs for your account, ordered by creation date.
/v1/crawl/{id}/retryRetry a failed or completed crawl using the same configuration. Returns a new crawl ID.
Enhanced crawl example
curl -X POST https://api.webclaw.io/v1/crawl \
-H "Authorization: Bearer wc_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.stripe.com",
"max_depth": 3,
"max_pages": 200,
"include_paths": ["/docs/**"],
"exclude_paths": ["/docs/changelog/**"],
"allow_subdomains": false,
"webhook_url": "https://your-app.com/webhooks/crawl-done"
}'Webhooks
Register webhook URLs to receive POST notifications when async operations (crawls, batch jobs) complete. Manage webhooks with full CRUD.
/v1/webhooksRegister a new webhook endpoint.
Create webhook request
{
"url": "https://your-app.com/webhooks/webclaw",
"events": ["crawl.completed", "crawl.failed", "batch.completed"]
}/v1/webhooksList all registered webhooks for your account.
/v1/webhooks/{id}Update an existing webhook's URL or event subscriptions.
/v1/webhooks/{id}Delete a webhook. It will no longer receive notifications.
Example
curl -X POST https://api.webclaw.io/v1/webhooks \
-H "Authorization: Bearer wc_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://your-app.com/webhooks/webclaw",
"events": ["crawl.completed"]
}'Firecrawl v2 compatibility
These endpoints accept the same request/response shapes as Firecrawl v2. Migrate by changing your base URL to webclaw -- no code changes needed.
/v2/scrapeFirecrawl-compatible scrape. Accepts Firecrawl v2 request format and returns Firecrawl v2 response shape.
/v2/crawlFirecrawl-compatible async crawl. Returns a job ID in Firecrawl v2 format.
/v2/crawl/{id}Poll a Firecrawl-compatible crawl job for status and results.
/v2/searchFirecrawl-compatible web search with scraping.
https://api.firecrawl.dev with https://api.webclaw.io in your SDK configuration. Authentication works the same way -- just use your webclaw API key instead.Quick example
curl -X POST https://api.webclaw.io/v1/scrape \
-H "Authorization: Bearer wc_your_api_key" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"]}'