CLI Reference
Complete reference for the webclaw command-line tool. Every flag, every option, with practical examples.
Basic extraction
Pass one or more URLs as positional arguments. webclaw fetches each page, extracts the main content, and outputs clean markdown.
| Flag | Description |
|---|---|
webclaw <url> | Extract a single URL. |
webclaw url1 url2 url3 | Batch extract multiple URLs in one command. |
--urls-file <file> | Read URLs from a file, one per line. |
--file <path> | Read HTML from a local file instead of fetching. |
--stdin | Read HTML from stdin. |
# Single URL
webclaw https://example.com
# Multiple URLs
webclaw https://example.com https://news.ycombinator.com
# From a file list
webclaw --urls-file urls.txt
# Local HTML file
webclaw --file page.html
# Pipe from another command
curl -s https://example.com | webclaw --stdinOutput formats
Control the output format with the -f flag. The default is markdown.
| Flag | Description |
|---|---|
-f markdown | Clean markdown with resolved URLs and collected assets. This is the default. |
-f text | Plain text with no formatting. |
-f json | Full ExtractionResult as JSON. Includes metadata, content, word count, and extracted URLs. |
-f llm | LLM-optimized output. 9-step pipeline: image stripping, emphasis removal, link deduplication, stat merging, whitespace collapse. Includes a metadata header. |
--metadata | Include page metadata (title, description, OG tags) in the output. |
--raw-html | Output the raw HTML response without any extraction processing. |
# Default markdown
webclaw https://example.com
# LLM-optimized for minimal token usage
webclaw https://example.com -f llm
# Full JSON with metadata
webclaw https://example.com -f json
# Plain text
webclaw https://example.com -f text
# Markdown with metadata header
webclaw https://example.com --metadata
# Save JSON snapshot for later diffing
webclaw https://example.com -f json > snapshot.jsonllm format typically achieves ~90% fewer tokens than raw HTML while preserving all meaningful content. Use it when feeding content to language models.Content filtering
Use CSS selectors to control what content is extracted. Include mode is exclusive -- only matched elements are returned. Exclude mode removes matched elements from the normal extraction.
| Flag | Description |
|---|---|
--include <selectors> | CSS selectors to extract. Comma-separated. Exclusive mode: only these elements are returned. |
--exclude <selectors> | CSS selectors to remove from extraction. Comma-separated. |
--only-main-content | Extract only article, main, or [role="main"] elements. |
# Extract only the article body
webclaw https://example.com --include "article"
# Extract specific sections
webclaw https://example.com --include ".post-content, .comments"
# Remove navigation and footer noise
webclaw https://example.com --exclude "nav, footer, .sidebar"
# Combine both
webclaw https://example.com --include "main" --exclude ".ads, .related-posts"
# Quick mode: just the main content area
webclaw https://example.com --only-main-contentBrowser impersonation
webclaw impersonates real browser TLS fingerprints. This makes requests indistinguishable from actual browser traffic at the TLS layer. Each browser option includes multiple version profiles.
| Flag | Description |
|---|---|
-b chrome | Chrome profiles: v142, v136, v133, v131. This is the default. |
-b firefox | Firefox profiles: v144, v135, v133, v128. |
-b random | Random browser profile per request. Useful for bulk extraction. |
# Default Chrome impersonation
webclaw https://example.com
# Firefox fingerprint
webclaw https://example.com -b firefox
# Random profile per request (good for batch)
webclaw url1 url2 url3 -b randomProxy
Route requests through HTTP proxies. Supports single proxy or pool rotation.
| Flag | Description |
|---|---|
-p <url> | Single proxy. Format: http://user:pass@host:port |
--proxy-file <file> | Proxy pool file. One proxy per line in host:port:user:pass format. Rotates per request. |
# Single proxy
webclaw https://example.com -p http://user:pass@proxy.example.com:8080
# Proxy pool with rotation
webclaw https://example.com --proxy-file proxies.txt
# Batch extraction with proxy rotation
webclaw --urls-file urls.txt --proxy-file proxies.txtproxies.txt from the working directory if present. No flag needed. Proxy rotation is per-request, not per-client, so each request in a batch or crawl uses a different proxy from the pool.Vertical extractors
28 site-specific extractors that return typed JSON instead of generic markdown. Reddit, GitHub, PyPI, Amazon, YouTube, and more. Full catalog in the API reference.
| Flag | Description |
|---|---|
webclaw extractors | List every vertical extractor with name, label, and a sample URL pattern. |
webclaw extractors --json | Same catalog as JSON. Same shape as GET /v1/extractors on the cloud API. |
webclaw vertical <name> <url> | Run a specific extractor by name. Prints typed JSON to stdout (pretty-printed). |
webclaw vertical <name> <url> --raw | Single-line JSON output for piping into jq or another tool. |
# Discover what's available
webclaw extractors
# GitHub PR as typed JSON
webclaw vertical github_pr https://github.com/rust-lang/rust/pull/12345
# Reddit thread with comments
webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/title/
# Amazon product (name + price + rating)
webclaw vertical amazon_product https://www.amazon.com/dp/B0C6KKQ7ND
# Pipe into jq for the fields you need
webclaw vertical github_repo https://github.com/rust-lang/rust --raw \
| jq '{stars, forks, primary_language}'webclaw extractors in CI to catch renames early.Crawling
BFS same-origin crawler. Discovers and extracts pages by following links within the same domain.
| Flag | Description |
|---|---|
--crawl | Enable BFS crawling from the given URL. |
--depth <n> | Maximum crawl depth. Default: 1. |
--max-pages <n> | Maximum number of pages to crawl. Default: 20. |
--concurrency <n> | Number of parallel requests. Default: 5. |
--delay <ms> | Delay between requests in milliseconds. Default: 100. |
--path-prefix <path> | Only crawl URLs whose path starts with this prefix. |
--sitemap | Seed the crawl queue from sitemap discovery before starting BFS. |
# Basic crawl, 1 level deep
webclaw https://docs.example.com --crawl
# Deep crawl with limits
webclaw https://docs.example.com --crawl --depth 3 --max-pages 100
# Faster crawl with more concurrency
webclaw https://docs.example.com --crawl --concurrency 10 --delay 50
# Only crawl the /api/ section
webclaw https://docs.example.com --crawl --path-prefix /api/
# Seed from sitemap first, then crawl
webclaw https://docs.example.com --crawl --depth 2 --max-pages 50 --sitemapSitemap discovery
Discover all URLs from a site's sitemap.xml and robots.txt without crawling.
| Flag | Description |
|---|---|
--map | Discover URLs from sitemap.xml and robots.txt. Outputs the URL list. |
# Discover all URLs from sitemap
webclaw https://docs.example.com --map
# Save discovered URLs to a file, then extract them
webclaw https://docs.example.com --map > urls.txt
webclaw --urls-file urls.txt -f llmChange tracking
Snapshot a page as JSON and compare against a future extraction to see what changed.
| Flag | Description |
|---|---|
--diff-with <file> | Compare the current extraction against a previous JSON snapshot file. |
# Take a snapshot
webclaw https://example.com -f json > snapshot.json
# Later, check what changed
webclaw https://example.com --diff-with snapshot.jsonBrand extraction
Extract brand identity from a website: colors, fonts, logo URL, and favicon. Analyzes both DOM structure and CSS.
| Flag | Description |
|---|---|
--brand | Extract brand identity (colors, fonts, logo, favicon) from the page. |
webclaw https://example.com --brandLLM features
webclaw can use LLMs to extract structured data, answer questions about page content, or summarize. The provider chain tries Ollama first (local, free), then OpenAI, then Anthropic.
| Flag | Description |
|---|---|
--extract-json <schema> | Extract data matching a JSON schema. Pass the schema as a string or use @file to read from a file. |
--extract-prompt <text> | Natural language extraction. Describe what you want and the LLM extracts it. |
--summarize [sentences] | Summarize the page content. Default: 3 sentences. |
--llm-provider <name> | Force a specific LLM provider: ollama, openai, or anthropic. |
--llm-model <name> | Override the default model for the chosen provider. |
--llm-base-url <url> | Override the selected provider's base URL. Works with Ollama, OpenAI-compatible providers, and Anthropic-compatible providers. |
# Summarize a page
webclaw https://example.com --summarize
webclaw https://example.com --summarize 5
# Natural language extraction
webclaw https://example.com --extract-prompt "Get all pricing tiers"
webclaw https://example.com --extract-prompt "List every author name and their role"
# JSON schema extraction
webclaw https://example.com --extract-json '{"type":"object","properties":{"title":{"type":"string"},"price":{"type":"number"}}}'
# Schema from a file
webclaw https://example.com --extract-json @schema.json
# Force OpenAI instead of Ollama
webclaw https://example.com --summarize --llm-provider openai
# Use a specific model
webclaw https://example.com --summarize --llm-provider anthropic --llm-model claude-sonnet-4-20250514
# Use an Anthropic-compatible endpoint
webclaw https://example.com --summarize \
--llm-provider anthropic \
--llm-base-url https://anthropic-proxy.example.com/v1ollama.ai and webclaw will use it automatically. For OpenAI and Anthropic, set the standard environment variables: OPENAI_API_KEY or ANTHROPIC_API_KEY.Provider compatibility
webclaw 0.5.9 adds two compatibility controls for teams running local or OpenAI-compatible LLM backends.
| Flag | Description |
|---|---|
ANTHROPIC_BASE_URL | Default base URL for Anthropic-compatible providers. Falls back to the official Anthropic API when unset. |
OPENAI_RESPONSE_FORMAT_TYPE | OpenAI-compatible JSON response mode: json_object, json_schema, or text. Defaults to json_object for official OpenAI. |
# LM Studio may reject OpenAI json_object mode.
# Use text or json_schema for OpenAI-compatible local backends.
export OPENAI_BASE_URL=http://localhost:1234/v1
export OPENAI_API_KEY=lm-studio
export OPENAI_RESPONSE_FORMAT_TYPE=text
webclaw https://example.com \
--extract-prompt "Extract the product name and price" \
--llm-provider openaiOPENAI_RESPONSE_FORMAT_TYPE unset when using official OpenAI. The default remains json_object.PDF extraction
webclaw auto-detects PDF documents via the Content-Type header and extracts text content. No special flags needed for basic PDF extraction.
| Flag | Description |
|---|---|
--pdf-mode auto | Default mode. Extracts text from PDFs. Returns an error if text extraction fails. |
--pdf-mode fast | Returns empty content on extraction failure instead of erroring. |
# Auto-detected from Content-Type
webclaw https://example.com/report.pdf
# Fast mode (skip failures silently)
webclaw https://example.com/report.pdf --pdf-mode fastOther options
| Flag | Description |
|---|---|
-t <seconds> | Request timeout in seconds. Default: 30. |
-v | Enable verbose logging. Shows request details, timing, and extraction stats. |
# Longer timeout for slow sites
webclaw https://slow-site.example.com -t 60
# Verbose output for debugging
webclaw https://example.com -vCloud API
By default, webclaw does everything locally: direct HTTP fetch with TLS impersonation plus local content extraction. When you provide an API key, the CLI gains access to the webclaw cloud infrastructure for handling sites that require bot protection bypass, JavaScript rendering, or proxy rotation.
| Flag | Description |
|---|---|
--api-key <key> | Set your webclaw API key. Can also be set via the WEBCLAW_API_KEY environment variable. |
--cloud | Force all requests through the cloud API. Skips local extraction entirely. |
There are three modes of operation depending on how you configure the API key and cloud flag:
Local only (default)
No API key set. All fetching and extraction happens on your machine. Fast and free, but cannot bypass bot protection or render JavaScript.
Automatic fallback
API key set, no --cloud flag. The CLI tries locally first. If bot protection or JS rendering is detected, it automatically retries through api.webclaw.io. Best of both worlds: fast local extraction for simple sites, cloud power when needed.
Cloud forced
The --cloud flag sends every request directly through the cloud API. Use this for sites that always need antibot bypass or JS rendering, avoiding the wasted local attempt.
# Set API key via environment
export WEBCLAW_API_KEY=wc_your_key_here
# Automatic fallback (tries local first, cloud on failure)
webclaw https://protected-site.com
# Force cloud mode
webclaw https://spa-site.com --cloud
# Pass key directly
webclaw https://example.com --cloud --api-key wc_your_key_here| Feature | Local | Cloud |
|---|---|---|
| Bot protection bypass | No | Yes |
| JS rendering | No | Yes |
| Proxy rotation | Manual (--proxy-file) | Automatic |
| Speed | Fast | Depends on server |
| Cost | Free | Uses API credits |
| Privacy | Data stays local | Data processed on server |
WEBCLAW_API_KEY in your shell profile and the CLI will handle the rest automatically.Complete examples
Common workflows combining multiple flags.
Extract a blog post for an LLM
webclaw https://blog.example.com/post \
-f llm \
--include "article" \
--exclude ".comments, .related-posts"Crawl documentation with proxy rotation
webclaw https://docs.example.com \
--crawl \
--depth 2 \
--max-pages 50 \
--sitemap \
--proxy-file proxies.txt \
-b random \
-f llmExtract structured pricing data
webclaw https://example.com/pricing \
--extract-json '{
"type": "object",
"properties": {
"plans": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "string"},
"features": {"type": "array", "items": {"type": "string"}}
}
}
}
}
}'Monitor a page for changes
# Initial snapshot
webclaw https://example.com/status -f json > baseline.json
# Check for changes (run periodically)
webclaw https://example.com/status --diff-with baseline.jsonBatch extract with Firefox impersonation
webclaw \
https://site-a.com \
https://site-b.com \
https://site-c.com \
-b firefox \
-f llm \
--metadataCloud crawl of a protected SPA
webclaw https://app.example.com \
--cloud \
--crawl \
--depth 2 \
--max-pages 30 \
-f llm