CLI Reference
Complete reference for the webclaw command-line tool. Every flag, every option, with practical examples.
Basic extraction
Pass one or more URLs as positional arguments. webclaw fetches each page, extracts the main content, and outputs clean markdown.
| Flag | Description |
|---|---|
webclaw <url> | Extract a single URL. |
webclaw url1 url2 url3 | Batch extract multiple URLs in one command. |
--urls-file <file> | Read URLs from a file, one per line. |
--file <path> | Read HTML from a local file instead of fetching. |
--stdin | Read HTML from stdin. |
Output formats
Control the output format with the -f flag. The default is markdown.
| Flag | Description |
|---|---|
-f markdown | Clean markdown with resolved URLs and collected assets. This is the default. |
-f text | Plain text with no formatting. |
-f json | Full ExtractionResult as JSON. Includes metadata, content, word count, and extracted URLs. |
-f llm | LLM-optimized output. 9-step pipeline: image stripping, emphasis removal, link deduplication, stat merging, whitespace collapse. Includes a metadata header. |
--metadata | Include page metadata (title, description, OG tags) in the output. |
--raw-html | Output the raw HTML response without any extraction processing. |
llm format typically achieves 67% fewer tokens than raw HTML while preserving all meaningful content. Use it when feeding content to language models.Content filtering
Use CSS selectors to control what content is extracted. Include mode is exclusive -- only matched elements are returned. Exclude mode removes matched elements from the normal extraction.
| Flag | Description |
|---|---|
--include <selectors> | CSS selectors to extract. Comma-separated. Exclusive mode: only these elements are returned. |
--exclude <selectors> | CSS selectors to remove from extraction. Comma-separated. |
--only-main-content | Extract only article, main, or [role="main"] elements. |
Browser impersonation
webclaw uses Impit to impersonate real browser TLS fingerprints. This makes requests indistinguishable from actual browser traffic at the TLS layer. Each browser option includes multiple version profiles.
| Flag | Description |
|---|---|
-b chrome | Chrome profiles: v142, v136, v133, v131. This is the default. |
-b firefox | Firefox profiles: v144, v135, v133, v128. |
-b random | Random browser profile per request. Useful for bulk extraction. |
Proxy
Route requests through HTTP proxies. Supports single proxy or pool rotation.
| Flag | Description |
|---|---|
-p <url> | Single proxy. Format: http://user:pass@host:port |
--proxy-file <file> | Proxy pool file. One proxy per line in host:port:user:pass format. Rotates per request. |
proxies.txt from the working directory if present. No flag needed. Proxy rotation is per-request, not per-client, so each request in a batch or crawl uses a different proxy from the pool.Crawling
BFS same-origin crawler. Discovers and extracts pages by following links within the same domain.
| Flag | Description |
|---|---|
--crawl | Enable BFS crawling from the given URL. |
--depth <n> | Maximum crawl depth. Default: 1. |
--max-pages <n> | Maximum number of pages to crawl. Default: 20. |
--concurrency <n> | Number of parallel requests. Default: 5. |
--delay <ms> | Delay between requests in milliseconds. Default: 100. |
--path-prefix <path> | Only crawl URLs whose path starts with this prefix. |
--sitemap | Seed the crawl queue from sitemap discovery before starting BFS. |
Sitemap discovery
Discover all URLs from a site's sitemap.xml and robots.txt without crawling.
| Flag | Description |
|---|---|
--map | Discover URLs from sitemap.xml and robots.txt. Outputs the URL list. |
Change tracking
Snapshot a page as JSON and compare against a future extraction to see what changed.
| Flag | Description |
|---|---|
--diff-with <file> | Compare the current extraction against a previous JSON snapshot file. |
Brand extraction
Extract brand identity from a website: colors, fonts, logo URL, and favicon. Analyzes both DOM structure and CSS.
| Flag | Description |
|---|---|
--brand | Extract brand identity (colors, fonts, logo, favicon) from the page. |
LLM features
webclaw can use LLMs to extract structured data, answer questions about page content, or summarize. The provider chain tries Ollama first (local, free), then OpenAI, then Anthropic.
| Flag | Description |
|---|---|
--extract-json <schema> | Extract data matching a JSON schema. Pass the schema as a string or use @file to read from a file. |
--extract-prompt <text> | Natural language extraction. Describe what you want and the LLM extracts it. |
--summarize [sentences] | Summarize the page content. Default: 3 sentences. |
--llm-provider <name> | Force a specific LLM provider: ollama, openai, or anthropic. |
--llm-model <name> | Override the default model for the chosen provider. |
--llm-base-url <url> | Override the provider's base URL (useful for proxies or custom deployments). |
ollama.ai and webclaw will use it automatically. For OpenAI and Anthropic, set the standard environment variables: OPENAI_API_KEY or ANTHROPIC_API_KEY.PDF extraction
webclaw auto-detects PDF documents via the Content-Type header and extracts text content. No special flags needed for basic PDF extraction.
| Flag | Description |
|---|---|
--pdf-mode auto | Default mode. Extracts text from PDFs. Returns an error if text extraction fails. |
--pdf-mode fast | Returns empty content on extraction failure instead of erroring. |
Other options
| Flag | Description |
|---|---|
-t <seconds> | Request timeout in seconds. Default: 30. |
-v | Enable verbose logging. Shows request details, timing, and extraction stats. |
Complete examples
Common workflows combining multiple flags.