CLI reference

Complete reference for the webclaw command-line tool. Every flag, every option, with practical examples.

Basic extraction

Pass one or more URLs as positional arguments. webclaw fetches each page, extracts the main content, and outputs clean markdown.

Flag	Description
`webclaw <url>`	Extract a single URL.
`webclaw url1 url2 url3`	Batch extract multiple URLs in one command.
`--urls-file <file>`	Read URLs from a file, one per line.
`--file <path>`	Read HTML from a local file instead of fetching.
`--stdin`	Read HTML from stdin.

examples

# Single URL
webclaw https://example.com

# Multiple URLs
webclaw https://example.com https://news.ycombinator.com

# From a file list
webclaw --urls-file urls.txt

# Local HTML file
webclaw --file page.html

# Pipe from another command
curl -s https://example.com | webclaw --stdin

Output formats

Control the output format with the -f flag. The default is markdown.

Flag	Description
`-f markdown`	Clean markdown with resolved URLs and collected assets. This is the default.
`-f text`	Plain text with no formatting.
`-f json`	Full ExtractionResult as JSON. Includes metadata, content, word count, and extracted URLs.
`-f llm`	LLM-optimized output. 9-step pipeline: image stripping, emphasis removal, link deduplication, stat merging, whitespace collapse. Includes a metadata header.
`--metadata`	Include page metadata (title, description, OG tags) in the output.
`--raw-html`	Output the raw HTML response without any extraction processing.

examples

# Default markdown
webclaw https://example.com

# LLM-optimized for minimal token usage
webclaw https://example.com -f llm

# Full JSON with metadata
webclaw https://example.com -f json

# Plain text
webclaw https://example.com -f text

# Markdown with metadata header
webclaw https://example.com --metadata

# Save JSON snapshot for later diffing
webclaw https://example.com -f json > snapshot.json

Tip

The llm format typically achieves ~90% fewer tokens than raw HTML while preserving all meaningful content. Use it when feeding content to language models.

Content filtering

Use CSS selectors to control what content is extracted. Include mode is exclusive -- only matched elements are returned. Exclude mode removes matched elements from the normal extraction.

Flag	Description
`--include <selectors>`	CSS selectors to extract. Comma-separated. Exclusive mode: only these elements are returned.
`--exclude <selectors>`	CSS selectors to remove from extraction. Comma-separated.
`--only-main-content`	Extract only article, main, or [role="main"] elements.

examples

# Extract only the article body
webclaw https://example.com --include "article"

# Extract specific sections
webclaw https://example.com --include ".post-content, .comments"

# Remove navigation and footer noise
webclaw https://example.com --exclude "nav, footer, .sidebar"

# Combine both
webclaw https://example.com --include "main" --exclude ".ads, .related-posts"

# Quick mode: just the main content area
webclaw https://example.com --only-main-content

Browser impersonation

webclaw impersonates real browser TLS fingerprints. This makes requests indistinguishable from actual browser traffic at the TLS layer. Each browser option includes multiple version profiles.

Flag	Description
`-b chrome`	Chrome profiles: v142, v136, v133, v131. This is the default.
`-b firefox`	Firefox profiles: v144, v135, v133, v128.
`-b random`	Random browser profile per request. Useful for bulk extraction.

examples

# Default Chrome impersonation
webclaw https://example.com

# Firefox fingerprint
webclaw https://example.com -b firefox

# Random profile per request (good for batch)
webclaw url1 url2 url3 -b random

Note

This is TLS fingerprint impersonation, not a headless browser. No browser engine is launched. Requests complete in milliseconds, not seconds.

Proxy

Route requests through HTTP proxies. Supports single proxy or pool rotation.

Flag	Description
`-p <url>`	Single proxy. Format: http://user:pass@host:port
`--proxy-file <file>`	Proxy pool file. One proxy per line in host:port:user:pass format. Rotates per request.

examples

# Single proxy
webclaw https://example.com -p http://user:pass@proxy.example.com:8080

# Proxy pool with rotation
webclaw https://example.com --proxy-file proxies.txt

# Batch extraction with proxy rotation
webclaw --urls-file urls.txt --proxy-file proxies.txt

Tip

webclaw auto-loads a file named proxies.txt from the working directory if present. No flag needed. Proxy rotation is per-request, not per-client, so each request in a batch or crawl uses a different proxy from the pool.

ColdProxy (infrastructure partner)

ColdProxy provides residential IPv4, residential IPv6, and datacenter IPv6 proxies across 195+ countries. Use a ColdProxy endpoint as a full URL with -p / WEBCLAW_PROXY, or list several in a --proxy-file pool. Copy your endpoint and credentials from the ColdProxy dashboard.

coldproxy.txt

# One host:port:user:pass per line; lines starting with # are ignored.
# residential IPv4
HOST:PORT:USERNAME:PASSWORD
# datacenter IPv6
HOST:PORT:USERNAME:PASSWORD

crawl through ColdProxy

webclaw https://docs.example.com \
  --crawl --depth 2 --max-pages 200 \
  --concurrency 10 --delay 200 \
  --proxy-file coldproxy.txt

Vertical extractors

28 site-specific extractors that return typed JSON instead of generic markdown. Reddit, GitHub, PyPI, Amazon, YouTube, and more. Full catalog in the API reference.

Flag	Description
`webclaw extractors`	List every vertical extractor with name, label, and a sample URL pattern.
`webclaw extractors --json`	Same catalog as JSON. Same shape as GET /v1/extractors on the cloud API.
`webclaw vertical <name> <url>`	Run a specific extractor by name. Prints typed JSON to stdout (pretty-printed).
`webclaw vertical <name> <url> --raw`	Single-line JSON output for piping into jq or another tool.

examples

# Discover what's available
webclaw extractors

# GitHub PR as typed JSON
webclaw vertical github_pr https://github.com/rust-lang/rust/pull/12345

# Reddit thread with comments
webclaw vertical reddit https://www.reddit.com/r/rust/comments/abc/title/

# Amazon product (name + price + rating)
webclaw vertical amazon_product https://www.amazon.com/dp/B0C6KKQ7ND

# Pipe into jq for the fields you need
webclaw vertical github_repo https://github.com/rust-lang/rust --raw \
  | jq '{stars, forks, primary_language}'

Note

Vertical output is stable per extractor name but extractor catalog grows over time. If you pin behavior to a specific extractor, check webclaw extractors in CI to catch renames early.

Crawling

BFS same-origin crawler. Discovers and extracts pages by following links within the same domain.

Flag	Description
`--crawl`	Enable BFS crawling from the given URL.
`--depth <n>`	Maximum crawl depth. Default: 1.
`--max-pages <n>`	Maximum number of pages to crawl. Default: 20.
`--concurrency <n>`	Number of parallel requests. Default: 5.
`--delay <ms>`	Delay between requests in milliseconds. Default: 100.
`--path-prefix <path>`	Only crawl URLs whose path starts with this prefix.
`--sitemap`	Seed the crawl queue from sitemap discovery before starting BFS.

examples

# Basic crawl, 1 level deep
webclaw https://docs.example.com --crawl

# Deep crawl with limits
webclaw https://docs.example.com --crawl --depth 3 --max-pages 100

# Faster crawl with more concurrency
webclaw https://docs.example.com --crawl --concurrency 10 --delay 50

# Only crawl the /api/ section
webclaw https://docs.example.com --crawl --path-prefix /api/

# Seed from sitemap first, then crawl
webclaw https://docs.example.com --crawl --depth 2 --max-pages 50 --sitemap

Warning

Crawling is same-origin only. webclaw will not follow links to external domains. Respect the target site by keeping concurrency and depth reasonable.

Sitemap discovery

Discover all URLs from a site's sitemap.xml and robots.txt without crawling.

Flag	Description
`--map`	Discover URLs from sitemap.xml and robots.txt. Outputs the URL list.

examples

# Discover all URLs from sitemap
webclaw https://docs.example.com --map

# Save discovered URLs to a file, then extract them
webclaw https://docs.example.com --map > urls.txt
webclaw --urls-file urls.txt -f llm

Change tracking

Snapshot a page as JSON and compare against a future extraction to see what changed.

Flag	Description
`--diff-with <file>`	Compare the current extraction against a previous JSON snapshot file.

examples

# Take a snapshot
webclaw https://example.com -f json > snapshot.json

# Later, check what changed
webclaw https://example.com --diff-with snapshot.json

Brand extraction

Extract brand identity from a website: colors, fonts, logo URL, and favicon. Analyzes both DOM structure and CSS.

Flag	Description
`--brand`	Extract brand identity (colors, fonts, logo, favicon) from the page.

example

webclaw https://example.com --brand

LLM features

webclaw can use LLMs to extract structured data, answer questions about page content, or summarize. The provider chain tries Ollama first (local, free), then OpenAI, then Anthropic.

Flag	Description
`--extract-json <schema>`	Extract data matching a JSON schema. Pass the schema as a string or use @file to read from a file.
`--extract-prompt <text>`	Natural language extraction. Describe what you want and the LLM extracts it.
`--summarize [sentences]`	Summarize the page content. Default: 3 sentences.
`--llm-provider <name>`	Force a specific LLM provider: ollama, openai, or anthropic.
`--llm-model <name>`	Override the default model for the chosen provider.
`--llm-base-url <url>`	Override the selected provider's base URL. Works with Ollama, OpenAI-compatible providers, and Anthropic-compatible providers.

examples

# Summarize a page
webclaw https://example.com --summarize
webclaw https://example.com --summarize 5

# Natural language extraction
webclaw https://example.com --extract-prompt "Get all pricing tiers"
webclaw https://example.com --extract-prompt "List every author name and their role"

# JSON schema extraction
webclaw https://example.com --extract-json '{"type":"object","properties":{"title":{"type":"string"},"price":{"type":"number"}}}'

# Schema from a file
webclaw https://example.com --extract-json @schema.json

# Force OpenAI instead of Ollama
webclaw https://example.com --summarize --llm-provider openai

# Use a specific model
webclaw https://example.com --summarize --llm-provider anthropic --llm-model claude-sonnet-4-20250514

# Use an Anthropic-compatible endpoint
webclaw https://example.com --summarize \
  --llm-provider anthropic \
  --llm-base-url https://anthropic-proxy.example.com/v1

Note

Ollama runs locally and requires no API key. Install it from ollama.ai and webclaw will use it automatically. For OpenAI and Anthropic, set the standard environment variables: OPENAI_API_KEY or ANTHROPIC_API_KEY.

Provider compatibility

webclaw 0.5.9 adds two compatibility controls for teams running local or OpenAI-compatible LLM backends.

Flag	Description
`ANTHROPIC_BASE_URL`	Default base URL for Anthropic-compatible providers. Falls back to the official Anthropic API when unset.
`OPENAI_RESPONSE_FORMAT_TYPE`	OpenAI-compatible JSON response mode: json_object, json_schema, or text. Defaults to json_object for official OpenAI.

LM Studio compatibility

# LM Studio may reject OpenAI json_object mode.
# Use text or json_schema for OpenAI-compatible local backends.
export OPENAI_BASE_URL=http://localhost:1234/v1
export OPENAI_API_KEY=lm-studio
export OPENAI_RESPONSE_FORMAT_TYPE=text

webclaw https://example.com \
  --extract-prompt "Extract the product name and price" \
  --llm-provider openai

Tip

Leave OPENAI_RESPONSE_FORMAT_TYPE unset when using official OpenAI. The default remains json_object.

PDF extraction

webclaw auto-detects PDF documents via the Content-Type header and extracts text content. No special flags needed for basic PDF extraction.

Flag	Description
`--pdf-mode auto`	Default mode. Extracts text from PDFs. Returns an error if text extraction fails.
`--pdf-mode fast`	Returns empty content on extraction failure instead of erroring.

examples

# Auto-detected from Content-Type
webclaw https://example.com/report.pdf

# Fast mode (skip failures silently)
webclaw https://example.com/report.pdf --pdf-mode fast

Other options

Flag	Description
`-t <seconds>`	Request timeout in seconds. Default: 30.
`-v`	Enable verbose logging. Shows request details, timing, and extraction stats.

examples

# Longer timeout for slow sites
webclaw https://slow-site.example.com -t 60

# Verbose output for debugging
webclaw https://example.com -v

Cloud API

By default, webclaw does everything locally: direct HTTP fetch with TLS impersonation plus local content extraction. When you provide an API key, the CLI gains access to the webclaw cloud infrastructure for handling sites that require bot protection bypass, JavaScript rendering, or proxy rotation.

Flag	Description
`--api-key <key>`	Set your webclaw API key. Can also be set via the WEBCLAW_API_KEY environment variable.
`--cloud`	Force all requests through the cloud API. Skips local extraction entirely.

There are three modes of operation depending on how you configure the API key and cloud flag:

Local only (default)

No API key set. All fetching and extraction happens on your machine. Fast and free, but cannot bypass bot protection or render JavaScript.

Automatic fallback

API key set, no --cloud flag. The CLI tries locally first. If bot protection or JS rendering is detected, it automatically retries through api.webclaw.io. Best of both worlds: fast local extraction for simple sites, cloud power when needed.

Cloud forced

The --cloud flag sends every request directly through the cloud API. Use this for sites that always need antibot bypass or JS rendering, avoiding the wasted local attempt.

examples

# Set API key via environment
export WEBCLAW_API_KEY=wc_your_key_here

# Automatic fallback (tries local first, cloud on failure)
webclaw https://protected-site.com

# Force cloud mode
webclaw https://spa-site.com --cloud

# Pass key directly
webclaw https://example.com --cloud --api-key wc_your_key_here

Feature	Local	Cloud
Bot protection bypass	No	Yes
JS rendering	No	Yes
Proxy rotation	Manual (--proxy-file)	Automatic
Speed	Fast	Depends on server
Cost	Free	Uses API credits
Privacy	Data stays local	Data processed on server

Tip

Start without an API key. If you hit sites that block you or require JavaScript, add your key to WEBCLAW_API_KEY in your shell profile and the CLI will handle the rest automatically.

Complete examples

Common workflows combining multiple flags.

Extract a blog post for an LLM

bash

webclaw https://blog.example.com/post \
  -f llm \
  --include "article" \
  --exclude ".comments, .related-posts"

Crawl documentation with proxy rotation

bash

webclaw https://docs.example.com \
  --crawl \
  --depth 2 \
  --max-pages 50 \
  --sitemap \
  --proxy-file proxies.txt \
  -b random \
  -f llm

Extract structured pricing data

bash

webclaw https://example.com/pricing \
  --extract-json '{
    "type": "object",
    "properties": {
      "plans": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "price": {"type": "string"},
            "features": {"type": "array", "items": {"type": "string"}}
          }
        }
      }
    }
  }'

Monitor a page for changes

bash

# Initial snapshot
webclaw https://example.com/status -f json > baseline.json

# Check for changes (run periodically)
webclaw https://example.com/status --diff-with baseline.json

Batch extract with Firefox impersonation

bash

webclaw \
  https://site-a.com \
  https://site-b.com \
  https://site-c.com \
  -b firefox \
  -f llm \
  --metadata

Cloud crawl of a protected SPA

bash

webclaw https://app.example.com \
  --cloud \
  --crawl \
  --depth 2 \
  --max-pages 30 \
  -f llm

CLI reference

Basic extraction

Output formats

Content filtering

Browser impersonation

Proxy

Vertical extractors

Crawling

Sitemap discovery

Change tracking

Brand extraction

LLM features

Provider compatibility

PDF extraction

Other options

Cloud API

Local only (default)

Automatic fallback

Cloud forced

Complete examples

Extract a blog post for an LLM

Crawl documentation with proxy rotation

Extract structured pricing data

Monitor a page for changes

Batch extract with Firefox impersonation

Cloud crawl of a protected SPA

Ready to build? Start extracting.