webclaw

CLI Reference

Complete reference for the webclaw command-line tool. Every flag, every option, with practical examples.

Basic extraction

Pass one or more URLs as positional arguments. webclaw fetches each page, extracts the main content, and outputs clean markdown.

FlagDescription
webclaw <url>Extract a single URL.
webclaw url1 url2 url3Batch extract multiple URLs in one command.
--urls-file <file>Read URLs from a file, one per line.
--file <path>Read HTML from a local file instead of fetching.
--stdinRead HTML from stdin.
examples
# Single URL
webclaw https://example.com

# Multiple URLs
webclaw https://example.com https://news.ycombinator.com

# From a file list
webclaw --urls-file urls.txt

# Local HTML file
webclaw --file page.html

# Pipe from another command
curl -s https://example.com | webclaw --stdin

Output formats

Control the output format with the -f flag. The default is markdown.

FlagDescription
-f markdownClean markdown with resolved URLs and collected assets. This is the default.
-f textPlain text with no formatting.
-f jsonFull ExtractionResult as JSON. Includes metadata, content, word count, and extracted URLs.
-f llmLLM-optimized output. 9-step pipeline: image stripping, emphasis removal, link deduplication, stat merging, whitespace collapse. Includes a metadata header.
--metadataInclude page metadata (title, description, OG tags) in the output.
--raw-htmlOutput the raw HTML response without any extraction processing.
examples
# Default markdown
webclaw https://example.com

# LLM-optimized for minimal token usage
webclaw https://example.com -f llm

# Full JSON with metadata
webclaw https://example.com -f json

# Plain text
webclaw https://example.com -f text

# Markdown with metadata header
webclaw https://example.com --metadata

# Save JSON snapshot for later diffing
webclaw https://example.com -f json > snapshot.json
Tip
The llm format typically achieves 67% fewer tokens than raw HTML while preserving all meaningful content. Use it when feeding content to language models.

Content filtering

Use CSS selectors to control what content is extracted. Include mode is exclusive -- only matched elements are returned. Exclude mode removes matched elements from the normal extraction.

FlagDescription
--include <selectors>CSS selectors to extract. Comma-separated. Exclusive mode: only these elements are returned.
--exclude <selectors>CSS selectors to remove from extraction. Comma-separated.
--only-main-contentExtract only article, main, or [role="main"] elements.
examples
# Extract only the article body
webclaw https://example.com --include "article"

# Extract specific sections
webclaw https://example.com --include ".post-content, .comments"

# Remove navigation and footer noise
webclaw https://example.com --exclude "nav, footer, .sidebar"

# Combine both
webclaw https://example.com --include "main" --exclude ".ads, .related-posts"

# Quick mode: just the main content area
webclaw https://example.com --only-main-content

Browser impersonation

webclaw uses Impit to impersonate real browser TLS fingerprints. This makes requests indistinguishable from actual browser traffic at the TLS layer. Each browser option includes multiple version profiles.

FlagDescription
-b chromeChrome profiles: v142, v136, v133, v131. This is the default.
-b firefoxFirefox profiles: v144, v135, v133, v128.
-b randomRandom browser profile per request. Useful for bulk extraction.
examples
# Default Chrome impersonation
webclaw https://example.com

# Firefox fingerprint
webclaw https://example.com -b firefox

# Random profile per request (good for batch)
webclaw url1 url2 url3 -b random
Note
This is TLS fingerprint impersonation, not a headless browser. No browser engine is launched. Requests complete in milliseconds, not seconds.

Proxy

Route requests through HTTP proxies. Supports single proxy or pool rotation.

FlagDescription
-p <url>Single proxy. Format: http://user:pass@host:port
--proxy-file <file>Proxy pool file. One proxy per line in host:port:user:pass format. Rotates per request.
examples
# Single proxy
webclaw https://example.com -p http://user:pass@proxy.example.com:8080

# Proxy pool with rotation
webclaw https://example.com --proxy-file proxies.txt

# Batch extraction with proxy rotation
webclaw --urls-file urls.txt --proxy-file proxies.txt
Tip
webclaw auto-loads a file named proxies.txt from the working directory if present. No flag needed. Proxy rotation is per-request, not per-client, so each request in a batch or crawl uses a different proxy from the pool.

Crawling

BFS same-origin crawler. Discovers and extracts pages by following links within the same domain.

FlagDescription
--crawlEnable BFS crawling from the given URL.
--depth <n>Maximum crawl depth. Default: 1.
--max-pages <n>Maximum number of pages to crawl. Default: 20.
--concurrency <n>Number of parallel requests. Default: 5.
--delay <ms>Delay between requests in milliseconds. Default: 100.
--path-prefix <path>Only crawl URLs whose path starts with this prefix.
--sitemapSeed the crawl queue from sitemap discovery before starting BFS.
examples
# Basic crawl, 1 level deep
webclaw https://docs.example.com --crawl

# Deep crawl with limits
webclaw https://docs.example.com --crawl --depth 3 --max-pages 100

# Faster crawl with more concurrency
webclaw https://docs.example.com --crawl --concurrency 10 --delay 50

# Only crawl the /api/ section
webclaw https://docs.example.com --crawl --path-prefix /api/

# Seed from sitemap first, then crawl
webclaw https://docs.example.com --crawl --depth 2 --max-pages 50 --sitemap
Warning
Crawling is same-origin only. webclaw will not follow links to external domains. Respect the target site by keeping concurrency and depth reasonable.

Sitemap discovery

Discover all URLs from a site's sitemap.xml and robots.txt without crawling.

FlagDescription
--mapDiscover URLs from sitemap.xml and robots.txt. Outputs the URL list.
examples
# Discover all URLs from sitemap
webclaw https://docs.example.com --map

# Save discovered URLs to a file, then extract them
webclaw https://docs.example.com --map > urls.txt
webclaw --urls-file urls.txt -f llm

Change tracking

Snapshot a page as JSON and compare against a future extraction to see what changed.

FlagDescription
--diff-with <file>Compare the current extraction against a previous JSON snapshot file.
examples
# Take a snapshot
webclaw https://example.com -f json > snapshot.json

# Later, check what changed
webclaw https://example.com --diff-with snapshot.json

Brand extraction

Extract brand identity from a website: colors, fonts, logo URL, and favicon. Analyzes both DOM structure and CSS.

FlagDescription
--brandExtract brand identity (colors, fonts, logo, favicon) from the page.
example
webclaw https://example.com --brand

LLM features

webclaw can use LLMs to extract structured data, answer questions about page content, or summarize. The provider chain tries Ollama first (local, free), then OpenAI, then Anthropic.

FlagDescription
--extract-json <schema>Extract data matching a JSON schema. Pass the schema as a string or use @file to read from a file.
--extract-prompt <text>Natural language extraction. Describe what you want and the LLM extracts it.
--summarize [sentences]Summarize the page content. Default: 3 sentences.
--llm-provider <name>Force a specific LLM provider: ollama, openai, or anthropic.
--llm-model <name>Override the default model for the chosen provider.
--llm-base-url <url>Override the provider's base URL (useful for proxies or custom deployments).
examples
# Summarize a page
webclaw https://example.com --summarize
webclaw https://example.com --summarize 5

# Natural language extraction
webclaw https://example.com --extract-prompt "Get all pricing tiers"
webclaw https://example.com --extract-prompt "List every author name and their role"

# JSON schema extraction
webclaw https://example.com --extract-json '{"type":"object","properties":{"title":{"type":"string"},"price":{"type":"number"}}}'

# Schema from a file
webclaw https://example.com --extract-json @schema.json

# Force OpenAI instead of Ollama
webclaw https://example.com --summarize --llm-provider openai

# Use a specific model
webclaw https://example.com --summarize --llm-provider anthropic --llm-model claude-sonnet-4-20250514
Note
Ollama runs locally and requires no API key. Install it from ollama.ai and webclaw will use it automatically. For OpenAI and Anthropic, set the standard environment variables: OPENAI_API_KEY or ANTHROPIC_API_KEY.

PDF extraction

webclaw auto-detects PDF documents via the Content-Type header and extracts text content. No special flags needed for basic PDF extraction.

FlagDescription
--pdf-mode autoDefault mode. Extracts text from PDFs. Returns an error if text extraction fails.
--pdf-mode fastReturns empty content on extraction failure instead of erroring.
examples
# Auto-detected from Content-Type
webclaw https://example.com/report.pdf

# Fast mode (skip failures silently)
webclaw https://example.com/report.pdf --pdf-mode fast

Other options

FlagDescription
-t <seconds>Request timeout in seconds. Default: 30.
-vEnable verbose logging. Shows request details, timing, and extraction stats.
examples
# Longer timeout for slow sites
webclaw https://slow-site.example.com -t 60

# Verbose output for debugging
webclaw https://example.com -v

Complete examples

Common workflows combining multiple flags.

Extract a blog post for an LLM

bash
webclaw https://blog.example.com/post \
  -f llm \
  --include "article" \
  --exclude ".comments, .related-posts"

Crawl documentation with proxy rotation

bash
webclaw https://docs.example.com \
  --crawl \
  --depth 2 \
  --max-pages 50 \
  --sitemap \
  --proxy-file proxies.txt \
  -b random \
  -f llm

Extract structured pricing data

bash
webclaw https://example.com/pricing \
  --extract-json '{
    "type": "object",
    "properties": {
      "plans": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "name": {"type": "string"},
            "price": {"type": "string"},
            "features": {"type": "array", "items": {"type": "string"}}
          }
        }
      }
    }
  }'

Monitor a page for changes

bash
# Initial snapshot
webclaw https://example.com/status -f json > baseline.json

# Check for changes (run periodically)
webclaw https://example.com/status --diff-with baseline.json

Batch extract with Firefox impersonation

bash
webclaw \
  https://site-a.com \
  https://site-b.com \
  https://site-c.com \
  -b firefox \
  -f llm \
  --metadata