RAW HTTP — NO HEADLESS BROWSER OVERHEADMARKDOWN · JSON · HTML · LLM-READY FORMATSMCP SERVER FOR AI AGENTSTLS FINGERPRINT IMPERSONATIONEXTRACT · SUMMARIZE · DIFF · BRANDSITEMAP DISCOVERY & DEEP CRAWLINGSELF-HOST OR USE OUR CLOUD APIBUILT IN RUST — FAST BY DEFAULTDEEP RESEARCH — AI SYNTHESIZES REPORTS FROM 50+ SOURCESWEB SEARCH — QUERY AND SCRAPE SEARCH RESULTS IN ONE CALLAGENT SCRAPE — GIVE A GOAL, AI EXTRACTS WHAT YOU NEEDURL MONITORING — WATCH PAGES FOR CHANGES WITH WEBHOOKSBONUS CREDITS — EARN FREE CREDITS BY STARRING AND REFERRINGRAW HTTP — NO HEADLESS BROWSER OVERHEADMARKDOWN · JSON · HTML · LLM-READY FORMATSMCP SERVER FOR AI AGENTSTLS FINGERPRINT IMPERSONATIONEXTRACT · SUMMARIZE · DIFF · BRANDSITEMAP DISCOVERY & DEEP CRAWLINGSELF-HOST OR USE OUR CLOUD APIBUILT IN RUST — FAST BY DEFAULTDEEP RESEARCH — AI SYNTHESIZES REPORTS FROM 50+ SOURCESWEB SEARCH — QUERY AND SCRAPE SEARCH RESULTS IN ONE CALLAGENT SCRAPE — GIVE A GOAL, AI EXTRACTS WHAT YOU NEEDURL MONITORING — WATCH PAGES FOR CHANGES WITH WEBHOOKSBONUS CREDITS — EARN FREE CREDITS BY STARRING AND REFERRING

Demo

The hidden cost of agent web fetch.

When an AI agent fetches a web page with a built-in tool, it reads every token of raw HTML internally — then hands you back a lossy summary. You pay full price, get partial data. webclaw inverts this.

What actually happens

I fetched webclaw.io/docs three ways and counted the tokens. Here is what each method costs.

Raw HTML (what WebFetch reads internally)3,054 tokens
Standard markdown (webclaw default)1,405 tokens
webclaw LLM format (token-optimized)~950 tokens
WebFetch output (what your agent actually gets)~300 tokens (lossy)

WebFetch is a black box. It consumes 3,054 tokens internally, then returns a ~300 token summary. Your agent spent 10× the tokens for 1/3 of the information. webclaw returns 950 tokens of complete, structured content — nothing hidden, nothing dropped.

What your agent actually sees

Same page. Same question. Different reality.

Built-in WebFetch

~300 tokens

The page is the official webclaw documentation. webclaw is a web extraction tool built in Rust designed for LLM pipelines. It provides a CLI, MCP server, and cloud API. The documentation covers installation, getting started guides, API reference, and CLI reference. The tool supports multiple output formats and is open source under AGPL-3.0.

agent can…

name the CLI install command
list the API endpoints
write a working curl example
know which output formats exist
name the MCP tools
explain the --format flag

webclaw LLM format

~950 tokens

output formats

llmtoken-optimized, removes boilerplate
markdownfull page, all structure preserved
jsontitle, description, links, images
textplain text strip

CLI

webclaw scrape <url> --format llm
webclaw extract <url> --prompt "..."
webclaw crawl <url> --limit 50

API endpoints

/v1/scrape/v1/extract/v1/crawl/v1/batch/v1/research

MCP tools

scrape, extract, crawl, search, map, batch, research, diff

agent can…

name the CLI install command
list the API endpoints
write a working curl example
know which output formats exist
name the MCP tools
explain the --format flag

WebFetch returned the vibe of the page. webclaw returned the actual endpoints, formats, CLI commands, and MCP tool names. For anything that requires real information — code generation, API integration, data extraction — the difference is whether your agent can do the task at all.

Why raw HTML costs so much

This is what 12,215 characters of a single documentation page looks like before extraction:

<div class="nx-mt-6 nx-leading-7 first:nx-mt-0"><h1 class="nx-mt-2 nx-text-4xl nx-font-bold nx-tracking-tight nx-text-slate-900 dark:nx-text-slate-100">Getting Started</h1><div class="nx-mt-6 nx-leading-7 first:nx-mt-0"><p>webclaw is a web extraction platform built for LLM pipelines. Install the CLI with cargo or use the cloud API directly.</p><div class="nx-mt-6 nx-leading-7 first:nx-mt-0"> <div class="nextra-code-block nx-relative nx-mt-6 first:nx-mt-0"> <pre class="bg-primary-700/5 nx-mb-4 nx-overflow-x-auto nx-rounded-xl nx-subpixel-antialiased dark:nx-bg-primary-300/10 nx-text-[.9em] contrast-more:nx-border contrast-more:nx-border-primary-900/20 contrast-more:dark:nx-border-primary-100/40"><code class="nx-border-black nx-border-opacity-[0.04] nx-bg-opacity-[0.03] nx-bg-black nx-break-words nx-rounded-md nx-border nx-py-0.5 nx-px-[.25em] nx-text-[.9em] dark:nx-border-white/10 dark:nx-bg-white/10">cargo install webclaw</code> </pre></div></div></div></div> ... [11,800 more characters of class names, aria labels, nav markup, script tags, and repeated boilerplate]

Class names, wrapper divs, aria attributes, script tags, repeated nav elements. None of it is information. All of it counts as tokens when a built-in fetch tool processes the raw page.

How to fix it

Use webclaw instead of your agent's built-in fetch — via CLI, MCP server, or API.

CLI

webclaw scrape https://docs.example.com --format llm

MCP (Claude Desktop / any MCP client)

// claude_desktop_config.json
{
  "mcpServers": {
    "webclaw": {
      "command": "webclaw-mcp"
    }
  }
}

API

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer $WEBCLAW_KEY" \
  -d '{"url": "https://docs.example.com", "format": "llm"}'

69%

token reduction vs raw HTML

more content than WebFetch returns

118ms

avg latency for static pages

Token counts measured on webclaw.io/docs using cl100k_base tokenizer. WebFetch behavior observed in Claude Code session. Results vary by page but the pattern is consistent: raw HTML is expensive, built-in fetch tools summarize lossily, webclaw extracts precisely.

Stay in the loop

Get notified when the webclaw API launches. Early subscribers get extended free tier access.

No spam. Unsubscribe anytime.