RAW HTTP — NO HEADLESS BROWSER OVERHEADMARKDOWN · JSON · HTML · LLM-READY FORMATSMCP SERVER FOR AI AGENTSTLS FINGERPRINT IMPERSONATIONEXTRACT · SUMMARIZE · DIFF · BRANDSITEMAP DISCOVERY & DEEP CRAWLINGSELF-HOST OR USE OUR CLOUD APIBUILT IN RUST — FAST BY DEFAULTDEEP RESEARCH — AI SYNTHESIZES REPORTS FROM 50+ SOURCESWEB SEARCH — QUERY AND SCRAPE SEARCH RESULTS IN ONE CALLAGENT SCRAPE — GIVE A GOAL, AI EXTRACTS WHAT YOU NEEDURL MONITORING — WATCH PAGES FOR CHANGES WITH WEBHOOKSBONUS CREDITS — EARN FREE CREDITS BY STARRING AND REFERRINGRAW HTTP — NO HEADLESS BROWSER OVERHEADMARKDOWN · JSON · HTML · LLM-READY FORMATSMCP SERVER FOR AI AGENTSTLS FINGERPRINT IMPERSONATIONEXTRACT · SUMMARIZE · DIFF · BRANDSITEMAP DISCOVERY & DEEP CRAWLINGSELF-HOST OR USE OUR CLOUD APIBUILT IN RUST — FAST BY DEFAULTDEEP RESEARCH — AI SYNTHESIZES REPORTS FROM 50+ SOURCESWEB SEARCH — QUERY AND SCRAPE SEARCH RESULTS IN ONE CALLAGENT SCRAPE — GIVE A GOAL, AI EXTRACTS WHAT YOU NEEDURL MONITORING — WATCH PAGES FOR CHANGES WITH WEBHOOKSBONUS CREDITS — EARN FREE CREDITS BY STARRING AND REFERRING
← BACK TO BLOG
Massi

HTML to markdown for LLMs. What raw web data actually costs you.

Here's a number that should bother you. A typical webpage is 50,000 to 200,000 tokens of raw HTML. The actual content on that page, the article, the product info, the documentation, is usually 500 to 2,000 tokens.

If you're feeding web data to an LLM, you're paying for every one of those tokens. The navigation bar. The footer with 47 links. The cookie consent banner. The inline SVGs. The data-testid attributes. The CSS classes that look like flex items-center justify-between px-4 py-2 bg-gradient-to-r from-blue-500 to-purple-600.

Your LLM reads all of it. Reasons over all of it. Bills you for all of it.

Why you can't just strip the tags

The first thing everyone tries is stripping HTML tags and keeping the text. innerText, regex, BeautifulSoup's .get_text(). It works for about five minutes.

Then you realize you've lost all structure. Headings are gone. Lists are flat paragraphs. Code blocks are unformatted. Links disappear entirely, and those links were the whole point of some pages. Tables become meaningless rows of words. Your LLM gets a wall of text with no hierarchy to reason about.

Markdown is the right middle ground. It preserves headings, lists, code blocks, links, and tables using minimal syntax. An LLM reads markdown as well as it reads English. The token overhead of ## and - and [text](url) is tiny compared to <div class="container mx-auto">.

But converting HTML to markdown is not the end of the story. Standard conversion tools just transliterate the HTML structure. Every <img> becomes ![alt](src). Every <b> becomes **bold**. Every link stays inline. The result is valid markdown, but it's not optimized for what an LLM actually needs.

What's hiding in the markdown

Once you start looking at real HTML-to-markdown output, you find that web pages are full of things that make sense visually but are useless as text.

Images that aren't content. Most images on a page are logos, icons, and decorative elements. A partner section with 12 company logos generates 12 image references that an LLM can't see and can't use. On marketing pages, 30-40% of the markdown output is image references pointing at things with zero informational value.

Emphasis that means nothing. Designers bold entire paragraphs for visual weight. They italicize taglines for style. **Get started today** and Get started today carry identical information for an LLM. The ** markers are pure token waste.

Duplicate content. A heading that says "Features" followed by a paragraph starting with "Features include..." says the same thing twice. Card carousels repeat content for mobile and desktop breakpoints. Sticky headers appear in the extraction. The same CTA shows up four times on one page.

UI debris. Material Icons render as icons in a browser but show up as random words in markdown. navigate_before, chevron_left, expand_more. Cookie consent text. "Your browser does not support video" messages. Breadcrumb separators. These are visual affordances, not content.

Leaked code. Tailwind class names appearing as text content: text-4xl font-bold tracking-tight. Next.js hydration code: self.__wrap_n=.... Stray @keyframes and @font-face declarations. Any HTML-to-markdown converter that isn't careful about element boundaries will include some of this.

Links that aren't useful. Navigation links, footer links, "reply" and "flag" and "hide" links on forums, pagination controls. A single Hacker News page has 200+ links where maybe 30 are relevant. All of those inline [text](url) patterns are burning tokens.

What proper extraction looks like

webclaw runs a 9-step optimization pipeline that processes extracted markdown into LLM-ready output. Each step targets a specific category of noise.

Image handling. Logo clusters get collapsed into a single line ("WRITER, MongoDB, GROQ, LangChain" instead of four separate image references). Linked images become plain links. Standalone decorative images get stripped. Meaningful alt text descriptions are preserved.

Text cleanup. Bold and italic markers are removed while keeping the content. UI control text is stripped. CSS artifacts and leaked framework code are cleaned out.

Link processing. All links get pulled out of inline text and collected into a deduplicated list at the end. Navigation links, anchor links, JavaScript void links, and action links ("reply", "flag", "hide") are filtered out. This alone cuts 20-30% of token count on link-heavy pages.

Deduplication. Headings that duplicate their following paragraph get merged. Carousel content that repeats across breakpoints collapses to one instance. Consecutive identical phrases ("Read more Read more Read more") reduce to one.

Stat merging. Marketing pages love separating numbers from their labels. "100M+" on one line, "monthly requests" three lines below. The pipeline merges these into "100M+ monthly requests" and removes the whitespace.

After all of that, a final whitespace pass collapses the gaps left behind. The output reads like a well-edited document where every token carries information.

The numbers

Here's what the pipeline does to real pages.

On a representative benchmark set across documentation sites, marketing pages, news articles, and e-commerce pages:

  • Raw HTML: 4,820 tokens (baseline)
  • Standard markdown conversion: 1,840 tokens (62% smaller)
  • LLM-optimized markdown: 1,590 tokens (67% smaller)
  • The 67% average undersells the impact on noisy pages. Marketing sites with hero sections, partner logos, testimonial carousels, and sticky CTAs regularly see 85-90% reductions. Documentation pages are leaner to start with, so the improvement is more modest at 40-50%.

    The extraction speed is 0.8ms for a 10KB page and 3.2ms for a 100KB page. That's the content processing, not the network request. End-to-end, including HTTP fetch with TLS fingerprinting, webclaw averages 118ms for static pages.

    At scale, the token savings translate directly to cost. Process 1,000 pages a day through an LLM at $3 per million input tokens. With raw HTML averaging 50,000 tokens per page, that's $150 a day, $4,500 a month. With LLM-optimized output averaging 2,000 tokens per page for the same content, that's $6 a day, $180 a month. Same information.

    Using the right format

    webclaw exposes all of this through a single parameter. Set format to llm and you get the fully optimized output.

    curl -X POST https://api.webclaw.io/v1/scrape \
      -H "Authorization: Bearer YOUR_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{"url": "https://example.com", "format": "llm"}'

    The other options: markdown for full formatting with images and emphasis preserved. text for plain text with all syntax stripped. json for structured metadata alongside content.

    For LLM applications, llm is almost always what you want. It keeps the structure an LLM needs to reason (headings, lists, code blocks) and strips everything it doesn't.

    With the MCP server, your AI agent calls scrape with format: "llm" and gets back content ready for processing. No post-processing, no cleanup, no regex.

    {
      "mcpServers": {
        "webclaw": {
          "command": "webclaw-mcp"
        }
      }
    }

    If you need structured data instead of text, the /v1/extract endpoint takes a JSON schema and returns exactly the data shape you specify. Product names and prices from a pricing page, contact info from an about page, event details from a calendar. Different problem, different tool.

    The trade-offs

    LLM-optimized markdown strips things. That's the whole point, but it means you lose information.

    Images are gone. If the page has charts, diagrams, or screenshots that matter, you won't see them. For pages where visual content is important, use markdown format and pass the images to a multimodal model separately.

    Emphasis is gone. If the original page used bold to highlight semantically meaningful terms (not just visual weight), that distinction is lost in the optimized output.

    Links are relocated. Instead of inline links within sentences, they're collected at the end. For most use cases this is fine or better. For tasks where link position in context matters, use markdown format.

    The 67% token reduction is an average. Some pages are already clean and you'll see 30-40%. The biggest gains are on bloated marketing sites, and the smallest on well-structured documentation.

    Why this matters for your pipeline

    The extraction format is the most overlooked decision in any LLM pipeline that touches web data. Most people pick the default and never think about it again.

    If you're building RAG, cleaner chunks produce better embeddings. Better embeddings produce better retrieval. Better retrieval produces better answers. The quality of your extraction is the ceiling for your application's output quality.

    If you're building agents that read web pages, every saved token is faster response times and lower costs. The format you choose compounds across every page, every task, every user.

    Check the documentation for the full format reference, or try the API at webclaw.io.