March 20, 2026Massi

HTML to Markdown for LLMs and RAG

Name: webclaw
Price: 19 USD
Author: Massi

If you are feeding scraped pages into RAG, agents, or summarization, raw HTML is the expensive version of the truth.

A typical webpage is 50,000 to 200,000 tokens of raw HTML. The actual content on that page, the article, the product info, the documentation, is usually 500 to 2,000 tokens.

If you're feeding web data to an LLM, you're paying for every one of those tokens. The navigation bar. The footer with 47 links. The cookie consent banner. The inline SVGs. The data-testid attributes. The CSS classes that look like flex items-center justify-between px-4 py-2 bg-gradient-to-r from-blue-500 to-purple-600.

Your LLM reads all of it. Reasons over all of it. Bills you for all of it.

HTML to Markdown for LLMs: quick answer

For LLM and RAG pipelines, do not feed raw HTML unless you need the full DOM. Convert the page into markdown that keeps content structure and removes interface noise.

Problem in raw HTML	Better LLM input
Navigation, footer, and cookie text pollute chunks	Keep the main content and remove boilerplate
Links are scattered through every sentence	Deduplicate useful links and drop UI actions
Headings, lists, and code blocks get flattened	Preserve structure as markdown
Hydration scripts and CSS classes burn tokens	Strip framework and styling artifacts
Cloudflare challenge pages look like valid HTML	Detect blocks before indexing the page

If you are building retrieval, pair this format with the LlamaIndex web scraping guide or the RAG pipeline walkthrough. If the target is protected, start with the Cloudflare scraping checklist.

Why you can't just strip the tags

The first thing everyone tries is stripping HTML tags and keeping the text. innerText, regex, BeautifulSoup's .get_text(). It works for about five minutes.

Then you realize you've lost all structure. Headings are gone. Lists are flat paragraphs. Code blocks are unformatted. Links disappear entirely, and those links were the whole point of some pages. Tables become meaningless rows of words. Your LLM gets a wall of text with no hierarchy to reason about.

Markdown is the right middle ground. It preserves headings, lists, code blocks, links, and tables using minimal syntax. An LLM reads markdown as well as it reads English. The token overhead of ## and - and [text](url) is tiny compared to <div class="container mx-auto">.

But converting HTML to markdown is not the end of the story. Standard conversion tools just transliterate the HTML structure. Every <img> becomes ![alt](src). Every <b> becomes **bold**. Every link stays inline. The result is valid markdown, but it's not optimized for what an LLM actually needs.

What's hiding in the markdown

Once you start looking at real HTML-to-markdown output, you find that web pages are full of things that make sense visually but are useless as text.

Images that aren't content. Most images on a page are logos, icons, and decorative elements. A partner section with 12 company logos generates 12 image references that an LLM can't see and can't use. On marketing pages, 30-40% of the markdown output is image references pointing at things with zero informational value.

Emphasis that means nothing. Designers bold entire paragraphs for visual weight. They italicize taglines for style. **Get started today** and Get started today carry identical information for an LLM. The ** markers are pure token waste.

Duplicate content. A heading that says "Features" followed by a paragraph starting with "Features include..." says the same thing twice. Card carousels repeat content for mobile and desktop breakpoints. Sticky headers appear in the extraction. The same CTA shows up four times on one page.

UI debris. Material Icons render as icons in a browser but show up as random words in markdown. navigate_before, chevron_left, expand_more. Cookie consent text. "Your browser does not support video" messages. Breadcrumb separators. These are visual affordances, not content.

Leaked code. Tailwind class names appearing as text content: text-4xl font-bold tracking-tight. Next.js hydration code: self.__wrap_n=.... Stray @keyframes and @font-face declarations. Any HTML-to-markdown converter that isn't careful about element boundaries will include some of this.

Links that aren't useful. Navigation links, footer links, "reply" and "flag" and "hide" links on forums, pagination controls. A single Hacker News page has 200+ links where maybe 30 are relevant. All of those inline [text](url) patterns are burning tokens.

What proper extraction looks like

webclaw runs a 9-step optimization pipeline that processes extracted markdown into LLM-ready output. Each step targets a specific category of noise.

Image handling. Logo clusters get collapsed into a single line ("WRITER, MongoDB, GROQ, LangChain" instead of four separate image references). Linked images become plain links. Standalone decorative images get stripped. Meaningful alt text descriptions are preserved.

Text cleanup. Bold and italic markers are removed while keeping the content. UI control text is stripped. CSS artifacts and leaked framework code are cleaned out.

Link processing. All links get pulled out of inline text and collected into a deduplicated list at the end. Navigation links, anchor links, JavaScript void links, and action links ("reply", "flag", "hide") are filtered out. This alone cuts 20-30% of token count on link-heavy pages.

Deduplication. Headings that duplicate their following paragraph get merged. Carousel content that repeats across breakpoints collapses to one instance. Consecutive identical phrases ("Read more Read more Read more") reduce to one.

Stat merging. Marketing pages love separating numbers from their labels. "100M+" on one line, "monthly requests" three lines below. The pipeline merges these into "100M+ monthly requests" and removes the whitespace.

After all of that, a final whitespace pass collapses the gaps left behind. The output reads like a well-edited document where every token carries information.

The numbers

Here's what the pipeline does to real pages.

Measured across 18 production sites — SPA marketing, documentation, long-form articles, news, enterprise pages — webclaw's LLM-optimized format averages ~90% fewer tokens than raw HTML (median 95%, full reproducible benchmark in the core repo under benchmarks/, tokenizer cl100k_base).

A few real datapoints from that run:

vercel.com: 380,094 raw HTML tokens → 1,074 tokens after webclaw — a 99.7% reduction

github.com: 234,246 → 1,438 — 99.4%

notion.com: 109,312 → 13,416 — 87.7%

stripe.com: 243,465 → 81,974 — 66.3% (floor: content-dense pages where most HTML *is* the content)

wikipedia / Rust: 189,406 → 47,823 — 74.8%

Marketing SPAs see the biggest reductions because most of their raw HTML is bundler data, hydration scripts, and nav — not content. Content-dense pages like Wikipedia articles or Stripe's customer-story-heavy pages see smaller reductions because webclaw is preserving the information that's actually there.

At scale, the token savings translate directly to cost. Process 1,000 pages a day through an LLM at $3 per million input tokens. With raw HTML averaging 150,000 tokens per page, that's $450 a day, $13,500 a month. With LLM-optimized output averaging 15,000 tokens per page for the same content, that's $45 a day, $1,350 a month. Same information.

Using the right format

webclaw exposes all of this through a single parameter. Set format to llm and you get the fully optimized output.

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "format": "llm"}'

The other options: markdown for full formatting with images and emphasis preserved. text for plain text with all syntax stripped. json for structured metadata alongside content.

For LLM applications, llm is almost always what you want. It keeps the structure an LLM needs to reason (headings, lists, code blocks) and strips everything it doesn't.

With the MCP server, your AI agent calls scrape with format: "llm" and gets back content ready for processing. No post-processing, no cleanup, no regex.

{
  "mcpServers": {
    "webclaw": {
      "command": "webclaw-mcp"
    }
  }
}

If you need structured data instead of text, the /v1/extract endpoint takes a JSON schema and returns exactly the data shape you specify. Product names and prices from a pricing page, contact info from an about page, event details from a calendar. Different problem, different tool.

The trade-offs

LLM-optimized markdown strips things. That's the whole point, but it means you lose information.

Images are gone. If the page has charts, diagrams, or screenshots that matter, you won't see them. For pages where visual content is important, use markdown format and pass the images to a multimodal model separately.

Emphasis is gone. If the original page used bold to highlight semantically meaningful terms (not just visual weight), that distinction is lost in the optimized output.

Links are relocated. Instead of inline links within sentences, they're collected at the end. For most use cases this is fine or better. For tasks where link position in context matters, use markdown format.

The ~90% token reduction is an average across 18 production sites. Content-dense pages like Wikipedia or Stripe's customer stories land closer to 65-75% because most of their HTML is already content. JS-heavy marketing SPAs land over 99% because most of their HTML isn't.

Why this matters for your pipeline

The extraction format is the most overlooked decision in any LLM pipeline that touches web data. Most people pick the default and never think about it again.

If you're building RAG, cleaner chunks produce better embeddings. Better embeddings produce better retrieval. Better retrieval produces better answers. The quality of your extraction is the ceiling for your application's output quality.

If you're building agents that read web pages, every saved token is faster response times and lower costs. The format you choose compounds across every page, every task, every user.

Check the scrape API docs for the full format reference, or try the API at webclaw.io. If you are building a retrieval pipeline, the LlamaIndex guide and RAG pipeline walkthrough show where this format fits.