POST /v1/scrape

Website to Markdown for AI and LLM context windows

Convert any URL into clean, LLM-ready Markdown in one call.

Turn any webpage into clean Markdown with nav, ads, and boilerplate stripped. Built for docs ingestion, prompts, and context windows, with around 90% fewer tokens than raw HTML.

Try it live
How it works

Build it step by step.

The real flow, one step at a time. Switch between TypeScript, Python, and cURL on any snippet.

  1. 1

    Send a URL

    POST the page URL to /v1/scrape with formats set to markdown.

    // POST the page URL to /v1/scrape, asking for Markdownconst result = await webclaw.scrape({  url: "https://example.com/docs/getting-started",  formats: ["markdown"],});
  2. 2

    Page is fetched

    The engine fetches the page and escalates to JS rendering or bot-protection handling only when the page needs it.

  3. 3

    HTML is cleaned

    Navigation, ads, and scripts are stripped while headings, lists, tables, and links are kept as Markdown.

  4. 4

    Markdown returned

    Clean Markdown plus title, status, and timing metadata comes back, ready for a prompt, vector store, or docs index.

    // Clean Markdown plus page metadata comes backconsole.log(result.markdown);   // boilerplate strippedconsole.log(result.metadata);   // title, description, status, timing// Drop straight into a prompt or context windowconst prompt = `Summarize this page:\n\n${result.markdown}`;
Why webclaw

Built for website to markdown.

Markdown out of the box, no Turndown or Readability to maintain

Around 90% fewer tokens than raw HTML fed to a model

Around 118ms on static pages, no headless browser to run

Headings, lists, tables, and links preserved as structure

Protected pages handled automatically, no proxy or browser setup

What you get

Everything this use case needs.

  • URL to Markdown in one call
  • Boilerplate, nav, and ad stripping
  • Heading, list, table, and link fidelity
  • Markdown, JSON, llm, text, and HTML formats
  • Page metadata: title, description, status, timing
Where it fits

Built for the messy parts.

Feeding raw HTML to an LLM wastes the context window on navigation, ads, cookie banners, and scripts. Hand-rolling an HTML-to-Markdown converter means maintaining Readability heuristics, Turndown rules, and a headless browser, and it still breaks on bot-protected pages.

webclaw's /v1/scrape endpoint returns clean Markdown directly. It strips boilerplate, preserves headings, lists, tables, and links, and returns static pages in around 118ms without a headless browser. JS rendering and bot-protection handling kick in automatically when a page needs them.

Common questions

Frequently asked questions

How is webclaw better than an HTML-to-Markdown library like Turndown?

Turndown converts whatever HTML you hand it, including nav, ads, and scripts, so you still need to fetch the page and clean it first. webclaw fetches the page, strips boilerplate, and returns article-grade Markdown in one call, and it handles JS rendering and bot-protected pages that a plain HTTP fetch cannot reach.

Does the Markdown keep tables, lists, and links?

Yes. webclaw preserves semantic structure: headings become Markdown headings, lists stay lists, tables stay tables, and links keep their hrefs. The output is meant to read cleanly both for a human and for an LLM.

Why convert to Markdown before sending a page to an LLM?

Raw HTML burns context tokens on markup and chrome that add no meaning. Clean Markdown cuts token count by roughly 90% versus raw HTML, which lowers cost, leaves more room in the context window, and improves the signal your model reasons over.

For AI agents

Or hand it to your agent.

Add the webclaw MCP server to Claude, Cursor, or any MCP client, then paste this prompt. The agent calls the webclaw tools and hands the result back to your model — no code to write.

PROMPT FOR YOUR AGENT

Using the webclaw tools, call scrape on [the page URL] and convert it into clean, LLM-ready Markdown. Strip out the navigation, ads, cookie banners, and scripts, but keep the real structure intact: headings, lists, tables, and links should survive as proper Markdown. Return the cleaned Markdown along with the page's title and any description metadata. If I give you several URLs, use batch to scrape them all in parallel and return one Markdown block per page, each labeled with its source URL.

Ready to build? Start extracting.

Cancel anytime. Clean, structured data on every call.

View API docs