POST /v1/batch

Web scraping for RAG pipelines.

Feed your vector database with fresh, clean web content.

Build retrieval augmented generation systems with real, up-to-date web data. webclaw returns LLM-optimized markdown that embeds cleanly into any vector database, with ~90% fewer tokens than raw HTML.

Scrape to markdown

View API docs

How it works

Build it step by step.

The real flow, one step at a time. Switch between TypeScript, Python, and cURL on any snippet.

Collect source URLs

Gather the docs and article pages you want indexed, or discover them with /v1/map.

// Discover every page under a docs siteconst sitemap = await webclaw.map({ url: "https://example.com/docs" });const urls = sitemap.links;console.log(`Found ${urls.length} pages to index`);

Scrape to markdown

Call /v1/scrape (or /v1/batch in parallel) to get clean, boilerplate-stripped markdown for each page.

// Parallel scrape, LLM-optimized markdown outputconst results = await webclaw.batch({ urls, formats: ["markdown"] });for (const r of results.data) {  console.log(r.url, r.markdown.length);}

Chunk and embed

Split the markdown, embed each chunk, and upsert into your vector database keyed by URL.

// Split each page, embed every chunk, upsert keyed by URLfor (const r of results.data) {  const chunks = chunk(r.markdown);  for (const [i, text] of chunks.entries()) {    const embedding = await embed(text);    await vectordb.upsert({ id: `${r.url}#${i}`, text, embedding });  }}

Diff to re-index

Use /v1/diff against the last snapshot so you only re-embed pages whose content actually changed.

// Only re-embed pages whose content actually changedconst result = await webclaw.diff({  url: r.url,  previous: lastSnapshot,  current: r.markdown,});if (result.changed) await reindex(r.url, r.markdown);

Scrape to markdown

Why webclaw

Built for rag pipelines.

LLM-optimized markdown cuts embedding costs by ~90% vs raw HTML

118ms on static pages makes large-scale re-indexing affordable

Content diffing for incremental updates instead of full re-crawls

CSS selector filtering to extract only article content, not nav/footer

Built-in batch endpoint for parallel multi-URL ingestion

What you get

Everything this use case needs.

Markdown and LLM-optimized output formats
Batch endpoint for parallel ingestion
Content diff endpoint for incremental updates
CSS selector filtering
Firecrawl v2 compatible (drop-in replacement)

Where it fits

Built for the messy parts.

Your RAG pipeline is only as good as the documents you index. Most web scrapers return messy HTML full of navigation, ads, and boilerplate that wastes embedding costs and poisons retrieval quality. Re-crawling frequently is slow and expensive with browser-based tools.

webclaw returns clean markdown with boilerplate stripped, semantic structure preserved, and token count minimized. At 118ms on static pages, you can re-index large document sets multiple times per day. Built-in content diffing lets you incrementally update only what changed.

Common questions

Frequently asked questions

How does webclaw reduce embedding costs for RAG?

webclaw strips navigation, ads, scripts, and boilerplate before returning content. The LLM-optimized format cuts token count by about 90% vs raw HTML, which directly reduces embedding API costs when indexing large document sets.

Can webclaw incrementally update my vector database?

Yes. Use the /v1/diff endpoint to compare a current scrape against a previous snapshot. Only embed and re-index pages where content changed, instead of running full re-crawls.

Does webclaw work with my existing RAG framework?

webclaw has official SDKs for Python, TypeScript, and Go. The REST API is also Firecrawl v2 compatible, so any LangChain, LlamaIndex, or CrewAI integration that uses Firecrawl can point at webclaw by changing one environment variable.

For AI agents

Or hand it to your agent.

Add the webclaw MCP server to Claude, Cursor, or any MCP client, then paste this prompt. The agent calls the webclaw tools and hands the result back to your model — no code to write.

PROMPT FOR YOUR AGENT

Using the webclaw tools, build a clean, up-to-date corpus I can feed into my RAG pipeline's vector database from [the docs or site section to index]. First call map on [the base URL or domain] to discover all the relevant page URLs, then use batch to scrape them all in parallel into LLM-optimized markdown with the navigation, ads, and boilerplate stripped out. For each page return its source URL, a short title, the cleaned markdown body, and an estimated token count, organized so each page is a self-contained chunk ready to embed. Skip or flag any pages that came back empty or look like login walls. Finally, list which URLs changed since [the date or snapshot of my last index] using diff so I only re-embed what's actually new, and summarize the whole run in a table of URL, status (new / changed / unchanged / failed), and token count.

Set up the MCP server

Ready to build? Start extracting.

Cancel anytime. Clean, structured data on every call.

View API docs

Read the guide API documentation

Other use cases

Travel price monitoring AI agents Deep research Price monitoring Competitive intelligence Lead enrichment Web search Brand extraction Content change monitoring Content summarization Website to Markdown YouTube transcripts