POST /v1/batch
Web scraping for RAG pipelines
Feed your vector database with fresh, clean web content.
Build retrieval augmented generation systems with real, up-to-date web data. webclaw returns LLM-optimized markdown that embeds cleanly into any vector database, with 67% fewer tokens than raw HTML.
The problem
Your RAG pipeline is only as good as the documents you index. Most web scrapers return messy HTML full of navigation, ads, and boilerplate that wastes embedding costs and poisons retrieval quality. Re-crawling frequently is slow and expensive with browser-based tools.
The webclaw solution
webclaw returns clean markdown with boilerplate stripped, semantic structure preserved, and token count minimized. At 118ms average per page, you can re-index large document sets multiple times per day. Built-in content diffing lets you incrementally update only what changed.
Why webclaw for rag pipelines
- LLM-optimized markdown cuts embedding costs by 67% vs raw HTML
- 118ms average response makes large-scale re-indexing affordable
- Content diffing for incremental updates instead of full re-crawls
- CSS selector filtering to extract only article content, not nav/footer
- Built-in batch endpoint for parallel multi-URL ingestion
Code example
Python — batch scrape for RAG
from webclaw import Webclaw
client = Webclaw(api_key="wc_...")
urls = ["https://example.com/docs/page1", "https://example.com/docs/page2"]
# Parallel scrape, LLM-optimized markdown output
results = client.batch(urls=urls, formats=["markdown"])
# Feed directly into your vector database
for r in results.data:
embedding = embed(r.markdown)
vectordb.upsert(id=r.url, text=r.markdown, embedding=embedding)webclaw features for this use case
- Markdown and LLM-optimized output formats
- Batch endpoint for parallel ingestion
- Content diff endpoint for incremental updates
- CSS selector filtering
- Firecrawl v2 compatible (drop-in replacement)
Frequently asked questions
How does webclaw reduce embedding costs for RAG?
webclaw strips navigation, ads, scripts, and boilerplate before returning content. The LLM-optimized format cuts token count by about 67% vs raw HTML, which directly reduces embedding API costs when indexing large document sets.
Can webclaw incrementally update my vector database?
Yes. Use the /v1/diff endpoint to compare a current scrape against a previous snapshot. Only embed and re-index pages where content changed, instead of running full re-crawls.
Does webclaw work with my existing RAG framework?
webclaw has official SDKs for Python, TypeScript, and Go. The REST API is also Firecrawl v2 compatible, so any LangChain, LlamaIndex, or CrewAI integration that uses Firecrawl can point at webclaw by changing one environment variable.