RAW HTTP — NO HEADLESS BROWSER OVERHEADMARKDOWN · JSON · HTML · LLM-READY FORMATSMCP SERVER FOR AI AGENTSTLS FINGERPRINT IMPERSONATIONEXTRACT · SUMMARIZE · DIFF · BRANDSITEMAP DISCOVERY & DEEP CRAWLINGSELF-HOST OR USE OUR CLOUD APIBUILT IN RUST — FAST BY DEFAULTDEEP RESEARCH — AI SYNTHESIZES REPORTS FROM 50+ SOURCESWEB SEARCH — QUERY AND SCRAPE SEARCH RESULTS IN ONE CALLAGENT SCRAPE — GIVE A GOAL, AI EXTRACTS WHAT YOU NEEDURL MONITORING — WATCH PAGES FOR CHANGES WITH WEBHOOKSBONUS CREDITS — EARN FREE CREDITS BY STARRING AND REFERRINGRAW HTTP — NO HEADLESS BROWSER OVERHEADMARKDOWN · JSON · HTML · LLM-READY FORMATSMCP SERVER FOR AI AGENTSTLS FINGERPRINT IMPERSONATIONEXTRACT · SUMMARIZE · DIFF · BRANDSITEMAP DISCOVERY & DEEP CRAWLINGSELF-HOST OR USE OUR CLOUD APIBUILT IN RUST — FAST BY DEFAULTDEEP RESEARCH — AI SYNTHESIZES REPORTS FROM 50+ SOURCESWEB SEARCH — QUERY AND SCRAPE SEARCH RESULTS IN ONE CALLAGENT SCRAPE — GIVE A GOAL, AI EXTRACTS WHAT YOU NEEDURL MONITORING — WATCH PAGES FOR CHANGES WITH WEBHOOKSBONUS CREDITS — EARN FREE CREDITS BY STARRING AND REFERRING

SDK INTEGRATION

Web scraping for LlamaIndex

Ingest web content into LlamaIndex nodes with clean markdown.

LlamaIndex is a data framework for LLM applications specializing in retrieval and context augmentation. webclaw provides clean markdown and structured data that loads directly into LlamaIndex Documents, with automatic bot bypass for protected sites.

Setup

LlamaIndex Python — custom reader

from llama_index.core import Document, VectorStoreIndex
from webclaw import Webclaw

wc = Webclaw(api_key="wc_...")

# Scrape a site with LLM-optimized output
urls = ["https://example.com/docs", "https://example.com/api"]
results = wc.batch(urls=urls, formats=["markdown"])

# Convert to LlamaIndex Documents
docs = [
    Document(text=r.markdown, metadata={"url": r.url})
    for r in results.data
]

# Build the index
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()

Why webclaw for LlamaIndex

  • Clean markdown output ideal for semantic chunking
  • Batch endpoint for parallel multi-URL ingestion
  • 67% token reduction cuts embedding costs
  • Content diff for incremental index updates

Common use cases

  • LlamaIndex document readers for web content
  • Vector indexes over fresh web documentation
  • Hybrid search with structured + unstructured data
  • Query engines over competitor or market data

Frequently asked questions

How do I add web content to a LlamaIndex VectorStoreIndex?

Scrape URLs with webclaw's batch endpoint, convert each result's markdown into a LlamaIndex Document, and pass them to VectorStoreIndex.from_documents. See the code example above.

Can I update a LlamaIndex index incrementally as web content changes?

Yes. Use webclaw's /v1/diff endpoint to detect which pages changed, then re-embed and upsert only those Documents instead of rebuilding the entire index.

Get started

500 pages/month free. No credit card. Open source.

Stay in the loop

Get notified when the webclaw API launches. Early subscribers get extended free tier access.

No spam. Unsubscribe anytime.