SDK integration

Web scraping for LlamaIndex

Ingest web content into LlamaIndex nodes with clean markdown.

LlamaIndex is a data framework for LLM applications specializing in retrieval and context augmentation. webclaw provides clean markdown and structured data that loads directly into LlamaIndex Documents, with automatic bot bypass for protected sites.

Setup

LlamaIndex Python — custom reader

python

from llama_index.core import Document, VectorStoreIndex
from webclaw import Webclaw

wc = Webclaw(api_key="wc_...")

# Scrape a site with LLM-optimized output
urls = ["https://example.com/docs", "https://example.com/api"]
results = wc.batch(urls=urls, formats=["markdown"])

# Convert to LlamaIndex Documents
docs = [
    Document(text=r.markdown, metadata={"url": r.url})
    for r in results.data
]

# Build the index
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()

Why webclaw for LlamaIndex

Clean markdown output ideal for semantic chunking
Batch endpoint for parallel multi-URL ingestion
~90% token reduction cuts embedding costs
Content diff for incremental index updates

Common use cases

LlamaIndex document readers for web content
Vector indexes over fresh web documentation
Hybrid search with structured + unstructured data
Query engines over competitor or market data

Frequently asked questions.

How do I add web content to a LlamaIndex VectorStoreIndex?

Scrape URLs with webclaw's batch endpoint, convert each result's markdown into a LlamaIndex Document, and pass them to VectorStoreIndex.from_documents. See the code example above.

Can I update a LlamaIndex index incrementally as web content changes?

Yes. Use webclaw's /v1/diff endpoint to detect which pages changed, then re-embed and upsert only those Documents instead of rebuilding the entire index.

Related guides

Other integrations

LangChain CrewAI Mastra Claude Desktop Claude Code Codex Cursor n8n Zapier

Ready to connect?

Start extracting.

LlamaIndex website