SDK INTEGRATION
Web scraping for LlamaIndex
Ingest web content into LlamaIndex nodes with clean markdown.
LlamaIndex is a data framework for LLM applications specializing in retrieval and context augmentation. webclaw provides clean markdown and structured data that loads directly into LlamaIndex Documents, with automatic bot bypass for protected sites.
Setup
LlamaIndex Python — custom reader
from llama_index.core import Document, VectorStoreIndex
from webclaw import Webclaw
wc = Webclaw(api_key="wc_...")
# Scrape a site with LLM-optimized output
urls = ["https://example.com/docs", "https://example.com/api"]
results = wc.batch(urls=urls, formats=["markdown"])
# Convert to LlamaIndex Documents
docs = [
Document(text=r.markdown, metadata={"url": r.url})
for r in results.data
]
# Build the index
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()Why webclaw for LlamaIndex
- Clean markdown output ideal for semantic chunking
- Batch endpoint for parallel multi-URL ingestion
- 67% token reduction cuts embedding costs
- Content diff for incremental index updates
Common use cases
- LlamaIndex document readers for web content
- Vector indexes over fresh web documentation
- Hybrid search with structured + unstructured data
- Query engines over competitor or market data
Frequently asked questions
How do I add web content to a LlamaIndex VectorStoreIndex?
Scrape URLs with webclaw's batch endpoint, convert each result's markdown into a LlamaIndex Document, and pass them to VectorStoreIndex.from_documents. See the code example above.
Can I update a LlamaIndex index incrementally as web content changes?
Yes. Use webclaw's /v1/diff endpoint to detect which pages changed, then re-embed and upsert only those Documents instead of rebuilding the entire index.
OTHER INTEGRATIONS