March 27, 2026Massi

Build a RAG pipeline with live web data (4 steps)

Name: webclaw
Price: 19 USD
Author: Massi

Most RAG tutorials show you how to upload a PDF and ask questions about it. Cool demo. Not a real product.

Real applications need live data. Stock prices change. Documentation gets updated. Blog posts get published. If your RAG pipeline only knows what was true when you last uploaded a file, your answers are stale and your users notice.

I've spent the last few months building webclaw specifically for this problem. Getting clean, structured content out of the web and into a vector database without losing your mind in the process. Here's what I learned.

The pipeline

A RAG pipeline with web data has four steps. Every step has a way to go wrong.

1. Fetch the page. Sounds simple. It's not. Half the web is behind Cloudflare, cookie consent overlays, or JavaScript rendering. A basic HTTP request returns either a 403 or an empty shell with a loading spinner.

2. Extract the content. Raw HTML is 50,000 tokens for a typical page. The actual content you care about is maybe 800 tokens. Navigation, ads, footers, cookie banners, tracking scripts. All noise. If you feed raw HTML to your embeddings model, you're burning money and polluting your vector space with garbage.

3. Chunk and embed. Split the clean content into pieces that make semantic sense, then run them through an embeddings model. The quality of your chunks determines the quality of your retrieval. Bad chunks mean the right answer exists in your database but the retriever can't find it.

4. Index and retrieve. Store the vectors, build your retrieval logic, serve results. This part has the most tutorials and the least actual difficulty.

Most teams spend 80% of their time on steps 1 and 2. The fetching and extraction. The actual RAG part is well-documented. The "get clean data from the web" part is not.

Step 1: fetching pages that don't want to be fetched

The naive approach is requests.get(url). Works on about 40% of the web. The rest returns a challenge page, a redirect loop, or an empty response.

The reason is TLS fingerprinting. Modern anti-bot systems don't just check your User-Agent header. They look at your TLS handshake, HTTP/2 settings, header ordering, and dozens of other signals. A Python requests library looks nothing like a real browser at the network level.

webclaw handles this at the transport layer. It impersonates real browser fingerprints (Chrome, Firefox, Safari) so the target server sees what looks like a real browser connection. No headless Chrome needed. No Puppeteer. Just HTTP with the right fingerprint.

# This gets through Cloudflare, Akamai, DataDome on most sites
webclaw https://example.com

For the pages that genuinely need JavaScript rendering (single-page apps, React sites), webclaw has a rendering pipeline that kicks in automatically when the initial fetch returns thin content.

Step 2: extraction matters more than you think

Here's something that took me a while to understand. The quality of your extraction directly determines the quality of your RAG answers. Not somewhat. Directly.

If your extraction includes navigation menus, the embeddings model creates vectors for "Home | About | Contact | Blog" that are semantically similar to actual navigation queries. Now when a user asks "how do I navigate the API," the retriever pulls up nav menus instead of the actual documentation.

If your extraction includes cookie consent text, you get vectors about privacy policies mixed into your knowledge base. If it includes footer links, you get vectors about social media profiles.

webclaw strips all of that. Navigation, ads, cookie banners, footers, sidebars. What comes out is the actual content of the page in clean markdown.

# Returns just the content, no noise
webclaw https://docs.example.com/api/authentication --format llm

The llm format goes further. It strips emphasis markers, deduplicates links, merges statistics, and collapses whitespace. Optimized specifically for LLM consumption. Fewer tokens, same information.

Step 3: chunking strategies that actually work

Once you have clean content, you need to split it into chunks. The standard approach is recursive character splitting with some overlap. It works, but there are better options when your source is markdown.

Heading-based splitting. Markdown has structure. H1, H2, H3 headers create a natural hierarchy. Split on headers and you get chunks that are semantically coherent because the author organized the content that way.

import re

def split_by_headings(markdown: str, max_chunk: int = 1500) -> list[str]:
    sections = re.split(r'\n(?=#{1,3} )', markdown)
    chunks = []
    for section in sections:
        if len(section) > max_chunk:
            # Fall back to paragraph splitting for long sections
            paragraphs = section.split('\n\n')
            current = ""
            for p in paragraphs:
                if len(current) + len(p) > max_chunk and current:
                    chunks.append(current.strip())
                    current = p
                else:
                    current += "\n\n" + p
            if current.strip():
                chunks.append(current.strip())
        else:
            chunks.append(section.strip())
    return [c for c in chunks if len(c) > 50]

Metadata enrichment. Before embedding, prepend the page title and URL to each chunk. This gives the embeddings model context about where the content came from and improves retrieval accuracy significantly.

def enrich_chunk(chunk: str, title: str, url: str) -> str:
    return f"Source: {title}\nURL: {url}\n\n{chunk}"

This simple addition means when the retriever pulls a chunk, the LLM knows which page it came from and can cite sources.

Step 4: keeping it fresh

Static RAG pipelines are easy. You run the ingestion once and you're done. Live web RAG is harder because you need to decide when to re-fetch and how to handle content changes.

webclaw has a /v1/diff endpoint that tracks content changes between snapshots. You can use this to build a refresh strategy:

1. Crawl your sources on a schedule (daily, hourly, whatever makes sense)

2. Diff each page against the last snapshot

3. Only re-embed pages that actually changed

4. Delete old vectors and insert new ones

This keeps your vector database fresh without re-embedding everything on every cycle.

For monitoring specific pages, webclaw's /v1/watch endpoint does this automatically. Set a URL, a check interval, and a webhook. When the content changes, you get notified.

The full picture

Putting it all together:

from webclaw import Webclaw
from openai import OpenAI

wc = Webclaw(api_key="your-key")
openai = OpenAI()

# 1. Fetch and extract
result = wc.scrape("https://docs.example.com/api", formats=["llm"])
content = result.llm

# 2. Chunk
chunks = split_by_headings(content)
enriched = [enrich_chunk(c, result.metadata.title, result.url) for c in chunks]

# 3. Embed
embeddings = openai.embeddings.create(
    model="text-embedding-3-small",
    input=enriched
)

# 4. Store (using whatever vector DB you prefer)
for chunk, embedding in zip(enriched, embeddings.data):
    vector_db.upsert(
        id=hash(chunk),
        vector=embedding.embedding,
        metadata={"text": chunk, "url": result.url}
    )

For bulk ingestion, use /v1/crawl to discover all pages on a site and /v1/batch to extract them in parallel.

What I'd do differently

After building this for several projects, here's what I wish I knew earlier:

Start with fewer sources. It's tempting to crawl everything. Don't. Start with 10 pages, get the quality right, then scale. Bad extraction at scale just means more garbage in your vector database.

Monitor your retrieval quality. Log what chunks get retrieved for each query. When the retriever returns irrelevant results, the problem is almost always in extraction or chunking, not in the retrieval algorithm.

Clean content beats more content. 100 well-extracted pages outperform 10,000 pages of noisy HTML every time. The extraction step is where you win or lose.

If you're building a RAG pipeline and the web is your data source, the extraction layer is the most important piece. Get that right and the rest follows.

webclaw is open source and AGPL-3.0 licensed. The whole extraction engine is on GitHub. Star it if it saves you time.

Install it in 30 seconds, or read the documentation to get started. If you have questions, open an issue on the repo or join the Discord.