Web extraction, LLMs, and building in public.
Technical deep dives on web extraction, content parsing for LLMs, anti-bot bypass, and building open-source infrastructure in Rust. Written by the team behind webclaw.
webclaw turns any website into clean, structured content for AI applications. These posts cover the engineering decisions, trade-offs, and lessons learned building a web extraction toolkit from scratch.

LlamaIndex Web Scraping: Fix SimpleWebPageReader
LlamaIndex web scraping fails on blocks, empty shells, and noisy HTML. Feed cleaner markdown into RAG pipelines and agents.

LangChain web scraping in 2026: what loaders can't do
LangChain's built-in loaders break on bot-protected sites and return raw HTML your LLM can't use. Here's how to get clean, reliable web data into any LangChain pipeline.

How to Scrape Google Search Results in 2026 (5 Ways)
Google killed plain HTTP to search results. 5 ways that still work in 2026: TLS fingerprinting, headless browsers, SERP APIs. Code examples for each.

The 6 best web scraping APIs for LLMs in 2026
If you're building with LLMs, you need web data. Here's how the main scraping APIs compare on the things that actually matter for AI use cases.

How to Bypass Cloudflare Bot Protection (2026, No Browser)
Fix the four signals Cloudflare checks before you reach for a headless browser: TLS, HTTP/2, challenge, session. Why proxy and user-agent rotation alone fails.

Extract structured data from any URL in one call
You don't always need the full page. Sometimes you need three fields from a product listing. Here's how to pull exactly the data you want from any URL.

Build a RAG pipeline with live web data (4 steps)
Most RAG tutorials stop at "upload a PDF." Real apps need live web data. Here's how to build a pipeline that fetches, extracts, and indexes pages.

MCP web scraping for Claude Code and Cursor
MCP web scraping gives Claude Code, Cursor, and AI agents live web access. Scrape, crawl, search, extract, and summarize from one server.

HTML to Markdown for LLMs and RAG
Convert HTML to Markdown for LLMs with boilerplate removed, links preserved, and cleaner RAG input for agents and summarization.
Stop reading. Start scraping.
Cancel anytime. Turn any page into clean, structured content your agent can actually use.