MCP and web scraping. Give your AI agent real internet access.
Your AI agent can write code, analyze documents, query databases, and hold long conversations. But ask it to check a competitor's pricing page, read the latest docs for a framework, or pull product specs from a supplier's website, and it hits a wall. It can't read the web.
This is the gap that MCP closes. And web scraping is the use case that makes it obvious.
What MCP actually is
MCP stands for Model Context Protocol. It's an open standard that lets AI models call external tools. Think of it like USB for AI. Before USB, every peripheral needed its own driver, its own connector, its own software. MCP does the same thing for AI tools: one protocol, any tool, any model.
The model describes what tools are available. The user (or the model itself) decides when to call one. The tool runs, returns data, and the model keeps going with the new context.
Claude Desktop, Claude Code, Cursor, Windsurf, and a growing list of other clients support MCP natively. You install an MCP server, it shows up as a set of tools your AI can call, and that's it. No API wiring, no middleware, no custom code.
The MCP SDK crossed 97 million monthly downloads. This is not experimental anymore.
Why web data is the killer MCP use case
Most MCP tools are wrappers around APIs. Connect to Slack, read a GitHub issue, query a database. Useful, but limited to services you already have access to.
Web scraping is different. It gives your AI access to the entire public web. Any URL, any page, any site. The agent decides what to read based on the conversation, not a predefined list.
This changes what agents can do.
An agent helping you evaluate SaaS tools can read their actual pricing pages instead of relying on its training data from months ago. An agent writing documentation can crawl the framework's latest docs. An agent doing competitive research can pull real numbers from public filings and product pages.
Without web access, agents are limited to what they already know. With web access, they can go find what they need. That's a fundamental capability shift.
Setting it up
webclaw ships an MCP server called webclaw-mcp with 8 tools. Install it once and your AI gets scraping, crawling, search, sitemap discovery, structured extraction, summarization, content diffing, and brand extraction.
Add this to your Claude Desktop config:
{
"mcpServers": {
"webclaw": {
"command": "webclaw-mcp"
}
}
}Restart Claude Desktop. The tools appear in the tool menu. Your AI can now call them during any conversation.
For Claude Code, same config in your project's .mcp.json. For Cursor, add it to the MCP settings panel.
No API key needed for the local server. It runs on your machine, uses its own HTTP client with TLS fingerprinting, and returns clean markdown. If you want to use the cloud API instead (for higher concurrency, JavaScript rendering, or anti-bot bypass), set the WEBCLAW_API_KEY environment variable and add --cloud to the command.
What the tools do
scrape reads a single URL and returns clean content. You control the format: markdown for full fidelity, llm for token-optimized output, text for plain text, json for structured metadata. The agent picks the format based on what it needs.
crawl follows links from a starting URL. It discovers pages across the site, extracts each one, and returns the full set. Useful for ingesting documentation sites, mapping a competitor's product catalog, or building a knowledge base from a company's blog.
search queries the web and returns results with snippets. When the agent needs to find information but doesn't have a specific URL, it searches first, then scrapes the most relevant results. This is how research workflows start.
map discovers all URLs on a site without scraping them. It reads the sitemap, follows internal links, and returns a clean list. The agent uses this to understand the structure of a site before deciding what to extract.
extract pulls structured data from a page using a JSON schema. The agent describes the shape of data it wants (product names and prices, contact information, event dates), and the extraction engine returns exactly that. No regex, no selectors, no brittle parsing.
summarize condenses a page into a short summary. When the agent needs the gist of an article but not the full content, this saves tokens and keeps the context window focused.
diff compares a page against a previous snapshot. The agent uses this to detect content changes: updated pricing, new product listings, modified documentation.
brand extracts visual identity from a page: colors, fonts, logos, favicons, OG images. Useful for design tools, competitive analysis, or generating brand-consistent content.
How agents actually use these
The tools are simple. What makes them powerful is how agents chain them together.
Research workflow. You ask: "Compare the pricing of webclaw, firecrawl, and scrapingbee." The agent calls search to find each pricing page. Calls scrape on each result. Extracts the relevant pricing data. Compares them in a table. All within one conversation, all with live data.
Documentation ingestion. You say: "Read the Next.js App Router docs and explain how middleware works." The agent calls map on nextjs.org/docs to find all doc pages. Calls crawl to extract the middleware-related pages. Reads the content and explains it with references to the actual documentation.
Content monitoring. You run a daily check: "Has the pricing changed on these three competitor pages?" The agent calls diff against stored snapshots. Reports what changed. Stores the new snapshots for next time.
Lead enrichment. You pass a list of company URLs. The agent calls extract on each with a schema for company name, tech stack, team size, and recent news. Returns a structured spreadsheet of enriched data.
None of this requires custom code. The agent figures out which tools to call and in what order. You describe the outcome you want in plain language.
What works well and what doesn't
MCP web scraping works best for focused, real-time extraction. Read a page, get the data, move on. The latency is low enough (100-300ms per page for static content) that it feels seamless in a conversation.
It works less well for massive scale. If you need to scrape 10,000 pages, doing it through MCP one conversation turn at a time is slow. For that, use the REST API directly with the batch or crawl endpoints, then bring the results into your agent's context.
JavaScript-heavy SPAs (React apps with client-side rendering only) sometimes return empty content through the local MCP server because it doesn't run a browser engine. The cloud API handles these through server-side JavaScript rendering, so if you're hitting SPAs, use --cloud.
Anti-bot protected sites (Cloudflare, DataDome) work fine with the TLS fingerprinting in most cases. For the hardest sites that require CAPTCHA solving, the cloud API has an antibot sidecar that handles it.
The MCP protocol itself has a limitation worth knowing: tool results are injected into the model's context window. A scrape that returns 5,000 tokens of content consumes 5,000 tokens of context. For long conversations or multi-page research, the context fills up. Using llm format instead of markdown helps because it returns 67% fewer tokens for the same content.
Beyond Claude
MCP is not Claude-specific. Any client that supports the Model Context Protocol can use webclaw-mcp. Cursor, Windsurf, Continue, and other coding tools already support MCP. OpenAI has announced MCP support. The ecosystem is converging on this standard.
This matters because the tool you install today works with every client that adopts MCP tomorrow. You're not locked into one vendor's tool ecosystem.
Getting started
Install webclaw:
cargo install webclawOr download a prebuilt binary from the releases page. The webclaw-mcp binary is included.
Add the config to your AI client. Start a conversation. Ask your agent to read a webpage. It will call scrape, get the content, and work with it like it was always there.
If you want the cloud API for JavaScript rendering, anti-bot bypass, and higher concurrency, sign up at webclaw.io and set your API key in the MCP config.
The MCP server is open source and MIT licensed. The cloud API has a free tier with 500 pages per month.
Check the MCP documentation for the full tool reference and advanced configuration.