March 24, 2026Massi

MCP web scraping for Claude Code and Cursor

Name: webclaw
Price: 19 USD
Author: Massi

Your AI agent can write code, analyze documents, query databases, and hold long conversations. But ask Claude, Cursor, or Windsurf to check a competitor's pricing page, read the latest docs for a framework, or pull product specs from a supplier's website, and it hits a wall. It can't read the web unless you give it a tool.

This is the gap that MCP closes. And web scraping is the use case that makes it obvious.

What MCP actually is

MCP stands for Model Context Protocol. It's an open standard that lets AI models call external tools. Think of it like USB for AI. Before USB, every peripheral needed its own driver, its own connector, its own software. MCP does the same thing for AI tools: one protocol, any tool, any model.

The model describes what tools are available. The user (or the model itself) decides when to call one. The tool runs, returns data, and the model keeps going with the new context.

Claude Desktop, Claude Code, Cursor, Windsurf, and a growing list of other clients support MCP natively. You install an MCP server, it shows up as a set of tools your AI can call, and that's it. No API wiring, no middleware, no custom code.

The MCP SDK crossed 97 million monthly downloads. This is not experimental anymore.

Why web data is the killer MCP use case

Most MCP tools are wrappers around APIs. Connect to Slack, read a GitHub issue, query a database. Useful, but limited to services you already have access to.

Web scraping is different. It gives your AI access to the entire public web. Any URL, any page, any site. The agent decides what to read based on the conversation, not a predefined list.

This changes what agents can do.

An agent helping you evaluate SaaS tools can read their actual pricing pages instead of relying on its training data from months ago. An agent writing documentation can crawl the framework's latest docs. An agent doing competitive research can pull real numbers from public filings and product pages.

Without web access, agents are limited to what they already know. With web access, they can go find what they need. That's a fundamental capability shift.

Setting it up

webclaw ships an MCP server called webclaw-mcp with 12 tools. Install it once and your AI gets scraping, crawling, search, sitemap discovery, structured extraction, summarization, content diffing, brand extraction, deep research, vertical extraction, and extractor discovery.

Add this to your Claude Desktop config:

{
  "mcpServers": {
    "webclaw": {
      "command": "webclaw-mcp"
    }
  }
}

Restart Claude Desktop. The tools appear in the tool menu. Your AI can now call them during any conversation.

For Claude Code, same config in your project's .mcp.json. For Cursor, add it to the MCP settings panel.

No API key needed for the local server. It runs on your machine, uses its own HTTP client with TLS fingerprinting, and returns clean markdown. If you want to use the cloud API instead (for higher concurrency, JavaScript rendering, or anti-bot bypass), set the WEBCLAW_API_KEY environment variable and add --cloud to the command.

What the tools do

scrape reads a single URL and returns clean content. You control the format: markdown for full fidelity, llm for token-optimized output, text for plain text, json for structured metadata. The agent picks the format based on what it needs.

crawl follows links from a starting URL. It discovers pages across the site, extracts each one, and returns the full set. Useful for ingesting documentation sites, mapping a competitor's product catalog, or building a knowledge base from a company's blog.

search queries the web and returns results with snippets. When the agent needs to find information but doesn't have a specific URL, it searches first, then scrapes the most relevant results. This is how research workflows start.

map discovers all URLs on a site without scraping them. It reads the sitemap, follows internal links, and returns a clean list. The agent uses this to understand the structure of a site before deciding what to extract.

extract pulls structured data from a page using a JSON schema. The agent describes the shape of data it wants (product names and prices, contact information, event dates), and the extraction engine returns exactly that. No regex, no selectors, no brittle parsing.

summarize condenses a page into a short summary. When the agent needs the gist of an article but not the full content, this saves tokens and keeps the context window focused.

diff compares a page against a previous snapshot. The agent uses this to detect content changes: updated pricing, new product listings, modified documentation.

brand extracts visual identity from a page: colors, fonts, logos, favicons, OG images. Useful for design tools, competitive analysis, or generating brand-consistent content.

research runs multi-step web research and returns a synthesized answer with sources. The agent can search, scrape, and summarize without you hand-rolling the workflow.

vertical_scrape extracts common vertical data, such as jobs, products, articles, events, or local business pages, through predefined extractors.

list_extractors shows which vertical extractors are available so the agent can pick the right one before calling vertical_scrape.

How agents actually use these

The tools are simple. What makes them powerful is how agents chain them together.

Research workflow. You ask: "Compare the pricing of webclaw, firecrawl, and scrapingbee." The agent calls search to find each pricing page. Calls scrape on each result. Extracts the relevant pricing data. Compares them in a table. All within one conversation, all with live data.

Documentation ingestion. You say: "Read the Next.js App Router docs and explain how middleware works." The agent calls map on nextjs.org/docs to find all doc pages. Calls crawl to extract the middleware-related pages. Reads the content and explains it with references to the actual documentation.

Content monitoring. You run a daily check: "Has the pricing changed on these three competitor pages?" The agent calls diff against stored snapshots. Reports what changed. Stores the new snapshots for next time.

Lead enrichment. You pass a list of company URLs. The agent calls extract on each with a schema for company name, tech stack, team size, and recent news. Returns a structured spreadsheet of enriched data.

None of this requires custom code. The agent figures out which tools to call and in what order. You describe the outcome you want in plain language.

What works well and what doesn't

MCP web scraping works best for focused, real-time extraction. Read a page, get the data, move on. The latency is low enough (100-300ms per page for static content) that it feels seamless in a conversation.

It works less well for massive scale. If you need to scrape 10,000 pages, doing it through MCP one conversation turn at a time is slow. For that, use the REST API directly with the batch or crawl endpoints, then bring the results into your agent's context.

JavaScript-heavy SPAs (React apps with client-side rendering only) sometimes return empty content through the local MCP server because it doesn't run a browser engine. The cloud API handles these through server-side JavaScript rendering, so if you're hitting SPAs, use --cloud.

Protected sites can still fail when the target checks network, browser, and behavior signals together. TLS fingerprinting handles many of them, and the cloud API adds protected access fallback for harder Cloudflare and DataDome pages. If you are debugging Cloudflare specifically, start with the Cloudflare scraping diagnostic checklist.

The MCP protocol itself has a limitation worth knowing: tool results are injected into the model's context window. A scrape that returns 5,000 tokens of content consumes 5,000 tokens of context. For long conversations or multi-page research, the context fills up. Using llm format instead of markdown helps — it runs a 9-step optimization pass (image strip, emphasis strip, link dedup, whitespace collapse) that typically shaves 20-50% off markdown output on noisy pages, with the full pipeline landing ~90% below raw HTML.

Beyond Claude

MCP is not Claude-specific. Any client that supports the Model Context Protocol can use webclaw-mcp. Cursor, Windsurf, Continue, and other coding tools already support MCP. OpenAI has announced MCP support. The ecosystem is converging on this standard.

This matters because the tool you install today works with every client that adopts MCP tomorrow. You're not locked into one vendor's tool ecosystem.

Getting started

Install webclaw:

cargo install webclaw

Or download a prebuilt binary from the releases page. The webclaw-mcp binary is included.

Add the config to your AI client. Start a conversation. Ask your agent to read a webpage. It will call scrape, get the content, and work with it like it was always there.

If you want the cloud API for JavaScript rendering, anti-bot bypass, and higher concurrency, sign up at webclaw.io and set your API key in the MCP config.

The MCP server is open source and AGPL-3.0 licensed. The cloud API is paid, from $19/mo.

Check the MCP product page for the overview, or the MCP documentation for the full tool reference and advanced configuration.