May 12, 2026Massi

How to evaluate web scraping APIs for AI agents

Name: webclaw
Author: Massi

Most web scraping API evaluations start with the wrong URL.

https://example.com

It is fast. It is stable. It has clean HTML. It has no JavaScript app shell, no pricing table, no docs sidebar, no cookie banner, no bot protection, no weird markdown edge cases, and no downstream parser waiting to break.

That makes it useful for checking if an API key works.

It makes it almost useless for deciding if a web scraping API belongs in your product.

If you are building an AI agent, a RAG pipeline, a competitor monitor, or a research workflow, the question is not:

Can this API scrape a page?

The real question is:

Can this API return useful context from the pages my workflow actually depends on?

Those are very different tests.

I am building webclaw, a web extraction API, CLI, and MCP server for AI agents. The more I talk with teams testing scraping providers, the more I see the same mistake: they compare tools on toy pages, then discover the real failures only after wiring the tool into an agent, RAG ingestion job, or production data pipeline.

This is how I would evaluate a web scraping API before trusting it.

This post continues the provider-evaluation cluster after Migrating from Firecrawl: compatible API for AI agents. The goal is simple: test scraping APIs like infrastructure, not like a landing-page demo.

Start With A Real URL Set

Do not start with the homepage of a famous company.

Do not start with a static demo page.

Start with 10 to 20 URLs that represent your actual workflow. This is the fastest way to evaluate a scraping API for AI agents because agents do not browse the average web page. They hit docs, pricing pages, changelogs, search results, and weird edge cases.

URL type	Why it matters
Documentation page	Tests headings, code blocks, tables, sidebars, and internal links.
Changelog page	Tests date structure, repeated entries, and incremental monitoring.
Pricing page	Tests tables, plan names, feature lists, and layout-heavy content.
Product page	Tests messy marketing pages, images, specs, and variant data.
Blog article	Tests main-content extraction and boilerplate removal.
Search results page	Tests dynamic content and anti-automation behavior.
JavaScript-heavy page	Tests whether the initial HTML is enough or rendering is needed.
Previously flaky URL	Tests the failure mode you already know exists.

The best benchmark is not broad.

It is representative.

If your product monitors competitor pricing pages, test pricing pages.

If your agent reads docs, test docs.

If your RAG pipeline ingests help centers, test help centers.

That sounds obvious. It is also where most evaluations get lazy.

A 200 Is Not Success

Web scraping APIs make it too easy to treat HTTP status as the result.

{
  "success": true,
  "status": 200
}

That can still be a failure.

For AI workflows, these are common false positives:

Failure	What it looks like
Empty app shell	The response contains header/nav text, but no real page body.
Challenge page	The API returns an anti-bot page as if it were content.
Login wall	The markdown describes a sign-in page instead of the requested page.
Boilerplate flood	The useful content is buried under nav, footer, cookie, and promo text.
Broken code blocks	Docs pages lose formatting and become useless for developer agents.
Flattened tables	Pricing or comparison data loses row/column meaning.
Missing source metadata	Your downstream answer has no reliable URL, title, or timestamp.

For LLM apps, a clean-looking wrong page is worse than an error.

An error stops the workflow.

Bad context poisons the workflow.

The agent summarizes a block page. The retriever embeds repeated nav text. The monitor reports no change because it never saw the real page.

That is why your evaluation needs to inspect output quality, not just status.

Compare The Output Shape

When testing providers, put the outputs side by side.

Not in a vibes-based way.

Use a checklist.

Check	What to look for
Title	Is it the real page title, not a generic site title?
URL	Is the final URL preserved after redirects?
Headings	Are page sections represented clearly?
Main content	Is the actual article/docs/pricing content present?
Boilerplate	Are nav, footer, cookie banners, and repeated sidebars removed?
Code blocks	Are code samples preserved with formatting?
Tables	Are rows and columns understandable in text?
Links	Are important links preserved?
Metadata	Do you get useful title, description, language, and timing fields?
Error behavior	Does the API clearly report blocks, timeouts, and empty pages?

The point is not to find the prettiest markdown.

The point is to find the output that survives your downstream workflow.

If the result goes into an agent, paste it into the actual agent prompt path.

If it goes into RAG, chunk it and inspect retrieval.

If it goes into a monitor, diff it against a later run.

The consumer decides whether the extraction is good.

Measure Token Waste

For AI products, token size is not a cosmetic detail. If you are comparing website-to-markdown APIs for LLMs, output size and content quality should be part of the test.

It affects cost, latency, context quality, and retrieval quality.

There are three outputs you should compare:

Output	Problem
Raw HTML	Huge, noisy, full of markup and scripts.
Plain text	Smaller, but often loses structure.
Clean markdown / LLM format	Keeps useful structure while cutting noise.

A scraping API that returns raw HTML quickly is still pushing work downstream.

Your app now has to clean it.

Your LLM now has to ignore it.

Your vector database now has to embed it.

That is not free.

For each test URL, record:

raw HTML size
markdown size
LLM-context size
useful content present: yes/no
boilerplate level: low/medium/high

Do not optimize only for the smallest output.

The smallest output can be wrong.

Optimize for the smallest output that still preserves the content your workflow needs.

Test Crawl Separately From Scrape

Scrape and crawl are different products. A scraping API can be good at one-page extraction and still be weak for RAG web crawling.

Scrape answers:

Can you extract this one page?

Crawl answers:

Can you discover the right pages, stay inside boundaries, extract each page, and return a usable collection?

That adds new failure modes.

Crawl concern	What can go wrong
Discovery	It misses pages that matter.
Boundaries	It wanders into irrelevant pages.
Deduplication	It extracts the same content many times.
Depth	It stops before reaching useful docs.
Pagination	It misses list/detail pages.
Status polling	Jobs are hard to debug or recover.
Output consistency	Pages come back in mixed formats or quality levels.

For docs ingestion and RAG, crawl quality often matters more than single-page quality.

You do not want “more pages.”

You want the right pages.

Start with a tiny crawl:

start URL: docs home
max pages: 10-25
depth: 1-2
format: markdown or LLM-ready

Then ask:

Question	Why
Did it find the pages a human would click first?	Discovery quality.
Did it avoid login, legal, footer, and duplicate pages?	Boundary quality.
Is each page clean enough to embed or summarize?	Extraction quality.
Can I map each answer back to a source URL?	Citation quality.

This is the test most teams skip until the week they need to ingest a whole site.

Include Your Worst URLs

Every team has a few URLs they hate.

The page that randomly fails.

The docs site with weird sidebar navigation.

The pricing page with layout-heavy cards.

The competitor site that sometimes returns a block.

The JavaScript app where the initial HTML is just:

<div id="root"></div>

Put those URLs in the evaluation set.

Do not hide them because they make the benchmark messy.

They are the benchmark.

If a provider works beautifully on easy pages and fails on the three pages your product depends on, it is not a good fit for your product.

Test Error Behavior

A good scraping API should fail in a way your app can use.

Bad error behavior looks like this:

200 OK
markdown: "Checking your browser..."

Or this:

timeout

with no clue what timed out.

Useful error behavior tells you what happened:

Error shape	Why it helps
Block detected	You can retry, route, or alert.
Empty content detected	You know rendering or another path may be needed.
Timeout type	You can distinguish connect, fetch, render, and extraction failures.
Source URL preserved	You can debug the exact page.
Partial crawl results	You can keep useful pages instead of losing the whole job.

For agents, this matters even more.

An agent can recover from a typed failure.

It cannot recover from a lie.

Do A Provider Swap Test

If you are migrating from an existing provider or evaluating a Firecrawl alternative, do not rewrite the whole integration first.

Put the provider behind a small adapter.

The adapter should normalize only the fields your app actually uses.

type ExtractedPage = {
  url: string;
  title?: string;
  markdown: string;
  metadata?: Record<string, unknown>;
};

Then run the same URLs through both providers.

Metric	Provider A	Provider B
Returned useful content	yes/no	yes/no
Markdown size	tokens or chars	tokens or chars
Main content quality	low/medium/high	low/medium/high
Code/table preservation	yes/no	yes/no
Error clarity	low/medium/high	low/medium/high
Downstream parser success	yes/no	yes/no

You are not trying to crown a universal winner.

You are trying to answer:

Which provider works better for this workflow?

For some teams, that means switching one endpoint.

For others, it means keeping two providers and routing specific URL classes differently.

That is a better outcome than arguing from marketing pages.

Where webclaw Fits

webclaw is built around the idea that extraction quality is the interface.

The useful output is not “HTML fetched successfully.”

The useful output is:

URL -> clean markdown / JSON / metadata -> agent, RAG pipeline, monitor, or script

That is why webclaw exposes:

scrape for one page

crawl for site ingestion

map for URL discovery

batch for lists of URLs

extract for structured JSON

summarize for quick page understanding

diff for monitoring changes

brand for identity extraction

MCP for Claude Code, Cursor, and other agent clients

Firecrawl-compatible /v2 endpoints for migration tests

If you are already evaluating Firecrawl-shaped APIs, start with the migration checklist:

Migrating from Firecrawl: compatible API for AI agents

If you are building the RAG side, this connects directly to:

Build a RAG pipeline with live web data

And if the output goes into Claude Code, Cursor, or another MCP client:

MCP web scraping for Claude Code and Cursor

The Practical Checklist

Before picking a scraping API, run this:

Step	Done
Pick 10-20 real URLs from your workflow
Include docs, pricing, changelog, product, and flaky pages
Compare markdown, not just status code
Check title, URL, headings, links, code blocks, and tables
Measure output size and boilerplate level
Test crawl separately from scrape
Test error behavior on blocked, empty, and slow pages
Run output through the actual agent, RAG, parser, or monitor
Keep the provider behind an adapter until you are confident

That is the evaluation.

Not the landing page.

Not the benchmark table.

Not example.com.

The only thing that matters is whether the API returns clean, useful context from the pages your product actually needs.

FAQ

What is the best way to evaluate a web scraping API?

Test it on the URLs your product actually depends on. Include docs pages, pricing pages, changelogs, JavaScript-heavy pages, and known flaky URLs. Then inspect the markdown, metadata, errors, and downstream parser or agent behavior. Do not stop at HTTP status.

What should I test in a scraping API for AI agents?

For AI agents, test whether the API returns clean context with source URL, title, headings, links, code blocks, and useful metadata. Also check whether it detects empty pages, blocked pages, and login walls instead of returning them as successful content.

How is RAG web scraping different from normal scraping?

RAG web scraping needs clean, chunkable, source-linked content. The output should preserve structure and remove boilerplate because it will be embedded, retrieved, and passed into an LLM. Raw HTML or noisy plain text usually hurts retrieval quality.

Should I test crawl and scrape separately?

Yes. Scrape tests one-page extraction. Crawl tests URL discovery, boundaries, deduplication, depth, pagination, and consistency across many pages. A provider can be good at scrape and still weak for crawl-based docs ingestion.

Is webclaw a Firecrawl alternative?

webclaw can be tested as a Firecrawl alternative because it exposes Firecrawl-compatible /v2 scrape, crawl, map, and search endpoints. The safest path is to run the same real URLs through both providers and compare output quality, token size, error clarity, and downstream success.

Website: webclaw.io

GitHub: 0xMassi/webclaw