How to evaluate web scraping APIs for AI agents
Most web scraping API evaluations start with the wrong URL.
https://example.comIt is fast. It is stable. It has clean HTML. It has no JavaScript app shell, no pricing table, no docs sidebar, no cookie banner, no bot protection, no weird markdown edge cases, and no downstream parser waiting to break.
That makes it useful for checking if an API key works.
It makes it almost useless for deciding if a web scraping API belongs in your product.
If you are building an AI agent, a RAG pipeline, a competitor monitor, or a research workflow, the question is not:
Can this API scrape a page?The real question is:
Can this API return useful context from the pages my workflow actually depends on?Those are very different tests.
I am building webclaw, a web extraction API, CLI, and MCP server for AI agents. The more I talk with teams testing scraping providers, the more I see the same mistake: they compare tools on toy pages, then discover the real failures only after wiring the tool into an agent, RAG ingestion job, or production data pipeline.
This is how I would evaluate a web scraping API before trusting it.
This post continues the provider-evaluation cluster after Migrating from Firecrawl: compatible API for AI agents. The goal is simple: test scraping APIs like infrastructure, not like a landing-page demo.
Start With A Real URL Set
Do not start with the homepage of a famous company.
Do not start with a static demo page.
Start with 10 to 20 URLs that represent your actual workflow. This is the fastest way to evaluate a scraping API for AI agents because agents do not browse the average web page. They hit docs, pricing pages, changelogs, search results, and weird edge cases.
| URL type | Why it matters |
|---|---|
| Documentation page | Tests headings, code blocks, tables, sidebars, and internal links. |
| Changelog page | Tests date structure, repeated entries, and incremental monitoring. |
| Pricing page | Tests tables, plan names, feature lists, and layout-heavy content. |
| Product page | Tests messy marketing pages, images, specs, and variant data. |
| Blog article | Tests main-content extraction and boilerplate removal. |
| Search results page | Tests dynamic content and anti-automation behavior. |
| JavaScript-heavy page | Tests whether the initial HTML is enough or rendering is needed. |
| Previously flaky URL | Tests the failure mode you already know exists. |
The best benchmark is not broad.
It is representative.
If your product monitors competitor pricing pages, test pricing pages.
If your agent reads docs, test docs.
If your RAG pipeline ingests help centers, test help centers.
That sounds obvious. It is also where most evaluations get lazy.
A 200 Is Not Success
Web scraping APIs make it too easy to treat HTTP status as the result.
{
"success": true,
"status": 200
}That can still be a failure.
For AI workflows, these are common false positives:
| Failure | What it looks like |
|---|---|
| Empty app shell | The response contains header/nav text, but no real page body. |
| Challenge page | The API returns an anti-bot page as if it were content. |
| Login wall | The markdown describes a sign-in page instead of the requested page. |
| Boilerplate flood | The useful content is buried under nav, footer, cookie, and promo text. |
| Broken code blocks | Docs pages lose formatting and become useless for developer agents. |
| Flattened tables | Pricing or comparison data loses row/column meaning. |
| Missing source metadata | Your downstream answer has no reliable URL, title, or timestamp. |
For LLM apps, a clean-looking wrong page is worse than an error.
An error stops the workflow.
Bad context poisons the workflow.
The agent summarizes a block page. The retriever embeds repeated nav text. The monitor reports no change because it never saw the real page.
That is why your evaluation needs to inspect output quality, not just status.
Compare The Output Shape
When testing providers, put the outputs side by side.
Not in a vibes-based way.
Use a checklist.
| Check | What to look for |
|---|---|
| Title | Is it the real page title, not a generic site title? |
| URL | Is the final URL preserved after redirects? |
| Headings | Are page sections represented clearly? |
| Main content | Is the actual article/docs/pricing content present? |
| Boilerplate | Are nav, footer, cookie banners, and repeated sidebars removed? |
| Code blocks | Are code samples preserved with formatting? |
| Tables | Are rows and columns understandable in text? |
| Links | Are important links preserved? |
| Metadata | Do you get useful title, description, language, and timing fields? |
| Error behavior | Does the API clearly report blocks, timeouts, and empty pages? |
The point is not to find the prettiest markdown.
The point is to find the output that survives your downstream workflow.
If the result goes into an agent, paste it into the actual agent prompt path.
If it goes into RAG, chunk it and inspect retrieval.
If it goes into a monitor, diff it against a later run.
The consumer decides whether the extraction is good.
Measure Token Waste
For AI products, token size is not a cosmetic detail. If you are comparing website-to-markdown APIs for LLMs, output size and content quality should be part of the test.
It affects cost, latency, context quality, and retrieval quality.
There are three outputs you should compare:
| Output | Problem |
|---|---|
| Raw HTML | Huge, noisy, full of markup and scripts. |
| Plain text | Smaller, but often loses structure. |
| Clean markdown / LLM format | Keeps useful structure while cutting noise. |
A scraping API that returns raw HTML quickly is still pushing work downstream.
Your app now has to clean it.
Your LLM now has to ignore it.
Your vector database now has to embed it.
That is not free.
For each test URL, record:
raw HTML size
markdown size
LLM-context size
useful content present: yes/no
boilerplate level: low/medium/highDo not optimize only for the smallest output.
The smallest output can be wrong.
Optimize for the smallest output that still preserves the content your workflow needs.
Test Crawl Separately From Scrape
Scrape and crawl are different products. A scraping API can be good at one-page extraction and still be weak for RAG web crawling.
Scrape answers:
Can you extract this one page?Crawl answers:
Can you discover the right pages, stay inside boundaries, extract each page, and return a usable collection?That adds new failure modes.
| Crawl concern | What can go wrong |
|---|---|
| Discovery | It misses pages that matter. |
| Boundaries | It wanders into irrelevant pages. |
| Deduplication | It extracts the same content many times. |
| Depth | It stops before reaching useful docs. |
| Pagination | It misses list/detail pages. |
| Status polling | Jobs are hard to debug or recover. |
| Output consistency | Pages come back in mixed formats or quality levels. |
For docs ingestion and RAG, crawl quality often matters more than single-page quality.
You do not want “more pages.”
You want the right pages.
Start with a tiny crawl:
start URL: docs home
max pages: 10-25
depth: 1-2
format: markdown or LLM-readyThen ask:
| Question | Why |
|---|---|
| Did it find the pages a human would click first? | Discovery quality. |
| Did it avoid login, legal, footer, and duplicate pages? | Boundary quality. |
| Is each page clean enough to embed or summarize? | Extraction quality. |
| Can I map each answer back to a source URL? | Citation quality. |
This is the test most teams skip until the week they need to ingest a whole site.
Include Your Worst URLs
Every team has a few URLs they hate.
The page that randomly fails.
The docs site with weird sidebar navigation.
The pricing page with layout-heavy cards.
The competitor site that sometimes returns a block.
The JavaScript app where the initial HTML is just:
<div id="root"></div>Put those URLs in the evaluation set.
Do not hide them because they make the benchmark messy.
They are the benchmark.
If a provider works beautifully on easy pages and fails on the three pages your product depends on, it is not a good fit for your product.
Test Error Behavior
A good scraping API should fail in a way your app can use.
Bad error behavior looks like this:
200 OK
markdown: "Checking your browser..."Or this:
timeoutwith no clue what timed out.
Useful error behavior tells you what happened:
| Error shape | Why it helps |
|---|---|
| Block detected | You can retry, route, or alert. |
| Empty content detected | You know rendering or another path may be needed. |
| Timeout type | You can distinguish connect, fetch, render, and extraction failures. |
| Source URL preserved | You can debug the exact page. |
| Partial crawl results | You can keep useful pages instead of losing the whole job. |
For agents, this matters even more.
An agent can recover from a typed failure.
It cannot recover from a lie.
Do A Provider Swap Test
If you are migrating from an existing provider or evaluating a Firecrawl alternative, do not rewrite the whole integration first.
Put the provider behind a small adapter.
The adapter should normalize only the fields your app actually uses.
type ExtractedPage = {
url: string;
title?: string;
markdown: string;
metadata?: Record<string, unknown>;
};Then run the same URLs through both providers.
| Metric | Provider A | Provider B |
|---|---|---|
| Returned useful content | yes/no | yes/no |
| Markdown size | tokens or chars | tokens or chars |
| Main content quality | low/medium/high | low/medium/high |
| Code/table preservation | yes/no | yes/no |
| Error clarity | low/medium/high | low/medium/high |
| Downstream parser success | yes/no | yes/no |
You are not trying to crown a universal winner.
You are trying to answer:
Which provider works better for this workflow?For some teams, that means switching one endpoint.
For others, it means keeping two providers and routing specific URL classes differently.
That is a better outcome than arguing from marketing pages.
Where webclaw Fits
webclaw is built around the idea that extraction quality is the interface.
The useful output is not “HTML fetched successfully.”
The useful output is:
URL -> clean markdown / JSON / metadata -> agent, RAG pipeline, monitor, or scriptThat is why webclaw exposes:
scrape for one pagecrawl for site ingestionmap for URL discoverybatch for lists of URLsextract for structured JSONsummarize for quick page understandingdiff for monitoring changesbrand for identity extraction/v2 endpoints for migration testsIf you are already evaluating Firecrawl-shaped APIs, start with the migration checklist:
Migrating from Firecrawl: compatible API for AI agents
If you are building the RAG side, this connects directly to:
Build a RAG pipeline with live web data
And if the output goes into Claude Code, Cursor, or another MCP client:
MCP web scraping for Claude Code and Cursor
The Practical Checklist
Before picking a scraping API, run this:
| Step | Done |
|---|---|
| Pick 10-20 real URLs from your workflow | |
| Include docs, pricing, changelog, product, and flaky pages | |
| Compare markdown, not just status code | |
| Check title, URL, headings, links, code blocks, and tables | |
| Measure output size and boilerplate level | |
| Test crawl separately from scrape | |
| Test error behavior on blocked, empty, and slow pages | |
| Run output through the actual agent, RAG, parser, or monitor | |
| Keep the provider behind an adapter until you are confident |
That is the evaluation.
Not the landing page.
Not the benchmark table.
Not example.com.
The only thing that matters is whether the API returns clean, useful context from the pages your product actually needs.
FAQ
What is the best way to evaluate a web scraping API?
Test it on the URLs your product actually depends on. Include docs pages, pricing pages, changelogs, JavaScript-heavy pages, and known flaky URLs. Then inspect the markdown, metadata, errors, and downstream parser or agent behavior. Do not stop at HTTP status.
What should I test in a scraping API for AI agents?
For AI agents, test whether the API returns clean context with source URL, title, headings, links, code blocks, and useful metadata. Also check whether it detects empty pages, blocked pages, and login walls instead of returning them as successful content.
How is RAG web scraping different from normal scraping?
RAG web scraping needs clean, chunkable, source-linked content. The output should preserve structure and remove boilerplate because it will be embedded, retrieved, and passed into an LLM. Raw HTML or noisy plain text usually hurts retrieval quality.
Should I test crawl and scrape separately?
Yes. Scrape tests one-page extraction. Crawl tests URL discovery, boundaries, deduplication, depth, pagination, and consistency across many pages. A provider can be good at scrape and still weak for crawl-based docs ingestion.
Is webclaw a Firecrawl alternative?
webclaw can be tested as a Firecrawl alternative because it exposes Firecrawl-compatible /v2 scrape, crawl, map, and search endpoints. The safest path is to run the same real URLs through both providers and compare output quality, token size, error clarity, and downstream success.
Website: webclaw.io
GitHub: 0xMassi/webclaw