POST /v1/crawl

Crawl whole sites, page by page.

Point it at one URL and get clean markdown for every page on the site.

Feed your LLM or agent a whole knowledge base in one job. Start an async crawl from a single URL, follow same-origin links with depth and page limits you set, and poll for clean markdown on every page. Built in Rust, handles JS rendering and bot protection automatically.

View docs

What you get

Everything in one call.

Every page as markdown

Each crawled page comes back as clean markdown with title and word count, stripped of nav, ads, and boilerplate.

Depth and page limits

Set max_depth and max_pages to control exactly how far the crawler walks and how many pages it pulls.

Sitemap seeding

Turn on use_sitemap to seed the queue with sitemap URLs and reach pages that links alone never expose.

Async with polling

Start a job, get a UUID back, and poll until it completes without holding a request open.

How it works

From URL to output in four steps.

Post a start URL

Send one URL with optional max_depth, max_pages, and use_sitemap, and get a crawl job ID in return.

Same-origin traversal

The crawler walks links breadth-first, staying on the same origin until it hits your depth or page limit.

Clean each page

Every page is rendered if needed and converted to markdown, around 90% fewer tokens than the raw HTML.

Poll for results

Poll GET /v1/crawl/{id} until status is completed, then read the full pages array of markdown.

API

One request, structured back.

POST /v1/crawl

Crawl of webclaw.io

4 pages crawled

webclaw: Web Scraping API for LLMs and AI Agents
webclaw Docs: Scrape, Crawl, Extract and Use MCP
webclaw Pricing: Starter, Growth, Pro, Scale

Common questions

Frequently asked questions

How do I crawl an entire website with an API?

Send one POST to /v1/crawl with a start URL. The crawler follows same-origin links breadth-first and returns clean markdown for every page it reaches. Set max_depth and max_pages to bound how far it goes.

Is the crawl synchronous or do I have to poll?

It is async. Starting a crawl returns a job UUID and a status, then you poll GET /v1/crawl/{id} until the status flips to completed and the full pages array is ready.

How do I limit crawl depth and the number of pages?

Pass max_depth and max_pages in the request body. They default to 2 and 50, so the crawler stays bounded unless you raise them.

Why are some pages missing from my crawl results?

Pages only reachable through forms or scripts may not show up via link traversal. Set use_sitemap to true to seed the queue with sitemap URLs and reach pages that links alone never expose.

Am I billed for failed requests?

No. Credits are only consumed on successful responses. A standard page is 1 credit; heavier work like JS rendering or protected-site access costs a few extra credits.

Ship an agent that actually sees the web.

One credit pool, every endpoint. Cancel anytime, or self-host the open-source core for free.

API docs

Every endpoint

Web Scraping API HTML to Markdown API Sitemap API Web Search API Batch Scraping API AI Web Extraction API Webpage Summarization API Website Change Monitoring API Brand Data API Deep Research API YouTube Transcript API Lead Enrichment API