webclaw

Crawl

Crawl an entire site with BFS traversal. Crawls run asynchronously -- start one, then poll until it completes.

Start a crawl

POST/v1/crawl

Start an async same-origin crawl from the given URL.

Request body

json
{
  "url": "https://docs.example.com",
  "max_depth": 2,
  "max_pages": 50,
  "use_sitemap": true
}
FieldTypeRequiredDescription
urlstringYesStarting URL. The crawl stays within this origin.
max_depthnumberNoMaximum link depth to follow. Default: 2.
max_pagesnumberNoMaximum number of pages to extract. Default: 50.
use_sitemapbooleanNoSeed the crawl queue with URLs from the site's sitemap. Default: false.

Response

json
{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "running"
}
Note
The crawl ID is a UUID. Store it -- you need it to poll for results.

Poll crawl status

GET/v1/crawl/{id}

Get the current status and results of a running or completed crawl.

Path parameters

ParameterTypeDescription
idstringThe crawl job UUID returned by POST /v1/crawl.

Response

json
{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "pages": [
    {
      "url": "https://docs.example.com",
      "markdown": "# Getting Started\n\nWelcome to the documentation...",
      "metadata": {
        "title": "Getting Started",
        "word_count": 842
      }
    },
    {
      "url": "https://docs.example.com/guides/setup",
      "markdown": "# Setup Guide\n\nFollow these steps...",
      "metadata": {
        "title": "Setup Guide",
        "word_count": 1203
      }
    }
  ],
  "total": 50,
  "completed": 48,
  "errors": 2,
  "created_at": "2026-03-12T10:30:00Z"
}

Status values

StatusDescription
runningCrawl is in progress. Poll again for updates.
completedCrawl finished. All results are available.
failedCrawl encountered a fatal error and stopped.

Example

Start a crawl
curl -X POST https://api.webclaw.io/v1/crawl \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.stripe.com",
    "max_depth": 2,
    "max_pages": 100,
    "use_sitemap": true
  }'
Poll for results
curl https://api.webclaw.io/v1/crawl/550e8400-e29b-41d4-a716-446655440000 \
  -H "Authorization: Bearer wc_your_api_key"
Tip
Enable use_sitemap to seed the crawl queue with sitemap URLs. This helps discover pages that aren't reachable through link traversal alone.