Crawl

Crawl an entire site with BFS traversal. Crawls run asynchronously. Start one, then poll until it completes.

Start a crawl

POST/v1/crawl

Start an async same-origin crawl from the given URL.

Request body

json

{
  "url": "https://docs.example.com",
  "max_depth": 2,
  "max_pages": 50,
  "use_sitemap": true
}

Field	Type	Required	Description
`url`	`string`	Yes	Starting URL. The crawl stays within this origin.
`max_depth`	`number`	No	Maximum link depth to follow. Default: 2.
`max_pages`	`number`	No	Maximum number of pages to extract. Default: 50.
`use_sitemap`	`boolean`	No	Seed the crawl queue with URLs from the site's sitemap. Default: false.

Response

json

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "running"
}

Note

The crawl ID is a UUID. Store it. You need it to poll for results.

Poll crawl status

GET/v1/crawl/{id}

Get the current status and results of a running or completed crawl.

Path parameters

Parameter	Type	Description
`id`	`string`	The crawl job UUID returned by POST /v1/crawl.

Response

json

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "pages": [
    {
      "url": "https://docs.example.com",
      "markdown": "# Getting Started\n\nWelcome to the documentation...",
      "metadata": {
        "title": "Getting Started",
        "word_count": 842
      }
    },
    {
      "url": "https://docs.example.com/guides/setup",
      "markdown": "# Setup Guide\n\nFollow these steps...",
      "metadata": {
        "title": "Setup Guide",
        "word_count": 1203
      }
    }
  ],
  "total": 50,
  "completed": 48,
  "errors": 2,
  "created_at": "2026-03-12T10:30:00Z"
}

Status values

Status	Description
`running`	Crawl is in progress. Poll again for updates.
`completed`	Crawl finished. All results are available.
`failed`	Crawl encountered a fatal error and stopped.

Paginate crawl pages

GET/v1/crawl/{id}/pages

Fetch crawl page results in server-side pages for large crawls.

Use this endpoint when a crawl returns many pages and you want to load them gradually in a dashboard, worker, or agent workflow.

Query parameters

Parameter	Type	Default	Description
`page`	`number`	1	Page number to fetch.
`per_page`	`number`	10	Number of crawl pages to return per response.

Response

json

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "pages": [
    {
      "url": "https://docs.example.com/guides/setup",
      "markdown": "# Setup Guide\n\nFollow these steps...",
      "metadata": {
        "title": "Setup Guide"
      }
    }
  ],
  "page": 1,
  "per_page": 10,
  "total": 50,
  "total_pages": 5
}

Example

Start a crawl

curl -X POST https://api.webclaw.io/v1/crawl \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.stripe.com",
    "max_depth": 2,
    "max_pages": 100,
    "use_sitemap": true
  }'

Poll for results

curl https://api.webclaw.io/v1/crawl/550e8400-e29b-41d4-a716-446655440000 \
  -H "Authorization: Bearer wc_your_api_key"

Load page results

curl "https://api.webclaw.io/v1/crawl/550e8400-e29b-41d4-a716-446655440000/pages?page=1&per_page=10" \
  -H "Authorization: Bearer wc_your_api_key"

Tip

Enable use_sitemap to seed the crawl queue with sitemap URLs. This helps discover pages that aren't reachable through link traversal alone.

Crawl

Start a crawl

Request body

Response

Poll crawl status

Path parameters

Response

Status values

Paginate crawl pages

Query parameters

Response

Example

Ready to build? Start extracting.