Scrape
Extract content from a single URL. This is the core endpoint -- one URL in, clean structured content out.
POST
/v1/scrapeExtract content from a single URL in one or more output formats.
Request body
json
{
"url": "https://example.com",
"formats": ["markdown", "llm", "text", "json"],
"include_selectors": [".article-content"],
"exclude_selectors": ["nav", ".sidebar"],
"only_main_content": true
}Parameters
| Field | Type | Required | Description |
|---|---|---|---|
url | string | Yes | The URL to scrape. |
formats | string[] | No | Output formats to include. Options: markdown, llm, text, json. Defaults to ["markdown"]. |
include_selectors | string[] | No | CSS selectors to extract exclusively. Only content matching these selectors will be included. |
exclude_selectors | string[] | No | CSS selectors to remove from the page before extraction. |
only_main_content | boolean | No | When true, extracts only the main article or content element, ignoring sidebars, headers, and footers. |
Tip
The
llm format runs a 9-step optimization pipeline that strips images, collapses whitespace, deduplicates links, and reduces token count by ~90% compared to raw HTML. Use it when feeding content to language models.Response
The response includes the requested formats alongside extracted metadata. Only the formats you request are populated.
json
{
"url": "https://example.com",
"metadata": {
"title": "Example Domain",
"description": "This domain is for use in illustrative examples.",
"author": null,
"published_date": null,
"language": "en",
"site_name": "Example",
"image": null,
"favicon": "https://example.com/favicon.ico",
"word_count": 1234
},
"markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
"text": "Example Domain\n\nThis domain is for use in illustrative examples...",
"llm": "> URL: https://example.com\n> Title: Example Domain\n\nThis domain is for use in illustrative examples...\n\n## Links\n- ...",
"extraction": { ... }
}Metadata fields
| Field | Type | Description |
|---|---|---|
title | string | Page title from OG, meta, or title tag. |
description | string | Page description from meta or OG tags. |
author | string? | Author name if detected. |
published_date | string? | Publication date if found in metadata. |
language | string? | Page language code (e.g. "en"). |
site_name | string? | Site name from OG metadata. |
image | string? | Primary image URL (OG or Twitter Card). |
favicon | string? | Favicon URL. |
word_count | number | Total word count of extracted content. |
Examples
Basic extraction
curl
curl -X POST https://api.webclaw.io/v1/scrape \
-H "Authorization: Bearer wc_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://openai.com/blog/gpt-4",
"formats": ["markdown"]
}'LLM-optimized with selector filtering
curl
curl -X POST https://api.webclaw.io/v1/scrape \
-H "Authorization: Bearer wc_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.stripe.com/payments/checkout",
"formats": ["llm"],
"include_selectors": [".content-container"],
"exclude_selectors": ["nav", "footer", ".sidebar"],
"only_main_content": true
}'Multiple output formats
curl
curl -X POST https://api.webclaw.io/v1/scrape \
-H "Authorization: Bearer wc_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://en.wikipedia.org/wiki/Web_scraping",
"formats": ["markdown", "llm", "text"]
}'Note
PDFs are auto-detected via Content-Type. If the URL serves a PDF, webclaw extracts text using its PDF engine instead of HTML parsing.
Error responses
400 Bad Request
{
"error": "Missing required field: url"
}401 Unauthorized
{
"error": "Invalid or missing API key"
}422 Unprocessable
{
"error": "Failed to fetch URL: connection timeout"
}