Scrape

Extract content from a single URL. This is the core endpoint -- one URL in, clean structured content out.

POST/v1/scrape

Extract content from a single URL in one or more output formats.

Request body

json

{
  "url": "https://example.com",
  "formats": ["markdown", "llm", "text", "json"],
  "include_selectors": [".article-content"],
  "exclude_selectors": ["nav", ".sidebar"],
  "only_main_content": true
}

Parameters

Field	Type	Required	Description
`url`	`string`	Yes	The URL to scrape.
`formats`	`string[]`	No	Output formats to include. Options: `markdown`, `llm`, `text`, `json`. Defaults to `["markdown"]`.
`include_selectors`	`string[]`	No	CSS selectors to extract exclusively. Only content matching these selectors will be included.
`exclude_selectors`	`string[]`	No	CSS selectors to remove from the page before extraction.
`only_main_content`	`boolean`	No	When true, extracts only the main article or content element, ignoring sidebars, headers, and footers.

Tip

The llm format runs a 9-step optimization pipeline that strips images, collapses whitespace, deduplicates links, and reduces token count by ~90% compared to raw HTML. Use it when feeding content to language models.

Response

The response includes the requested formats alongside extracted metadata. Only the formats you request are populated.

json

{
  "url": "https://example.com",
  "metadata": {
    "title": "Example Domain",
    "description": "This domain is for use in illustrative examples.",
    "author": null,
    "published_date": null,
    "language": "en",
    "site_name": "Example",
    "image": null,
    "favicon": "https://example.com/favicon.ico",
    "word_count": 1234
  },
  "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
  "text": "Example Domain\n\nThis domain is for use in illustrative examples...",
  "llm": "> URL: https://example.com\n> Title: Example Domain\n\nThis domain is for use in illustrative examples...\n\n## Links\n- ...",
  "extraction": { ... }
}

Metadata fields

Field	Type	Description
`title`	`string`	Page title from OG, meta, or title tag.
`description`	`string`	Page description from meta or OG tags.
`author`	`string?`	Author name if detected.
`published_date`	`string?`	Publication date if found in metadata.
`language`	`string?`	Page language code (e.g. "en").
`site_name`	`string?`	Site name from OG metadata.
`image`	`string?`	Primary image URL (OG or Twitter Card).
`favicon`	`string?`	Favicon URL.
`word_count`	`number`	Total word count of extracted content.

Examples

Basic extraction

curl

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://openai.com/blog/gpt-4",
    "formats": ["markdown"]
  }'

LLM-optimized with selector filtering

curl

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.stripe.com/payments/checkout",
    "formats": ["llm"],
    "include_selectors": [".content-container"],
    "exclude_selectors": ["nav", "footer", ".sidebar"],
    "only_main_content": true
  }'

Multiple output formats

curl

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://en.wikipedia.org/wiki/Web_scraping",
    "formats": ["markdown", "llm", "text"]
  }'

Note

PDFs are auto-detected via Content-Type. If the URL serves a PDF, webclaw extracts text using its PDF engine instead of HTML parsing.

Error responses

400 Bad Request

{
  "error": "Missing required field: url"
}

401 Unauthorized

{
  "error": "Invalid or missing API key"
}

422 Unprocessable

{
  "error": "Failed to fetch URL: connection timeout"
}

Scrape

Request body

Parameters

Response

Metadata fields

Examples

Basic extraction

LLM-optimized with selector filtering

Multiple output formats

Error responses

Related reading

Ready to build? Start extracting.