webclaw

Scrape

Extract content from a single URL. This is the core endpoint -- one URL in, clean structured content out.

POST/v1/scrape

Extract content from a single URL in one or more output formats.

Request body

json
{
  "url": "https://example.com",
  "formats": ["markdown", "llm", "text", "json"],
  "include_selectors": [".article-content"],
  "exclude_selectors": ["nav", ".sidebar"],
  "only_main_content": true
}

Parameters

FieldTypeRequiredDescription
urlstringYesThe URL to scrape.
formatsstring[]NoOutput formats to include. Options: markdown, llm, text, json. Defaults to ["markdown"].
include_selectorsstring[]NoCSS selectors to extract exclusively. Only content matching these selectors will be included.
exclude_selectorsstring[]NoCSS selectors to remove from the page before extraction.
only_main_contentbooleanNoWhen true, extracts only the main article or content element, ignoring sidebars, headers, and footers.
Tip
The llm format runs a 9-step optimization pipeline that strips images, collapses whitespace, deduplicates links, and reduces token count by ~67% compared to raw HTML. Use it when feeding content to language models.

Response

The response includes the requested formats alongside extracted metadata. Only the formats you request are populated.

json
{
  "url": "https://example.com",
  "metadata": {
    "title": "Example Domain",
    "description": "This domain is for use in illustrative examples.",
    "author": null,
    "published_date": null,
    "language": "en",
    "site_name": "Example",
    "image": null,
    "favicon": "https://example.com/favicon.ico",
    "word_count": 1234
  },
  "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
  "text": "Example Domain\n\nThis domain is for use in illustrative examples...",
  "llm": "> URL: https://example.com\n> Title: Example Domain\n\nThis domain is for use in illustrative examples...\n\n## Links\n- ...",
  "extraction": { ... }
}

Metadata fields

FieldTypeDescription
titlestringPage title from OG, meta, or title tag.
descriptionstringPage description from meta or OG tags.
authorstring?Author name if detected.
published_datestring?Publication date if found in metadata.
languagestring?Page language code (e.g. "en").
site_namestring?Site name from OG metadata.
imagestring?Primary image URL (OG or Twitter Card).
faviconstring?Favicon URL.
word_countnumberTotal word count of extracted content.

Examples

Basic extraction

curl
curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://openai.com/blog/gpt-4",
    "formats": ["markdown"]
  }'

LLM-optimized with selector filtering

curl
curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.stripe.com/payments/checkout",
    "formats": ["llm"],
    "include_selectors": [".content-container"],
    "exclude_selectors": ["nav", "footer", ".sidebar"],
    "only_main_content": true
  }'

Multiple output formats

curl
curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer wc_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://en.wikipedia.org/wiki/Web_scraping",
    "formats": ["markdown", "llm", "text"]
  }'
Note
PDFs are auto-detected via Content-Type. If the URL serves a PDF, webclaw extracts text using its PDF engine instead of HTML parsing.

Error responses

400 Bad Request
{
  "error": "Missing required field: url"
}
401 Unauthorized
{
  "error": "Invalid or missing API key"
}
422 Unprocessable
{
  "error": "Failed to fetch URL: connection timeout"
}