Scrape
Extract content from a single URL. This is the core endpoint -- one URL in, clean structured content out.
POST
/v1/scrapeExtract content from a single URL in one or more output formats.
Request body
Parameters
| Field | Type | Required | Description |
|---|---|---|---|
url | string | Yes | The URL to scrape. |
formats | string[] | No | Output formats to include. Options: markdown, llm, text, json. Defaults to ["markdown"]. |
include_selectors | string[] | No | CSS selectors to extract exclusively. Only content matching these selectors will be included. |
exclude_selectors | string[] | No | CSS selectors to remove from the page before extraction. |
only_main_content | boolean | No | When true, extracts only the main article or content element, ignoring sidebars, headers, and footers. |
Tip
The
llm format runs a 9-step optimization pipeline that strips images, collapses whitespace, deduplicates links, and reduces token count by ~67% compared to raw HTML. Use it when feeding content to language models.Response
The response includes the requested formats alongside extracted metadata. Only the formats you request are populated.
Metadata fields
| Field | Type | Description |
|---|---|---|
title | string | Page title from OG, meta, or title tag. |
description | string | Page description from meta or OG tags. |
author | string? | Author name if detected. |
published_date | string? | Publication date if found in metadata. |
language | string? | Page language code (e.g. "en"). |
site_name | string? | Site name from OG metadata. |
image | string? | Primary image URL (OG or Twitter Card). |
favicon | string? | Favicon URL. |
word_count | number | Total word count of extracted content. |
Examples
Basic extraction
LLM-optimized with selector filtering
Multiple output formats
Note
PDFs are auto-detected via Content-Type. If the URL serves a PDF, webclaw extracts text using its PDF engine instead of HTML parsing.