Extract structured data from any webpage
You scraped the page. You got clean markdown. Now what?
If you're building a price comparison tool, you don't need the whole article. You need the price, the product name, and whether it's in stock. If you're enriching leads, you need the company name, team size, and tech stack. Not the entire "About Us" page converted to markdown.
Most scraping tools stop at "here's the content." They give you text and leave the parsing to you. Which means you're back to writing regex, CSS selectors, or feeding the entire page to an LLM with a prompt like "please find the price somewhere in here."
There's a better way.
Why selectors break
The traditional approach to pulling specific data from a webpage is CSS selectors. Find the element, grab the text.
price = soup.select_one(".product-price .amount").textThis works until it doesn't. And it always stops working.
The site redesigns. The class name changes from product-price to pdp-price-container. The price moves from a <span> to a <div>. The format changes from "$29.99" to "US$29.99/mo". Your selector returns None and your pipeline breaks at 3am.
Selectors are brittle because they depend on implementation details. You're coupling your data extraction to someone else's frontend code. Every deploy on their end is a potential breakpoint on yours.
For one site, you can maintain selectors. For ten sites, it's a part time job. For "any URL an agent decides to visit," it's impossible.
Schema-based extraction
webclaw's /v1/extract endpoint takes a different approach. You describe what data you want as a JSON schema. The extraction engine reads the page, understands the content, and returns data matching your schema.
No selectors. No XPath. No regex. You define the shape of the data you need, webclaw fills it in.
curl -X POST https://api.webclaw.io/v1/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://store.example.com/product/wireless-headphones",
"schema": {
"type": "object",
"properties": {
"product_name": {"type": "string"},
"price": {"type": "number"},
"currency": {"type": "string"},
"in_stock": {"type": "boolean"},
"rating": {"type": "number"},
"review_count": {"type": "integer"}
}
}
}'Response:
{
"data": {
"product_name": "Sony WH-1000XM5",
"price": 279.99,
"currency": "USD",
"in_stock": true,
"rating": 4.7,
"review_count": 3842
}
}The site can redesign completely. As long as the information is somewhere on the page, the extraction still works. You're extracting meaning, not DOM positions.
How it works under the hood
The extraction pipeline has three steps.
First, webclaw fetches the page with TLS fingerprinting, the same way it handles any scrape. If the page needs JavaScript rendering, it renders. If it's behind anti-bot protection, it gets through. The extract endpoint inherits all of webclaw's fetch capabilities.
Second, the page content gets cleaned through the same 9-step optimization pipeline from the regular scrape. Navigation, ads, cookie banners, footers. All stripped. What's left is the actual content.
Third, an LLM reads the clean content against your schema and extracts the matching fields. Because the content is already optimized, the LLM focuses on actual information instead of wading through noise. This makes the extraction more accurate and costs fewer tokens.
The LLM handles ambiguity the way a human would. Price shows up as "Starting from $29/mo"? It extracts 29 and knows it's monthly. Rating is "4.7 out of 5 stars (3,842 reviews)"? It splits that into rating and review_count correctly. A paragraph mentions "the team has grown to over 200 engineers across 3 offices"? It pulls 200 and 3 into the right fields.
You can also just ask in plain English
Not everything fits neatly into a JSON schema. Sometimes you don't know the exact structure upfront. For those cases, the extract endpoint accepts a prompt parameter alongside or instead of a schema.
curl -X POST https://api.webclaw.io/v1/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://company.com/about",
"prompt": "Find the founding year, number of employees, and what the company does in one sentence"
}'You get back structured data without having to define every field type. Useful for exploration, quick lookups, and cases where the data shape varies across pages.
For production pipelines where you need consistent output, use the schema. For ad-hoc extraction and agent workflows, the prompt is faster to set up.
What you can extract
The same endpoint works across completely different types of pages. Same approach, different schema.
Product pages. Name, price, availability, specs, reviews. Works on any e-commerce site regardless of their frontend framework or layout.
Job listings. Title, company, location, salary range, requirements, remote status. Same schema works on LinkedIn, Greenhouse, Lever, Indeed. No per-site configuration.
Contact pages. Email, phone, address, social media links. Useful for lead enrichment at scale.
Event listings. Conference pages, meetup groups, concert venues. Pull dates, locations, speakers, prices into a consistent format regardless of how each site presents the information.
Pricing pages. Plan names, features, prices, billing frequency. Competitive analysis without manually checking each competitor's site every week.
The pattern is always the same. Define the data shape. Point it at a URL. Get clean JSON back.
Extract vs scrape: when to use which
Use scrape when you want the content of a page. Articles, documentation, blog posts. You want to read it, summarize it, feed it to a RAG pipeline, or show it to a user. The output is text.
Use extract when you want specific data points. Prices, names, dates, structured fields. You want to store it in a database, compare it across sites, or use it in a calculation. The output is JSON.
You can combine them. Scrape a page to get the full content for context, then extract specific fields for your database. Or run extract on a batch of URLs to build a structured dataset from pages that all look completely different.
Using extract through MCP
If you're using webclaw through MCP with an AI agent, the agent gets the extract tool automatically. During a conversation, the agent can call extract with a schema and get back structured data without you writing any code.
You say: "Compare the pricing of these three SaaS tools." The agent calls extract on each pricing page with the same schema, gets back consistent JSON, and builds a comparison table. Three pages, three seconds, structured output.
This is where extract gets really useful. The agent decides what to extract based on the conversation. You describe the outcome, the agent figures out the schema and the URLs.
{
"mcpServers": {
"webclaw": {
"command": "webclaw-mcp"
}
}
}The extract tool is one of 8 tools in webclaw-mcp. It works alongside scrape, crawl, search, map, summarize, diff, and brand. Install once, your agent gets all of them.
Accuracy and trade-offs
The extraction uses an LLM for the parsing step, which means it handles messy, inconsistent pages well. But it also means there are trade-offs worth knowing.
Accuracy. For standard fields like names, prices, dates, and availability, accuracy is above 95% on pages where the data is clearly present. For more ambiguous extractions (categorizing a product, determining sentiment, inferring data that isn't explicitly stated), results depend on how clear the page content is.
Cost. An extract call costs roughly 2x a regular scrape because of the LLM step. For most use cases that's a fraction of a cent per page. Cheaper than writing and maintaining custom parsers for every site you need to extract from.
Missing data. If the data isn't on the page, the response returns null for those fields rather than making something up. The extraction is conservative. It would rather give you nothing than hallucinate a price or a phone number that doesn't exist.
Speed. Slightly slower than a plain scrape because of the LLM step. Expect 500ms to 1.5 seconds depending on the page size and schema complexity. Fast enough for real-time use, but for bulk extraction (thousands of pages), use the batch endpoint and run extractions in parallel.
Getting started
The extract endpoint is available on the webclaw cloud API. If you have an API key, you can start using it right now.
curl -X POST https://api.webclaw.io/v1/extract \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/pricing",
"schema": {
"type": "object",
"properties": {
"plans": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"billing": {"type": "string"},
"features": {"type": "array", "items": {"type": "string"}}
}
}
}
}
}
}'webclaw SDKs for Python, TypeScript, and Go are coming soon with native extract() methods. For now, the REST API and MCP cover everything.
Define the data you need. Point it at any page. Get structured JSON back. No selectors to maintain, no parsers to update, no pipelines to fix when a site changes its CSS.
Check the API documentation for the full schema reference and response format. Sign up at webclaw.io to get your API key.
---
Read next: Build a RAG pipeline with live web data | Web scraping for AI agents | API reference