RAW HTTP — NO HEADLESS BROWSER OVERHEADMARKDOWN · JSON · HTML · LLM-READY FORMATSMCP SERVER FOR AI AGENTSTLS FINGERPRINT IMPERSONATIONEXTRACT · SUMMARIZE · DIFF · BRANDSITEMAP DISCOVERY & DEEP CRAWLINGSELF-HOST OR USE OUR CLOUD APIBUILT IN RUST — FAST BY DEFAULTDEEP RESEARCH — AI SYNTHESIZES REPORTS FROM 50+ SOURCESWEB SEARCH — QUERY AND SCRAPE SEARCH RESULTS IN ONE CALLAGENT SCRAPE — GIVE A GOAL, AI EXTRACTS WHAT YOU NEEDURL MONITORING — WATCH PAGES FOR CHANGES WITH WEBHOOKSBONUS CREDITS — EARN FREE CREDITS BY STARRING AND REFERRINGRAW HTTP — NO HEADLESS BROWSER OVERHEADMARKDOWN · JSON · HTML · LLM-READY FORMATSMCP SERVER FOR AI AGENTSTLS FINGERPRINT IMPERSONATIONEXTRACT · SUMMARIZE · DIFF · BRANDSITEMAP DISCOVERY & DEEP CRAWLINGSELF-HOST OR USE OUR CLOUD APIBUILT IN RUST — FAST BY DEFAULTDEEP RESEARCH — AI SYNTHESIZES REPORTS FROM 50+ SOURCESWEB SEARCH — QUERY AND SCRAPE SEARCH RESULTS IN ONE CALLAGENT SCRAPE — GIVE A GOAL, AI EXTRACTS WHAT YOU NEEDURL MONITORING — WATCH PAGES FOR CHANGES WITH WEBHOOKSBONUS CREDITS — EARN FREE CREDITS BY STARRING AND REFERRING
← BACK TO BLOG
Massi

How to bypass Cloudflare bot protection when scraping

You send a request. You get a 403. Or worse, you get back HTML that looks like content but is actually a Cloudflare challenge page. Your scraper reports success. Your data is garbage.

If you've tried to scrape anything meaningful in 2026, you've hit this wall. Cloudflare protects somewhere north of 20% of all websites. That's not just enterprise sites. It's blogs, documentation, e-commerce stores, SaaS pricing pages. The kind of pages AI agents and data pipelines need to read every day.

Most scraping tools handle this by either failing silently or telling you to upgrade to a paid proxy tier. Neither is a real solution. So let me walk you through what Cloudflare actually does, why most bypass approaches fail, and what works reliably without a $500/month proxy bill.

What Cloudflare actually checks

Cloudflare's bot detection isn't one thing. It's a stack of signals evaluated together. Understanding the layers matters because most tools only address one or two of them.

TLS fingerprinting. Before your request even reaches the server, Cloudflare inspects your TLS handshake. Every HTTP client produces a unique fingerprint based on which cipher suites it supports, in what order, and which TLS extensions it sends. Python's requests library, Go's net/http, Node's axios — they all have fingerprints that look nothing like a real browser. Cloudflare knows this. The fingerprint is checked before your User-Agent header is even read.

HTTP/2 fingerprinting. Modern browsers use HTTP/2 with specific settings: SETTINGS_HEADER_TABLE_SIZE, SETTINGS_MAX_CONCURRENT_STREAMS, WINDOW_UPDATE values. These settings are negotiated at connection time and vary between Chrome, Firefox, and Safari. If your HTTP client sends default HTTP/2 settings that don't match any known browser, that's another signal.

Header order and values. Browsers send headers in a specific, consistent order. Chrome sends sec-ch-ua before sec-ch-ua-mobile before sec-ch-ua-platform. Most HTTP libraries send headers in hash-map order, which is random and immediately suspicious.

JavaScript challenges. For higher security levels, Cloudflare serves a JavaScript challenge that a real browser executes automatically. The challenge collects browser environment data: canvas fingerprint, WebGL renderer, installed fonts, screen dimensions, timezone. This is the Turnstile layer. No JavaScript engine, no pass.

Behavioral analysis. Request timing, mouse movements, scroll patterns, cookie handling. This layer kicks in for sites with "I'm Under Attack" mode or custom WAF rules.

The key insight: these layers work together. Passing one doesn't mean you pass them all. You can have a perfect TLS fingerprint and still get blocked by a JavaScript challenge. You can solve the JavaScript challenge and still get re-challenged because your subsequent requests have a non-browser fingerprint. Cloudflare is a stack, and you need to handle the whole stack.

The approaches, and why most of them fail

Proxy rotation

The most common advice is "just use proxies." Services like Bright Data, Oxylabs, and Smartproxy sell residential and datacenter proxies that rotate your IP address on each request.

The problem: Cloudflare doesn't primarily block by IP. It blocks by fingerprint. You can rotate through a thousand IPs, but if every request has the same Python requests TLS fingerprint, Cloudflare sees a thousand requests from the same bot on different IPs. You've spent money to look more suspicious, not less.

Proxies are useful as one component of a bypass stack. They're not a solution on their own.

Headless browsers

Puppeteer, Playwright, Selenium. Spin up a real Chrome instance, navigate to the page, solve the challenge, extract the content.

This works. A real Chrome instance has the right TLS fingerprint, the right HTTP/2 settings, and a real JavaScript engine. Cloudflare's challenge runs and passes.

The catch is everything else. Each request takes 3-8 seconds. You need 200MB+ of Chromium per instance. Scaling to thousands of pages means running a browser farm. Memory usage is brutal. And Cloudflare has gotten good at detecting headless Chrome specifically. The navigator.webdriver flag, missing plugins, headless-specific quirks. Tools like puppeteer-extra-plugin-stealth patch some of these tells, but it's an arms race.

For a handful of pages, headless browsers work fine. For anything at scale, you need a lighter approach.

Undetected Chrome wrappers

undetected-chromedriver and similar tools patch Selenium's Chrome to remove detectable artifacts. They modify the binary to strip headless tells, patch JavaScript APIs, and randomize fingerprint values.

These work better than raw Puppeteer but still carry the Chrome overhead. And they break regularly. Every Chrome update changes internals, and the patches need to catch up. You'll find GitHub issues filled with "stopped working after Chrome 130" posts.

CAPTCHA solving services

Capsolver, 2Captcha, Anti-Captcha. These services solve Cloudflare Turnstile challenges by running them in real browser environments and returning the solution token.

The issue is that solving the challenge isn't enough. You still need to make subsequent requests with the right fingerprint and the cookies from the solved session. If your next request comes from a Python requests client with a non-browser TLS fingerprint, Cloudflare re-challenges you immediately. The solved token was worthless.

CAPTCHA solvers are a piece of the puzzle, not the whole picture.

TLS fingerprint impersonation

Instead of running a full browser, you make your HTTP client look like a browser at the network level. Same TLS cipher suites, same HTTP/2 settings, same header order.

Libraries like curl-impersonate, primp for Python, and tls-client for Go do exactly this. They patch the underlying TLS library to produce browser-matching fingerprints.

This helps with simpler bot detection systems and gets you past the first layer of Cloudflare's checks. But let's be honest: TLS impersonation alone doesn't reliably bypass Cloudflare in 2026. Cloudflare has evolved well beyond fingerprint checks. Even with a perfect Chrome TLS fingerprint, you'll still hit JavaScript challenges, Turnstile widgets, and behavioral analysis on most protected sites. TLS impersonation is a necessary foundation, but it's not a bypass on its own.

Full-stack scraping APIs

Services like ScrapingBee, Scrapfly, and ZenRows combine multiple bypass techniques behind an API. You send a URL, they handle the fingerprinting, proxies, JavaScript rendering, and challenge solving.

The trade-off is cost and control. You're paying per request (typically $1-5 per 1,000 pages), you don't control the browser profile, and you're dependent on their infrastructure. For some use cases that's fine. For high-volume scraping or latency-sensitive applications, the economics don't work.

What webclaw does differently

Every approach above fails at Cloudflare for the same reason: they solve one layer and ignore the rest. Proxies don't fix your fingerprint. TLS impersonation doesn't solve JavaScript challenges. CAPTCHA solvers don't maintain sessions. Headless browsers work but don't scale.

webclaw has a built-in antibot engine that handles Cloudflare end-to-end. You send a URL, and the engine deals with whatever Cloudflare throws at it — challenges, Turnstile, behavioral checks. You don't pick a strategy, you don't configure bypass modes, you don't chain tools together. It just works.

I'm not going to go deep into how the antibot engine works internally. But the result is an 89% bypass rate across Cloudflare-protected sites.

webclaw https://cloudflare-protected-site.com

No proxy configuration. No browser setup. No CAPTCHA API key. If the site is behind Cloudflare, webclaw handles it automatically.

Using the API

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://cloudflare-protected-site.com",
    "formats": ["markdown"],
    "only_main_content": true
  }'

When antibot bypass activates, the response includes timing data so you know what happened:

{
  "url": "https://cloudflare-protected-site.com",
  "markdown": "# Page content...",
  "antibot": {
    "bypass": true,
    "elapsed_ms": 3200
  }
}

Using the CLI

webclaw https://cloudflare-protected-site.com --format llm

The llm format gives you LLM-optimized output. Clean markdown, title and URL header, deduplicated link references. Typically 67% fewer tokens than standard markdown conversion.

Using MCP

If you're building with Claude or another MCP-compatible AI, add webclaw-mcp to your config and your AI handles Cloudflare-protected pages automatically:

{
  "mcpServers": {
    "webclaw": {
      "command": "webclaw-mcp"
    }
  }
}

Your AI calls scrape with a URL. If the site is behind Cloudflare, the bypass runs transparently.

The bypass spectrum

Not all Cloudflare configurations are the same. The protection level depends on what the site operator has configured.

Basic. Standard Cloudflare proxy with default bot detection. These sites still check fingerprints and may serve lightweight challenges. Easier to bypass, but not trivial. A proper antibot tool handles these reliably.

Managed challenge. Cloudflare serves an interstitial challenge page. For low-risk visitors it auto-solves in the background (the "checking your browser" spinner). For higher-risk visitors it shows a Turnstile widget. You need a real browser environment or a specialized solver with proper session handling.

I'm Under Attack mode. The site operator has explicitly turned on aggressive bot filtering. Every visitor gets a 5-second JavaScript challenge. Behavioral signals are weighted heavily. This is the hardest tier to bypass consistently.

Custom WAF rules. The site has custom rules that go beyond Cloudflare's defaults. Rate limiting, geographic restrictions, specific header requirements, device fingerprint checks. These are site-specific and there's no one-size-fits-all bypass.

webclaw handles the first three tiers automatically. Custom WAF rules may need additional configuration like proxies.

What I'd recommend

You're scraping a few pages and don't want to think about it: Use webclaw's CLI or API. Cloudflare bypass is automatic. You don't need to understand the layers.

You're building an AI agent that needs web access: Use webclaw-mcp. Your AI gets Cloudflare bypass as a transparent capability. No per-page configuration.

You're scraping one specific Cloudflare site and want to DIY: You'll need to combine multiple tools. Puppeteer with stealth or undetected-chromedriver to solve challenges, then maintain the session cookies for subsequent requests. Expect to spend time keeping it working as Cloudflare updates its detection.

You have budget and don't want infrastructure: A managed service like ScrapingBee or Scrapfly handles everything. Cost scales linearly with volume, so check your per-page economics first.

Frequently asked questions

How does Cloudflare detect web scrapers?

Cloudflare uses a stack of detection signals. The primary ones are TLS fingerprinting (analyzing the TLS handshake to identify the HTTP client), HTTP/2 settings analysis, header ordering, and JavaScript challenges. Most scrapers fail at the TLS layer before reaching the JavaScript challenge. Cloudflare also uses behavioral analysis for high-security configurations.

Can you scrape a Cloudflare-protected website?

Yes. Cloudflare protection can be bypassed through headless browsers with stealth patches, specialized antibot engines, or scraping APIs that handle bypass automatically. The approach depends on the protection level. TLS fingerprint impersonation alone is no longer enough for most Cloudflare sites in 2026. You need a tool that handles the full detection stack including JavaScript challenges and behavioral analysis.

Is it legal to scrape Cloudflare-protected websites?

Legality depends on what you're scraping and how, not whether Cloudflare is involved. The hiQ v. LinkedIn ruling established that scraping publicly accessible data is generally legal under US law. However, violating a site's Terms of Service, scraping personal data under GDPR, or circumventing access controls on non-public content can create legal risk. This is not legal advice. Consult a lawyer for your specific use case.

What is TLS fingerprinting?

TLS fingerprinting identifies HTTP clients by analyzing their TLS handshake. When a client connects via HTTPS, it sends a ClientHello message containing supported cipher suites, TLS extensions, and other parameters. This combination is unique enough to distinguish Chrome from Firefox from Python's requests. The JA3 and JA4 algorithms hash these parameters into a short fingerprint string. Anti-bot systems compare this fingerprint against known browser profiles.

Why do proxies alone not work against Cloudflare?

Proxies change your IP address but not your fingerprint or behavior. Cloudflare checks far more than IP — TLS handshake, browser environment, JavaScript challenge responses, behavioral signals. A Python requests library routed through a residential proxy still looks like a bot to Cloudflare. Rotating through 1,000 IPs with the same bot fingerprint actually makes you more suspicious, not less. Proxies can be useful as part of a full bypass stack, but they solve the wrong problem on their own.

What is the fastest way to scrape Cloudflare-protected sites?

A scraping API with a built-in antibot engine is the fastest practical approach. DIY solutions with headless browsers take 3-8 seconds per page and require constant maintenance. webclaw handles Cloudflare bypass automatically — you send a URL and get clean content back without configuring anything.

---

Read next: Web scraping for AI agents | MCP and web scraping | Get started with webclaw

Stay in the loop

Get notified when the webclaw API launches. Early subscribers get extended free tier access.

No spam. Unsubscribe anytime.