← BACK TO BLOG
Massi

Cloudflare scraping checklist: diagnose the block before you retry

Most Cloudflare scraping failures get worse because the scraper retries too early.

You get a 403. You rotate the proxy. Same 403. You change the User-Agent. Now it is a 503. You launch Puppeteer. It works once. Then it dies on page three. At that point the code is no longer debugging Cloudflare. It is generating more bad traffic for Cloudflare to score.

This post closes the Cloudflare cluster with a checklist. Not a silver bullet. A way to decide what actually failed before you change anything.

If you want the deeper pieces, start here:

1. Bypass Cloudflare bot protection

2. Cloudflare Turnstile in 2026

3. Why Puppeteer stealth stopped working on Cloudflare

4. Cloudflare error codes for scrapers

5. TLS fingerprinting in 2026

The short version: log the layer first. Then change the layer that failed.

Cloudflare scraping diagnostic checklist: log the response, error code, fingerprint, and session before changing the scraper.
Cloudflare scraping diagnostic checklist: log the response, error code, fingerprint, and session before changing the scraper.

The mistake: treating every block like the same block

A Cloudflare block can come from several places.

Cloudflare's own bot detection engines docs describe multiple engines: heuristics, JavaScript Detections, machine learning, and anomaly detection on some plans. Their ML docs say the model uses request features, headers, session characteristics, and browser signals. The __cf_bm cookie is used to smooth the bot score for a user's request pattern.

That means one scrape can fail because:

1. The TLS or HTTP fingerprint does not match the browser you claim to be.

2. The request hits a path-specific WAF rule.

3. JavaScript Detections failed or never had a chance to run.

4. The session has no believable history.

5. The IP, ASN, or country is wrong for the target.

6. The rate limit fired.

7. The body is a challenge page, even if the status code says 200.

Those are different failures. They need different fixes.

What to log on every Cloudflare request

If your scraper does not store these fields, add them before changing the bypass logic.

URL and methodCloudflare rules are often path-specific
Status codeUseful, but not enough by itself
cf-rayThe only useful handle if a site owner checks logs
cf-mitigatedCloudflare sets this to challenge on Challenge Page responses
Content-TypeChallenge pages return HTML, even for some fetch/XHR flows
First 2 KB of bodyEnough to detect cf-turnstile, /cdn-cgi/challenge-platform/, and error codes
Response headersRate limits, cookies, and challenge markers live here
Request headers sentThe bug is often in what you actually sent, not what you meant to send
Proxy ASN and countryA clean fingerprint from the wrong network still looks wrong
Session ID and cookie ageFresh sessions and returning sessions are scored differently
Duration and retry numberRate limit and challenge loops look different over time

Cloudflare's challenge docs give one especially useful signal: Challenge Page responses include the cf-mitigated header with value challenge, and the content type is text/html regardless of the requested resource type. If you index that body as if it were the page, you just poisoned your dataset.

Step 1: classify the response body

Do this before reading the status code.

type CloudflareShape =
  | "real_page"
  | "challenge_page"
  | "turnstile"
  | "waf_error"
  | "rate_limited"
  | "unknown_block";

export function classifyCloudflareResponse(input: {
  status: number;
  headers: Record<string, string | undefined>;
  body: string;
}): CloudflareShape {
  const body = input.body.slice(0, 20_000).toLowerCase();
  const mitigated = input.headers["cf-mitigated"];

  if (mitigated === "challenge") return "challenge_page";
  if (body.includes("cf-turnstile")) return "turnstile";
  if (body.includes("challenges.cloudflare.com/turnstile")) return "turnstile";
  if (body.includes("/cdn-cgi/challenge-platform/")) return "challenge_page";
  if (body.includes("error 1015") || input.status === 429) return "rate_limited";
  if (body.includes("error 1020") || body.includes("access denied")) return "waf_error";
  if (input.status === 403 || input.status === 503) return "unknown_block";
  return "real_page";
}

This is not magic. It is hygiene.

A 200 with a challenge body is not a success. A 503 with /cdn-cgi/challenge-platform/ is not an origin outage. A 1015 is not fixed by another stealth plugin. Your first job is to stop treating all of them as "retry later."

Step 2: read the status and Cloudflare code together

Status code alone is too coarse. Read the body and the Cloudflare error number.

cf-mitigated: challengeChallenge pageDetect as failure, do not parse body
cf-turnstile in bodyTurnstileBrowser or token path may be required
403 without codeWAF or bot scoreInspect fingerprint, headers, IP
1020Custom WAF ruleIdentify the matched request attribute
1010Browser fingerprint classified as automationFix TLS and HTTP/2 fingerprint
1015 or 429Rate limitBack off, reduce per-host concurrency
503 plus challenge scriptInterstitial challengePersist clearance and retry coherently
Tiny word countShell, challenge, or blocked variantDo not accept as extracted content

The goal is not to memorize codes. The goal is to stop changing the wrong variable.

Step 3: check whether the network fingerprint matches the claim

Cloudflare's JA4 Signals post explains the direction clearly. JA4 fingerprints alone are not enough, so Cloudflare also computes inter-request features from traffic over the last hour. The post lists signals such as browser ratio, cache ratio, HTTP/2 and HTTP/3 ratio, request quantiles, and IP quantiles for a JA4 fingerprint.

That matters for scrapers because a request can look wrong before JavaScript ever runs.

Common mismatch:

Chrome User-AgentTLS ClientHello from Python, Go, Node, or curl
Chrome on macOSLinux container browser surface
Browser trafficNo Client Hints or wrong Client Hints
Normal sessionNo cookies, no cache, no asset requests
Local userProxy country does not match language or site market
Human browsingDirect deep links at machine cadence

Cloudflare's Detection IDs docs also mention detection tags for categories Cloudflare has fingerprinted, including a go tag for traffic observed from a Go programming language bot. Do not read that as "Cloudflare hates Go." Read it as evidence that implementation fingerprints are visible.

If the connection says "library" and the User-Agent says "Chrome", the lie is the signal.

Step 4: decide whether this needs JavaScript

A lot of teams launch a browser because Cloudflare is involved. That is expensive and often unnecessary.

Ask one question first: does the page content exist in the first HTML response?

If yes, a browser-grade HTTP client is usually the right first move. Match the TLS, HTTP/2, headers, locale, and proxy geography. Then parse the HTML.

If no, you need one of these:

1. The underlying JSON endpoint the page uses.

2. Browser rendering for the page.

3. A token or clearance flow if the page explicitly requires it.

The mistake is making browser rendering the default for every Cloudflare page. It hides the real failure and makes the system slower. Use it when the content or the challenge requires JavaScript, not because the domain uses Cloudflare.

Step 5: keep sessions coherent

Cloudflare's docs say the __cf_bm cookie measures a user's request pattern and helps generate a reliable bot score for that user's requests. Their JavaScript Detections docs also describe a cf_clearance cookie that stores the JavaScript Detections outcome.

For a scraper, this means stateless retry loops are suspicious by design.

Bad pattern:

1. New proxy.

2. New browser context.

3. No cookies.

4. Deep product URL.

5. Same request every two seconds.

Better pattern:

1. Reuse a session per host.

2. Keep cookies between requests.

3. Keep language and proxy geography aligned.

4. Back off after challenge or rate-limit responses.

5. Escalate only after classifying the block.

You do not need to fake a full human life story. You do need the request sequence to be internally consistent.

Step 6: separate WAF blocks from rate limits

A 1020 and a 1015 are not cousins.

1015 means rate limit. The fix is mechanical: slow down, respect Retry-After, reduce per-host concurrency, spread requests across more exits if the use case allows it.

1020 means a custom rule matched. Cloudflare's custom rules docs show how site owners can combine bot score with URI path, ASN, country, JA3/JA4 fingerprint, user agent, and other request fields. That is a very different problem.

If you hit 1020, changing speed may do nothing. The rule probably matched what the request is, not how often it runs.

Step 7: write the retry policy last

Retries are useful after classification. They are harmful before it.

Use a policy like this:

real_pageAccept only if content markers are present
challenge_pageRetry with session continuity or escalate
turnstileUse a real browser or token path if allowed
waf_errorChange fingerprint, headers, geo, or path
rate_limitedRespect backoff and reduce concurrency
unknown_blockStore body, Ray ID, headers, and stop blind retry

The worst retry policy is "same request, different proxy, ten times." That creates more negative history for every layer Cloudflare cares about.

A concrete debugging flow

Here is the flow I use when a Cloudflare target starts failing.

1. Fetch once with logging enabled.

2. Store status, headers, first 20 KB of body, proxy metadata, duration, and session ID.

3. Classify the response body.

4. If cf-mitigated is challenge, stop parsing and mark the run as blocked.

5. If the body has a Cloudflare error number, route by that number.

6. If the response is 200 but the word count is tiny, treat it as a silent block until proven otherwise.

7. If the block is fingerprint-shaped, move to a browser-grade HTTP client.

8. If the content is JavaScript-only, escalate to rendering.

9. If the block is rate-shaped, reduce concurrency before changing fingerprints.

10. If it still fails, keep the Ray ID and the exact request. Do not guess.

That flow is boring. Boring is good. Boring means your scraper is producing evidence instead of folklore.

How webclaw handles this

webclaw routes a scrape through the same idea.

The fast path is a browser-grade HTTP fetch. It keeps the request coherent across fingerprint, headers, locale, and proxy geography. The response classifier checks for Cloudflare challenge markers, Turnstile markers, WAF bodies, status codes, content size, and extraction quality.

If the response is a real page, it extracts markdown, text, JSON, or LLM-ready content. If the response is a challenge, it does not hand you that HTML as success. It escalates.

import { Webclaw } from "@webclaw/sdk";

const client = new Webclaw({
  apiKey: process.env.WEBCLAW_API_KEY,
});

const page = await client.scrape({
  url: "https://target.example/product/123",
  formats: ["markdown", "llm"],
});

console.log(page.markdown);

Same thing over REST:

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://target.example/product/123", "formats": ["llm"]}'

Full reference: scrape API docs. If you are migrating from a browser-first stack, start with the Cloudflare error code guide and the TLS fingerprinting guide.

What to remember

Cloudflare does not block "scrapers" in one generic way.

It scores requests. It sees fingerprints. It runs JavaScript Detections when configured. It lets site owners write custom rules with bot scores, JA3/JA4, ASN, path, country, user agent, and detection IDs. It has challenge responses that can look like ordinary HTML if you only check status code.

So the fix is not one more header, one more proxy, or one more stealth plugin.

The fix is a scraper that knows what happened.

Log the layer. Classify the block. Change the right thing.