Back to blog
Massi

Residential Proxies for Self-Hosted webclaw Scraping

You run the webclaw CLI on your own box. It works for the first hundred pages. Then a site that returned clean markdown an hour ago starts handing you 429s, captchas, or a page localized to the wrong language. Your code did not change. The site noticed one IP pulling pages faster than any human would, and it throttled you.

Every self-hosted scraper hits this. One origin IP carries one reputation, one rate budget, and one location. Spread the same volume across hundreds of IPs in the right countries and the per-IP request rate drops below the threshold that triggers blocks. The site sees ordinary traffic from many places.

This guide puts ColdProxy residential proxies in front of the webclaw CLI: a single proxy first, then a rotating pool across a crawl, then geo-targeting by country. webclaw is open source, so you wire up the proxy layer yourself. ColdProxy supplies the IPs. Every command below uses a real webclaw flag, so you can paste it straight into a terminal.

Key takeaways

  • A self-hosted scraper sends every request from one IP, so it gets rate-limited or geo-walled fast. Spread requests across many proxy IPs to fix both.
  • webclaw's hosted cloud handles proxies for you. The open-source CLI does not, so self-hosters bring their own. That is the audience for this guide.
  • Set one proxy with WEBCLAW_PROXY or --proxy. Rotate a pool with --proxy-file, where each proxy gets its own client and TLS fingerprint.
  • Residential IPs look like real home connections and suit geo-specific and protected targets. Datacenter IPs run faster and cost less for high-volume crawling of tolerant endpoints.
  • Geo-target by putting one country-tagged ColdProxy endpoint per line in the pool file. ColdProxy covers 195+ countries with country, city, ZIP, and ASN targeting.
  • Proxies fix throughput and IP reputation. They do not handle JS rendering or challenge solving. For heavily protected sites, switch to webclaw's managed cloud.
  • Why a self-hosted scraper needs its own proxies

    webclaw comes two ways. The hosted cloud at webclaw.io runs the full extraction pipeline on managed infrastructure with proxies included: you send a URL, you get content back, and IPs never enter the picture. The open-source CLI runs on your machine and gives you the same extraction engine without the managed plumbing. This guide covers the CLI.

    When you self-host, your scraper inherits the network identity of the machine it runs on. A laptop, a VPS, a CI runner: each has a single egress IP, and every request to a target stamps that same address. The target counts requests per IP over time, and once you cross the rate it tolerates, it throttles or bans the IP. A datacenter VPS IP often gets flagged as non-human before you send a single request.

    Proxies break the one-IP bottleneck. Route each request through a different upstream address and the target sees a spread of clients. Residential proxies go one step further: they use real consumer IPs from actual ISPs, so to the target your scraper reads as an ordinary home visitor. That can decide whether a request returns the page or a block.

    Residential vs datacenter proxies for scraping

    Residential and datacenter proxies fit different jobs. Residential IPs come from consumer ISP connections and carry the trust of a real household, which matters when a site scrutinizes IP reputation or serves region-locked content. Datacenter IPs live in server farms. They run faster and cost less per gigabyte, and they work on endpoints that do not inspect IP origin closely, like public APIs or sitemaps.

    IP sourceReal consumer ISP connectionsServer farms / cloud ranges
    Trust with strict sitesHigh, reads as a home userLower, often pre-flagged
    SpeedGoodFastest
    Cost per requestHigherLowest
    Geo-targetingCountry / city / ZIP / ASNCountry-level
    Best forRegion-specific testing, localized content, public-data collection, market monitoringHigh-volume crawling of tolerant endpoints

    A practical default: pick residential when the target cares who is asking or where you are, and pick datacenter when you need to move a lot of pages cheaply from a forgiving source. ColdProxy offers residential IPv4, residential IPv6, and datacenter IPv6 from one dashboard, so you can mix both in the same project. ColdProxy's residential vs datacenter breakdown covers the trade-offs in more depth.

    Step 1: Install webclaw

    Pick whichever install path fits your machine. Homebrew is the shortest on macOS and Linux:

    brew tap 0xMassi/webclaw && brew install webclaw

    With a Rust toolchain, build from source via cargo:

    cargo install --git https://github.com/0xMassi/webclaw.git webclaw-cli

    Want a prebuilt binary with no toolchain? Grab one from the GitHub releases page and drop it on your PATH.

    Confirm it runs by scraping a page with no proxy yet:

    webclaw https://example.com --format markdown

    You should get clean markdown on stdout. Once that works, you can route it through ColdProxy.

    Step 2: Get your ColdProxy endpoint

    Sign in to the ColdProxy dashboard and open the proxy product you picked in Step 1 (residential IPv4 is a safe starting choice). The dashboard gives you four pieces:

  • a host (the proxy gateway hostname)
  • a port
  • a username
  • a password
  • ColdProxy uses a username-tag scheme, so the username string also carries your targeting options. You select a country, a sticky or rotating session, and similar controls in the dashboard, and those choices fold into the username it hands you. Copy the values as shown.

    Assemble them into a standard proxy URL. The shape webclaw expects is http://USERNAME:PASSWORD@HOST:PORT. With the four values in hand, you have everything for the next step.

    Step 3: Scrape through a single ColdProxy proxy

    The cleanest way to set a proxy is the WEBCLAW_PROXY environment variable. Export it once and every webclaw call in that shell routes through it:

    export WEBCLAW_PROXY="http://USERNAME:PASSWORD@HOST:PORT"
    webclaw https://example.com --format markdown

    To keep the proxy out of your environment, pass it inline with --proxy on the single command instead:

    webclaw https://example.com --proxy "http://USERNAME:PASSWORD@HOST:PORT" --format markdown

    Run the scrape and check the page for anything that echoes your apparent location, like a currency or a language banner. With the proxy working, that location reflects the ColdProxy exit IP, not your real one. A single proxy covers low-volume work. Once you crawl a whole site, you want rotation.

    Step 4: Rotate a ColdProxy pool across a crawl

    Past a few hundred requests, spread the load across many IPs. webclaw reads a pool from a plain text file: one proxy per line in host:port:user:pass format, with # lines ignored as comments. Create coldproxy.txt:

    # residential IPv4
    HOST:PORT:USERNAME:PASSWORD
    HOST:PORT:USERNAME:PASSWORD
    # datacenter IPv6
    HOST:PORT:USERNAME:PASSWORD

    Point a crawl at the file with --proxy-file. webclaw rotates the pool per request, and each proxy gets its own client with its own TLS fingerprint, so the rotation goes beyond a swapped IP:

    webclaw https://docs.example.com --crawl --depth 2 --max-pages 200 \
      --concurrency 10 --delay 200 --proxy-file coldproxy.txt --format markdown

    That crawls two levels deep, caps the job at 200 pages, runs 10 requests in parallel, and waits 200ms between requests on each worker. With 10 proxies in the pool, no single IP carries the full crawl rate. The same --proxy-file works for batch jobs over a fixed URL list:

    webclaw --urls-file urls.txt --proxy-file coldproxy.txt --concurrency 10 --format json

    You can also set the pool through the WEBCLAW_PROXY_FILE environment variable instead of the flag, the same way WEBCLAW_PROXY mirrors --proxy. ColdProxy's sticky vs rotating session guide covers when to lock an IP for a multi-step session versus rotating on every call, which maps onto whether you want one proxy line or a full pool here.

    Geo-targeting by country

    A lot of scraping fails not because the site blocks you but because you sit in the wrong country. Pricing pages, search results, and content libraries change by region. Scrape a US storefront from a German IP and you get euros, German copy, and a different catalog than a US shopper sees.

    ColdProxy targets 195+ countries, with city, ZIP, and ASN precision on top of country. The targeting rides in the username tag you copied from the dashboard, so a US endpoint and a UK endpoint differ only in their credentials. To collect region-correct data, put one endpoint per country in the pool file and label each with a comment:

    # United States exit
    US_HOST:PORT:US_USERNAME:US_PASSWORD
    # United Kingdom exit
    UK_HOST:PORT:UK_USERNAME:UK_PASSWORD
    # Germany exit
    DE_HOST:PORT:DE_USERNAME:DE_PASSWORD

    To pin a whole crawl to one country, build a pool of only that country's endpoints. To compare a page across markets, mix countries in the file and run the crawl. Each page comes back stamped with the exit a local visitor would use. ColdProxy's geo-targeting guide documents the city, ZIP, and ASN syntax for drilling below the country level.

    Reliability beyond the proxy layer

    Good proxies get you part of the way. Three webclaw flags carry most of the reliability load on a self-hosted crawl.

    --concurrency sets how many requests run in parallel. Higher numbers finish faster and lean harder on the target. Start at 10 and raise it only if both the site and your pool absorb the load without errors climbing. --delay adds a pause in milliseconds between requests per worker, which smooths your request pattern so it reads as less mechanical; 200ms is a reasonable floor for a polite crawl. --timeout caps how long a single request waits before giving up, so one slow proxy does not stall the job.

    Match concurrency to pool size. Ten parallel workers against three proxies means each IP eats heavy load and the pool advantage disappears. Size the pool large enough that per-IP rate stays low, then tune concurrency under it. Watch your error rate as you scale. A creeping share of 429s or timeouts tells you to add proxies, lower concurrency, or raise the delay before the target bans IPs outright.

    Pool hygiene matters over a long run. Dead or slow proxies drag down throughput and inflate error rates, so prune them and refresh the file. ColdProxy's pool management guidance covers keeping a rotating set clean. On your side, log status codes per run so you catch a degrading pool before it tanks a crawl.

    When to use webclaw's managed cloud instead

    Proxies have a ceiling. Proxy rotation helps with throughput and IP reputation. It does not replace request fingerprinting, JS rendering, or challenge handling for heavily protected sites. For those, use webclaw's hosted cloud mode (set WEBCLAW_API_KEY), which handles that for you.

    The line is concrete. If a site serves its content in the initial HTML and only rations requests by IP, a good ColdProxy pool with sane concurrency and delay carries you a long way. If the page renders its content with JavaScript after load, throws an interactive challenge, or fingerprints the browser beyond the TLS layer, no proxy alone solves it. You would rebuild a rendering and challenge pipeline by hand, which is the work the managed cloud already does. Self-host with ColdProxy for tolerant-to-moderate targets and bulk collection. Reach for the cloud when the target fights back at the browser level.

    Frequently asked questions

    Do I need residential proxies, or will datacenter ones work?

    Depends on the target. Datacenter IPv6 runs faster and costs less, and it works on public APIs, sitemaps, and sites that do not scrutinize IP origin. Sites that check IP reputation or serve region-locked content flag datacenter ranges, so use residential IPv4 or IPv6 there. Many projects mix both: datacenter for the easy bulk pages, residential for the fussy ones.

    How many proxies should my pool have?

    Enough that no single IP carries a request rate a human would never produce. There is no fixed number; it scales with your total volume and the target's tolerance. A rough rule: keep per-IP requests per minute well under what the site tolerates from one visitor, then size the pool to hit your throughput target. If error rates climb as you scale concurrency, add proxies before pushing parallelism higher.

    Does webclaw rotate proxies automatically?

    Yes, when you pass a pool with --proxy-file (or the WEBCLAW_PROXY_FILE env var). webclaw rotates per request and builds a separate client with its own TLS fingerprint for each proxy. A single --proxy or WEBCLAW_PROXY value does not rotate; it routes every request through that one endpoint.

    Why am I getting blocked even with proxies?

    Most often the target needs more than a clean IP. If it renders content with JavaScript or throws a challenge, rotation alone will not get you the page. Confirm whether the content sits in the initial HTML. If it does not, switch that target to webclaw's hosted cloud, which handles rendering and challenges. If the content is in the HTML, check your concurrency and delay; an aggressive rate from too few proxies gets IPs banned regardless of how clean they are.

    Next steps

    Two moves take you from blocked to scraping at volume. Pick up a residential proxy plan from ColdProxy, copy your endpoint, and drop it into a coldproxy.txt pool. Run the rotating crawl command from Step 4 against your real target and watch the error rate stay flat where one IP used to fall over.

    When you hit a site that renders with JavaScript or throws challenges, stop fighting it by hand. Set WEBCLAW_API_KEY and route that target through webclaw's managed cloud, which handles fingerprinting, rendering, and challenges for you. Self-host with ColdProxy where it fits, and lean on the cloud for the sites that fight back.

    Ship your agent today. Scrape forever.

    Cancel anytime. Migrate from Firecrawl in 60 seconds with the compatibility layer.

    Read the docs