← Back to Blog

When the auditor can't reach your site: paste your own HTML, or briefly relax your CDN

When the auditor can't reach your site: paste your own HTML, or briefly relax your CDN

The audit tools on this site fetch a URL through a serverless proxy, then parse and score whatever HTML comes back. That works for most sites. It does not work when the origin sits behind a bot-management layer that returns a JavaScript challenge page to anything that does not look like a real Chrome session.

The shipped auditors detect that case and tell you about it. They also try the Wayback Machine as a fallback so you still get something. But for sites with no archived snapshot, or for sites where you want to score the live current state instead of an old archive, the tool will still come back partly empty and tell you it was blocked.

If the site is yours, there are two clean ways to get a real audit anyway. This post walks through both.

Read this first: a blocked audit is also a blocked AI retriever

If a tool on this site shows you a "blocked by WAF / Wayback fallback" banner, the same block is hitting GPTBot, ClaudeBot, OAI-SearchBot, ChatGPT-User, Claude-User, PerplexityBot, Perplexity-User, Applebot, Amazonbot, MistralAI-User, Meta's external agent, and the open-source crawlers (CCBot, Bytespider). Every one of those agents fetches your page from a non-browser context the same way the audit proxy does. They get the same challenge body, read the same noindex,nofollow meta tag, and walk away.

This is the most important takeaway in the post: a blocked audit is a leading indicator that your site is invisible to AI search. Even if you do not care about the audit specifically, the audit being blocked tells you something measurable about how AI engines see your site, and the fix is the same.

The companion post How to allowlist AI crawlers without weakening bot protection covers the long-term fix. The 499 problem: Cloudflare's signal that your AI eligibility is collapsing covers the metric to watch in your logs. This post focuses on the short-term: getting an actual audit done today on a site you own.

Why this happens at all

Cloudflare Bot Management, Vercel's Security Checkpoint, AWS WAF challenge actions, Akamai Bot Manager, Imperva, Sucuri, and DataDome all share the same defensive instinct: anything that arrives without a full browser fingerprint is treated as suspect. The default response is HTTP 403 (or sometimes a 200) carrying a small JavaScript challenge body. A real browser solves the challenge in milliseconds and the user never notices. A serverless function fetching the URL receives the challenge body, no JavaScript runs, and the audit tool ends up parsing a thousand bytes of obfuscation with <meta name="robots" content="noindex,nofollow"> baked in.

This is the same behavior that hides your pages from GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, and ChatGPT-User. If a remote auditor cannot read your site, AI retrievers cannot read it either. The fix paths are the same in both cases. This post covers the audit-only path. The full retriever-allowlisting walkthrough lives at Cloudflare is blocking AI crawlers from your site.

Path one: paste your own HTML

The fastest path for a one-time audit is to fetch the page yourself, in your own browser, and feed the HTML directly to the tools that accept it.

Step one: get the HTML. In any modern browser, open the page you want to audit, hit Ctrl+U (Windows/Linux) or Option+Cmd+U (macOS) to view the source, then Ctrl+A and Ctrl+C to copy it. Or use File → Save Page As and pick "Webpage, HTML Only" if you prefer a file you can open later. Either way you now have the exact HTML the browser received, post-CDN, post-challenge.

A small caveat: View Source shows the HTML the server sent. If your page is a single-page app that renders content client-side, View Source will show an empty shell. In that case use the DevTools Elements panel — right-click anywhere on the page, choose "Inspect", right-click the <html> tag in the Elements pane, and pick "Copy outerHTML". That captures the post-render DOM, which is what AI retrievers and accessibility scanners need anyway.

Step two: pick a tool that accepts pasted HTML. Several of the audit tools have a paste-HTML mode in addition to URL fetch. Worth knowing:

The trade-off with paste mode is that any tool that needs to fetch sibling resources (sitemap.xml, robots.txt, CSS files, images for actual byte sizes) will not have access to those when you only paste the page HTML. For those, path two is the right move.

Step three: cURL or wget if the page does not render client-side. If you prefer the command line, this is the same content the audit proxy would get if it were unblocked:

curl -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36" \
     -H "Accept: text/html,application/xhtml+xml" \
     -L --max-time 15 \
     "https://yoursite.com/" > page.html

Open page.html in a browser to confirm it is the real content rather than a challenge response, then paste it into the tool.

Path two: temporarily relax the rule on your own site

If you would rather audit live, you can disable the bot challenge for a few minutes while the audit runs. Do this only on a site you own. Do not leave it open afterward — the same rules that block the audit also block credential-stuffing attempts and scraping bots. Re-enable when you are done.

The exact UI changes per host. Here is where the toggle lives in each of the major platforms.

Cloudflare

The blanket option is in Security → Bots → Configure. The "Bot Fight Mode" toggle is the simplest one to flip; turning it off allows everything that is not on Cloudflare's known-bad list. For a more surgical move that does not weaken protection sitewide, go to Security → WAF → Custom Rules and add a rule with the action "Skip" for the IP you are auditing from. The Cloudflare docs cover the full skip-rule syntax including how to scope by IP, ASN, or user-agent string (WAF Custom Rules → Skip).

If you only need to whitelist one auditor IP, find your audit machine's public IP at whatismyip.com and write the rule as: (ip.src eq YOUR.IP.ADDRESS.HERE) action Skip → check "All remaining custom rules" + "Bot Fight Mode" + "Super Bot Fight Mode". Save, run the audit, then disable the rule.

For the long-term path that allowlists AI retrievers without weakening protection, the Cloudflare AI crawler allowlist guide covers the verified-bots toggle and the per-bot UA skip patterns.

Vercel

Vercel's bot defense lives in two places. The Security Checkpoint is enabled per-project under Settings → Security → Attack Challenge Mode. Toggle it off, run the audit, toggle it back on. The Firewall (a separate feature) is at Settings → Security → Firewall — disable any rules with the Challenge action while you audit, then re-enable. The Vercel docs page is Attack Challenge Mode.

If you only want to skip the challenge for the audit IP, the firewall supports a Skip action with an ip.src condition the same way Cloudflare does.

Netlify

Netlify itself does not run a bot-management layer at the edge. If your Netlify site is challenging requests, the rule is somewhere else: a Cloudflare or Akamai layer in front, a Netlify Edge Function you wrote yourself, or a Netlify Function that imposes its own rate limit (the _guard.mjs pattern this site uses).

Check Site Settings → Build & Deploy → Edge Functions for any installed challenge logic, and look at any netlify.toml [[redirects]] blocks that send / to a challenge page. If your site uses Netlify Edge Forms the form submission endpoint can rate-limit but should not block the page itself.

GoDaddy

GoDaddy's Website Builder + Managed WordPress products sit behind a Sucuri WAF. To pause it, sign in to your GoDaddy dashboard, go to Web Security → Managed WAF → Settings, and toggle "Block Aggressive Web Crawlers" off for the duration of the audit. The Sucuri admin docs have the full panel reference at Sucuri WAF — Access Control.

If you cannot find the WAF panel in your GoDaddy dashboard, the protection might be on the Sucuri direct dashboard at waf.sucuri.net rather than in the GoDaddy UI — that is the case for the older bundled-WAF SKUs.

AWS WAF

If your site sits behind CloudFront + AWS WAF, find your Web ACL in the WAF console, locate any rule with an Action of "Challenge" or "CAPTCHA", and either disable the rule temporarily or add an override that skips it for your audit IP. The override is reversible per-deployment (you can revert by reverting the change), which is safer than disabling. AWS docs cover the override syntax at Action overrides.

Akamai, Imperva, DataDome

For Akamai Bot Manager, the per-IP allow rule is at Bot Manager → Custom Bot Categories → Allow List. Imperva's equivalent lives under Account Settings → Security → Allow List. DataDome's exception interface is in Settings → Web Protection → Exceptions, where you add the IP and pick "Allow" as the action.

In every case the discipline is the same: scope the change as narrowly as you can, run the audit, then revert.

Reference prompt — paste this into your AI assistant

If the steps above feel too vague for your setup, here is a prompt you can paste into Claude, ChatGPT, or any other assistant. Replace the bracketed values with yours and the assistant will walk you through the exact clicks for your hosting stack.

You are a senior web-infrastructure engineer. I need to temporarily relax my CDN's bot
protection so a remote audit tool can fetch my page. After the audit, I want to fully
restore the original protection.

My setup:
- Domain: [example.com]
- DNS / CDN provider: [Cloudflare / Vercel / Netlify / GoDaddy / AWS / Akamai / Imperva / DataDome / other]
- Hosting platform: [Netlify / Vercel / WordPress / Shopify / custom Node / static / other]
- The audit tool's outbound IP (if known): [IP or "unknown — uses a serverless proxy from Netlify/AWS/GCP"]
- The audit tool's user agent (if I want to allow by UA instead of IP): [Mozilla/5.0 ... or "unknown"]
- Page I want audited: [https://example.com/specific-page]

What I want:
1. The exact menu path in my CDN dashboard to either (a) allow my audit IP, or (b) temporarily disable
   the bot challenge for the page above.
2. A reversal checklist — how to confirm that protection is back to the original posture after I am done.
3. Anything I should NOT touch (settings that look related but are unrelated and would weaken protection).
4. If my CDN supports a "skip" / "override" action that is reversible, prefer that to a "disable" toggle.
5. If a UI walkthrough is impractical, give me the exact API call (with a curl example) that does the
   same thing.

Treat this as a working-hours change. I want to audit, verify, and restore within 15 minutes.

The assistant should produce a clean step-by-step that matches the dashboard you actually use, plus the rollback. If the answer is generic, paste a screenshot of your CDN console and ask again — the second turn typically lands on the right setting once it can see the actual UI.

Manual steps for every audit type that gets blocked

The audit tools on this site fall into a handful of fetch patterns. When the proxy gets challenged, the manual workaround depends on which pattern the tool uses. This is the cheat sheet.

Single-page audits (Mega Analyzer, Site Analyzer, AI Posture, AI Citation Readiness, headings, etc.)

These tools fetch one URL and parse the HTML. Manual recovery:

  1. Open the page in your browser.
  2. View source (Ctrl+U / Option+Cmd+U) → select all → copy. If the page renders content client-side, use DevTools Elements → right-click <html> → Copy outerHTML instead.
  3. Open the tool's paste-HTML mode if it has one (Mega Analyzer, Site Analyzer, headings, image-alt, WCAG, FAQ-parity, content-velocity all do). Paste and run.
  4. If the tool only takes a URL, check whether you can reach it from a logged-in browser tab and whether the challenge applies to your IP. If so, run path two: relax the CDN rule for the audit IP, then re-run.

Sitemap-driven batch audits (sitemap audit, batch analyzer, internal-link auditor, redirect-chain audit)

These tools fetch /sitemap.xml or follow links across the site. Single-page paste does not solve the batch case. Manual recovery:

  1. Open https://yoursite.com/sitemap.xml in your browser, save as a .xml file or copy the body.
  2. Several batch tools accept a pasted URL list — paste the URLs from your sitemap into the batch input box on the tool page.
  3. If the tool needs to fetch each listed URL itself, the temporary CDN-relax (path two below) is the only option. Time-box the relax window: turn it off, run the audit, turn it back on within 15 minutes.

Sibling-resource audits (security headers, DNS / email auth, AI crawler access, AI bot allowlist validator)

These tools query DNS, fetch /robots.txt, fetch /ai.txt, or probe HTTP headers from your origin. They do not parse rendered content, so paste mode is not enough.

  1. DNS-only audits (DNS + email auth, BIMI, MTA-STS) work regardless of WAF — they query DNS, not your web origin. If they show errors, the issue is DNS, not bot protection.
  2. Header / robots audits need the live fetch. Use path two and relax the rule briefly.
  3. If you cannot relax the rule (a client site, a regulated environment), run the audit against the dev / staging hostname instead. Most CDN setups apply bot protection to production only.

Per-bot UA probes (AI Crawler Access Auditor, AI Bot Allowlist Validator)

These tools deliberately fetch your origin once per AI bot user-agent string, comparing the responses. The whole point is to detect bot challenges, so a "WAF challenge" verdict is the correct verdict, not a tool failure. Do not relax the CDN rule before running these audits — you will mask the very signal they are designed to find.

If a per-bot probe says "ClaudeBot challenged", that is a real production finding. The fix is to allowlist verified bots in the CDN (Cloudflare AI crawler allowlist), not to disable the rule.

Public-data audits (Lighthouse, PageSpeed Insights, schema validator, Wayback)

If the tool relies on a third-party provider (Google PageSpeed, schema.org validator, archive.org), the third-party fetcher is a different IP than this site's auditor proxy. Some of those will be allowlisted by your CDN already — Googlebot in particular, since most CDNs ship a verified-bot rule that lets Google through. Try the third-party tool directly:

If those external services see your real page but the jwatte audit proxy does not, your CDN is correctly allowing verified Googlebot but blocking other automation. That is exactly the pattern that needs the verified-bots category turned on rather than left off.

Always re-run after you change CDN settings

The point of the audit is to verify the change, not to assume it. After you allowlist or skip a rule, re-run the audit with no other changes and confirm:

  • The WAF banner is no longer shown
  • The score updates to reflect the actual content (word count above zero, schema types listed, headings populated)
  • The AI Crawler Access Auditor no longer flags any bot as "WAF challenge"

Only then is the change verified. If any of those still fail, the rule scope was wrong and you need to widen it.

Even when the auditor works, recheck your blocking posture

A clean audit run does not mean your site is fully reachable to AI agents. The audit proxy is one fetcher with one user agent, hitting from one IP range. AI retrievers run dozens of fetchers from different IP ranges with different user agents. Edge cases that escape this site's proxy can still trip up GPTBot, and vice versa.

Run the AI Crawler Access Auditor and the AI Bot Allowlist Validator on your site quarterly, not just when something breaks. They probe each AI bot's published user agent against your origin, separately, and tell you which specific bot UAs are getting challenged. That is a different and more diagnostic test than the single-fetch path used by the content auditors.

If anything in those probes says "WAF challenge" for a verified retrieval bot — GPTBot, ClaudeBot, ChatGPT-User, Claude-User, PerplexityBot, Perplexity-User, OAI-SearchBot, Applebot, Bingbot — it is a citation-loss signal, not a security signal. Fix it.

Why "audit my own HTML" is sometimes better than relaxing the rule

A surprising number of audits go faster when you skip the live-fetch path entirely. Reasons to favor the paste-HTML approach:

  • You audit what humans actually see. A logged-in dashboard, a paywalled article, an A/B test variant — the auditor proxy will never get to those. Your browser already does. Save the page after you have the exact state you want scored.
  • You can audit a draft before publish. Paste the HTML from the WordPress preview pane, the Eleventy _site/ build, or the Next.js dev-server output. No need to push to production first.
  • You avoid a failure mode where the audit succeeds against the challenge page. If you forgot to verify the proxy got real content, you can score a 100% perfect challenge response — meaningless. Paste removes that ambiguity.
  • It costs nothing. No CDN setting changes, no risk window, no rollback to forget.

The trade-off is the one I mentioned above: tools that need to fetch sibling resources (/sitemap.xml, /robots.txt, JSON-LD blocks loaded via fetch, image dimensions from byte size) cannot do that when you only paste the page HTML. For those checks, the live-fetch path is required, which is when the temporary CDN relax becomes the right call.

A note on what to do if you keep getting blocked

If you find yourself repeatedly relaxing rules to audit your own site, that is the signal that the bot rules are too strict for the kind of automated retrieval you actually want. The same friction that blocks you from auditing on demand is hitting AI retrievers, search engines, monitoring services, and uptime probes constantly. The fix is the verified-bots category in your CDN, not a one-off skip rule.

Cloudflare publishes a verified-bots list that includes Googlebot, Bingbot, Applebot, Amazonbot, GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-User, Claude-Web, PerplexityBot, Perplexity-User, and DuckDuckBot. Allowing the verified set keeps the protection intact against unverified scrapers. The full setup is in the AI crawler allowlist guide.

Related reading

Fact-check notes and sources

This post is informational, not security or hosting advice. Mentions of Cloudflare, Vercel, Netlify, GoDaddy, AWS, Akamai, Imperva, Sucuri, and DataDome are nominative fair use. No affiliation is implied.

← Back to Blog

Accessibility Options

Text Size
High Contrast
Reduce Motion
Reading Guide
Link Highlighting
Accessibility Statement

J.A. Watte is committed to ensuring digital accessibility for people with disabilities. This site conforms to WCAG 2.1 and 2.2 Level AA guidelines.

Measures Taken

  • Semantic HTML with proper heading hierarchy
  • ARIA labels and roles for interactive components
  • Color contrast ratios meeting WCAG AA (4.5:1)
  • Full keyboard navigation support
  • Skip navigation link
  • Visible focus indicators (3:1 contrast)
  • 44px minimum touch/click targets
  • Dark/light theme with system preference detection
  • Responsive design for all devices
  • Reduced motion support (CSS + toggle)
  • Text size customization (14px–20px)
  • Print stylesheet

Feedback

Contact: jwatte.com/contact

Full Accessibility StatementPrivacy Policy

Last updated: April 2026