← Back to Blog

I shipped a tool that probes your site as ChatGPT, Claude, Perplexity, and 13 other bots so you can see exactly who is being blocked

I shipped a tool that probes your site as ChatGPT, Claude, Perplexity, and 13 other bots so you can see exactly who is being blocked

A few days ago I wrote about Cloudflare bot management blocking AI crawlers from your site. The short version: when Cloudflare decides a request looks automated, it returns a 403 with a JavaScript challenge page, and AI bots cannot solve the challenge, so they read your homepage as if it were empty.

After I shipped that post I kept getting the same question from people who run small business sites: "Okay, but how do I actually tell if Cloudflare is blocking the bots? I logged in, I see settings, I do not know which one is the problem."

So I built the AI Bot Allowlist Validator. It does one thing: sends a real HTTP request to a URL you pick, using each major AI bot's published User-Agent string, and reports what came back per bot. Allowed, challenged, blocked, or partial. About fifteen seconds end to end.

What the tool actually does

The tool walks down the full bot catalog the rest of the site shares (the same list every other AEO tool here uses, kept in /js/ai-bots.js). Right now that's the OpenAI family (GPTBot, OAI-SearchBot, ChatGPT-User), Anthropic (ClaudeBot, Claude-Web), Perplexity (PerplexityBot, Perplexity-User), Apple (Applebot), ByteDance (Bytespider), Common Crawl (CCBot), Meta (Meta-ExternalAgent), Amazon (Amazonbot), Mistral (MistralAI-User), Cohere (cohere-ai, Cohere-Training-Data-Crawler), You.com (YouBot), Diffbot, Timpibot, Webz.io (omgili), Huawei (PetalBot), the major search engines (Googlebot, bingbot, DuckDuckBot, YandexBot, Slurp, Baiduspider), and the link-preview agents (facebookexternalhit, LinkedInBot, Twitterbot, Slackbot). The list grows whenever a new vendor publishes a bot. New bots show up in every AEO tool the day I add them to the registry.

For each bot it fires a single HTTP request to your URL and inspects the response.

  • Allowed = HTTP 2xx with content. Bot can read the page.
  • WAF Challenged = the response was a Cloudflare / AWS WAF / Akamai / Imperva / Sucuri / Vercel Firewall challenge page. Bot saw noindex,nofollow and walked away.
  • Blocked = HTTP 4xx or 5xx with no challenge page. Hard refusal.
  • Partial = HTTP 200 but the body is much smaller than what a browser receives. Usually a soft-block, an empty SPA shell, or a rendering misconfiguration.
  • Error = the connection failed entirely (DNS, TLS, timeout).

Before the bot probes run, the tool fetches the URL once with a regular desktop browser User-Agent to establish a baseline word count. The browser fetch tells me what a normal visitor sees. The bot probes are then judged against that baseline. If the browser sees 1,800 words and GPTBot sees 12 words, GPTBot is in trouble even though both got an HTTP 200.

What you do with the output

When you run the tool you usually end up in one of three states.

State 1: every bot says ALLOWED. Good news, mostly. Two caveats. First, this is a User-Agent test only. Some CDN allowlists also verify the requesting IP belongs to the bot's published range. The probe traffic originates from Netlify's IPs, not from OpenAI's range, so a strict IP-allowlist setup will show ALLOWED on this tool but still block the actual bot in production. Second, bots can fail at the page level even when the homepage works (a broken /robots.txt, an X-Robots-Tag noindex on a specific path, a country-block).

If the validator says ALLOWED everywhere and your AI referral traffic is still flat, run the AI Crawler Access Auditor next to check the policy layer.

State 2: most bots are WAF-CHALLENGED. This is what I see most often on small business sites running Cloudflare. The fix is in the Cloudflare allowlist post — usually it is the Verified Bots toggle in Security → Bots, or the AI Audit "Block AI Bots" feature you turned on at some point and forgot about.

The tool will tell you which specific WAF returned the challenge (Cloudflare, AWS WAF, Akamai, Imperva, Sucuri, or Vercel Firewall), so you know which dashboard to log into.

State 3: a mix. This is the most informative case. Often I see Googlebot and bingbot allowed, but GPTBot and ClaudeBot challenged. That happens when a CDN's "Verified Bots" category includes search engines but not the newer AI training crawlers, and the operator never updated the rule. The fix: extend the allowlist to include the AI bots specifically.

I have also seen the inverse — Googlebot challenged but GPTBot allowed. Usually that is a custom Cloudflare WAF rule that someone wrote to "block aggressive crawling," accidentally caught Googlebot's UA, and never noticed because Search Console traffic dropped slowly enough to look like a normal SEO down quarter.

What the tool does not do

I want to be careful about overpromising.

It does not test what an AI engine actually sees in production. AI engines crawl from their own IP ranges with their own caching layers and their own retry behavior. The validator is a sanity check on the User-Agent layer, not a substitute for inspecting your CDN's bot analytics or your origin's access logs.

It does not test JavaScript rendering. If your site requires JavaScript to render content, even an "allowed" bot will see a mostly-empty page. The validator will flag this as PARTIAL when the body is much smaller than the browser baseline.

It does not detect IP-based allowlists. As noted above, a CDN that allowlists by both UA and IP will report ALLOWED on this tool but still block in real bot traffic. This is by design — the tool is honest about the test surface it covers.

How I use it on my own work

Before this tool existed, I diagnosed bot-blocking issues by looking at server logs, then guessing which Cloudflare rule was responsible, toggling settings, and waiting. The diagnostic loop was about an hour per round.

Now the loop is: change a Cloudflare setting, wait 30 seconds for cache propagation, click "Probe all bots," see whether the situation moved. Five seconds per check. I tend to run it after every Cloudflare-side change to confirm I did not accidentally lock something out.

If you want a deeper look at how the policy layer (robots.txt, ai.txt, meta robots, X-Robots-Tag) connects to the operational layer (CDN, WAF, origin), my book The $97 Launch has the technical-setup chapter that walks through it end to end. The validator and the auditor are the diagnostic side; the book is the playbook side.

Run it

AI Bot Allowlist Validator →

Paste a URL, click probe, wait fifteen seconds. If anything is blocked, the tool emits a paste-ready AI fix prompt that includes the specific WAF detected and the exact bots that need allowlisting. Drop that into Claude or ChatGPT and you get the configuration changes back as actual code, not advice.

Fact-check notes and sources

The User-Agent strings used in the validator are the ones each vendor publishes in their official bot documentation. Sources:

  • OpenAI (GPTBot, OAI-SearchBot, ChatGPT-User): https://platform.openai.com/docs/bots
  • Anthropic (ClaudeBot, Claude-Web): https://docs.anthropic.com/en/docs/agents-and-tools/web-fetch-tool
  • Perplexity (PerplexityBot, Perplexity-User): https://docs.perplexity.ai/guides/bots
  • Apple (Applebot, Applebot-Extended): https://support.apple.com/en-us/119829
  • Common Crawl (CCBot): https://commoncrawl.org/ccbot
  • ByteDance (Bytespider): publicly advertised UA in the bot's request headers
  • Google (Googlebot, Google-Extended): https://developers.google.com/search/docs/crawling-indexing/google-special-case-crawlers
  • Microsoft (bingbot): https://www.bing.com/webmasters/help/which-crawlers-does-bing-use-8c184ec0
  • Meta (Meta-ExternalAgent): https://developers.facebook.com/docs/sharing/webmasters/crawler/
  • Amazon (Amazonbot): https://developer.amazon.com/support/amazonbot
  • Mistral (MistralAI-User): https://docs.mistral.ai/
  • DuckDuckGo (DuckDuckBot): https://duckduckgo.com/duckduckbot
  • Cloudflare Verified Bots directory: https://radar.cloudflare.com/traffic/verified-bots
  • Cloudflare bot challenge documentation: https://developers.cloudflare.com/bots/concepts/challenge-types/

Related reading

This post is informational, not security or SEO consulting advice. Mentions of OpenAI, Anthropic, Perplexity, Apple, Google, Microsoft, Meta, ByteDance, Amazon, Mistral, Common Crawl, DuckDuckGo, Cloudflare, AWS, Akamai, Imperva, Sucuri, and Vercel are nominative fair use. No affiliation is implied.

← Back to Blog

Accessibility Options

Text Size
High Contrast
Reduce Motion
Reading Guide
Link Highlighting
Accessibility Statement

J.A. Watte is committed to ensuring digital accessibility for people with disabilities. This site conforms to WCAG 2.1 and 2.2 Level AA guidelines.

Measures Taken

  • Semantic HTML with proper heading hierarchy
  • ARIA labels and roles for interactive components
  • Color contrast ratios meeting WCAG AA (4.5:1)
  • Full keyboard navigation support
  • Skip navigation link
  • Visible focus indicators (3:1 contrast)
  • 44px minimum touch/click targets
  • Dark/light theme with system preference detection
  • Responsive design for all devices
  • Reduced motion support (CSS + toggle)
  • Text size customization (14px–20px)
  • Print stylesheet

Feedback

Contact: jwatte.com/contact

Full Accessibility StatementPrivacy Policy

Last updated: April 2026