I run a tool called the Mega Analyzer that scores how well a page is set up for AI retrieval. Every so often I run it on a client site and the score comes back as an F across the board: zero schema, zero word count, zero internal links, no rendered H1.
The first time it happened I assumed the site was broken. It was not. The site was perfectly fine. The problem was that Cloudflare was returning a 403 challenge page to my scanner, and the scanner was scoring the challenge page instead of the actual site.
Then I realized: ChatGPT, Claude, Perplexity, Gemini, Siri, and Google's AI Overviews were all seeing exactly the same 403 challenge page. The site was effectively invisible to every AI engine.
If you run a small business site behind Cloudflare, this is probably happening to you too. Here is what is going on and how to fix it without disabling the protection that keeps real attackers out.
What you are actually serving to AI crawlers
When Cloudflare's bot management decides a request looks automated, the default response is a JavaScript challenge page. The HTTP status is 403, the body is around a kilobyte of JavaScript and a <meta name="robots" content="noindex,nofollow"> tag, and the response includes a header named cf-mitigated: challenge.
A browser running JavaScript will solve the challenge in a fraction of a second and the user never sees anything wrong. An AI crawler does not run JavaScript. It receives the 403, reads the noindex,nofollow meta tag, and walks away. As far as that AI engine is concerned, your homepage has no content and explicitly does not want to be indexed.
This is not a Cloudflare bug. The challenge page is doing exactly what it is designed to do: stop scrapers, stop credential stuffing, stop layer-7 DDoS attempts. The trouble is that the same shield also stops the agents you actually want reading your site.
The same pattern shows up in AWS WAF, Akamai Bot Manager, Imperva, Sucuri, and Vercel Firewall. Cloudflare is the one most small-business sites run into, so I will use that as the running example. The fix shape is the same elsewhere.
Who is being blocked
The AI crawlers that matter for visibility in 2026 use predictable, published user agents. The major ones:
- GPTBot and OAI-SearchBot from OpenAI. GPTBot can train; OAI-SearchBot is the retrieval bot that powers ChatGPT's "search the web" feature. (OpenAI bot docs)
- ClaudeBot, Claude-Web, and anthropic-ai from Anthropic. ClaudeBot is the retrieval crawler. (Anthropic crawler docs)
- PerplexityBot and Perplexity-User from Perplexity. (Perplexity bot docs)
- Google-Extended from Google. This is not a separate crawler — it is a token Google checks against your robots.txt to decide whether to use already-crawled content for Gemini and AI Overviews. (Google special-case crawlers)
- Applebot-Extended from Apple, used for Apple Intelligence and Siri summarization. (Apple's About Applebot)
- CCBot from Common Crawl. Most LLM training corpora draw from Common Crawl as a source. (CCBot info)
If Cloudflare is challenging requests by default, every one of these bots is hitting the wall.
Three ways to fix it, in order of how I would actually do it
1. Turn on Cloudflare's "Verified Bots" allowlist
Cloudflare maintains a category called Verified Bots. This is a list of crawlers that Cloudflare has independently verified by IP range and user-agent signature, including GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot, and many search crawlers. (Cloudflare verified bots directory)
In the Cloudflare dashboard, go to Security → Bots and make sure Verified Bots are not being challenged. The setting is sometimes labeled "Allow Verified Bots" or appears as a Skip rule in the WAF custom rules section. After flipping it, real AI crawlers are exempt from the challenge while everything else still gets the same screening.
This is the cleanest fix because Cloudflare is the one verifying identity. You do not have to maintain a list yourself, and IP spoofing does not get past their check.
2. Use Cloudflare's AI Audit / "Block AI Bots" toggle (and turn it off)
In late 2024 Cloudflare added an AI Audit feature that gives site owners a single switch to either block or allow categories of AI crawlers. (Cloudflare AI Audit docs)
If you turned this on at some point because the marketing made it sound like AI bots are bad, and you have not been seeing AI referral traffic since, this is probably your problem. Turn it off, or set it to "Allow" for the retrieval crawlers and "Block" only for training crawlers if that is your policy.
3. Write a WAF Skip rule by user agent
If you cannot use Verified Bots for some reason — usually because you are on a free plan with limited rule slots — write a Skip rule that exempts the AI bot user agents from bot management.
In Cloudflare → Security → WAF → Custom Rules, create a rule with this expression:
(http.user_agent contains "GPTBot") or
(http.user_agent contains "OAI-SearchBot") or
(http.user_agent contains "ClaudeBot") or
(http.user_agent contains "anthropic-ai") or
(http.user_agent contains "PerplexityBot") or
(http.user_agent contains "Google-Extended") or
(http.user_agent contains "Applebot-Extended") or
(http.user_agent contains "CCBot")
Action: Skip → Bot Fight Mode + Managed Rules. This is weaker than the Verified Bots check (a script kiddie can set their user agent to GPTBot to slip past), but if all you are protecting is a brochure site, the practical risk is low and the AEO upside is much higher.
Don't strip your robots.txt and ai.txt
A common mistake when fixing this is to also start allowing every bot in robots.txt because you want "maximum visibility." Don't. Your robots.txt and ai.txt are the layer where you express policy (which bots may train on your content, which directories are off-limits, where the sitemap is). The Cloudflare allowlist is the layer where you make sure the bots can actually reach the policy file.
Keep them separate. If you want to opt out of training but allow retrieval, robots.txt and ai.txt are where you say so, and Cloudflare's job is to let the well-behaved bots get to those files in the first place.
How to verify the fix worked
Once you have flipped the setting, run two checks:
- Run the AI Crawler Access Auditor against your homepage. It will tell you whether the response looks like a challenge page versus real content. If you see "WAF challenge detected," the fix has not landed yet.
- Check your server logs (or Cloudflare's Bot Analytics tab) for a GPTBot or PerplexityBot request returning 200. If you see them succeeding, you are done. If they are still 403'ing, your Skip rule has a typo or is positioned below another rule that fires first.
The Mega Analyzer and the AI Posture Audit on this site both detect the same challenge pattern and now display a banner instead of pretending your site has no content. Run either one to confirm the fix.
Why this matters more than it sounds
I want to be careful with stats here, so I will not make up percentages. What is true and easy to confirm: AI engines route traffic to sites whose content they can read. If the engine cannot read your homepage on a Tuesday afternoon, your site is not in the citation pool when someone asks ChatGPT "best self-storage in Twin Falls" on Tuesday afternoon.
The fix takes about ten minutes if you have Cloudflare access. The cost of leaving it broken compounds every week the AI engines keep cycling through their corpora.
If you want a deeper look at how AI engines decide who to cite (and what your robots.txt + ai.txt should actually say), my book The $97 Launch has a chapter on technical setup that is built around exactly this kind of problem. Get it on Kindle if you want the long version.
Fact-check notes and sources
- Cloudflare Verified Bots directory: https://radar.cloudflare.com/traffic/verified-bots
- Cloudflare AI Audit feature documentation: https://developers.cloudflare.com/ai-audit/
- Cloudflare bot management challenge page documentation: https://developers.cloudflare.com/bots/concepts/challenge-types/
- OpenAI GPTBot and OAI-SearchBot user agents: https://platform.openai.com/docs/bots
- Anthropic ClaudeBot and crawler user agent reference: https://docs.anthropic.com/en/docs/agents-and-tools/web-fetch-tool
- Google-Extended documentation (Google special-case crawlers): https://developers.google.com/search/docs/crawling-indexing/google-special-case-crawlers
- Apple "About Applebot" support article (Applebot-Extended): https://support.apple.com/en-us/119829
- Perplexity bot documentation: https://docs.perplexity.ai/guides/bots
- Common Crawl CCBot: https://commoncrawl.org/ccbot
- HTTP status code 403 is defined by RFC 9110 §15.5.4 as "the server understood the request but refuses to authorize it."
Related reading
- The AI Posture Audit master prompt — how to align robots.txt, ai.txt, X-Robots-Tag, and meta robots so they do not contradict each other.
- llms.txt and the .well-known mirror — the file you want AI engines to find once they can reach your site.
- Netlify WAF vs Cloudflare double-CDN — what happens when two layers of bot protection stack on top of each other.
- The AI Crawler Access Auditor tool — how the per-bot verdict is calculated.
- Max-snippet robots directive — once the bots can read your site, this is the next setting that affects how they quote you.
This post is informational, not security or SEO consulting advice. Cloudflare, OpenAI, Anthropic, Perplexity, Google, Apple, Akamai, Imperva, Sucuri, Vercel, and Common Crawl are trademarks of their respective owners; mentions are nominative fair use. No affiliation is implied.