← Back to Blog

When Robots.txt Allows GPTBot and Cloudflare Still Returns 403

When Robots.txt Allows GPTBot and Cloudflare Still Returns 403

A thing that happens more often than people think: your robots.txt explicitly allows GPTBot. Your ai.txt has no Disallow. Your meta robots is permissive. And GPTBot still gets a 403. Because Cloudflare's bot-protection rules return a challenge page to any non-browser user-agent, and nobody configured an exception for GPTBot.

The AI Crawler Access Auditor catches this. It audits all three layers — robots.txt + ai.txt + CDN — for 20 major AI crawlers and tells you which bots actually reach your content.

The three layers

Layer 1: robots.txt + ai.txt directives. The auditor parses robots.txt using longest-match precedence (specific agent beats wildcard, specific path beats wildcard path). ai.txt is parsed against the Spawning spec. Each bot gets an allow/block verdict from the directive layer.

Layer 2: per-page meta robots + X-Robots-Tag. Fetches the page HTML, extracts <meta name="robots">, cross-references against what robots.txt said. The conflict case is "robots.txt allows, meta robots disallows" — or vice versa — which LLMs handle inconsistently.

Layer 3: CDN challenge page detection. This is the one most AI-access audits miss. If Cloudflare or Akamai return a challenge page ("Checking your browser…") instead of your actual content, non-browser UAs see the challenge, humans see the page. The auditor heuristically detects challenge markers in the page body and flags a "CDN-risk" verdict for every bot — because the bot will read the challenge, not your content.

The 20 bots audited

OpenAI (GPTBot, ChatGPT-User, OAI-SearchBot), Anthropic (ClaudeBot, Claude-Web, anthropic-ai), Google (Google-Extended, Googlebot), Perplexity (PerplexityBot), Apple (Applebot-Extended, Applebot), Meta (Meta-ExternalAgent), ByteDance (Bytespider), Amazon (Amazonbot), Common Crawl (CCBot), Cohere (cohere-ai), Mistral (MistralAI-User), Diffbot, You.com (YouBot), DuckDuckGo (DuckAssistBot).

This is the minimum AI-crawler surface as of mid-2026. If a new bot launches, add one entry to the BOTS array in the page script and it picks up the same allow/block/challenge logic.

Why the CDN layer matters most

Cloudflare's bot-management rules are opt-in per domain but almost always on by default for free-tier Cloudflare users who clicked through the onboarding. The default rules allow Googlebot (verified), Bingbot (verified), and block "unknown bots" — which includes every AI bot on the list above. The result: your robots.txt says "come on in" and Cloudflare says "absolutely not" at the edge.

The fix is a Cloudflare "Skip" rule or a Page Rule that lets specific User-Agent strings through. If you want to allow training bots you also need to configure the "AI scrapers and crawlers" preset that Cloudflare shipped in 2024, which names the AI bots explicitly and lets you allow or block them individually.

What to do with the output

Run it against your homepage, your most-cited article, and a random deep page. If the homepage is accessible but a deep page shows CDN-risk, your Cloudflare rule is too restrictive and you're losing retrieval coverage. If the homepage itself shows CDN-risk, nothing is getting indexed and you have a bigger problem.

The auditor doesn't emit fix code (that's the job of the AI Bot Policy Generator for directive-layer fixes, and a Cloudflare dashboard for CDN-layer fixes). What it does is tell you which layer is blocking which bot, so you know where to look.

Related reading

Fact-check notes and sources


The $100 Network covers aligning robots.txt + ai.txt + CDN rules across a site network with single-source templates. The auditor is how you verify the templates actually took effect.

← Back to Blog

Accessibility Options

Text Size
High Contrast
Reduce Motion
Reading Guide
Link Highlighting
Accessibility Statement

J.A. Watte is committed to ensuring digital accessibility for people with disabilities. This site conforms to WCAG 2.1 and 2.2 Level AA guidelines.

Measures Taken

  • Semantic HTML with proper heading hierarchy
  • ARIA labels and roles for interactive components
  • Color contrast ratios meeting WCAG AA (4.5:1)
  • Full keyboard navigation support
  • Skip navigation link
  • Visible focus indicators (3:1 contrast)
  • 44px minimum touch/click targets
  • Dark/light theme with system preference detection
  • Responsive design for all devices
  • Reduced motion support (CSS + toggle)
  • Text size customization (14px–20px)
  • Print stylesheet

Feedback

Contact: jwatte.com/contact

Full Accessibility StatementPrivacy Policy

Last updated: April 2026