← Back to Blog

Your robots.txt Says One Thing And Your CDN Says Another

Your robots.txt Says One Thing And Your CDN Says Another

The annoying truth about AI-crawler policy: it lives in four places and they can disagree silently.

  1. /robots.txt — the classic text file
  2. /ai.txt (and /.well-known/ai.txt) — training-specific declarations
  3. <meta name="robots"> — per-page HTML
  4. X-Robots-Tag — HTTP header set by your server or CDN

A crawler reading one source sees one policy. A crawler reading another sees another. Which one "wins" is crawler-specific and rarely documented. The only safe state is all four agree.

AI Posture Audit fetches all four, per named AI bot, and reports every disagreement.

The failure modes it catches

robots.txt disallows, X-Robots-Tag allows. You wrote a Disallow rule. Your CDN (Cloudflare, Akamai, CloudFront) has an edge rule that injects X-Robots-Tag: all on every response. The header wins for bots that prefer header directives.

ai.txt is missing. A missing ai.txt is not equivalent to a restrictive one — it's the same as a permissive one. If you meant to opt out of AI training, a missing file means you didn't.

Meta robots contradicts robots.txt. Your template injects <meta name="robots" content="index,follow"> on every page, including pages you disallowed in robots.txt. Some crawlers will still index the page if they find it via a link; meta robots is their next signal.

CDN challenge pages. Cloudflare's Bot Fight Mode returns a 403 challenge to bots that fail the JS test. For a crawler that does not execute JS, this looks like "blocked" but shows a soft success. The tool detects challenge pages and flags the ambiguity.

Server-specific X-Robots-Tag. Apache's mod_headers and nginx's add_header can set X-Robots-Tag: noindex at the server level. If deployed on a subdomain or path that wasn't meant to be blocked, you've quietly removed it from search indexes.

Why this matters specifically for AI crawlers

Search crawlers have been here for 25 years and the disagreement modes are well-catalogued. AI crawlers are new and each one treats the four sources differently:

  • GPTBot (OpenAI) checks robots.txt first; honors Disallow for both crawl and training.
  • Google-Extended is a "virtual" bot — it does not crawl but signals training opt-out via robots.txt. Cannot be controlled via ai.txt.
  • ClaudeBot (Anthropic) checks robots.txt; training stance depends on the deployment (ClaudeBot vs ClaudeBot-User).
  • PerplexityBot checks robots.txt and honors Disallow for crawl; has a separate agent for user-initiated fetches.

The audit output flags per-bot whether each signal is consistent across the four sources, so you know which bot has a mismatch and where.

When to run it

After any robots.txt, ai.txt, template, or CDN-config change. Server config drift is the #1 silent source of accidental policy changes. You moved to a new hosting provider last month; their default X-Robots-Tag is different; you had no reason to suspect.

Monthly, as hygiene. Cloudflare and Akamai push edge-rule updates; WAF provider updates; your CMS updates; all of those can inject headers. Once a month is cheap insurance.

Before any major AI-crawler launch. When OpenAI shipped OAI-SearchBot, sites that had blocked GPTBot but not OAI-SearchBot got a surprise. Audit before the bot goes live, not after.

The fix output

Every disagreement produces an AI fix prompt that describes the gap ("ai.txt disallows GPTBot but X-Robots-Tag on /blog/ returns all") and emits the exact patch. Drop the patch into your Netlify _headers, nginx config, Apache .htaccess, or robots.txt, deploy, re-audit.

If you're using the AI Bot Policy Generator you should pass its output into this audit after deploy. Generate → deploy → audit is the full loop.

Related reading

Fact-check notes and sources


The $100 Network approach: treat crawler-policy drift as a monitored-service problem, not a set-and-forget config. One monthly audit is cheaper than a quarter of unwanted training contribution.

← Back to Blog

Accessibility Options

Text Size
High Contrast
Reduce Motion
Reading Guide
Link Highlighting
Accessibility Statement

J.A. Watte is committed to ensuring digital accessibility for people with disabilities. This site conforms to WCAG 2.1 and 2.2 Level AA guidelines.

Measures Taken

  • Semantic HTML with proper heading hierarchy
  • ARIA labels and roles for interactive components
  • Color contrast ratios meeting WCAG AA (4.5:1)
  • Full keyboard navigation support
  • Skip navigation link
  • Visible focus indicators (3:1 contrast)
  • 44px minimum touch/click targets
  • Dark/light theme with system preference detection
  • Responsive design for all devices
  • Reduced motion support (CSS + toggle)
  • Text size customization (14px–20px)
  • Print stylesheet

Feedback

Contact: jwatte.com/contact

Full Accessibility StatementPrivacy Policy

Last updated: April 2026