# Your robots.txt Says One Thing And Your CDN Says Another

You disallowed GPTBot in robots.txt. Your CDN injects X-Robots-Tag: all. Your ai.txt was never created so it defaults to permissive. Three sources of truth, three different answers, and the crawler picks whichever it likes. This tool flags every disagreement per bot.

Author: J.A. Watte
Published: May 12, 2026
Source: https://jwatte.com/blog/blog-tool-ai-posture-audit/

---

_Part of the [AEO / GEO / AI-search audit tool stack](/blog/blog-new-aeo-audit-tools-2026/).  See the pillar post for the full catalog of sibling audits and where this one fits in the lineup._

The annoying truth about AI-crawler policy: it lives in four places and they can disagree silently.

1. `/robots.txt` — the classic text file
2. `/ai.txt` (and `/.well-known/ai.txt`) — training-specific declarations
3. `<meta name="robots">` — per-page HTML
4. `X-Robots-Tag` — HTTP header set by your server or CDN

A crawler reading one source sees one policy. A crawler reading another sees another. Which one "wins" is crawler-specific and rarely documented. The only safe state is **all four agree**.

[AI Posture Audit](/tools/ai-posture-audit/) fetches all four, per named AI bot, and reports every disagreement.

## The failure modes it catches

**robots.txt disallows, X-Robots-Tag allows.** You wrote a Disallow rule. Your CDN (Cloudflare, Akamai, CloudFront) has an edge rule that injects `X-Robots-Tag: all` on every response. The header wins for bots that prefer header directives.

**ai.txt is missing.** A missing ai.txt is not equivalent to a restrictive one — it's the same as a permissive one. If you meant to opt out of AI training, a missing file means you didn't.

**Meta robots contradicts robots.txt.** Your template injects `<meta name="robots" content="index,follow">` on every page, including pages you disallowed in robots.txt. Some crawlers will still index the page if they find it via a link; meta robots is their next signal.

**CDN challenge pages.** Cloudflare's Bot Fight Mode returns a 403 challenge to bots that fail the JS test. For a crawler that does not execute JS, this looks like "blocked" but shows a soft success. The tool detects challenge pages and flags the ambiguity.

**Server-specific X-Robots-Tag.** Apache's `mod_headers` and nginx's `add_header` can set `X-Robots-Tag: noindex` at the server level. If deployed on a subdomain or path that wasn't meant to be blocked, you've quietly removed it from search indexes.

## Why this matters specifically for AI crawlers

Search crawlers have been here for 25 years and the disagreement modes are well-catalogued. AI crawlers are new and each one treats the four sources differently:

- **GPTBot** (OpenAI) checks robots.txt first; honors Disallow for both crawl and training.
- **Google-Extended** is a "virtual" bot — it does not crawl but signals training opt-out via robots.txt. Cannot be controlled via ai.txt.
- **ClaudeBot** (Anthropic) checks robots.txt; training stance depends on the deployment (ClaudeBot vs ClaudeBot-User).
- **PerplexityBot** checks robots.txt and honors Disallow for crawl; has a separate agent for user-initiated fetches.

The audit output flags per-bot whether each signal is consistent across the four sources, so you know which bot has a mismatch and where.

## When to run it

**After any robots.txt, ai.txt, template, or CDN-config change.** Server config drift is the #1 silent source of accidental policy changes. You moved to a new hosting provider last month; their default `X-Robots-Tag` is different; you had no reason to suspect.

**Monthly, as hygiene.** Cloudflare and Akamai push edge-rule updates; WAF provider updates; your CMS updates; all of those can inject headers. Once a month is cheap insurance.

**Before any major AI-crawler launch.** When OpenAI shipped OAI-SearchBot, sites that had blocked GPTBot but not OAI-SearchBot got a surprise. Audit before the bot goes live, not after.

## The fix output

Every disagreement produces an AI fix prompt that describes the gap ("ai.txt disallows GPTBot but X-Robots-Tag on /blog/ returns `all`") and emits the exact patch. Drop the patch into your Netlify `_headers`, nginx config, Apache `.htaccess`, or robots.txt, deploy, re-audit.

If you're using the [AI Bot Policy Generator](/tools/ai-bot-policy-gen/) you should pass its output into this audit after deploy. Generate → deploy → audit is the full loop.

## Related reading

- [AI Bot Policy Generator](/tools/ai-bot-policy-gen/) — emit robots.txt + ai.txt + meta/header stance together
- [AI Crawler Access Auditor](/tools/ai-crawler-access-auditor/) — per-bot allow/block verdict with CDN detection
- [Robots/LLM Drift Diff](/tools/robots-llm-drift-diff/) — before/after drift detector

## Fact-check notes and sources

- Google meta robots + X-Robots-Tag: [developers.google.com/search/docs/crawling-indexing/robots-meta-tag](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag)
- RFC 9309 — Robots Exclusion Protocol: [rfc-editor.org/rfc/rfc9309.html](https://www.rfc-editor.org/rfc/rfc9309.html)
- OpenAI bot docs: [platform.openai.com/docs/bots](https://platform.openai.com/docs/bots)
- Cloudflare AI crawlers and bots: [developers.cloudflare.com/bots/concepts/ai-bots/](https://developers.cloudflare.com/bots/concepts/ai-bots/)
- ai.txt spec: [site-ai.org](https://site-ai.org/)

---

*The $100 Network approach: treat crawler-policy drift as a monitored-service problem, not a set-and-forget config. One monthly audit is cheaper than a quarter of unwanted training contribution.*


---

Canonical HTML: https://jwatte.com/blog/blog-tool-ai-posture-audit/
RSS: https://jwatte.com/feed.xml
JSON Feed: https://jwatte.com/feed.json
Hero image: https://jwatte.com/images/blog-tool-ai-posture-audit.webp