The annoying truth about AI-crawler policy: it lives in four places and they can disagree silently.
/robots.txt— the classic text file/ai.txt(and/.well-known/ai.txt) — training-specific declarations<meta name="robots">— per-page HTMLX-Robots-Tag— HTTP header set by your server or CDN
A crawler reading one source sees one policy. A crawler reading another sees another. Which one "wins" is crawler-specific and rarely documented. The only safe state is all four agree.
AI Posture Audit fetches all four, per named AI bot, and reports every disagreement.
The failure modes it catches
robots.txt disallows, X-Robots-Tag allows. You wrote a Disallow rule. Your CDN (Cloudflare, Akamai, CloudFront) has an edge rule that injects X-Robots-Tag: all on every response. The header wins for bots that prefer header directives.
ai.txt is missing. A missing ai.txt is not equivalent to a restrictive one — it's the same as a permissive one. If you meant to opt out of AI training, a missing file means you didn't.
Meta robots contradicts robots.txt. Your template injects <meta name="robots" content="index,follow"> on every page, including pages you disallowed in robots.txt. Some crawlers will still index the page if they find it via a link; meta robots is their next signal.
CDN challenge pages. Cloudflare's Bot Fight Mode returns a 403 challenge to bots that fail the JS test. For a crawler that does not execute JS, this looks like "blocked" but shows a soft success. The tool detects challenge pages and flags the ambiguity.
Server-specific X-Robots-Tag. Apache's mod_headers and nginx's add_header can set X-Robots-Tag: noindex at the server level. If deployed on a subdomain or path that wasn't meant to be blocked, you've quietly removed it from search indexes.
Why this matters specifically for AI crawlers
Search crawlers have been here for 25 years and the disagreement modes are well-catalogued. AI crawlers are new and each one treats the four sources differently:
- GPTBot (OpenAI) checks robots.txt first; honors Disallow for both crawl and training.
- Google-Extended is a "virtual" bot — it does not crawl but signals training opt-out via robots.txt. Cannot be controlled via ai.txt.
- ClaudeBot (Anthropic) checks robots.txt; training stance depends on the deployment (ClaudeBot vs ClaudeBot-User).
- PerplexityBot checks robots.txt and honors Disallow for crawl; has a separate agent for user-initiated fetches.
The audit output flags per-bot whether each signal is consistent across the four sources, so you know which bot has a mismatch and where.
When to run it
After any robots.txt, ai.txt, template, or CDN-config change. Server config drift is the #1 silent source of accidental policy changes. You moved to a new hosting provider last month; their default X-Robots-Tag is different; you had no reason to suspect.
Monthly, as hygiene. Cloudflare and Akamai push edge-rule updates; WAF provider updates; your CMS updates; all of those can inject headers. Once a month is cheap insurance.
Before any major AI-crawler launch. When OpenAI shipped OAI-SearchBot, sites that had blocked GPTBot but not OAI-SearchBot got a surprise. Audit before the bot goes live, not after.
The fix output
Every disagreement produces an AI fix prompt that describes the gap ("ai.txt disallows GPTBot but X-Robots-Tag on /blog/ returns all") and emits the exact patch. Drop the patch into your Netlify _headers, nginx config, Apache .htaccess, or robots.txt, deploy, re-audit.
If you're using the AI Bot Policy Generator you should pass its output into this audit after deploy. Generate → deploy → audit is the full loop.
Related reading
- AI Bot Policy Generator — emit robots.txt + ai.txt + meta/header stance together
- AI Crawler Access Auditor — per-bot allow/block verdict with CDN detection
- Robots/LLM Drift Diff — before/after drift detector
Fact-check notes and sources
- Google meta robots + X-Robots-Tag: developers.google.com/search/docs/crawling-indexing/robots-meta-tag
- RFC 9309 — Robots Exclusion Protocol: rfc-editor.org/rfc/rfc9309.html
- OpenAI bot docs: platform.openai.com/docs/bots
- Cloudflare AI crawlers and bots: developers.cloudflare.com/bots/concepts/ai-bots/
- ai.txt spec: site-ai.org
The $100 Network approach: treat crawler-policy drift as a monitored-service problem, not a set-and-forget config. One monthly audit is cheaper than a quarter of unwanted training contribution.