← Back to Blog

Does Your robots.txt Agree With Your ai.txt? The Silent AI-Crawler Misconfig Most Sites Have

Does Your robots.txt Agree With Your ai.txt? The Silent AI-Crawler Misconfig Most Sites Have

A typical site tells AI crawlers four different things about whether they can read it. The files were added at different times by different people (or different IDE sessions), and nobody ever cross-referenced them. I discovered this running the new AI Posture Audit against half a dozen sites I thought were clean. Every single one had at least one disagreement.

The four sources, and — critically — what each one actually governs:

  • /robots.txt — governs CRAWLING (can the bot fetch the page?). The oldest, most widely honored directive. Covers User-agent: GPTBot, User-agent: ClaudeBot, etc.
  • <meta name="robots"> — also a CRAWLING directive, per-page. Usually index, follow.
  • X-Robots-Tag response header — also a CRAWLING directive, server-sent. Can be bot-scoped: X-Robots-Tag: googlebot: noindex.
  • /ai.txt (root + /.well-known/ai.txt) — governs TRAINING-DATA USE (can the bot add this content to its training corpus?). From the Spawning spec.

This distinction is the common trap: people read robots.txt and ai.txt as if they're saying the same thing. They aren't. The canonical AEO-friendly, training-hostile pattern is:

robots.txt: User-agent: GPTBot → Allow: /          (yes, you can crawl me)
ai.txt:     User-Agent: GPTBot → Disallow: /       (no, don't train on me)

This combination is not a contradiction — it's a coherent policy. The bot fetches your page so ChatGPT can cite you when answering user queries (AEO win), but the page doesn't get ingested into GPT-5's next training run (IP protection). Audit tools that flag this as a "disagreement" are wrong — they're conflating the two directive types.

A real disagreement is when same-kind directives disagree: robots.txt says Allow: / for GPTBot but <meta name="robots"> says noindex. Both are crawl directives; they must agree.

The problem: each AI bot's documentation resolves conflicts differently.

  • OpenAI (GPTBot): checks robots.txt first. If GPTBot is blocked there, stops. Otherwise crawls — does not consult ai.txt directly per their current documentation.
  • Anthropic (ClaudeBot): checks robots.txt. Honors site-wide directives only; does not parse ai.txt natively.
  • Perplexity (PerplexityBot): checks robots.txt. Has historically respected ai.txt for training policy but not crawl policy.
  • Google-Extended: the sub-agent that controls whether Google's AI products (Gemini, AI Overviews) train on your content. Checks robots.txt. Treats absence of an explicit rule as "allow".
  • CCBot (Common Crawl): checks robots.txt. The 2024 spec update accepts ai.txt as equivalent guidance but falls back to robots.txt.
  • Bytespider (TikTok/ByteDance): checks robots.txt. Historically aggressive — known to ignore some directives per published research. Defensive sites explicitly Disallow: / it.

So if your robots.txt says Allow: / for GPTBot but your ai.txt says Disallow: /, GPTBot reads robots.txt and crawls anyway — it never reads ai.txt. Your intent to block GPTBot is wrong; your deployment is wrong; but you won't see the misbehavior unless you're actively monitoring crawl logs.

The common failure modes I've seen

Failure mode 1: Crawl directives disagree (robots.txt ↔ meta robots ↔ X-Robots-Tag).

These three sources govern the same thing: can the bot fetch the page? When they disagree for a specific bot, behavior is unpredictable — each bot picks its own winner. This is the class of disagreement that actually matters.

Example: robots.txt says User-agent: GPTBot → Allow: /. The page's <meta name="robots"> says noindex, nofollow. GPTBot reads robots.txt, crawls the page, and may cite it in ChatGPT despite the noindex intent. Fix: align all three crawl sources per-bot.

Failure mode 2: X-Robots-Tag disagrees with HTML meta.

Server sends X-Robots-Tag: noindex but the HTML meta says index, follow. Most-specific source wins (X-Robots-Tag is processed at the HTTP layer, before HTML), but audit tools that read only HTML miss the header entirely and report the page as indexable.

Failure mode 3: GPTBot-specific directive in robots.txt, not mirrored appropriately in ai.txt.

Separate, legitimate failure: if your intent is "allow GPTBot to crawl, disallow GPTBot from training on your content," both files need explicit GPTBot rules, not wildcard fall-through. A User-agent: GPTBot → Allow: / in robots.txt with User-Agent: * (no GPTBot block) in ai.txt means GPTBot falls back to the wildcard — which usually says Allow: / too. Your training-hostile intent isn't expressed. Fix: name GPTBot explicitly in both files, with their respective directive type (crawl in robots.txt, training in ai.txt).

Non-failure: robots.txt allow + ai.txt disallow.

If the audit flags GPTBot with robots.txt=allow, ai.txt=disallow as a "disagreement" — that's the audit being wrong, not your configuration. That's the canonical AEO-friendly, training-hostile pattern. The AI Posture Audit now explicitly separates crawl columns from the training column in its matrix and treats this pattern as INFO (intentional) rather than WARN (conflict). Earlier versions of the tool conflated the two; that was a bug I fixed after catching it in my own site's audit. If you see "10 bots with conflicting signals" and every single one is the crawl-allow-train-deny pattern, your config is actually clean — the audit read it wrong.

What the audit does

The AI Posture Audit does exactly what this article's thesis suggests. Paste a URL; it fetches all four sources (robots.txt, ai.txt at root and .well-known/, page's meta robots, the response's X-Robots-Tag header) and renders a per-bot matrix.

For each of the 15+ known AI crawlers, the matrix shows what each source says. Disagreements are highlighted. The audit surfaces the count and produces a fix prompt you can hand to Claude or ChatGPT with the literal content of all four sources for it to reconcile.

The tool runs every check in the browser via the existing /.netlify/functions/fetch-page proxy. No OAuth, no API keys, no site data leaves your browser.

What "consistent" actually means

Consistency is easier to reason about than the four-file tangle suggests. Pick one stance per bot, and mirror it across every source:

Option A — allow all AI crawlers (open-web stance):

# robots.txt
User-agent: *
Allow: /
# ai.txt
User-Agent: *
Allow: /

No per-page meta robots noindex restrictions. No X-Robots-Tag header. The wildcard rule covers every bot; no bot-specific entries needed.

Option B — deny training-focused AI crawlers, allow answer-engine crawlers:

# robots.txt
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: ClaudeBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: *
Allow: /

Same thing in ai.txt — every bot named in robots.txt must also be named in ai.txt with the same directive. Silent absence from ai.txt means "fall back to User-Agent: *" which is almost never what you want.

Option C — deny all AI crawlers (training-hostile stance):

# robots.txt
User-agent: GPTBot
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: anthropic-ai
User-agent: ClaudeBot
User-agent: PerplexityBot
User-agent: Google-Extended
User-agent: CCBot
User-agent: Bytespider
User-agent: Applebot-Extended
User-agent: Meta-ExternalAgent
Disallow: /

User-agent: *
Allow: /

Same bots listed in ai.txt with the same directive. Do not rely on wildcard fall-through.

The key principle: never rely on the fallback User-agent: * rule for a bot you care about. Always name the bot explicitly in both files, with the same directive.

The X-Robots-Tag header piece

If you have an X-Robots-Tag header serving, it must agree with the meta robots tag on every page the header covers. The header is usually set at the CDN or server layer:

# netlify.toml — applies X-Robots-Tag to every HTML response
[[headers]]
  for = "/*"
  [headers.values]
    X-Robots-Tag = "index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1"

If you then have a page with <meta name="robots" content="noindex"> because it's a draft, the header and meta disagree. Google respects the most-restrictive signal (noindex), but other crawlers behave less predictably.

Fix: either drop the X-Robots-Tag from CDN-level (rely on per-page meta) or scope the header to specific paths (for = "/blog/*").

The fix prompt

The audit emits a fix prompt that looks like this (abbreviated):

I ran an AI-posture parity audit on example.com and these four sources provide conflicting signals for the following bots: GPTBot (robots.txt=allow, ai.txt=disallow), Bytespider (robots.txt=disallow, ai.txt=allow). Task: for each, tell me which source wins per public documentation, then emit a unified configuration for robots.txt, ai.txt, meta robots, and X-Robots-Tag that expresses the same intent consistently.

Hand it to Claude or ChatGPT with the literal content of all four sources pasted in. Output: a clean, aligned configuration.

When the audit finds nothing

If the audit reports all bots aligned, that's the ideal — your deployment matches your intent for every bot. But the audit only reports on the ~15 bots it tests. If a new AI crawler launches tomorrow and your robots.txt has a wildcard allow, you're implicitly allowing it. The audit can't warn you about future bots; it can only warn about the bots that already exist.

Revisit the audit quarterly, or whenever you change any of the four sources. The cost is 30 seconds.

Related reading

  • ai.txt Generator — produces a consistent ai.txt file with defaults for every major bot
  • .well-known Audit — finds security.txt, agent-card.json, ai-plugin.json, and other identity files that bots consult
  • Mega Analyzer — runs this AI posture check automatically as part of the Indexing Hygiene tab
  • How llms.txt works, structurally — the companion file that tells AI engines where to find your structured content

Fact-check notes and sources


Run the AI Posture Audit on your site right now. The cost is 30 seconds; the risk of NOT running it is an AI crawler behaving differently from what you think your configuration expresses.

← Back to Blog

Accessibility Options

Text Size
High Contrast
Reduce Motion
Reading Guide
Link Highlighting
Accessibility Statement

J.A. Watte is committed to ensuring digital accessibility for people with disabilities. This site conforms to WCAG 2.1 and 2.2 Level AA guidelines.

Measures Taken

  • Semantic HTML with proper heading hierarchy
  • ARIA labels and roles for interactive components
  • Color contrast ratios meeting WCAG AA (4.5:1)
  • Full keyboard navigation support
  • Skip navigation link
  • Visible focus indicators (3:1 contrast)
  • 44px minimum touch/click targets
  • Dark/light theme with system preference detection
  • Responsive design for all devices
  • Reduced motion support (CSS + toggle)
  • Text size customization (14px–20px)
  • Print stylesheet

Feedback

Contact: jwatte.com/contact

Full Accessibility StatementPrivacy Policy

Last updated: April 2026