# Does Your robots.txt Agree With Your ai.txt? The Silent AI-Crawler Misconfig Most Sites Have

Four different files tell AI crawlers whether they can read your site. When they disagree, each bot resolves the conflict differently — GPTBot may honor robots.txt, Perplexity may honor ai.txt, Google-Extended may honor X-Robots-Tag. The audit you need takes 30 seconds and catches the misalignment every SMB site has.

Author: J.A. Watte
Published: April 20, 2026
Source: https://jwatte.com/blog/blog-ai-posture-consistency/

---

A typical site tells AI crawlers four different things about whether they can read it. The files were added at different times by different people (or different IDE sessions), and nobody ever cross-referenced them. I discovered this running the new AI Posture Audit against half a dozen sites I thought were clean. Every single one had at least one disagreement.

The four sources, and, critically, what each one governs:

- **`/robots.txt`**, governs **CRAWLING** (can the bot fetch the page?). The oldest, most widely honored directive. Covers `User-agent: GPTBot`, `User-agent: ClaudeBot`, etc.
- **`<meta name="robots">`**, also a **CRAWLING** directive, per-page. Usually `index, follow`.
- **`X-Robots-Tag` response header**, also a **CRAWLING** directive, server-sent. Can be bot-scoped: `X-Robots-Tag: googlebot: noindex`.
- **`/ai.txt` (root + `/.well-known/ai.txt`)**, governs **TRAINING-DATA USE** (can the bot add this content to its training corpus?). From the Spawning spec.

**This distinction is the common trap**: people read robots.txt and ai.txt as if they're saying the same thing. They aren't. The canonical AEO-friendly, training-hostile pattern is:

```
robots.txt: User-agent: GPTBot → Allow: /          (yes, you can crawl me)
ai.txt:     User-Agent: GPTBot → Disallow: /       (no, don't train on me)
```

This combination is **not a contradiction**, it's a coherent policy. The bot fetches your page so ChatGPT can cite you when answering user queries (AEO win), but the page doesn't get ingested into GPT-5's next training run (IP protection). Audit tools that flag this as a "disagreement" are wrong, they're conflating the two directive types.

A **real** disagreement is when same-kind directives disagree: robots.txt says `Allow: /` for GPTBot but `<meta name="robots">` says `noindex`. Both are crawl directives; they must agree.

The problem: each AI bot's documentation resolves conflicts differently.

- **OpenAI (GPTBot)**: checks robots.txt first. If GPTBot is blocked there, stops. Otherwise crawls, does not consult ai.txt directly per their [current documentation](https://platform.openai.com/docs/bots).
- **Anthropic (ClaudeBot)**: checks robots.txt. Honors site-wide directives only; does not parse ai.txt natively.
- **Perplexity (PerplexityBot)**: checks robots.txt. Has historically respected ai.txt for training policy but not crawl policy.
- **Google-Extended**: the sub-agent that controls whether Google's AI products (Gemini, AI Overviews) train on your content. Checks robots.txt. Treats absence of an explicit rule as "allow".
- **CCBot (Common Crawl)**: checks robots.txt. The 2024 spec update accepts ai.txt as equivalent guidance but falls back to robots.txt.
- **Bytespider (TikTok/ByteDance)**: checks robots.txt. Historically aggressive, known to ignore some directives per [published research](https://www.darkreading.com/cybersecurity-operations/bytedance-bytespider-web-crawler-security). Defensive sites explicitly `Disallow: /` it.

So if your robots.txt says `Allow: /` for GPTBot but your ai.txt says `Disallow: /`, GPTBot reads robots.txt and crawls anyway, it never reads ai.txt. Your **intent** to block GPTBot is wrong; your **deployment** is wrong; but you won't see the misbehavior unless you're actively monitoring crawl logs.

## The common failure modes I've seen

**Failure mode 1: Crawl directives disagree (robots.txt ↔ meta robots ↔ X-Robots-Tag).**

These three sources govern the same thing: can the bot fetch the page? When they disagree for a specific bot, behavior is unpredictable, each bot picks its own winner. This is the class of disagreement that matters.

Example: `robots.txt` says `User-agent: GPTBot → Allow: /`. The page's `<meta name="robots">` says `noindex, nofollow`. GPTBot reads robots.txt, crawls the page, and may cite it in ChatGPT despite the `noindex` intent. Fix: align all three crawl sources per-bot.

**Failure mode 2: X-Robots-Tag disagrees with HTML meta.**

Server sends `X-Robots-Tag: noindex` but the HTML meta says `index, follow`. Most-specific source wins (X-Robots-Tag is processed at the HTTP layer, before HTML), but audit tools that read only HTML miss the header entirely and report the page as indexable.

**Failure mode 3: GPTBot-specific directive in robots.txt, not mirrored appropriately in ai.txt.**

Separate, legitimate failure: if your intent is "allow GPTBot to crawl, disallow GPTBot from training on your content," both files need explicit GPTBot rules, not wildcard fall-through. A `User-agent: GPTBot → Allow: /` in robots.txt with `User-Agent: *` (no GPTBot block) in ai.txt means GPTBot falls back to the wildcard, which usually says `Allow: /` too. Your training-hostile intent isn't expressed. Fix: name GPTBot explicitly in both files, with their respective directive type (crawl in robots.txt, training in ai.txt).

**Non-failure: robots.txt allow + ai.txt disallow.**

If the audit flags GPTBot with `robots.txt=allow, ai.txt=disallow` as a "disagreement", that's the audit being wrong, not your configuration. That's the canonical AEO-friendly, training-hostile pattern. The [AI Posture Audit](/tools/ai-posture-audit/) now explicitly separates crawl columns from the training column in its matrix and treats this pattern as `INFO` (intentional) rather than `WARN` (conflict). Earlier versions of the tool conflated the two; that was a bug I fixed after catching it in my own site's audit. If you see "10 bots with conflicting signals" and every single one is the crawl-allow-train-deny pattern, your config is clean, the audit read it wrong.

## What the audit does

The [AI Posture Audit](/tools/ai-posture-audit/) does exactly what this article's thesis suggests. Paste a URL; it fetches all four sources (robots.txt, ai.txt at root and `.well-known/`, page's meta robots, the response's X-Robots-Tag header) and renders a per-bot matrix.

For each of the 15+ known AI crawlers, the matrix shows what each source says. Disagreements are highlighted. The audit surfaces the count and produces a fix prompt you can hand to Claude or ChatGPT with the literal content of all four sources for it to reconcile.

The tool runs every check in the browser via the existing `/.netlify/functions/fetch-page` proxy. No OAuth, no API keys, no site data leaves your browser.

## What "consistent" actually means

Consistency is easier to reason about than the four-file tangle suggests. Pick one stance per bot, and mirror it across every source:

**Option A, allow all AI crawlers (open-web stance):**
```
# robots.txt
User-agent: *
Allow: /
```
```
# ai.txt
User-Agent: *
Allow: /
```
No per-page meta robots noindex restrictions. No X-Robots-Tag header. The wildcard rule covers every bot; no bot-specific entries needed.

**Option B — deny training-focused AI crawlers, allow answer-engine crawlers:**
```
# robots.txt
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: ClaudeBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: *
Allow: /
```
Same thing in ai.txt — every bot named in robots.txt must also be named in ai.txt with the same directive. Silent absence from ai.txt means "fall back to `User-Agent: *`" which is almost never what you want.

**Option C — deny all AI crawlers (training-hostile stance):**
```
# robots.txt
User-agent: GPTBot
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: anthropic-ai
User-agent: ClaudeBot
User-agent: PerplexityBot
User-agent: Google-Extended
User-agent: CCBot
User-agent: Bytespider
User-agent: Applebot-Extended
User-agent: Meta-ExternalAgent
Disallow: /

User-agent: *
Allow: /
```
Same bots listed in ai.txt with the same directive. Do not rely on wildcard fall-through.

The key principle: **never rely on the fallback `User-agent: *` rule for a bot you care about**. Always name the bot explicitly in both files, with the same directive.

## The X-Robots-Tag header piece

If you have an X-Robots-Tag header serving, it must agree with the meta robots tag on every page the header covers. The header is usually set at the CDN or server layer:

```toml
# netlify.toml — applies X-Robots-Tag to every HTML response
[[headers]]
  for = "/*"
  [headers.values]
    X-Robots-Tag = "index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1"
```

If you then have a page with `<meta name="robots" content="noindex">` because it's a draft, the header and meta disagree. Google respects the most-restrictive signal (noindex), but other crawlers behave less predictably.

Fix: either drop the X-Robots-Tag from CDN-level (rely on per-page meta) or scope the header to specific paths (`for = "/blog/*"`).

## The fix prompt

The audit emits a fix prompt that looks like this (abbreviated):

> I ran an AI-posture parity audit on example.com and these four sources provide conflicting signals for the following bots: GPTBot (robots.txt=allow, ai.txt=disallow), Bytespider (robots.txt=disallow, ai.txt=allow). Task: for each, tell me which source wins per public documentation, then emit a unified configuration for robots.txt, ai.txt, meta robots, and X-Robots-Tag that expresses the same intent consistently.

Hand it to Claude or ChatGPT with the literal content of all four sources pasted in. Output: a clean, aligned configuration.

## When the audit finds nothing

If the audit reports all bots aligned, that's the ideal, your deployment matches your intent for every bot. But the audit only reports on the bots in the shared registry at `/js/ai-bots.js` — every AI / search / data crawler that has published a User-Agent string. If a new vendor launches a bot tomorrow and your robots.txt has a wildcard allow, you're implicitly allowing it. The audit can't warn you about future bots; it can only warn about the ones in the registry today. New bots get added to the registry as vendors publish them, and every tool on the site picks the addition up automatically.

Revisit the audit quarterly, or whenever you change any of the four sources. The cost is 30 seconds.

## Related reading

- [ai.txt Generator](/tools/ai-txt-gen/), produces a consistent ai.txt file with defaults for every major bot
- [.well-known Audit](/tools/well-known-audit/), finds security.txt, agent-card.json, ai-plugin.json, and other identity files that bots consult
- [Mega Analyzer](/tools/mega-analyzer/), runs this AI posture check automatically as part of the Indexing Hygiene tab
- [How llms.txt works, structurally](/blog/blog-llms-txt-structure-spec/), the companion file that tells AI engines where to find your structured content

## Fact-check notes and sources

- OpenAI bot documentation: [platform.openai.com/docs/bots](https://platform.openai.com/docs/bots)
- Anthropic crawler documentation: [support.anthropic.com, How do I stop Claude from training on my website?](https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler)
- Google crawler overview: [developers.google.com/search/docs/crawling-indexing/overview-google-crawlers](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) (includes Google-Extended documentation)
- Perplexity bot documentation: [docs.perplexity.ai/guides/bots](https://docs.perplexity.ai/guides/bots)
- Spawning ai.txt spec: [site-spawn.org/ai.txt](https://site-spawn.org)
- X-Robots-Tag syntax reference: [developers.google.com/search/docs/crawling-indexing/robots-meta-tag](https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag)

---

*Run the [AI Posture Audit](/tools/ai-posture-audit/) on your site right now. The cost is 30 seconds; the risk of NOT running it is an AI crawler behaving differently from what you think your configuration expresses.*


---

Canonical HTML: https://jwatte.com/blog/blog-ai-posture-consistency/
RSS: https://jwatte.com/feed.xml
JSON Feed: https://jwatte.com/feed.json
Hero image: https://jwatte.com/images/blog-ai-posture-consistency.webp
