# Your AI Bot Policy Lives In Three Files, And They Probably Disagree

Your robots.txt disallows GPTBot. Your ai.txt allows OpenAI-User. Your meta robots says index,follow. Crawlers pick the most permissive interpretation and you accidentally opt into training you thought you&#39;d blocked. One tool emits all three files, consistent, per 22+ named AI crawlers.

Author: J.A. Watte
Published: May 12, 2026
Source: https://jwatte.com/blog/blog-tool-ai-bot-policy-gen/

---

Most sites that think they have an "AI bot policy" actually have three half-configured files that contradict each other. A common real state:

- `robots.txt` disallows GPTBot (copy-pasted from a blog post in 2023)
- `ai.txt` does not exist, so the default is "opt in to everything"
- `/.well-known/ai.txt` also does not exist
- `<meta name="robots" content="index,follow">` has no AI-specific directives

A crawler hitting the site reads the most permissive signal. You think you blocked OpenAI training. You did not.

[AI Bot Policy Generator](/tools/ai-bot-policy-gen/) emits all three files from one set of per-bot decisions, so they agree after deploy.

## What the tool covers

22+ named AI crawlers. Each gets an explicit **crawl** stance (allowed to fetch pages) and an explicit **training** stance (allowed to use content for model training). The two are different — Google-Extended lets you opt out of training while still letting Googlebot crawl for search. Separating them is the whole point.

The named bots include: GPTBot (OpenAI), ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-Web, ClaudeBot-User, PerplexityBot, Perplexity-User, Google-Extended, Applebot-Extended, Bytespider, Amazonbot, CCBot, MistralAI-User, Diffbot, FacebookBot, Meta-ExternalAgent, YouBot, Bingbot (with IndexNow), Yandex, DuckAssistBot, and more.

## Three files, one source of truth

**robots.txt** — the classic. Written as `User-agent: <bot> / Disallow: <path>` blocks. Honored by well-behaved crawlers, which is most major AI bots.

**ai.txt** — the site-ai.org spec. A plain-text declaration of training preferences, scoped to AI training use. Emerging standard; [site-ai.org](https://site-ai.org/) is the source.

**/.well-known/ai.txt** — same content as /ai.txt but at the RFC 8615 well-known-URIs location. Some crawlers check one path, some the other, and some both. Emitting at both eliminates the guess.

The tool also emits `<meta name="robots">` and `X-Robots-Tag` hints so server-level headers agree with the text files.

## Opt-out semantics are not uniform

An important gotcha: some AI companies honor the `Disallow:` directive as both crawl and training opt-out (OpenAI's GPTBot). Others treat training opt-out as a separate signal (Google-Extended is explicitly "training-only"). Reading each company's docs is the only way to know which mode applies. The tool's output links to each company's official bot documentation so you can verify.

## After deploy, audit for drift

Emit the three files, deploy, then run [AI Posture Audit](/tools/ai-posture-audit/) to confirm all three agree. If your CDN rewrites robots.txt or injects an X-Robots-Tag header, the audit catches the disagreement before a crawler does.

Quarterly re-runs are the right cadence. New major AI bots ship every few months; your policy should reflect them before they start crawling blind.

## The "I'm a solo publisher, do I actually need this" answer

If you publish anything with revenue implications (product pages, book pages, paid service pages, a newsletter signup), yes. AI-generated summaries of your pages can reduce downstream click-through. You get to decide which crawlers get to contribute to that summarization. The three files are how that decision is communicated.

If you're pure ad-supported and every eyeball is valuable, lean permissive. If you sell a book or a product where the sale requires the buyer landing on your page, lean restrictive on the training axis while staying permissive on the crawl axis. The tool separates those two decisions per bot.

## Related reading

- [AI Posture Audit](/tools/ai-posture-audit/) — verify the three files agree after deploy
- [AI Crawler Access Auditor](/tools/ai-crawler-access-auditor/) — per-bot allow/block verdict with CDN challenge detection
- [ai.txt Generator](/tools/ai-txt-gen/) — the simpler single-file predecessor
- [Robots/LLM Drift Diff](/tools/robots-llm-drift-diff/) — before/after diff on your three files

## Fact-check notes and sources

- OpenAI bot docs: [platform.openai.com/docs/bots](https://platform.openai.com/docs/bots)
- Anthropic ClaudeBot docs: [docs.anthropic.com](https://docs.anthropic.com/en/docs/agents-and-tools/claude-for-chrome/browsing-tools-and-claudebot)
- Google crawlers reference: [developers.google.com/search/docs/crawling-indexing/overview-google-crawlers](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers)
- ai.txt spec: [site-ai.org](https://site-ai.org/)
- RFC 8615 — Well-Known URIs: [datatracker.ietf.org/doc/html/rfc8615](https://datatracker.ietf.org/doc/html/rfc8615)
- RFC 9309 — Robots Exclusion Protocol: [rfc-editor.org/rfc/rfc9309.html](https://www.rfc-editor.org/rfc/rfc9309.html)

---

*The $100 Network treats AI bot policy as part of publisher infrastructure, not an afterthought. The three-file policy is the bedrock; everything downstream (Perplexity citations, Google AIO appearance, LLM training contribution) branches from it.*


---

Canonical HTML: https://jwatte.com/blog/blog-tool-ai-bot-policy-gen/
RSS: https://jwatte.com/feed.xml
JSON Feed: https://jwatte.com/feed.json
Hero image: https://jwatte.com/images/blog-tool-ai-bot-policy-gen.webp
