← Back to Blog

Your AI Bot Policy Lives In Three Files, And They Probably Disagree

Your AI Bot Policy Lives In Three Files, And They Probably Disagree

Most sites that think they have an "AI bot policy" actually have three half-configured files that contradict each other. A common real state:

  • robots.txt disallows GPTBot (copy-pasted from a blog post in 2023)
  • ai.txt does not exist, so the default is "opt in to everything"
  • /.well-known/ai.txt also does not exist
  • <meta name="robots" content="index,follow"> has no AI-specific directives

A crawler hitting the site reads the most permissive signal. You think you blocked OpenAI training. You did not.

AI Bot Policy Generator emits all three files from one set of per-bot decisions, so they agree after deploy.

What the tool covers

22+ named AI crawlers. Each gets an explicit crawl stance (allowed to fetch pages) and an explicit training stance (allowed to use content for model training). The two are different — Google-Extended lets you opt out of training while still letting Googlebot crawl for search. Separating them is the whole point.

The named bots include: GPTBot (OpenAI), ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-Web, ClaudeBot-User, PerplexityBot, Perplexity-User, Google-Extended, Applebot-Extended, Bytespider, Amazonbot, CCBot, MistralAI-User, Diffbot, FacebookBot, Meta-ExternalAgent, YouBot, Bingbot (with IndexNow), Yandex, DuckAssistBot, and more.

Three files, one source of truth

robots.txt — the classic. Written as User-agent: <bot> / Disallow: <path> blocks. Honored by well-behaved crawlers, which is most major AI bots.

ai.txt — the site-ai.org spec. A plain-text declaration of training preferences, scoped to AI training use. Emerging standard; site-ai.org is the source.

/.well-known/ai.txt — same content as /ai.txt but at the RFC 8615 well-known-URIs location. Some crawlers check one path, some the other, and some both. Emitting at both eliminates the guess.

The tool also emits <meta name="robots"> and X-Robots-Tag hints so server-level headers agree with the text files.

Opt-out semantics are not uniform

An important gotcha: some AI companies honor the Disallow: directive as both crawl and training opt-out (OpenAI's GPTBot). Others treat training opt-out as a separate signal (Google-Extended is explicitly "training-only"). Reading each company's docs is the only way to know which mode applies. The tool's output links to each company's official bot documentation so you can verify.

After deploy, audit for drift

Emit the three files, deploy, then run AI Posture Audit to confirm all three agree. If your CDN rewrites robots.txt or injects an X-Robots-Tag header, the audit catches the disagreement before a crawler does.

Quarterly re-runs are the right cadence. New major AI bots ship every few months; your policy should reflect them before they start crawling blind.

The "I'm a solo publisher, do I actually need this" answer

If you publish anything with revenue implications (product pages, book pages, paid service pages, a newsletter signup), yes. AI-generated summaries of your pages can reduce downstream click-through. You get to decide which crawlers get to contribute to that summarization. The three files are how that decision is communicated.

If you're pure ad-supported and every eyeball is valuable, lean permissive. If you sell a book or a product where the sale requires the buyer landing on your page, lean restrictive on the training axis while staying permissive on the crawl axis. The tool separates those two decisions per bot.

Related reading

Fact-check notes and sources


The $100 Network treats AI bot policy as part of publisher infrastructure, not an afterthought. The three-file policy is the bedrock; everything downstream (Perplexity citations, Google AIO appearance, LLM training contribution) branches from it.

← Back to Blog

Accessibility Options

Text Size
High Contrast
Reduce Motion
Reading Guide
Link Highlighting
Accessibility Statement

J.A. Watte is committed to ensuring digital accessibility for people with disabilities. This site conforms to WCAG 2.1 and 2.2 Level AA guidelines.

Measures Taken

  • Semantic HTML with proper heading hierarchy
  • ARIA labels and roles for interactive components
  • Color contrast ratios meeting WCAG AA (4.5:1)
  • Full keyboard navigation support
  • Skip navigation link
  • Visible focus indicators (3:1 contrast)
  • 44px minimum touch/click targets
  • Dark/light theme with system preference detection
  • Responsive design for all devices
  • Reduced motion support (CSS + toggle)
  • Text size customization (14px–20px)
  • Print stylesheet

Feedback

Contact: jwatte.com/contact

Full Accessibility StatementPrivacy Policy

Last updated: April 2026