Robots LLM Drift Simulator

The robots.txt protocol is "RFC 9309 with vendor-specific extensions and undocumented quirks." Each major LLM crawler implements its own parser. The differences are real:

GPTBot follows most-specific-match (longest path wins).
ClaudeBot follows last-match-wins (the directive ordering matters).
PerplexityBot has documented quirks around wildcards in certain patterns.
Google-Extended inherits Googlebot's parser (reliable, well-documented).
CCBot (Common Crawl) follows the original RFC 9309 strictly.

The same robots.txt produces different access maps depending on which bot is reading it. A site that thinks it's blocked GPTBot may have left a hole open to ClaudeBot, or vice versa. Most sites ship one robots.txt and assume "block all LLMs" means "block all LLMs."

What the Robots LLM Drift Simulator does

You paste your robots.txt content + a list of test paths. The tool:

Parses the directives into per-UA rule groups.
Simulates each rule group against 11 LLM bot UAs (GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-Web, PerplexityBot, Google-Extended, Googlebot, Bingbot, Applebot-Extended, CCBot).
Applies each bot's specific interpretation rules (most-specific vs last-match, wildcard support, suffix-anchor support).
Builds a paths × bots matrix showing each bot's allow/block decision for each path.
Highlights drift rows — paths where bots disagree.
Emits an AI prompt with a rewritten robots.txt that produces consistent decisions.

The five interpretation rules that drift

1. Most-specific vs last-match. When two directives both apply, do you take the longest path (Google's RFC 9309 behavior) or the last directive in source order (some bots' behavior)? Different bots, different choice.

2. Wildcard support. Disallow: /api/* works in some parsers, not in others. Strict RFC 9309 says no — the original spec doesn't support wildcards. Google + most LLMs added them. Perplexity is documented as inconsistent.

3. Suffix anchors ($). Disallow: /*.pdf$ works in Google + GPTBot + Anthropic. Original RFC 9309 doesn't define $.

4. UA-specific group inheritance. When a bot has a UA-specific block, does it ALSO honor * rules, or only its own block? Most bots honor only their own. A few (older parsers) inherit from *.

5. Comment + whitespace handling. Blank lines, tabs, mid-line comments — slight differences in tokenization between parsers can cause one bot to see a directive that another silently drops.

What "drift" actually looks like

Common patterns:

A path blocked for * and allowed for GPTBot. You added User-agent: GPTBot / Allow: / after a global Disallow. Most-specific-match bots see GPTBot is allowed everywhere; last-match bots may still block.
Wildcard block that catches more than intended. Disallow: /api (no $ anchor) blocks /api, /apiv2, /api-docs/ — anything starting with /api. Different from /api/ (trailing slash) which blocks only the directory.
Allow precedence inversion. Some bots treat Allow as overriding Disallow regardless of specificity; others require the Allow to be more specific.

The simulator catches these by enumerating per-bot decisions and flagging mismatches.

When drift is OK vs when it's a problem

Drift is OK when you intentionally want different policies per bot. Example: you want training-corpus crawlers (CCBot, GPTBot, Google-Extended) blocked but retrieval-time crawlers (ChatGPT-User, ClaudeBot, OAI-SearchBot, PerplexityBot) allowed because retrieval-time crawls send users back to your site. That's a deliberate split.

Drift is a problem when:

A bot you intended to block is allowed because of a parser quirk.
A bot you intended to allow is blocked because of an over-broad pattern.
You don't know which bots interpret your file each way (the most common case).

The simulator surfaces all three.

The training vs retrieval split (recommended starting point)

A common 2026 policy:

# Training-corpus crawlers — usually block (you don't want training without consent)
User-agent: CCBot
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# Retrieval-time crawlers — usually allow (they send users back)
User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# Apple Intelligence (training)
User-agent: Applebot-Extended
Disallow: /

# Default
User-agent: *
Disallow: /admin/
Sitemap: https://example.com/sitemap.xml

Run this through the simulator with your real test paths to verify the policy expresses what you intend across every bot.

The 7-day robots.txt cleanup path

Day 1-2: Run the simulator with your current robots.txt + 20 representative paths (homepage, top categories, blog index, product pages, admin, API endpoints, parametrized URLs, sitemap).

Day 3-4: For every drift row, decide intent. Rewrite the robots.txt to express that intent unambiguously.

Day 5: Re-run the simulator. Confirm zero drift on intended-uniform paths and intentional drift on others.

Day 6: Deploy. Add the matching ai.txt for finer-grained signals.

Day 7: Verify in production via reverse-DNS lookup of incoming bot traffic + log analysis (use the Web Log Anomaly Detector for that).

Fact-check notes and sources

robots.txt RFC: RFC 9309 — current standard
Google's robots.txt parser: Google Search Central — robots.txt specification
OpenAI's robots.txt rules: OpenAI — GPTBot
Anthropic's documentation: Anthropic — Claude crawler IPs and UAs
Per-bot interpretation rules: synthesis of community testing 2024-2026; some parser quirks are observed-not-documented and may shift over time

This post is informational, not robots.txt-policy-consulting advice. Mentions of OpenAI, Anthropic, Google, Perplexity, Apple, Common Crawl are nominative fair use. No affiliation is implied.

When The Same robots.txt Blocks One LLM And Allows Another