# When The Same robots.txt Blocks One LLM And Allows Another

GPTBot honors most-specific blocks. ClaudeBot honors the LAST matching block. PerplexityBot ignores wildcards in some patterns. The same robots.txt yields different access maps per bot — and most operators never check.

Author: J.A. Watte
Published: April 23, 2026
Source: https://jwatte.com/blog/blog-tool-robots-llm-drift-simulator/

---

The robots.txt protocol is "RFC 9309 with vendor-specific extensions and undocumented quirks." Each major LLM crawler implements its own parser. The differences are real:

- GPTBot follows most-specific-match (longest path wins).
- ClaudeBot follows last-match-wins (the directive ordering matters).
- PerplexityBot has documented quirks around wildcards in certain patterns.
- Google-Extended inherits Googlebot's parser (reliable, well-documented).
- CCBot (Common Crawl) follows the original RFC 9309 strictly.

The same robots.txt produces different access maps depending on which bot is reading it. A site that thinks it's blocked GPTBot may have left a hole open to ClaudeBot, or vice versa. Most sites ship one robots.txt and assume "block all LLMs" means "block all LLMs."

## What the [Robots LLM Drift Simulator](/tools/robots-llm-drift-simulator/) does

You paste your robots.txt content + a list of test paths. The tool:

1. Parses the directives into per-UA rule groups.
2. Simulates each rule group against 11 LLM bot UAs (GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-Web, PerplexityBot, Google-Extended, Googlebot, Bingbot, Applebot-Extended, CCBot).
3. Applies each bot's specific interpretation rules (most-specific vs last-match, wildcard support, suffix-anchor support).
4. Builds a paths × bots matrix showing each bot's allow/block decision for each path.
5. Highlights drift rows — paths where bots disagree.
6. Emits an AI prompt with a rewritten robots.txt that produces consistent decisions.

## The five interpretation rules that drift

**1. Most-specific vs last-match.** When two directives both apply, do you take the longest path (Google's RFC 9309 behavior) or the last directive in source order (some bots' behavior)? Different bots, different choice.

**2. Wildcard support.** `Disallow: /api/*` works in some parsers, not in others. Strict RFC 9309 says no — the original spec doesn't support wildcards. Google + most LLMs added them. Perplexity is documented as inconsistent.

**3. Suffix anchors ($).** `Disallow: /*.pdf$` works in Google + GPTBot + Anthropic. Original RFC 9309 doesn't define `$`.

**4. UA-specific group inheritance.** When a bot has a UA-specific block, does it ALSO honor `*` rules, or only its own block? Most bots honor only their own. A few (older parsers) inherit from `*`.

**5. Comment + whitespace handling.** Blank lines, tabs, mid-line comments — slight differences in tokenization between parsers can cause one bot to see a directive that another silently drops.

## What "drift" actually looks like

Common patterns:

- **A path blocked for `*` and allowed for GPTBot.** You added `User-agent: GPTBot / Allow: /` after a global Disallow. Most-specific-match bots see GPTBot is allowed everywhere; last-match bots may still block.
- **Wildcard block that catches more than intended.** `Disallow: /api` (no $ anchor) blocks `/api`, `/apiv2`, `/api-docs/` — anything starting with `/api`. Different from `/api/` (trailing slash) which blocks only the directory.
- **Allow precedence inversion.** Some bots treat Allow as overriding Disallow regardless of specificity; others require the Allow to be more specific.

The simulator catches these by enumerating per-bot decisions and flagging mismatches.

## When drift is OK vs when it's a problem

Drift is **OK** when you intentionally want different policies per bot. Example: you want training-corpus crawlers (CCBot, GPTBot, Google-Extended) blocked but retrieval-time crawlers (ChatGPT-User, ClaudeBot, OAI-SearchBot, PerplexityBot) allowed because retrieval-time crawls send users back to your site. That's a deliberate split.

Drift is **a problem** when:

- A bot you intended to block is allowed because of a parser quirk.
- A bot you intended to allow is blocked because of an over-broad pattern.
- You don't know which bots interpret your file each way (the most common case).

The simulator surfaces all three.

## The training vs retrieval split (recommended starting point)

A common 2026 policy:

```
# Training-corpus crawlers — usually block (you don't want training without consent)
User-agent: CCBot
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# Retrieval-time crawlers — usually allow (they send users back)
User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# Apple Intelligence (training)
User-agent: Applebot-Extended
Disallow: /

# Default
User-agent: *
Disallow: /admin/
Sitemap: https://example.com/sitemap.xml
```

Run this through the simulator with your real test paths to verify the policy expresses what you intend across every bot.

## The 7-day robots.txt cleanup path

**Day 1-2:** Run the simulator with your current robots.txt + 20 representative paths (homepage, top categories, blog index, product pages, admin, API endpoints, parametrized URLs, sitemap).

**Day 3-4:** For every drift row, decide intent. Rewrite the robots.txt to express that intent unambiguously.

**Day 5:** Re-run the simulator. Confirm zero drift on intended-uniform paths and intentional drift on others.

**Day 6:** Deploy. Add the matching ai.txt for finer-grained signals.

**Day 7:** Verify in production via reverse-DNS lookup of incoming bot traffic + log analysis (use the [Web Log Anomaly Detector](/blog/blog-tool-web-log-anomaly-detector/) for that).

## Related reading

- [AI Posture Audit](/tools/ai-posture-audit/) — robots.txt + ai.txt + meta robots cross-audit
- [AI.txt Generator](/tools/ai-txt-gen/) — generates the ai.txt companion file
- [AI Crawler Access Auditor](/blog/blog-tool-ai-crawler-access-auditor/) — verifies live bot access
- [Mega Security Analyzer](/tools/mega-security-analyzer/) — full security audit including AI-bot posture

## Fact-check notes and sources

- robots.txt RFC: [RFC 9309](https://www.rfc-editor.org/rfc/rfc9309.html) — current standard
- Google's robots.txt parser: [Google Search Central — robots.txt specification](https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt)
- OpenAI's robots.txt rules: [OpenAI — GPTBot](https://platform.openai.com/docs/gptbot)
- Anthropic's documentation: [Anthropic — Claude crawler IPs and UAs](https://docs.anthropic.com/en/docs/build-with-claude/ips-and-user-agents)
- Per-bot interpretation rules: synthesis of community testing 2024-2026; some parser quirks are observed-not-documented and may shift over time

*This post is informational, not robots.txt-policy-consulting advice. Mentions of OpenAI, Anthropic, Google, Perplexity, Apple, Common Crawl are nominative fair use. No affiliation is implied.*


---

Canonical HTML: https://jwatte.com/blog/blog-tool-robots-llm-drift-simulator/
RSS: https://jwatte.com/feed.xml
JSON Feed: https://jwatte.com/feed.json
Hero image: https://jwatte.com/images/blog-tool-robots-llm-drift-simulator.webp
