The robots.txt file is one of the oldest standards on the web. It dates back to 1994, predating CSS, JavaScript, and most of what we consider the modern internet. Despite that history, or maybe because of it, most site owners treat it as a write-once, forget-forever file. They paste in some rules from a Stack Overflow answer during initial setup and never look at it again.
The problem is that robots.txt rules interact with each other in ways that are not obvious. A Disallow: /api/ rule seems harmless until you realize your sitemap references /api/products/feed.xml and Google cannot reach it. A Disallow: /*? rule blocks all query strings, which also blocks ?page=2 on your paginated blog archive. The syntax is deceptively simple, but the consequences compound.
RFC 9309 changed the game
In September 2022, the IETF published RFC 9309, which formalized robots.txt parsing rules for the first time in 28 years. Before this, every search engine had its own interpretation of edge cases. Google, Bing, and Yandex all handled wildcards, dollar signs, and group precedence slightly differently.
RFC 9309 establishes clear precedence rules. The most specific matching rule wins, regardless of order in the file. A Allow: /products/feed takes precedence over a Disallow: /products/ because it is more specific. This seems intuitive, but many older robots.txt files were written assuming order-based precedence, where whichever rule appears last wins. On compliant crawlers, those files may not behave as the author expected.
The standard also clarifies that unrecognized directives are ignored. If you added Crawl-delay: 10 thinking it would slow down Googlebot, it does nothing. Googlebot does not support Crawl-delay. Bing does, but only in certain configurations. The only reliable way to control crawl rate for Google is through Search Console.
What the simulator does
The robots.txt Simulator takes your robots.txt content, a list of URLs, and a user-agent string, then runs RFC 9309 group precedence rules against every URL. For each URL, it reports whether the crawler would be allowed or blocked, and which specific rule caused the decision.
This is useful in three scenarios. First, before deploying a new robots.txt, you can test it against your actual URL structure to catch unintended blocks. Second, after a site migration where URL patterns changed, you can verify that the existing robots.txt still makes sense. Third, when diagnosing indexation problems in Search Console, you can confirm whether robots.txt is the cause.
The simulator handles the tricky cases that catch most people off guard. Wildcard patterns with * and end-of-string markers with $. Multiple user-agent groups where the crawler might match more than one. Empty Disallow: lines, which mean "allow everything" in the matched group but are often added by accident.
Common mistakes
The most frequent issue is blocking CSS and JavaScript files. In the early 2010s, it was common practice to disallow /css/ and /js/ directories. Google now requires access to these files for proper rendering. Blocking them triggers warnings in Search Console and degrades your site's appearance in search results because Google cannot render the page layout.
Another common problem is conflicting rules across user-agent groups. If you have a block for User-agent: * and a separate block for User-agent: Googlebot, the Googlebot block completely replaces the wildcard block for Google's crawler. It does not inherit the wildcard rules. Many site owners add a specific Googlebot section to allow one path, not realizing they just removed all the wildcard restrictions for Googlebot.
Trailing slashes also cause confusion. Disallow: /admin blocks /admin, /admin/, and /administrator/. If you only wanted to block the admin directory, you needed Disallow: /admin/ with the trailing slash. One character difference, completely different scope.
Testing before deploying
Run the simulator before every robots.txt change. Feed it your full sitemap URL list and test against both Googlebot and * user-agents. Any URL that returns a different result between the two is worth investigating.
If you are managing multiple sites, whether as an agency or as a portfolio builder like I describe in The $97 Launch, robots.txt testing is the kind of 5-minute check that prevents month-long indexation mysteries.
Related tools
- AI Crawler Access Auditor checks access rules for AI-specific crawlers
- AI Bot Policy Generator generates ai.txt and robots.txt rules for AI crawlers
- Sitemap Lastmod Truthfulness audits whether your sitemap dates match reality
- Noindex Conflict Audit catches conflicts between robots.txt and meta robots directives
- Canonical Redirect Graph maps redirect chains that may interact with crawl rules
Fact-check notes and sources
- RFC 9309 "Robots Exclusion Protocol" published September 2022 by the IETF. Source: datatracker.ietf.org/doc/rfc9309/.
- Original robots.txt proposal from Martijn Koster, June 1994. Source: robotstxt.org/orig.html.
- Google does not support Crawl-delay: Google Search Central documentation, "How Google interprets the robots.txt specification."
- Google requires access to CSS and JS: Google Search Central Blog, "Updating our technical Webmaster Guidelines," October 2014.
- RFC 9309 specificity-based precedence: Section 2.2, "The most specific match found takes precedence."
- Bing Crawl-delay support: Bing Webmaster Tools documentation, "Bing Crawler (Bingbot)" guidelines.
This post is informational, not SEO-consulting advice. Crawlers from different search engines may interpret edge cases differently despite the RFC.