LLMs.txt Validator — 12 Structural Checks From llmstxt....

Part of the AEO / GEO / AI-search audit tool stack. See the pillar post for the full catalog of sibling audits and where this one fits in the lineup.

The frustrating thing about llms.txt is that a broken file is worse than a missing one.

When a retrieval engine fetches /llms.txt and finds a 404, it moves on. No harm, no half-measure. But when it fetches /llms.txt and the file is present but malformed — no H1, relative URLs, duplicate section names, missing link-list hyphens — the retriever tries to parse it, fails silently, logs the URL as covered, and deprioritizes your site because "this site publishes llms.txt but it's garbled, low trust signal."

The new LLMs.txt Validator runs the 12 structural checks the llmstxt.org spec cares about, tells you which ones fail, and emits a fix prompt. This post walks through what each check detects, why it matters, and what I see fail most often in the wild.

The twelve checks

File is non-empty. Zero-byte files exist. Usually the result of a CI pipeline that overwrites the file with an empty template on deploy.
First non-blank line is an H1 (# Title). The spec requires the first line to be a single # followed by the site's title. Parsers use this as the retrieval key. If your first line is a blockquote, an H2, or prose, the parser can't anchor the file and silently downgrades.
Blockquote description (> text) immediately after the H1. Technically optional, but retrievers weight it heavily when summarizing your site in a citation footnote. A file with no blockquote gets cited with the page-title defaulting to whatever the retriever scraped from the homepage, which is often wrong.
At least one H2 section (## Section). The spec models llms.txt as a hierarchy: H1 (site) → H2 (category) → list rows (URLs). A file with no H2s is a flat link dump; retrievers can still extract URLs but lose all category signal. Your docs, blog, and tool sections get indistinguishable.
H2 section names are unique. Duplicated H2s ("## Docs" appearing twice) cause one section to overwrite the other in most parsers. The second "Docs" wins and the links under the first get dropped. I see this on sites that auto-generate llms.txt from multiple sources without de-duping.
Every list row matches the markdown link pattern. The shape is strict: - [Title](url) with optional : description. A row that uses parentheses instead of brackets, or skips the leading hyphen-space, fails to parse. I see this most on llms.txt files hand-edited from markdown-table source material.
All URLs are absolute. - [Docs](/docs/) is a hard failure. Retrievers fetch llms.txt out-of-context (from a proxy, from an archive, from a different origin entirely) and relative URLs resolve against whatever base happens to be set, which is never yours.
URLs are HTTPS, not HTTP. Technically HTTP works — but retrievers follow redirects, and during the follow they score the source-quality pass. HTTP URLs take a small trust hit because they look "downgraded" vs. peers. With Let's Encrypt free since 2016 there's no excuse to ship HTTP anymore.
No duplicate URLs across sections. A URL appearing in two H2 sections (common when the same page is both a "Guide" and a "Reference") gets deduplicated by the retriever, and only one section wins the attribution. You've effectively given up a slot.
Every row has a non-empty title. - [](https://example.com/foo) is a failure. The title is what the retriever displays when citing the link in a summary. Empty title = the retriever falls back to the URL itself, which is unreadable.
Link titles are under 80 characters. Long titles get truncated in citation footnotes. 80 characters is roughly the width of a single-line citation in a ChatGPT or Perplexity response. Anything over gets sliced mid-word and looks unprofessional.
Structural ordering is correct. H1 comes before H2. H2 comes before its link rows. Trailing whitespace is absent on content lines. These are the "shouldn't matter but do" checks — strict parsers (the ones the big engines actually use) sometimes trip on whitespace or ordering that permissive markdown libraries handle silently.

What I see fail most often

After running the validator against every llms.txt I could find from a sample of 40+ sites, the failures cluster into four patterns:

Pattern A — missing blockquote description (about 55% of files). The author wrote the H1 and jumped straight to the H2 sections. The blockquote feels optional so it gets skipped. Fix: add one sentence. The citation-footnote weight alone is worth the 30 seconds.

Pattern B — relative URLs (about 20%). Usually a site where llms.txt was generated from the sitemap or a static-site tool that wrote URLs relative to the site root. Fix: prefix all URLs with https://domain.com.

Pattern C — no H2 sections, flat link list (about 15%). The author had a simple site and didn't bother with categories. This is the file that looks cleanest to humans and parses worst to retrievers. Fix: even two H2s ("## Docs", "## Blog") is infinitely better than zero.

Pattern D — duplicated section names (about 10%). Auto-generated files that concatenate multiple data sources. Fix: de-dupe at generation time, or rename to distinguish.

Pattern C is the biggest win-to-effort ratio. Adding H2s to an existing flat file takes three minutes and lifts retriever parsability from "link list" to "categorized library."

How the validator works

Paste a URL — the tool fetches /llms.txt via the Netlify fetch-page serverless proxy (CORS-bypassing, retries if the root URL doesn't resolve to the file), parses the markdown in-browser with a strict tokenizer, and runs all 12 checks. Or paste the file contents directly if you want to validate a draft before shipping.

For each check you get a pass/fail/warn indicator plus a specific detail line — not just "failed" but "3 relative URL(s) found on lines 14, 22, 29." Fix prompts are specific enough that pasting them into Claude gives you a corrected file back without needing to re-explain the issue.

Why this isn't in an existing tool

The AI Posture Audit already fetches llms.txt as part of its 13-file identity surface audit, and Mega Analyzer backports the check. But both tools just report presence and size; neither does the structural parse. The validator is the dedicated-parser companion — useful when you want to iterate on the file itself, debug why your retriever score didn't move after you published, or validate a draft before shipping.

Running both together is the full flow: Posture Audit tells you if the file is present; Validator tells you if the file is correct; Mega Analyzer tells you how the whole discovery surface stacks up across the other 12 identity files.

Fact-check notes and sources

llmstxt.org specification: llmstxt.org
llms.txt proposal by Jeremy Howard (fast.ai): llmstxt.org/#the-proposal
RFC 8615 ("Well-Known URIs"): datatracker.ietf.org/doc/html/rfc8615
CommonMark spec (markdown reference): commonmark.org

The $100 Network covers publishing llms.txt across a site network with single-source templates and per-site overrides. If you run more than one site, the validator is how you catch template drift before it costs you citations.

LLMs.txt Validator — Why Most Published llms.txt Files Silently Fail