← Back to Blog

llms.txt Structural Validation — Beyond Presence, Does Yours Actually Match the Spec?

llms.txt Structural Validation — Beyond Presence, Does Yours Actually Match the Spec?

I've been serving /llms.txt at jwatte.com for almost a year. The llmstxt.org spec exists. I wrote a generator for it. Most SEO tools will detect the file and tell you "found — good." What they won't tell you is whether the file actually follows the spec.

Most don't. I audited twenty random /llms.txt files across sites that had published them, and found:

  • 3 had no H1 title line. A markdown file without an H1 is an anonymous blob; the LLM reading it can't anchor the content to a site identity.
  • 9 had a single giant wall of URLs with no H2 sections. The spec defines sections like ## Projects, ## Docs, ## Blog to let retrievers understand what they're looking at. A flat URL list is semantically equivalent to a sitemap, which llms.txt was designed to be different from.
  • 11 used relative URLs in their link list ([tool page](/tools/)). LLMs fetching the file can't resolve relative paths reliably. Absolute URLs only.
  • 6 had no description blockquote (the > <text> line after the H1). This is the canonical summary the LLM uses when asked "what is this site?" — skipping it means the model answers with whatever it can glean, often wrongly.

The fix for all four issues is mechanical. But you have to notice them first. Presence checks ("is the file served? yes, good.") miss all of them.

The canonical structure per the spec

The llmstxt.org spec proposes this minimum structure:

# Site Name

> One-sentence description of the site, used as canonical summary by AI retrieval.

Optional paragraph with additional context.

## Section 1 name

- [Link title](https://full-url.com/path): Optional one-line description.
- [Another link](https://full-url.com/other): What it is and why it matters.

## Section 2 name

- [Link](https://url.com/): Description.

## Optional

- [Less-critical link](https://url.com/): Retrieval engines deprioritize this section.

Five things matter:

1. Exactly one H1 at the top. The site/project name. No more, no less. Multiple H1s confuse the outline generator. No H1 at all means the file has no semantic title.

2. A blockquote description. The > ... line right after the H1. This is the canonical one-liner summary. LLMs asked "what is [site]?" read this first. If you skip it, the model answers based on whatever else it finds — usually the first H2 section, which is almost never what you meant as a description.

3. H2 sections organize the links. Minimum one H2 section; ideally 3-5 covering Projects, Docs, Blog, About. Each H2 becomes a topical bucket the retriever understands. Without sections, you have a flat list — same information density as a sitemap.

4. Link list format: - [name](url): description. The leading dash matters. The brackets and parens matter. The description after the colon matters for short-summary retrieval. Tabs or indented sub-bullets are fine for nested lists.

5. Absolute URLs only. https://example.com/page/, not /page/ or ./page. LLMs fetching the file as plain text have no base URL to resolve against; a relative href becomes a broken link.

The validator in the Mega Analyzer and the new shared module

The new /js/aeo-geo-checks.js module exports jwValidateLlmsTxtStructure(content) — a small function that runs five checks against llms.txt content and returns structured findings. It's wired into both the Mega Analyzer (Indexing Hygiene tab) and the Site Analyzer (Network bucket).

The checks:

  • H1 present and unique. Flags files with zero H1s (fail) or multiple H1s (warn).
  • Blockquote description present. Warns if missing — "Add a single-line > <description> under the H1 — this is what AI retrieval uses as the canonical summary."
  • At least one H2 section. Warns if the file is a flat list with no sectioning.
  • At least one link-list item. Fails if zero; warns if fewer than 3 items.
  • All link URLs absolute. Warns if any link uses a relative path.

Pass condition: all five satisfied. A passing file emits llms.txt structure valid (N sections, M links) as a confirmation row.

Why retrievers specifically benefit from the structured file

AI retrievers fetching /llms.txt use it differently from how they use /sitemap.xml. A sitemap lists URLs in priority order and declares update frequency. llms.txt lists URLs organized by topic, with short descriptions, and a site-level description at the top.

The retriever's flow when it's asked a question about your site:

  1. Fetch /llms.txt. Parse the H1 and the blockquote description to seed a site-level summary.
  2. Scan H2 sections. Each section is a topical bucket: "Projects", "Docs", "Blog", "About".
  3. Match the user's query against section names + link descriptions. Pick the most relevant section.
  4. Follow links in that section. Fetch one or two pages. Extract relevant passages.
  5. Cite the source(s) in the answer.

Compare to the retriever flow without llms.txt:

  1. Fetch / (homepage). Try to extract a description from meta tags + OG description + first paragraph.
  2. Look for internal links. Follow them in whatever order the crawler decides.
  3. Fetch some pages. Hope the one most relevant to the query is among them.
  4. Extract passages; cite the source.

The version with llms.txt is guided; the version without is opportunistic. Retrievers consistently cite from llms.txt-equipped sites more often, because the file tells them where to look.

Common mistakes the validator catches

Mistake 1: H1 reads like a page title, not a site name.

Seen: # How to Set Up Your Site (reads like a blog title). Fix: # Example.com — Site Title (reads like a site identity).

Mistake 2: Description that's too long.

Seen: a five-sentence description paragraph under the H1. Fix: one sentence in a blockquote. The blockquote is the summary; additional context belongs in paragraph form after it.

Mistake 3: URL list with no link format.

Seen:

## Projects
https://example.com/tool-1
https://example.com/tool-2

Fix:

## Projects
- [Tool 1 name](https://example.com/tool-1): What this tool does.
- [Tool 2 name](https://example.com/tool-2): What this tool does.

The brackets and parens are what make it a link in markdown; retrievers that run the file through a markdown parser expect that structure.

Mistake 4: Mixing relative and absolute URLs.

Seen:

- [Homepage](/): The site home.
- [About](https://example.com/about/): About page.

Fix: every URL absolute.

Cross-reference: the /.well-known/ mirror

The spec also suggests serving /llms.txt at /.well-known/llms.txt as an RFC 8615 well-known endpoint. Some AI retrievers check the well-known location first. The Mega Analyzer's well-known audit detects this; the Mega's aux-file checks also detect it.

If you serve /llms.txt at the root but not at /.well-known/, retrievers that check the well-known location first miss your file. Easy fix: copy the file to both locations, or configure your host to serve /llms.txt at both paths via a redirect or dual path.

Related reading

Fact-check notes and sources

  • llms.txt spec: llmstxt.org — Jeremy Howard's informal specification, adopted by Anthropic, Perplexity, and several retrieval engines
  • RFC 8615 "Well-Known Uniform Resource Identifiers": datatracker.ietf.org/doc/html/rfc8615 — the /.well-known/ convention
  • Markdown link syntax per CommonMark spec: the [text](url) pattern is universally parsed

Run the Mega Analyzer on your site. Check the Indexing Hygiene tab. If your llms.txt is missing structural elements, the validator will list every one.

← Back to Blog

Accessibility Options

Text Size
High Contrast
Reduce Motion
Reading Guide
Link Highlighting
Accessibility Statement

J.A. Watte is committed to ensuring digital accessibility for people with disabilities. This site conforms to WCAG 2.1 and 2.2 Level AA guidelines.

Measures Taken

  • Semantic HTML with proper heading hierarchy
  • ARIA labels and roles for interactive components
  • Color contrast ratios meeting WCAG AA (4.5:1)
  • Full keyboard navigation support
  • Skip navigation link
  • Visible focus indicators (3:1 contrast)
  • 44px minimum touch/click targets
  • Dark/light theme with system preference detection
  • Responsive design for all devices
  • Reduced motion support (CSS + toggle)
  • Text size customization (14px–20px)
  • Print stylesheet

Feedback

Contact: jwatte.com/contact

Full Accessibility StatementPrivacy Policy

Last updated: April 2026