AI Content Disclosure Audit

TL;DR. AI search = discovery (classic SEO buys the seat) + retrieval (passage-level chunking buys the citation). AI crawlers do not execute JS — every critical claim must live in the server-rendered HTML.

The AI Content Disclosure Audit is the audit you reach for when you already suspect a problem in this dimension and need a fast, copy-paste-able fix list. It reuses the same chrome as every other jwatte.com tool — deep-links from the mega analyzers, AI-prompt export, CSV/PDF/HTML download — but the checks it runs are narrow and specific to the dimension described above.

Checks for AI-generated content disclosure: visible

Why this dimension matters

AI search runs in two stages: DISCOVERY (the LLM queries a classic search engine to get ~20 candidate URLs) and RETRIEVAL (it fetches those pages, chunks them into ~150-token passages, and cites whichever chunk best matches the query). Classic SEO buys the seat; paragraph-level structure buys the citation. AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) do NOT execute JavaScript — every critical claim must be in the server-rendered HTML.

Common failure patterns

SPA shell with empty <div id="root"> — React / Vue / Angular apps that hydrate on the client look completely empty to AI crawlers. The fix is SSR (Next.js getServerSideProps, Nuxt asyncData, Svelte Kit load) or prerender / static export for content-heavy pages.
Missing llms.txt at the site root — the emerging standard for pointing AI crawlers at your canonical content. Absence is not catastrophic but presence makes your site noticeably easier to retrieve. Pair with llms-full.txt for full-content mirroring.
AI crawler-blocking in robots.txt without strategy — blocking GPTBot while allowing Googlebot is a choice; blocking all AI crawlers by default without knowing whether your audience queries ChatGPT / Claude / Perplexity is a cost. Decide deliberately; most content businesses benefit from allowing retrieval crawlers while blocking training crawlers.
Paragraphs over 300 words — each <p> is a retrieval unit for the chunker. Target 40–150 words per paragraph. Thinner = no answer match; thicker = split mid-thought and lose coherence at citation time.

How to fix it at the source

Start with llms.txt + llms-full.txt at the site root. Audit your robots.txt stance per bot deliberately. Restructure long paragraphs into 40–150-word chunks that each contain a complete claim + evidence pair. Track LLM referral visits via a custom Referrer segment (chatgpt.com, perplexity.ai, claude.ai, gemini.google.com, copilot.microsoft.com) — that is the canonical AEO KPI.

Thresholds that matter

Signal	Target
Paragraph length (retrieval unit)	40–150 words. Thinner fails to answer; thicker gets split mid-thought.
JSON-LD blocks	2+ per page (site-wide Org + page-specific type).
llms.txt byte size	< 50 KB for fast ingestion; `llms-full.txt` can be larger (1–2 MB).
robots.txt per-bot directive	Explicit for GPTBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot, Bytespider, Applebot-Extended.

Example fix

llms.txt starter at site root:

# Your Business

> One-sentence "what this site is" — used by LLM retrievers as the authoritative site description.

## Core content

- [About](https://yoursite.com/about): who you are, why you do this
- [Products](https://yoursite.com/products): catalog with stable URLs
- [Documentation](https://yoursite.com/docs): technical references

## Policies

- [Privacy](https://yoursite.com/privacy)
- [Terms](https://yoursite.com/terms)
- [AI-crawler policy](https://yoursite.com/ai.txt)

## Optional — full content mirror

- [llms-full.txt](https://yoursite.com/llms-full.txt): full canonical content for long-form retrieval

When to run the audit

After a major site change — redesign, CMS migration, DNS change, hosting platform swap.
Quarterly as part of routine technical hygiene; the checks are cheap to run repeatedly.
Before an investor / client review, a PCI scan, a SOC 2 audit, or an accessibility-compliance review.
When a downstream metric drops (rankings, conversion, AI citations) and you need to rule out this dimension as the cause.

Reading the output

Every finding is severity-classified. The playbook is the same across tools:

Critical / red — same-week fixes. These block the primary signal and cascade into downstream dimensions.
Warning / amber — same-month fixes. Drag the score, usually don't block.
Info / blue — context only. Often what a PR reviewer would flag but that doesn't block merge.
Pass / green — confirmation. Keep the control in place.

Every audit also emits an "AI fix prompt" — paste into ChatGPT / Claude / Gemini for exact copy-paste code patches tied to your specific stack.

Related tools in this family

Mega AEO Analyzer — the AEO orchestrator — 10 dimensions (citation, attribution, retrievability, freshness, tokenizer, prompt-injection, fair-use).
AI Posture Audit — cross-references robots.txt, ai.txt, meta robots, X-Robots-Tag per bot — flags disagreements.
llms.txt Quality Scorer — audits llms.txt structure against the llmstxt.org spec.
AI Crawler Access Auditor — simulates each major AI bot's crawl permissions on your site.
RAG Readiness Audit — tests how cleanly your pages chunk for enterprise RAG pipelines.

Fact-check notes and sources

This post is informational and not a substitute for professional consulting. Mentions of third-party platforms in the tool itself are nominative fair use. No affiliation is implied.

Why AI Content Disclosure Audit Exists