TL;DR. AI search = discovery (classic SEO buys the seat) + retrieval (passage-level chunking buys the citation). AI crawlers do not execute JS — every critical claim must live in the server-rendered HTML.
The Content Credentials is the audit you reach for when you already suspect a problem in this dimension and need a fast, copy-paste-able fix list. It reuses the same chrome as every other jwatte.com tool — deep-links from the mega analyzers, AI-prompt export, CSV/PDF/HTML download — but the checks it runs are narrow and specific to the dimension described above.
Scan a page or a single image URL for C2PA / Content Credentials manifests. Tells you which images carry a provenance chain (so AI-vs-human origin is verifiable) and which don
Why this dimension matters
AI search runs in two stages: DISCOVERY (the LLM queries a classic search engine to get ~20 candidate URLs) and RETRIEVAL (it fetches those pages, chunks them into ~150-token passages, and cites whichever chunk best matches the query). Classic SEO buys the seat; paragraph-level structure buys the citation. AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) do NOT execute JavaScript — every critical claim must be in the server-rendered HTML.
Common failure patterns
- SPA shell with empty
<div id="root">— React / Vue / Angular apps that hydrate on the client look completely empty to AI crawlers. The fix is SSR (Next.jsgetServerSideProps, NuxtasyncData, Svelte Kit load) or prerender / static export for content-heavy pages. - Missing
llms.txtat the site root — the emerging standard for pointing AI crawlers at your canonical content. Absence is not catastrophic but presence makes your site noticeably easier to retrieve. Pair withllms-full.txtfor full-content mirroring. - AI crawler-blocking in robots.txt without strategy — blocking GPTBot while allowing Googlebot is a choice; blocking all AI crawlers by default without knowing whether your audience queries ChatGPT / Claude / Perplexity is a cost. Decide deliberately; most content businesses benefit from allowing retrieval crawlers while blocking training crawlers.
- Paragraphs over 300 words — each
<p>is a retrieval unit for the chunker. Target 40–150 words per paragraph. Thinner = no answer match; thicker = split mid-thought and lose coherence at citation time.
How to fix it at the source
Start with llms.txt + llms-full.txt at the site root. Audit your robots.txt stance per bot deliberately. Restructure long paragraphs into 40–150-word chunks that each contain a complete claim + evidence pair. Track LLM referral visits via a custom Referrer segment (chatgpt.com, perplexity.ai, claude.ai, gemini.google.com, copilot.microsoft.com) — that is the canonical AEO KPI.
Thresholds that matter
| Signal | Target |
|---|---|
| Paragraph length (retrieval unit) | 40–150 words. Thinner fails to answer; thicker gets split mid-thought. |
| JSON-LD blocks | 2+ per page (site-wide Org + page-specific type). |
| llms.txt byte size | < 50 KB for fast ingestion; llms-full.txt can be larger (1–2 MB). |
| robots.txt per-bot directive | Explicit for GPTBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot, Bytespider, Applebot-Extended. |
Example fix
llms.txt starter at site root:
# Your Business
> One-sentence "what this site is" — used by LLM retrievers as the authoritative site description.
## Core content
- [About](https://yoursite.com/about): who you are, why you do this
- [Products](https://yoursite.com/products): catalog with stable URLs
- [Documentation](https://yoursite.com/docs): technical references
## Policies
- [Privacy](https://yoursite.com/privacy)
- [Terms](https://yoursite.com/terms)
- [AI-crawler policy](https://yoursite.com/ai.txt)
## Optional — full content mirror
- [llms-full.txt](https://yoursite.com/llms-full.txt): full canonical content for long-form retrieval
When to run the audit
- After a major site change — redesign, CMS migration, DNS change, hosting platform swap.
- Quarterly as part of routine technical hygiene; the checks are cheap to run repeatedly.
- Before an investor / client review, a PCI scan, a SOC 2 audit, or an accessibility-compliance review.
- When a downstream metric drops (rankings, conversion, AI citations) and you need to rule out this dimension as the cause.
Reading the output
Every finding is severity-classified. The playbook is the same across tools:
- Critical / red — same-week fixes. These block the primary signal and cascade into downstream dimensions.
- Warning / amber — same-month fixes. Drag the score, usually don't block.
- Info / blue — context only. Often what a PR reviewer would flag but that doesn't block merge.
- Pass / green — confirmation. Keep the control in place.
Every audit also emits an "AI fix prompt" — paste into ChatGPT / Claude / Gemini for exact copy-paste code patches tied to your specific stack.
Related tools in this family
- Mega AEO Analyzer — the AEO orchestrator — 10 dimensions (citation, attribution, retrievability, freshness, tokenizer, prompt-injection, fair-use).
- AI Posture Audit — cross-references robots.txt, ai.txt, meta robots, X-Robots-Tag per bot — flags disagreements.
- llms.txt Quality Scorer — audits llms.txt structure against the llmstxt.org spec.
- AI Crawler Access Auditor — simulates each major AI bot's crawl permissions on your site.
- RAG Readiness Audit — tests how cleanly your pages chunk for enterprise RAG pipelines.
Fact-check notes and sources
- llmstxt.org: llms.txt proposed standard
- OpenAI: GPTBot documentation
- Anthropic: ClaudeBot documentation
- Perplexity: PerplexityBot
- Google: Google-Extended opt-out
This post is informational and not a substitute for professional consulting. Mentions of third-party platforms in the tool itself are nominative fair use. No affiliation is implied.