Live Citation Surface Probe

TL;DR. AI search = discovery (classic SEO buys the seat) + retrieval (passage-level chunking buys the citation). AI crawlers do not execute JS — every critical claim must live in the server-rendered HTML.

The Live Citation Surface Probe is the audit you reach for when you already suspect a problem in this dimension and need a fast, copy-paste-able fix list. It reuses the same chrome as every other jwatte.com tool — deep-links from the mega analyzers, AI-prompt export, CSV/PDF/HTML download — but the checks it runs are narrow and specific to the dimension described above.

Probes DuckDuckGo for live citation surface across knowledge aggregators, academic databases, and reference corpora. Tells you where you are cited and where you are not — the first step in raising your presence in AI-answered queries.

Why this dimension matters

AI search runs in two stages: DISCOVERY (the LLM queries a classic search engine to get ~20 candidate URLs) and RETRIEVAL (it fetches those pages, chunks them into ~150-token passages, and cites whichever chunk best matches the query). Classic SEO buys the seat; paragraph-level structure buys the citation. AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) do NOT execute JavaScript — every critical claim must be in the server-rendered HTML.

Common failure patterns

SPA shell with empty <div id="root"> — React / Vue / Angular apps that hydrate on the client look completely empty to AI crawlers. The fix is SSR (Next.js getServerSideProps, Nuxt asyncData, Svelte Kit load) or prerender / static export for content-heavy pages.
Missing llms.txt at the site root — the emerging standard for pointing AI crawlers at your canonical content. Absence is not catastrophic but presence makes your site noticeably easier to retrieve. Pair with llms-full.txt for full-content mirroring.
AI crawler-blocking in robots.txt without strategy — blocking GPTBot while allowing Googlebot is a choice; blocking all AI crawlers by default without knowing whether your audience queries ChatGPT / Claude / Perplexity is a cost. Decide deliberately; most content businesses benefit from allowing retrieval crawlers while blocking training crawlers.
Paragraphs over 300 words — each <p> is a retrieval unit for the chunker. Target 40–150 words per paragraph. Thinner = no answer match; thicker = split mid-thought and lose coherence at citation time.

How to fix it at the source

Start with llms.txt + llms-full.txt at the site root. Audit your robots.txt stance per bot deliberately. Restructure long paragraphs into 40–150-word chunks that each contain a complete claim + evidence pair. Track LLM referral visits via a custom Referrer segment (chatgpt.com, perplexity.ai, claude.ai, gemini.google.com, copilot.microsoft.com) — that is the canonical AEO KPI.

Thresholds that matter

Signal	Target
Paragraph length (retrieval unit)	40–150 words. Thinner fails to answer; thicker gets split mid-thought.
JSON-LD blocks	2+ per page (site-wide Org + page-specific type).
llms.txt byte size	< 50 KB for fast ingestion; `llms-full.txt` can be larger (1–2 MB).
robots.txt per-bot directive	Explicit for GPTBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot, Bytespider, Applebot-Extended.

Example fix

llms.txt starter at site root:

# Your Business

> One-sentence "what this site is" — used by LLM retrievers as the authoritative site description.

## Core content

- [About](https://yoursite.com/about): who you are, why you do this
- [Products](https://yoursite.com/products): catalog with stable URLs
- [Documentation](https://yoursite.com/docs): technical references

## Policies

- [Privacy](https://yoursite.com/privacy)
- [Terms](https://yoursite.com/terms)
- [AI-crawler policy](https://yoursite.com/ai.txt)

## Optional — full content mirror

- [llms-full.txt](https://yoursite.com/llms-full.txt): full canonical content for long-form retrieval

When to run the audit

After a major site change — redesign, CMS migration, DNS change, hosting platform swap.
Quarterly as part of routine technical hygiene; the checks are cheap to run repeatedly.
Before an investor / client review, a PCI scan, a SOC 2 audit, or an accessibility-compliance review.
When a downstream metric drops (rankings, conversion, AI citations) and you need to rule out this dimension as the cause.

Reading the output

Every finding is severity-classified. The playbook is the same across tools:

Critical / red — same-week fixes. These block the primary signal and cascade into downstream dimensions.
Warning / amber — same-month fixes. Drag the score, usually don't block.
Info / blue — context only. Often what a PR reviewer would flag but that doesn't block merge.
Pass / green — confirmation. Keep the control in place.

Every audit also emits an "AI fix prompt" — paste into ChatGPT / Claude / Gemini for exact copy-paste code patches tied to your specific stack.

Related tools in this family

Mega AEO Analyzer — the AEO orchestrator — 10 dimensions (citation, attribution, retrievability, freshness, tokenizer, prompt-injection, fair-use).
AI Posture Audit — cross-references robots.txt, ai.txt, meta robots, X-Robots-Tag per bot — flags disagreements.
llms.txt Quality Scorer — audits llms.txt structure against the llmstxt.org spec.
AI Crawler Access Auditor — simulates each major AI bot's crawl permissions on your site.
RAG Readiness Audit — tests how cleanly your pages chunk for enterprise RAG pipelines.

Fact-check notes and sources

This post is informational and not a substitute for professional consulting. Mentions of third-party platforms in the tool itself are nominative fair use. No affiliation is implied.

Why Live Citation Surface Probe Exists