# Why RAG-Readiness Is The Next Table-Stakes Audit Every SMB wanting to stand up an internal AI assistant will hit the same wall: their own website can't be ingested cleanly. RAG pipelines demand semantic HTML, passage boundaries, and freshness signals most pages never think about. Author: J.A. Watte Published: April 23, 2026 Source: https://jwatte.com/blog/blog-tool-rag-readiness-audit/ --- The SMB pattern of 2026 goes like this: the owner discovers ChatGPT. They install a customer-support chatbot. They try to "train" it on their own website. It works badly. They blame the chatbot. The chatbot isn't the problem. The website is. RAG (retrieval-augmented generation) pipelines — the underlying tech behind every "chat with your docs" product — need ingestible source material. Ingestibility has specific requirements, and most websites fail several of them without their owner ever knowing. ## What RAG pipelines need A RAG ingestion loop does roughly this: 1. Fetch the page (without executing JavaScript). 2. Strip boilerplate (nav, footer, sidebars). 3. Chunk the remaining content into passages — typically 200-500 words each. 4. Embed each passage into a vector representation. 5. Store the passages indexed by vector + metadata (URL, title, dateModified, canonical). Each step has failure modes: - **Fetch without JS** — if your content is rendered by React/Vue/Svelte at runtime and your server returns a skeleton, the RAG pipeline sees the skeleton. Content invisible. - **Chunking** — if your page is a single 4,000-word paragraph with no H2 breaks, the chunker splits at arbitrary token boundaries, producing incoherent chunks that embed poorly. - **Canonical identity** — if the same content is reachable at `/foo`, `/foo/`, `/foo?utm=x`, and `/index.html`, the dedupe step duplicates embeddings, wastes vector storage, and fragments retrieval. - **Freshness signals** — RAG rankers down-weight content with no `dateModified`. A stale article and a fresh article with the same content rank together if neither has a stamp. ## What the [RAG Readiness Audit](/tools/rag-readiness-audit/) checks Ten checks against the common denominator across LlamaIndex, LangChain, Haystack, and Vertex AI Search: 1. **SSR content present** — does the returned HTML include ≥300 words without JS execution? 2. **Canonical URL** — is there a ``? 3. **Heading hierarchy** — one H1, 2+ H2 sections, clean tree? 4. **Passage-friendly paragraphs** — are 60%+ of paragraphs in the 20-150 word sweet spot for chunking? 5. **Sentence-complete alt text** — do 80%+ of images have alt text that functions as a standalone sentence? 6. **Entity / content-type schema** — is Article/Product/Service/FAQPage/HowTo declared? 7. **dateModified freshness** — is there a fresh (<180 days) dateModified? 8. **Script density** — are there fewer than 30 `