The SMB pattern of 2026 goes like this: the owner discovers ChatGPT. They install a customer-support chatbot. They try to "train" it on their own website. It works badly. They blame the chatbot.
The chatbot isn't the problem. The website is.
RAG (retrieval-augmented generation) pipelines — the underlying tech behind every "chat with your docs" product — need ingestible source material. Ingestibility has specific requirements, and most websites fail several of them without their owner ever knowing.
What RAG pipelines need
A RAG ingestion loop does roughly this:
- Fetch the page (without executing JavaScript).
- Strip boilerplate (nav, footer, sidebars).
- Chunk the remaining content into passages — typically 200-500 words each.
- Embed each passage into a vector representation.
- Store the passages indexed by vector + metadata (URL, title, dateModified, canonical).
Each step has failure modes:
- Fetch without JS — if your content is rendered by React/Vue/Svelte at runtime and your server returns a skeleton, the RAG pipeline sees the skeleton. Content invisible.
- Chunking — if your page is a single 4,000-word paragraph with no H2 breaks, the chunker splits at arbitrary token boundaries, producing incoherent chunks that embed poorly.
- Canonical identity — if the same content is reachable at
/foo,/foo/,/foo?utm=x, and/index.html, the dedupe step duplicates embeddings, wastes vector storage, and fragments retrieval. - Freshness signals — RAG rankers down-weight content with no
dateModified. A stale article and a fresh article with the same content rank together if neither has a stamp.
What the RAG Readiness Audit checks
Ten checks against the common denominator across LlamaIndex, LangChain, Haystack, and Vertex AI Search:
- SSR content present — does the returned HTML include ≥300 words without JS execution?
- Canonical URL — is there a
<link rel="canonical">? - Heading hierarchy — one H1, 2+ H2 sections, clean tree?
- Passage-friendly paragraphs — are 60%+ of paragraphs in the 20-150 word sweet spot for chunking?
- Sentence-complete alt text — do 80%+ of images have alt text that functions as a standalone sentence?
- Entity / content-type schema — is Article/Product/Service/FAQPage/HowTo declared?
- dateModified freshness — is there a fresh (<180 days) dateModified?
- Script density — are there fewer than 30
<script>tags (proxy for JS-gating)? - Meta robots allows indexing — no noindex?
- Clean canonical — no tracking parameters in the canonical URL?
Each check is scored pass/warn/fail. The overall RAG readiness score is a 0-100 weighted aggregate. The AI fix prompt ranks the fails by ingestion-impact and proposes specific remediation.
What "good" looks like
Score 85+: RAG-ready. Any major pipeline ingests this page cleanly. Passages are coherent, entities are disambiguated, freshness is signaled.
Score 65-85: ingestible with friction. Some chunks will be mediocre. Some entities will be missed. Usually fixable in a 2-hour pass (add schema, add dateModified, clean up headings, split a giant paragraph).
Score below 65: structural problems. The site is probably JS-gated, has no canonical URLs, lacks semantic HTML. Requires an architectural pass (SSR/prerender, CMS restructure, template rebuild). Not a weekend job.
The SSR question
If your audit fails the "SSR content present" check, your site is built as a client-rendered SPA. Options:
- Prerender via Netlify or Cloudflare — inexpensive, drop-in, works for most SPAs. Netlify has a native prerender feature for SEO that also serves RAG crawlers.
- Switch to a static/hybrid framework — Next.js, Nuxt, Astro, Eleventy all support SSR or static generation. Migrating a React SPA to Next.js SSR is a week of focused work.
- Expose a parallel server-rendered copy — some pages (landing, service pages, blog) ship as static HTML; the interactive app stays at
/app/*. This is the fastest middle path.
The choice depends on scale. For an SMB with <100 pages, prerendering is usually right. For a catalog site with thousands of pages, migrate to a framework with native SSR.
Why this is the next table-stakes audit
Three 2024-2026 trends compounded:
- SMBs started adopting AI assistants for customer support at scale.
- "Chat with your documents" went from research demo to $30/month SaaS (ChatPDF, Humata, internal ChatGPT teams).
- Vertex AI Search, Bing Copilot for Business, and Claude's internal search all started grounding answers in first-party websites instead of general web search.
The net: your website is now a RAG corpus whether you meant it to be or not. The sites that are RAG-ready today become the reliable sources for every AI-mediated query about them. The sites that aren't become the ones with hallucinated answers.
The audit takes ninety seconds. The fix list is usually 2-8 items. The compound payoff is every AI-mediated query about the business for the next three years.
Related reading
- Chunk Retrievability — passage-level scoring (complements this page-level score)
- Passage Retrievability — Google-specific featured-snippet extraction grading
- LLM Training-Data Inclusion Audit — upstream: is your site even in the dataset?
- AI Posture Audit — robots/ai.txt/meta coverage
Methodology: the weights and thresholds in the audit were calibrated against the default ingestion settings of LlamaIndex SimpleDirectoryReader, LangChain HTMLLoader, and Haystack HTMLToDocument as of early 2026. Specialized pipelines (with custom chunkers, embedded JS-rendering steps, or proprietary dedupers) may weigh additional factors.
Fact-check notes and sources
- RAG foundational paper: Lewis et al., 2020, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- LlamaIndex docs: docs.llamaindex.ai
- LangChain document loaders: python.langchain.com/docs/integrations/document_loaders
- Typical chunk sizes (200-500 words): convergent community standard; OpenAI embeddings benchmarks suggest ~300 tokens as sweet spot
This post is informational, not engineering advice. Mentions of LlamaIndex, LangChain, Haystack, Vertex AI, Pinecone, Vectara, and similar products are nominative fair use. No affiliation is implied.