Passage Retrieval for SEO — Why Self-Contained Paragrap...

Your page scores 92/100 in every SEO tool. Schema is clean. Meta robots says max-snippet:-1. hreflang is bidirectional. E-E-A-T is maxed. But Perplexity never cites you. ChatGPT summarizes around your answer without mentioning the source. Claude references sites less authoritative than yours when answering questions your page is explicitly about.

The gap is passage retrieval. Classic SEO scores a page as a single unit. Answer engines score each paragraph as a separate unit, because that's the chunk they retrieve.

How modern retrieval actually works

When you ask ChatGPT "what's the best framework for tracking subscription revenue in a small SaaS?", here's what happens:

ChatGPT calls a search engine (Bing, usually) to get 10-20 URLs relevant to the query.
It fetches each URL, but not to read the whole page. It splits the page into passages of roughly 150 tokens each (~100-150 words). Each passage becomes a retrieval unit.
Each passage gets embedded into vector space. The user's query is also embedded.
ChatGPT ranks passages by similarity. The top 3-5 passages get included in the context window with a citation link back to their source URL.
ChatGPT answers, weaving citations from the passages into the response.

The crucial step is (2): the page is not treated as a single blob. If only one paragraph on your page answers the question well, and the other 40 paragraphs are generic filler, only the one good paragraph gets retrieved. The other 40 never exist from the retriever's perspective.

And if your one good paragraph starts with "This is why it matters", with no antecedent for "this", the retriever sees a fragment that makes no sense in isolation. It ranks lower. It doesn't get cited. The content was there; the structure hid it.

What makes a passage retrievable

A retrievable passage has five properties:

1. Token count in the 40-150 word sweet spot. Thinner than 40 words is a fragment. Thicker than 150 words gets split mid-thought during chunking, the split boundary cuts your reasoning in half.

2. Self-contained opening. The first sentence has a subject and a verb, and doesn't depend on a pronoun from a previous paragraph. "Ceramides repair the skin barrier by filling gaps in the lipid matrix" stands alone. "They help repair it" does not.

3. Named entity presence. The passage mentions at least one capitalized proper noun that would be a query match. "The $97 Launch methodology" beats "our approach". "BigCommerce pricing" beats "the pricing model". Retrievers match on entities; no entity means no match.

4. Answer-form structure. Passages that follow "The X is Y", "X works by Z", or "Q:... A:..." patterns read cleanly as answers. Narrative prose ("We started thinking about...") doesn't read as an answer and doesn't get cited for answer queries.

5. Specific claim markers. Numbers, dates, percentages, named frameworks. "$97 vs $3,500" is citable. "We save you a lot of money" is not.

Most pages get one or two of these across their key paragraphs. The best pages get all five across every paragraph, and the retrievers reward them with disproportionate citation rates.

The new Retrieval tab in the Mega Analyzer

The Mega Analyzer now has a Retrieval (AEO) tab that simulates this process. When you run an audit, the tool:

Walks the page's <main>/<article> element, extracting every <p>, <li> content, and semantic chunk.
For each passage, scores against the five criteria above, producing a 0-100 score.
Surfaces the average score, count of "strong" (≥70) vs "weak" (<40) passages, and a per-passage breakdown with issues and kudos.

A healthy page averages 70+ with zero passages under 40. Weak pages cluster in the 30-50 range, every paragraph has one or two strengths but most have a fatal flaw (pronoun-dependent opening, no entity, too thin, or too long).

The output tells you what to fix per passage:

Passage 3 · 28 words · 32/100 "This is why we recommend it. It saves time and effort on every project." Issues:

Too thin (28 words), retrievers will extract but the chunk won't answer a question

No named entities, retriever can't match on topic/entity queries

First sentence isn't self-contained (orphan pronoun or fragment), retriever can't cite this standalone

Fix: rename the passage's first phrase to name the subject. Merge with the preceding paragraph if appropriate so the chunk has enough volume. Add a concrete claim marker.

The math of rewriting for retrieval

A page with 20 paragraphs averaging 58/100 produces roughly 4-6 citation candidates (passages above 70). A page with the same content reorganized to average 78/100 produces 14-16 citation candidates. Same word count, same meaning, dramatically different retrieval surface.

The rewriting work per paragraph is usually small:

Replace a leading pronoun with its named antecedent
Add a specific number, date, or named framework to a vague claim
Merge two thin paragraphs that together form one complete thought
Split one 250-word paragraph into two 100-word paragraphs with clean topical boundaries
Re-open a paragraph with an answer-form structure: "The [topic] is a..." or "To [do X], you..."

A 2,000-word article typically needs 30-45 minutes of this kind of sentence-level rewriting to move from 50-average to 75-average.

Chunking is deterministic and predictable

The retrievers don't use mysterious algorithms for chunking. Most public LLMs use one of three patterns:

Paragraph-boundary chunking: split on \n\n, merge short chunks until they hit a target token count.
Fixed-window chunking: split on a fixed token count (often 512 or 1024 for embedding models), with slight overlap.
Semantic chunking: split on topical shifts detected via embedding similarity drops between sentences.

The simulator uses paragraph-boundary logic because that's what most public retrievers use and what your HTML structure most directly influences. If you care about a retriever that uses fixed-window chunking, the ~150-token target still applies, your paragraph boundaries just don't map 1:1 to retrieval chunks anymore.

Why this matters more than your overall page score

Classic SEO KPIs (ranking, impressions, CTR) are page-level. You rank for a keyword; Google sends you traffic; the page is the unit.

AEO KPIs are passage-level. A specific passage gets cited for a specific question. Multiple passages on the same page can be cited for different questions. Your page doesn't "rank" in Perplexity; specific passages from it get surfaced when the user asks specific questions.

A page with a bad overall SEO score but strong passages can get cited heavily. A page with a great SEO score but weak passages gets the rank but loses the citation. The two metrics are correlated but not the same, and for AEO the passage-level metric dominates.

What to change on your site

Run the Mega Analyzer's Retrieval tab on your three most-important pages (homepage, top blog post, your "about" page if E-E-A-T matters to you). Look at the distribution:

Average 70+, zero weak passages: you're in great shape. Ship anyway; re-run after content updates.
Average 55-70, fewer than 20% weak: a targeted editorial pass over the weak passages will move the average above 70.
Average under 55, many weak: the page needs a restructuring pass. Merge thin paragraphs, expand vague claims with specifics, rewrite pronoun-dependent openings.

The biggest wins come from fixing the weak passages rather than polishing strong ones. A 35-score passage jumping to 60 moves the average more than a 70-score passage moving to 85.

Fact-check notes and sources

Anthropic's Claude context window and retrieval documentation: docs.anthropic.com/en/docs/build-with-claude/retrieval
Chunking strategies for retrieval-augmented generation: Pinecone's RAG guide (industry reference)
OpenAI embeddings retrieval pattern: platform.openai.com/docs/guides/embeddings
Perplexity citation source attribution: docs.perplexity.ai, Response format
Google AI Overviews citation behavior: described in Google Search Central, Overview, passage-level attribution is explicit in the rendered card

Run the Mega Analyzer on your top blog post. The Retrieval tab tells you passage-by-passage how citable your content is, not just how "good" the page is as a whole.

Passage Retrieval Is The New SEO — Why Your Page Score Doesn't Match Your AI-Citation Rate