# When Programmatic SEO Becomes Thin-Content Spam — Shingle Detection

Programmatic SEO scales content by swapping one noun across a template. Google&#39;s Helpful Content system detects near-duplicate sets via shingle similarity and filters the whole template when pairwise similarity stays above 0.7. The new detector runs the same math in your browser.

Author: J.A. Watte
Published: May 3, 2026
Source: https://jwatte.com/blog/blog-tool-pseo-thinness-audit/

---

City pages, "X vs Y" comparison pages, service-area pages — the pSEO playbook is one template with one slot swapped per variant. When done well, it captures long-tail demand efficiently. When done poorly, it hits Google's Helpful Content filter and the whole template drops out of the index in one update.

The line between well and poorly isn't word count. It's pairwise shingle similarity across the set. [pSEO Thinness Audit](/tools/pseo-thinness-audit/) runs the same near-duplicate detection Google's filter is doing.

## How shingle similarity works

Every page gets tokenized and sliced into overlapping 5-word sequences (shingles). For two pages, Jaccard similarity = intersection / union of their shingle sets. Identical pages = 1.0. Completely different = 0.0.

Real pSEO numbers:
- **0.7+** — near-duplicate. HCU will filter the set. Fail.
- **0.5-0.7** — heavy templating. Some filtering likely.
- **0.35-0.5** — moderate templating. Usually survives.
- **Under 0.35** — differentiated. Safe.

## What the audit reports

- **Pairwise matrix** — similarity between every pair of pages in the set.
- **Per-page average** — each page's mean similarity to the others. Pages with the lowest per-page average are your most differentiated; pages with the highest are your biggest risk.
- **Title + H1 duplication** — clusters where multiple pages share the exact same title. Classic pSEO mistake.
- **Thin-page count** — pages under 300 words are thin regardless of similarity.
- **HCU risk tier** — combined score flagging LOW / MEDIUM / HIGH risk.

## The fix pattern

The AI prompt emits a content-diff plan: which sections to delete (pure boilerplate), which to rewrite uniquely per variant (local landmarks, regulations, pricing examples, author quotes), which to keep as shared scaffold.

The specific advice depends on the page type. City pages: rewrite the regulations section for each city's actual laws, add local-pricing examples with real-dollar figures, include a quote from a local author/employee. "X vs Y" pages: rewrite the tradeoffs section uniquely for each pair because the tradeoffs really are different.

## Related reading

- [Chunk Retrievability](/tools/chunk-retrievability/) — per-page passage quality
- [Heading Gap Audit](/tools/heading-gap-audit/) — competitive H2 coverage
- [Voice Cleanup](/tools/voice-cleanup/) — de-slop content after rewriting

## Fact-check notes and sources

- Google Helpful Content Update: [developers.google.com/search/updates/helpful-content-update](https://developers.google.com/search/updates/helpful-content-update)
- W-shingling near-duplicate detection: [en.wikipedia.org/wiki/W-shingling](https://en.wikipedia.org/wiki/W-shingling)
- Jaccard index: [en.wikipedia.org/wiki/Jaccard_index](https://en.wikipedia.org/wiki/Jaccard_index)

---

*The $100 Network covers scaling pSEO without triggering HCU. The detector is the gate before scaling from 10 pages to 1000.*


---

Canonical HTML: https://jwatte.com/blog/blog-tool-pseo-thinness-audit/
RSS: https://jwatte.com/feed.xml
JSON Feed: https://jwatte.com/feed.json
Hero image: https://jwatte.com/images/blog-tool-pseo-thinness-audit.webp
