Duplicate Content Fingerprint

A site has 300 pages, indexes 180 of them, and can't figure out why the other 120 keep getting dropped. The answer in 70% of cases: internal duplicates. Two pages with 80%+ Jaccard similarity compete for the same query, Google picks one, ignores the other.

Copyscape will catch external scraping. Siteliner will catch the same problem internally but only on the homepage and one level deep, and even then the free tier caps you. Tools like SEMrush bundle a duplicate-content check inside a $139/mo subscription.

Or: paste your URL list into a free tool that shingles each page and computes pairwise similarity in your browser.

What the Duplicate Content Fingerprint does

You paste up to 20 URLs. The tool:

Fetches each page through the proxy.
Strips navigation, footer, scripts, asides — keeps body text.
Splits each page into 5-word rolling shingles.
Hashes each shingle to a stable integer.
Computes pairwise Jaccard similarity between every page combination.
Flags any pair ≥60% (functional duplicate) or 30-60% (significant overlap).
Emits an AI prompt for consolidation strategy per pair.

What the similarity thresholds mean

≥80% — these pages are the same content with cosmetic differences. Almost certainly the result of templated category pages, paginated archives, A/B test variants left in production, or copy-paste from one URL to another. Consolidate.

60-80% — heavy overlap. Same topic, same key phrases, but some sections differ. Common pattern: location pages where 70% of the content is copy-pasted boilerplate and only the city name + map embed differs. Either differentiate harder or canonical to a hub page.

30-60% — meaningful overlap. Could be intentional (two genuinely-distinct articles on related topics that share vocabulary) or accidental (a "guide" and a "tutorial" that ended up overlapping more than planned). Worth a human read.

Under 30% — distinct. No action needed.

Why internal duplicates hurt more than external ones

External scraping (someone copying your content) usually gets caught by Google's canonical-source signals — the original ranks, the scraper doesn't. Internal duplicates are harder for Google because Google can't tell which of your two near-identical pages is "canonical" without your help.

Default Google behavior on an internal duplicate:

Picks one URL to index, drops the other.
The dropped URL still wastes crawl budget on every recrawl.
Backlinks pointing to the dropped URL transmit no PageRank.
If Google's pick is the wrong URL (e.g., a /category/?sort=oldest instead of the canonical product page), you can lose meaningful traffic.

The fix is to give Google explicit direction: 301 redirects for clearly-redundant URLs, canonical tags for variants, robots.txt blocks for parametric variants, or actual content differentiation for pages that should stay independent.

The five common internal-duplicate patterns

1. Templated location/service pages. "Roofers in Twin Falls" and "Roofers in Boise" share 80% boilerplate, only city name differs. Fix: write 200+ unique words per location (not just keyword-swap).

2. Paginated archives. /blog/page/1/, /blog/page/2/, etc. each share the same hero/intro/sidebar. Fix: rel="next"/rel="prev" + canonical to the unfiltered archive.

3. Faceted navigation. /products?color=red, /products?color=blue, etc. Same product list re-sorted. Fix: canonical to the unfiltered URL + robots.txt block on parametric variants.

4. Old /new variant left in production. A migration kept both /about-us and /about/ live. Fix: 301 the old to the new.

5. Copy-paste content drift. Author A wrote an article. 6 months later author B wrote a near-identical article on the same topic, unaware. Fix: merge into one canonical, 301 the other.

The 30-day deduplication path

Week 1: Run the audit on three batches — your top 20 traffic-driving URLs, your top 20 backlinked URLs, and your top 20 "submitted but not indexed" URLs from GSC.

Week 2: Triage flagged pairs into three buckets — merge (≥80% similarity), differentiate (60-80%), monitor (30-60%).

Week 3: Execute merges with 301 redirects. Implement canonical tags for variant pairs. Add robots.txt blocks for parametric duplicates.

Week 4: Re-run GSC. Submit the new sitemap. Watch indexed-pages count: typical lift after deduplication is 15-30% more URLs indexed within 30-60 days.

Why Jaccard over cosine similarity

Jaccard on word shingles is a lexical measure: it cares about exact word sequences. Cosine on embeddings is semantic: it catches "same topic, different wording."

Jaccard catches the duplicates that hurt SEO most: templated content, copy-paste, paginated archives. Embeddings catch the duplicates that hurt LLM retrieval most: same idea expressed differently. Different problems, different tools — the Vector Embedding Similarity tool covers the embedding side.

Fact-check notes and sources

Jaccard similarity definition: standard set-theoretic measure, defined in Jaccard 1912
Shingling algorithm: Broder 1997 — On the resemblance and containment of documents
Google duplicate-content guidance: Google Search Central — Duplicate content
60% threshold: pattern-synthesis from community testing 2020-2026; not a Google-published number

This post is informational, not SEO-consolidation-consulting advice. Mentions of Google, Copyscape, Siteliner, SEMrush are nominative fair use. No affiliation is implied.

When Two Of Your Own Pages Cannibalize Each Other