Vector Embedding Similarity

Writers paraphrase. Google doesn't care about the paraphrase.

An article rewritten from a competitor's piece — same structure, same examples, same talking points, different words in each sentence — passes Copyscape (which looks for matching byte sequences). It gets flagged by Google's content-quality systems (which use semantic embeddings and 5-gram shingles).

The gap between those two detection methods is where most SMB content sits after "AI rewrites" and "spintax-style edits" and "paraphrasing passes." The content reads fresh. It ranks like a duplicate.

The three similarity layers

Byte-level (Copyscape). Exact-phrase matching. Easy to defeat with a paraphrase. $0.05 per check. Misses anything paraphrased beyond 4-5 words.

Lexical cosine (TF-IDF). Count word frequencies, compute cosine similarity between the two vectors. Catches most structural duplication even when wording changes. Free to compute. Misses paraphrases that use entirely different vocabulary.

5-gram shingle Jaccard. Break both texts into overlapping 5-word sequences. Count the intersection over the union. Catches the "rewrote the sentence order but kept the phrases" pattern. Free. This is roughly what Google's deduplication systems do at scale.

Semantic embeddings (OpenAI, Voyage, Cohere). Convert both texts to high-dimensional vectors, compute cosine similarity. Catches paraphrases even when wording, phrasing, AND structure differ. Requires an API key; costs ~$0.00002 per comparison. This is roughly what Google's newer quality systems do.

What the Vector Embedding Similarity tool does

Free tier (always available): lexical cosine + 5-gram shingle Jaccard, combined into a single similarity score. No API key needed. Runs entirely in the browser.

Output:

Combined similarity percentage
Lexical cosine score
Shingle Jaccard score
Verdict: distinct / related / substantial overlap / near-duplicate
Top 15 shared 5-gram phrases (the smoking gun when flagged)
AI fix prompt that decides: intentional duplication, accidental duplication, or external near-copy

Two input paths: paste raw text, or paste two URLs and the tool fetches via the same-origin proxy.

Reading the verdicts

Below 15% combined — distinct content. No action needed.

15-35% — related topic. Expected for two articles on the same subject. Not a problem.

35-65% — substantial overlap. One of three things is happening:

Intentional: a translated version, an AMP copy, a legitimate pillar-plus-child pair. Action: ensure canonical tags are correct.
Accidental: two pages on the site that should have been one. Action: consolidate. Pick one canonical URL, 301 the other, update internal links.
External: another site has lifted your content. Action: DMCA notice or outreach.

Above 65% — near-duplicate. Treat as one of the three above cases, but with urgency. At this level Google's deduplication systems will pick one URL to index and ignore the rest. If the "wrong" URL wins, you lose.

Why the shingle phrases matter

The report lists the top 15 shared 5-gram phrases. These are the smoking gun. If you see phrases like "the shingles should be installed" and "to prevent water damage from" and "inspect the roof every six," the rewrite was cosmetic — the author replaced words but left the phrase skeleton.

The fix isn't to run another paraphrasing pass. It's to restructure the argument. Different examples. Different order of points. Different first-person specifics. A paragraph that starts "When I was inspecting a roof in Twin Falls last fall" is impossible to have accidentally overlap with a competitor, no matter how many words it shares with them.

The three common SMB scenarios

1. Accidental service-page sprawl. An SMB has /roof-repair and /roof-repair-twin-falls and /roofing-services. All three cover the same ground. Combined similarity 70%+. Fix: consolidate to one canonical, 301 the others, keep different anchor-text on internal links pointing to the consolidated URL.

2. AI-generated duplicate across locations. An SMB operating in multiple cities generates a /roof-repair-[city] page for each. The template generates 80%+ similar content across cities. Google indexes one per cluster. Fix: inject real per-city specifics (local landmarks, actual project photos from that city, crew-member quotes). Target <40% similarity between any two city pages.

3. Competitor lifted your content. You published a case study. A competitor published a near-copy two weeks later. Run this tool on both URLs; the shared-phrase list gives you documentation for a DMCA notice. File with Google via Search Console DMCA form.

When to pay for real embeddings

The free tool catches most cases. Real semantic embeddings catch the rest. If you're:

Running a content operation with 50+ articles per month
Publishing across multiple sites where lexical overlap is expected but semantic overlap needs to stay distinct
Auditing for E-E-A-T-grade originality

...pay for API-based embeddings. OpenAI's text-embedding-3-large is the default choice; Voyage and Cohere are alternatives. Cost: pennies per article. Benefit: semantic-grade overlap detection at scale.

Fact-check notes and sources

5-gram shingle as deduplication signal: Broder (1997), On the resemblance and containment of documents
Cosine similarity for document comparison: standard text-retrieval technique, see Salton & McGill (1983)
Google's deduplication at scale uses both lexical and embedding signals per public presentations from Google Search team at WMX conferences
OpenAI embedding pricing: platform.openai.com/docs/pricing

This post is informational, not copyright-law advice. Consult an attorney before sending DMCA notices. Mentions of Copyscape, Siteliner, OpenAI, Voyage, and Cohere are nominative fair use. No affiliation is implied.

Byte-Level Plagiarism Detection Misses The Overlap That Hurts Rankings