LLM Fair Use Audit

A 30-word quote with attribution is fair use. A 500-word verbatim block embedded in an answer with a link at the bottom is functionally republication.

The line in between is murky, contested, and currently being litigated by every major publisher group simultaneously. But you don't have to wait for the courts to set a number — you can measure it yourself, on your own pages, against actual LLM outputs.

The threshold most legal commentators converge on is roughly 150 verbatim words for a single passage and 300 cumulative across a response. Below that: cite-and-link. Above that: cite-and-replace. The audit checks both.

What the LLM Fair Use Audit does

You paste two things: source content from your page, and an LLM response that mentions or quotes that page. The tool:

Splits both into normalized sentences and token sequences.
Runs sliding-window n-gram comparison (minimum 12-word matches).
Computes the longest verbatim run and total verbatim coverage.
Flags individual passages by length tier.
Estimates fair-use risk under the four-factor analysis: purpose, nature, amount, market effect.
Emits an AI prompt for drafting a takedown / clarification request to the LLM provider, with the matched passages embedded.

The four factors, applied to LLM verbatim copying

US fair use analysis weighs four factors. None alone is dispositive, but the audit reports each:

1. Purpose and character. Is the LLM use transformative? Pure summarization (paraphrased) leans transformative. Verbatim reproduction inside a generated answer leans non-transformative — the answer is using your content to be the answer, not to comment on it.

2. Nature of the copyrighted work. Highly creative works (poetry, fiction) get more protection than factual works. SMB content is mostly factual — leans toward fair use. But factual expression (your specific phrasing) is still protected.

3. Amount and substantiality. The audit's main signal. 12-word matches: trivial. 50-word matches: meaningful. 150+ word matches: substantial. 300+ word total across a response: presumptively non-fair-use.

4. Effect on market. This is where SMBs get hurt. If a user reads the LLM answer and never visits your page, you've lost the ad impression / lead opportunity. Even attribution doesn't compensate when the substitute satisfies the user's intent fully.

What the verbatim thresholds mean

Under 50 words longest match: typical paraphrase-and-cite pattern. No fair-use concern.

50-150 words: noteworthy. Likely cite-and-extended-quote. Probably defensible as fair use but worth tracking.

150-300 words single passage OR 300+ cumulative: entering substitute-for-original territory. Document the instance. If it recurs across queries, escalate.

300+ words single passage: functionally republication. Strong basis for a clarification or takedown request to the LLM provider.

The four LLM-provider response paths

When an LLM provider receives a verbatim-substantial complaint, four typical outcomes:

1. Acknowledged and patched. Provider adds source filtering / paraphrase rewriting on the offending content path. Resolves within 1-2 model retraining cycles (months).

2. Acknowledged, deferred. Provider logs the issue, makes no immediate change. Use this as the basis for a future legal escalation if the conduct persists.

3. Disputed under fair use. Provider claims their use is transformative. Document and decide whether to escalate based on commercial impact.

4. Ignored. Less common from major providers; common from open-source LLM hosts. May warrant DMCA-style notice.

The audit's AI prompt helps you draft the initial clarification request with the specific verbatim matches embedded so the provider can identify the source.

The 60-day monitoring path

Week 1-2: Pick your top 20 highest-value pages (revenue-driving, lead-driving, or competitively differentiated). For each, identify 5 likely query phrasings a user would type to find that page.

Week 3-4: Run each query through ChatGPT, Claude, Gemini, Perplexity, You.com. Capture the responses.

Week 5-6: Run the audit on each (response, source-page) pair. Tabulate longest match and total coverage by provider.

Week 7-8: For any provider/page combination above the 150-word threshold, draft and send the clarification request using the AI prompt output.

Week 9+: Re-run a representative sample monthly. Trend the longest-match metric over time per provider.

The schema.org NoIndex and emerging "no-train" signals

Beyond fair-use complaints, you can preemptively signal copy preferences:

robots.txt with explicit AI-bot disallows (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, etc.) — this is the established mechanism. The audit links to the AI-bot allow-list reference.
<meta name="robots" content="noai, noimageai"> — emerging signal, not universally honored but parsed by some crawlers.
<meta name="ai-content-declaration" content="no-training, no-summary-substitute"> — proposed but not standardized.

These are signals, not enforcement. They give you standing in any later dispute by establishing that the publisher made the preference explicit.

Fact-check notes and sources

US fair use four-factor analysis: 17 USC § 107 and Copyright Office Fair Use Index
150-word and 300-word thresholds: synthesis of legal-commentator analyses 2023-2026 around AI training and output cases (NYT v. OpenAI, Authors Guild v. OpenAI, Getty v. Stability AI). These are commentator estimates, not statutory limits.
AI-bot opt-out signals: robots.txt RFC 9309; ai.txt is an emerging community convention without formal RFC status as of 2026
Provider response patterns: pattern observation across reported takedown / clarification cases 2023-2026

This post is informational, not legal advice. Mentions of OpenAI, Anthropic, Google, Perplexity, You.com, Stability AI, NYT, Getty, Authors Guild are nominative fair use. No affiliation is implied. If you face a substantive verbatim-copying issue with commercial impact, consult a copyright attorney.

When An LLM Quotes 200 Words Of You Verbatim, Is That Citation Or Republication?