← jwatte.com

LLM Training-Data Inclusion Audit

Being in Common Crawl is the single biggest under-reported AI-discoverability signal. This probe queries the public CC index directly, reports every snapshot crawl of your domain, depth by URL, and cross-references which derived training corpora (C4, RedPajama, FineWeb) almost certainly contain your pages.

Context and background

Read the story behind this tool: The Common Crawl test: are you even in the dataset? →

Inputs

We query the last 4 Common Crawl indexes by default. Each index is 3 months apart, covering roughly the past year. All API calls route through the site's same-origin fetch proxy.