LLM Training-Data Inclusion Audit

There's a test that takes sixty seconds, costs nothing, and tells you whether the AI models that shape how the world finds your business even know you exist.

Almost nobody runs it.

The test is: is your domain in Common Crawl?

What Common Crawl is, and why it matters

Common Crawl is a nonprofit that has been crawling the open web since 2008. They release a new dataset every 1-3 months, publicly, for free. The datasets are cumulative snapshots — billions of pages, hundreds of terabytes.

Every major LLM's pretraining corpus is downstream of Common Crawl. GPT-4 and GPT-4o: trained partly on CC-derived data. Claude 3.7 and 4: same. Gemini: same. Llama 3 and 4: same. The specifics vary, but the pattern doesn't. Common Crawl is the substrate.

Which means:

If your domain is in Common Crawl, you're a candidate for inclusion in every derived training dataset (C4, FineWeb, RedPajama, SlimPajama, OSCAR).
If your domain is not in Common Crawl, you're excluded from that entire tier of downstream training data. Your only path to the LLM's knowledge is via retrieval — live search grounding at query time.

Retrieval is better than nothing. But retrieval is unreliable (models sometimes don't search), lossy (they summarize imperfectly), and trailing (the search result is more recent than the training data but still delayed by caching). Pretraining knowledge is baked in, always present, and structurally harder to displace once it's there.

The asymmetry is: being in Common Crawl is high-value and essentially free. Not being in Common Crawl is silent and invisible and costs you every LLM-surfaced impression.

How pages end up in or out

In:

Public HTTP-accessible pages
robots.txt allows CCBot (the default if robots.txt doesn't name-exclude CCBot)
Server returns 200 OK on GET requests from CCBot's user agent
No aggressive WAF or Cloudflare challenge blocking a non-browser UA
Server-rendered HTML or well-structured SSR; CCBot does not execute JavaScript

Out:

robots.txt disallows CCBot (some sites do this intentionally, many do it by accident via copy-pasted AI-blocklists)
Server returns 5xx or 403 to CCBot's UA
Cloudflare / Akamai / AWS WAF challenges bots by default, returning JS-rendered interstitials that CC can't bypass
Site is behind a login wall
Content is entirely JS-rendered with no SSR fallback
Site launched after the most recent CC-MAIN crawl (fine — you'll be in the next one)

Every paid AEO monitoring platform pretends this isn't the first thing to check. The LLM Training-Data Inclusion Audit makes it the first thing you check.

What the audit actually does

It queries Common Crawl's public CDX index API directly for your domain across the last 4 CC-MAIN indexes (roughly the past year of snapshots). For each index, it reports:

How many records Common Crawl has for your domain
How many unique URLs those records cover
Whether there are gaps between indexes that suggest temporary outages or robots.txt changes

It then cross-references against five major derived corpora and reports likely inclusion status:

Common Crawl (direct): hit if any records found
C4: likely hit if CC presence exists, since C4 is CC filtered for English + clean text
RedPajama / SlimPajama: likely hit if CC exists, since RPJ sources CC
FineWeb: likely hit; FineWeb is the current favorite open-source English CC refinement (15T tokens)
OSCAR: candidate hit; multilingual derivative requires language confirmation
Wikipedia derivatives: separate check — run the Knowledge Graph + Wikidata audit for that pathway

The output also includes the top ~30 indexed URLs — so you can see what Common Crawl actually grabbed from you. If it grabbed only your homepage and an old blog post while you have 400 product pages, that's a structural problem with your site's crawlability.

What "good inclusion" looks like

Strong inclusion looks like: 100+ records across the last 4 indexes, 50+ unique URLs, representation across your main content sections (not just homepage + tag archives). This means CC is finding your content consistently, your robots.txt is clean, your server returns 200s, and your site is accessible without JS execution.

Thin inclusion looks like: single-digit records, only the homepage, or long gaps between indexes. Usually the fix is: check your robots.txt, check your WAF rules, check if you've got a Cloudflare bot-fight-mode setting turned on that auto-blocks non-browser UAs.

Zero inclusion is rare but recoverable: a 90-day fix usually gets you into the next CC-MAIN crawl. The audit's AI fix prompt walks through the exact steps per likely failure mode.

The three most common failures we catch

1. Copy-pasted AI-blocklist in robots.txt. Someone saw a "block GPTBot" article and pasted an entire block list, which included CCBot. The site meant to opt out of OpenAI's model-training crawl and accidentally opted out of being in any LLM's knowledge base at all. Fix: remove CCBot from the disallow list. Allow GPTBot too if you want the training-inclusion benefit (this is a philosophical choice — see the AI Crawler Policy Generator for the full matrix).

2. Cloudflare bot-fight-mode. Cloudflare's "Bot Fight Mode" setting auto-challenges non-browser user agents. CCBot gets a JS-rendered challenge page and fails to crawl. Fix: turn off Bot Fight Mode, or add CCBot to the allowed UA list in Cloudflare's bot management.

3. Entirely client-side rendering. Common Crawl does not execute JavaScript. If your homepage body is a <div id="app"> and the content loads via React/Vue/Svelte at runtime, CC sees nothing but the loader. Fix: add SSR or prerendering. Every major static site generator (Next.js, Nuxt, Astro, Eleventy) supports this.

The 90-day recovery playbook

If your audit shows zero or thin CC presence:

Days 1-7: audit robots.txt, server UA policy, Cloudflare settings. Fix blockers.
Days 8-14: submit your site to aggregators and directory sites that CC crawls regularly. Reddit submissions (if relevant), Hacker News, industry-specific directories, news mentions. Each inbound link from a high-CC-frequency domain increases your odds of being picked up in the next crawl.
Days 15-30: add canonical URLs to every page, make sure <title> and <meta description> are clean, ensure clean semantic HTML.
Days 30-60: CC's next crawl window. Likely to pick you up if blockers are removed.
Days 60-90: re-run this audit. Confirm inclusion. Expand.

Fact-check notes and sources

Common Crawl public CDX index API: index.commoncrawl.org — publicly documented, free to query
C4 (Colossal Clean Crawled Corpus): Raffel et al., 2020, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer — defines the C4 filters on top of CC
FineWeb dataset documentation: HuggingFace FineWeb — 15T-token English CC refinement
RedPajama documentation: TogetherAI RedPajama-Data — Llama pretraining-corpus reproduction
OSCAR project: oscar-project.org — CC deduplicated + language-classified
Common Crawl crawl cadence: CC-MAIN datasets published approximately every 1-3 months since 2013

This post is informational, not engineering or AI-strategy advice. Mentions of Common Crawl, C4, FineWeb, RedPajama, OSCAR, OpenAI, Anthropic, Google, Meta, Together AI, and Cloudflare are nominative fair use. No affiliation is implied.

The Common Crawl Test: Are You Even In The Dataset That Trained The LLM?