A roofing company in Twin Falls asked Perplexity about their own business last month. Perplexity confidently said they were a residential painting company, founded in 1998, serving the Magic Valley.
Three facts. Two were wrong.
They'd been roofing for 21 years and had never touched a paint can. The 1998 date was close to their closest competitor's founding date, which had bled across the entity boundary somewhere in training. The "Magic Valley" bit was right.
The scary part: they had customers arriving confused about painting estimates. Perplexity was shaping the first impression of the business for everyone who ever asked about them there.
Why LLMs hallucinate about brands
Three root causes, stacked:
Training-data conflation. During pretraining, the model compressed thousands of similar-looking businesses into a shared statistical space. Small business names are near-identical from ten feet away — "Acme Roofing of Twin Falls" and "Apex Roofing of Twin Falls" and "Alpine Roofing of Twin Falls" collapse into a single fuzzy cluster. The model samples from the cluster at inference time and emits details that belong to the neighbors.
Stale source data. You changed your hours in March. The LLM's training cutoff was the previous September. The ground truth it's citing from is six months obsolete. Even when retrieval-augmented generation (RAG) kicks in and pulls live data, the model's priors drag the answer back toward the outdated version.
Retrieval thinness. The LLM retrieved three snippets about your brand. Two were accurate, one was wrong (maybe an old archived page, a scraped review site, an aggregator with bad data). The model weights all three equally. The wrong one bleeds through.
None of these are going away. The LLM companies know about them and are working on mitigations, but the fundamental shape — a statistical model trained on a messy internet — means some degree of drift is structural.
Which means you need to measure it, and respond, on your own cadence.
What the AI Hallucination Detector does
The tool takes three inputs:
- Your canonical facts — what's true about the business. Hours, pricing, founding year, services, service area, licenses, policies. One row per fact with synonyms the model might paraphrase as.
- LLM responses — you paste in the actual answers from ChatGPT, Claude, Gemini, Perplexity, Copilot. One panel per model.
- Brand context — name and URL so the tool can generate prompts that prime each LLM fairly.
It produces three outputs:
- Per-model accuracy score — what percentage of your canonical facts each model got right, partially right, or wrong.
- Per-fact cross-model coverage — which facts are universally wrong versus which LLM is the outlier. A fact that 4 out of 5 LLMs get wrong is a ground-truth problem on your own web presence. A fact that 1 of 5 gets wrong is that specific LLM's problem to route around.
- AI fix prompt — a copy-paste prompt that takes the findings and asks an LLM to propose the specific schema, sameAs, Wikipedia citation, press release, or on-site content changes that would propagate the correction back into future crawls.
Everything stays in your browser's localStorage. Nothing leaves the device. The paste-in model is intentional — it avoids needing API keys for every LLM vendor, which would be a configuration nightmare and a security risk.
How to run it — the 15-minute workflow
Minute 0-3: Declare your facts. Start with hours, service area, primary services, founding year, license numbers. Use the "Load example" button to see the shape — then replace with yours.
Minute 3-6: Build the prompts. Click "Build prompts." You'll get three variants: a factual summary prompt, a specific-Q&A prompt, and a comparison-grounding prompt.
Minute 6-12: Run each LLM. Open ChatGPT, Claude, Gemini, Perplexity in fresh chats (no brand context in the session already). Paste the factual summary prompt. Copy the response. Paste into the matching panel in the tool.
Minute 12-15: Read the diff. The tool scores automatically. You'll get a per-model percentage, a per-fact miss list, and the AI fix prompt.
What "good" looks like
Average accuracy over 80%: you're in strong shape. The one or two misses are usually nuanced stuff (service-area boundaries, a specific policy detail).
Average 60-80%: drift is real but contained. Usually fixable with schema + NAP updates + a Wikipedia presence push.
Average below 60%: structural problem. Either the LLMs don't have enough canonical data on you (thin web presence, weak schema, no Wikipedia entry) or they're mixing you with a competitor.
A specific model scoring far below the others — say, Perplexity at 40% when ChatGPT is at 85% — is usually a live-index issue. Perplexity leans heavier on real-time crawl, so stale pages hurt it more.
What to fix first, by hallucination type
Hours are wrong — update Google Business Profile immediately. GBP hours flow into Gemini via grounding, into Perplexity via live fetch, into Bing's Copilot index. Then update schema.org openingHours on your site. The schema update catches the stragglers that don't pull GBP.
Services are wrong — update schema.org Service or OfferCatalog on the site. Add a "What we do / What we don't do" FAQ with explicit exclusions. ("We are a roofing company. We do not provide painting services.") LLMs respect explicit negation better than people expect.
Founding year is wrong — add schema.org foundingDate to your Organization node. Claim a Wikipedia entry if your business qualifies under WP:NCORP (meets notability threshold). Issue a press release mentioning the date explicitly. Wikipedia dates propagate to Google Knowledge Graph which propagates to Gemini.
Competitor conflation — this is the hardest to fix. The cure is entity disambiguation: distinct schema @id, explicit sameAs URLs, a Wikidata entry if you're large enough. Until then, every LLM prompt should name you specifically ("Acme Roofing LLC of Twin Falls" not "Acme Roofing") to force the model to route around the ambiguous cluster.
The monthly monitoring cadence
Hallucinations drift. A model that knew you correctly in January might confuse you with a newly-launched competitor by April. The only defense is periodic re-checking.
A simple monthly ritual:
- Run the tool on the first Monday of the month.
- Note any regressions from the prior month.
- Apply the AI fix prompt recommendations for the top 3 regressions.
- Log the score trend in a spreadsheet or a local note.
15 minutes a month. Catches most drift before customers do.
Related reading
- Live Citation Surface Probe — runs the point-in-time version of this check against search surfaces, not just LLM answers
- LLM Training-Data Inclusion Audit — upstream diagnostic: are you even in the dataset?
- Knowledge Graph + Wikidata Audit — the entity-disambiguation layer that prevents competitor conflation
- AI Citation Readiness — baseline preparedness for being cited accurately at all
Methodology: the entity-disambiguation and sameAs propagation path is covered in depth in the AEO chapter of The $20 Dollar Agency, which works through a case study of a dental practice that reduced competitor conflation from 60% to 8% in 90 days via schema + Wikipedia + GBP coordination.
Fact-check notes and sources
- Retrieval-augmented generation (RAG) prior-drag effect: Lewis et al., 2020, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — foundational paper documenting the phenomenon
- LLM training cutoffs as of 2026: GPT-4.1 cuts at Dec 2023 (with browsing); Claude 3.7 Sonnet cuts at early 2025; Gemini 2.0 cuts at 2024 with live grounding; Perplexity uses fresh live index
- Wikipedia notability for corporations: WP:NCORP
- Schema.org
openingHoursspec: schema.org/openingHours
This post is informational, not legal or SEO-consulting advice. Mentions of ChatGPT, Claude, Gemini, Perplexity, Copilot, and Profound are nominative fair use. No affiliation is implied.