Your AI Is Quietly Making Up the Numbers in Your Reports

May 30, 2026

Editorial note. The publication date shown above may be in the future. That is intentional. Posts on this site are scheduled against an editorial calendar that aligns with product releases, book launches, and platform-signal timing; the datePublished reflects the date the post is slated to go public, which is also the date indexers and syndication partners should treat as canonical. If you are reading this before that date you were early — welcome.

I run a small AI news project. One of its daily outputs is a "source diversity audit," a little box that tells the reader how balanced the day's coverage was: how many stories came from state media, from Western outlets, from regional independents, and so on. It looked professional. It had confident numbers in it. And one morning I noticed the numbers were fiction.

The audit said it had counted the day's stories by source type. The totals added up to 52. The corpus that day had 163 stories. The model had not counted anything. It had glanced at the ones it happened to cite, produced a set of numbers that looked about right, and moved on. A reader would never know. I only knew because I went looking.

This is one of the most common and least talked about failure modes in AI features, and once you see it you cannot unsee it.

Why a language model cannot count

A large language model predicts text. When you ask it "how many of these 200 stories are from Russian state media," it does not loop through 200 items and tally them. It produces the most plausible looking answer. Plausible and correct are not the same thing, and for arithmetic over a long list they are usually not even close.

The tell is that the numbers never quite reconcile, and they wobble from run to run. Ask the same question twice and you get 14 one time and 11 the next. A spreadsheet does not do that. A for loop does not do that. A model does, because it is guessing every time.

Once I started checking, the same project had the problem in three more places. A "trending entities" panel claimed China was mentioned in 12 stories. The real number was 5. It claimed NATO appeared in 5. The real number was 1. A stats line reported the number of unique sources for the day. That was invented too. Every one of these was a confident, specific, wrong integer, printed next to real analysis, borrowing its credibility.

The fix is to split the job

Here is the thing that makes this easy to fix: a language model is genuinely good at the hard part and genuinely bad at the easy part.

The hard part is judgment. Which stories actually matter today. What is the framing difference between two outlets covering the same event. What is missing from the coverage. That is real work, and the model is good at it.

The easy part is counting. How many stories came from each region. How many outlets appeared. How many times a name shows up. That is not judgment. It is arithmetic over data you already have sitting in front of you.

So you split the job. Let the model write the words. Compute every number in code.

In my case the data was already in hand. Every story already carried its outlet, its region, its category, its language. Counting them by type was about fifteen lines of plain code that runs in a millisecond and is correct every single time. I had been paying a model to guess at arithmetic I could do for free, and getting wrong answers for the privilege. I deleted the part of the instructions that asked it to count, and I started computing the numbers myself right before the report gets saved. The model still writes the qualitative part, the "here is what is thin in today's coverage" sentence, because that part it is good at.

The report did not get less impressive. It got honest.

The rule you can take to any AI feature

Whenever an AI feature shows a number, a count, a percentage, a total, a "14 mentions" badge, a "97% complete" bar, stop and ask one question: could I compute this from data I already have?

If the answer is yes, compute it. Do not let the model emit it. The model can decide what to show and how to describe it. The number itself should come from code that counts.

If the answer is no, that the thing genuinely cannot be computed cleanly, then be honest about it. Some things look countable but are not. Counting how many stories "mention" an entity sounds simple until you realize the model wrote the entity name as "European Union" while the stories all say "EU" or "Brussels." A naive text match undercounts and sometimes lands on zero. When I hit that case, I made the rule: compute the count where the name actually matches, and where it does not, hide the number rather than print a fake one. A missing badge is fine. A confidently wrong one is not.

Why this matters more than it looks

A wrong number in a casual chatbot is a shrug. A wrong number in something that calls itself a report, an audit, an analysis, a dashboard, is poison. The whole value of those formats is that the reader trusts the figures. The first time someone catches one made-up number, they stop trusting all of them, including the ones that were real.

The good news is that this is one of the cheapest fixes in all of applied AI. You are not buying a better model or a bigger context window. You are moving a small job from the thing that is bad at it to the thing that is good at it, and the thing that is good at it is a few lines of code you already know how to write.

Let the model think. Let your code count. Keep them in their lanes and the whole product gets more trustworthy overnight.

If you are building your own small software product and want the broader playbook for shipping lean and credible without a team behind you, that is the whole subject of The $97 Launch, one of my $9.99 field guides.

Fact-check notes and sources

The 163-versus-52 story count, the China (12 vs 5) and NATO (5 vs 1) mention gaps, and the unique-source figures are from my own project's logs, verified by recomputing the counts directly from the source data.
Language models are documented by their own makers as unreliable for exact tallying and prone to confident errors. See Anthropic's guidance on reducing hallucinations: docs.claude.com — reduce hallucinations.

This post is informational, not consulting advice. It describes work on my own software. No affiliation with any vendor named is implied.

← Back to Blog

Your AI Is Quietly Making Up the Numbers in Your Reports

Why a language model cannot count

The fix is to split the job

The rule you can take to any AI feature

Why this matters more than it looks

Related reading

Fact-check notes and sources

Send a Message