I Spent a Day Trying to Make an LLM Count to 12. The Real Problem Was a Word.

I built an AI assistant into our enterprise architecture platform. It answers natural language questions about your application portfolio: “What apps are End of Life and on-prem?”, “What’s my total cloud spend?”, “Which Mission-critical apps have red health?”

It looked great in demos. The answers were fluent, well-structured, and came back fast. Then we started testing it properly.

The same question, asked four times in a row, returned four different answers. Counts varied between 6 and 8. Cost totals swung by six figures. The currency changed from £ to $ mid-session. App lists that should have been identical were different every time.

We had a reliability problem. I assumed it was an AI problem. I was wrong, but it took me most of a day to understand why, and the real answer is more useful than the one I went looking for.

The Naive Approach

My initial architecture was straightforward. Before each query, I build a portfolio snapshot, a JSON object containing every app’s lifecycle status, hosting type, criticality, cost, health score, and about 80 other fields. I strip the sensitive data, send the snapshot to the model along with the user’s question, and ask it to return a structured answer.

Simple. Elegant. Fast.

The problem is that I was asking the model to do two fundamentally different things at once. Act as a query engine over structured data, and act as a narrative intelligence layer on top of it. That turned out to matter. But it wasn’t the whole story, and I want to be honest about the order I learned things in, because the order is the lesson.

The Investigation

I started where any engineer should, with the simplest possible fix.

Fix 1: Harden the counting rule. I made the model produce a workings array, every matching app’s ID, and forced the count to equal its length. Counts stabilised. Still wrong, but consistently wrong.

Fix 2: Broaden the trigger. Extended the rule to every multi-filter query, not just ones with “how many” in them. No change.

Fix 3: Increase the token budget. In case the model was running out of room mid-answer. No change.

Three fixes in, the count was stuck at 8. The correct answer was 12. And I still thought I was debugging an AI.

The Diagnostic Turn

I added logging. And the logs said something I didn’t expect.

All 12 apps were in the data I sent the model. It was receiving everything. It was choosing to return only 8.

I cross-referenced the 8 it kept against the 4 it dropped. The dropped apps had something in common. They didn’t fit the model’s intuition of what “End of Life” means. Two were on extended vendor support. One was marked retired. The model was quietly applying its own judgment about which apps “really” counted, and excluding the ones that didn’t match its mental picture.

My first conclusion was the obvious one. The model can’t be trusted to query data. A database applies a filter mechanically. It returns exactly what matches, with no opinion. A language model reasons semantically. It brings judgment to everything. That makes it superb at narrative and useless as a deterministic query engine where the answer is either right or wrong.

So I rebuilt the architecture around that. Questions that are really database queries now go to the database. The model only narrates the verified result. Separate retrieval from reasoning, and let each side do what it’s good at.

That was correct. It was also not the real problem. It just made the real problem easier to see.

The Word

Here is the thing I had missed for most of the day.

There is no “End of Life” in our data model.

The field a user thinks they’re asking about, lifecycle, has four values. Invest, Tolerate, Migrate, Eliminate. “End of Life” is not one of them. The strategic concept the user means by “End of Life” is actually called Eliminate.

And to make it worse, there is something called “End of Life” in the data model. It’s a value in a different field, support status, describing whether the vendor still supports the product. So the user’s three words don’t map onto nothing. They map onto two different things, in two different fields, one of which has a different name and one of which has the same name but the wrong meaning.

The model was never miscounting. It was doing something far more reasonable. It was being handed a genuinely ambiguous question, a question that did not have one correct interpretation, and it was resolving that ambiguity slightly differently each time it was asked. The inconsistency I had spent a day chasing wasn’t the model failing. It was the model faithfully reflecting an ambiguity that was already sitting in our product, in the gap between the words our users say and the way our data is modelled.

The AI didn’t introduce the problem. The AI made a pre-existing problem impossible to ignore.

Why This Is the Dangerous Kind of Bug

Here’s the part that should worry anyone building this sort of thing.

My deterministic rebuild, where the database does the retrieval and the model only narrates, did not fix this. It couldn’t. If the system maps “End of Life” to the wrong field, the database will now return the same wrong answer every single time, quickly and confidently, with a little “verified” badge next to it.

I had taken a bug that was random and wrong and turned it into a bug that was consistent and wrong. And consistent-and-wrong is worse, because it looks trustworthy. A number that changes every time invites suspicion. A number that’s stable, fast, and badged as verified gets believed.

The verified-data architecture guarantees the system correctly answers the question it was given. It does nothing to guarantee the question was understood the way the user meant it. Those are two different problems. I had solved the first and assumed it was the second.

What I Actually Built

Before I get to that second problem, it’s worth being concrete about the architecture, because the shape of it is the part other builders will want.

The rebuild has four moving parts, and none of them is clever. Clever was the original mistake.

An intent classifier sits in front. Every question hits a small, fast model call that does exactly one job. It decides what kind of question this is, and it extracts a structured filter. Is the user asking for a list or a count (retrieve)? For interpretation or advice (advise)? Or for judgement that depends on facts (hybrid)? The classifier does not answer anything. It routes.

There are three paths. A pure retrieval question goes straight to the database and the model never touches the numbers, it only writes the sentence around them. A pure advisory question goes to the model with portfolio context, because interpretation is what the model is actually good at. The common case is hybrid: retrieve the verified data first, deterministically, then hand that verified set to the model to reason over. Retrieval finishes before reasoning begins. The model never decides what to fetch.

The classifier emits a typed filter, not SQL. It does not write a query. It produces a structured, validated filter object, lifecycle equals Eliminate, hosting equals on-premises, and the application layer turns that into a query. A model writing raw SQL against a production database is a security and correctness problem waiting to happen. A model emitting a constrained, typed object that the application validates is auditable and safe. The grammar of what can be asked is fixed by us, not improvised by the model.

Aggregation is deterministic. Counts, sums, totals, percentages, none of them are done by the model. They are computed in code from the verified rows the database returned. This is the EUR 427,000 lesson made permanent. The model is handed finished figures and told, in plain terms, state these exactly and calculate nothing.

And every answer carries verification metadata, a flag recording whether its numbers came from the database or from model inference. The interface can then tell the user which kind of answer they are looking at. A verified count and an informed estimate should not look identical, and now they don’t.

None of this is a pattern I lifted from a paper. I arrived at it by hitting the wall repeatedly and removing, one at a time, every place the model was being trusted with something it could not be trusted with. It is reassuring, afterwards, to see the wider industry converging on the same instinct, that the serious enterprise AI work of the last year has been about treating the knowledge source, not the model, as the thing you invest in and trust. But I did not start there. I started with a number that would not stay still.

The Real Lesson

An AI assistant over enterprise data is only ever as reliable as the semantic agreement between how your users speak and how your data is modelled.

That sentence is not about AI. It’s the oldest problem in enterprise architecture. The shared business vocabulary, the agreed definition of a term, the semantic layer between what people say and what the system stores. We have always known that “customer” means six different things in six different systems. AI did not create that problem. It just removed the last place it could hide.

Before a natural-language interface, the ambiguity was contained. A user clicked a filter labelled “Lifecycle: Eliminate” and got exactly that. The label did the disambiguation. The moment you let someone type “show me the End of Life apps” in their own words, every unresolved mismatch between their vocabulary and your model becomes a live fault. And the more fluent and confident your AI, the more invisible the fault becomes.

So the fix isn’t only architectural. Yes, separate retrieval from reasoning, and never let the model do arithmetic or enumeration. But underneath that, the real work is a semantic audit. Go through every field and every value in the model, list the words real users will actually say to mean each one, and find every place those two vocabularies disagree. Same word, different meaning. One word, several possible meanings. Words users will say that the model has no concept for at all. Each of those is a latent version of the bug I spent a day chasing. “End of Life” was just the first one a user happened to hit.

The output of that audit is a governed dictionary. User language mapped to canonical meaning mapped to where it actually lives in the data. In a regulated industry that dictionary is not housekeeping. It’s an auditable control, evidence that the system’s interpretation of a question is defined and intentional rather than left to a model’s instinct.

The Honest Reflection

I went into this thinking I had an AI accuracy problem. I rebuilt an architecture, and the rebuild was right and worth doing. But the rebuilt, deterministic, verified system would still have confidently told a customer the wrong number, because the actual fault was never in the model. It was in the inch of undefined space between a word our users say every day and a field in our database that doesn’t use that word.

The AI was not the unreliable component. The AI was the diagnostic instrument. It took an ambiguity our product had always contained and turned it into a visible, reproducible, four-different-answers-in-a-row failure. That’s a gift, if you’re paying attention. Most semantic debt in enterprise systems never announces itself this clearly.

If you’re building AI features on structured data, by all means separate retrieval from reasoning, and never let the model be the calculator. But don’t stop there, because that only guarantees the right answer to the question as the system understood it. Go and check that the system understands the words the way your users mean them. Audit the vocabulary. Find the gaps before a customer does.

The hardest bug to catch is not a wrong answer. It’s a confident, fluent, well-formatted answer to a question that was quietly understood to mean something other than what was asked. A language model produces nothing else, and that is exactly why it’s so good at showing you where your own definitions were never as solid as you thought.

John Murphy is Company Director of ailíniú Ltd, an enterprise architecture and AI consultancy, and the builder of Soiléire, an Application Portfolio Management platform for regulated financial services. He writes about building production AI systems at the intersection of enterprise architecture and regulated industries.