Insights · Perspective · 6 min read

AI testing in 2026. The bit where everyone realised the demo wasn't the product.

A field report from inside the corporate scramble — what's working, what's not, and the testing playbook quietly emerging.

By Duncan Smith · 24 May 2026

It's 2026 and your bank has an AI assistant. Your insurer too. Your telco. The coffee shop chain that used to send you a paper loyalty stamp now has a chatbot that "understands your preferences." Half of them are wonderful. The other half will cheerfully tell you that yes, you can absolutely deduct your dog as a dependent.

We are, collectively, in the bit between the demo and the product.

The demo phase was great. An executive saw a model do something impressive and said "we need this." Procurement moved. Vendors arrived. Pilots launched. Slack channels filled with rocket emojis. Then somebody asked the question that breaks the spell:

"How do we know it works?"

That, dear reader, is the gap.

The corporate gap, in one sentence.

The model is in production. The test plan is in someone's head. The "head" left for a competitor in March.

If you laughed, you work in this industry.

The pattern repeats across every Fortune 500 I see. A team builds a retrieval pipeline over three weeks. It demos beautifully on cherry-picked questions. It ships. Six months later somebody runs ten questions a real customer might ask, and four of them confidently cite a policy that hasn't existed since 2019.

Nobody was lazy. Nobody was incompetent. The tooling didn't exist a year ago, and the test discipline didn't transfer from regular software because LLMs don't behave like regular software.

That's the honest version of the story.

Why testing got weird.

Regular software has a charming property: same input, same output. Run the test, get a green tick, go home.

LLMs are the opposite of charming. Same input, slightly different output. Sometimes wildly different. Sometimes the model invents a citation. Sometimes it refuses to answer a question it answered yesterday. Sometimes it confuses your customer's first name with the name of a deprecated SKU.

So the question shifts from "did it return the expected value?" to "did it return something acceptable across a distribution of plausible inputs, with grounding I can trace back to a source document?"

You can see why this gave QA leaders a headache. Their entire profession was built on determinism.

The good news: a working playbook is emerging. The better news: it's not as terrifying as the trade press makes out.

Seven things that hold up on the eval bench.

1. Build the golden set before you build the model

A golden set is fifty to two hundred real questions with the right answers written down by a human who knows the domain. Boring. Effective. Most teams skip it because it isn't fun. Don't skip it. The team that writes the golden set first ships twice as fast as the team that writes it after the first user complaint.

2. Test retrieval separately from generation

If the model gives a bad answer, you have two suspects: the retriever pulled the wrong documents, or the generator ignored the right ones. Different failure modes, different fixes. Test them apart. Recall@5 and groundedness are your two friends here.

3. Run it twice

Same prompt. Same model. Two runs. If the answers disagree on the facts, you have a determinism problem worth knowing about before your customer finds it. Two API calls. One postmortem avoided.

4. Use a model to judge a model (carefully)

LLM-as-judge gets a lot of stick, and most of the stick is deserved when people use it lazily. Used carefully — with a rubric, calibrated against human-graded examples, with a confidence threshold — it scales evaluation in a way humans can't. Treat the judge as a noisy sensor, not an oracle.

5. Watch for the boring failures first

Everyone wants to talk about jailbreaks and adversarial attacks. Glamorous. Worth your attention eventually. Not the failure mode costing you money this quarter. The failure mode costing you money is the model politely making up a refund policy. Catch the boring stuff first.

6. Version your prompts like you version your code

Your prompt is code. Your prompt-template-with-six-edits-from-three-people-in-a-Slack-thread is not code. It is a liability. Put it in git. Tag it. Diff it. When behaviour changes you'll be glad you can answer "what changed?" in thirty seconds instead of two days.

7. Have someone you trust try to break it before you ship

I'll keep this short on detail for reasons that will become obvious if you think about why. Find a person who is good at finding edges. Let them poke. Fix what they find. Repeat. The cost of doing this internally is small. The cost of skipping it is large and public.

The thing nobody puts in the slide deck.

Most corporate AI projects don't fail because the model is bad. They fail because the organisation hasn't decided who owns the answer when the model is wrong.

Is it the vendor? The platform team? The business unit that signed off the pilot? The compliance officer who approved the use case but not the prompt? In most companies the answer is "nobody, until it's a problem, and then everybody."

Fix this before you fix anything else. A bad model with clear ownership recovers. A good model with no ownership becomes a slow-motion incident.

Unglamorous advice. Also the difference between an AI programme that's still alive in 2027 and one that quietly disappears from the next town hall.

What 2026 feels like, from where I sit.

The vibe in the industry has shifted. Two years ago everyone was racing. Last year everyone was hedging. This year people are starting to ask the boring, useful questions: what's the test coverage, what's the failure rate, how do we know it's getting better, who do I page when it's not.

That's a healthy place to be. Less rocket emoji. More clipboard. The clipboard always wins eventually.

If you're building or buying AI right now, the best thing you can do is treat it like a system that will fail and design around the failures. The teams that do this look slower from the outside and ship faster on the inside. Funny how that works.

A quick plug, then I'll go.

Sau5 is the small consultancy I run that does this kind of testing for a living. If your team is in the gap between "we shipped it" and "we know it works," that's where we live. Otherwise, please steal the playbook above and use it. The industry will be in better shape if everyone tests their AI before everyone else finds the bugs.

Now go write your golden set.

Duncan Smith runs Sau5, a global AI testing consultancy. He writes about evaluation, grounding, and the slow art of making models behave. Find more on Medium and LinkedIn.

Ready to talk?

If you're considering Sau5 for a RAG assessment, the next step is a 30-minute discovery call. Bring a short description of the system you want tested. We'll discuss scope, fit, timing, and whether Sau5 is the right vendor.

Join the waitlist