Skip to content
← Skygena Signal
opinion 4 min read

The real cost of not evaluating your AI system

Every enterprise AI team says they will add evaluation later. Later never comes — and the cost of that missing harness is not theoretical. Here is what we see when we audit unevaluated systems.

by Skygena Editorial

There is a sentence we hear in almost every AI project audit we run: “We plan to add evaluation once the system stabilises.” It is the AI equivalent of “we will write tests after launch.” It never happens, and the cost of not doing it is not abstract — it is measurable in regressions shipped, trust lost and operating hours wasted.

Here is what unevaluated AI systems actually cost.

The silent regression

Without an evaluation harness, every model update, every prompt tweak, every retrieval config change is a roll of the dice. The team makes a change, eyeballs a few outputs, decides it “looks fine”, and ships. Three weeks later, someone notices that the agent has been giving wrong answers to a specific class of question since the change.

Nobody caught it because nobody measured it. The regression was silent — not because it was subtle, but because there was no instrument listening.

In our experience, unevaluated systems ship at least one silent regression per quarter. Each one takes 2–4 weeks to detect (usually by a user complaint, not by the team), 1–2 weeks to diagnose, and a variable amount of trust to rebuild.

The confidence trap

Teams without evaluation harnesses develop a dangerous pattern: they lose confidence in their own system. Every change feels risky because they cannot prove it is safe. So they stop changing things. The system ossifies. The model stays on an old version because “we are not sure what would break.” The prompts stay untouched because “they work, we think, mostly.”

This is the opposite of what AI systems need. AI systems need to be iterated fast — new models, new prompts, new retrieval strategies — and the only thing that makes fast iteration safe is a harness that catches regressions before they ship.

The audit hole

Regulators are starting to ask: “How do you know this system is performing correctly?” The honest answer from most teams is: “We look at it sometimes.” That answer will not survive the EU AI Act enforcement cycle, and it certainly will not survive an incident investigation.

An evaluation harness is not just an engineering tool. It is the evidence base that your system works as documented. Without it, every compliance claim is unsubstantiated.

What evaluation actually costs to build

The common objection is that evaluation is expensive. It is not — relative to the system it protects. A minimal viable evaluation harness for a production agent looks like this:

  1. A golden test set of 50–200 labelled question/answer pairs, co-authored with the business owner. Time to create: 2–5 days of the business owner’s attention + 2–3 days of engineering to wire the harness.

  2. A CI gate that runs the agent against the golden set on every pull request and blocks merge if agreement drops below threshold. Time to build: 1 day of engineering.

  3. A nightly regression run that re-tests the full set and alerts on drift. Time to build: half a day.

Total: roughly one engineering week and one week of business owner time (spread over 2–3 weeks). For a system that processes thousands of decisions a day, this is not expensive. It is negligent not to do it.

The pattern we recommend

After 18 months of building evaluation harnesses for client agents, our standard is:

  • Golden set: 100–500 pairs, versioned in git, owned by the business. Updated monthly. Treated as code, not as a document.
  • Per-PR gate: agreement threshold (typically 95%). Blocks merge. No exceptions.
  • Nightly full run: catches drift from external changes (model updates, data changes, upstream API changes). Alerts to Slack.
  • Monthly review: business owner + engineering lead review the harness results, update the golden set, discuss overrides.

The entire setup fits in a single CI workflow and a 200-line evaluation script.

What to do this week

If your AI system does not have an evaluation harness:

  1. Open your system and write down 20 questions it should be able to answer correctly. Include the expected answer.
  2. Run those 20 questions through the system and score the output (correct / incorrect / partially correct).
  3. If more than 2 are wrong, you have found regressions you did not know about. That is the cost of not evaluating.
  4. Commit the 20 pairs to a file in your repo. You have just started your golden set.

The rest — CI gate, nightly run, monthly review — is engineering work that any competent team can build in a week. The hard part is deciding to start. Start this week.

If you want help building or auditing an evaluation harness for your production agent, write to [email protected].

Thinking about AI in your business?

Skygena is a boutique European AI studio engineering autonomous agents and LLM products. If you're wrestling with where to start — or where to stop — we can help.

Book a 30-minute call