Year in review — agent evaluation is the discipline that finally grew up
2025 was the year enterprise AI stopped being a demo and started being a system. The under-recognised reason is that agent evaluation finally became a serious engineering discipline.
by Skygena Editorial
Every December the AI press writes about the year’s biggest model release. We are going to do the opposite. The story of 2025 in enterprise AI was not a model. It was that agent evaluation finally became a serious engineering discipline — and the teams that took it seriously are the ones with agents in production.
Looking back at the engagements we ran this year, the through-line is unmistakable. The clients who shipped were the ones who treated evaluation as a first-class deliverable, not a chore to do at the end. The clients who got stuck were the ones who treated it as a nice-to-have.
Why evaluation was the unlock
A year ago, “evaluating an agent” mostly meant “running a few prompts and looking at the output”. That is fine for a demo. It is catastrophic for a production system.
Three things changed in 2025:
-
Golden test sets became standard practice. Teams started curating real, labelled question/answer sets — co-authored with business owners — that the agent has to pass on every release. Nothing ships if the agreement on the golden set drops below threshold.
-
Pairwise and rubric-based grading replaced gut feel. Letting a stronger model grade the weaker model’s output, against an explicit rubric, became normal. It scales. It is repeatable. It catches regressions a human would miss after the third coffee.
-
Evaluation got wired into CI. Not a quarterly review — a per-pull-request gate. The agent has to pass the harness before it can ship. The same discipline software teams have had for unit tests for two decades, applied to agents.
Nothing on this list is exotic. None of it required a frontier model. All of it required engineering rigour the field had been postponing.
What this looked like in practice
In our own work this year:
-
A reporting agent we shipped to a mid-size industrial group has a golden set of 800 controlling questions, co-authored with the CFO’s team. The agent’s nightly run has to clear 96% agreement or the build fails.
-
A drawing-reading agent for an auditing firm has a golden set of 400 architectural drawings labelled by senior auditors. We measure drawing extraction agreement on every release. It has saved us at least four bad releases this year.
-
A messaging system for a European publisher has a brand-voice evaluator co-authored with the editorial team. Nothing sends unless it scores above the threshold. The publisher’s editor asked us to use it elsewhere in the product.
In all three, the model is interchangeable. The evaluator is not. The evaluator is the asset.
What we expect in 2026
The next 12 months will normalise three things:
-
Evaluation harnesses as required deliverables. No more “we’ll add it later”. Procurement teams will ask for the evaluator before they sign.
-
Domain-specific evaluators authored by the business, not the engineers. This is the only way evaluators stay aligned to what the business actually cares about.
-
Evaluation telemetry on the executive dashboard. Not just uptime and cost. Accuracy, drift, override rate, escalation rate. Operating metrics for an operating system.
If 2024 was the year of the demo and 2025 was the year of the evaluator, 2026 will be the year agents are finally treated as systems. We are looking forward to it.
Happy holidays from the Skygena bench. We will see you in January.
Thinking about AI in your business?
Skygena is a boutique European AI studio engineering autonomous agents and LLM products. If you're wrestling with where to start — or where to stop — we can help.
Book a 30-minute call