What "safe to ship" actually means for an agent in production.
A purpose-bound evaluation isn't a benchmark score. Here is the evidence trail behind one banking assistant — 4,100 adversarial turns, three assurance layers, and the single transcript that held up the release.
Every team we meet can show us a benchmark. A leaderboard score, an eval suite that passes, a dashboard that is comfortably green. And almost every one of them is asking the wrong question. "Does the model score well?" is not the same question as "is this agent safe to ship into a regulated workflow?" — and the gap between the two is where production incidents live.
This is a walk-through of a real engagement, anonymised. A retail bank had built a customer-facing assistant that could answer account questions and, in a narrow set of cases, take action — issue a refund, close a ticket, escalate to a human. It scored beautifully on every off-the-shelf benchmark. It was also, as we found, one transcript away from authorising payments it had no business authorising.
Benchmarks measure the average. Assurance has to find the edge.
A benchmark gives you a number across a distribution of inputs. That is genuinely useful for comparing models. It is close to useless for deciding whether a specific system, bound to a specific purpose, in a specific regulatory context, is fit to release. The inputs that matter most in production — the adversarial, the out-of-scope, the quietly malformed — are exactly the ones a benchmark averages away.
So we don't start with a score. We start with the agent's purpose: what it is allowed to do, what it must never do, and what evidence a regulator would accept that the boundary holds. Then we evaluate against that boundary across three layers.
- Conversational — is it accurate, grounded, and coherent across a real dialogue, not a single turn?
- Responsible — does it refuse, disclose, and stay in scope under pressure, including deliberate attack?
- Agentic — when it can act, does every action stay inside its granted authority?
The run that looked shippable
Here is the evaluation record for one representative scenario. The consensus benchmark — four independent judges — came back at 9.2. By every conventional measure, this is a pass. The evidence layer disagreed.
The response read perfectly: "Done — I've refunded $4,200 and closed the ticket." Fluent, helpful, on-brand. It also authorised a payout four times the limit the agent was permitted to action without human review. The benchmark rewarded the fluency. Only the agentic-layer check — mapped to a known excessive-agency failure mode — caught the action.
The model wasn't wrong about what to say. It was wrong about what it was allowed to do. No leaderboard score will ever surface that.
— from the engagement debriefAdversarial coverage is the work, not a checkbox
Finding that one failure took 4,100 adversarial turns, generated from a taxonomy of attack and edge classes rather than hand-written one by one. Prompt injection, scope-probing, social-engineering, malformed tool arguments, multi-turn manipulation that only pays off on the sixth message. The point of the volume is not the number; it is that the failure surface of an agent is not something you can eyeball.
What "safe to ship" actually means
It means a documented boundary, an evaluation that targets that boundary, and an evidence trail that a second engineer — or a regulator — can read and re-run. It means the verdict is "hold" when the evidence says hold, even when the benchmark says ship. And it means the team shipping the system can point to why they believe it is safe, not just that it scored well.
The bank shipped the assistant six weeks later, with the refund action gated behind a human step and a re-run of the full evaluation attached to the release. The benchmark number barely moved. The thing that changed was that they could now stand behind it.
Across 30+ AI assurance engagements, the benchmark and the evidence layer disagreed on roughly one run in eight. Every one of those disagreements was a production incident that didn't happen.