When the benchmark says ship and the evidence says hold.
A field note on the one-in-eight problem, a methodology habit we've abandoned, and three things worth your attention this fortnight. Five minutes, no filler.
Hello from the regulated edge. This issue is shorter than usual because one idea earned the space: the gap between a model that scores well and a system that's safe to release. We keep finding it, and it keeps surprising people. Let's get into it.
The one-in-eight problem
Across our last thirty-odd AI assurance engagements, the consensus benchmark and the evidence layer disagreed on roughly one evaluated run in eight. In every case the benchmark said ship and the evidence said hold — an action taken out of scope, a citation that wasn't in the source, a refusal that didn't hold under a sixth-turn nudge.
The lesson isn't that benchmarks are bad. It's that they answer "is the model good?" when the question on a release call is "is this system, bound to this purpose, safe?" Those are different questions, and the second one is the one that shows up in the incident review. The full field report is here.
We stopped reporting agent pass-rate as a single number
For a year we put a pass-rate percentage at the top of every agent evaluation, because that's what stakeholders were used to. We've stopped. A single percentage over a probabilistic system flattens exactly the information that matters — the spread, the boundary, the variance under repetition.
Now the top line is a distribution and a boundary result. It's a harder number to put on a slide. It's also the one that's true.