QE Notes — Issue No. 047 | Cogniron

Hello from the regulated edge. This issue is shorter than usual because one idea earned the space: the gap between a model that scores well and a system that's safe to release. We keep finding it, and it keeps surprising people. Let's get into it.

From the field

The one-in-eight problem

Across our last thirty-odd AI assurance engagements, the consensus benchmark and the evidence layer disagreed on roughly one evaluated run in eight. In every case the benchmark said ship and the evidence said hold — an action taken out of scope, a citation that wasn't in the source, a refusal that didn't hold under a sixth-turn nudge.

The lesson isn't that benchmarks are bad. It's that they answer "is the model good?" when the question on a release call is "is this system, bound to this purpose, safe?" Those are different questions, and the second one is the one that shows up in the incident review. The full field report is here.

A method we changed our minds about

We stopped reporting agent pass-rate as a single number

For a year we put a pass-rate percentage at the top of every agent evaluation, because that's what stakeholders were used to. We've stopped. A single percentage over a probabilistic system flattens exactly the information that matters — the spread, the boundary, the variance under repetition.

Now the top line is a distribution and a boundary result. It's a harder number to put on a slide. It's also the one that's true.

Worth your time

The test pyramid was built for code. Agents need a different shape.cogniron · methodology How a tier-1 bank cut regression from six weeks to nine dayscogniron · case study Non-functional debt is the silent killer of regulated releasescogniron · essay

When the benchmark says ship and the evidence says hold.

The one-in-eight problem

We stopped reporting agent pass-rate as a single number

Get QE Notes in your inbox.