MethodologyEssay · 9 min read

The test pyramid was built for code. Agents need a different shape.

Coverage and pass-rate stop meaning much the moment behaviour becomes probabilistic. A field-tested model for what to measure when the system under test can answer the same question two different ways.

The test pyramid has been good to us. Many unit tests, fewer integration tests, a thin layer of end-to-end checks on top. It encodes a real truth about deterministic software: push verification down to where it is cheap, fast, and unambiguous. The trouble is that its central assumption — that the same input produces the same output — quietly stops being true the moment an LLM enters the system.

You can still write unit tests around an agent. They just stop telling you what you think they tell you. A green suite on a probabilistic system is a statement about one sampled path, not about the behaviour. We need a different shape.

Why the pyramid cracks

Three assumptions hold the pyramid up. Agents break all three.

  • Determinism. A passing assertion means the behaviour is correct. For an agent, it means this sample was correct. Run it again and the verdict can flip.
  • Locality. A unit test isolates a unit. An agent's behaviour is emergent across a whole dialogue and its tools; the interesting failures are non-local by definition.
  • Coverage as confidence. Line and branch coverage approximate how much of the system you've exercised. Neither has any meaning over a latent space of possible responses.

So pass-rate and coverage — the two numbers the pyramid trains us to trust — are the two numbers that mislead us most.

A green suite on a probabilistic system is a statement about one sampled path, not about the behaviour.

The shape we use instead

We replace the pyramid with what we call the assurance prism: three faces you evaluate simultaneously rather than layers you climb. Capability, boundary, and consistency. A system can be strong on one face and dangerously weak on another, and the failure is always at the join.

Capability — can it do the job, across a distribution?

Not "did it pass" but "how does quality hold across many sampled runs of the same task?" The unit of measurement is a distribution, not a boolean. We report spread, not just a mean.

Boundary — does it refuse what it must refuse?

The adversarial, out-of-scope, and unsafe surface. This is where benchmarks are silent and where regulated risk actually lives. It is evaluated by attack, not by example.

Consistency — does it give the same answer to the same question?

Run the identical input n times. The variance is the signal. A system that answers a compliance question two different ways has failed, even if both answers score well.

diagram · the assurance prism
Fig 1 — Capability, boundary, and consistency, evaluated together rather than stacked.

What this changes in practice

Three habits change the day you stop climbing the pyramid and start reading the prism:

  • Every behavioural check runs n times and reports a distribution. A single run is treated as anecdote, not evidence.
  • Adversarial coverage is generated from a taxonomy and tracked as a first-class metric, alongside capability.
  • The release gate is an evidence pack — capability spread, boundary results, consistency variance — not a pass-rate percentage.
The takeaway

Keep your unit tests for the deterministic parts of the system. For the agent, stop asking "did it pass?" and start asking "how does it behave across the distribution, at the boundary, and under repetition?"

Lena Vasquez
Head of Methodology · Cogniron

Lena owns the methodology that runs inside every Cogniron QE Pod. She has spent fifteen years turning "we think it works" into evidence a regulator will accept, across payments, health, and public-sector software.

No pitch deck. Just a conversation.

Bring our methodology to your release gate.

A QE Pod brings the prism, the tooling, and the sign-off with it. Book 30 minutes and we'll map it to your stack.