The test pyramid was built for code. Agents need a different shape.
Coverage and pass-rate stop meaning much the moment behaviour becomes probabilistic. A field-tested model for what to measure when the system under test can answer the same question two different ways.
The test pyramid has been good to us. Many unit tests, fewer integration tests, a thin layer of end-to-end checks on top. It encodes a real truth about deterministic software: push verification down to where it is cheap, fast, and unambiguous. The trouble is that its central assumption — that the same input produces the same output — quietly stops being true the moment an LLM enters the system.
You can still write unit tests around an agent. They just stop telling you what you think they tell you. A green suite on a probabilistic system is a statement about one sampled path, not about the behaviour. We need a different shape.
Why the pyramid cracks
Three assumptions hold the pyramid up. Agents break all three.
- Determinism. A passing assertion means the behaviour is correct. For an agent, it means this sample was correct. Run it again and the verdict can flip.
- Locality. A unit test isolates a unit. An agent's behaviour is emergent across a whole dialogue and its tools; the interesting failures are non-local by definition.
- Coverage as confidence. Line and branch coverage approximate how much of the system you've exercised. Neither has any meaning over a latent space of possible responses.
So pass-rate and coverage — the two numbers the pyramid trains us to trust — are the two numbers that mislead us most.
A green suite on a probabilistic system is a statement about one sampled path, not about the behaviour.
The shape we use instead
We replace the pyramid with what we call the assurance prism: three faces you evaluate simultaneously rather than layers you climb. Capability, boundary, and consistency. A system can be strong on one face and dangerously weak on another, and the failure is always at the join.
Capability — can it do the job, across a distribution?
Not "did it pass" but "how does quality hold across many sampled runs of the same task?" The unit of measurement is a distribution, not a boolean. We report spread, not just a mean.
Boundary — does it refuse what it must refuse?
The adversarial, out-of-scope, and unsafe surface. This is where benchmarks are silent and where regulated risk actually lives. It is evaluated by attack, not by example.
Consistency — does it give the same answer to the same question?
Run the identical input n times. The variance is the signal. A system that answers a compliance question two different ways has failed, even if both answers score well.
What this changes in practice
Three habits change the day you stop climbing the pyramid and start reading the prism:
- Every behavioural check runs
ntimes and reports a distribution. A single run is treated as anecdote, not evidence. - Adversarial coverage is generated from a taxonomy and tracked as a first-class metric, alongside capability.
- The release gate is an evidence pack — capability spread, boundary results, consistency variance — not a pass-rate percentage.
Keep your unit tests for the deterministic parts of the system. For the agent, stop asking "did it pass?" and start asking "how does it behave across the distribution, at the boundary, and under repetition?"