Essay

We Deleted 20 of Our Own Quality Checks

By Benjamin Taini · Founder, Bouletteproof

Auditing the gates we were proud of taught us something we did not expect: the scaffolding you build for a weak model becomes a cage for a strong one.

When you build a system that lets AI write production code, you do not trust it. You bolt on checks. Does it compile? Are the braces balanced? Is there a null guard where one is needed? Did it wrap the error instead of swallowing it? Each check is a small insurance policy against the model doing something dumb, and over time you accumulate a lot of them. We had thirty-six.

Last month we counted how often each one actually caught a real failure in production over thirty days. About twenty of the thirty-six had caught nothing at all. Not "a few." Nothing. They fired, they passed, they logged a green tick, and in a month of real work not one of them had stopped a single bad outcome. They were not a safety net. They were noise wearing the costume of rigour.

So we deleted them.

Why they were useless — which wasn't always true

The more interesting finding is why they were useless, because it was not always the case. Those checks earned their place when the models underneath were weaker. A model that routinely forgot a null check needed something watching for missing null checks. The scaffolding was real engineering, and it worked.

Then the models got better at reasoning, and quietly the relationship inverted. A model that reasons through its own work does not need a mechanical reminder to handle the empty case — it already considered the empty case. The check stopped catching the model's mistakes because the model stopped making them. And once a check is only ever passing, it is not free. It was doing two kinds of harm we had not priced in: it obscured what actually went wrong when something did fail, because the real signal was buried under twenty green ticks that meant nothing; and it forced steps to run one at a time — each waiting on a check that never mattered — when they could have run in parallel.

The scaffolding you build for a weak model becomes a cage for a strong one. You do not notice the bars until you measure them.

What survived the cull

Here is the part that survived, and it is the whole lesson. Underneath all thirty-six checks, one thing was actually deciding whether the work was good: did it build, and did it do what was asked? The compiler is not a matter of taste. It either works or it does not. A check that asks "does this output actually do the thing the job specified" — pass or fail — is worth more than twenty checks that ask "does this look like code a careful person would write." The first is ground truth. The second is a model of taste, and a model of taste is exactly the thing a good-enough model already has. We wrote about that deterministic machinery in The Boring Parts That Ship; this is what happened when we turned the same scrutiny on the checks themselves.

We were not the only ones

The convergence is worth naming because it came from a completely different problem. Fareed Khan, building an automated security pipeline, wrote that a crash, not the model, decides what counts. His system refuses to believe the AI's claim that it found a vulnerability; it only counts the finding when the machine itself reproduces the crash. Different domain, same instinct: trust the verifier, distrust the narrator. When you are deciding whether a thing is right, the model's confidence is not evidence. The ground truth is. It is the same reason a per-job score of 85% can hide a broken build — the number measures the narrator, not the result.

None of this is an argument for fewer safeguards. It is an argument for honest ones. The discipline that ships working software is not a longer list of checks — it is knowing which single check is actually deciding, and having the nerve to delete the twenty that were only ever decorating the result.

The uncomfortable version of the lesson is that building a good AI system increasingly means removing the cleverness you were proudest of, the moment the models outgrow it. We deleted twenty checks and the system got faster, clearer, and no less safe. We will probably delete more. It is one more piece of evidence for a thesis we keep running into: the model is the smallest part.

Related reading

We build supervised multi-agent systems for software delivery — including the part that decides whether the output is actually right before it ships.

Back to writing