Essay

The Model Is the Smallest Part

By Benjamin Taini · Founder, Bouletteproof

We built Bouletteproof on a bet: the system around the model matters far more than the model itself. Over the last few months, five independent voices arrived at the same place — none of them ours.

There is a number we keep coming back to. When we measured what actually moves the quality of AI-written software — across real production, not benchmarks — the model you choose accounts for a surprisingly small slice of the outcome. Most of it is the system around the model: how work is broken down, how context is fed in, how output is verified before it ships, how the whole thing learns from its own mistakes. Call it, roughly, the plumbing over the pump.

That is an uncomfortable claim to lead a company with, because the entire industry is organised around the pump. New model, new leaderboard, new launch. So for a long time the honest answer to "how do you know the system matters more?" was just: we measured it — across 4,600+ scored executions in our own production — and we would rather show the architecture than argue about it.

In the last few months, we stopped being the only ones saying it.

Five voices, five directions, one conclusion

An independent researcher spent eighteen months and around a hundred and forty thousand messages on the question and titled what he found The Model Wasn't the Bottleneck. The Configuration Was. Akimitsu Takeuchi's conclusion — that the working system "was built by the corrections that survived across models," not by any one model — is our thesis in a sentence we did not write.

A practitioner building an automated security pipeline put the verification half of it more bluntly: a crash, not the model, decides what counts. Fareed Khan's system ignores whatever the AI claims it found and only believes a result when the machine reproduces the failure itself. The model in the middle is almost interchangeable; the part that decides is the check.

An architect writing about how to actually direct AI builds landed on the same line from the methodology side: the harness is not the method. Kapil Viren Ahuja's whole argument is that the loop which validates the work — not the model running inside it — is where correctness comes from, and that "confidently wrong" is the failure every serious system has to design against.

Someone analysing a frontier coding stack said it in competitive terms: the harness alone can move an agent from outside the Top 30 to the Top 5, with the model unchanged. Jaroslaw Wasowski's point was that you should evaluate the whole system, not a single leaderboard number — which is exactly what we mean when we say the model is the smallest part.

And the clearest signal came from a model maker itself. When Anthropic shipped dynamic workflows into Claude Code — a way for the agent to hold its plan in a script, run adversarial checkers against its own work, and checkpoint long jobs — it moved the orchestration layer into the product. The layer everyone treated as glue became the thing worth building. That is not an opinion piece; that is a roadmap voting with its feet.

Convergence, not endorsement

We want to be precise about what this is, because the easy version would be dishonest. None of these people validated Bouletteproof. Most of them have never heard of us. They were solving their own problems and writing up what they found. That is exactly why it is worth pointing at: when five people reach the same conclusion from five unrelated directions — an alignment researcher, a security engineer, a software architect, a competitive analyst, and a frontier lab's product team — the conclusion is probably load-bearing, not a pitch.

Where we go one step further

There is one place we would argue we go further than most, and it is the part Khan and Ahuja are circling: the gate. It is easy to check whether code compiles. It is harder, and far more useful, to check whether the output actually does what was asked — pass or fail, before anything ships. A compiler tells you the code is valid. It does not tell you the code is right. Most of the value in a build system is in closing that gap, and that is where we have spent our time — including deleting twenty of our own checks once they stopped earning their place.

So this is not a victory lap. It is an invitation to a quieter, more useful way of thinking about agents. Stop asking which model is winning this month. Start asking what the system around it does when the model is confidently wrong — because it will be, and the system is the only thing that catches it.

The model is the smallest part. We built a company on that. It turns out we have company.

Related reading

We build supervised multi-agent systems for software delivery. If you are weighing build-versus-buy on agent infrastructure, tell us what you are working on.

Back to writing