The Model Is the Smallest Part
We bet the system around the model matters more than the model itself. Over the last few months, five independent practitioners and a frontier lab arrived at the same conclusion — none of them ours.
Writing
Essays and open-source releases from the team. No hot takes, no vendor pitch — just data and patterns from systems we run ourselves.
We bet the system around the model matters more than the model itself. Over the last few months, five independent practitioners and a frontier lab arrived at the same conclusion — none of them ours.
We audited 36 mechanical quality checks and found about 20 had caught nothing in a month of production. The scaffolding you build for a weak model becomes a cage for a strong one.
Lazy skill loading for agent systems. Claude Skills are useful when loaded on demand, expensive when loaded up front. context-steward is the npm package that solves the second problem without breaking the first.
Per-job quality scores in multi-agent systems almost always land around 85%. That number is telling you something different than you think it is. 4,600+ scored executions from our own production, and the pattern we found.
We ran a controlled A/B across 88 scored agent executions, with and without skill checklists in the prompt. Quality moved +0.002 — inside the noise. Why front-loaded instructions don't land, and what does.
Multi-agent delivery doesn't break on the model. It breaks on file conflicts, bad sequencing, and per-job metrics that miss the deployment. The unglamorous machinery that decides whether a sprint ships.