Writing

Practitioner notes
from production.

Essays and open-source releases from the team. No hot takes, no vendor pitch — just data and patterns from systems we run ourselves.

Essay 2026-06

The Model Is the Smallest Part

We bet the system around the model matters more than the model itself. Over the last few months, five independent practitioners and a frontier lab arrived at the same conclusion — none of them ours.

Essay 2026-06

We Deleted 20 of Our Own Quality Checks

We audited 36 mechanical quality checks and found about 20 had caught nothing in a month of production. The scaffolding you build for a weak model becomes a cage for a strong one.

Library · MIT 2026-04

context-steward

Lazy skill loading for agent systems. Claude Skills are useful when loaded on demand, expensive when loaded up front. context-steward is the npm package that solves the second problem without breaking the first.

Essay 2026-03

The 85% Accuracy Trap

Per-job quality scores in multi-agent systems almost always land around 85%. That number is telling you something different than you think it is. 4,600+ scored executions from our own production, and the pattern we found.

Essay 2026-04

The Median Trap

We ran a controlled A/B across 88 scored agent executions, with and without skill checklists in the prompt. Quality moved +0.002 — inside the noise. Why front-loaded instructions don't land, and what does.

Essay 2026-05

The Boring Parts That Ship

Multi-agent delivery doesn't break on the model. It breaks on file conflicts, bad sequencing, and per-job metrics that miss the deployment. The unglamorous machinery that decides whether a sprint ships.