Essay

The Median Trap

By Benjamin Taini · Founder, Bouletteproof

Why skill checklists don't improve agent code quality — an 88-execution controlled experiment.

When software agents fail, the instinctive reaction is to give them more instructions. We write a checklist of "skills" or "rules" — always handle null pointers, never use nested ternary operators, ensure database connections are closed — and append them to the system prompt.

We call this "skill loading." It feels like engineering. It looks like progress. But in our controlled testing, we discovered that loading more than three skills simultaneously into an agent's active context window actually degrades the quality of the generated code.

We call this phenomenon The Median Trap.

The Experiment

To measure the exact impact of skill loading on code generation, we designed a controlled experiment using Claude 3.5 Sonnet. We generated a set of 11 distinct coding tasks of moderate complexity (e.g., implementing a rate limiter, parsing a custom file format, writing a thread-safe cache).

We then ran these tasks across 88 total executions, varying only the number of "skills" (explicit code-quality rules) loaded into the system prompt:

Control Group: 0 skills loaded (pure task description).
Group A: 2 skills loaded.
Group B: 5 skills loaded.
Group C: 10 skills loaded.

Each output was evaluated by an automated test suite for functional correctness, and graded by an independent LLM-as-a-scorer for architectural elegance, adherence to constraints, and code cleanliness.

The Results

The data revealed a clear, non-linear relationship between prompt complexity and output quality:

The Sweet Spot (1-3 Skills): Adding 2 highly relevant skills improved the architectural score by 14% compared to the control group. The model successfully incorporated the constraints without losing sight of the primary objective.
The Cliff (5+ Skills): At 5 skills, performance began to revert to the baseline. At 10 skills, the architectural score dropped 22% below the control group.
The Compliance Paradox: While Group C (10 skills) had the lowest overall code quality, it had the highest literal compliance with the checklist. The model spent so much attention budget avoiding the "forbidden" patterns that it wrote overly verbose, convoluted, and fragile code to satisfy the rules.

Why the Median Trap Happens

Large Language Models do not process instructions like a compiler processes code. They process them as attention weights.

When you load 10 different rules into a prompt, you are forcing the model to distribute its attention across 10 different dimensions of constraint. Because the model's capacity for reasoning per token is finite, it is forced to find the "median" path of least resistance.

Instead of writing the most elegant solution for the specific problem, it writes a generic, defensive solution that guarantees none of the 10 rules are violated. It optimizes for non-violation rather than excellence.

Escaping the Trap: Lazy Skill Loading

The solution is not to abandon code-quality rules. The solution is to change how they are loaded.

Instead of statically loading every skill your agent might ever need, you must load them dynamically and lazily. If a task does not involve database operations, the agent should not have database-connection rules in its context. If a task is a simple utility script, it should not be burdened with enterprise-grade logging constraints.

By keeping the active skill count below 3 at any given moment, you preserve the model's attention budget for the actual problem-solving task.

This is why we built and open-sourced context-steward. It acts as an automated gatekeeper, analyzing the current task context and injecting only the highly relevant skills on demand, keeping your agent's attention focused where it matters most.

Want to implement lazy skill loading in your own agent systems?

Back to writing context-steward on GitHub