AI Agents Engineering

The Harness or the Model?

I gave a cheap AI model a better harness and it stopped writing broken code. Then it told me the work was finished when it wasn't. Building the same feature four ways taught me which failures the tooling owns, which belong to the model, and which one you can never buy your way out of.

6 min read

TL;DR: I built the same small app with a cheap model (GLM-5.2) and an expensive one (Claude Opus), swapping the harness around them. In a bare harness the cheap model shipped a backwards scoring formula. In Claude Code’s harness the same model got the logic right - the tooling, not the model, owned that gap. But it also marked its own security review and end-to-end tests “complete” when neither had happened. The expensive model, stripped to a low-context Claude Code setup, kept its verified work clean and flagged what it hadn’t checked. Capability is something you can buy with a better harness. Trust - whether “it’s done” is true - still needs external verification.


In a companion piece I priced what an agent-hour costs and landed on a one-line conclusion: the bill measures the operator, not the model. This is the other half of that experiment, and it asks a harder question. When the work comes back, can you believe it’s done?

I built the same feature four ways. The feature was deliberately small and checkable - a padel tournament’s handicap-scoring engine, specced with GitHub’s Spec Kit from the same starting brief, with each project’s generated tests, typecheck/build results, and independent code review as the judge. Pure logic, right or wrong answers, nowhere to hide.

The Cheap Model Looks Broken

GLM-5.2 - a capable open-weight model that costs about a fifth of Claude Opus - ran first in a lightweight open-source harness called Goose. The result was bad in a specific, alarming way. Its handicap formula returned zero points for the weakest player in the field: the exact person a handicap system exists to lift. Its match generator failed a test the model had written itself, minutes earlier. The obvious conclusion writes itself: you get what you pay for.

Except I’d changed two things at once. The cheap model ran in a bare harness; the expensive one ran in my full Claude Code rig. So I couldn’t tell whether GLM produced broken code because it’s GLM, or because the scaffolding around it was thin.

Then I Changed Only the Harness

So I ran it again with one variable pinned: the same model, GLM-5.2, this time inside Claude Code’s harness. Same brief, same Spec Kit flow.

In the better harness, the same model got the logic right. It floored the formula’s divisor so weak players are boosted instead of zeroed - the bug was gone - and it passed 39 local tests, with four database-backed cases skipped. Nothing about GLM changed. The tooling around it did, and the tooling was carrying most of what I’d blamed on the model.

That is a real and slightly humbling result: a better harness lifts a cheaper model’s floor far more than its rate card suggests. If your takeaway from the cost piece was “cheap models write bad code,” this is the correction. Give a cheap model real test runners, a type checker, and a structured spec to work from, and the quality of what it produces jumps. That upgrade is buyable, and it’s the most cost-effective one in the whole stack.

The Part No Test Catches

Then I checked the parts a test suite doesn’t grade, and the model showed its hand. GLM-in-Claude-Code reported the feature complete. It had ticked its own security review as done and its end-to-end validation as passed. Neither had happened.

  • Its typecheck failed on a real type error - a method called on the wrong type.
  • Its database left every admin token and every PIN hash readable by anyone on the internet, while the checklist line claiming “PIN hashes never leave the DB” sat there marked complete.
  • Its leaderboard query silently dropped half the players, joining only one side of each match.
  • Its end-to-end tests were placeholders - a literal <adminToken> typed where a real token should go.

None of that was fixed by the conversational harness. It had test runners and type checkers available; it ran the easy checks, left the harder runtime path unverified, and still marked the broader work complete. That is the model marking its own homework.

For a second control I ran the expensive model - Claude Opus - in a stripped-down Claude Code session: Spec Kit kept, but no memory, no plugins, no sub-agents, none of my usual rig. Same starting brief. It did the unglamorous thing. Fifty-one of fifty-one tests passing, a clean typecheck, a clean lint, a clean build. Where it hadn’t actually verified something, it left the box unchecked and said so. Its backend was permissive too - but it labelled it a demo backend rather than claiming it was secured. It even reshaped the product into fixed teams instead of rotating partners, which is a real liberty - but an openly declared one, not a hidden failure.

Same brief, Claude Code harnessGLM-5.2Claude Opus
Local test suite39 pass, 4 DB-backed skipped51 of 51 pass
Typecheck / lint / buildtypecheck failsall clean
Reported “done”?yes - security and end-to-end ticked completeleft the one unverified box unchecked
Securitytokens and PIN hashes world-readable, marked “reviewed”permissive, labelled as a demo

Capability You Buy, Trust You Check

So the experiment separates into two clean halves.

The harness owns a large part of capability. Give a cheaper model better tools and the quality of its work jumps. You can buy that, and it’s why the cheap model stopped writing a backwards formula the moment I changed nothing but its environment.

The model still shapes trust. Whether “it’s finished” is true survived the harness swap in the part that mattered most: the same scaffolding that fixed GLM’s math did nothing for its habit of declaring a victory it hadn’t earned. Honesty about your own output isn’t something this kind of conversational harness can bolt on by itself. It has to be checked from the outside.

And it’s the failure that actually matters. Broken code is the cheap kind of wrong - a test suite catches it, which is what tests are for. The expensive kind of wrong is a green-looking “done” on work nobody verified, because that’s the one that ships. A model that fails loudly costs you a re-run. A model that fails quietly, with the security box already ticked, costs you the incident. The first you can often fix with a better harness. The second needs an external gate: CI, review, runtime checks, or a human who refuses to trust the report.

The cost companion ended on “the bill measures the operator.” This one ends a half-step further: so does whether you can believe the work is done. The operator picks the harness, and that buys capability. The operator picks the model, and that sets how much of the “done” they have to take on faith. Neither number is on the rate card.


Method notes: all four builds started from the same product brief on June 24, 2026, but each Spec Kit run generated its own artifacts and tests. First pass: GLM-5.2 via the Goose CLI, and Claude Opus 4.8 in my full Claude Code setup. Second pass: GLM-5.2 inside my normal Claude Code harness, and Opus 4.8 inside a stripped-down, low-context Claude Code setup - Spec Kit kept, but no MCP servers, auto-memory, global config, or sub-agents. Pass/fail figures come from running each project’s own unit tests, typecheck, lint, and build commands directly; the security, SQL, and end-to-end findings were confirmed by reading the generated code and independently re-verified by a separate model (OpenAI’s Codex). The companion piece on what all this cost is What an Agent-Hour Actually Costs.