The Seven-Layer Agent Audit

Your agent is starved on one layer of seven. It is rarely the harness everyone argues about.

Jun 19, 2026

Article voiceover

0:00

-22:30

Your agent failed again, and your hand found the model dropdown before you had finished reading the transcript. You told yourself the next model up would fix it. It did not, and the failure came back wearing better prose. The dropdown is a comfortable place to put the blame, because the model is the one part of your agent that is public, ranked, and argued about. Everything else is private, unglamorous, and yours. So you upgrade the layer you can see and leave the one that is actually failing untouched. You have been debugging the layer easiest to talk about, not the one quietly costing you trust.

I have built inside this discipline for years and I still catch my own hand doing it. The reflex has a more sophisticated form too. The builders who would never just click the dropdown reach instead for a thicker harness, a bigger context window, the framework everyone is posting about. It is the same move every time: spend on the part you can see so you do not have to diagnose the part you cannot. Call it the comforting false fix, and most of agent engineering is some version of it.

What makes the reflex so hard to drop is that the visible layer sometimes is the answer. In 2024 a Princeton team took a model everyone already had, pointed it at the hardest software benchmark of the day, and resolved 12.5% of SWE-bench tasks.¹ The best prior approach, one that could retrieve context but could not act, had managed 3.8%. They shipped no new model. They rebuilt the interface the agent worked through: what it could see at once, how it edited files, what it heard back when a command failed. The number more than tripled on the strength of the scaffolding alone. It worked because, that time, the scaffolding was the starved layer. Spend the same effort on a layer that is already fed and you have nothing to show for it.

That result founded a discipline, and two years on the discipline is at war with itself over what to do with the thing it found. One camp says the scaffolding is the moat: build the harness, own your control flow, and the model becomes a component you swap underneath it. The other, the view from inside OpenAI’s Codex team argued on the Dev Interrupted podcast, says the scaffolding is coping: rip it out, let the model carry the load, and every line of harness you wrote is debt that dissolves at the next release. Read both and you will be told, with equal confidence and real evidence, to build more harness and to build less.

Both camps are right. They are describing different repos.

The harness is everything around the model: the tools it can call, the files it can touch, the feedback it gets back, and the rules that decide when its work is accepted. The whole war is over how much of it to build. That broad sense of the word is where the war hides its mistake, because it bundles half a dozen different jobs into one, and each camp has taken whichever one was starved in its own repos and mistaken it for the law of every agent. Tell a team whose harness is the missing piece to own its control flow and the harness looks like the moat. Tell a team whose harness a frontier model already covers to rip it out and the same harness looks like debt. Both read their own repo correctly, then generalised it to yours.

The harness war is a fight about where reliability lives, waged mostly before anyone measures where their own is leaking.

Thicken the harness and thin the harness are the same reflex one floor up, a guess about where the failure lives made before anyone measured. The guess is usually wrong, because the failures that cost you hide in layers that neither the dropdown nor the harness argument ever names. Before you can take a side, you need the thing nobody in the fight is offering: a way to find which layer of your own agent is starved.

Three failures that look identical

Watch an agent fail for a month and every incident blurs into one complaint: it said something wrong. Sit with the transcripts longer and the complaint splits into species with different mechanisms.

The first species repeats itself. On Monday you tell it the service deploys on Fly, not Vercel, and it adjusts at once, gracefully. On Thursday it opens a fresh plan with assuming a standard Vercel deploy, polite and certain, the Monday correction nowhere in it. The work inside any one session can be flawless. Across sessions the agent is a goldfish with a good vocabulary, meeting the same problem new each morning.

The second species ships with confidence. It hands back a clean table, every cell aligned, every source linked, and one figure reads 2.4 where the filing says 4.2, two digits transposed and certain. No one re-derives it. By the time anyone notices, it is in the board deck, the pricing sheet, the migration script. The agent did its job. Between generating that number and accepting it, no checkpoint stood, human or machine, that might have caught it.

The third species reports from a world that changed underneath it. You ask how to stream a response; it writes a clean snippet calling a method the SDK renamed two releases ago, in the exact cadence of the documentation it learned from. The fluency is total. The substrate underneath has gone stale, and the model fills the gap with the one thing it can always produce, plausibility.

Three species, one surface symptom. And one repair gets reached for across all three: a bigger model. The bigger model usually re-assumes Monday’s correction with more eloquence, ships the uncaught error with better formatting, and extrapolates from the stale substrate with more confidence. Money was spent. The mechanism producing the failure was never touched.

Where the mechanisms live

A month of transcripts teaches you the species. Years of debugging my own agents and reading other people’s taught me the geography. I debug through seven layers now, in a fixed order, starting from the one whose damage reaches furthest. One question per layer, and the failure signature you see when that layer is starved.

Most teams cannot answer these seven for their own agent. They can recite the model, the framework, the vector store, the latest eval score. Ask which layer is actually losing them trust and the room goes quiet, because that answer sits on no dashboard. It is in the order the layers fail.

I treat these seven as a stack with a direction of damage, a lens I debug by rather than a law I have proven, most reliable at the bottom. They are not a strict dependency chain: an agent can have clean data and broken memory, or the reverse. But a failure low in the stack, in the data an agent reasons from or the checks on its output, tends to corrupt everything that runs on top of it, while most failures higher up stay where they are. A stale fact propagates into memory, skills, and answer alike. An unverified output ships no matter how good the layers above it are. Purpose, at the very top, is the clean exception that marks this as a tendency and not a rule: get the job wrong and every layer beneath inherits the mistake.

That is why I repair in the opposite order to the way I design. You build from the top of the stack down, naming the job at Purpose, then binding the harness, then packaging the skills. You debug from the bottom up, starting at Data, the facts everything else reasons from, because the lower a starved layer sits, the further its fault has already spread through everything above it. Design outside-in, debug inside-out. The dropdown reflex fails because it debugs at the design end of the stack, the top; the harness reflex fails the same way, one rung lower.

The harness is layer 6, one floor of seven. Harness engineering drives a real share of the gap between agent products, and it is still one layer; the skills layer just below makes the same point from the other side, where a copied skill is a behavioural dependency and curated skills tend to lift task success more than the ones a model writes for itself.2

So the thicken-or-thin question has an answer, and it turns on which layer your own repo is starving, the variable each camp read correctly at home and then generalised too far. When your starved layer sits below the harness, at verification or data, thickening the harness builds a better second storey over a cracked foundation, and thinning it at least stops you reinforcing a floor that was already sound. When the harness itself is starved, the moat camp is right, and a model left to carry the load alone tends to return confident work you struggle to reproduce.

The anti-harness camp has the stronger long-run argument: the frontier model keeps absorbing the failure modes your harness was patching, so every layer you hand-build is debt with a short half-life. They are right about the trajectory, and the trajectory is the case for owning the audit rather than the harness. As the model improves, the starved layer moves, and what holds its value across releases is not the layer you built but the instrument that finds where the constraint went. Do not own the harness; own the thing that tells you when the harness stopped mattering. My bet is not that the harness is unimportant. It is that the fight happens a floor too high, with teams arguing over it before they have proved that is where their own reliability leaks.

The layer quietly capping your agent is rarely the one with the famous name or the loudest debate.

The seven are overlapping lenses rather than a clean taxonomy. An acceptance rule is part harness and part verification; a persistent store is part memory and part orchestration. And they map where an agent’s reliability leaks, which is a different question from where it should be careful: the guardrail layer you keep deliberately thick falls outside this audit and should stay thick. They give you seven places to look, in the order the damage travels, not a partition of your system.

What the audit turns up

Point the seven questions at two different agents and they tend to land on two different floors. That is the test that the instrument is reading the repo and not your assumptions: a checklist that always blamed the same layer would just be that layer’s advocate.

A research agent that quotes prices, versions, and policies usually breaks at data. The retrieval is stale, the checks above it are sound, and the failure is a confident answer drawn from a world that moved. Reach for a bigger model and it delivers the out-of-date figure with more poise. The starved floor is lower than anyone was looking.

An agent that turns out clean, well-formed prose or code usually breaks one floor up, at verification, the row easiest to mark green by pointing at a passing test suite. Look at what those tests check. Most confirm the output is well formed, parseable, the right shape, and almost none test whether it is right. That is a format check wearing a verification badge. The trap is worse for an agent that rewrites its own behaviour, which can pass every check while drifting from the behaviour those checks were meant to protect, the verification problem that time alone solves. A green test suite can sit on a starved verification layer, and it is sometimes the most expensive way to stay blind.

The same seven questions land on different floors for different repos. That is the difference between an instrument and a hunch.

I wrote a script to run the seven questions for me, then threw it out, for the reason this whole piece is about. A script reads your file names, not your setup, so it returns a confident verdict on any stack it does not recognise. Point it at a well-built agent on a framework it never learned and it will report, with total composure, that most of its layers are missing, because they live in code it cannot read. That is the second failure species, shipped as a tool. The thinking crosses from one stack to the next; the automation does not. So I kept the questions and dropped the script.

The afternoon audit

The audit runs on any stack today, by hand, in an afternoon, though the afternoon buys the diagnosis, not the repair. Finding the starved row is fast; rebuilding a starved data or verification layer is real engineering, and diagnosing first is how you spend that effort on the right floor. Running it by hand finds the leak once; at scale you turn the same seven questions into what you instrument, the traces and checks that surface a starving layer without you reading every transcript.

Take the seven questions in debug order, bottom to top, and for each one write the evidence in your repo that answers it: the file path, the asserted comparison, the memory store’s last write. None of it needs to be a file; on a raw SDK loop, memory is whatever you persist between calls and the question is only when it was last written. The form does not matter. Where you catch yourself writing a sentence about how you sort of handle that layer, instead of pointing at where it lives, you have found a starved row.

Take the data row as the worked example. The question is what reality the agent reasons from, so find where its facts enter: the retrieval call, the assumptions written into the system prompt, the document set you handed it. Then check one claim against its source. When the agent quotes a price, a version, a policy, can you name when that fact was last refreshed, and does it still match the world? A fed row has a provenance you can point at and a staleness you can bound. A starved one is a confident answer with no timestamp behind it. The move when it comes back red is to fix the source before you touch the gate above it, because a verification check on a stale fact only certifies the wrong answer faster.

Most of the seven clear in a few minutes, the rows you already trust, and running all of them is what earns you the right to drop to the one or two that bite instead of guessing. The starved row is the one you have been compensating for by hand without naming. The two floors I find starved most often are data and verification, the layers with no dropdown and no debate to hide inside. The build-the-harness camp, by its own diagnosis, is short a few floors up; different repos, different floors, which is the whole point. Which layer is in fashion turns over, memory last year, context engineering now,3 but the audit is what tells you which one is yours.

The seven fit on one page you can print, a row each with space to point at your own evidence, a fed-or-starved mark, and a line at the foot for the binding layer you land on.

The Seven-Layer Agent Audit is that page, free to download and keep by your desk for the next time an agent fails.

Seven Layer Audit Scorecard

799KB ∙ PDF file

Download

So before you upgrade the model, before you rebuild the harness, run the seven. The row that comes back red is the one already costing you trust, and naming it stops the wasted motion: you quit arguing with the model, quit rewriting prompts that were never the problem, quit adding memory to a layer whose facts were stale to begin with. The answer to thicken-or-thin is sitting in your own repo.

My bet: the row that comes back starved is data or verification, not the harness everyone is fighting about. Run the seven and tell me I am wrong. The one that turns up most often is the one I take apart next.

John Yang, Carlos E. Jimenez et al., “SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering,” NeurIPS 2024. arXiv:2405.15793. SWE-agent (GPT-4 Turbo) resolved 12.5% of SWE-bench at pass@1; the same paper reports the previous best as 3.8%, “achieved by a non-interactive, retrieval-augmented system.” The benchmark itself was introduced by Jimenez et al., arXiv:2310.06770. Scores have climbed since, carried by better models and better interfaces both.

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks, arXiv:2602.12670. Across dozens of tasks, human-curated skills raised the average pass rate by roughly 16 points, while skills the model generated for itself produced no average benefit, the paper’s evidence that models cannot reliably author the procedural knowledge they benefit from consuming. (Reported task counts and the exact gain shift slightly between versions of the paper; the directional result is stable.)

Nelson F. Liu et al., “Lost in the Middle: How Language Models Use Long Contexts,” TACL. arXiv:2307.03172. The finding behind the row-3 signature and much of the context-engineering wave: models reliably lose information placed in the middle of long contexts, which is why a bigger window substitutes for neither memory nor orchestration.

The Durability Curve

Discussion about this post

Ready for more?