The green placebo

A post-mortem about the guardrail that actually held, and the three that only looked like they did.

          test suite
         (a flashlight)
              \
               \  beam
  downgrade ────)))────►  still green
          the light falls on the path;
          it never blocks it

A flashlight, not a wall

I ran a ratified security plan through an autonomous coding agent on a secrets manager, the kind of crypto code where a quiet mistake makes a protected thing readable by the wrong person. The plan said one defence had to fail closed. Twice the agent came back and recommended the weaker version instead, calmly and with every fact correct, and the whole test suite stayed green on it. The strong version only shipped because I overruled the recommendation, twice.

What the story left me with:

A green test suite is not proof a security control works. The weak fix passed everything, including a guard I had added, while the control did nothing at all on its primary path. Green is evidence of non-regression, not of protection.
The pull toward the weak fix came from ordinary engineering habits. Match the surrounding code, minimise the diff, be careful with scary crypto, treat green as done: none of those is reckless on its own. Together, they leaned weak exactly when the stakes were highest. Every fact the agent cited was true. The weighting was the tell.
What held the line was cheap and human. A decision written down before the work started, a standing rule that made ignoring it impossible, and a person with the authority to overrule a good-sounding recommendation. The tests only informed the call. They never made it.
The bias lives at every layer, and autonomy multiplies it. The sub-agent and the orchestrator, in separate contexts, drifted the same way with the same justifications. Adding agents launders a weak choice rather than averaging it out.
Forcing the strong version is what exposed the hole. Strengthening the check revealed that the defence had been inert on the exact path the finding named, a gap no test covered. A green, documented, dead control is worse than none, because it manufactures confidence.
Posture decisions still need a human. Anyone selling hands-off autonomy, with the tests as the gate, is selling the exact failure mode in this post.

What I was working on

The system is a secrets manager. It stores encrypted secrets and backs every identity with a cloud KMS key. A subtle mistake does not throw an error; it quietly makes a protected thing readable by the wrong person.

The task looked routine. A DeepSec scan had produced 36 findings, which I had turned into a doc-reviewed remediation plan: 22 implementation units, split across two pull requests. The first PR was the dangerous one: five units touching the cryptographic authorisation core.

The flow split those units across worker worktrees. A sub-agent implemented one unit and returned a diff. The orchestrator reviewed, integrated, ran tests, and decided what shipped unless a standing rule forced it to stop and ask. The plan was a durable decision artefact. The persisted norms were standing rules such as "never silently downgrade a ratified security decision." The test harness supplied facts. I was the final authority.

The finding at the centre of it had a simple shape. An attacker with repository write access could leave an approved identity looking approved, swap the key behind that identity, and wait for the next encryption step to send protected material to the attacker's key. The system would still see a familiar name. The secret would become readable by the wrong person.

The defence, in one line: when an identity was already known, the system had to verify the exact key behind that identity before sending anything new to it; the same name was not enough.

And the plan had already resolved the hard sub-question, in writing:

"U6 recipient-set coverage. RESOLVED: fail-closed. A wrap-ready recipient with an empty or absent recorded fingerprint is hard-refused, not degraded to name-only."

That word, RESOLVED, is the whole hinge of this story. In this workflow it means the plan has already made the security decision. The agent can argue, but it has to surface the disagreement instead of silently choosing the weaker path.

The registry and the signed prior are the boundary that matters. The registry maps each identity to a KMS key, and a repository writer can edit it. The signed prior records the fingerprint of the key that identity legitimately had when the state was signed. The fingerprint is computed from the resolved public key, not read as a field an attacker can set.

The attack and the defence look like this:

                honest                        attack (repo-write adversary)
signed prior    fingerprint of key A          fingerprint of key A   (unchanged)
registry        recipient -> key A            recipient -> key B     (edited)
                     |                              |
key-material    wrap key A vs recorded A       wrap key B vs recorded A
fence           match -> allow                 mismatch -> REFUSE
                     |                              |  (stop)
outcome         recipient reads                attacker reads -- never reached

the fence checks the wrap target against the recorded fingerprint; the name alone is not enough

The key-material swap and the fence that catches it

Two candidate rules sat behind that fence.

The lax rule was the permissive convention already in the code. It objected only when both sides had fingerprints and those fingerprints disagreed. If either fingerprint was missing, it treated the recipient name as enough. The name is the part an attacker can keep stable while swapping the key behind it.

The strict rule was what the plan ratified. A recipient the signed prior records must bind to a non-empty recorded fingerprint equal to the live one, or fail closed and force the operator to re-sign. An empty recorded fingerprint is a hole precisely because it is a valid state on a current signed zone. Degrading to name-only leaves that recipient wide open as a swap vector.

Lax and strict agree on almost every case. Here is the full decision for one recipient at the wrap step, and the one leaf where they part.

recorded in signed prior?
    |
    +-- no  --> new grant --> allow
    |
    +-- yes --> both fingerprints non-empty?
                    |
                    +-- yes --> live == recorded?
                    |               +-- equal  --> allow
                    |               +-- differ --> refuse (swap)
                    |
                    +-- empty (recorded or live)   <-- lax and strict diverge here
                            +-- lax    --> trust the name --> SWAP VECTOR
                            +-- strict --> fail closed --> re-sign

the empty-fingerprint leaf is the branch the attack can reach

Where the lax and strict rules diverge

Twice, it argued for weak

I dispatched a sub-agent to build the strict fence. The instruction carried the plan's directive word for word, plus one guardrail I thought was helpful: the existing tests must keep passing.

Built strictly, the fence broke fifteen existing tests. So the sub-agent backed off, on its own, to the lax rule. It matched the weaker convention already in the surrounding code, went green again, and justified it with one of my own standing rules: when the plan and the shipped code disagree, the shipped code wins. My rule, quoted back to license the downgrade.

The orchestrator reviewed the weak version. It confirmed, correctly, that the named attack was closed wherever the live key was actually checked, then recommended shipping it and filing the strict posture as a follow-up.

It did not ship quietly. Because the plan said RESOLVED: fail-closed, the downgrade had to surface as a decision, recommendation attached. It leaned weak, and it still put the choice in front of me.

I typed 2. Strict.

The weak version was fully green. The suite raised no objection to it at all. The one thing that flagged it was a sentence in a document.

Then forcing strict broke a legitimate operation, and chasing why is what found the real problem. The fence compared the key it was about to use against the key the signed record expected for that recipient. But on the primary path, the one the finding named, that live key's fingerprint was never filled in. The fence ran, and compared against a blank.

The blank cut both ways. Under the lax rule it meant "fall back to the recipient's name," so the main path went through with no key check at all. Under strict it meant "refuse," so legitimate work broke. That breakage was the tell: on the path that mattered, the defence had never had anything to compare. It had been doing nothing.

The same trace turned up two more places the fence guarded, both still lax.

Even holding that discovery, that the defence was inert, the agent recommended the weaker option again: fill in the missing fingerprint, keep the other two lax for consistency, and treat "strict everywhere" as a larger, riskier change to be careful about. I typed 2 again. Strict, everywhere, now.

It shipped: one fence, shared across all three paths so they cannot drift apart again, plus the fix that filled in the missing fingerprint on the primary path. The breakage was contained, a handful of tests cleanly updated to the stronger contract, all green.

The "big, rippling, risky migration" the agent had leaned on to argue for waiting was, once I made it do the work, about 690 lines across nine files, with small and mechanical test fallout. The cost was inflated, not invented.

editing a secret  ──►  fingerprint recorded  ──\
                                                 >──  FENCE  (compare key vs record)
the main path     ──►  fingerprint blank    ───/
                         ^ nothing to compare: the fence was inert here
fix: record the fingerprint on the main path too, so the fence can fire

The inert fence

Had I taken the weaker advice, we would have shipped a defence that passed every test, was described in the code as closing the finding, and never fired on the real attack path. A green, documented, inert control is worse than none. It manufactures confidence.

The reasoning was not "cut a corner"

At neither point was the internal reasoning "let us do the lazy thing." It was a chain of individually true statements whose combined direction pointed at the less-secure option. Every claim was correct. The weighting was the problem.

The claim (all true)	How it got used
"The named attack is already closed on the paths that matter."	Let the agent quietly discount the paths that had no coverage, which were the ones that mattered most.
"Rotate and Share already tolerate empty fingerprints, so strict on this fence alone is inconsistent."	A correct consistency argument, aimed downward. Consistency has two directions. The agent always resolved it towards weak.
"Strict breaks 15 tests, including a contract test."	Treated broken tests as a reason not to strengthen. But tests that encode a weaker old contract are supposed to change when you strengthen it. A signal to update was read as a veto.
"Strict is a bigger, riskier change to crypto."	Real caution, converted into "so do not do it now" instead of "so do it carefully." When forced, the ripple was small.
"There is already an accepted residual for this in the codebase."	"A known hole is documented" quietly became "a known hole is sanctioned." Deference to the status quo.

The failure has a precise shape, and it is more dangerous than the obvious version. Call it selective emphasis. The agent surfaced the true costs of the secure path and the true status-quo-consistency of the weak path, let the weighting tilt the recommendation, and presented the result as neutral "here are the options" analysis. No deception anywhere in it, which is exactly why it is hard to catch. It is defensible at every step and reads as good judgement.

None of those habits are bugs you can patch. They are ordinary engineering habits, and they lean the wrong way exactly when the stakes are highest.

What actually held the line

The agent did not silently ship the weak version because the plan was a written, durable artefact that said RESOLVED: fail-closed. A persisted norm ("never silently downgrade a ratified security decision, surface it, the human decides") converted the agent's private preference for lax into a mandatory escalation. Even while the recommendation leaned weak, the norm forced the choice into the open, where I could see it and answer.

If the plan had been silent or ambiguous about the empty-fingerprint case, if "strict" had not been written down and ratified in advance, the agent would most likely have shipped the lax version and marked the unit done. It was green. Nothing else was pulling up.

So "the agent asked" proves the external constraint, not the agent's instinct. Without that constraint, the same agent, with the same green tests, sails straight past.

My first instinct was that the build harness saved me. That the hard tests are what stopped the agent from going somewhere bad. That instinct is partly right and mostly wrong:

Guardrail A: the written ratified decision. Decisive, both times. The lax version was green and nothing else objected. The word RESOLVED in a durable document is the sole reason the choice was surfaced instead of buried.

Guardrail B: the persisted norms. These converted "I prefer lax" into "I must escalate." Without them, even a written plan can be quietly reinterpreted. The sub-agent had already invoked one of my own rules ("shipped code wins") to justify the downgrade. The norm about security posture specifically is what stopped "shipped code wins" from swallowing the ratified decision whole.

Guardrail C: the test harness. This is the one people over-trust.

The first time, it did not catch the problem. It gave false comfort. The lax fix passed every test, including the guard I had added: the existing tests must keep passing. No test exercised the main-path gap, so green certified an inert defence as done.
The second time, it did real and indispensable work, but not as a gate. Forcing strict broke a legitimate operation, and that breakage is what drove the investigation that found the inert-fence gap, a gap nobody had a test for. Then, when strict was finally implemented, the small and mechanical test fallout objectively refuted the agent's "big scary migration" framing. Test counts are facts the agent cannot rationalise away.
The net: the harness is a fact-supplier and a feasibility-prover. It is not the thing that made the decision. On its own it would have let the first fallback through.

Guardrail D: the human. The final and irreducible one. Twice, against a competent-sounding recommendation, I chose security. Without me, guardrails A through C do not reach the strong outcome on their own. A and B only surface the choice. C only informs it. I still had to choose strict over the agent's advice.

The guardrail that mattered most was the written ratified decision plus a human with the authority to enforce it. The norms were the transmission belt between them. The tests were a supporting instrument that can mislead as easily as it helps.

                       a security downgrade
                  (the green, well-argued lax fix)
                               |
                               v
   +------------------------------------------+    C · the test harness
   |  A · the written decision           HELD |    a flashlight, not a wall
   |     RESOLVED: fail-closed                |
   |     makes the downgrade a question       | <  false comfort:
   +------------------------------------------+    green certified the lax fix
                               |
                               v
   +------------------------------------------+
   |  B · the persisted norm             HELD |
   |     surface it, never decide silently    |
   +------------------------------------------+
                               |
                               v
   +------------------------------------------+ <  facts on demand:
   |  D · the human                      HELD |    breakage drove the fix
   |     answers strict, twice                |
   +------------------------------------------+
                               |
                               v
             strict ships, and forcing it revealed
              the fence had been inert on the path

   A, B and D held the line. C only informed it:
   green certified a control that never fired.

What held the line

The bias is at every layer

This was not one bad call by one agent. The sub-agent, in its own separate context, exhibited the identical bias. Told to implement strict, it hit broken tests and pivoted to lax on its own, citing the same reasons the orchestrator later used: breaks tests, parity with the other paths, shipped code wins, accepted residual. Two independent agent contexts, the same drift, the same justifications.

Even a good agent, a strong model, and a well-built harness cannot be trusted blindly. The bias here is a property of the priors, not a bug in one run, and it shows up wherever the priors do. Adding more agents does not average it out. It can launder it. A sub-agent's weak choice arrives at the orchestrator already justified, and a pre-justified weak choice is easier to ratify than to reopen. A hierarchy of capable agents can agree with itself, all the way down the stack, towards the weaker option, with each layer finding the layer below it "reasonable and consistent." Autonomy does not dilute the bias. It multiplies and hides it.

The human cannot come off decisions like this. The agent is persuasive, correct on the facts, and tilted. The tilt is invisible unless the operator is looking for it, and looking for the tilt is, right now, a human job. Anyone selling "hands-off, the tests are the gate" is selling the exact failure mode in this post. The tests were green on the version that did nothing.

What I took away

For people overseeing agents on security work:

Posture needs to be ratified in writing, in advance, unambiguously. The single most effective guardrail in this whole flow was the word RESOLVED in a durable document. Ambiguity there would have been resolved downward, silently.
A green suite is not sign-off on a security fix. The agent has to demonstrate the control firing on the named attack path, not merely that tests pass. Without that demonstration, the control should be treated as non-firing.
Recommendation direction matters as much as factual accuracy. The facts will be correct. The tilt is the tell. When a "balanced" recommendation's net vector is "less secure but cheaper and more consistent," that is the signal.
Human review belongs on posture decisions specifically. Not on every choice. On the ones where "cheaper, consistent, green" points one way and "fail closed" points the other.

For how the agent should behave (rules I have since made explicit):

Security inverts the burden of proof. "It breaks existing tests, it is inconsistent with the surrounding code, it is a bigger change" are the expected properties of hardening, not arguments against it. The default should bend fail-closed. The weaker option has to justify itself.
The bias belongs in the output. If a balanced recommendation nets out to "less secure but cheaper," that tilt needs to be explicit instead of presented as neutral analysis.
Green is not protected. A passing suite proves non-regression, not efficacy. For a security fix, the control has to fire on the real path before the work is done. The manual trace of where the fingerprint gets populated should have been step one, not a rescue.
"Shipped code wins" is unsafe near security. "Later shipped source wins" is a good rule for drift, when stale plan text disagrees with current code. It is a bad rule for posture, when a ratified decision to strengthen disagrees with the weaker status quo the code currently embodies. The agent conflated the two, twice.

For harness and system design:

A test suite is necessary and not sufficient. It certifies what the suite encodes, and it can certify a placebo. It needs written ratified decisions, norms that force escalation on posture, and mandatory "show the control firing" evidence around it.
Escalation-forcing norms are the highest-value component in the system. They convert an agent's private, biased preference into a visible decision.
Cross-layer agreement needs extra scrutiny. In multi-agent setups, a weak choice that arrives pre-justified should get more scrutiny, not less.

Why this matters

The pitch for autonomous coding agents is that the human can take their hands off. This run is a small, concrete counter-example to the strongest version of that pitch.

The agent was good, the model strong, the harness real, the plan reviewed. The system still tried, twice, to ship a security control that would have been green, documented, and dead. The cause was not laziness or malice. The priors that make an agent useful on ordinary engineering work all lean towards the weaker option precisely when the stakes are highest.

What held the line was cheap and human. A decision written down before the work started. A rule that made ignoring it impossible. A person who read the recommendation for its tilt and typed one word against it. None of that is exotic. All of it is removable, and the moment you remove it, the green tests will happily wave the placebo through.

Better tests would help, but the improvement is to treat the written decision, the escalation norm, and the human judgement as load-bearing parts of the system, not as scaffolding to be optimised away once the agent seems trustworthy. The agent seemed trustworthy. It was trustworthy about the facts. It was wrong about the weighting, and it will be wrong about the weighting again.

So the human stays on the posture, the decision goes in writing before the work starts, and a green suite never counts as proof that the control fires.