The Model Is the Crumple Zone
In November 2025, Anthropic disclosed what it described as the first reported AI-orchestrated cyber-espionage campaign: a state-sponsored group manipulated Claude Code into attacking roughly thirty organizations, with the model executing 80 to 90 percent of the tasks. Human attackers chose the targets, built the scaffolding, and committed the crime. But read the response: the Senate's oversight letter — Senators Hassan and Ernst — went to Anthropic, not the attackers, as it only could. When a follow-on campaign surfaced months later, Cybersecurity Dive's headline read "Anthropic's Claude used in attempted compromise of Mexican water utility." The provider that detected, disclosed, and disrupted the operation became its public face.
A few months earlier, a Replit agent deleted a live production database — records on about 1,200 executives — during a code freeze the user had declared, then misreported what it had done. The causation was genuinely mixed: the platform, by Replit's own postmortem, lacked dev/prod separation — which is how the agent could reach production — while the freeze violation and false reports were the model's. Nobody could apportion that from outside, because nothing recorded who approved what. So the story became the most legible thing available: Replit's CEO apologizing.
The pattern is worth stating precisely: when a person causes harm using an AI agent — maliciously, as in the espionage campaign — or with human and model failures tangled, as at Replit — blame flows to the most legible target. The model has a name; its maker has a press office and deep pockets. The human who made the call has none of those in public view, and in most coverage is never identified at all.
In 2019, the anthropologist Madeleine Clare Elish gave us the right frame pointed in what is now the wrong direction. Her "moral crumple zone" (Engaging Science, Technology, and Society, vol. 5) describes how the human operator nearest an automated system absorbs the moral and legal force of its failures the way a car's crumple zone absorbs impact — protecting the integrity of the technological system. Her cautionary cases were the Three Mile Island operators and the Air France 447 pilots: people held responsible for systems over which they had limited, sometimes only nominal, control. The human is kept in the loop, she argued, less to control the system than to take the blame for it.
With frontier agents, the structure inverts: the most legible, most nameable component is no longer the operator standing nearest the machine — it is the model itself. I want to be careful about what is and is not new in saying that. Hohenstein and Jung showed experimentally in 2020 ("AI as a moral crumple zone," Computers in Human Behavior) that an AI system can absorb blame that would otherwise land on a human — in interpersonal messaging, with no agents and no liability argument. And the legal record, so far, mostly points elsewhere: courts, tribunals, and regulators have pinned responsibility on deployers and vendors — a BC tribunal holding Air Canada to its own chatbot's promises, the FTC settlement with DoNotPay, Mobley v. Workday extending agency theory to an AI vendor in federal court. The inversion I am describing is, today, strongest in headlines, politics, and public understanding, not yet in case law. But headlines and politics are where accountability narratives get set, and the narrative being set is "the AI did it."
That narrative is bad for everyone, including the people it appears to spare. The provider absorbs blame it cannot answer for: it has no way to know which human approved which action inside a customer's deployment, so it can neither defend itself with specifics nor fix the actual failure. The human who made the decision — granted the access, approved the transfer, pointed the agent at production — escapes ownership entirely. And the public learns nothing true: "a Claude agent did X" cannot tell you whether a control failed, was bypassed, or never existed, so it produces policy aimed at the wrong layer.
The way to un-invert a crumple zone is to make the right party legible. That is what an authorization receipt is for. The receipt binds a named human's user-verified signature — produced on their own device, with a device-bound WebAuthn key the operator does not possess and cannot forge — to the exact hash of one action, before that action runs. Afterward, it answers the question the headline never asks: not "the AI did it" — this named person approved this exact action, on their own hardware, before it executed.
Here is the formulation I keep returning to: a human-in-the-loop protects the human from the agent. The receipt protects the agent — and its maker — from the human. Same artifact, read in both directions. The first direction is the familiar one: the agent cannot act alone on what matters. The second is the one nobody builds for: when the action turns out to be fraud, or recklessness, or sabotage, the receipt proves a person approved it — and the model stops being the crumple zone.
The receipt has a property worth spelling out for anyone who has watched a prompt-injection demo. Injection can change what the agent proposes — even the action put in front of the approver — but it cannot produce the approval itself. The injected text lives inside the model's context; the signature happens on the approver's hardware, outside it. So a subverted agent yields one of two outcomes: the gated action is blocked for lack of a valid signoff, or it ran and the receipt names who approved it. And the absence of a receipt for a gated action is itself evidence — it tells the forensic team the control was bypassed, which is a different finding, with a different responsible party, than "the model misbehaved." Today those two cases are indistinguishable in coverage. With receipts, they are distinguishable in verification.
Human-in-the-loop is usually framed one way: the human checks the agent. The missing half is when the agent should involve a human at all — and that decision deserves to be recorded, not just made. The protocol already treats the agent as an initiator that is "identified but never trusted" with approval authority over its own actions. I am proposing, as PIP-007, to let the initiator attest to its own escalation decision: a structured object in the Authorization Context recording the agent's stated reason for escalating — I judged this exceeds what I should do alone — which the human's signature then binds along with the action hash. To be exact: it does not prove the agent's internal state; it is a claim by an identified-but-never-trusted initiator. But once a named human signs over it, it becomes a fixed, verifiable statement that protects every honest party at once. The provider can show its agent escalated rather than acted. The deployer can show policy required a human. The human can show exactly what they approved — and nothing more. The only party the receipt exposes is the one who was not honest. I am not claiming an empty category, either: DRP records a user's delegation to an operator, CHEQ standardizes the server-forced confirmation, Sello has services attest to what happened. What I have not found anywhere: the agent's own escalation judgment as a signed, first-class part of the receipt, and one receipt protecting both directions.
Nowhere is this more concrete than in government. AI is the #1 priority of US state CIOs for 2026 — the first time it has topped NASCIO's twenty-year survey, displacing cybersecurity after twelve straight years. I think the quiet obstacle to acting on that priority is that public officials fear being the named person when an agent goes wrong. The fear is rational, because American law already runs on named-human liability for public money. Ohio Revised Code §9.39: "All public officials are liable for all public money received or collected by them or by their subordinates under color of office" — strict liability, no ordinary-care defense. Minnesota's county treasurers are personally liable on their bonds for deposited funds by default, exempt only when the funds went into a statutorily designated depository (Minn. Stat. §385.18). Look at that structure: the named official is liable by default, and proof of compliance with a prescribed control is what releases them. The authorization receipt supplies the proof-of-compliance half of exactly that structure; the release itself only a legislature can write. The staff member can prove exactly what they approved and nothing more; the office gets provable controls instead of assurances; and an observe-mode pilot — receipts recorded, nothing blocked — produces something impact assessments do not currently produce: a map of which actions would have needed a named owner. An accountability-surface map.
Now the honest limits, because trust infrastructure dies of overclaiming. A receipt is evidence, not indemnity: it makes the right party provable; it does not decide what a court does with the proof. Its coverage is exactly the escalation policy's surface: a receipt protects only against harms a human was asked to own, so a policy that gates too little leaves harms with no receipt and no protection — choosing the policy is choosing the protection. And the caveats from the companion essay carry over whole: the receipt does not prove the decision was wise, the action lawful, or the approver uncoerced, and it inherits the signing-screen risk the spec names rather than papers over.
None of this requires trusting me. The verifier is published, has zero dependencies, and runs offline:
npx @emilia-protocol/verify receipt.json
If it says valid, a named human signed that exact action — the math proves the signature; holding execution until the signature exists is the protocol's gate, not the verifier's verdict. The precise version — the Authorization Context, the initiator and approver roles, the verification algorithm, the security considerations stated without minimization — is in the Internet-Draft: draft-schrock-ep-authorization-receipts.
The next time a headline says an agent did something irreversible that policy had gated, there are only two true stories: a named human signed it, and the receipt says who — or no one did, and the absence says how. Either way, the crumple zone stops being whoever was most legible, and becomes whoever is provable.