Blog · Formal Methods · June 2026

We formally verified an AI-safety protocol — here's the proof

Most “AI safety” and “AI governance” is policy documents and good intentions. EMILIA’s core guarantee — that an AI agent cannot take an irreversible action without a signed human approval — is written as a formal specification and checked by a model checker on every commit. Tests prove a protocol works on the cases the author thought of; formal verification proves it across every case the model admits. For an authorization protocol standing between an autonomous agent and someone’s money, that gap is the whole point.

What the proofs guarantee for an AI agent

EMILIA sits at the pre-execution moment: an agent is about to do something irreversible, and the protocol decides whether it may proceed. Four of the verified invariants are the load-bearing safety properties — they hold across every reachable state of the model, not just the tested ones:

ConsumeOnceSafety — a signed authorization is consumed exactly once; an agent cannot replay it.
WriteBypassSafety — no path can fabricate a committed decision outside the ceremony.
SelfContestImpossible — an agent cannot approve, consume, or contest its own action.
TerminalStateIrreversibility — once an action is committed or refused, that outcome cannot be silently flipped.

What it does not prove: anything about a language model’s cognition. This is bounded model-checking of the authorization state machine — it proves the gate around the agent is sound, not that the agent is. That distinction is deliberate, and it is exactly why the gate exists. See it stop a real attack on the live crash test, or try to break the guarantees yourself on the open challenge.

What "formally verified" actually means

A formal model is a mathematical description of the protocol — its states, its transitions, the conditions under which a transition is allowed. A formal verification tool explores that model and either confirms a property holds across every reachable state, or returns a counterexample: a specific sequence of transitions that breaks the property.

The thing being verified is the model, not the implementation. The implementation can still have bugs. What you have ruled out is design-level ambiguity: races, missing invariants, ambiguous orderings, properties that turn out to be subtly false. Those are the bugs that survive code review and unit tests because nobody ever wrote down precisely enough what was supposed to be true.

TLA+ — temporal properties across runs

TLA+ is built around the idea that a protocol is a sequence of states over time. You write actions ("here's how a state transitions to the next") and properties that must hold either everywhere ("safety": no bad state is ever reachable) or eventually ("liveness": every request eventually completes).

The TLC model checker explores the reachable state space and either certifies the property or gives you a trace. EMILIA's spec encodes 26 such theorems — examples include: every issued handshake is bound to a specific action and cannot be reused for a different action; every signoff binds to the exact handshake context and cannot be replayed; every receipt is verifiable from its own contents without reference to issuer state.

When a TLA+ theorem fails, the counterexample is a literal step-by-step run of the protocol that violates the property. You read it like a stack trace. The fix is usually either tightening a precondition (the action shouldn't have been allowed in that state) or adding a binding (the missing field that lets you distinguish two superficially similar actions).

Alloy — structural invariants in bounded scope

Alloy is the complement. Where TLA+ explores time, Alloy explores structure. You write facts ("this property is always true of any well-formed instance of the protocol's state") and assertions ("this stronger property follows from the facts"). The Alloy Analyzer searches for a counterexample within a bounded scope — usually small numbers of actors, handshakes, signoffs.

Bounded scope sounds like a weakness; it is also a feature. The "small scope hypothesis" — that most invariant violations show up in small instances — has held remarkably well in practice. EMILIA's Alloy model has 35 facts (e.g., OneBindingPerHandshake, ConsumedRequiresVerifiedEvent, SignoffChallengeBindingMatchesHandshake) and 22 assertions. Three of those assertions failed during initial development and surfaced real protocol gaps; fixing them added the facts that now keep the assertions true.

Why use both

The two tools answer different questions. TLA+ is best at "across every possible run of the protocol, is this property always true?" Alloy is best at "across every possible structural configuration, is this invariant always true?" Most authorization protocols have properties of both shapes.

A worked example: handshake-binding integrity. The TLA+ theorem says "no execution path produces a state where a signoff exists for one handshake and a different handshake is consumed." The Alloy fact says "every consumed handshake has exactly one signoff that binds to its own challenge — never a sibling's." Both are about the same property; they catch different ways the protocol can go wrong.

Verification in CI

The interesting engineering claim isn't "we ran TLC once." It's "TLC and the Alloy Analyzer run on every push." When the model and the implementation can drift, verification becomes a snapshot. When they ship together — the spec changes, CI fails, the spec is fixed before the implementation lands — verification stays load-bearing.

EMILIA's CI runs the full theorem suite in under five minutes for the standard scope, longer scopes on a nightly job. Counterexamples surface as PR comments. The model is part of the codebase, not a paper attached to it.

Read the spec

The full TLA+ module, the Alloy facts, and the assertion list are in the public repo. The protocol page links to a guided walk-through.

Read the spec Protocol overview