The Soul and the Hands, Part II: From Intuition to Proof
A technical companion to "A Third Path for AI Alignment"
In Part I, I argued that AI safety requires attention to both the "soul" (values, intentions, alignment) and the "hands" (capabilities, authorities, actions). I proposed Formal Capability Verification as a complementary layer to existing approaches.
Several readers, including colleagues with deep expertise in formal methods, raised a fair challenge: where's the formalism?
This essay attempts to answer that question. I want to show that capability bounding for AI agents isn't a novel invention requiring new theory. The foundations already exist. They've been proven correct. They're running in production systems right now. The question is whether we can adapt them.
What We Can Learn from Operating Systems
The seL4 microkernel is, to my knowledge, the only operating system kernel with a complete formal verification of its security properties. The proof is machine-checked in Isabelle/HOL. It runs on real hardware. It's deployed in systems where failure is not an option.
Among the properties seL4 proves are two that matter for our purposes:
Integrity: An upper bound on what any component can write. If the security policy says process A cannot modify memory region X, then A cannot modify X. Full stop. Not "probably can't" or "is trained not to." Cannot, in the mathematical sense.
Authority confinement: An upper bound on how authority can propagate. If A has capability C, A can delegate C to B. But neither A nor B can conjure capabilities from nothing. Authority flows; it doesn't spontaneously generate.
These properties hold not because the kernel was trained on good examples or because its developers had noble intentions. They hold because the proof rules out all other possibilities. The proof is the documentation.
The seL4 team uses a variant of the take-grant model. The key insight is that they only permit grant operations, not take operations. Authority must be explicitly given. This asymmetry is what makes confinement provable.
Hardware That Enforces Boundaries
CHERI (Capability Hardware Enhanced RISC Instructions) takes this further. It builds capability checking into the processor itself.
In a CHERI system, pointers are replaced with capabilities: unforgeable tokens that encode not just an address but also bounds and permissions. You can't just cast an integer to a pointer and access arbitrary memory. The hardware won't let you.
The Cambridge/Arm team has proven a property called reachable capability monotonicity: no sequence of instructions can increase the authority of capabilities reachable from the current register state. You can restrict capabilities. You can delegate them. You cannot amplify them.
This proof, too, is machine-checked. It covers the actual instruction set of the Morello prototype processor. It's not a model of an idealised system. It's a proof about silicon.
The Type-Theoretic View
There's another way to think about capability bounding: through the lens of effect systems.
Traditional type systems tell you about data: this function takes an integer and returns a string. Effect systems extend this to tell you about computation: this function takes an integer, returns a string, and may perform file I/O and throw exceptions.
The Koka programming language, developed at Microsoft Research, uses row-polymorphic effect types. A function's type signature includes its effects:
fun fetch-page(url: string): <network, exn> string
This function may access the network and may throw exceptions. The effect is part of the type. If you try to call this function from a context that doesn't allow network access, the type checker rejects your program. Not at runtime. At compile time.
The key property is that effects can only be weakened, never strengthened. A function with effects {A} can be used anywhere that allows {A, B, C}. But a function with effects {A, B} cannot be used in a context that only allows {A}. The type system enforces monotonicity.
This is exactly the property we want for AI agent capabilities.
A Sketch for Agent Capabilities
Let me try to be concrete. Suppose we're building an AI agent that can use tools: web search, a calculator, file operations. We want to prove that this agent cannot execute arbitrary code.
First, we define our capability vocabulary:
inductive Capability where
| webSearch : Capability
| calculator : Capability
| fileRead : FilePath → Capability
| fileWrite : FilePath → Capability
| codeExec : Capability
| shellCommand : Capability
Then we define agent actions as indexed by their required capabilities:
structure AgentAction (required : CapabilitySet) (α : Type) where
run : Environment → IO α
proof_bounded : OnlyUses required
The proof_bounded field is a proof obligation. It must be discharged before the action can be constructed.
Finally, we have an execution gate:
def executeAction
(granted : CapabilitySet)
(action : AgentAction required α)
(h : required ⊆ granted) : IO α :=
action.run env
The key is the argument h : required ⊆ granted. This is a proof that the action's required capabilities are a subset of the granted capabilities. If this proof can't be constructed, the code doesn't compile.
This is a sketch, not a complete implementation. But it illustrates the pattern: capability checking becomes a proof obligation, and the type system enforces it.
Composing Agents
Real systems involve multiple agents interacting. This is where compositional reasoning becomes essential.
If Agent A has capabilities C₁ and Agent B has capabilities C₂, what capabilities does their composition have?
The naive answer is C₁ ∪ C₂. But this misses something important: the interaction between agents might itself require capabilities. And crucially, we need to prove that the interaction cannot create new capabilities that neither agent originally possessed.
This is precisely seL4's authority confinement property, lifted to the agent level:
theorem interaction_confined
(a1 : Agent caps1)
(a2 : Agent caps2)
(result : Interaction a1 a2) :
result.capabilities ⊆ caps1 ∪ caps2 := by
-- The proof must show no capability amplification occurs
sorry -- This is where the real work lives
The sorry is honest. I don't have this proof. But the structure tells us what we'd need to prove: that interaction preserves the capability monotonicity property.
This connects directly to Emmett Shear's concern about organic alignment. In multi-agent systems, the question isn't just whether each agent is aligned, but whether their interaction produces aligned behaviour. Capability bounds give us a handle on this: even if interaction produces emergent behaviour, that behaviour is constrained by the union of the agents' capabilities.
What This Doesn't Solve
I want to be clear about limitations.
Rice's theorem still applies. For Turing-complete systems, semantic properties ("this program causes harm") are generally undecidable. We can prove syntactic properties ("this program never calls the shell_exec function") but not arbitrary semantic ones.
Capability bounds don't ensure good outcomes. A web-search-only agent could still research harmful information. Restricting what an agent can do doesn't determine whether what it does is good. This is precisely why I frame capability verification as a complement to alignment, not a replacement.
We're bounding tools, not understanding. An agent might "understand" how to cause harm even if it lacks the capability to execute that understanding. This is the distinction between knowing and doing. Capability bounds address doing.
Specification remains hard. Deciding which capabilities to grant is a human judgment. The formal methods tell you "if you grant these capabilities, the agent cannot exceed them." They don't tell you which capabilities to grant.
This is how seL4 works too. It proves confinement relative to a policy. It doesn't prove the policy is wise.
Why Lean 4 Matters
I've been watching the work coming out of Harmonic with great interest. Their Aristotle system achieved gold-medal performance on International Mathematical Olympiad problems, with proofs in Lean 4 that are machine-verified.
This matters for capability verification because it suggests frontier AI systems can already generate formal proofs. The question is whether we can define the right proof obligations.
Imagine an architecture where:
- An AI agent proposes an action
- The action comes with a proof certificate claiming it stays within granted capabilities
- A trusted verifier (much simpler than the agent) checks the proof
- Only if the proof verifies does the action execute
This is essentially the gatekeeper architecture that David Dalrymple has proposed for ARIA's Safeguarded AI programme. The innovation would be using capability bounds as the property being proven.
The advantage is that the verifier can be simple and auditable. The complexity lives in the proof generator (the AI), but we don't need to trust the generator. We only need to trust the verifier. And the verifier is just checking that proof steps follow from axioms.
Research Questions
I don't have all the answers. But I can articulate questions that seem worth pursuing:
Can we define a practical capability vocabulary for AI agents? The examples I've given (web search, file access, code execution) are coarse-grained. Real tool use is more nuanced. What's the right level of abstraction?
How do we handle capability delegation in agent networks? If Agent A grants Agent B a subset of its capabilities, how do we track this? Can we adapt seL4's capability derivation proofs?
What's the relationship between capabilities and world models? ARIA's framework involves proving properties relative to a world model. Can capability bounds be expressed as constraints on an agent's influence over the world model state?
Can Aristotle-style systems generate capability proofs? This seems testable. Define a capability proof obligation, fine-tune a model to generate such proofs, measure success rate.
What would hardware-assisted AI capability enforcement look like? CHERI constrains memory access at the hardware level. Could similar mechanisms constrain AI tool calls? This might require rethinking the AI serving stack.
Synthesis
In Part I, I argued that we need to attend to both the soul and the hands. This essay has tried to show what attending to the hands might look like in practice.
The core insight is that capability monotonicity is provable. We have existence proofs: seL4 proves it for OS kernels, CHERI proves it for hardware, effect systems prove it for programs. The question is whether we can prove it for AI agents.
I'm not claiming this solves alignment. I'm claiming it provides a layer of defence that doesn't depend on alignment being solved. Even if we can't verify that an agent's values are good, we might be able to verify that its capabilities are bounded.
In the words of the seL4 team: "The proof is the documentation." For AI systems with growing real-world impact, we need systems where the bounds are not merely intended or trained, but proven.
I'm grateful to the teams at Harmonic, ARIA, the seL4 Foundation, and the CHERI project for their work on formal verification and provable safety. Their research provided the foundations for this essay. The errors are mine.
References
- Klein, G., et al. "seL4: Formal Verification of an OS Kernel." SOSP 2009. PDF
- Sewell, T., et al. "seL4 Enforces Integrity." ITP 2011. SpringerLink
- seL4 Project. "Verified Protection Model of the seL4 Microkernel." PDF
- Arm/Cambridge. "Verified Security for the Morello Capability-enhanced Prototype Arm Architecture." ESOP 2022.
- Watson, R., et al. "CHERI: Capability Hardware Enhanced RISC Instructions." Project page
- Leijen, D. "Koka: Programming with Row-polymorphic Effect Types." 2014. arXiv:1406.2061
- Leijen, D. "Algebraic Effects for Functional Programming." Microsoft Research, 2016. PDF
- Harmonic. "Aristotle: IMO-level Automated Theorem Proving." 2025. arXiv:2510.01346
- Dalrymple, D., et al. "Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems." 2024. arXiv:2405.06624
- ARIA. "Safeguarded AI: Constructing Guaranteed Safety." Programme Thesis v1.2. PDF
- Miller, M., Yee, K., Shapiro, J. "Capability Myths Demolished." 2003.
- Dennis, J., Van Horn, E. "Programming Semantics for Multiprogrammed Computations." CACM, 1966.
Part I of this essay, "The Soul and the Hands: A Third Path for AI Alignment," is available at the-soul-and-the-hands.html