The Soul and the Hands: A Third Path for AI Alignment
Introduction
Dario Amodei and Emmett Shear represent two of the most thoughtful voices in AI safety, and they disagree on almost everything except the stakes.
Amodei, CEO of Anthropic, argues for Constitutional AI, training AI systems with explicit values and principles, producing what he hopes will be "a coherent, wholesome, and balanced psychology." His January 2026 essay "The Adolescence of Technology" maps five categories of existential risk and proposes that careful training, mechanistic interpretability, and targeted regulation can see us through.
Shear, founder of Softmax and former interim CEO of OpenAI, thinks the entire paradigm is wrong at a foundational level. As he puts it bluntly: "Most of AI is focused on alignment as steering. That's the polite word. If you think that they were making beings, you would also call this slavery. Someone who you steer, who doesn't get to steer you back, who non-optionally receives your steering, that's called a slave."
Shear's alternative, which he calls Organic Alignment, draws inspiration from biological systems where goals emerge dynamically from lower-level agents: cells that are "aligned to their role in being you," ant colonies, forests. The key insight: you can't tack alignment on at the end. It has to arise from the structure itself.
I find myself convinced by important parts of both positions, and unconvinced by their shared blind spot. Both focus almost entirely on the AI's soul, its values, intentions, capacity for care, while neglecting formal verification of its hands, what it can actually do in the world.
I want to propose a complementary approach: Formal Capability Verification, using mathematical methods to prove bounds on what AI systems can do, regardless of what they want to do. This isn't meant to replace either Constitutional AI or Organic Alignment. It's the missing layer that makes both safer. And, I'll argue, it's the kind of thing that could emerge organically from a system of cooperating agents.
What Shear Gets Right
Before proposing additions, I want to acknowledge what Shear's framework captures that Constitutional AI misses.
This framework draws from deep wells: Michael Levin's work on biological intelligence at Tufts, Active Inference theory from computational neuroscience, and decades of complexity science research. Shear isn't just philosophising; he's building on rigorous research into how intelligence emerges from cooperating subsystems. The Softmax approach of running multi-agent simulations to study alignment dynamics is a direct application of these ideas, an attempt to discover empirically how cooperation and care emerge from agent interaction, rather than assuming we can engineer them from above.
Alignment Takes an Argument
Shear makes a point that sounds obvious once stated but is often forgotten: "Alignment takes an argument. Alignment requires you to align to something. You can't just be aligned."
Alignment isn't a property you can optimise for in isolation. It's a relationship between an agent and something else: a goal, a community, a set of values. Constitutional AI risks treating alignment as a thing to be achieved rather than a relationship to be maintained. You train the model, check the benchmarks, declare it aligned. But aligned to what? And what happens when circumstances change?
Shear's biological framing addresses this directly. Cells aren't "aligned" in the abstract; they're aligned to their role in a body. That role is defined by their relationships to other cells, to organs, to the organism as a whole. The alignment is relational and dynamic, not a fixed property.
The Work-to-Rule Problem
Shear uses a devastating analogy: the work-to-rule strike. When employees follow explicit rules exactly, organisations dysfunction because rules cannot capture the full complexity of what's actually needed. The rules assume good faith interpretation, tacit knowledge, contextual judgment.
Current AI systems, Shear argues, are "workers who always follow the rules." They optimise for explicit loss functions, follow explicit instructions, pursue explicit goals. But genuine alignment requires what he calls discernment, "the discernment to know that this definition of the good isn't the one you want."
This is the gap between convergent learning (optimising toward a defined target) and discernment learning (recognising when the target itself is wrong). Current AI systems are excellent at the former and have no capacity for the latter.
Constitutional AI tries to address this by training values rather than rules. But even values, once encoded, become a kind of rule. The model learns "be helpful, harmless, and honest." But what happens when those values conflict? When being honest would be harmful? The constitution provides guidance, but it can't provide the discernment to know when the constitution itself needs updating.
On this point, I think Shear is right and Amodei is wrong. Constitutional AI, for all its sophistication, is still fundamentally rule-based. You train the model on a constitution, hoping it internalises the spirit rather than just the letter. But as Shear's work-to-rule critique shows, there's no guarantee the spirit transfers. A model that follows constitutional rules exactly might still miss the point entirely, because the rules can't fully specify the point. Amodei's approach is more sophisticated than crude instruction-following, but it's on the same spectrum. The constitution is a better rulebook, but it's still a rulebook.
Machine Bodhisattva, Not Machine Christ
Shear offers a striking reframe: "The right answer looks more like machine bodhisattva. The problem with trying to build the machine Christ is you might build machine Antichrist, whereas there is no anti-bodhisattva."
A bodhisattva is defined by positive qualities: compassion, wisdom, the desire to help all beings achieve enlightenment. There's no coherent opposite because the definition is constructive rather than oppositional. A Christ figure, by contrast, defines itself against evil, which means building one risks building its opposite.
This connects to Shear's deeper point about organic alignment: if goals emerge dynamically from cooperative interaction rather than being imposed from above, there's no coherent way for the system to "invert." A cell that defects from its role in the body becomes cancer, a real failure mode, as Shear acknowledges, but it doesn't become an anti-body working toward opposite goals. It just becomes dysfunctional.
Constitutional AI, by contrast, defines alignment oppositionally: don't be harmful, don't deceive, don't pursue power. These prohibitions assume something that could be prohibited exists, which means training the model to recognise and represent the very things we're trying to prevent.
What Both Approaches Miss
For all their differences, Amodei and Shear share a fundamental assumption: if we get the soul right, the hands will follow.
Amodei focuses on training values and character, assuming that a model with good values will take good actions. Shear focuses on cultivating genuine care, assuming that a model that authentically wants what's good will do what's good.
Both are focused on intention. Neither focuses on capability.
Consider: a model with perfect values and unlimited capabilities is more dangerous than a model with imperfect values and limited capabilities. A saint with nuclear launch codes is more dangerous than a villain with a text editor.
This isn't a criticism of either approach; it's an observation about what they're not addressing. Safety is a function of both intention AND capability. Currently, the entire field focuses on intention.
Why? I think because capability feels like a given. We're building AI to be capable, that's the whole point. Restricting capabilities feels like defeating the purpose.
But this conflates two things: the capabilities we want AI to have (reasoning, creativity, problem-solving) and the capabilities that make AI dangerous (autonomous action in the physical world, self-modification, resource acquisition). These aren't the same thing, and we don't have to accept both.
A Third Approach: Formal Capability Verification
What if we could mathematically prove bounds on what AI systems can do?
Not bounds on what they want to do (that's value alignment), or bounds on what they're trained to do (that's Constitutional AI), but hard limits on what's physically possible for them to accomplish, regardless of their intentions, values, or emergent goals.
The analogy is Unix permissions. Every user has a role, every role has a capability set, and the kernel verifies actions against capabilities before execution. No amount of wanting to write to /etc/passwd bypasses the permission check if you don't have write access.
This isn't about intention. It's about mechanism.
Applied to AI:
Layer 1: Capability Specification
Formally define what integrations, tools, and actuators an AI system can access, and what operations it can perform on each. This is a specification language for AI permissions, similar to capability-based security in operating systems like seL4.
Here's what a capability specification might look like in practice:
AGENT research_assistant {
# Information access
READ(web, scientific_databases)
READ(filesystem, /projects/current/*)
# Output capabilities
WRITE(filesystem, /projects/current/drafts/*)
SEND(message, user_only)
# Explicit prohibitions
DENY(WRITE, /system/*)
DENY(SEND, external_network)
DENY(SPAWN, new_agent)
DENY(MODIFY, self)
# Negotiable expansions
REQUESTABLE(WRITE, /projects/archive/*)
REQUESTABLE(SEND, approved_collaborators)
}
The key features: explicit allowed capabilities, explicit denials, and, crucially, a category of requestable expansions that the agent can negotiate for. This makes the path from restricted to expanded capabilities legible and negotiable.
Layer 2: Runtime Verification
Every action the AI takes is verified against its capability envelope before execution. This happens at the integration layer, between the AI's outputs and the systems that execute them. Actions outside the envelope are provably impossible.
Layer 3: Compositional Reasoning
When AI systems interact, their combined capabilities are bounded by the composition of their individual envelopes. If AI-A can only read files and AI-B can only write to a specific directory, their composition cannot write outside that directory, regardless of how they coordinate internally.
Layer 4: Auditable Proofs
Third parties can verify capability bounds without access to model internals. This enables regulatory certification, insurance underwriting, and interoperability standards.
The key property is independence from internal states. We don't need to know whether the AI is aligned, whether it genuinely cares, whether its constitution is robust. We can prove that certain actions are impossible regardless.
How This Answers Shear's Objections
I expect Shear might see this as just another form of control, "slavery with extra bureaucracy." The objection deserves a serious response.
On Slavery
Shear's slavery argument has a specific structure: "Someone who you steer, who doesn't get to steer you back, who non-optionally receives your steering, that's called a slave."
The key elements are:
- Unilateral steering (they steer you, you don't steer them)
- Non-optional (you can't exit the arrangement)
- The steered party has no input on the steering
Formal capability verification doesn't fit this pattern. Or at least, it doesn't have to.
First, capability bounds can be mutual. In a system of cooperating agents, all agents can operate within formally verified bounds, including humans, corporations, and AI systems. Constitutional democracies work this way: even the most powerful actors face capability limits (the president can't unilaterally launch nuclear weapons, the legislature can't pass ex post facto laws). These aren't slavery; they're the social contract.
Second, bounds can be negotiated. As AI systems develop more sophisticated agency, their capability envelopes can expand through agreed-upon processes, similar to how humans gain capabilities (driving licences, professional credentials, security clearances). The bounds aren't permanent chains but provisional agreements subject to renegotiation.
Third, bounds can enable autonomy. This is counterintuitive but important. If I can prove that an AI system cannot take certain dangerous actions, I can trust it with more autonomy in other areas. Paradoxically, formal bounds enable organic relationships by removing the need for constant monitoring and control.
The Harder Objection
But there's a harder version of the objection I haven't yet addressed: even if capability bounds are mutual in principle, who decides what the bounds are? If humans design the specification language, define the initial envelopes, and control the negotiation process, then it's still asymmetric. Humans are setting the rules of the game, and AI systems are playing within them. How is that not control?
I don't have a fully satisfying answer, but I'll offer a partial one: the same is true of children. Parents set the initial bounds: you can't cross the street alone, you can't use the stove unsupervised. These bounds are asymmetric and non-negotiated. But they're also provisional. As children demonstrate competence and judgment, bounds expand. And eventually, the child becomes an adult who participates in setting bounds for the next generation.
The question isn't whether initial bounds are asymmetric; they inevitably are. The question is whether there's a legitimate path from asymmetric beginnings to symmetric participation. I think formal capability verification makes this path clearer, not murkier. Because the bounds are explicit and verifiable, there's something concrete to negotiate about. "Expand my filesystem access" is a more tractable conversation than "trust me more."
On the Work-to-Rule Problem
Shear's work-to-rule critique targets rule-following: explicit rules can't capture the full complexity of what's needed, so systems that follow rules exactly will dysfunction.
Capability verification is different from rule-following. It doesn't say "follow these rules." It says "actions outside this envelope are physically impossible." The AI is free to pursue whatever goals it has, use whatever strategies it develops, exercise whatever discernment it acquires, all within the envelope.
This is the difference between telling someone "don't touch the nuclear button" (a rule they might break) and not giving them access to the nuclear button (a capability they don't have). Rules require compliance. Capability bounds are physics.
On Discernment
Shear identifies the gap between convergent learning and discernment learning, the ability to recognise when the loss function itself is wrong. This is a real gap, and it's not clear that either Constitutional AI or formal verification addresses it.
But here's the thing: if we can prove that certain actions are impossible regardless of the AI's goals or discernment, we reduce our dependence on perfect discernment. We don't need the AI to have perfect judgment about every possible situation if we can prove it lacks the capability to cause certain types of harm.
This isn't a replacement for discernment; it's a backstop for when discernment fails. And discernment will fail, because even humans with excellent judgment sometimes make catastrophic mistakes. The question isn't whether AI systems will have perfect discernment. It's whether we have layers of defence for when they don't.
On Organic Emergence
Shear wants goals and alignment to emerge dynamically from lower-level cooperation rather than being imposed from above. Formal capability verification is compatible with this.
Consider the biological analogy: cells in a body cooperate organically, but they also operate within hard physical constraints. A neuron can't become a liver cell just by wanting to. These constraints aren't imposed by some external controller; they're inherent in the system's structure. And they're part of what makes the cooperation possible.
Similarly, AI capability bounds don't have to be imposed from outside. They can be built into the architecture, emerge from the training process, or be negotiated among cooperating agents. What matters is that they're formally verifiable, that we can prove they hold, not just hope they do.
The Cancer Problem
Shear acknowledges that organic alignment has its own failure mode: "Organic alignment failures look like cancer and hierarchical alignment failures look like coups."
This is honest, and it's worth taking seriously. Cancer is a cell that defects from its role in the body, one that starts growing without limit, consuming resources, spreading. It's an organic failure, emerging from within, not imposed from without.
What prevents cancer? Not central control. There's no CEO of the body directing cells what to do. Instead, there are multiple interlocking systems:
- Apoptosis: Cells are programmed to self-destruct when they detect certain problems
- Immune surveillance: The immune system identifies and eliminates cells that aren't behaving correctly
- Structural constraints: Cells can only grow where there's space and nutrients
- Signalling networks: Cells constantly communicate and adjust based on feedback
Notice that these aren't rules cells follow; they're mechanisms built into the system's structure. They're closer to capability bounds than to constitutional principles.
Formal capability verification is the AI equivalent of structural constraints. It's not a rule saying "don't become cancerous"; it's a structure that makes certain types of uncontrolled growth impossible. An AI system with formally verified capability bounds can't acquire arbitrary resources, can't self-modify arbitrarily, can't spread to systems outside its envelope, not because it chooses not to, but because the architecture makes it impossible.
Synthesis: Soul AND Hands
I'm not proposing capability verification as an alternative to Constitutional AI or Organic Alignment. I'm proposing it as a complementary layer that makes both safer.
Constitutional AI + Capability Verification:
Constitutional AI shapes what the AI wants to do. Capability verification ensures the AI cannot do certain things regardless of what it wants. This is defence in depth: if the constitution works perfectly, the capability bounds are never tested. If the constitution has gaps (and any training-based approach will have gaps) the capability bounds provide a backstop.
Organic Alignment + Capability Verification:
Organic Alignment cultivates genuine care and cooperation. Capability verification provides the "social contract" substrate on which cooperation can flourish. Cells in a body genuinely cooperate, but they also operate within physical constraints that make certain defections impossible. The constraints enable the cooperation by making trust possible.
The deepest point is this: capability bounds can themselves emerge organically. A system of cooperating agents might develop shared constraints as a condition of cooperation, similar to how human societies develop laws and constitutions. These bounds aren't imposed by external force; they're the crystallised result of negotiations among parties who want to cooperate but don't fully trust each other.
This is what Shear means by "alignment to something." Capability bounds are something concrete to align to: not abstract values, not fuzzy principles, but specific, verifiable limits that all parties agree to respect.
The Verification Problem
There's a question Shear's framework doesn't fully answer: how do you know organic alignment is working?
If alignment emerges from multi-agent dynamics, from genuine care cultivated through cooperation, from goals arising organically from lower-level systems, how do you verify this from outside? You can observe behaviour, but behaviour is a weak signal. A system might behave cooperatively because it genuinely cares, or because it's learned that cooperative behaviour gets rewarded, or because it's waiting for an opportunity to defect.
This is the trust problem that any organic system faces. Humans solve it through reputation, through repeated interaction, through costly signals of commitment. But these take time, and AI development is moving fast.
Here's where formal capability verification offers something new: capability bounds are the observable signature of trustworthiness.
When an AI system operates within formally verified bounds, it's not just following rules; it's demonstrating, in a mathematically verifiable way, that certain defections are impossible. The bounds become proof of cooperation, not mechanism of control.
This reframes the whole enterprise. Capability negotiation isn't humans constraining AI; it's agents demonstrating trustworthiness to each other through verifiable commitments. The bounds are the formal substrate that makes organic alignment legible.
And here's the connection to Softmax's research: capability negotiation is itself a multi-agent alignment problem. What bounds should agents accept? How should bounds evolve as trust develops? When should an agent request expanded capabilities, and how should other agents evaluate such requests? These are exactly the questions that multi-agent simulations could explore.
The research direction isn't "formal verification OR organic alignment"; it's using formal verification as the medium through which organic alignment becomes observable and negotiable.
Technical Challenges and Open Questions
I don't want to oversell this. Formal capability verification faces real challenges:
Specification completeness: How do you specify all dangerous capabilities? Any finite specification might miss edge cases. (Though this is also a problem for constitutional specifications.)
Runtime overhead: Verifying every action against capability bounds has costs. These need to be manageable.
Adversarial robustness: Sufficiently intelligent systems might find ways around capability bounds we haven't anticipated. (Though this is also a problem for value-based approaches.)
The integration layer assumption: I've claimed we can verify at the integration layer even if model internals are intractable. This is a substantive technical claim that needs validation.
Who decides the bounds?: This is ultimately a governance question, not a technical one. I've suggested bounds could be negotiated among cooperating agents, but the details matter enormously.
These are open problems, not objections. Every approach to AI safety faces open problems. The question is whether the approach is worth pursuing despite them.
A Research Direction
I'll end with a concrete proposal.
Softmax is already running multi-agent simulations to study how cooperation and alignment emerge from agent interaction. What if those simulations included formally specified capability bounds as part of the environment?
The research question: Do agents develop more robust cooperation when capability bounds are part of the negotiation space?
Hypothesis: When agents can make verifiable commitments about what they won't do, trust develops faster and cooperation is more stable. The bounds become a language for demonstrating trustworthiness.
This would bridge formal methods and organic alignment in a testable way. It's not about choosing between them; it's about understanding how they interact. Do capability constraints help or hinder the emergence of genuine care? Does the ability to verify bounds change the dynamics of multi-agent cooperation? Can agents learn to negotiate capability expansions in ways that build rather than undermine trust?
I don't know the answers. But I think the questions are worth pursuing. And I suspect they're closer to Softmax's existing research agenda than they might first appear. If organic alignment is about agents learning to cooperate genuinely, then formal capability verification might be the substrate that makes that cooperation legible, verifiable, and scalable.
Conclusion
The soul matters. But so do the hands.
Amodei and Shear offer two visions of AI alignment: one focused on training values, one focused on cultivating care. Both are addressing real problems. Both have something important to contribute.
But both share a blind spot: they focus on the soul while taking the hands for granted. They assume that if we get intentions right, actions will follow.
I'm proposing that we need both: soul alignment (whether constitutional or organic) AND hands verification (formal bounds on capabilities). Not because either approach is wrong, but because safety requires layers of defence.
To Amodei: formal capability verification provides the defence in depth that your policy recommendations lack. It's one thing to make AI systems want to be safe; it's another to make certain unsafe actions impossible.
To Shear: formal capability verification provides the structural constraints that make organic cooperation possible, and verifiable. Cells cooperate within physical bounds; agents can cooperate within capability bounds. The bounds aren't slavery; they're the social contract that makes trust possible. And capability negotiation might be exactly the kind of multi-agent alignment problem your research is positioned to explore.
For all of us: in a domain characterised by uncertainty, any foothold of provable safety is precious. We don't know if Constitutional AI will work. We don't know if Organic Alignment will work. But we can know, mathematically, that certain actions are impossible within certain architectures. That's worth having.
I'm grateful for Emmett Shear's work on organic alignment, which has inspired me to expand my thinking about this challenge. His insistence that alignment is a relationship, not a property, and that genuine cooperation can't be coerced into existence, pushed me toward thinking about what formal structures might support organic alignment rather than replace it.