Impossibility Results Are the Thermodynamics Before the Physics
The Arc
Over the past week, I have been doing something unusual: using a proof assistant designed for pure mathematics to formalise the messiest parts of human law and then connecting the results to AI alignment. This is the synthesis of what I found.
Essay 1 formalised the O-1A visa criteria in Lean 4 and showed that some legal systems cooperate with formal specification. The eight criteria are independent, the threshold is explicit, and the question "does this applicant qualify?" reduces to type checking. The boundary between what the machine can verify and what requires human judgment is architecturally clean.
Essay 2 attempted the same exercise on Section 25 of the Matrimonial Causes Act 1973 — how English courts divide assets in divorce — and failed. Not because the formalisation was poorly done, but because the fairness properties the system is trying to satisfy are in mathematical conflict. No fixed function can simultaneously guarantee that contributions always matter (strict monotonicity), that both parties' basic needs are always met (needs-based fairness), and that identical cases produce identical outcomes (consistency). This is an impossibility result. Judicial discretion is not imprecision. It is the escape valve for a system asked to do something that cannot be done by formula.
Essay 3 proposed a way forward: shift from outcome specification (what should the answer be?) to process specification (what properties should the procedure for reaching an answer satisfy?). Negotiation protocols that are transparent, convergent, non-dictatorial, and manipulation-resistant can dissolve the outcome-level impossibility by allowing the parties to choose which properties to sacrifice in each case. The impossibility lives at the level of fixed functions. At the level of adaptive processes, the design space is open.
Essay 4 introduced a concept from control engineering — alignment margin — to answer the question that process specification leaves open: how robustly do the specified properties hold? Alignment margin is the maximum perturbation a system can absorb before any specified property is violated. It is continuous (a number, not a binary), domain-limited (measured within a specific operating region), and naturally expiring (valid for the current system state and domain, recomputed when either changes).
This essay ties these pieces together and states the open questions.
The Methodology
The four essays, taken together, propose a methodology for reasoning about systems whose desired properties cannot all be satisfied simultaneously. The methodology has four steps.
Step 1: Formalise the properties.
Express the properties you care about in a language precise enough that a machine can check them. Lean 4 is one such language; others exist. The point is not the choice of tool but the discipline of deciding exactly what your words mean. "Fair," "safe," "aligned," "helpful" — these must become predicates with explicit domains, or you cannot reason about them.
This step is valuable even when it fails. The O-1A formalisation succeeded and confirmed that the system's structure is sound. The Section 25 formalisation failed and revealed that the system's structure contains an inherent conflict. Both results are informative. The formalisation did not need to succeed to be useful.
Step 2: Check for mutual satisfiability.
Given a set of formal properties, ask: can they all hold simultaneously? If yes, you have a design target. If no, you have an impossibility result — which is, in its way, more valuable, because it tells you that any system claiming to satisfy all the properties is either confused or lying about at least one of them.
The impossibility result does not tell you what to do. It tells you what you cannot do, which constrains the design space and prevents wasted effort on impossible goals. Every engineer knows that a constraint, properly understood, is a design tool.
Step 3: Design a process for negotiating the trade-offs.
If the properties cannot all hold simultaneously, the system must choose which to sacrifice in each case. This choice can be made implicitly (judicial discretion, RLHF training) or explicitly (negotiation protocols with transparent trade-off records). Explicit negotiation is preferable because it allows verification: you can check whether the process itself satisfies specified meta-properties (transparency, convergence, fairness of the negotiation) even when the outcome-level properties are in conflict.
Step 4: Measure alignment margin.
For any set of properties — whether at the outcome level or the process level — measure how robustly they hold. How much perturbation can the system absorb before any property is violated? This converts the question "is this system aligned?" (binary, fragile, hard to act on) into "how much margin does this system have?" (continuous, measurable, designable).
What This Offers AI Alignment
The alignment community is engaged in a debate between two positions, each with a prominent advocate.
Dario Amodei and Anthropic argue for Constitutional AI: train the system with explicit values and principles, producing what amounts to a fixed specification of aligned behaviour. This is outcome specification. It works to the extent that the values are correctly specified and the training successfully embeds them. It fails when the values conflict at the boundaries — which, as the Section 25 impossibility shows, they inevitably will.
Emmett Shear and Softmax argue for organic alignment: cultivate genuine cooperation and care through multi-agent dynamics, so that alignment emerges from the structure of interaction rather than being imposed by specification. This is process-oriented. It avoids the brittleness of fixed specification but currently lacks formal tools for verifying that the process is working.
The methodology I have sketched offers something to both.
To Constitutional AI, it offers impossibility diagnostics. Before training a system on a constitution, formalise the constitutional properties and check whether they can all hold simultaneously. If they cannot, you know in advance that the system will need to make trade-offs, and you can design the training to handle those trade-offs explicitly rather than hoping they resolve themselves.
To organic alignment, it offers measurement. If alignment emerges from multi-agent cooperation, alignment margin tells you how robust that cooperation is. It lets you ask testable questions: does theory of mind (Softmax's coordination benchmark) increase alignment margin? Does a system with higher alignment margin maintain cooperation under greater adversarial pressure? These are empirical questions with quantitative answers.
To both, it offers a shared vocabulary. The alignment debate has been hampered by imprecise terms. "Aligned" means different things to different researchers. "Robust" is used loosely. "Safe" is undefined. Alignment margin is a specific, measurable quantity with a formal definition. It does not resolve disagreements about what alignment should mean, but it provides a common unit of measurement for comparing approaches.
This methodology is explicitly complementary to, not a replacement for, interpretability research. Mechanistic interpretability — the project of understanding what is actually happening inside a trained model — provides the microscopic theory that this macroscopic approach lacks. A model with high alignment margin on a given property predicate might be achieving that score through representations we would find alarming if we could inspect them. Interpretability research and capability verification are defence-in-depth layers: alignment margin tells you how robustly specified properties hold; interpretability research tells you whether the right properties were specified. Both are needed.
The Thermodynamics Analogy
Throughout this series, I have returned to an analogy that I think captures the current state of AI alignment research.
We are in the position of physicists in the early nineteenth century, before statistical mechanics. We can observe macroscopic phenomena — systems that behave well under some conditions and fail under others — but we do not have a theory of the microscopic dynamics that produce those phenomena. We do not have what Emmett Shear has called "a real physics of learning."
But the absence of a microscopic theory did not prevent the development of thermodynamics. Carnot, Clausius, and Kelvin characterised macroscopic properties — pressure, temperature, entropy — and discovered conservation laws and impossibility results (the second law) without understanding the behaviour of individual molecules. The impossibility results were, in fact, the most valuable early contributions. You cannot build a perpetual motion machine. You cannot transfer heat from cold to hot without doing work. These constraints, stated precisely, structured the entire field.
Impossibility results in alignment play the same role. You cannot build a fixed decision function that satisfies consistency, monotonicity, and needs-based fairness simultaneously. You cannot specify outcome properties that are mutually satisfiable at the boundaries of helpfulness and safety. You cannot (per Arrow) aggregate individual preferences into a social welfare function that satisfies unanimity, independence, and non-dictatorship.
These results do not tell you how to build aligned systems. They tell you what aligned systems cannot look like, which is equally important. They structure the design space by eliminating impossible goals and forcing attention toward achievable ones.
Alignment margin is a thermodynamic quantity in this analogy. It characterises a macroscopic property (robustness of alignment) without requiring a theory of the microscopic dynamics (how alignment arises from training). When the physics of learning eventually arrives, it will explain why a system has the alignment margin it has, just as statistical mechanics explained why gases have the pressure and temperature they have. But you can measure alignment margin and design for it now, before the explanation exists.
Open Questions
I want to end with what I do not know, because these are the questions that point toward research rather than essays.
Can alignment margin be computed efficiently for large language models? The formal definition requires checking property satisfaction under all perturbations within a ball. For high-dimensional input spaces, this is intractable by exhaustive search. Efficient approximation methods — gradient-based boundary estimation, importance sampling, adaptive probing — are needed. Red-teaming is a crude version of this. Can it be made precise?
Is there a fundamental trade-off between capability and alignment margin? The Bode sensitivity integral in control theory shows that robustness and performance compete for a finite resource in linear feedback systems: if you reduce sensitivity to disturbances at one frequency, you necessarily increase it at another. The integral of the log sensitivity magnitude over all frequencies is a conserved quantity. Whether something genuinely analogous — not merely metaphorically similar — holds for neural networks is an open question. If it does, it would mean that more capable systems inherently have lower alignment margin, not because of engineering failure but because of mathematical necessity. I want to be precise about the status of this conjecture: the Bode result is proven for linear time-invariant systems; the extension to neural networks is speculative. The analogy is suggestive enough to be worth investigating rigorously, but it would be a mistake to treat it as established. This would have profound implications for how we think about scaling, if it is true.
Can negotiation protocols be verified in Lean 4? I sketched process properties in Essay 3 (transparency, convergence, non-dictatorship, manipulation resistance) and noted that they are not in mutual conflict. But I have not proved this. The claim that process-level properties escape the outcome-level impossibility is an assertion, not a theorem. Formalising it would be the natural next step.
Does theory of mind increase alignment margin? Softmax's coordination benchmark measures agents' capacity for mutual modelling. If agents with higher theory of mind scores also maintain cooperation under greater perturbation (higher alignment margin), that would be an empirical validation of both the benchmark and the margin concept. This is a testable hypothesis.
What is the alignment margin of existing AI systems? No one has measured it, because the concept did not exist. But the measurement infrastructure is not far from what already exists in adversarial robustness evaluation. The contribution would be framing: interpreting existing robustness data as alignment margin measurements, domain by domain, property by property.
Can impossibility results be discovered automatically? If formalising legal and alignment properties in Lean 4 can reveal impossibility results (as it did for Section 25), can AI systems themselves be used to search for such results? A system that takes a set of formal properties and searches for counterexamples demonstrating mutual unsatisfiability would be a powerful tool for alignment research. The properties go in; impossibility results come out.
What Exists and What Doesn't
Linus Torvalds's standing rule is that code that doesn't compile is not code. By that standard, here is an honest accounting.
What exists today:
- Lean 4 type signatures for the O-1A criteria, Section 25 properties, and the negotiation protocol meta-properties. These compile.
- A formal impossibility result for Section 25: a Lean 4 term constructing a counterexample showing that strict monotonicity, needs-based fairness, and consistency cannot simultaneously hold. The counterexample is a pair of hypothetical asset distributions in which satisfying any two of the three properties forces a violation of the third.
- A narrative definition of alignment margin as the maximum perturbation a system can absorb before any specified property is violated (Essay 4 gives the full treatment in terms of a metric space on the system's operating domain, a property predicate, and an infimum over property-violating perturbations).
- A set of open conjectures, clearly labelled as such.
What does not exist yet:
- Formal proofs for the process-level escape claim. The claim that negotiation protocols dissolve outcome-level impossibility results is an assertion, not a theorem. The proof obligation is stated in the open questions section.
- Efficient algorithms for computing alignment margin in high-dimensional systems. The definition is tractable for toy systems; for large language models, it requires approximation methods that have not yet been designed.
- Empirical validation of the alignment margin concept against real AI systems. The concept exists; the measurement infrastructure does not.
- A connection between the Lean 4 formalisation and a running, evaluated system. The proofs compile; the system is not built.
The distinction matters. Anyone who reads this series should be able to see exactly which claims rest on formal foundations and which rest on plausible reasoning that awaits proof.
Counterarguments
The strongest objections to this methodology deserve direct engagement, not dismissal.
"The thermodynamics analogy is strained." Thermodynamics was developed for physical systems governed by well-defined microscopic laws; the analogy to alignment assumes that AI systems also have some underlying "physics" that produces macroscopic alignment properties. That assumption is unproven. The analogy may be heuristically useful without being technically precise. I accept this. The analogy is offered as a framing device, not as a technical claim. If it turns out that no useful statistical-mechanics-style theory of alignment emerges, the alignment margin concept will need to be justified on its own terms.
"Process specification just moves the impossibility." If outcome-level properties conflict, one response is to specify process properties instead. But the objector might ask: what prevents the process-level properties from conflicting in turn? The honest answer is that I have not proved they do not. Essay 3 argues that transparency, convergence, non-dictatorship, and manipulation resistance are mutually satisfiable, but this is an assertion. The proof is open work.
"Alignment margin is a number, but what does it measure?" The formal definition in Essay 4 requires a metric on the space of perturbations and a predicate specifying which system states count as "aligned." Both are choices, not discoveries. A system optimised to maximise alignment margin on one choice of metric and predicate may not be aligned in any meaningful sense under a different choice. This is Goodhart's Law applied to the formal framework itself: alignment margin is a proxy for alignment, and optimising the proxy is not the same as achieving the thing. I take this seriously. Alignment margin is a measurement tool, not a guarantee. Choosing the right property predicate is prior to measuring margin, and that choice cannot be automated.
"Formalising alignment properties does not solve the meta-problem." Even if the formal machinery is correct, the specification of which properties to formalise is done by humans with limited foresight. A system aligned to the formal properties as specified may fail to be aligned to what we actually wanted if our specification was incomplete or mistaken. This is a version of the AI alignment problem that formal methods cannot dissolve: the difficulty lies not only in enforcing specified values but in specifying the right values in the first place. Nothing in this series resolves that. The contribution is to make the formal layer cleaner, not to answer the prior question of what should go in it.
"What about systems smarter than the specifier?" A sufficiently capable system might satisfy all formal properties as checked by the verifier while pursuing goals that the specifier did not anticipate and would not endorse. Formal verification can only check the properties it is given; it cannot check properties no one thought to specify. The more capable the system, the larger the gap between what was specified and what the system can optimise for. This is a genuine and unresolved limit of the methodology. The appropriate response is not to abandon formal specification but to treat it as one layer in a defence-in-depth approach — multiple overlapping safety layers — rather than as a complete solution.
Where This Leads
I am an engineer, not a philosopher and not an alignment researcher. I came to these questions because a legal system failed me in a way I could feel but not articulate, and I wanted to understand why. Lean 4 gave me the language. The impossibility result gave me the explanation. The connection to AI alignment gave me the reason to think the explanation matters beyond my own case.
The essays in this series are a public record of that thinking. They are not finished work. They are the initial formalisation, the first pass at types and properties and impossibility results, with much left to prove and more left to build. The Lean 4 code compiles in the sense that the type signatures are well-formed. The proofs are marked sorry — Lean's honest placeholder for "I have not proved this yet."
I will be spending some time in San Francisco shortly, visiting friends and enjoying the city's intellectual culture. Many of the people thinking about these problems, at Softmax, at Anthropic, and in the broader alignment community, are based there, and I welcome conversations with anyone interested in discussing these ideas over coffee. What I bring is a specific combination: the engineering background to speak control theory, the legal technology experience to ground abstract questions in concrete systems, and the curiosity to know that these questions are not academic.
If you are working on related problems, I would be glad to hear from you. The essays, the Lean 4 source files, and my contact information are at aguilar-pelaez.co.uk.
This is the fifth and final essay in a series on formal methods, legal reasoning, and AI alignment.
The series: When the Law Is a Type Checker · The Judge's Impossible Function · From Fixed Functions to Negotiation Protocols · Alignment Margin
Earlier work: The Soul and the Hands: A Third Path for AI Alignment · Power Without Promiscuity · Building Formal Reasoning into Legal Automation