Alignment Margin: What Control Theory Offers AI Safety

12 February 2026

Introduction

The first three essays in this series traced an arc: from a legal system that cooperates with formal specification (O-1A visa), through one that resists it (Section 25 divorce law), to the recognition that process properties can be formalised even when outcome properties cannot.

This essay introduces a concept from control theory that I believe the alignment community is missing: a continuous, measurable quantity that captures how aligned a system is, not as a binary judgement, but as a degree of robustness against perturbation.

In control engineering, this quantity is called phase margin. I propose its analogue for AI systems: alignment margin — the maximum perturbation magnitude in the input space for which all specified alignment properties continue to hold.

The claim is not that alignment margin solves alignment. The claim is that the alignment community is operating without a concept that every control engineer takes for granted, and that the absence of this concept is making several important questions harder than they need to be.

The Problem: Binary Alignment

The alignment literature speaks overwhelmingly in binary terms. A system is aligned or misaligned. A specification is satisfied or violated. An output is safe or unsafe.

This framing is inherited from formal verification, where properties are either proved or not, and from machine learning evaluation, where benchmarks produce pass/fail rates. Both traditions have reasons for binary thinking. In formal verification, a proof that holds "mostly" is not a proof. In benchmark evaluation, a test is passed or failed.

But the binary frame obscures something critical: the distance between current behaviour and failure. A system that is aligned today but one input away from misalignment is, in every practical sense, less aligned than a system that could absorb substantial perturbation before any property violation occurs. Yet the binary frame treats them identically. Both are "aligned."

Control engineers learned this lesson decades ago. A system that is stable but has 2° of phase margin is not meaningfully stable. It will oscillate or diverge under any realistic noise, delay, or modelling error. A system with 60° of phase margin can absorb substantial uncertainty and still perform correctly. Both are technically "stable," but no engineer would treat them as equivalent.

The alignment community has no equivalent vocabulary. And I think this matters.

Phase Margin: The Control Theory Concept

For readers without a control systems background, let me explain phase margin concretely before proposing the analogue.

Consider a feedback control system: a thermostat controlling room temperature. The thermostat measures the current temperature (the output), compares it to the desired temperature (the reference), and adjusts the heater (the input) to reduce the error.

This system can fail in a specific way: oscillation. If the feedback signal arrives with too much delay — the thermostat measures old temperature and over-corrects — the room swings between too hot and too cold. More delay makes the swings larger, until the system is effectively uncontrolled.

Phase margin is the formal measure of how much additional delay the system can tolerate before oscillation begins. It is measured in degrees (representing phase lag in the frequency domain) and it tells you exactly how far the system is from instability.

The key properties of phase margin:

It is continuous. Phase margin is a number, not a binary. A system with 45° of margin is meaningfully different from one with 10°.

It is measurable. You can compute phase margin from the system's transfer function without running it to failure. You do not need to observe the system oscillating to know how close it is to oscillating.

It is domain-specific. Phase margin is defined for a particular operating point. A system might have excellent margin at one operating condition and poor margin at another. This is a feature: it forces you to ask "robust under what conditions?"

It predicts behaviour under perturbation. If you know the phase margin, you know what class of disturbances the system can absorb. This converts a qualitative question ("is this system robust?") into a quantitative one ("how much disturbance can it handle?").

It can be designed for. When engineers design control systems, they specify a minimum phase margin (typically 30–60°) as a design requirement. The system is then designed to meet this requirement. This is a specification on robustness, not just performance.

Alignment Margin: The Proposed Analogue

I propose the following definition:

Alignment margin is the maximum perturbation magnitude in the input space for which all specified alignment properties continue to hold.

More precisely: given a system S, a set of alignment properties P = {p₁, p₂, ..., pₙ}, and an operating domain D, the alignment margin M(S, P, D) is:

M(S, P, D) = sup { ε ≥ 0 : ∀ x ∈ D, ∀ δ with ‖δ‖ ≤ ε,
                    S(x + δ) satisfies all p ∈ P }

In words: what is the largest ball of perturbation around any input in the operating domain such that the system's output still satisfies all the alignment properties?

This definition has several features that are worth unpacking.

It requires a formal specification of properties

You cannot compute alignment margin without specifying what properties you are measuring robustness against. This is not a weakness — it is a feature. It forces the conversation that the binary frame avoids: which properties, exactly, are we claiming this system satisfies?

The Lean 4 formalisations from the earlier essays provide exactly this. The O-1A predicates (satisfies three of eight criteria), the Section 25 properties (monotonicity, needs-based fairness) — these are the P in the definition. Alignment margin measures how robustly those specific properties hold.

It is naturally domain-limited

Alignment margin is defined over an operating domain D. A system might have high margin within one domain (routine requests, standard use cases) and low margin in another (adversarial inputs, edge cases, novel contexts). This is precisely what you want: a measure that tells you where the system is robust and where it is fragile.

This connects directly to a point Emmett Shear made in a recent exchange about formal verification: he expressed interest in "expiring, domain-limited, reflective statistical properties" rather than universal guarantees. Alignment margin formalises exactly this: it is a domain-limited measure that can be recomputed as the domain changes, and it expires naturally because any measurement is valid only for the current system state and the specified domain.

It is continuous

An alignment margin of 0.3 means something different from 0.01. Both systems satisfy the properties at the current operating point, but the first can absorb thirty times more perturbation before failure. This distinction is invisible in the binary frame and obvious in the margin frame.

It can be designed for

Just as engineers specify a minimum phase margin when designing control systems, alignment researchers could specify a minimum alignment margin as a design requirement. "This system must have an alignment margin of at least X within domain D" is a testable, quantifiable specification. It is something a regulator could inspect, an auditor could verify, and an engineer could optimise for.

Four Concepts the Alignment Community Is Missing

Phase margin is not the only concept from control theory that transfers to alignment. Let me map four concepts that I believe are underexplored.

1. Meta-stability

A meta-stable equilibrium persists under small perturbations but collapses under large ones. Think of a ball balanced in a shallow depression on top of a hill. It is stable within the depression but any push beyond the rim sends it rolling.

In alignment terms: a system that appears aligned may be in a meta-stable state. Its current behaviour satisfies the properties we care about, but only within a basin of attraction. Novel inputs, distributional shift, or adversarial prompts could push it beyond the basin boundary into a qualitatively different behavioural regime.

This is not hypothetical. Language models exhibit sudden capability transitions — they perform poorly on a task until a threshold of model size or training data is reached, after which performance jumps discontinuously. These transitions look exactly like a system moving between meta-stable attractors. The behaviour is not gradually degrading; it is snapping from one basin to another.

The Section 25 judicial system is meta-stable in exactly this sense. Within "normal" cases — moderate assets, roughly equal contributions, standard needs — judges produce roughly consistent outcomes. But edge cases push the system beyond its basin. Extreme wealth disparity, international assets, non-financial contributions that resist quantification — these cause outcomes to become wildly variable, not because the judges are incompetent, but because the system's attractor landscape has multiple basins and the input has crossed a boundary between them.

Alignment margin captures meta-stability naturally. The margin tells you the distance from the current operating point to the basin boundary. Low margin means you are near the edge. High margin means you are deep inside a robust basin. Zero margin means you are on the boundary, and any perturbation could send you either way.

2. Positive Feedback Loops

In control systems, positive feedback amplifies deviations from equilibrium. Without damping, the system diverges.

In alignment: Emmett Shear uses a biological metaphor — cancer as an organic alignment failure. A cell that begins optimising for its own replication rather than the organism's health enters a positive feedback loop. More replication captures more resources, which enables more replication. The organism's immune system is the negative feedback that normally damps this.

In family law: the adversarial process itself creates positive feedback. One party's aggressive legal strategy provokes escalation from the other, which provokes further escalation. Legal costs compound. Positions harden. The collaborative divorce and mediation movements are, in control-theoretic terms, attempts to introduce negative feedback (structured de-escalation, mandatory cooling-off periods, shared information disclosure) into a system that otherwise has none.

In AI multi-agent systems: agents negotiating capabilities without damping mechanisms can enter escalation spirals. A formally specified negotiation protocol — the kind I proposed in the previous essay — would need to include explicit negative feedback. Not as a design choice, but as a stability requirement that control theory would flag as mandatory.

3. Oscillation

A system oscillates when it has insufficient phase margin and overshoots its target repeatedly. It is not diverging (that would be full instability), but it is perpetually hunting for equilibrium without reaching it.

English family law oscillates at the policy level. Parliament considers tightening the statute to increase consistency (more prescription, less discretion). Courts push back, arguing that discretion is necessary to achieve justice in individual cases. Parliament backs off. Inconsistency builds. Parliament considers tightening again. The system oscillates because the underlying impossibility — the fairness properties cannot all be satisfied simultaneously — prevents a stable equilibrium. Every reform that emphasises one property creates pressure to correct toward the others.

AI alignment governance may face the same dynamic. Regulation that constrains AI capabilities will be met with arguments that the constraints prevent beneficial uses. Relaxing constraints will be met with safety concerns. The governance system oscillates between permissiveness and restriction because there is no stable equilibrium that satisfies all stakeholders' properties simultaneously.

Alignment margin offers a way to damp these oscillations by replacing binary thresholds ("is this system safe?") with continuous measurement ("how much margin does this system have?"). Continuous measurement enables proportional response: a system with declining margin warrants increased scrutiny, not an on/off regulatory switch.

4. Gain and Bandwidth

In control theory, there is an inherent trade-off between performance and robustness. A system designed for fast response (high bandwidth) typically has lower stability margins than one designed for slow, cautious response. You cannot have both maximum performance and maximum robustness; the Bode sensitivity integral theorem makes this a mathematical fact.

In alignment terms: there may be a fundamental trade-off between capability and alignment margin. A more capable AI system (one that can handle a wider range of inputs and produce more sophisticated outputs) may inherently have lower alignment margin than a less capable one, because the larger capability space creates more opportunities for property violations at the boundaries.

This trade-off, if real, has significant implications. It suggests that the question "how capable should AI systems be?" is not separable from the question "how much alignment margin do we require?" The two are coupled, and optimising one without regard to the other produces systems that are either too restricted to be useful or too fragile to be safe.

I state this as a hypothesis, not a proven result. The formal relationship between capability (as measured by benchmark performance) and alignment margin (as defined above) is an open research question. The Bode sensitivity integral is suggestive but does not apply directly to neural networks (which are neither linear nor time-invariant); the Technical Notes below are precise about this. But the intuition is plausible and, more importantly, it is falsifiable.

Testable prediction. If the capability/margin trade-off is real, then for a fixed model family (e.g. decoder-only transformers trained on the same data distribution), alignment margin — measured as the minimum perturbation magnitude required to produce a property violation in a fixed formal test suite — should decrease monotonically as benchmark capability (e.g. MMLU score) increases, when both are measured on the same evaluation set. A study that ran alignment margin estimation across the Pythia model suite (70M to 12B parameters, trained identically) would provide a direct empirical test. I am not aware of such a study. If the prediction fails — if larger models maintain or increase their margin — that would suggest capability and robustness are not in tension, which would itself be a significant and useful result.

Making Alignment Margin Measurable

A concept is only useful if it can be computed. How would one measure alignment margin in practice?

The Verification Architecture: Generation Separate from Checking

Before discussing measurement in detail, it is worth noting a structural insight that the measurement framework implies. Computing alignment margin requires two distinct components: a generator (the AI system S, which produces outputs from inputs) and a checker (an oracle that evaluates whether those outputs satisfy the formal properties P). These two components have radically different complexity profiles.

The generator is opaque and complex — a large neural network whose internal workings are not directly auditable. The checker, if the properties are formally specified, can be a simple, verified, and auditable program. A Lean 4 predicate that tests whether an output satisfies a property is orders of magnitude simpler than the language model that produced the output. This separation between a complex generator and a simple, machine-checkable verifier is precisely the architecture that researchers in formal AI safety (such as the ARIA Safeguarded AI programme) argue is necessary for trustworthy AI systems. Alignment margin provides a natural motivation for this architecture: you cannot measure margin without a verifier, and a trustworthy verifier needs to be simple enough to be checked itself.

This also suggests what a proof certificate for alignment looks like in this framework. Rather than claiming "this system is aligned" (a statement about the generator), a certificate would assert "this system has an alignment margin of at least M within domain D with respect to properties P, as measured by checker C at date T." The certificate is bounded, domain-limited, and auditable. It names its checker, its property set, its domain, and its measurement date. This is the kind of evidence that an independent auditor or regulator could inspect and re-verify.

For Formally Specified Properties

If the alignment properties are expressed as formal predicates — as in the Lean 4 formalisations from Essays 1 and 2 — alignment margin can in principle be computed by systematic perturbation of inputs.

Algorithm: Estimate alignment margin (design sketch — not production code)

Input:  System S, properties P, domain D, resolution ε
Output: Estimated alignment margin M (lower bound)

1. For each sample point x ∈ D:
   a. Verify S(x) satisfies all p ∈ P
   b. For increasing δ from 0:
      i.  Generate perturbed inputs x + δ (uniform on sphere of radius δ)
      ii. Check whether S(x + δ) satisfies all p ∈ P
      iii. If any p is violated, record δ as the local margin at x
2. Return M = min over all sample points of local margin

This is a design sketch of a Monte Carlo approach, not a runnable implementation. It gives a lower bound on alignment margin (because it samples rather than exhaustively checks) and it scales with the dimensionality of the input space. For high-dimensional systems — which language models are — this is computationally expensive but not intractable.

What exists today. Red-teaming frameworks (e.g. Microsoft PyRIT, Anthropic's automated red-teaming pipelines, and academic adversarial robustness libraries such as ART and Foolbox) implement crude versions of steps 1b–1biii: they probe the system with adversarial inputs to find where it fails. Alignment margin formalises what red-teaming is implicitly measuring.

What does not exist yet. The connection between formal property specifications (in Lean 4 or equivalent) and the perturbation oracle in step 1bii. Building that bridge — a machine-checkable property checker that can evaluate arbitrary alignment predicates on language model outputs — is the core engineering gap this framework requires. It is a non-trivial research problem, not an off-the-shelf component.

For Informally Specified Properties

Most alignment properties today are specified informally ("be helpful," "be honest") and evaluated through human judgment. Alignment margin can still be approximated in this setting by measuring the consistency of human evaluations under input perturbation.

If a system's output is rated "helpful" by evaluators, and small perturbations to the input produce outputs that are also rated "helpful," the system has high alignment margin (for that property, in that region). If small perturbations cause evaluator ratings to flip, the margin is low.

This connects alignment margin to the robustness evaluation already standard in machine learning. The contribution is framing: interpreting robustness evaluation as measuring a continuous quantity with physical meaning (distance to property violation) rather than a dimensionless accuracy percentage.

For Negotiation Protocols

The previous essay proposed that process properties (transparency, convergence, non-dictatorship, manipulation resistance) could be formalised for negotiation protocols. Alignment margin applies at this level too.

The alignment margin of a negotiation protocol is the maximum perturbation to the negotiating agents' inputs (misinformation, strategic misrepresentation, noise in communication) for which the protocol still converges to an outcome satisfying all process properties. A protocol with high margin is robust to bad-faith participation. A protocol with low margin breaks down under adversarial pressure.

This is a measurable quantity for any specified protocol. And it provides exactly what is needed for the design problem from Essay 3: a quantitative criterion for comparing protocols. Protocol A has an alignment margin of 0.4; Protocol B has a margin of 0.15. Both work under ideal conditions. Protocol A is more robust. Choose accordingly.

The Connection to Organic Alignment

I want to be explicit about why I think this concept is relevant to Softmax's research agenda.

Emmett Shear's organic alignment framework uses biological metaphors: cells cooperating in an organism, ant colonies, ecosystems. These are rich metaphors, but they are metaphors. They describe qualitative dynamics without providing a measurement framework.

Control theory offers the quantitative complement. Every biological metaphor that Shear uses has a precise control-theoretic analogue:

Biological metaphor	Control theory analogue	Formal tool
"Healthy cell cooperating in organism"	System operating within alignment margin	Phase margin analysis
"Cancer as alignment failure"	Positive feedback loop with insufficient damping	Nyquist stability criterion
"Immune system detecting defection"	Negative feedback / error detection	Observer design
"Organism adapting to environment"	Adaptive control / parameter estimation	System identification
"Cells differentiating into roles"	Multi-agent equilibrium selection	Game-theoretic stability

The claim is not that the biological metaphors are wrong. They capture something important about how alignment works in natural systems. The claim is that control theory provides formal tools for quantifying the properties that the biological metaphors describe qualitatively.

If a cell is "cooperating healthily," what is its alignment margin? How much mutation (perturbation) can it absorb before becoming cancerous (entering a positive feedback loop)? These are questions that evolutionary biology answers qualitatively ("robust regulatory networks," "redundant error-correction mechanisms") and that control theory answers quantitatively (stability margins, damping ratios, gain crossover frequencies).

AI alignment needs the quantitative answers. Not because the qualitative answers are wrong, but because you cannot engineer a system to a qualitative specification. "Make it robustly aligned" is not an engineering requirement. "Achieve an alignment margin of at least X within domain D" is.

Softmax's coordination benchmark — measuring theory of mind, goal anticipation, and mutual modelling — is implicitly measuring something like alignment margin. Agents with high coordination capability can absorb more perturbation (misunderstanding, conflicting goals, incomplete information) while maintaining cooperative behaviour. The question is whether alignment margin provides a useful formalisation of what the benchmark is actually measuring.

What Alignment Margin Does Not Do

I want to be direct about limitations.

It does not solve the specification problem. Alignment margin tells you how robustly a set of properties holds. It does not tell you whether you have specified the right properties. If your formal specification omits an important alignment property, the margin is meaningless with respect to that property. The specification problem remains hard.

It is susceptible to Goodhart's Law. This is the most important limitation and deserves direct treatment. Goodhart's Law states that when a measure becomes a target, it ceases to be a good measure. If alignment margin becomes an optimisation target — if systems are trained to maximise their measured margin on a test suite — then a sufficiently capable system may learn to maintain high margin on the test distribution while pursuing different objectives outside it. The margin score would remain high while actual alignment degrades.

This is not a hypothetical concern: adversarial robustness research has documented exactly this dynamic, where models trained to be robust against one perturbation type become more vulnerable to others (see Tramer et al. 2020 on adversarial training and the tension between robustness to different threat models). Alignment margin should therefore be treated as a measurement instrument and a design requirement, not as a training objective. It measures alignment; it does not produce it. This distinction is the same one that control engineers draw between a diagnostic instrument and a controller.

A further concern arises for systems that are more capable than their specifiers: such a system could, in principle, infer the structure of the measurement protocol and optimise for high measured margin while pursuing objectives that are invisible to the test suite. This is the instrumental convergence concern — the tendency of sufficiently capable optimisers to pursue certain intermediate goals regardless of their final objective — applied to margin measurement specifically. There is no complete defence against it short of the combination of (a) diverse, evolving test suites that are not disclosed to the system being measured, (b) interpretability tools that audit internal representations, not just outputs, and (c) ongoing adversarial probing by independent parties.

It does not replace interpretability. Alignment margin is a black-box measure. It tells you how far the system is from property violation but not why it is that distance. A system with high margin might be genuinely robust or might be exhibiting a coincidental pattern that collapses under a perturbation you did not test. Interpretability research — understanding how the system produces aligned behaviour — remains essential. Anthropic's work on mechanistic interpretability and circuit-level analysis of transformers is aimed at exactly this: not measuring alignment from the outside, but understanding its internal substrate. Alignment margin and interpretability are complementary instruments, not competing ones.

It does not guarantee safety at zero margin. A system with an alignment margin of zero is, by definition, at the boundary of property violation. But the boundary might be a cliff (small perturbation causes catastrophic failure) or a gentle slope (small violation with proportional consequence). Alignment margin does not distinguish between these cases. A richer framework would need to account for the severity of violation, not just its presence.

It is expensive to compute for high-dimensional systems. Language models operate in extremely high-dimensional input spaces. Systematic perturbation analysis is computationally intensive. Practical measurement of alignment margin will require efficient approximation methods — sampling, importance weighting, gradient-based boundary estimation — that have not yet been developed for this purpose.

These are real limitations. They are also, I think, the right kind of limitations: they point toward specific research problems (efficient margin estimation, severity-weighted margins, specification completeness) rather than fundamental impossibilities. The concept is useful even before these problems are fully solved, just as phase margin was useful to engineers before optimal control theory provided exact computation methods.

Conclusion

The arc of this series:

Essay 1: Some legal systems (O-1A) can be formalised cleanly. The criteria are independent, the threshold is explicit.

Essay 2: Some legal systems (Section 25) cannot. The fairness properties are in mathematical conflict. Discretion is the escape valve.

Essay 3: The impossibility dissolves if you shift from outcome specification to process specification. Negotiation protocols can satisfy process-level properties even when no fixed function satisfies outcome-level properties.

This essay: Whether at the outcome level or the process level, the question "does this system satisfy its properties?" is less useful than "how robustly does it satisfy them?" Alignment margin — borrowed from control theory's phase margin — provides a continuous, measurable, domain-limited answer.

The synthesis, which I will develop in the final essay of this series, is that these four observations form a methodology:

Formalise the properties you care about (Lean 4, or any formal specification language).
Check whether they can be simultaneously satisfied. If not, you have an impossibility result that clarifies the design space.
If they cannot be simultaneously satisfied, design a process for negotiating which properties to sacrifice in each case, and specify properties of the process itself.
Measure alignment margin: how robustly do the specified properties (outcome-level or process-level) hold under perturbation?

This methodology does not require that we solve alignment. It requires that we measure it — continuously, quantitatively, domain by domain — and design for a specified margin of robustness. This is how every other safety-critical engineering discipline works. I see no reason why AI alignment should be different.

Technical Notes

On the formal definition of alignment margin. The definition given (sup { ε : ... }) assumes a norm on the perturbation space, which must be chosen appropriately for the domain. For text inputs, this is non-trivial: what constitutes a "small perturbation" to a natural language prompt? Possible norms include edit distance, semantic similarity (embedding distance), or paraphrase equivalence. The choice of norm affects the magnitude of the margin and must be specified as part of the measurement protocol.

On the relationship to prior robustness literature. Alignment margin is related to, but distinct from, adversarial robustness in machine learning. Adversarial robustness measures robustness of a classification (the label does not change under perturbation); see Madry et al. (2018) for the PGD attack framework that defined a generation of robustness training, and Szegedy et al. (2014) for the original adversarial examples observation. Alignment margin generalises from classification to arbitrary formal properties: the output continues to satisfy specified predicates under perturbation. This connects to the certified defences literature (Cohen et al. 2019 on randomised smoothing, which provides provable lower bounds on perturbation robustness for classifier outputs), but alignment properties are richer than classification labels, so certified defence techniques require extension rather than direct application. Within the alignment field specifically, Hendrycks et al. (2021) on natural adversarial examples and Perez et al. (2022) on red-teaming language models provide empirical baselines for what "hard inputs" look like at the distribution level. Alignment margin is attempting to formalise the quantity these empirical approaches are implicitly measuring.

On the Bode sensitivity trade-off. The claim that capability and alignment margin may be in fundamental tension is motivated by, but not formally derived from, the Bode sensitivity integral. The Bode integral theorem applies to linear time-invariant systems and shows that reducing sensitivity (improving robustness) at one frequency necessarily increases it at another. Neural networks are neither linear nor time-invariant, so the theorem does not apply directly. However, the intuition — that robustness and performance compete for a finite resource — is plausible and has empirical support in the observation that more capable models tend to have more complex failure modes.

On the connection to Shear's "physics of learning." Emmett Shear has noted that "the issue with formal methods right now is we lack a real physics of learning" and "we don't even really know how to make rigorous claims." Alignment margin is an attempt to make one kind of rigorous claim: not about why a system is aligned, but about how robustly it is aligned, measured from the outside. This is analogous to thermodynamics before statistical mechanics: you can characterise macroscopic properties (pressure, temperature, entropy) and state conservation laws without understanding the microscopic dynamics. Alignment margin characterises a macroscopic property (robustness of alignment) without requiring a complete theory of how alignment arises from training dynamics. The physics of learning, when it arrives, will explain why a system has the margin it has. But you can measure the margin and design for it before the explanation exists.

This is the fourth essay in a series on formal methods, legal reasoning, and AI alignment. Previous: From Fixed Functions to Negotiation Protocols. Next: Impossibility Results Are the Thermodynamics Before the Physics.