Legal Reasoning Needs a Wind Tunnel

Hardening English-to-Lean translation with factorial testing, canonical IRs, and equivalence proofs

18 March 2026

The Problem No One Talks About

LegalLean can verify that a tax rule is correctly encoded in Lean 4. We proved that: 44 machine-verified theorems across three jurisdictions. But here is the question that keeps me up at night: what happens when a lawyer rephrases the rule?

"A taxpayer who earns more than $600 in freelance income must report it."

Now consider:

"Freelance income exceeding $600 earned by a taxpayer must be reported."
"Every taxpayer must report freelance income, unless it is $600 or less."
"No taxpayer is exempt from reporting freelance income that exceeds $600."

A competent lawyer would treat these as semantically equivalent. They describe the same obligation, the same threshold, the same consequence. But when you run each through a natural-language-to-Lean-4 pipeline, you do not always get the same formal term. Sometimes you get logically equivalent but structurally different propositions. Sometimes you get propositions that are not equivalent, where the quantifier scope shifted, or the negation attached to the wrong subexpression, or the exception was parsed as a separate rule rather than a proviso of the original.

If your system claims to do verified legal reasoning but produces different formal objects for paraphrases of the same law, you do not have verified legal reasoning. You have a fragile translation layer wearing a theorem prover as a hat.

The solution comes from an unexpected place: factorial experimental design, a methodology developed for manufacturing quality control and adapted by NIST for combinatorial software testing. It turns out that the same discipline that helps engineers find interaction defects in hardware configurations is exactly the right tool for stress-testing formal legal encodings.

What Exists Today and What Does Not

To be clear about the status of this work. LegalLean today has 44 machine-verified theorems across US tax law (IRC section 163(h)), immigration visa eligibility (O-1A extraordinary ability), and Australian telecommunications regulation (TCP Code). All 44 compile with zero sorry stubs. The FormalisationBoundary type for encoding open texture is implemented and working. Docker-verified reproducibility is confirmed.

What does not yet exist: the canonical semantic IR described below, the factorial test suite, the Lean-based equivalence oracle, and the self-verifying covering array. These are the next phase of the programme. This essay describes the engineering discipline we intend to apply, not capabilities we are claiming to have shipped. The 44 theorems are the foundation; the wind tunnel is the plan.

Why Legal Language Is an Interaction Problem

The instinct when testing a translation pipeline is to collect example sentences, translate them, and check whether the output compiles. This is necessary but radically insufficient, because legal text is intensionally dense: many distinct surface forms can carry the same or subtly different meanings. The difficulty is not any single linguistic feature. It is combinations of features.

Consider the factors that vary independently in statutory drafting:

Syntactic voice: active, passive, nominalised
Deontic modality: "must", "shall", "may", "is required to", "is entitled to"
Quantifier form: "any", "every", "each", "no", "a"
Negation pattern: simple ("not"), exception ("unless"), proviso ("provided that... not"), double negation
Temporal expression: absolute date, relative period, event-relative, recurring
Threshold expression: numeric literal, verbalised amount, comparative ("more than", "at least", "exceeding")
Exception syntax: inline, trailing proviso, separate subsection
Cross-reference: self-contained, defined-term reference, external statute pointer

If each factor has even 3 levels, a single legal clause generates 3⁸ = 6,561 surface variants. You cannot test them all. But the key empirical finding from NIST's combinatorial testing research is that most software failures are triggered by interactions of six or fewer factors, with the vast majority caused by 2-way or 3-way interactions. A carefully constructed covering array (a subset of test cases guaranteeing every t-tuple of factor levels appears at least once) of a few hundred cases can cover the interaction space that matters.

For legal language specifically, the interactions that matter most are:

Modality and negation. "Must not" versus "is not required to." These are legally distinct (prohibition versus absence of obligation) but syntactically adjacent. A pipeline that conflates them has a soundness bug.

Quantifier scope and exception. "Every person who is not exempt must..." versus "No person is required to... unless they are not exempt." Scope ambiguity multiplied by exception parsing is where formal systems routinely produce non-equivalent outputs.

Vagueness and deontic modality. "The minister may grant relief where it is reasonable." Change "may" to "must" and the discretion disappears. But does the open-textured "reasonable" survive? If your Lean encoding of "must" plus "reasonable" still contains a Discretionary wrapper, you have a type error in your legal semantics.

Temporal expression and negation. "Must be filed within 30 days, unless an extension has been granted." Does the exception negate the temporal bound, the filing obligation, or both? Different parses yield structurally different Lean terms.

A factorial design forces you to confront every one of these combinations systematically, rather than stumbling into them when a client's solicitor phrases a clause unexpectedly.

The Wrong Architecture: English Straight into Lean

There is a deeper reason this matters.

If you map raw English directly into Lean 4, every surface perturbation is allowed to shake the entire formal object. A pronoun can become a typing problem. A passive construction can become an ontology problem. A proviso can become a proof-search problem. Once that happens, debugging becomes opaque.

You do not know whether the failure came from parsing, legal interpretation, ontology binding, vagueness handling, or the final lowering into Lean syntax.

The better approach is denotational: introduce a semantic intermediate representation between English and Lean.

English → legal semantic IR → Lean 4 → proof obligations

That middle layer is where paraphrases should canonicalise. It is where "must," "shall," and "is required to" can converge if they are truly equivalent in context. It is where "reasonable care" can be tagged as open-textured rather than silently converted into a Boolean. It is where "the Secretary may waive" can be represented as discretion rather than determinised into entitlement.

Without that layer, the translator has to solve language, ontology, legal interpretation, and formalisation all at once. With it, each stage becomes inspectable. When two variants diverge, you can pinpoint exactly where: did they diverge at the IR (a language-understanding problem), or at the Lean lowering (an encoding-choice problem)?

The Lean Kernel as Test Oracle

Here is why this approach is uniquely powerful for LegalLean rather than for conventional NLP benchmarks: Lean 4's type-checking kernel is the oracle.

For each pair of English variants that should encode the same legal proposition, you translate both to Lean 4 terms T₁ and T₂, then attempt to prove T₁ ↔ T₂. Three outcomes are possible:

Equivalence proved. The translation is paraphrase-invariant for this variation. Pass.

Both compile, but equivalence unprovable. Semantic divergence: same law, materially different formalisations. This splits into two sub-cases. Either the two English phrasings really do mean different things (genuine legal ambiguity, which should be surfaced as a typed uncertainty marker), or the English is unambiguous to a lawyer but the pipeline produced structurally different output (a translator defect, which should be fixed).

One compiles, the other fails. Coverage gap: the translator cannot handle this syntactic form.

In Lean 4, the equivalence check is direct:

-- Two translations of "income > $600 → must report"
def t1 (income : Nat) : Prop := income > 600 → Obligation Report
def t2 (income : Nat) : Prop := ¬(income ≤ 600) → Obligation Report

-- The oracle: prove they are equivalent
theorem t1_iff_t2 (income : Nat) : t1 income ↔ t2 income := by
  simp [t1, t2, Nat.not_le]

When the proof goes through, you know the two translations are semantically identical. When it does not, you have found either a genuine legal distinction or a translator defect. The kernel does not guess. It checks.

Factorial design does not just find bugs. It classifies the failure space and tells you which factors and which interaction orders are responsible. That is engineering intelligence you cannot get from ad hoc prompt testing.

The Failure Mode That Matters Most

LegalLean's core thesis (as described in the previous essay) is that deliberate legal vagueness should be a first-class type, not erased. The FormalisationBoundary type explicitly represents where machine verification ends and human judgment begins.

The worst failure mode is therefore not compilation failure: it is false precision. This is when the input is vague, discretionary, or evidential, but the system emits a crisp decidable predicate anyway. A formal proof built on a silently determinised premise is worse than no proof at all, because it carries the imprimatur of mathematical certainty while encoding the wrong semantics.

Consider the difference between these two formulations:

"An applicant is not eligible unless they have resided continuously in the UK for five years."
"An applicant may be treated as eligible where the decision-maker is satisfied that they have substantially resided in the UK for five years."

The first has a crisp threshold ("continuously", "five years") and a clear rule structure. The second introduces discretion ("may be treated"), an evidential standard ("the decision-maker is satisfied"), and an open-textured term ("substantially"). If the translator emits the same Lean predicate for both, it is not robust. It is flattening legally meaningful structure.

A factorial test suite that includes vagueness as a factor will catch this, because the comparison is built into the experimental design. Abstention is not weakness. It is soundness preservation. False precision is the real risk.

What We Should Measure

Not "percentage of clauses that compile." That metric is too weak. The right metrics for a hardened formal legal pipeline are:

IR parse success: does English map to the intermediate representation?
Paraphrase canonicalisation rate: do semantically equivalent phrasings converge to the same IR?
Lean well-typedness: does IR lower to valid Lean 4?
Equivalence proof success: for intended-equivalent pairs, can T₁ ↔ T₂ be proved?
Proof stability: does a small wording change force a major proof rewrite, or do proofs transfer cleanly?
Abstention quality: when ambiguity is genuine, does the system abstain with a typed uncertainty marker?
False precision rate: how often does the system emit a crisp predicate for genuinely vague legal language?
Provenance completeness: can each Lean subterm be traced back to a source text span?

If false precision rate is not a release gate, you are not taking your own thesis seriously.

A Concrete Example

Take the proposition: "An applicant qualifies for the visa if they have been physically present in the country for at least 183 days during the qualifying year."

Factorial variations:

Base. "An applicant qualifies if they have been physically present for at least 183 days during the qualifying year."
Passive voice. "The visa may be granted to an applicant who has been physically present for at least 183 days during the qualifying year."
Exception form. "An applicant does not qualify unless they have been physically present for at least 183 days during the qualifying year."
Universal quantifier. "Every applicant who has been physically present for at least 183 days during the qualifying year qualifies for the visa."
Verbalised threshold. "An applicant qualifies if their days of physical presence in the qualifying year number one hundred and eighty-three or more."
Discretionary + vague. "An applicant may qualify where the authority is satisfied that they were substantially present during the qualifying year."

Variants 1 through 5 should canonicalise to the same IR and produce provably equivalent Lean terms. The threshold encoding might differ between "greater than or equal to 183" and "greater than 182," but Lean's omega tactic closes that equivalence trivially.

Variant 6 is intentionally different. "May qualify" introduces discretion. "Substantially present" replaces a crisp 183-day threshold with an open-textured standard. "The authority is satisfied" introduces an evidential burden. The correct behaviour is to produce a structurally different Lean term, one that includes Discretionary, OpenTexture "substantially present", and RequiresEvidence (authority_satisfied) markers. If the system instead emits the same crisp predicate for all six, it has committed false precision. The factorial test catches this because the comparison is built into the design.

Making the Test Methodology Self-Verifying

There is a further move available that is, to my knowledge, novel in the legal formalisation literature.

The covering array is a finite combinatorial object: a matrix with rows (test cases) and columns (factors), satisfying the property that every t-tuple of factor levels appears in at least one row. That property is expressible in Lean 4. You can encode the covering array as a Lean structure and prove, within Lean itself, that it satisfies the required t-way coverage.

At that point, not only are the legal theorems machine-verified, but the test methodology that validates the translation is machine-verified too. The claim "our robustness testing is adequate" becomes not an assertion but a proof. This is a natural extension of LegalLean's philosophy: if you are going to build formal verification infrastructure, verify the infrastructure.

Relation to Existing Work

This essay does not claim that formalising legal rules is new. Catala (Merigoux et al.) compiles tax statutes into executable code with a literate programming approach anchored to legislative text. Governatori's Defeasible Deontic Logic (DDL) computes conclusions from legal rules using defeat relations. LegalRuleML provides an interchange format for rule-based legal knowledge. Hou et al.'s L4L combines LLM agents with formal reasoning.

What none of these systems currently does is systematically test whether their formalisation is stable under paraphrase variation. Catala's compiler is anchored to one statutory text; it does not ask what happens when the same rule is expressed differently. DDL reasons over a fixed rule base; it does not test whether alternative English phrasings produce equivalent rule bases. The contribution here is not a new formalisation approach but a new hardening discipline: using factorial design to stress-test the translation layer and using Lean's kernel as the oracle that distinguishes genuine legal ambiguity from translator defects.

The Deeper Lesson

Legal language is not noisy because lawyers are bad at writing. It is variable because law operates across institutions, procedures, audiences, and edge cases. Any formal system worthy of the domain has to treat that variation as a first-class engineering problem.

This means moving beyond "we verified a theorem" to "we verified that the theorem is stable under the full combinatorial space of how lawyers actually express the same rule." It means introducing a canonical intermediate representation so that paraphrases converge before they reach the proof engine. It means using Lean not just as the target language but as the oracle that checks whether two translations agree. It means measuring false precision as seriously as you measure compilation success.

Formal legal reasoning needs better models. It also needs better experiments.

Not just examples. Not just benchmarks. Not just one more successful theorem.

It needs a wind tunnel.

Because the question is no longer whether a machine can formalise one legal sentence. The question is whether it can preserve meaning when the legal world says the same thing six different ways.

About the author: Eduardo Aguilar Pelaez is CTO and co-founder at Legal Engine Ltd and Honorary Research Fellow at the Department of Surgery & Cancer, Faculty of Medicine, Imperial College London. LegalLean is an open framework for verified legal reasoning built on Lean 4. The 44 machine-verified theorems span US tax law, immigration visa eligibility, and Australian telecommunications regulation. The framework is open source (MIT); the production formalisation pipeline is proprietary to Legal Engine Ltd. Read the previous essay: When a Computer Says 'You Owe Tax', Can You Check Its Reasoning?