LegalLean: When a Computer Says 'You Owe Tax', Can You Check Its Reasoning?

A verification layer for legal AI that makes machine reasoning auditable, not just fast

16 March 2026

The Accountability Gap in Legal AI

Imagine a government system tells you that your mortgage interest is no longer tax-deductible. Or that your visa application has been rejected. Or that your telecoms provider has violated a consumer protection obligation. You ask: why? The system produces an answer. But can anyone actually verify the chain of reasoning that led there?

This is the central problem with legal AI today. Systems that extract rules from statutes, assess compliance, or answer legal queries are getting very good at producing outputs. None of them provides a verifier in the formal sense: a component that gives you a machine-checkable certificate of the reasoning, not just a plausible-sounding result.

LegalLean is an attempt to close that gap. It is a verification framework built on Lean 4, the same proof assistant used by mathematicians to formally verify theorems. The idea is straightforward: if legal reasoning can be encoded as types and proofs, then a computer can check whether the reasoning is internally consistent, whether a rule change actually altered the outcome, and precisely where human judgment is still required. Not as an assertion. As a proof.


What LegalLean Actually Is

LegalLean is not an AI system that reads statutes and answers questions. It is a verification layer that sits on top of legal formalisations and checks them for consistency. The distinction matters.

Think of it like a compiler. A compiler does not write your code; it checks that the code you wrote is well-formed. LegalLean does not formalise legal rules; it checks that the formalisations you (or a language model) produced are internally consistent and correctly typed. If the encoding says "home equity interest is deductible before 2018 and not deductible from 2018 onwards," LegalLean can verify that this is a structural property of the encoding that holds for every possible taxpayer satisfying the preconditions, not just the cases you happened to test.

The framework uses three interlocking ideas:

  1. Propositions as types. In Lean 4, a legal conclusion like "this interest payment is not deductible" is a type. A proof of that conclusion is a value of that type. If the value type-checks, the conclusion is verified. If it does not, the error is explicit and localised.
  2. Dependent types for thresholds. The type of evidence required can depend on the specific claim being made. "Meets 3 of 8 visa criteria" is not just a boolean; it is a type that can only be inhabited by providing a duplicate-free list of at least three criteria, each with its own evidence witness. The threshold is enforced at compile time.
  3. First-class formalisation boundaries. Where legal reasoning requires human judgment, LegalLean does not pretend otherwise. Instead, it represents that boundary as a typed, auditable object. More on this shortly.

The Key Insight: Open Texture as a Type

The philosopher H.L.A. Hart observed in 1961 that legal concepts have a "core of settled meaning" and a "penumbra of uncertainty." His canonical example: a rule prohibiting vehicles in a park. A car is unambiguously in. A toy pedal car? That requires a judgment about the rule's purpose. Hart called this irreducible property "open texture": no formalisation can eliminate the need for human judgment at the boundary.

Every prior system for formal legal reasoning treats open texture as a limitation imposed from outside. The formalisation goes as far as it can, and the rest is left to the practitioner, often silently. A system that returns "non-compliant" gives you no indication of whether that verdict rests on a formally verified ground or a contestable judgment call.

LegalLean treats open texture as a property to be represented inside the system, not worked around. The core type is FormalisationBoundary:

inductive FormalisationBoundary (P : Sort _) where
  | formal   (proof : P)                                : FormalisationBoundary P
  | boundary (reason : String) (authority : String)     : FormalisationBoundary P
  | mixed    (formalPart : String) (humanPart : String) : FormalisationBoundary P

Three constructors, three epistemic positions about a legal proposition:

This is not just a data structure. It changes what it means to get an answer from a legal reasoning system. Instead of "non-compliant" with no further explanation, you get a typed result that tells you whether the rejection was machine-verified or requires a human adjudicator to confirm it. The distinction is auditable, not just claimed.


The TCJA Monotonicity Break: A Machine-Checked Proof

The best way to understand what this makes possible is through the headline result: a machine-verified proof that the Tax Cuts and Jobs Act 2017 removed a deduction.

Before 2018, taxpayers could potentially deduct home equity interest payments, subject to various conditions. The TCJA changed this: from 2018 onwards, home equity interest is not deductible. This is a non-monotonic change in the law: adding a new rule to the legal system caused a previously available permission to be withdrawn.

Non-monotonicity matters because it is a known source of errors in legal automation. A system trained on pre-TCJA law, or one that does not properly track which rules defeat which, might still report home equity interest as potentially deductible. Worse, the error might be silent: the system produces an answer, and you have no way to check whether it accounted for the rule change.

LegalLean's theorem monotonicity_break_general proves this break formally:

theorem monotonicity_break_general
    (payment : InterestPayment) (status : FilingStatus)
    (h_individual : payment.paidByIndividual = true)
    (h_secured    : payment.indebtedness.securedByResidence = true)
    (h_purpose    : payment.indebtedness.purpose = IndebtednessPurpose.other)
    (h_within     : withinHomeEquityLimit payment.indebtedness status = true)
    : (∃ r a, isDeductible payment ⟨2017⟩ status =
                FormalisationBoundary.boundary r a) ∧
      (isDeductible payment ⟨2018⟩ status =
                FormalisationBoundary.formal False) := by
  constructor
  · simp [isDeductible, h_individual, h_secured, h_purpose, h_within]
  · simp [isDeductible, h_individual, h_secured, h_purpose]

In plain terms: for all home equity interest payments by individual taxpayers on secured property within the applicable dollar limits, the pre-TCJA outcome (year = 2017) is boundary (potentially deductible; whether the property qualifies as a "qualified residence" requires IRS or Tax Court determination), while the post-TCJA outcome (year = 2018) is formal False (automatically rejected; machine-verified).

The proof is two lines of simp. That brevity might seem anticlimactic, but it reflects how Lean 4 works: the simp tactic unfolds the isDeductible function, substitutes the hypotheses, and evaluates the conditional branches. The non-triviality is in the universality. This is not a test case checking one taxpayer. It is a theorem quantifying over every possible InterestPayment and FilingStatus satisfying the four preconditions. A unit test checks one input. This theorem covers the entire input space. That is the difference between empirical testing and formal verification.


The Visa Case: Dependent Types Enforce Eligibility Thresholds

The O-1A extraordinary ability visa requires demonstrating that an applicant meets at least three of eight evidentiary criteria. This sounds like a simple threshold check. It is actually a natural fit for dependent types, and a good illustration of why the type system matters.

The naïve encoding would be a function that takes a list of criteria and returns true if the list has at least three elements. But that would allow a list with duplicates (counting the same criterion twice) or a list with three claimed criteria but no evidence for any of them. Neither should pass.

LegalLean's encoding makes these errors structurally impossible:

structure MeetsThreeCriteria (a : Applicant) where
  satisfied : List Criterion
  distinct  : AllDistinct satisfied
  cardinal  : satisfied.length ≥ 3
  evidence  : ∀ c ∈ satisfied, EvidenceFor (criterionProp a c)

A value of type MeetsThreeCriteria a can only be constructed if you provide: a list of criteria with no duplicates (AllDistinct satisfied), at least three of them (satisfied.length ≥ 3), and for each criterion on the list, an evidence witness (EvidenceFor (criterionProp a c)). The type checker enforces all three conditions at compile time. Constructing the value is the proof.

The Kazarian two-part test that USCIS uses (Kazarian v. USCIS, 596 F.3d 1115, 9th Cir. 2010) is encoded as FormalisationBoundary.mixed: Part 1 (criteria count) is formal; Part 2 (holistic merits review) is boundary, with the authority field naming USCIS as the responsible institution. The theorem eligibility_always_mixed proves that this function always returns mixed, never formal or boundary alone. That structural constraint is a consequence of the Kazarian framework itself: Part 1 is always machine-checkable; Part 2 is always not. The theorem formalises this as a property of the encoding, not just a design intention.


What This Means for Society

Legal AI systems are already making consequential decisions. They assess compliance, advise on deductibility, screen applications, and flag violations. The question of who can check their reasoning is not academic.

Three things change when legal reasoning is verifiable.

Transparency. A FormalisationBoundary result tells you not just what the system decided, but why, and whether the decision is machine-checkable or requires human judgment. The system cannot hide a contested judgment call inside a confident-sounding "compliant" or "non-compliant."

Fairness. When the encoding is open and the proofs are auditable, affected parties can challenge the reasoning at the level of the code, not just the output. If the formalisation of "qualified residence" is wrong, that error is visible and correctable. With an opaque model, you can only argue with the answer.

Accountability. Regulators and courts increasingly demand that automated decisions be explainable. A machine-checked proof is the most rigorous form of explanation possible. It does not just assert that the conclusion follows; it provides a term that the Lean 4 kernel has verified. That verification is deterministic, auditable, and reproducible.

None of this eliminates human judgment. The boundary and mixed constructors exist precisely to mark where the system defers. The goal is not to automate the law entirely; it is to make the boundary between what a machine can certify and what requires a human clearly visible, and to have that boundary itself be a typed, auditable object rather than a gap.


The Open Core Model

LegalLean is open source. The framework, the case studies, and the Lean 4 proofs are all available on GitHub. The research pipeline that uses Perplexity to fill in sorry stubs, the Aristotle integration for automated theorem proving, and the comparison against prior systems are all documented and reproducible.

The production pipeline, however, is proprietary. The automated formalisation layer, which converts statutory text to Lean 4 encodings using language models, is part of Legal Engine's commercial offering. The verification layer can be open because its value is in trust: anyone can inspect the proofs, challenge the encodings, and propose improvements. The formalisation layer requires the legal domain expertise, tooling, and engineering investment that make it commercially viable.

This is the same model that underlies much of the formal verification ecosystem. The Lean 4 theorem prover itself is open source. Mathlib, the community mathematics library, is open source. The value of the tooling is inseparable from the community that builds, audits, and extends it. LegalLean aims to contribute to that community by making legal formalisation a concrete, tested domain, not just a theoretical possibility.


Further Reading

The paper:

Prior work and references:

What's next: healthcare. Drug dosage rules (BNF), clinical guidelines (NICE), NHS pathways, and US insurance coverage decisions follow the same pattern as tax law: clear rules with contested boundaries. NICE evidence grading maps directly to FormalisationBoundary: Grade A evidence is formal; "consider" recommendations are boundary. If you work in healthcare compliance and are interested in verified clinical reasoning, get in touch.

About the author: Eduardo Aguilar Pelaez is CTO and co-founder at Legal Engine Ltd and Honorary Research Fellow at the Department of Surgery & Cancer, Faculty of Medicine, Imperial College London. Contact: edu@legalengine.co.uk. The full paper, source code, and case studies are available at legal-lean.aguilar-pelaez.co.uk. The framework is open source (MIT); the production formalisation pipeline is proprietary to Legal Engine Ltd.