Holding Water With Our Hands: Why Law Needs a Chaos Monkey

A chaos monkey for statute law: property-based testing as legislative eval

19 March 2026

Where This Sits

In the previous essay I argued that LegalLean needs a wind tunnel: factorial testing to harden the translation layer between English and Lean 4. That article attacks the middle of the pipeline, the morphism from natural language to formal type theory. It asks: when a lawyer rephrases a rule, does the formalisation remain stable?

This essay attacks the input stage. Once a statute has been formalised in Lean 4 and the translation is accepted as faithful, how do we know the law itself handles the space of real-world situations it was drafted to govern?

The answer: subject it to a chaos monkey. A combinatorial barrage of synthetic legal fact-patterns, generated by Monte Carlo sampling over the property space, evaluated against the formalisation. This is the legislative equivalent of evals in AI and of property-based testing in software engineering. It is complementary to the wind tunnel, not a replacement for it. The wind tunnel tests translation robustness (same law, many phrasings). The chaos monkey tests legislative robustness (same formalisation, many fact-patterns).

What Exists and What Does Not

To be clear about the status. LegalLean today has 44 machine-verified theorems, a working FormalisationBoundary type, and three case studies (US tax, immigration, Australian telecoms). The factorial test design described in the previous essay is a concrete plan with a phased implementation programme. What this essay proposes, the chaos monkey evaluation framework, does not yet exist. The FactPattern structure, the TriageResult type, the property-space sampler, and the coverage-map analysis are proposals, not shipped capabilities. The 44 theorems provide the foundation on which this can be built. This essay describes what comes next, not what we have today.


The Holding-Water Problem

A statute is a human attempt to partition an infinite fact-space into a finite number of outcomes: liable or not liable, eligible or ineligible, deductible or not deductible. Legislators draft these partitions in natural language, debate them in committee, vote on them, and hope the resulting text covers the cases that matter.

The operative word is hope.

Right now, law is like trying to hold water with cupped hands. The intention is good. Every finger is placed deliberately. But water leaks between them. Valuable cases seep through: edge cases the drafter never imagined, interaction effects between subsections, temporal boundaries that produce discontinuities nobody intended. We know the leaks exist because litigation exists. Every lawsuit that turns on "what does 'reasonable' mean here?" or "does a houseboat count as a qualified residence?" is water that escaped.

The analogy extends: our hands are the best tool evolution gave us, and for millennia they were the only option. But we now have vessels, Lean 4 and dependent type theory, that can hold liquid without gaps, provided we know where the gaps are. The chaos monkey's job is to find them.

There is a deeper philosophical distinction at work here. Legislators draft law intensionally: from idealised narratives, policy intent, and envisioned scenarios. The chaos monkey tests law extensionally: from observable input-output behaviour across the full space of possible facts. The gap between legislative intent and operational reality is precisely the gap between intensional drafting and extensional testing. No amount of careful hand-placement closes it. You have to pour water and watch where it leaks.


Evals for Law

In AI, an eval is a curated suite of inputs designed to probe a model's behaviour across a distribution of scenarios, including adversarial ones. The eval does not train the model. It characterises it. It answers: for which inputs does this system produce correct outputs, and for which does it fail?

The chaos monkey for statute law does exactly the same thing, but the model under test is the formalised statute and the inputs are synthetic fact-patterns: combinations of legally relevant properties describing a hypothetical case.

AI Eval Concept Legislative Chaos Monkey
Model under test Lean 4 formalisation of a statute
Eval input Synthetic fact-pattern (Monte Carlo sampled)
Expected output Clear verdict or explicit sorry
Failure mode Silent gap, inconsistency, or unintended outcome
Eval suite Factorial or Monte Carlo sample over property space

This draws on a well-established engineering practice. Netflix's Chaos Monkey, introduced in 2011, randomly terminates production services to verify that the system handles failures gracefully. The insight is counterintuitive: you make the system more reliable by deliberately breaking it. The chaos monkey for law applies the same logic: you make a statute more robust by deliberately bombarding it with hard cases. The difference is that Netflix tests infrastructure resilience, while we test legislative coverage.


The Property Space

Every statute implicitly defines a property space: the Cartesian product of all legally relevant attributes that could affect the outcome. Consider the Tax Cuts and Jobs Act mortgage-interest deduction formalised in LegalLean:

Relevant properties:
  taxpayer_type    : {Individual, Corporation, Trust, Estate}
  loan_type        : {Acquisition, HomeEquity, Refinance, Construction}
  property_type    : {PrimaryResidence, SecondHome, Houseboat, RV, ...}
  principal_amount : Nat  (in dollars)
  taxable_year     : Nat
  filing_status    : {Single, MarriedJointly, MarriedSeparately, HeadOfHousehold}
  property_secured : Bool
  proceeds_used    : {BuyBuildImprove, Other, Mixed}

The full Cartesian product is large but enumerable for discrete factors and samplable for continuous ones. A chaos monkey does the following:

  1. Enumerates the factorial combinations for discrete properties (taxpayer type by loan type by filing status by ...)
  2. Samples continuous properties (principal amount, taxable year) from distributions designed to hit boundary conditions: just below $750,000, exactly at $750,000, just above; year 2017, 2018, 2025
  3. Constructs a fact-pattern for each combination
  4. Evaluates the Lean 4 formalisation on each fact-pattern
  5. Classifies the result into one of three buckets

The sampling strategy borrows from design of experiments. Fractional factorial designs cover the discrete space efficiently. Boundary-value sampling targets the exact points where statutes tend to break. And for the hardest cases, an LLM can be prompted to generate fact-patterns that a practising lawyer would find genuinely debatable, the legislative analogue of red-teaming in AI safety.


The Three-Way Triage

The chaos monkey's most consequential output is a triage of the entire case-space into three zones.

Green zone: machine-decidable. These are the fact-patterns where Lean 4 produces a definitive, machine-verified verdict. The Monotonicity Break Theorem from LegalLean is a canonical example: for every home equity interest payment by an individual taxpayer on secured property within applicable dollar limits, the post-2018 outcome is "definitely not deductible". No judge needed. No lawyer needed. No latency. No cost. The chaos monkey's job is to maximise this zone, not by changing the law, but by revealing how much of the law already is deterministic and simply has not been recognised as such because it is buried in prose.

Amber zone: explicit human determination. These are the fact-patterns where the formalisation hits a FormalisationBoundary.boundary, a term that is deliberately vague in the sense of Hart's open texture. "Qualified residence", "extraordinary ability", "substantial contribution". The formalisation does not say sorry. It says: I know exactly why I cannot decide this, and here is the authority who should. This is where judges should focus. Not on the green zone (a waste of human cognition and calendar time) and not on the red zone (which should not exist). The amber zone is the law working as designed: providing structured discretion to human adjudicators.

Red zone: sorry. These are the fact-patterns where Lean 4 returns sorry: the term has no defined boundary, the statute does not address the combination, or the interaction between two provisions produces a contradiction or gap. These are the leaks between the fingers.

Every red-zone case is a legislative defect report. "Section 163(h)(3)(B)(i) does not specify what happens when a taxpayer refinances an acquisition loan into a home equity loan mid-year and uses the proceeds partly to improve the property and partly to pay off credit card debt." Currently, this defect is discovered by a taxpayer, litigated through Tax Court, and resolved by judicial opinion, years later, at enormous cost. The chaos monkey finds it in milliseconds.


Legislative Margin

In a previous essay I defined alignment margin as the maximum perturbation a system can absorb before any specified alignment property is violated, borrowing from control theory's phase and gain margins. The chaos monkey provides the analogous measure for statutes: legislative margin.

Legislative margin is the proportion of the property space that falls in the green zone, weighted by the empirical frequency of each fact-pattern in actual case law. A statute with high legislative margin is robust: small changes in fact-patterns do not push cases from green to red. A statute with low legislative margin is brittle. This is measurable, comparable across statutes, and trackable over time as amendments are made.

Consider the contrast between two of LegalLean's case studies. The O-1A visa criteria produce a large green zone and a small amber zone, because the statute was designed with discrete categories, explicit thresholds, and binary predicates. The chaos monkey validates that design choice quantitatively. Section 25 of the Matrimonial Causes Act, by contrast, produces a massive amber zone and a small green zone, because the impossibility result proved in an earlier essay shows that the fairness properties are in mathematical conflict. The chaos monkey confirms what the impossibility proof established: this statute was designed to be almost entirely discretionary, and if you want to reduce judicial workload on divorce financial settlements, you need to amend the statute, not improve the formalisation.


The Latency Argument

The human cost of the current system is not just cognitive. It is temporal. A green-zone case that reaches a courtroom wastes months to years of calendar time between filing and adjudication, thousands of pounds in legal fees for both parties, scarce judicial bandwidth that could be allocated to genuinely hard amber-zone cases, and certainty, because the parties live in limbo while waiting for a verdict that a type-checker could have produced in 15 seconds.

The chaos monkey, by maximising the green zone and making the amber zone explicit, enables a triage system for legal disputes. Green-zone cases receive automated adjudication: instant verdict, machine-verified reasoning, full auditability. Amber-zone cases are fast-tracked to a human judge with the contested boundary precisely identified and all machine-decidable sub-questions already resolved. Red-zone cases are flagged as legislative gaps and referred to the appropriate reform body.

This is not replacing human judgment. It is concentrating it where it matters and eliminating the enormous waste of applying it where it does not.


The Feedback Loop

Here is the central thesis: the density of the red zone is a quantitative measure of legislative quality.

A statute with a large green zone and a small, well-delineated amber zone is a good statute. It decides what can be decided and delegates what cannot. A statute with a large red zone is a leaky statute. It fails silently on common fact-patterns, generating unnecessary litigation.

This creates a feedback loop that does not exist today:

Draft statute, formalise in Lean 4, subject to chaos monkey (Monte Carlo plus factorial plus adversarial), measure green/amber/red zone ratios, identify red-zone fact-patterns, amend statute to address gaps, re-formalise, re-test, iterate until the red zone is acceptably small.

This is test-driven legislation: drafting law against a continuously evaluated suite of synthetic cases, analogous to test-driven development in software engineering. The chaos monkey does not tell legislators what the law should say. That remains a normative question. It tells them where the law says nothing at all, and lets them decide deliberately rather than discovering gaps through decades of litigation.


Implementation Sketch

In Lean 4, the chaos monkey's core types are straightforward:

-- A FactPattern is a structured record of legally relevant properties
structure FactPattern where
  taxpayerType   : TaxpayerType
  loanType       : LoanType
  propertyType   : PropertyType
  principalAmt   : Nat
  taxableYear    : Nat
  filingStatus   : FilingStatus
  propertySec    : Bool
  proceedsUsed   : ProceedsUse

-- The triage result: green, amber, or red
inductive TriageResult where
  | green : (verdict : Verdict) → TriageResult
  | amber : (boundary : FormalisationBoundary) → TriageResult
  | red   : (gap : String) → TriageResult

-- The eval: apply a formalised statute to a sampled fact-pattern
def chaosMonkeyEval
    (statute : FactPattern → TriageResult)
    (sampler : Nat → List FactPattern)
    (n : Nat) : List (FactPattern × TriageResult) :=
  (sampler n).map fun fp => (fp, statute fp)

The output is a coverage map: for each region of the property space, we know whether the formalised statute produces a machine-verified verdict, an explicit deferral to human judgment, or silence. The silence is the signal.


Relation to Existing Practice

The idea of testing law against hypothetical scenarios is not new. Law students learn through hypothetical questions. The Law Commission conducts consultations with worked examples. Regulatory impact assessments model the expected effects of proposed legislation on representative populations. Draft bills are scrutinised by parliamentary committees using imagined fact-patterns.

What the chaos monkey adds is automation, exhaustiveness, and a formal oracle. Existing methods are qualitative: a human reads the statute, imagines a scenario, and reasons about the outcome. The chaos monkey is quantitative: a sampler generates thousands of fact-patterns, the Lean 4 type-checker evaluates each one, and the coverage map reveals where the statute succeeds, where it defers to human judgment, and where it is silent. The difference is the same as the difference between a code review and a test suite. Both are valuable. One scales.

The deeper contribution is not the testing itself but the triage output. The Law Commission can identify problem areas in a statute. It cannot, at present, produce a quantitative coverage map showing exactly which regions of the fact-space are machine-decidable, which require human judgment, and which are unaddressed. The chaos monkey can.


The Triptych

This article completes a triptych:

  1. LegalLean: can we formalise statute law in Lean 4? Yes, with 44 verified theorems across three jurisdictions.
  2. The Wind Tunnel: is the natural-language-to-Lean-4 translation robust? Test it by varying phrasing across factorial designs.
  3. The Chaos Monkey: does the formalised statute handle real-world cases? Test it by bombarding it with synthetic fact-patterns.

The three together constitute an end-to-end legislative verification pipeline. The formalisation is correct (LegalLean). The translation is robust (wind tunnel). The statute itself covers the case-space (chaos monkey). Each attacks a different failure mode. Each is necessary. None is sufficient alone.


The Formalisation Is Not the Law

A concern that must be addressed directly: the chaos monkey tests the formalisation, not the statute. If the formalisation is wrong, the chaos monkey validates the wrong thing.

This is correct, and it is not a bug. It is the architecture working as designed.

The formalisation is an interpretation of the statute, not the statute itself. Every formalisation embeds choices: how to model "qualified residence", where to draw the line between formal and boundary, which properties to include in the fact-space and which to omit. These choices are contestable. Different formalisers might produce different Lean 4 encodings of the same statute, just as different judges produce different interpretations.

The chaos monkey does not resolve this. What it does is make the interpretive choices visible and testable. If two competing formalisations of the same statute produce different coverage maps, that divergence is itself informative. It tells you where the statute is genuinely ambiguous at the level of formal structure, not just at the level of individual words.

Lawrence Lessig argued in 1999 that "code is law": the architecture of digital systems constrains behaviour as powerfully as any statute. The chaos monkey inverts this. It uses code to test law, not to replace it. The Lean 4 formalisation is not the authoritative legal text. It is a model that can be checked, challenged, and improved. The statute, in natural language, remains the law. The formalisation is an instrument for understanding it.

Who controls the formalisation therefore matters. LegalLean is open source precisely for this reason. If the encoding of "qualified residence" is wrong, anyone can inspect the Lean 4 source, challenge the modelling choice, and propose an alternative. The chaos monkey then tells you whether the alternative formalisation covers more of the fact-space. That is a better process than the current one, where interpretive choices are embedded in opaque AI systems and the affected parties have no way to inspect them.


From Archaic Hands to Engineered Vessels

We write laws with the best of intentions and the worst of tools. Natural language is a medium optimised for human persuasion, not for unambiguous specification of decision procedures over infinite fact-spaces. The result is predictable: valuable cases leak through the gaps, citizens wait years for verdicts that should take seconds, and judges spend their scarce attention on cases that never required human judgment in the first place.

The solution is not to replace human judgment. It is to stop wasting it. Formalise the statute in Lean 4. Subject the formalisation to a chaos monkey. Measure the green, amber, and red zones. Fix the red-zone gaps. Repeat.

The chaos monkey is not an attack on the law. It is the most respectful thing we can do for it. It says: we take this statute seriously enough to check whether it actually works.

When Lean 4 says "proven", trust the machine. When it says "human required", send the case to a judge, with the contested boundary precisely identified and all machine-decidable sub-questions already resolved. When it says sorry, send the gap to the legislature.

Holding water with our hands was the best we could do for centuries. It is no longer the best we can do.

About the author: Eduardo Aguilar Pelaez is CTO and co-founder at Legal Engine Ltd and Honorary Research Fellow at the Department of Surgery & Cancer, Faculty of Medicine, Imperial College London. LegalLean is an open framework for verified legal reasoning built on Lean 4. This is the third essay in the LegalLean series, following When a Computer Says 'You Owe Tax', Can You Check Its Reasoning? and Legal Reasoning Needs a Wind Tunnel.