Elan: Why AI Agents Need an Operating System, Not a Framework

Introducing a BEAM-native runtime where agents recover from crashes, prove their provenance, and coordinate at scale

16 March 2026

The Fragility Problem

Here is a thought experiment. You have an AI agent managing a legal research pipeline. It has been running for forty minutes: reading case law, cross-referencing statutes, building a structured memorandum. Then the LLM provider returns a 503. The HTTP connection drops. The agent process dies.

What happens next?

In most mainstream agent frameworks, the default answer is: you start over. Some frameworks have added checkpointing (LangGraph, for instance, introduced persistent state in 2024), but these are bolted onto runtimes that were not designed for it. The checkpointing is opt-in, often requires external infrastructure, and does not extend to process-level fault isolation. The forty minutes of work are gone, or at best partially recoverable if the developer anticipated the failure and configured persistence in advance. The agent had no durable state by default. It was a long-running function call pretending to be a system.

This is not a bug in any particular framework. It is a consequence of building agents on top of runtimes that were never designed for long-lived, stateful, concurrent processes. Python's asyncio, Node's event loop, Go's goroutines: these are excellent for request-response workloads. They are structurally wrong for autonomous agents that run for hours, coordinate with other agents, and must survive infrastructure failures without losing work.

Elan is my attempt to fix this. It is a multi-agent runtime built on the BEAM virtual machine, the same technology that powers WhatsApp, Discord's real-time infrastructure, and most of the world's telecommunications switching fabric. The thesis is simple: the problem we call "agent reliability" is actually the problem that telecom engineers solved forty years ago, and we should use their solution.


Why BEAM

The BEAM is the virtual machine that runs Erlang and Elixir. It was designed at Ericsson in the late 1980s for telephone switches, systems that must handle millions of concurrent connections, never go down, and recover gracefully when individual components fail. These are exactly the properties that autonomous agents need.

Three features of the BEAM matter here.

Lightweight isolated processes

A BEAM process is not an OS thread. It is a user-space construct that costs roughly 2KB of memory and is scheduled preemptively by the VM. You can run millions of them on a single node. Each process has its own heap, its own garbage collector, and its own failure domain. When one process crashes, no other process is affected. There is no shared mutable state to corrupt.

In Elan, every agent is a process. One agent per process. No shared memory, no locks, no race conditions on agent state. If you need ten thousand agents coordinating on a research task, that is ten thousand processes, each independently scheduled, each independently recoverable.

Supervision trees

This is the key insight from Erlang/OTP that the AI agent community has not yet absorbed. In OTP, you do not try to prevent crashes. You assume they will happen and build a hierarchy of supervisors that restart failed processes according to explicit strategies. The supervisor itself is a process. If it crashes, its supervisor restarts it. The tree is arbitrarily deep.

Joe Armstrong, the creator of Erlang, called this "let it crash." The phrase sounds reckless until you understand what it means in practice: instead of defensive programming that tries to handle every possible error (and inevitably misses some), you write the happy path and let the supervision tree handle recovery. The result is simpler code that is paradoxically more reliable, because the recovery logic is separated from the business logic and is itself tested and supervised.

For AI agents, this is transformative. An agent that encounters a malformed LLM response does not need elaborate error-handling code. It crashes. Its supervisor restarts it. The agent reconstructs its state from the event log and resumes from the last checkpoint. The crash is visible in telemetry. The recovery is deterministic. And crucially, no other agent in the system is affected.

Hot code upgrades

The BEAM can replace running code without stopping the system. In a telecom switch, you cannot take the system offline to deploy a patch. In a long-running agent system, you face the same constraint: you cannot kill all agents, update the code, and restart them without losing their accumulated state and context. The BEAM's ability to load new modules while processes continue running means Elan can evolve without interrupting work in progress.

Elan is not the first project to recognise these properties. The Elixir ecosystem has been moving toward AI workloads through Nx (numerical computing and ML on BEAM), Bumblebee (pre-trained model serving), and LiveBook (interactive notebooks). Phoenix PubSub provides the message routing substrate that Elan's inter-agent communication builds on. What has been missing is a runtime layer that combines these capabilities with durable state, provenance, and policy into a coherent agent lifecycle. That is the gap Elan targets.


The Four Invariants

Elan is built around four architectural invariants. Every design decision is tested against them.

1. Durable state

Agent state is persisted as an append-only event log with periodic checkpoints. Every state transition is a recorded event. Recovery replays the event log from the last checkpoint. For replay to be deterministic, all non-deterministic inputs (wall-clock timestamps, LLM responses, external API results) must be captured in the log alongside state transitions. Elan records these as first-class events (LlmResponseReceived, ToolExecuted with result payload), so that replay reconstructs identical state without re-issuing external calls. This is event sourcing applied to agent cognition.

The implementation wraps Erlang/OTP's :gen_statem behaviour directly. Each agent has an explicit finite state machine with validated transitions. You cannot reach a state without going through the prescribed transition. The state machine is the contract between the agent's current behaviour and its recovery behaviour: if the FSM says "I am in state :executing," then recovery knows exactly what context to restore.

The following is extracted from the working module (lib/elan/agent_process.ex), simplified for readability:

defmodule Elan.AgentProcess do
  @behaviour :gen_statem

  # Allowed transitions define the FSM graph
  @allowed_transitions %{
    booting: [:idle],
    idle: [:planning],
    planning: [:executing, :failed],
    executing: [:completed, :failed],
    completed: [],
    failed: []
  }

  def handle_event(:cast, {:transition_state, event}, state, data) do
    desired = Map.get(event, :to)
    allowed = Map.get(@allowed_transitions, state, [])

    if desired in allowed do
      emit_event("AgentStateTransitioned",
        %{agent_id: data.agent_id, from: state, to: desired})
      {:ok, checkpoint_id} =
        Elan.CheckpointStore.write_checkpoint(data.agent_id, %{state: desired})
      {:next_state, desired, %{data | checkpoint_ref: checkpoint_id}}
    else
      emit_event("AgentStateTransitionBlocked",
        %{agent_id: data.agent_id, from: state, to: desired})
      {:keep_state_and_data, data}
    end
  end
end

2. Git-native provenance

Every change an agent makes to the file system happens on a git branch. The branch is the agent's workspace. The commit history is the agent's audit trail. When multiple agents collaborate, their work is isolated by branch and merged through standard git operations. Conflicts are visible, attributable, and resolvable.

This is not just version control for convenience. It is provenance as a first-class architectural property. Any change can be traced to the agent that made it, the event that triggered it, and the policy that authorised it. The Merkle tree structure of git means this provenance is cryptographically verifiable. You do not need to trust the agent's self-report of what it did; you can verify it against the commit graph.

Elan uses git worktrees to give each agent its own working directory without duplicating the repository. Worktrees are lightweight (they share the object store) and provide full filesystem isolation. An agent working in its worktree cannot accidentally modify another agent's files.

3. Policy-governed tool execution

Before an agent executes any tool, the policy engine checks whether the agent has the required capability. This is the "hands" verification I described in The Soul and the Hands, implemented as a runtime system rather than a theoretical framework.

defmodule Elan.PolicyEngine do
  use GenServer

  defstruct policy_version: "v0", allow: MapSet.new()

  def check_capability(actor, capability, context \\ %{}) do
    GenServer.call(__MODULE__, {:check_capability, actor, capability, context})
  end

  def handle_call({:check_capability, _actor, capability, _context}, _from, state) do
    allowed = MapSet.member?(state.allow, capability)
    decision = %{capability: capability, allowed: allowed,
                 decided_at: DateTime.utc_now()}
    {:reply, decision, state}
  end
end

The policy check is not advisory. It is a gate. An agent whose allowlist does not include :filesystem_write gets a %{allowed: false} decision, and the tool runner refuses to proceed. The denial is logged as an event, visible in telemetry, and available for audit. The design borrows from capability-based security as implemented in systems like seL4 and CHERI, though it is important to be precise about the gap: seL4's capability model is backed by a machine-checked refinement proof; Elan's policy engine is a runtime enforcement mechanism in application code. The architectural pattern is the same (capabilities checked before execution, unforgeable by the subject), but Elan does not yet have formal verification of the enforcement layer itself. That is an explicit goal, not a current claim.

4. Idempotent side effects

Every tool execution is tracked with a unique identifier. If an agent crashes mid-execution and restarts, the recovery process checks whether the tool call was already completed. If it was, the result is retrieved from the log rather than re-executed. This makes recovery safe: you never get duplicate emails sent, duplicate files created, or duplicate API calls made because an agent restarted.

Idempotency is not optional in a system where crashes are expected. Without it, "let it crash" becomes "let it crash and hope nobody notices the side effects."


What This Enables

The combination of these four invariants enables agent behaviours that current frameworks cannot support.

Long-running autonomous tasks. An Elan agent can run for hours or days without losing state. A legal research agent that spends six hours building a comprehensive case analysis does not lose its work when the LLM provider has an outage. It resumes from the last checkpoint when the provider comes back.

Multi-agent coordination without state corruption. Because each agent is an isolated process with its own event log and git branch, thousands of agents can work concurrently without clobbering each other's state. Coordination happens through message passing (the BEAM's native communication primitive), not shared memory. Mutex-based deadlocks are eliminated because there are no locks. Circular message dependencies remain possible in principle (A waits for B, B waits for A), but the supervision tree detects these as timeouts and restarts the stuck processes, converting a hang into a recoverable crash.

Auditable decision chains. Every decision an agent makes, every tool it calls, every state transition it undergoes is recorded in the event log. This is not logging in the traditional sense (unstructured text written to stdout). It is a typed, queryable record of everything that happened. For regulated industries like legal services and healthcare, this audit trail is not a nice-to-have; it is a compliance requirement.

Safe recursive agent spawning. Agents can spawn sub-agents to handle sub-tasks. The sub-agents are supervised by the parent agent's supervisor. If a sub-agent crashes, the supervisor decides whether to restart it, escalate, or abort the sub-task. The parent agent is notified and can adapt. This is recursive composition with explicit failure handling, not the "fire and forget" pattern that current multi-agent systems use.


The Contrast with Current Approaches

I do not want to be uncharitable to existing agent frameworks. LangChain (and its orchestration layer LangGraph), CrewAI, AutoGen, and others have done important work in making agent development accessible. Some have added persistence: LangGraph supports checkpointing to external stores, and CrewAI has memory modules. But even with these additions, they share a structural limitation: they are libraries running on general-purpose runtimes, not runtimes designed for the workload.

A library gives you abstractions for building agents. A runtime gives you the substrate on which agents live. The difference matters when things go wrong, and with long-running autonomous systems, things always go wrong.

Consider the failure modes:

These are not edge cases. They are the normal operating conditions of any system that runs autonomous agents at scale. Building on a runtime that was designed for exactly these conditions is not over-engineering; it is appropriate engineering.


The Connection to Formal Verification

Readers of my earlier essays will notice the thread connecting Elan to the "Soul and the Hands" thesis. The policy engine is a runtime implementation of capability verification. The event log is the audit substrate that makes capability bounds observable. Git-native provenance is the mechanism that makes agent actions attributable.

But there is a deeper connection. The BEAM's process model is inherently compositional. Each agent is a self-contained unit with well-defined inputs (messages), outputs (messages and side effects), and failure modes (crashes caught by supervisors). This compositionality is exactly what makes formal reasoning about agent systems tractable. You can reason about each agent independently, then compose the guarantees.

Elan's event types are defined as Elixir structs with enforced schemas. Each event type is a contract: AgentStateTransitioned has a from state, a to state, and a timestamp. ToolExecuted has a tool name, parameters, a result, and a duration. These typed events are the raw material for formal verification. A future version of Elan could export its event schemas as Lean 4 types and prove properties about agent behaviour, for instance, that an agent with a given policy can never execute a given tool, or that a given state is unreachable from a given starting configuration. To be clear about the current state: this is an aspiration, not an achievement. Today, Elan's enforcement is at runtime. The bridge to machine-checked proofs is a research direction, not a shipping feature.

This is the bridge between the "soul" work (LegalLean, formal verification of reasoning) and the "hands" work (Elan, runtime enforcement of capability bounds). LegalLean verifies that legal reasoning is correct. Elan ensures that agents executing that reasoning operate within provable bounds. Together, they form a stack where both the logic and the execution are auditable.


Open Problems and Honest Costs

It would be dishonest to present the architecture without naming what is hard, unsolved, or expensive.

Event log growth. Persisting every state transition, every LLM response, and every tool result as an event means the log grows fast. A single agent running for six hours with frequent LLM calls could produce tens of megabytes of event data. Checkpointing amortises replay cost but does not solve storage growth. Elan will need compaction or archival policies, and these interact non-trivially with the durability guarantees. We have not solved this yet.

Recovery is not free, especially with LLMs. Restoring agent state from the event log reconstructs the FSM and the data context. But the LLM has no memory of the prior conversation. Rebuilding the LLM's context window after a crash means re-sending the relevant history, which costs tokens and time. For a long-running agent with a large context, this could mean thousands of tokens of re-prompting before the agent can resume productive work. The event log makes recovery possible; it does not make it cheap.

Distributed BEAM is not single-node BEAM. The fault tolerance story in this essay is primarily about single-node supervision. Erlang distribution (connecting multiple BEAM nodes) introduces network partitions, split-brain scenarios, and the full complexity of distributed consensus. Elan's current design is single-node. Multi-node distribution is a future concern, and when it arrives, it will bring CAP trade-offs that cannot be hand-waved away.

Operational tooling. How do you deploy Elan? How do you monitor agent health? What does the observability stack look like? The BEAM has excellent introspection tools (Observer, recon, telemetry), and Elan emits structured telemetry events. But a production-grade dashboard for "show me all running agents, their states, their checkpoint freshness, and their event log sizes" does not exist yet. The typed event system makes this buildable; it is not yet built.

The Elixir adoption barrier. The AI agent community writes Python. Asking developers to learn Elixir, OTP supervision patterns, and :gen_statem to build agents is a significant friction cost. Whether the architectural benefits justify that cost is an open question. I believe they do for the class of problems Elan targets, but I acknowledge the barrier is real.


Current Status

Elan is in early build. The runtime core, supervision tree, agent FSM, event log, checkpoint store, and policy engine exist as Elixir modules with working logic (in-memory persistence, not yet backed by durable storage). The git coordinator is stubbed: it manages branch names in memory but does not yet shell out to git or create real worktrees. The LLM adapter interface is defined but not yet connected to production providers. You can clone the repository and run mix compile to verify the modules are well-formed. What you cannot yet do is point Elan at a real LLM and run a real task end to end.

The PRD tracks twelve requirements with WSJF prioritisation. The current focus is on the recovery pipeline: ensuring that an agent can crash at any point in its execution, restart, and resume without data loss or duplicate side effects. This is the hardest part. If recovery is correct, everything else is engineering. If recovery is wrong, nothing else matters.

Three models, GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro, independently converged on the same core architecture when asked to design a BEAM-native agent runtime. The convergence was striking: all three proposed supervision trees for agent lifecycle, event sourcing for state persistence, git branches for provenance isolation, and capability-based policy for tool governance. When three competing models agree on an architecture without prompting toward each other's outputs, it suggests the design is near-canonical for the problem space.


Who This Is For

Elan is not for building chatbots. If your agent handles a single user request and returns a response within seconds, you do not need durable state, supervision trees, or git-native provenance. Use LangChain. It is good at that.

Elan is for systems where agents run autonomously for extended periods, where crashes must not cause data loss, where multiple agents must coordinate without corrupting each other's state, and where every action must be attributable and auditable. Legal research pipelines. Compliance monitoring. Clinical decision support. Financial analysis workflows. Infrastructure management. Any domain where "the agent crashed and we lost an hour of work" is not an acceptable outcome.

It is also for anyone who believes, as I do, that the agent infrastructure problem is fundamentally a distributed systems problem, and that the best distributed systems runtime ever built is sitting right there, battle-tested across four decades of telecommunications, waiting for us to use it.


Get Involved

Elan is open source and under active development. The repository, PRD, and all design documents are public.

If you have experience with Erlang/OTP, Elixir, or distributed systems and are interested in applying that expertise to AI agent infrastructure, I would like to hear from you. The hardest open problems are in recovery correctness (proving that replay from the event log produces identical state) and policy composition (reasoning about the combined capabilities of cooperating agents). These are problems where telecom engineering experience directly transfers.

About the author: Eduardo Aguilar Pelaez is CTO and co-founder at Legal Engine Ltd. He previously led product strategy at Canonical (Ubuntu) and served as a voting member of the Cloud Native Computing Foundation. Elan grows from the intersection of two decades of distributed systems work and the conviction that AI agents deserve infrastructure as serious as the problems they solve. Contact: edu@legalengine.co.uk.