TorchLean: Verified Neural Networks in Lean

Robert Joseph George·May 2026Lean 4neural networksscientific computingverification

After a year of working on TorchLean, I keep coming back to one question: when we say something about a neural network, which exact computation are we talking about?

That sounds pedantic until you follow a real ML workflow. A model may be written as Python code, trained into a checkpoint, exported to ONNX or another graph format, simplified by a compiler, executed through CUDA kernels, and then checked by a verifier that reads yet another artifact. None of those steps is suspicious by itself. The problem is that each step can change a convention: tensor layout, broadcasting, preprocessing, numerical precision, fused operations, masks, cache state, or parameter order.

So by “drift,” I mean something concrete. The verifier may check a graph after export, while deployment runs a fused kernel. A theorem may talk about real numbers, while the model runs in Float32. A text model may be trained with one tokenizer and sampled with another. A certificate may name an input box, but the runtime path may normalize inputs differently. The bug is not always that the code crashes. Often the code runs, but the claim has quietly moved away from the computation.

For the past year, I have been working on TorchLean in the AI for science environment around Anima Anandkumar’s group at Caltech.⁸ That context shaped the project. In scientific computing, a neural network is often not just a predictor; it is part of a larger claim about a PDE, a controller, a geometry pipeline, a simulator, or a physical system.

TorchLean is my attempt to make that drift harder to ignore. It puts neural-network programs inside Lean so that the model, graph, gradient rule, certificate, Float32 assumption, and runtime artifact can be named in one formal setting.

Figure 1 · the story I have in mindfrom generated code to checked scientific claims

AI assisted coding

Scripts, wrappers, kernels, exporters, and verifier harnesses are becoming easier to generate.

Formal verification

Lean turns definitions and proofs into objects checked by a small kernel.

Scientific ML

Models sit inside solvers, controllers, geometry pipelines, and physical systems.

TorchLean

A shared place for models, graphs, gradients, finite precision, and certificates.

I see TorchLean as infrastructure for a more theorem-aware style of scientific machine learning.

Why this feels urgent now

Two things are happening at the same time.

The first is that code is becoming easier to produce than to understand. AI assistants can now write scripts, wrappers, exporters, kernels, and verifier harnesses quickly. That is genuinely useful. But it shifts the bottleneck: the hard part is no longer only writing code, but knowing what computation the code implements and whether that claim survived all the interfaces it crossed.

This is the part of “vibe coding” that interests me most. Karpathy’s phrase captures a real change in programming practice, and Collins later named vibe coding its 2025 word of the year.¹ For scientific ML, the consequence is not that expertise matters less. It is that checks matter more: shape checks, mask checks, graph checks, certificate checks, numerical checks, and contracts around external tools.

The second is that theorem proving and formal verification are being pulled into the mainstream software conversation. The bigger idea is vericoding: AI-assisted programming where code is generated together with formal specifications, proofs, or certificates that a checker can reject if they are wrong. Max Tegmark and collaborators use the term for LLM-generation of formally verified code from formal specifications, explicitly contrasting it with vibe coding.² I fully believe some version of this is the future. I do not want AI systems only to produce more plausible glue code. I want them to help produce artifacts that carry checkable contracts.

Lean is one concrete reason this now feels possible. It is both a functional programming language and an interactive theorem prover: you write definitions, state theorems about them, and have a small kernel check that the proof really establishes the statement. This is why Lean has become so visible in mathematics and AI. DeepMind’s AlphaProof used Lean as a formal target for Olympiad-style reasoning, and Terence Tao’s Lean formalization work shows mathematicians using formal systems to make existing arguments more explicit and collaborative.³

The functional programming part matters too. In Lean, the values that matter to a theorem tend to have names: the parameters, the graph, the input box, the random seed, the mask, the Float32 model. That is exactly the discipline I want around ML systems, especially when AI-generated code makes plausible-looking glue code cheap.

My bet is that this shift will matter even more in scientific computing. We are asking whether a learned simulator, a controller, a geometry pipeline, or a numerical certificate actually supports the claim the paper says it supports.

Figure 2 · the new bottlenecksyntax gets cheaper, contracts do not

intent

The idea

A paper claim, model sketch, physical law, or training objective.

generation

The code

Models, wrappers, data scripts, exporters, kernels, solvers.

interfaces

The crossings

PyTorch, JSON, graph IRs, certificates, CUDA, logs.

runtime

The execution

Float32, reductions, masks, caches, library calls, state.

claim

The statement

A property of a graph, payload, input region, or certificate.

TorchLean is a response to this shift. It gives the ML stack objects that can be named, lowered, differentiated, bounded, checked, and discussed.

What TorchLean is

Concretely, TorchLean has three parts.⁴

First, it is an ML programming layer in Lean: typed tensors, layers, training loops, autograd, optimizers, logs, model examples, PyTorch interop, and optional CUDA execution.

Second, it is a specification layer: graph IRs, shape-indexed tensors, scalar models, Float32 and IEEE-style arithmetic, and definitions that say what the model is supposed to compute.

Third, it is a checking layer: autograd proofs, graph checks, IBP and CROWN-style verification, certificate checkers, Bug Zoo contracts, and scientific ML examples.⁵

What matters is that these layers are meant to meet. I want the graph the verifier checked to be recognizably the model that was written and run.

Figure 3 · one model, many viewswhat each artifact talks about

User program

Typed tensor and layer code written in Lean with a PyTorch shaped surface.

Graph IR

An op tagged SSA DAG with node shapes and a denotation \(\llbracket G \rrbracket\).

Autograd

Reverse mode rules expressed as VJPs over the graph denotation.

Finite precision

Real, interval, FP32, IEEE32Exec, runtime Float32, and native CUDA floats.

\(\llbracket G \rrbracket\)the graph computation being checked

Verification

IBP boxes, CROWN affine bounds, margin checks, and external certificates.

CUDA

Runtime acceleration with stated agreements around native kernels and libraries.

Modern models

GPT style text, Mamba, diffusion, RL, FNO, ViT, ResNet, and other examples.

Science claims

PDE residuals, controllers, 3D projections, robustness, and geometry checks.

The notation \(\llbracket G \rrbracket\) is a habit of precision. It asks which graph, scalar model, input region, parameter payload, and runtime path the claim is about.

The Lean view of a neural network

A neural network is a composition of tensor operations. TorchLean makes the tensor shape part of the type, so many shape errors become type errors rather than runtime surprises. At the specification layer, tensors can be viewed as total functions over finite index sets. This is not always the fastest representation, but it is a clean reference object for proofs.

The graph view is equally important. A model lowers to an op tagged SSA DAG. Each node has a primitive operation, parent nodes, and an output shape. Because the graph is acyclic, evaluation is defined in topological order. Because each intermediate is assigned once, proofs and checkers can attach values, gradients, boxes, or affine forms to node IDs without ambiguity.

A typical graph level statement

For a well typed graph \(G\) over \(\R\), reverse mode backpropagation computes the adjoint derivative of the graph denotation:

\[\VJP_G(x;\bar y) = \left(D\llbracket G\rrbracket(x)\right)^\top \bar y.\]

That is the kind of statement I want ML infrastructure to make routine: not “the gradients look right,” but “this graph level algorithm computes the derivative of this denotation under these hypotheses.”

Figure 4 · the stackproofs, checkers, runtimes, and producers

SpecificationTyped tensors, shapes, layers, scalar polymorphism, pure operations.

RuntimeEager tape, compiled graph, optimizers, logging, import and export, CUDA path.

VerificationIBP, CROWN style affine bounds, margin checks, certificate checkers.

ApplicationsRobustness, PINNs, controllers, 3D geometry, text models, RL, diffusion.

ProvedInside Lean

Definitions and theorems over tensors, graphs, selected derivatives, and sound checker cores.

CheckedArtifacts

External JSON or certificate data can be reloaded and checked against a small contract.

ValidatedRuntime behavior

CUDA kernels and CPU stubs are compared by tests, finite difference checks, and sanitizer runs.

NamedAssumptions

Native code, vendor libraries, Python, Julia, Arb, and CROWN producers have stated roles.

Different parts of the stack give different kinds of evidence. TorchLean tries to keep those roles visible.

Bugs become theorem shaped

One part of TorchLean that I like a lot is the Bug Zoo. These are small case studies for ML bugs that pass ordinary runtime checks. The code still returns tensors, losses, or tokens, but the value no longer satisfies the intended contract.⁶

This is exactly the kind of bug I worry about in AI assisted coding. The code looks plausible. The output has the right shape. The notebook runs. But the mask is wrong, the cache position drifted, the tokenizer changed, or the Float32 path is being treated like a real-valued proof.

In TorchLean, these failures become statements. Some are ruled out by types. Some become checked artifacts. Some become named runtime assumptions. Some become small Lean theorems.

Figure 5 · examples of theorem shaped bugsordinary tensors are not enough

Causal attention masks

A future token can accidentally receive attention mass while all tensor shapes still look correct.

future key j > i ⇒ attentionWeight i j = 0

KV cache and RoPE

Incremental decoding can disagree with full sequence decoding if cache positions and rotary embeddings drift.

cache position contract ⇒ incremental output agrees with full sequence output

Tokenizer contracts

The model can be correct for one tokenizer and meaningless for another. Byte token demos remove an external BPE dependency.

token ids are explicit tensors, not hidden files

Normalization state

Training statistics, inference statistics, and explicit inputs should not be silently conflated.

state is part of the contract, not an invisible side effect

Batch invariance

One sample processed alone should agree with the same sample inside a batch only under stated conditions.

single example output = selected batch output

Float and autograd bridges

Runtime Float32 and proof-facing IEEE arithmetic are related by a named bridge.

RuntimeFloat32MatchesIEEE32Exec ⇒ rewrite to IEEE32Exec

The Bug Zoo is a design pressure test for TorchLean: can we express the intended contract clearly enough that the bug has nowhere to hide?

A favorite example: fused attention without changing the computation

Attention is a good example because it sits exactly where modern ML systems become both mathematical and systems heavy. The mathematical operation is scaled dot product attention. The efficient implementation may fuse scores, masking, softmax, and value multiplication so the full attention matrix is never materialized.

TorchLean’s CUDA guide describes a FlashAttention style fused attention contract. The Lean side has equality theorem names such as flashAttention_eq_scaledDotProductAttention, flashAttentionBackward_eq_scaledDotProductAttentionBackward, and cudaLoopFlashAttention_eq_scaledDotProductAttention. The fused Lean spec denotes ordinary scaled dot product attention. The native CUDA kernel is a runtime implementation path with separate agreement evidence.⁷

Spec equality before runtime comparison

Informally, for queries \(Q\), keys \(K\), values \(V\), and mask \(M\):

\[\Flash(Q,K,V,M) = \SDPA(Q,K,V,M)\]

where

\[\SDPA(Q,K,V,M) = \softmax\left(\frac{QK^\top}{\sqrt{d}} + M\right)V.\]

Figure 6 · fused attention as a rewritewhat the theorem is really saying

Composed SDPA

scores = QKᵀ / √d

apply causal or padding mask

softmax with stable normalization

multiply by V

Fused contract

tiling metadata is scheduling

same masked attention function

fused forward contract

fused VJP contract

The useful theorem here is not “every CUDA instruction is proved.” It is that a fused operator can be connected to the clean operation it is meant to replace.

Why I care about scientific computing

Scientific computing is where this feels most important to me.

A learned PDE surrogate can sit inside a solver, an inverse-design loop, or a decision system. A neural controller can become part of a safety argument. A 3D perception model can produce geometry that another system treats as a certificate about the world.

In these settings, the model is embedded inside a larger chain of reasoning. That chain includes data, numerical methods, finite precision, kernels, solvers, and physical interpretation. TorchLean is useful because it gives us a place to ask, piece by piece, which parts of that chain have been run, checked, proved, or assumed.

Figure 7 · scientific ML claims are composite claimsthe model is only one piece

PINNs

A residual bound like \(\lvert u_t + u u_x - \nu u_{xx}\rvert \le \varepsilon\).

Controllers

A Lyapunov inequality over a region, beyond sampled trajectories.

3D vision

A camera and box artifact whose projection can be recomputed.

Neural operators

Discretization, FFT convention, spectral multiplication, and runtime path.

Reinforcement learning

Environment state, randomness, rollouts, and reward conventions.

This is the scientific-computing motivation that keeps pulling me back to TorchLean.

Certificates: external tools as producers, Lean as checker

There is a pragmatic lesson from neural network verification: the strongest bound search procedures are often complex. You may want an external solver to optimize CROWN slopes, split regions, call a Python library, or produce a JSON artifact. TorchLean can treat those tools as producers.

The verification tutorial describes a simple path: write or import a model, compile it to NN.IR.Graph, attach an input box, run a bound engine, and inspect or check output bounds. For margin properties, a lower bound on logit_true - logit_other can certify all inputs in a box when the bound stays positive.⁹

The pattern generalizes. An external tool may search. Lean checks the small statement that matters: the shapes match, the nodes match, the boxes or affine bounds support the final inequality, and the imported artifact has the format the checker expects.

Figure 8 · a toy margin certificateinteractive intuition

lower bound on true logit

upper bound on other logit

certified margin: 0.70

true

1.25

other

0.55

For a classifier, a common sufficient certificate is \(\underline z_y > \overline z_k\) for every other class \(k\). The interesting part is the path from graph nodes to checked bounds.

CUDA belongs in the story

CUDA belongs in the story because scientific ML does not run only on clean mathematical definitions. It runs on accelerators, vendor libraries, fused kernels, reductions, and memory layouts. The CUDA guide describes an opt-in runtime backend for supported Float32 tensor work such as matmul, convolution, reductions, attention helpers, FFT and FNO kernels, selective scan and Mamba support, plus sanitizer and parity harnesses.¹⁰

TorchLean’s choice is to let CUDA be useful without letting CUDA silently define the theorem. The Lean-side operation still has a specification. The native path is an implementation path. If a claim depends on that path, the agreement has to be named, tested, or eventually proved.

This is also a good example of the personal reason I care about the project. If theorem proving is going to matter for science, it has to meet accelerators, vendor libraries, and floating point without becoming vague about the contract.

Why the model zoo matters

The model zoo is there because different model families expose different sources of drift.

A GPT-style model brings tokenization, positions, causal masks, sampling, and saved parameter packs. Mamba brings recurrent scan state. Diffusion brings schedules and iterative sampling. FNO brings spectral convolution and scientific data. RL brings environment transitions and trajectory records. Bug Zoo collects the cases where these details go wrong but the code still runs.

These examples are compact, but they are not decorative. They force the library to handle the kinds of structure that real ML systems use.

A small but revealing text choice

The TorchLean text demos tokenize bytes directly. Every UTF 8 byte is a token, so the vocabulary is fixed at 256. The docs say this avoids an external BPE file, tokenizer version ambiguity, and hidden cache dependencies.¹¹ That may sound minor, but it is the kind of design choice that makes a verification claim easier to state.

Finite precision is not a footnote

Mathematical neural networks often live over \(\R\). Deployed neural networks do not. They run with Float32, library conventions, signed zeros, NaNs, subnormals, FMA behavior, reduction order, and sometimes approximate math modes.

TorchLean therefore separates numeric roles. Real arithmetic is useful for reference proofs and calculus. Interval domains are useful for enclosures. \(\texttt{FP32}\) and related rounded-real models are useful for error envelopes. \(\texttt{IEEE32Exec}\) gives executable IEEE-style binary32 behavior inside Lean. Runtime Float32 and native CUDA floats are useful for speed, but they enter through named bridges and assumptions.

This may feel like bookkeeping, but it is central to the project. A real-valued margin of \(10^{-4}\) is not useful if the Float32 error budget is \(10^{-3}\). The theorem has to say which case we are in.

The vision

I do not think the future is that every ML researcher writes long Lean proofs by hand before running an experiment. That would be the wrong lesson. The future I want is one where more of the stack becomes checkable by construction.

Models should carry shape contracts. Transformations should carry mathematical contracts. Runtime paths should be named. Certificates should be small enough to check. Scientific claims should be connected to the computation that produced them. AI-generated code should have places where its contracts are tested along with its syntax.

TorchLean is one step toward that. It is a place to put neural networks, graphs, gradients, finite precision, CUDA paths, verification artifacts, and scientific examples under a common discipline. The project is public because I think this needs to become an ecosystem, not a private artifact.

What I hope TorchLean becomes

A shared substrate for theorem-aware machine learning: not a replacement for PyTorch, CUDA, CROWN, or scientific solvers, but a Lean-based layer where their claims can be stated, checked, and related to the model being used.

Trying it

The repository builds as a normal Lean and Lake project. The public README lists the basic commands and maintained CLI entry points.¹²

git clone https://github.com/lean-dojo/TorchLean.git
cd TorchLean
lake build

lake exe torchlean --help
lake exe verify --help

For verification examples:

lake exe verify -- torchlean-crown-ops --dtype float
lake exe verify -- torchlean-robustness --dtype float
lake exe verify -- margin-cert
lake exe verify -- abcrown-leaf

For CUDA, build with CUDA explicitly and then use runtime flags:

lake build -R -K cuda=true
lake exe -K cuda=true torchlean mlp --cuda --steps 20
lake exe -K cuda=true nn_tests_suite

Closing thought

TorchLean still feels exciting to me because it sits at the intersection of things that are usually kept apart: neural-network systems, theorem proving, numerical computing, and scientific modeling.

Each of those areas has its own idea of correctness. A training run can be successful. A theorem can be checked. A numerical method can be stable. A CUDA kernel can be fast. A scientific model can be useful. The hard question is how these notions meet when they are all part of the same pipeline.

That is the question I want TorchLean to make precise. Not because every ML researcher should write long Lean proofs by hand, but because learned computation is becoming part of serious scientific arguments. When that happens, the model should carry more than weights. It should carry a contract we can inspect, check, and share.

Notes and references

Collins Dictionary, “The Collins Word of the Year 2025 is... vibe coding”. Collins describes vibe coding as software development that turns natural language into code using AI and notes Karpathy’s role in popularizing the term.
Sergiu Bursuc et al., “A benchmark for vericoding: formally verified program synthesis”, arXiv:2509.22908. The paper defines vericoding as LLM-generation of formally verified code from formal specifications, in contrast to vibe coding, and includes Max Tegmark among the authors.
Google DeepMind, “AI achieves silver medal standard solving International Mathematical Olympiad problems”; Terence Tao, “Formalizing the proof of PFR in Lean4 using Blueprint”; Lean, official project page.
TorchLean homepage, Formalizing Neural Networks in Lean.
LeanDojo TorchLean page, TorchLean: Formalizing Neural Networks in Lean; GitHub repository, lean-dojo/TorchLean.
TorchLean examples, Bug Zoo Walkthrough.
TorchLean guide, FlashAttention style fused attention; TRUST_BOUNDARIES.md.
Caltech, AI4Science @ Caltech. The site describes the initiative led by Anima Anandkumar and Yisong Yue as bringing AI researchers together with domain experts across science and engineering.
TorchLean examples, Verification Bounds.
TorchLean guide, GPU and CUDA Boundaries.
TorchLean examples, Text Models Walkthrough.
GitHub repository, lean-dojo/TorchLean.
TorchLean preprint, arXiv:2602.22631. I treat the paper as the starting point, and the current site and repository as the source of truth for the broader public codebase described here.