TorchLean: Verified Neural Networks in Lean
For roughly the past year, we have been building TorchLean in and around Anima Anandkumar’s group at Caltech. I have thought about it as a machine learning project, a formal methods project, and a scientific computing project. The third framing is the one that now feels most important to me.
Anima’s lab sits at the interface of AI and science. That matters for TorchLean. Scientific machine learning is not just about fitting a model and reporting a benchmark. It is about using learned models inside simulations, controllers, inverse design loops, PDE solvers, geometry pipelines, and physical decision systems. In those settings, a neural network is not merely a predictor. It becomes part of an argument about the world.
That argument can be fragile. A model can have the right weights but the wrong preprocessing. A verifier can prove a property of a graph that is not quite the graph that ran. A CUDA kernel can be faster because it fuses operations, but then someone has to say what value it is supposed to compute. A physics informed neural network can use derivatives, but those derivatives have to mean something. A controller can look stable in experiments, but safety is a statement over a region, not over the points we happened to sample.
Why this feels urgent now
AI systems are making it easier to produce code. Andrej Karpathy popularized the phrase “vibe coding” for a style of programming where the user describes what they want and lets an AI system fill in much of the implementation. Collins later named vibe coding its 2025 word of the year, describing it as software development that turns natural language into code with AI assistance.1
I do not take this to mean that expertise stops mattering. I take it to mean the opposite. When more code is produced with less friction, the bottleneck shifts from writing code to knowing whether the code preserves the intended meaning. This is especially true in machine learning, where the correctness of a system often depends on tensor shapes, broadcasting conventions, masks, tokenizers, randomness, floating point, cached state, and device kernels.
At the same time, theorem proving is becoming part of mainstream mathematics and AI. DeepMind’s AlphaProof made Lean visible to a broad audience by searching for formal proofs of Olympiad problems. Terence Tao’s Lean formalization work shows the complementary direction: mathematicians using formal systems to make existing arguments more explicit and collaborative.2
My bet is that the same shift will matter even more in scientific computing. The stakes are different. We are not only asking whether a proof of a theorem is valid. We are asking whether a learned simulator, a controller, a geometry pipeline, or a numerical certificate actually means what the paper says it means.
What TorchLean is
TorchLean is a Lean 4 codebase for neural network specification, execution, and verification. The project site describes it as connecting typed tensor and layer specifications, runnable training examples, graph IR semantics, floating point contracts, CUDA boundaries, and artifacts that Lean checkers can inspect.3
The current codebase is broader than the first paper. It includes a PyTorch shaped API, eager and compiled execution, reverse mode autograd, optimizers, logging, import and export, CUDA execution, Float32 and IEEE style semantics, IBP and CROWN style verification, and examples across vision, sequence models, generative models, reinforcement learning, PINNs, and 3D geometry certificates.4
But lists can obscure the point. The point is that all of these pieces should be able to refer to the same mathematical object. A TorchLean model can be user code. It can be a graph. It can be a runtime computation. It can be the target of autograd theorems. It can be the object whose bounds are checked by a verifier. Those are different views, not different meanings.
Typed tensor and layer code written in Lean with a PyTorch shaped surface.
An op tagged SSA DAG with node shapes and a denotation \(\llbracket G \rrbracket\).
Reverse mode rules expressed as VJPs over the graph semantics.
Real, interval, FP32, IEEE32Exec, runtime Float32, and native CUDA floats.
IBP boxes, CROWN affine bounds, margin checks, and external certificates.
Runtime acceleration with explicit contracts around native kernels and libraries.
GPT style text, Mamba, diffusion, RL, FNO, ViT, ResNet, and other examples.
PDE residuals, controllers, 3D projections, robustness, and geometry checks.
The Lean view of a neural network
A neural network is a composition of tensor operations. TorchLean makes the tensor shape part of the type, so many shape errors become type errors rather than runtime surprises. At the semantic layer, tensors can be viewed as total functions over finite index sets. This is not always the fastest representation, but it is a clean reference meaning for proofs.
The graph view is equally important. A model lowers to an op tagged SSA DAG. Each node has a primitive operation, parent nodes, and an output shape. Because the graph is acyclic, evaluation is defined in topological order. Because each intermediate is assigned once, proofs and checkers can attach values, gradients, boxes, or affine forms to node IDs without ambiguity.
For a well typed graph \(G\) over \(\R\), reverse mode backpropagation computes the adjoint derivative of the graph denotation:
\[\VJP_G(x;\bar y) = \left(D\llbracket G\rrbracket(x)\right)^\top \bar y.\]
That is the kind of statement I want ML infrastructure to make routine: not “the gradients look right,” but “this graph level algorithm computes the derivative of this denotation under these hypotheses.”
Definitions and theorems over tensor semantics, graphs, selected derivatives, and sound checker cores.
External JSON or certificate data can be reloaded and checked against a small contract.
CUDA kernels and CPU stubs are compared by tests, finite difference checks, and sanitizer runs.
When native code, vendor libraries, Python, Julia, Arb, or CROWN producers enter, the boundary is explicit.
Bugs become theorem shaped
One part of TorchLean that I like a lot is the Bug Zoo. These are small case studies for semantic bugs that pass ordinary runtime checks. The code still returns tensors, losses, or tokens, but the value no longer satisfies the intended contract.5
This is exactly the kind of bug I worry about in AI assisted coding. The code looks plausible. The output has the right shape. The notebook runs. But the meaning has shifted.
In TorchLean, the point is to turn these bugs into statements. Some bugs become impossible because the types rule them out. Some become checked artifacts. Some become named runtime assumptions. Some become small Lean theorems.
Causal attention masks
A future token can accidentally receive attention mass while all tensor shapes still look correct.
KV cache and RoPE
Incremental decoding can disagree with full sequence decoding if cache positions and rotary embeddings drift.
Tokenizer boundaries
The model can be correct for one tokenizer and meaningless for another. Byte token demos remove an external BPE dependency.
Normalization state
Training statistics, inference statistics, and explicit inputs should not be silently conflated.
Batch invariance
One sample processed alone should agree with the same sample inside a batch only under stated conditions.
Float and autograd boundaries
Runtime Float32 and proof facing IEEE semantics are related by a named bridge.
A favorite example: fused attention without changing the meaning
Attention is a good example because it sits exactly where modern ML systems become both mathematical and systems heavy. The mathematical operation is scaled dot product attention. The efficient implementation may fuse scores, masking, softmax, and value multiplication so the full attention matrix is never materialized.
TorchLean’s CUDA guide describes a FlashAttention style fused attention contract. The Lean side has equality theorem names such as flashAttention_eq_scaledDotProductAttention, flashAttentionBackward_eq_scaledDotProductAttentionBackward, and cudaLoopFlashAttention_eq_scaledDotProductAttention. The guide is careful about the boundary: the fused spec denotes ordinary scaled dot product attention, while the native CUDA kernel is an FFI runtime path validated separately, not a proof of production CUDA machine code.6
Informally, for queries \(Q\), keys \(K\), values \(V\), and mask \(M\):
\[\Flash(Q,K,V,M) = \SDPA(Q,K,V,M)\]
where
\[\SDPA(Q,K,V,M) = \softmax\left(\frac{QK^\top}{\sqrt{d}} + M\right)V.\]
Composed SDPA
Fused contract
Why I care about scientific computing
In Anima’s group, a lot of the intellectual energy is around AI for science: neural operators, physical simulation, weather, inverse design, and models that interact with the physical world. Caltech’s AI4Science initiative, led by Anima Anandkumar and Yisong Yue, explicitly aims to bring AI researchers together with domain experts and push modern AI tools into science and engineering.7
This is where verification feels most alive to me. A scientific model is rarely just a function approximator. It is part of a claim: this residual is small, this controller is stable, this projected box encloses the object, this learned operator respects a numerical contract, this policy satisfies a safety condition over a region. Those claims involve math, code, data, and hardware at the same time.
TorchLean gives us a language for turning some of those claims into checkable objects. The goal is not to make every scientist prove everything manually. The goal is to make more of the scientific stack capable of carrying precise contracts by default.
Certificates: external tools as producers, Lean as checker
There is a pragmatic lesson from neural network verification: the strongest bound search procedures are often complex. You may want an external solver to optimize CROWN slopes, split regions, call a Python library, or produce a JSON artifact. TorchLean does not need to pretend those tools are Lean proofs. It can treat them as producers.
The verification tutorial describes a simple path: write or import a model, compile it to NN.IR.Graph, attach an input box, run a bound engine, and inspect or check output bounds. For margin properties, a lower bound on logit_true - logit_other can certify all inputs in a box when the bound stays positive.8
The pattern generalizes. An external tool may search. Lean checks the small statement that matters: the shapes match, the nodes match, the boxes or affine bounds are sound enough, and the final inequality follows.
CUDA belongs in the story
If TorchLean only worked on tiny examples, it would be too detached from how ML is actually practiced. That is why CUDA matters. The CUDA guide says the backend is opt in, that it is a runtime backend rather than a second semantics, and that it covers float32 tensor work such as matmul, convolution, reductions, attention helpers, FFT and FNO kernels, selective scan and Mamba support, plus sanitizer and parity harnesses.9
The right attitude is practical. CUDA changes where supported tensor work runs. It should not silently change what the operation means. When a GPU path calls cuBLAS, cuFFT, a fused attention kernel, or a custom reduction, TorchLean keeps the Lean side meaning separate from the native runtime agreement. That way the same model API and graph can describe CPU eager execution, CUDA eager execution, and compiled verification graphs.
This is also a good example of the personal reason I care about the project. Scientific ML systems do not live in a clean theorem prover alone. They run on accelerators. They call vendor libraries. They use floating point. If theorem proving is going to matter for science, it has to meet those systems without losing precision about the contract.
Why the model zoo matters
A natural question is why TorchLean includes GPT style text models, GPT 2 style examples, Mamba, diffusion, MAE style learning, reinforcement learning, FNO examples, and a Bug Zoo if the core verification demos are smaller.
My answer is that verification should not be designed only for toy worlds. Text models bring tokenization, causal masks, caches, softmax, attention, sampling, and saved parameter packs. Mamba brings state space sequence computation and selective scan. Diffusion brings iterative sampling and scheduler conventions. RL brings policies, rollouts, environments, and randomness. FNO brings FFT conventions and spectral multiplication. These are exactly the places where semantics matter.
The TorchLean text demos tokenize bytes directly. Every UTF 8 byte is a token, so the vocabulary is fixed at 256. The docs say this avoids an external BPE file, tokenizer version ambiguity, and hidden cache dependencies.10 That may sound minor, but it is the kind of design choice that makes a verification claim easier to state.
Finite precision is not a footnote
Mathematical neural networks often live over \(\R\). Deployed neural networks do not. They run with Float32, library conventions, signed zeros, NaNs, subnormals, FMA behavior, reduction order, and sometimes approximate math modes.
TorchLean therefore separates numeric roles. Real semantics are useful for reference proofs and calculus. Interval domains are useful for enclosures. \(\texttt{FP32}\) and related round on real models are useful for error envelopes. \(\texttt{IEEE32Exec}\) gives an executable IEEE style binary32 semantics inside Lean. Runtime Float32 and native CUDA floats are useful for speed, but they enter through named bridges and assumptions.
This may feel like bookkeeping, but it is central to the project. A claim about real arithmetic, a claim about a proof oriented rounded model, a claim about an executable Float32 model, and a claim about a GPU run are not automatically the same claim. TorchLean tries to make the relation part of the statement.
The vision
I do not think the future is that every ML researcher writes long Lean proofs by hand before running an experiment. That would be the wrong lesson. The future I want is one where more of the stack becomes checkable by construction.
Models should carry shape contracts. Transformations should carry semantic contracts. Runtime boundaries should be named. Certificates should be small enough to check. Scientific claims should be connected to the computation that produced them. AI generated code should have places where its meaning is tested, not only its syntax.
TorchLean is one step toward that. It is a place to put neural networks, graphs, gradients, finite precision, CUDA boundaries, verification artifacts, and scientific examples under a common discipline. The project is public because I think this needs to become an ecosystem, not a private artifact.
A shared substrate for theorem aware machine learning: not a replacement for PyTorch, CUDA, CROWN, or scientific solvers, but a Lean based layer where their claims can be stated, checked, and related to a precise model semantics.
Trying it
The repository builds as a normal Lean and Lake project. The public README lists the basic commands and maintained CLI entry points.11
git clone https://github.com/lean-dojo/TorchLean.git
cd TorchLean
lake build
lake exe torchlean --help
lake exe verify --helpFor verification examples:
lake exe verify -- torchlean-crown-ops --dtype float
lake exe verify -- torchlean-robustness --dtype float
lake exe verify -- margin-cert
lake exe verify -- abcrown-leafFor CUDA, build with CUDA explicitly and then use runtime flags:
lake build -R -K cuda=true
lake exe -K cuda=true torchlean mlp --cuda --steps 20
lake exe -K cuda=true nn_tests_suiteClosing thought
The reason TorchLean still feels exciting to me after a year is that it sits between several things I care about: machine learning systems, theorem proving, numerical computing, and science. Each field has its own notion of correctness. TorchLean asks whether those notions can be made to meet around one object: the neural network as a precise semantic object.
That is the story I want the project to tell. We are not only verifying networks because verification is nice. We are trying to build a world where learned computation can enter scientific arguments with a meaning that is explicit, checkable, and shared.
Notes and references
- Collins Dictionary, “The Collins Word of the Year 2025 is... vibe coding”. Collins describes vibe coding as software development that turns natural language into code using AI and notes Karpathy’s role in popularizing the term.
- Google DeepMind, “AI achieves silver medal standard solving International Mathematical Olympiad problems”; Terence Tao, “Formalizing the proof of PFR in Lean4 using Blueprint”; Lean, official project page.
- TorchLean homepage, Formalizing Neural Networks in Lean.
- LeanDojo TorchLean page, TorchLean: Formalizing Neural Networks in Lean; GitHub repository, lean-dojo/TorchLean.
- TorchLean examples, Bug Zoo Walkthrough.
- TorchLean guide, FlashAttention style fused attention; TRUST_BOUNDARIES.md.
- Caltech, AI4Science @ Caltech. The site describes the initiative led by Anima Anandkumar and Yisong Yue as bringing AI researchers together with domain experts across science and engineering.
- TorchLean examples, Verification Bounds.
- TorchLean guide, GPU and CUDA Boundaries.
- TorchLean examples, Text Models Walkthrough.
- GitHub repository, lean-dojo/TorchLean.
- TorchLean preprint, arXiv:2602.22631. I treat the paper as the starting point, and the current site and repository as the source of truth for the broader public codebase described here.