TorchLean: Verified Neural Networks in Lean
After a year of working on TorchLean, I keep coming back to one question: when we say something about a neural network, which exact computation are we talking about?
That sounds pedantic until you follow a real ML workflow. A model may be written as Python code, trained into a checkpoint, exported to ONNX or another graph format, simplified by a compiler, executed through CUDA kernels, and then checked by a verifier that reads yet another artifact. None of those steps is suspicious by itself. The problem is that each step can change a convention: tensor layout, broadcasting, preprocessing, numerical precision, fused operations, masks, cache state, or parameter order.
So by “drift,” I mean something concrete. The verifier may check a graph after export, while deployment runs a fused kernel. A theorem may talk about real numbers, while the model runs in Float32. A text model may be trained with one tokenizer and sampled with another. A certificate may name an input box, but the runtime path may normalize inputs differently. The bug is not always that the code crashes. Often the code runs, but the claim has quietly moved away from the computation.
For the past year, I have been working on TorchLean in the AI for science environment around Anima Anandkumar’s group at Caltech.8 That context shaped the project. In scientific computing, a neural network is often not just a predictor; it is part of a larger claim about a PDE, a controller, a geometry pipeline, a simulator, or a physical system.
TorchLean is my attempt to make that drift harder to ignore. It puts neural-network programs inside Lean so that the model, graph, gradient rule, certificate, Float32 assumption, and runtime artifact can be named in one formal setting.
Why this feels urgent now
Two things are happening at the same time.
The first is that code is becoming easier to produce than to understand. AI assistants can now write scripts, wrappers, exporters, kernels, and verifier harnesses quickly. That is genuinely useful. But it shifts the bottleneck: the hard part is no longer only writing code, but knowing what computation the code implements and whether that claim survived all the interfaces it crossed.
This is the part of “vibe coding” that interests me most. Karpathy’s phrase captures a real change in programming practice, and Collins later named vibe coding its 2025 word of the year.1 For scientific ML, the consequence is not that expertise matters less. It is that checks matter more: shape checks, mask checks, graph checks, certificate checks, numerical checks, and contracts around external tools.
The second is that theorem proving and formal verification are being pulled into the mainstream software conversation. The bigger idea is vericoding: AI-assisted programming where code is generated together with formal specifications, proofs, or certificates that a checker can reject if they are wrong. Max Tegmark and collaborators use the term for LLM-generation of formally verified code from formal specifications, explicitly contrasting it with vibe coding.2 I fully believe some version of this is the future. I do not want AI systems only to produce more plausible glue code. I want them to help produce artifacts that carry checkable contracts.
Lean is one concrete reason this now feels possible. It is both a functional programming language and an interactive theorem prover: you write definitions, state theorems about them, and have a small kernel check that the proof really establishes the statement. This is why Lean has become so visible in mathematics and AI. DeepMind’s AlphaProof used Lean as a formal target for Olympiad-style reasoning, and Terence Tao’s Lean formalization work shows mathematicians using formal systems to make existing arguments more explicit and collaborative.3
The functional programming part matters too. In Lean, the values that matter to a theorem tend to have names: the parameters, the graph, the input box, the random seed, the mask, the Float32 model. That is exactly the discipline I want around ML systems, especially when AI-generated code makes plausible-looking glue code cheap.
My bet is that this shift will matter even more in scientific computing. We are asking whether a learned simulator, a controller, a geometry pipeline, or a numerical certificate actually supports the claim the paper says it supports.
What TorchLean is
Concretely, TorchLean has three parts.4
First, it is an ML programming layer in Lean: typed tensors, layers, training loops, autograd, optimizers, logs, model examples, PyTorch interop, and optional CUDA execution.
Second, it is a specification layer: graph IRs, shape-indexed tensors, scalar models, Float32 and IEEE-style arithmetic, and definitions that say what the model is supposed to compute.
Third, it is a checking layer: autograd proofs, graph checks, IBP and CROWN-style verification, certificate checkers, Bug Zoo contracts, and scientific ML examples.5
What matters is that these layers are meant to meet. I want the graph the verifier checked to be recognizably the model that was written and run.
Typed tensor and layer code written in Lean with a PyTorch shaped surface.
An op tagged SSA DAG with node shapes and a denotation \(\llbracket G \rrbracket\).
Reverse mode rules expressed as VJPs over the graph denotation.
Real, interval, FP32, IEEE32Exec, runtime Float32, and native CUDA floats.
IBP boxes, CROWN affine bounds, margin checks, and external certificates.
Runtime acceleration with stated agreements around native kernels and libraries.
GPT style text, Mamba, diffusion, RL, FNO, ViT, ResNet, and other examples.
PDE residuals, controllers, 3D projections, robustness, and geometry checks.
The Lean view of a neural network
A neural network is a composition of tensor operations. TorchLean makes the tensor shape part of the type, so many shape errors become type errors rather than runtime surprises. At the specification layer, tensors can be viewed as total functions over finite index sets. This is not always the fastest representation, but it is a clean reference object for proofs.
The graph view is equally important. A model lowers to an op tagged SSA DAG. Each node has a primitive operation, parent nodes, and an output shape. Because the graph is acyclic, evaluation is defined in topological order. Because each intermediate is assigned once, proofs and checkers can attach values, gradients, boxes, or affine forms to node IDs without ambiguity.
For a well typed graph \(G\) over \(\R\), reverse mode backpropagation computes the adjoint derivative of the graph denotation:
\[\VJP_G(x;\bar y) = \left(D\llbracket G\rrbracket(x)\right)^\top \bar y.\]
That is the kind of statement I want ML infrastructure to make routine: not “the gradients look right,” but “this graph level algorithm computes the derivative of this denotation under these hypotheses.”
Definitions and theorems over tensors, graphs, selected derivatives, and sound checker cores.
External JSON or certificate data can be reloaded and checked against a small contract.
CUDA kernels and CPU stubs are compared by tests, finite difference checks, and sanitizer runs.
Native code, vendor libraries, Python, Julia, Arb, and CROWN producers have stated roles.
Bugs become theorem shaped
One part of TorchLean that I like a lot is the Bug Zoo. These are small case studies for ML bugs that pass ordinary runtime checks. The code still returns tensors, losses, or tokens, but the value no longer satisfies the intended contract.6
This is exactly the kind of bug I worry about in AI assisted coding. The code looks plausible. The output has the right shape. The notebook runs. But the mask is wrong, the cache position drifted, the tokenizer changed, or the Float32 path is being treated like a real-valued proof.
In TorchLean, these failures become statements. Some are ruled out by types. Some become checked artifacts. Some become named runtime assumptions. Some become small Lean theorems.
Causal attention masks
A future token can accidentally receive attention mass while all tensor shapes still look correct.
KV cache and RoPE
Incremental decoding can disagree with full sequence decoding if cache positions and rotary embeddings drift.
Tokenizer contracts
The model can be correct for one tokenizer and meaningless for another. Byte token demos remove an external BPE dependency.
Normalization state
Training statistics, inference statistics, and explicit inputs should not be silently conflated.
Batch invariance
One sample processed alone should agree with the same sample inside a batch only under stated conditions.
Float and autograd bridges
Runtime Float32 and proof-facing IEEE arithmetic are related by a named bridge.
A favorite example: fused attention without changing the computation
Attention is a good example because it sits exactly where modern ML systems become both mathematical and systems heavy. The mathematical operation is scaled dot product attention. The efficient implementation may fuse scores, masking, softmax, and value multiplication so the full attention matrix is never materialized.
TorchLean’s CUDA guide describes a FlashAttention style fused attention contract. The Lean side has equality theorem names such as flashAttention_eq_scaledDotProductAttention, flashAttentionBackward_eq_scaledDotProductAttentionBackward, and cudaLoopFlashAttention_eq_scaledDotProductAttention. The fused Lean spec denotes ordinary scaled dot product attention. The native CUDA kernel is a runtime implementation path with separate agreement evidence.7
Informally, for queries \(Q\), keys \(K\), values \(V\), and mask \(M\):
\[\Flash(Q,K,V,M) = \SDPA(Q,K,V,M)\]
where
\[\SDPA(Q,K,V,M) = \softmax\left(\frac{QK^\top}{\sqrt{d}} + M\right)V.\]
Composed SDPA
Fused contract
Why I care about scientific computing
Scientific computing is where this feels most important to me.
A learned PDE surrogate can sit inside a solver, an inverse-design loop, or a decision system. A neural controller can become part of a safety argument. A 3D perception model can produce geometry that another system treats as a certificate about the world.
In these settings, the model is embedded inside a larger chain of reasoning. That chain includes data, numerical methods, finite precision, kernels, solvers, and physical interpretation. TorchLean is useful because it gives us a place to ask, piece by piece, which parts of that chain have been run, checked, proved, or assumed.
Certificates: external tools as producers, Lean as checker
There is a pragmatic lesson from neural network verification: the strongest bound search procedures are often complex. You may want an external solver to optimize CROWN slopes, split regions, call a Python library, or produce a JSON artifact. TorchLean can treat those tools as producers.
The verification tutorial describes a simple path: write or import a model, compile it to NN.IR.Graph, attach an input box, run a bound engine, and inspect or check output bounds. For margin properties, a lower bound on logit_true - logit_other can certify all inputs in a box when the bound stays positive.9
The pattern generalizes. An external tool may search. Lean checks the small statement that matters: the shapes match, the nodes match, the boxes or affine bounds support the final inequality, and the imported artifact has the format the checker expects.
CUDA belongs in the story
CUDA belongs in the story because scientific ML does not run only on clean mathematical definitions. It runs on accelerators, vendor libraries, fused kernels, reductions, and memory layouts. The CUDA guide describes an opt-in runtime backend for supported Float32 tensor work such as matmul, convolution, reductions, attention helpers, FFT and FNO kernels, selective scan and Mamba support, plus sanitizer and parity harnesses.10
TorchLean’s choice is to let CUDA be useful without letting CUDA silently define the theorem. The Lean-side operation still has a specification. The native path is an implementation path. If a claim depends on that path, the agreement has to be named, tested, or eventually proved.
This is also a good example of the personal reason I care about the project. If theorem proving is going to matter for science, it has to meet accelerators, vendor libraries, and floating point without becoming vague about the contract.
Why the model zoo matters
The model zoo is there because different model families expose different sources of drift.
A GPT-style model brings tokenization, positions, causal masks, sampling, and saved parameter packs. Mamba brings recurrent scan state. Diffusion brings schedules and iterative sampling. FNO brings spectral convolution and scientific data. RL brings environment transitions and trajectory records. Bug Zoo collects the cases where these details go wrong but the code still runs.
These examples are compact, but they are not decorative. They force the library to handle the kinds of structure that real ML systems use.
The TorchLean text demos tokenize bytes directly. Every UTF 8 byte is a token, so the vocabulary is fixed at 256. The docs say this avoids an external BPE file, tokenizer version ambiguity, and hidden cache dependencies.11 That may sound minor, but it is the kind of design choice that makes a verification claim easier to state.
Finite precision is not a footnote
Mathematical neural networks often live over \(\R\). Deployed neural networks do not. They run with Float32, library conventions, signed zeros, NaNs, subnormals, FMA behavior, reduction order, and sometimes approximate math modes.
TorchLean therefore separates numeric roles. Real arithmetic is useful for reference proofs and calculus. Interval domains are useful for enclosures. \(\texttt{FP32}\) and related rounded-real models are useful for error envelopes. \(\texttt{IEEE32Exec}\) gives executable IEEE-style binary32 behavior inside Lean. Runtime Float32 and native CUDA floats are useful for speed, but they enter through named bridges and assumptions.
This may feel like bookkeeping, but it is central to the project. A real-valued margin of \(10^{-4}\) is not useful if the Float32 error budget is \(10^{-3}\). The theorem has to say which case we are in.
The vision
I do not think the future is that every ML researcher writes long Lean proofs by hand before running an experiment. That would be the wrong lesson. The future I want is one where more of the stack becomes checkable by construction.
Models should carry shape contracts. Transformations should carry mathematical contracts. Runtime paths should be named. Certificates should be small enough to check. Scientific claims should be connected to the computation that produced them. AI-generated code should have places where its contracts are tested along with its syntax.
TorchLean is one step toward that. It is a place to put neural networks, graphs, gradients, finite precision, CUDA paths, verification artifacts, and scientific examples under a common discipline. The project is public because I think this needs to become an ecosystem, not a private artifact.
A shared substrate for theorem-aware machine learning: not a replacement for PyTorch, CUDA, CROWN, or scientific solvers, but a Lean-based layer where their claims can be stated, checked, and related to the model being used.
Trying it
The repository builds as a normal Lean and Lake project. The public README lists the basic commands and maintained CLI entry points.12
git clone https://github.com/lean-dojo/TorchLean.git
cd TorchLean
lake build
lake exe torchlean --help
lake exe verify --helpFor verification examples:
lake exe verify -- torchlean-crown-ops --dtype float
lake exe verify -- torchlean-robustness --dtype float
lake exe verify -- margin-cert
lake exe verify -- abcrown-leafFor CUDA, build with CUDA explicitly and then use runtime flags:
lake build -R -K cuda=true
lake exe -K cuda=true torchlean mlp --cuda --steps 20
lake exe -K cuda=true nn_tests_suiteClosing thought
TorchLean still feels exciting to me because it sits at the intersection of things that are usually kept apart: neural-network systems, theorem proving, numerical computing, and scientific modeling.
Each of those areas has its own idea of correctness. A training run can be successful. A theorem can be checked. A numerical method can be stable. A CUDA kernel can be fast. A scientific model can be useful. The hard question is how these notions meet when they are all part of the same pipeline.
That is the question I want TorchLean to make precise. Not because every ML researcher should write long Lean proofs by hand, but because learned computation is becoming part of serious scientific arguments. When that happens, the model should carry more than weights. It should carry a contract we can inspect, check, and share.
Notes and references
- Collins Dictionary, “The Collins Word of the Year 2025 is... vibe coding”. Collins describes vibe coding as software development that turns natural language into code using AI and notes Karpathy’s role in popularizing the term.
- Sergiu Bursuc et al., “A benchmark for vericoding: formally verified program synthesis”, arXiv:2509.22908. The paper defines vericoding as LLM-generation of formally verified code from formal specifications, in contrast to vibe coding, and includes Max Tegmark among the authors.
- Google DeepMind, “AI achieves silver medal standard solving International Mathematical Olympiad problems”; Terence Tao, “Formalizing the proof of PFR in Lean4 using Blueprint”; Lean, official project page.
- TorchLean homepage, Formalizing Neural Networks in Lean.
- LeanDojo TorchLean page, TorchLean: Formalizing Neural Networks in Lean; GitHub repository, lean-dojo/TorchLean.
- TorchLean examples, Bug Zoo Walkthrough.
- TorchLean guide, FlashAttention style fused attention; TRUST_BOUNDARIES.md.
- Caltech, AI4Science @ Caltech. The site describes the initiative led by Anima Anandkumar and Yisong Yue as bringing AI researchers together with domain experts across science and engineering.
- TorchLean examples, Verification Bounds.
- TorchLean guide, GPU and CUDA Boundaries.
- TorchLean examples, Text Models Walkthrough.
- GitHub repository, lean-dojo/TorchLean.
- TorchLean preprint, arXiv:2602.22631. I treat the paper as the starting point, and the current site and repository as the source of truth for the broader public codebase described here.