Spec vs Runtime Refinement

Robert Joseph George·June 2026

Here is a problem that had been quietly bothering me. You spend weeks proving that your neural network satisfies some formal property: robustness, output bounds, certified behavior. The proof is over real numbers, with clean algebra and exact arithmetic. But the model that actually ships runs in Float32. Every single multiply rounds. Every add rounds. The proof and the runtime are about two different mathematical objects, and almost nobody writes that gap down anywhere.

I came to this thinking the answer was to somehow prove they are equal. That turns out to be the wrong goal. They are not equal, and pretending otherwise is where verified ML systems quietly break. The right answer is a refinement theorem: a formal statement that the runtime output lies within some explicit, checked bound of the real-valued spec. Small shift in framing, large shift in what you actually have to prove.

This changed how I thought about TorchLean. It is not enough for the system to explain the neural network over exact real numbers. It also has to say how far the Float32 computation is allowed to move away from that idealized model. Otherwise the formal spec and the deployed model sit next to each other looking related, but no theorem actually connects them.

Why this matters

Let me be concrete about why this gap matters. When a neural network flags an abnormality in a CT scan, decides whether a car should brake, or computes actuator commands for a flight controller, the mathematical proof we wrote about its behavior is a proof about the real-number version of the model. The CUDA kernels, the fused multiply-adds, the actual bits on hardware: those compute in Float32. Or Float16. Or bfloat16. The proof lives in one universe; the deployed artifact lives in another.

The standard workflow in verified ML is to write the spec over real numbers, prove your theorems there, and then kind of hope the floating-point implementation is close enough. The rounding errors are small individually, around \(2^{-24} \approx 6 \times 10^{-8}\) per operation in Float32, but they compound across hundreds of layers. On adversarially chosen inputs, that compounding can flip a classification boundary. Jia and Rinard showed this concretely in 2020: formally verified robustness guarantees, stated over real arithmetic, violated by the actual floating-point model.

What we need is a bridge: a formal statement connecting the two worlds. On one side is the real-valued spec and its clean properties. On the other side is the Float32 runtime and its rounding errors. The bridge is a quantitative refinement theorem: for every input in the domain, the floating-point output lies within an explicit checked bound of the spec output. Once that bound is in the theorem, downstream safety claims can include numerical error instead of leaving it implicit.

I started with the smallest case: a single dot product. Not because dot products are the hard part, but because they are the atomic unit where the gap first appears. Every linear layer is a batch of dot products. Every attention head is dot products. If we cannot state the refinement theorem for one dot product cleanly, we cannot compose it through a whole network. And TorchLean is exactly the system where that composition needs to happen.

Where this fits in prior work

This note is not claiming that floating-point proof is new. It is very much not new. Numerical analysts have been writing this kind of theorem for decades: define the exact mathematical answer, define the computed answer, and prove a forward or backward error bound explaining how far the computed answer can drift. Higham's Accuracy and Stability of Numerical Algorithms is the standard reference for this mindset.

\[ \widehat{y}=y+\Delta y, \qquad \lVert \Delta y\rVert \leq \text{explicit error bound}. \]

What is different here is the object being connected. The exact object is no longer just a small numerical kernel in isolation. It is a real-valued neural-network specification. The computed object is not just a line of C arithmetic. It may be a TorchLean graph, an executable IR, an ATen call, a CUDA kernel, a Float32 path, or eventually a low-precision accelerator path. So the classical numerical-analysis shape has to be attached to a modern ML runtime stack.

There is also a long line of formal floating-point tools. Flocq formalizes floating-point arithmetic in Coq/Rocq. Gappa proves properties of numerical programs over floating-point and fixed-point arithmetic. FPTaylor computes rigorous roundoff-error bounds. PRECiSA estimates roundoff error and emits proof certificates. VCFloat gives Coq infrastructure for floating-point roundoff proofs about C programs. This note borrows that numerical-analysis shape and tries to attach it to a neural-network runtime.

On the neural-network side, there is a separate and very large verification literature: Reluplex, Marabou, ERAN, and \(\alpha,\beta\)-CROWN are all examples of serious systems for proving neural-network properties using SMT, abstract interpretation, bound propagation, and branch-and-bound. I am not arguing with those verifiers here. The narrower issue is this: if the verified object is real-valued and the deployed object is floating-point, then the connection between them has to be stated somewhere.

Jia and Rinard made this failure mode explicit for neural-network verification: if a verifier reasons about one arithmetic model while inference or verification runs in another, floating-point effects can invalidate the claimed robustness guarantee. TorchLean is trying to make that seam visible: the spec, graph semantics, executable semantics, finite precision model, and runtime assumptions should all be explicit enough that Lean can check how they fit together.

Where Aristotle enters

TorchLean sits right on this fault line. On one side: the mathematical neural network, tensors over \( \mathbb{R} \), exact dot products, exact nonlinearities, a clean computational graph. On the other side: Float32, CUDA, ATen, fused kernels, and many places where the bits being computed are not quite what the spec says. The project is trying to connect those two worlds.

This is where the Aristotle story is different from the AC story. For the Andrews-Curtis blog, we had already built a large manual Lean formalization, and Aristotle mostly acted as a reviewer. Here, Aristotle helped create the first formal problem shape. I gave it the informal question and asked it to turn the gap into Lean objects, theorem stubs, and hypotheses.

The question was basic:

\[ \text{How should } f_{\mathbb{R}} \text{ and } f_{\mathrm{rt}} \text{ be related?} \]

Here \(f_{\mathbb{R}}\) is the real-valued specification: the clean mathematical function we would like to reason about. The symbol \(f_{\mathrm{rt}}\) is the runtime path: the finite-precision computation that is actually executed. The note is about proving a relationship between those two objects, not pretending they are definitionally the same.

The generated Lean file did not prove TorchLean correct. That would be too much to claim. What it did was more useful at this stage: it gave me a compact theorem shape that matched the way numerical analysts talk about roundoff.

The claim I care about is not “this Python model, this Lean graph, this CUDA kernel, and this theorem all look morally similar.” The claim should be more exact:

\[ \forall x \in \mathcal{D},\qquad \lVert f_{\mathrm{rt}}(x)-f_{\mathbb{R}}(x)\rVert \leq \varepsilon(x). \]

That is the LaTeX version of the note. The Lean version replaces each symbol by an explicit definition: a tensor shape, a rounding model, a runtime evaluator, and a proof term for the inequality.

The conversation loop

The loop was simple. Ask Aristotle to formalize the numerical claim. Compare the generated Lean shape against TorchLean's existing files. Feed back the mismatch. Keep the part that clarified the theorem and throw away the part that did not fit.

The first object that helped was a FloatModel: a rounding function, a unit roundoff \(u\), and the proof that rounding \(x\) stays within \(u|x|\). That was the right abstraction. It did not try to model all of IEEE 754 at once. It named the contract needed by the dot-product theorem.

The second useful move was the bridge back to TorchLean. TorchLean's local FP32 facts were not packaged exactly as a single relative-error field. They came as an absolute error bound plus a proof that this absolute error is below \(u|x|\) on the domain we care about. Aristotle helped turn that mismatch into the adapter FloatModel.ofTorchLeanFP32. That adapter is the kind of small theorem-shaping move that matters a lot in a verified runtime.

The third useful move was forcing schedules and range assumptions into the statement. Are we modeling left-to-right accumulation, tree reduction, or FMA? Are we ruling out overflow and NaNs? Are subnormals included? Those questions sound annoying until you realize they are exactly where a spec/runtime theorem can lie to you if they stay implicit.

Equality is too strong

The naive approach, and the one I initially reached for, is to try to prove this directly:

\[ \texttt{floatRuntime}(x)=\texttt{realSpec}(x). \]

This is just false. Not sometimes false, straightforwardly false for typical neural network computations. A Float32 dot product rounds after every multiply and every add. A GPU may use FMA, fusing those into a single rounding event. A CUDA reduction might go left-to-right in one schedule and tree-reduce in another. The arithmetic object running on hardware is not the real-valued spec. Claiming equality would not be a theorem; it would be a bug in the proof.

The theorem I actually want is quantitative:

\[ |\texttt{floatRuntime}(x)-\texttt{realSpec}(x)| \le \varepsilon(x). \]

That inequality is the bridge. If the downstream margin is larger than the error, the decision is stable. If it is not, Lean should not pretend the runtime has the exact behavior of the real spec.

Key reframe: equality is a degenerate special case (only true when roundoff is literally zero). For real hardware, the correct interface between a spec and its runtime is always a refinement theorem with a named, quantitative bound.

For neural networks, the user usually cares about a higher level claim. Maybe a classifier is robust inside an input ball. Maybe a PDE surrogate respects a residual tolerance. Maybe a token choice is stable under small logit drift. Each of those claims has to budget for the numerical error introduced by the runtime.

The floating point contract

The Lean file starts with a small abstraction. A floating point model has a unit roundoff \(u\), a rounding function, and a proof that rounding error is controlled.

\[ \operatorname{rnd}(x)=x+\delta, \qquad |\delta|\leq u|x|. \]

Lean: the standard rounding model

-- A FloatModel is the tiny contract we need from a runtime.
-- It does not say how rounding is implemented.
-- It only says: every rounded value stays within u * |x| of the exact value.
structure FloatModel where
  -- Unit roundoff, e.g. roughly 2^-24 for Float32 in the normal range.
  u : ℝ
  -- Lean needs this because later inequalities use that u is nonnegative.
  u_nonneg : 0 ≤ u
  -- The mathematical rounding operation.
  rnd : ℝ → ℝ
  -- The key promise: rounding x is close to x.
  rnd_err : ∀ x, |rnd x - x| ≤ u * |x|

The displayed equation is what one would write in a numerical analysis note. The Lean structure is the same statement with the pieces named: rnd is the rounding map, u is the unit roundoff, and rnd_err is the proof that every call to rnd satisfies the error inequality.

This is not native Lean Float. Native Float is an implementation type. For a proof, we want the mathematical contract: every rounded result is close to the exact value, with the error controlled by \(u\).

The fields are doing real work. u is the unit roundoff. rnd is the mathematical rounding operation. rnd_err is the promise that every rounded value stays close to the exact value. Once that promise is available locally, Lean can compose it through a larger computation.

For real IEEE-754 binary32, this relative-error model needs hypotheses. We need to stay in a normal range and rule out overflow, infinities, NaNs, and the subnormal corner unless those are modeled separately. That is not a problem with the theorem. It is exactly the kind of boundary the theorem forces us to write down.

The runtime model

The hardest part of writing the spec/runtime comparison was deciding what to model. You could model individual FMA instructions, GPU warp-level reduction trees, the full ATen dispatch table. All more accurate, all unmanageable in a proof. The decision I landed on, with Aristotle, was the most abstract model that still captures the essential behavior: a recursive accumulator where each element contributes exactly two rounding events, one for the multiply and one for the add into the accumulator.

Then Aristotle wrote the tiny comparison I actually wanted. The real spec is an exact dot product. The runtime model is the same dot product, except each multiply and each add is rounded.

\[ \operatorname{dot}_{\mathbb{R}}(w,x)=\sum_i w_i x_i, \qquad \operatorname{dot}_{\mathrm{fp}}(w,x) = \operatorname{rnd}\!\left( \operatorname{rnd}(w_0x_0)+\cdots \right). \]

Lean: exact real valued spec

-- dotR is the ideal mathematical dot product.
-- Pattern matching on the list is just recursion over the input pairs.
def dotR : List (ℝ × ℝ) → ℝ
  -- Empty list contributes zero.
  | [] => 0
  -- p.1 is the weight, p.2 is the input, and t is the remaining list.
  | p :: t => p.1 * p.2 + dotR t

Lean: rounded runtime model

-- dotF is the runtime-shaped model.
-- It uses the same list recursion, but explicitly rounds the multiply and the add.
def dotF (M : FloatModel) : List (ℝ × ℝ) → ℝ
  | [] => 0
  | p :: t =>
      -- First round p.1 * p.2, then round the accumulator addition.
      M.rnd (M.rnd (p.1 * p.2) + dotF M t)

This is small, but it is the right kind of small. It does not hide the runtime in a sentence like “computed in floating point.” It says exactly where rounding happens.

The two M.rnd calls in dotF are deliberate and not redundant. The inner call M.rnd (p.1 * p.2) captures that a hardware multiplier rounds the product \(w_i x_i\) before it ever enters the accumulator register. This is the multiply rounding event. The outer call M.rnd (... + dotF M t) captures that adding the rounded product into the running sum is itself a floating-point addition, which is also rounded according to IEEE 754. That is the scalar, non-FMA accumulator model: multiply then round, accumulate then round, one element at a time. An FMA hardware unit fuses those into a single rounding event, which is a different model requiring a different dotF definition and a tighter error bound.

In dotR, the product and the addition are exact. In dotF, the product is rounded first, and then the addition into the accumulator is rounded too. That is the whole difference between the spec and this runtime model, and Lean can now see it.

It also shows why schedules matter. This dotF chooses a particular recursive reduction order. A different runtime schedule needs a different model or a proof that it refines the same one.

The refinement theorem

The main theorem says that the rounded runtime stays close to the exact spec. The error grows with the number of rounded operations and with the size of the products being summed. The error grows because each of the \(|l|\) elements contributes two rounded operations: the relative error from each operation compounds multiplicatively, not additively, which is why the bound uses a power of \((1+u)\) rather than a simple multiple.

\[ |\operatorname{dotF}_M(l)-\operatorname{dotR}(l)| \le \big((1+u)^{2|l|}-1\big) \sum_i |w_i x_i|. \]

Lean: final refinement theorem

-- This is the theorem shape we want:
-- runtime output is not equal to the real spec,
-- but it is within an explicit bound.
theorem dotF_error_bound :
    ∀ l : List (ℝ × ℝ),
      |dotF M l - dotR l| ≤
        ((1 + M.u) ^ (2 * l.length) - 1) * sumAbs l := by
  -- Proof idea: induct over the list.
  -- At each cons cell, use M.rnd_err for the multiply and again for the add.
  ...

This is just a forward-error bound. That is why I like it. It gives TorchLean the statement a layer needs: here is the exact spec, here is the finite-precision runtime model, and here is the quantitative gap.

The coefficient \(\big((1+u)^{2|l|}-1\big)\) is a compact way of counting accumulated roundoff. The \(2|l|\) appears because each list element contributes a rounded multiply and a rounded add. The \(\sum_i |w_i x_i|\) term says the error scales with the size of the values being accumulated, not only with the number of operations.

A concrete worked example

To make the accumulation intuitive, walk through a 3-element dot product \(w_1 x_1 + w_2 x_2 + w_3 x_3\) step by step in a left-to-right scalar accumulator (no FMA). Write \(p_i = w_i x_i\) for the exact products.

Step 1: round the first product. The hardware computes \(\hat{p}_1 = \operatorname{rnd}(p_1) = p_1(1+\delta_1)\) where \(|\delta_1| \leq u\). The accumulator starts at \(a_1 = \hat{p}_1\).

Step 2: round the second product, then round the addition. The hardware computes \(\hat{p}_2 = p_2(1+\delta_2)\) and then rounds the sum into the accumulator: \(a_2 = \operatorname{rnd}(a_1 + \hat{p}_2) = (a_1 + \hat{p}_2)(1+\delta_3)\). Already two rounding events have touched \(p_2\): one from the multiply (\(\delta_2\)) and one from the addition into the accumulator (\(\delta_3\)).

Step 3: repeat the pattern for the third element. \(\hat{p}_3 = p_3(1+\delta_4)\) and \(a_3 = (a_2 + \hat{p}_3)(1+\delta_5)\). The first product \(p_1\) has now been through three rounding events: \(\delta_1\) (its own multiply), \(\delta_3\) (the first accumulate-add), and \(\delta_5\) (the second accumulate-add).

Multiplying these factors out, the total error on \(p_1\) alone is bounded by \((1+u)^3 - 1 \approx 3u\) for small \(u\). The last element \(p_3\) gets only two rounding events. Combining the worst case across all elements gives \((1+u)^{2n} - 1\) for a list of length \(n\), exactly the exponent in the theorem. The factor 2 comes from the two rounding events, multiply and add, that each element contributes in the FMA-less model.

To make the scale concrete: for Float32, \(u = 2^{-24} \approx 6 \times 10^{-8}\). A 512-element dot product gives \(2n = 1024\) and a bound of roughly \((1 + 6\times 10^{-8})^{1024} - 1 \approx 6.1 \times 10^{-5}\) times \(\sum |w_i x_i|\). That is small for a single layer, but it accumulates across every layer in a deep network. That is why the theorem must compose cleanly when layers are stacked.

The generated file also had the sanity check I wanted. If the unit roundoff is zero, the bound collapses and the runtime equals the spec.

Lean: equality as a special case

-- If the unit roundoff is zero, rounding introduces no error.
-- Then the runtime-shaped model collapses back to the real spec.
theorem dotF_eq_dotR_of_u_zero
    (h : M.u = 0) (l : List (ℝ × ℝ)) :
    dotF M l = dotR l := by
  -- Use the error bound above. Its right side becomes zero.
  have hb := dotF_error_bound M l
  ...

That is the right place for equality. Exact arithmetic gives equality. Float32 gives a refinement theorem.

Back to TorchLean

The part I care about for TorchLean is the bridge. TorchLean already has finite-precision definitions and runtime contracts. The abstract FloatModel should be something we instantiate from those existing facts, not a competing float library.

Lean: bridge shape for TorchLean FP32

-- Adapter from TorchLean's local FP32 facts into the simple FloatModel contract.
-- This is the "bridge" object: it packages low-level runtime facts
-- into the abstraction the theorem above expects.
def FloatModel.ofTorchLeanFP32
    (round eps : ℝ → ℝ) (u : ℝ) (hu : 0 ≤ u)
    -- habs: the runtime has some absolute rounding-error bound eps x.
    (habs : ∀ x, |round x - x| ≤ eps x)
    -- hulp: that absolute bound is small enough to be a relative u * |x| bound.
    (hulp : ∀ x, eps x ≤ u * |x|) :
    FloatModel where
  u := u
  u_nonneg := hu
  rnd := round
  -- This lemma composes habs and hulp into the one field FloatModel needs.
  rnd_err := relErr_of_halfUlp habs hulp

In words: if TorchLean proves an absolute rounding bound, and then proves that this absolute bound is below \(u|x|\) on the range we care about, it can instantiate the abstract rounding model. Then the dot-product theorem applies.

That is the architecture I want more of. First prove local facts about rounding, FMA, ATen calls, or CUDA kernels. Then lift those facts into dot products, layers, graphs, logits, and final decisions.

For example, a TorchLean linear layer theorem should eventually have three visible pieces. The real spec says what the layer means over exact tensors. The executable model says what the checked graph evaluator computes. The runtime contract says when a backend call is allowed to stand in for that executable model.

This is also where small tests matter. They do not replace theorem proving, but they make the boundaries concrete. A tiny deterministic MLP can be run through the TorchLean program path and the mathematical spec path with the same initialized parameters. If those disagree, the problem is probably a shape convention, parameter ordering, or graph semantics bug.

Small model checks and bugs

The abstract theorem shape quickly connected to very ordinary bugs. TorchLean has little checked examples that ask: does the spec MLP agree with the TorchLean forward path? Does the IR denotation agree with the executable graph? Is BatchNorm using epsilon in the right place? Did we make the Float32 boundary explicit rather than trusting native runtime arithmetic by default?

That is the workflow I want from a formal ML stack: first state the clean theorem, then keep a zoo of small examples where real systems usually go wrong.

Spec versus TorchLean MLP

A deterministic \(2 \to 3 \to 1\) MLP is initialized by explicit seeds, interpreted as a spec model, compiled as a TorchLean program, and compared coordinate by coordinate.

IR versus executable graph

The same small model is compiled to an IR graph and to executable graph data. The check catches disagreement between denotational semantics and the compiled evaluator.

BatchNorm state and epsilon

The spec separates the correct formula \((x-\mu)/\sqrt{\sigma^2+\varepsilon}\) from the common bug \((x-\mu)/(\sqrt{\sigma^2}+\varepsilon)\), and makes running statistics explicit inputs.

Float32 trust boundary

Runtime Float32 arithmetic only rewrites to the explicit bit-level IEEE executor under a named runtime-conformance assumption.

Lean: a tiny MLP regression check

-- Same initialized parameters, two meanings:
--   1. the mathematical Spec MLP
--   2. the TorchLean compiled forward path
-- If these disagree on a deterministic toy model, something basic is wrong.
let ySpec : Tensor Float yShape := Examples.mlpForward (α := Float) l1 l2 x

let yTorch : Tensor Float yShape :=
  Runtime.Autograd.Torch.CompiledOut.forward compiled args

-- Compare the first output coordinate with a small tolerance.
assertApprox "mlp forward[0] spec/torchlean"
  (vecVal ySpec 0) (vecVal yTorch 0) 1e-6

One tiny MLP does not prove all of TorchLean. It tests the seam where mistakes are surprisingly easy: weight layout, bias layout, input shape, layer order, and the meaning of Linear → ReLU → Linear.

Lean: the Float32 runtime boundary

-- This theorem does not blindly trust native Float32 addition.
-- It requires an explicit assumption saying the runtime operation matches
-- the bit-level IEEE executor.
theorem runtimeFloat32_add_rewrites_to_ieee32
    [RuntimeFloat32MatchesIEEE32Exec] (a b : F32) :
    toIEEE32Exec (a + b) =
      IEEE32Exec.add (toIEEE32Exec a) (toIEEE32Exec b) :=
  RuntimeFloat32MatchesIEEE32Exec.toIEEE32Exec_add a b

I like this theorem because the trust assumption is visible. If the runtime primitive matches the bit-level executable IEEE model, then runtime addition rewrites to that model. Without that assumption, Lean does not give us the rewrite.

Lean: BatchNorm formula bug as a spec-level contrast

-- These two formulas look annoyingly similar in prose.
-- In Lean they are different functions, so the parenthesis bug is visible.
def wrongEpsilonOutsideSqrt
    (x mean variance gamma beta epsilon : ℝ) : ℝ :=
  ((x - mean) / (Real.sqrt variance + epsilon)) * gamma + beta

-- This is the standard BatchNorm denominator: sqrt(variance + epsilon).
def correctEpsilonInsideSqrt
    (x mean variance gamma beta epsilon : ℝ) : ℝ :=
  ((x - mean) / Real.sqrt (variance + epsilon)) * gamma + beta

That example is small, but it is the same species of bug that shows up in real deep-learning libraries: a formula looks familiar, a parenthesis moves, and numerical behavior changes. TorchLean turns that kind of ambiguity into a named expression and a checked theorem about the actual spec.

Lean: batch-invariance target

-- Semantic target for batching:
-- run f over the whole batch, then select row i,
-- equals selecting row i first and running f on that one row.
theorem mapBatch_select_eq_single
    (f : Spec.Tensor α sIn → Spec.Tensor α sOut)
    (xs : Spec.Tensor α (.dim batch sIn))
    (i : Fin batch) :
    Spec.getAtSpec (mapBatch f xs) i = f (Spec.getAtSpec xs i) := by
  -- The current definition is set up so this reduces directly.
  cases xs
  rfl

This theorem is the clean semantic target for batching: evaluating one row inside a batch should match evaluating that row alone. A high-performance runtime may use dynamic batching or different kernels, but then that behavior belongs in the contract. It is not invisible implementation dust.

Lean: safe autograd domain before masking

-- Divide safely first, then apply the mask.
-- This avoids creating a bad division node that is merely hidden later.
def maskAfterSafeDiv
    (mask numerator denominator : Spec.Tensor α s) : Spec.Tensor α s :=
  Spec.Tensor.mulSpec mask
    (Spec.Tensor.safedivSpec numerator denominator)

-- The theorem unfolds the definition and confirms that safediv really
-- uses denominator + epsilon before the mask is applied.
theorem maskAfterSafeDiv_uses_epsilon_denominator :
    maskAfterSafeDiv mask numerator denominator =
      Spec.Tensor.mulSpec mask
        (Spec.Tensor.map2Spec
          (fun a b => a / (b + Numbers.epsilon))
          numerator denominator) := by
  rfl

This is another small example that catches a real kind of bug. If an undefined division is recorded before a mask is applied, the forward result may look harmless while the backward graph is already poisoned. The TorchLean spec can choose the safe graph shape first, then mask the protected division.

What is not proved yet

This is not a proof that TorchLean is correct or that Float32 neural networks are safe. It is a worked theorem shape for a single abstract dot product, formalized with its assumptions visible.

To make this production ready inside TorchLean, several obligations still have to be made explicit.

First, the relative-error model needs range assumptions. Overflow, NaNs, infinities, and subnormals either need to be ruled out by preconditions or modeled directly.

Second, the theorem is specific to a schedule. If a backend changes the reduction order, the proof has to mention that. This connects directly to the batch invariance problem: schedules are part of the runtime contract, not an implementation detail we can ignore.

Third, a real backend needs a trusted or certified bridge. If TorchLean calls ATen, CUDA, or another external runtime, Lean needs a contract saying which mathematical operation that call refines.

The sentence I want TorchLean to say: given this real valued spec and this runtime path, the runtime output is within this checked error bound, and that bound is small enough for the claim we want to make.

References

Harmonic. Aristotle: IMO-level Automated Theorem Proving. The paper describes Aristotle as a system that combines formal verification, Lean proof search, informal reasoning, and auxiliary solvers.
Harmonic. Harmonic website. Source for the Harmonic mark used in the header.
Caltech. California Institute of Technology. Institutional context for the research environment around this project.
TorchLean. TorchLean project site. Neural-network verification infrastructure in Lean, including executable runtime paths and finite-precision models.
Nicholas J. Higham. Accuracy and Stability of Numerical Algorithms. Standard reference for floating point forward error analysis.
Flocq. Floating-point formalization for Coq/Rocq, with a comprehensive library of floating-point arithmetic theorems.
Gappa. A tool for verifying properties of numerical programs involving floating-point or fixed-point arithmetic.
FPTaylor. A tool for rigorous estimation of floating-point roundoff errors.
PRECiSA. NASA Langley roundoff-error analyzer that computes over-approximations and proof certificates.
VCFloat. Coq infrastructure for floating-point roundoff proofs, including C-program-oriented workflows.
Jia and Rinard. “Exploiting Verified Neural Networks via Floating Point Numerical Error.” A useful reminder that real-valued robustness claims do not automatically transfer to floating-point execution.
Neural-network verifiers. Representative systems include Reluplex, Marabou, ERAN, and \(\alpha,\beta\)-CROWN.
Pham et al.; Wang et al. Cross-backend and model-generation testing papers documenting deep-learning-library bugs, including normalization and backend-state failures.