Origin
I did not build this because mathematics needs another list.
I started building Failure Atlas after watching the Stanford Future of Mathematics Symposium, especially Terence Tao’s keynote on New mathematical workflows. The official symposium framing was already striking: artificial intelligence, formal verification, and frontier reasoning models are not merely producing new answers; they are beginning to reshape the workflow of mathematical practice itself.[1]
But the question that stayed with me was not the loud one: Can AI prove theorems? It was the quieter one: what does research need to remember in order for humans and machines to think better together?
Stanford Daily’s public report of Tao’s talk captured the point that felt decisive to me. Tao argued that the metric is not merely whether a proof has been generated or verified, but whether someone can explain it, answer questions about it, and connect it to past literature and future directions; he also described “proof digestion,” the work of making a result understandable and reusable.[2]
I wanted an object that did not yet exist in the form I needed: not a forum, not only a wiki, not only a citation graph, not only a benchmark, but a research graph that remembers failure: a graph of problems, attempts, partial results, obstructions, sources, review states, and method families across fields.
The asymmetry
Success is archived. Failure is mostly oral history.
Modern scholarship is excellent at archiving success. We have papers, preprints, DOI metadata, citation graphs, theorem prover libraries, benchmark leaderboards, conference proceedings, and institutional memory around final results. The successful theorem is searchable. The polished proof is citable. The solved case has a place in the archive.
The negative space around a problem is much less visible.
For an important problem, specialists often know facts of the following kind:
- this sieve reaches almost primes but cannot isolate primes;
- this compactness argument loses the structure needed at the last step;
- this spectral analogy would work only if the right operator existed;
- this computational search tested a huge range but produced no explanation;
- this geometric intuition is true in the toy model but false in the actual category;
- this proof claim failed because a lemma was stated one quantifier too strongly;
- this formulation is equivalent to the problem but not easier.
These are not side notes. They are part of the problem. Yet they are often distributed across talks, MathOverflow comments, referee reports, private messages, failed drafts, lecture notes, and the memories of a small number of people. A new researcher can easily rediscover a dead end because the dead end was never given a durable address.
The Polymath project is important precisely because it made some of this process visible. Scientific American’s account notes Gowers’s point that mathematicians often spend weeks or months in “blind alleys,” and that public exposure of attempts lets other experts accelerate the process by recognizing fruitless routes earlier. The same account describes Polymath as a rare documented record of serious mathematical research with “false starts” and “dead ends.”[3]
A theorem is the headline. The failed attempts are the map.
If Polymath showed the value of public mathematical process, Failure Atlas asks whether that process can be turned into a reusable knowledge object.
Proofs, refutations, anomalies
Failure is not a defect in research. It is one of the forms research takes.
The philosophical point is old, but our infrastructure has not caught up with it. Lakatos’s Proofs and Refutations made a famous case that informal mathematics grows through a dialectic of proposed proofs, counterexamples, repaired definitions, and revised concepts. The Stanford Encyclopedia of Philosophy summarizes the pedagogical implication sharply: definitions often become understandable only by seeing the process that gave rise to them.[4]
Popper’s philosophy of science likewise made refutation central, though actual scientific practice is more complicated than a single decisive falsification; the Stanford Encyclopedia notes that Popper distinguished the simple logic of falsifiability from the messier methodology of deciding what counts as a genuine falsification.[5] Kuhn then shifted attention to anomalies: persistent failures inside a paradigm can reorganize what a field takes to be a legitimate problem, method, or explanation.[6]
In the empirical sciences, the same lesson appears under another name: negative results. Nature’s editorial on negative results described how costly failed replications can be, and how little career credit often attaches to the work of discovering that something does not reproduce.[7] Commentaries on negative results and publication bias have repeatedly argued that scientific literature over-represents positive results, while negative or null outcomes often disappear even though they can prevent wasted time and duplicated effort.[8]
Failure is not the absence of knowledge. It is knowledge about the boundary of a method.
A proof is the final commit. A failure atlas is the git history of discovery.
If we only archive success, every generation pays again for the same dead ends.
The point is not that failure is glamorous. The point is that failure is information. A failed attempt can measure the reach of a technique, reveal a hidden assumption, locate a missing invariant, or show that an analogy has been pushed beyond its domain of validity. In that sense, a failed attempt is often closer to an experiment than to a mistake.
Why LLMs need failure
LLMs trained only on successful prose may learn the style of discovery without the discipline of discovery.
Large language models learn from a literature that is biased toward finished artifacts. Papers compress the path. They tell a coherent story after the concepts have stabilized. They name the successful proof technique, not the five plausible routes that failed before the right one appeared.
This creates a mismatch. The data distribution is success; the research process is search. The published proof is a terminal state; discovery is a sequence of trial, obstruction, repair, and reorientation.
This matters because modern mathematical AI is moving quickly. AlphaGeometry showed that a neurosymbolic system could solve many Olympiad level geometry problems by combining a language model with a symbolic deduction engine and synthetic proof data.[9] AlphaProof pushed further by using reinforcement learning inside Lean and reached silver medal level performance on the 2024 IMO under a formal pipeline, although with multi day computation and remaining failures on other problems.[10] Lean’s Mathlib, meanwhile, illustrates why formalized mathematics depends on reusable libraries: without a library of formal objects, every proof starts too close to zero.[11]
These systems need more than answers. They need memory of search. A corpus that includes failure could train and evaluate at least five capabilities:
- Obstruction recognition: identifying the symptom of a known barrier before hallucinating around it.
- Method boundary prediction: knowing where a historically successful method stops.
- Search space pruning: avoiding rediscovery of known dead ends.
- Analogy discipline: learning that a method transferred from one field may fail because a hidden structure is absent.
- Epistemic calibration: distinguishing evidence, special case, conditional theorem, proof claim, formal verification, and community accepted proof.
This also connects to recent discussion of reasoning models and tool using systems. The Alex Zhang post that inspired the requested style argues, in a different setting, that broad claims about what models “cannot do” can be too strong when the surrounding workflow is mismanaged; what matters is not only the model but the decomposition, harness, feedback, and verification loop around it.[12] Failure Atlas applies a related idea to scientific memory: the toolchain should not only generate attempts; it should preserve and learn from the failed structure of attempts.
The object
A failure card is a small unit of research memory.
The unit I wanted was not a discussion thread. Threads are good for conversation; they are bad at preserving claims with review states. The unit I wanted was closer to a citable card:
What is the target?
State the conjecture, theorem, physical anomaly, or computational barrier in a form that can be versioned.
What was tried?
Name the method family: sieve, forcing, spectral expansion, convex integration, SAT search, lattice simulation, formalization, and so on.
Where does it stop?
Record the precise failure mode, not just the fact that the attempt failed.
A good failure card should answer four questions: what survived, what failed, why it failed, and who says so. It should carry sources and a review state. A peer reviewed theorem, a survey remark, an arXiv preprint, a MathOverflow answer, a model generated attempt, and an unreviewed community note should not collapse into the same authority level.
That is why Failure Atlas treats review state as data. The system is not allowed to pretend that a source backed draft is expert reviewed. It should say, visibly, what level of trust the card has earned.
Atlas evidence
What the 2,000 card atlas reveals.
The current version is not an encyclopedia and not a claim of expert review. It is a representation experiment: can a large body of open problems, solved problems, failed attempts, partial results, citations, and graph edges be stored in a way that makes the negative space of research visible?
| Domain | Problem cards |
|---|---|
| Mathematics | 814 |
| Computer Science | 618 |
| Physics | 568 |
| Status | Problem cards |
|---|---|
| Open / active | 1,829 |
| Solved | 144 |
| Disproved | 27 |
| Quality audit tier | Cards |
|---|---|
| A: strong source trail, still not expert reviewed | 56 |
| B: triangulated source trail, still not expert reviewed | 856 |
| C: source backed draft, editor review needed | 1,088 |
The most important number is not 2,000. The most important number is 1,088: the number of cards still explicitly labeled as editor review needed. The atlas is useful precisely because it does not confuse scale with authority. Scale without review is a hallucination machine. Scale with visible review state becomes an audit surface.
Ulam’s UnsolvedMath dataset points toward the same infrastructure need from the benchmark side: its dataset card describes 1,146 curated open mathematical problems, including 632 Erdős problems, and explicitly notes limitations such as subjective difficulty and status needing verification.[13] The Erdős Problems database is likewise a living record, with entries that move from open to solved and problem lists that are still being checked.[14] The lesson is not that problem databases are easy. The lesson is that living problem databases need explicit provenance, review state, and update workflows.
The graph
The interesting graph is not only problem → problem. It is attempt → attempt.
At first, I thought the main value would be linking problems to other problems: Riemann to prime distribution, P versus NP to circuit lower bounds, Navier–Stokes to fluid regularity, dark matter to cosmological inference.
That graph is useful. But it is not enough.
The more interesting graph connects method families and failure modes. A combinatorics problem can share spectral methods with theoretical computer science. A number theory question can become an ergodic theory problem. A graph coloring problem can connect to topology. A condensed matter problem can share renormalization language with probability. A proof complexity barrier can illuminate why a formalization task is hard.
In the current atlas, bridge family nodes include probabilistic method, spectral/expansion methods, topological methods, algebraic geometry/polynomial method, analytic/harmonic analysis, sieve and local to global methods, ergodic/dynamical systems, computational search/SAT/formalization, logic/model theory/independence, information/entropy, optimization/SDP/convexity, statistical physics correspondence, quantum/noncommutative methods, geometric flow/variational methods, renormalization/multiscale analysis, automorphic/representation theory, and AI and machine discovered conjectures.
This is where the “failure dataset” becomes more than a negative results archive. It becomes a map of reusable obstruction knowledge.
Research agenda
What should be built next?
A mature Failure Atlas would need more than cards. It would need a research workflow.
- First class attempt nodes. Attempts should not be buried inside problem pages. A method like “polynomial method” or “SAT search” should have its own page, history, successes, and failure modes.
- Review tiers. Every claim should have a status: unreviewed note, source backed draft, domain editor reviewed, externally expert reviewed, disputed, retracted, or quarantined proof claim.
- Citation triangulation. Ulam, Erdős Problems, MathOverflow, surveys, papers, arXiv, OpenAlex, Crossref, Semantic Scholar, formal libraries, and primary sources should be checked against one another rather than collapsed.
- Proof claim quarantine. Famous unsolved problems attract confident wrong proofs. The interface should make it possible to store claims without letting them pollute the curated layer.
- Failure aware model evaluation. LLMs should be tested not only on solving but on diagnosing why a proposed route fails, identifying the missing lemma, and finding known obstructions.
- Human in the loop updates. Domain experts should be able to sign off on cards, correct status changes, attach new partial results, and mark entries stale.
In short, this should become infrastructure for proof digestion at the level of the whole research ecosystem.
Build note
This was built as a real collaboration with Codex.
I used Codex for five days straight while building Failure Atlas. I let it spin up multiple agents to explore the code, generate card structures, check citations, clean the interface, and wire the static website together. I still made the editorial decisions, but the speed and breadth of the project came from that collaboration.
The site would not have been possible in this form without that help. Codex made it possible to keep many threads moving at once: data shaping, citation checking, visual design, local testing, GitHub setup, and deployment.
Objections
Three objections are serious.
1. “Failures are too hard to formalize.”
Yes. But that is not a reason to leave them unstructured. A failed attempt need not be fully formalized to be useful. A source backed obstruction card can still prevent duplication and clarify the research frontier.
2. “This will attract bad proof claims.”
Yes. That is why review state and quarantine are core design requirements. A failure atlas without moderation becomes noise. A failure atlas with explicit review state becomes an epistemic filter.
3. “Negative results are not always reliable.”
Correct. A failed attempt may fail because the attempt was bad, not because the route is impossible. The card must say exactly what failed and at what level of confidence. “This proof draft broke at Lemma 4” is different from “this method cannot work in principle.” The atlas should make that difference visible.
The general principle is simple: failure should be archived, not overinterpreted.
Conclusion
We need systems that remember how not to know.
Science and mathematics progress by success, but they learn by constrained failure. A failed route tells us where a method ends. A counterexample tells us what a definition was hiding. A failed replication tells us what the literature overclaimed. A proof gap tells us which dependency was missing. A formalization failure tells us where informal understanding outran explicit structure.
If LLMs are going to become useful research collaborators, they need to inherit this memory. Not just the finished proofs. Not just the abstracts. Not just the benchmark answers. They need the failed attempts, the obstruction taxonomy, the partial results, the dead ends, the source trails, and the review states.
The dead ends are not noise around discovery. They are the geometry of discovery.
Failure Atlas is my attempt to draw that geometry.
References and public sources
- Stanford University Events, Future of Mathematics Symposium. Agenda lists Terence Tao’s keynote, “New mathematical workflows,” and frames the symposium around AI, formal verification, and frontier reasoning models reshaping mathematical practice.
- Angikar Ghosal, Stanford Daily, “‘Like space aliens landing’: Symposium weighs effect of AI on the future of mathematics”, May 6, 2026. Public report of Tao’s “proof digestion” discussion and related symposium remarks.
- Davide Castelvecchi, Scientific American, “Project Polymath: Collaborative Mathematics through Blogs”, 2011. Discusses blind alleys, false starts, and the public record of Polymath collaboration.
- Stanford Encyclopedia of Philosophy, “Imre Lakatos”. See discussion of Proofs and Refutations and the dialectic of proofs, counterexamples, and concept repair.
- Stanford Encyclopedia of Philosophy, “Karl Popper”. See section on falsifiability and the methodological complexity of falsification.
- Stanford Encyclopedia of Philosophy, “Scientific Revolutions”. See discussion of Kuhn, anomalies, paradigm change, and scientific revolutions.
- Nature editorial, “Negative results”, 2008. Discusses retractions, failed reproduction, and the cost of correcting scientific records.
- Jaime A. Teixeira da Silva, “Negative results: negative perceptions limit their potential for increasing reproducibility”, Journal of Negative Results in BioMedicine, 2015; see also Nature Index, “Highlight negative results to improve science”, 2019.
- Trinh et al., Nature, “Solving olympiad geometry without human demonstrations”, 2024.
- DeepMind / Nature, “Olympiad level formal mathematical reasoning with reinforcement learning”, 2025.
- Lean, “Mathlib: A Foundation for Formal Mathematics Research and Verification”. Describes Mathlib as a large community driven formal mathematics library for Lean.
- Alex L. Zhang and Omar Khattab, “A Mini Exercise on the Mismanaged Geniuses Hypothesis (RLMs on LongCoT)”, 2026. Cited here for the general workflow/harness lesson and stylistic inspiration.
- UlamAI / Hugging Face, “UnsolvedMath Dataset”. Dataset card describes 1,146 open mathematical problems, 632 Erdős problems, categories, difficulty levels, and limitations.
- Erdős Problems, “Erdős Problem Lists”; see also Erdős Problems homepage for living status updates.