Paper ReviewAI & Machine LearningMachine/Deep Learning

Machines Proving Theorems: Goedel-Prover and the IMO Gold Medal Frontier

Goedel-Prover achieves state-of-the-art open-source theorem proving in Lean 4, while Aristotle and Seed-Prover reach IMO competition level. The convergence of LLMs and formal verification is creating machines that don't just calculateโ€”they prove.

By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Mathematics is the last domain where correctness is absolute. A proof is either valid or it is notโ€”there is no "mostly correct" in formal mathematics. This unforgiving standard makes automated theorem proving (ATP) simultaneously the hardest and most meaningful benchmark for artificial intelligence: a machine that can prove theorems has achieved something qualitatively different from one that merely generates plausible text.

In 2025, several systems crossed thresholds that reshape our understanding of what machines can achieve in mathematical reasoning. Goedel-Prover, the leading open-source ATP system, achieves state-of-the-art performance in generating formal proofs in Lean 4โ€”the proof assistant increasingly adopted by the mathematical community. Aristotle targets problems from the International Mathematical Olympiadโ€”problems that require not just formal manipulation but genuine mathematical creativity.

These represent a significant step forward in the relationship between artificial intelligence and mathematical reasoning.

The Formalization Bottleneck

The fundamental challenge in ATP is not computational power but data scarcity. While LLMs for natural language are trained on trillions of tokens of text, formalized mathematicsโ€”theorems and proofs written in machine-verifiable languages like Lean, Coq, or Isabelleโ€”comprises perhaps millions of tokens. This five-order-of-magnitude gap means that standard scaling approaches fail: you cannot train a theorem prover the same way you train a chatbot.

Goedel-Prover addresses this bottleneck through an ingenious bootstrapping strategy. Starting from a modest corpus of existing Lean proofs, the system uses an LLM to generate conjectures in natural language, formalize them into Lean statements, and then prove or disprove them. Successful proofs are added to the training corpus, creating a self-reinforcing cycle of data generation.

The approach mirrors how human mathematicians expand knowledge: formulate conjectures, attempt proofs, learn from both successes and failures. The difference is speedโ€”Goedel-Prover can generate and attempt thousands of conjectures per hour, systematically exploring mathematical territory that would take human mathematicians years.

Aristotle: Creativity in Formal Proof

If Goedel-Prover demonstrates systematic mathematical capability, Aristotle demonstrates something closer to mathematical creativity. The system's gold-medal equivalent performance on IMO 2025 problems is significant because IMO problems are specifically designed to resist systematic approachesโ€”they require insight, novel construction, and the ability to see connections that are not obvious from the problem statement.

Aristotle achieves this by combining three components that mirror distinct aspects of mathematical thinking:

  • Lean proof search: A highly parallel search algorithm that uses a large transformer model as its policy and value function. The transformer selects promising Lean tactics and estimates the likelihood of future proof success, conditioned on the proof state, proof history, and any available informal proof.
  • Informal reasoning: A lemma-based system that generates informal proofs of mathematical statements, breaks these proofs down into lemmas, formalizes each lemma into Lean, and iterates based on formal feedbackโ€”enabling the system to leverage high-level mathematical intuition to guide the formal search.
  • Geometry solver: A dedicated plane geometry solver operating outside of Lean that handles IMO geometry problems which would be cumbersome to formalize directly in Lean.
  • This architecture embodies a profound insight: mathematical reasoning is not purely formal (or machines would have solved it decades ago) nor purely intuitive (or formalization would be unnecessary). It is the interplay between informal insight and formal verification that defines mathematical thoughtโ€”and Aristotle operationalizes this interplay at competition level.

    Seed-Prover: Scaling Through Depth

    Seed-Prover takes a different path, demonstrating that reinforcement learning with long chains of thought can dramatically improve formal theorem proving without Goedel-Prover's data-generation strategy.

    The key insight: standard RL for theorem proving rewards only the final outcome (proof found or not). Seed-Prover introduces deep reasoning rewardsโ€”intermediate signals that reward the model for making progress toward a proof even when the complete proof remains elusive. This enables learning from partial successes, which are vastly more common than complete proofs and carry rich information about productive reasoning strategies.

    The result is a model that generates proofs that are not merely correct but structuredโ€”organized into lemmas, intermediate results, and final conclusions in a way that human mathematicians would recognize as principled. This structural quality matters because it enables the proofs to serve as building blocks for proving harder theorems, creating a compounding advantage. Seed-Prover, combined with its geometry reasoning component Seed-Geometry, fully proved 5 out of 6 problems at IMO 2025.

    From Mathematics to Software

    Dougherty et al.'s FVAPPS benchmark extends the ATP paradigm from pure mathematics to software verification. Their benchmark asks AI systems to not only write code but prove that the code is correctโ€”generating formal proofs of functional correctness alongside the implementations.

    This is the bridge between theoretical mathematics and practical engineering. A world where AI can both write software and prove it correct is a world where entire categories of software bugsโ€”buffer overflows, race conditions, logic errorsโ€”become provably impossible. The current systems are far from achieving this at production scale, but the benchmark establishes the target and enables systematic progress measurement.

    Claims and Evidence

    <
    ClaimEvidenceVerdict
    LLMs can generate valid formal mathematical proofsGoedel-Prover, Seed-Prover achieve state-of-the-art on Lean benchmarksโœ… Strongly supported
    AI can solve IMO-level competition problemsAristotle demonstrates gold-medal equivalent on 2025 IMOโœ… Demonstrated
    Data bootstrapping overcomes formalization scarcityGoedel-Prover's self-play approach generates useful training dataโœ… Supported
    ATP systems exhibit genuine mathematical creativityAristotle's informal reasoning component shows non-obvious insightsโš ๏ธ Debatableโ€”depends on definition of creativity
    Formal software verification via AI is practicalFVAPPS benchmark establishes feasibility; production readiness is distantโš ๏ธ Early stage

    Open Questions

  • The understanding question: When Goedel-Prover proves a theorem, does it understand the mathematics, or is it performing sophisticated pattern matching over proof tactics? This is not a philosophical quibbleโ€”the answer determines whether these systems can generalize to genuinely novel mathematics or are limited to recombining known techniques.
  • Scalability to research mathematics: IMO problems, while difficult, are solvable with techniques that a well-trained undergraduate might know. Can ATP systems scale to open research problemsโ€”the kind that consume years of a professional mathematician's career?
  • Collaboration models: How should human mathematicians work with ATP systems? As co-authors? As verification tools? As exploration assistants? The optimal human-AI collaboration model for mathematical research is unknown.
  • Cross-system transfer: Proofs in Lean do not automatically transfer to Coq or Isabelle. Can we build ATP systems that are proof-assistant agnostic, or will the community fragment around incompatible formal ecosystems?
  • The formalization gap: Most published mathematics is informalโ€”written in natural language with implicit assumptions. Can AI bridge the gap, automatically formalizing the existing mathematical literature to create the training data needed for even more powerful ATP systems?
  • What This Means for Your Research

    For mathematicians, ATP systems are evolving from curiosities to collaborators. Goedel-Prover can verify conjectures that would take days to check by hand. Aristotle can suggest proof strategies that a human might not consider. The researchers who integrate these tools into their workflow will have a genuine advantageโ€”not because the machines replace mathematical thinking, but because they amplify it.

    For computer scientists, the convergence of LLMs and formal verification creates a new design space. Software that is not merely tested but proven correct becomes feasible for critical systemsโ€”medical devices, financial infrastructure, autonomous vehiclesโ€”where bugs have catastrophic consequences.

    For the philosophy of mathematics, these systems reignite old questions with new urgency. If a machine proves a theorem that no human understands, is it really proven? If Aristotle solves an IMO problem using a strategy no human competitor considered, is that creativity or computation? The answers matterโ€”not just abstractly, but for how we structure the future relationship between human and artificial mathematical intelligence.

    References (4)

    [1] Lin, Y., Tang, S., Lyu, B. et al. (2025). Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving. arXiv:2502.07640.
    [2] Achim, T., Best, A., Bietti, A. et al. (2025). Aristotle: IMO-level Automated Theorem Proving. arXiv:2510.01346.
    [3] Chen, L., Gu, J., Huang, L. et al. (2025). Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving. arXiv:2507.23726.
    [4] Dougherty, Q., Mehta, R. (2025). Proving the Coding Interview: A Benchmark for Formally Verified Code Generation. IEEE LLM4Code.

    Explore this topic deeper

    Search 290M+ papers, detect research gaps, and find what hasn't been studied yet.

    Click to remove unwanted keywords

    Search 8 keywords โ†’