Paper ReviewAI & Machine LearningMachine/Deep Learning

Machines Proving Theorems: Goedel-Prover and the IMO Gold Medal Frontier

Goedel-Prover achieves state-of-the-art open-source theorem proving in Lean 4, while Aristotle and Seed-Prover reach IMO competition level. The convergence of LLMs and formal verification is creating machines that don't just calculate—they prove.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Mathematics is the last domain where correctness is absolute. A proof is either valid or it is not—there is no "mostly correct" in formal mathematics. This unforgiving standard makes automated theorem proving (ATP) simultaneously the hardest and most meaningful benchmark for artificial intelligence: a machine that can prove theorems has achieved something qualitatively different from one that merely generates plausible text.

In 2025, several systems crossed thresholds that reshape our understanding of what machines can achieve in mathematical reasoning. Goedel-Prover, the leading open-source ATP system, achieves state-of-the-art performance in generating formal proofs in Lean 4—the proof assistant increasingly adopted by the mathematical community. Aristotle targets problems from the International Mathematical Olympiad—problems that require not just formal manipulation but genuine mathematical creativity.

These represent a significant step forward in the relationship between artificial intelligence and mathematical reasoning.

The Formalization Bottleneck

The fundamental challenge in ATP is not computational power but data scarcity. While LLMs for natural language are trained on trillions of tokens of text, formalized mathematics—theorems and proofs written in machine-verifiable languages like Lean, Coq, or Isabelle—comprises perhaps millions of tokens. This five-order-of-magnitude gap means that standard scaling approaches fail: you cannot train a theorem prover the same way you train a chatbot.

Goedel-Prover addresses this bottleneck through an ingenious bootstrapping strategy. Starting from a modest corpus of existing Lean proofs, the system uses an LLM to generate conjectures in natural language, formalize them into Lean statements, and then prove or disprove them. Successful proofs are added to the training corpus, creating a self-reinforcing cycle of data generation.

The approach mirrors how human mathematicians expand knowledge: formulate conjectures, attempt proofs, learn from both successes and failures. The difference is speed—Goedel-Prover can generate and attempt thousands of conjectures per hour, systematically exploring mathematical territory that would take human mathematicians years.

Aristotle: Creativity in Formal Proof

If Goedel-Prover demonstrates systematic mathematical capability, Aristotle demonstrates something closer to mathematical creativity. The system's gold-medal equivalent performance on IMO 2025 problems is significant because IMO problems are specifically designed to resist systematic approaches—they require insight, novel construction, and the ability to see connections that are not obvious from the problem statement.

Aristotle achieves this by combining three components that mirror distinct aspects of mathematical thinking:

Lean proof search: A highly parallel search algorithm that uses a large transformer model as its policy and value function. The transformer selects promising Lean tactics and estimates the likelihood of future proof success, conditioned on the proof state, proof history, and any available informal proof.

Informal reasoning: A lemma-based system that generates informal proofs of mathematical statements, breaks these proofs down into lemmas, formalizes each lemma into Lean, and iterates based on formal feedback—enabling the system to leverage high-level mathematical intuition to guide the formal search.

Geometry solver: A dedicated plane geometry solver operating outside of Lean that handles IMO geometry problems which would be cumbersome to formalize directly in Lean.

This architecture embodies a profound insight: mathematical reasoning is not purely formal (or machines would have solved it decades ago) nor purely intuitive (or formalization would be unnecessary). It is the interplay between informal insight and formal verification that defines mathematical thought—and Aristotle operationalizes this interplay at competition level.

Seed-Prover: Scaling Through Depth

Seed-Prover takes a different path, demonstrating that reinforcement learning with long chains of thought can dramatically improve formal theorem proving without Goedel-Prover's data-generation strategy.

The key insight: standard RL for theorem proving rewards only the final outcome (proof found or not). Seed-Prover introduces deep reasoning rewards—intermediate signals that reward the model for making progress toward a proof even when the complete proof remains elusive. This enables learning from partial successes, which are vastly more common than complete proofs and carry rich information about productive reasoning strategies.

The result is a model that generates proofs that are not merely correct but structured—organized into lemmas, intermediate results, and final conclusions in a way that human mathematicians would recognize as principled. This structural quality matters because it enables the proofs to serve as building blocks for proving harder theorems, creating a compounding advantage. Seed-Prover, combined with its geometry reasoning component Seed-Geometry, fully proved 5 out of 6 problems at IMO 2025.

From Mathematics to Software

Dougherty et al.'s FVAPPS benchmark extends the ATP paradigm from pure mathematics to software verification. Their benchmark asks AI systems to not only write code but prove that the code is correct—generating formal proofs of functional correctness alongside the implementations.

This is the bridge between theoretical mathematics and practical engineering. A world where AI can both write software and prove it correct is a world where entire categories of software bugs—buffer overflows, race conditions, logic errors—become provably impossible. The current systems are far from achieving this at production scale, but the benchmark establishes the target and enables systematic progress measurement.

Claims and Evidence

Claim	Evidence	Verdict
LLMs can generate valid formal mathematical proofs	Goedel-Prover, Seed-Prover achieve state-of-the-art on Lean benchmarks	✅ Strongly supported
AI can solve IMO-level competition problems	Aristotle demonstrates gold-medal equivalent on 2025 IMO	✅ Demonstrated
Data bootstrapping overcomes formalization scarcity	Goedel-Prover's self-play approach generates useful training data	✅ Supported
ATP systems exhibit genuine mathematical creativity	Aristotle's informal reasoning component shows non-obvious insights	⚠️ Debatable—depends on definition of creativity
Formal software verification via AI is practical	FVAPPS benchmark establishes feasibility; production readiness is distant	⚠️ Early stage

Open Questions

The understanding question: When Goedel-Prover proves a theorem, does it understand the mathematics, or is it performing sophisticated pattern matching over proof tactics? This is not a philosophical quibble—the answer determines whether these systems can generalize to genuinely novel mathematics or are limited to recombining known techniques.

Scalability to research mathematics: IMO problems, while difficult, are solvable with techniques that a well-trained undergraduate might know. Can ATP systems scale to open research problems—the kind that consume years of a professional mathematician's career?

Collaboration models: How should human mathematicians work with ATP systems? As co-authors? As verification tools? As exploration assistants? The optimal human-AI collaboration model for mathematical research is unknown.

Cross-system transfer: Proofs in Lean do not automatically transfer to Coq or Isabelle. Can we build ATP systems that are proof-assistant agnostic, or will the community fragment around incompatible formal ecosystems?

The formalization gap: Most published mathematics is informal—written in natural language with implicit assumptions. Can AI bridge the gap, automatically formalizing the existing mathematical literature to create the training data needed for even more powerful ATP systems?

What This Means for Your Research

For mathematicians, ATP systems are evolving from curiosities to collaborators. Goedel-Prover can verify conjectures that would take days to check by hand. Aristotle can suggest proof strategies that a human might not consider. The researchers who integrate these tools into their workflow will have a genuine advantage—not because the machines replace mathematical thinking, but because they amplify it.

For computer scientists, the convergence of LLMs and formal verification creates a new design space. Software that is not merely tested but proven correct becomes feasible for critical systems—medical devices, financial infrastructure, autonomous vehicles—where bugs have catastrophic consequences.

For the philosophy of mathematics, these systems reignite old questions with new urgency. If a machine proves a theorem that no human understands, is it really proven? If Aristotle solves an IMO problem using a strategy no human competitor considered, is that creativity or computation? The answers matter—not just abstractly, but for how we structure the future relationship between human and artificial mathematical intelligence.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 학술 연구에서 인용하기 전에 구체적인 연구 결과, 통계 및 주장은 원본 논문을 통해 검증해야 한다.

기계가 정리를 증명하다: Goedel-Prover와 IMO 금메달 프론티어

수학은 정확성이 절대적인 마지막 영역이다. 증명은 유효하거나 그렇지 않거나 둘 중 하나이며, 형식 수학에서 "대체로 맞다"는 것은 존재하지 않는다. 이 가혹한 기준은 자동 정리 증명(ATP)을 인공지능에 대한 가장 어렵고도 가장 의미 있는 벤치마크로 동시에 만든다. 정리를 증명할 수 있는 기계는 단순히 그럴듯한 텍스트를 생성하는 것과는 질적으로 다른 무언가를 달성한 것이다.

2025년에는 여러 시스템이 수학적 추론에서 기계가 달성할 수 있는 것에 대한 우리의 이해를 재형성하는 임계점을 넘어섰다. 선도적인 오픈소스 ATP 시스템인 Goedel-Prover는 수학 커뮤니티에서 점점 더 널리 채택되고 있는 증명 보조 도구인 Lean 4로 형식 증명을 생성하는 데 있어 최첨단 성능을 달성한다. Aristotle은 국제수학올림피아드(IMO)의 문제를 목표로 하는데, 이 문제들은 형식적 조작뿐만 아니라 진정한 수학적 창의성을 요구한다.

이는 인공지능과 수학적 추론의 관계에서 중요한 진전을 나타낸다.

형식화 병목 현상

ATP에서의 근본적인 과제는 계산 능력이 아니라 데이터 부족이다. 자연어 처리용 LLM이 수조 개의 토큰으로 학습되는 반면, Lean, Coq, Isabelle과 같은 기계 검증 가능한 언어로 작성된 정리와 증명인 형식화된 수학은 수백만 개의 토큰에 불과하다. 이 다섯 자릿수의 크기 차이는 표준적인 확장 접근법이 실패한다는 것을 의미한다. 챗봇을 학습시키는 것과 같은 방식으로 정리 증명기를 학습시킬 수는 없다.

Goedel-Prover는 독창적인 부트스트래핑 전략을 통해 이 병목 현상을 해결한다. 기존의 소규모 Lean 증명 코퍼스에서 시작하여, 이 시스템은 LLM을 사용해 자연어로 추측을 생성하고, 이를 Lean 명제로 형식화한 다음, 증명하거나 반증한다. 성공적인 증명은 학습 코퍼스에 추가되어 데이터 생성의 자기 강화 순환을 만들어낸다.

이 접근법은 인간 수학자들이 지식을 확장하는 방식을 모방한다. 즉, 추측을 공식화하고, 증명을 시도하며, 성공과 실패 모두에서 배우는 것이다. 차이는 속도에 있다. Goedel-Prover는 시간당 수천 개의 추측을 생성하고 시도하여, 인간 수학자들이 수년이 걸릴 수학적 영역을 체계적으로 탐색할 수 있다.

Aristotle: 형식 증명에서의 창의성

Goedel-Prover가 체계적인 수학적 역량을 보여준다면, Aristotle은 수학적 창의성에 더 가까운 무언가를 보여준다. IMO 2025 문제에서 금메달에 해당하는 성능은 IMO 문제가 체계적인 접근을 저항하도록 특별히 설계되었기 때문에 중요하다. 이 문제들은 통찰력, 새로운 구성, 그리고 문제 명제에서 명백하지 않은 연결을 발견하는 능력을 요구한다.

Aristotle은 수학적 사고의 서로 다른 측면을 반영하는 세 가지 구성 요소를 결합하여 이를 달성한다.

Lean 증명 탐색: 대형 트랜스포머 모델을 정책 및 가치 함수로 사용하는 고도의 병렬 탐색 알고리즘이다. 트랜스포머는 유망한 Lean 전술을 선택하고, 증명 상태, 증명 이력, 그리고 사용 가능한 비형식적 증명을 조건으로 미래 증명 성공 가능성을 추정한다.

비형식적 추론: 수학적 명제의 비형식적 증명을 생성하고, 이 증명을 보조 정리(lemma)로 분해하며, 각 보조 정리를 Lean으로 형식화하고, 형식적 피드백을 기반으로 반복하는 보조 정리 기반 시스템이다. 이를 통해 시스템이 고수준의 수학적 직관을 활용하여 형식적 탐색을 안내할 수 있다.

기하학 풀이기: Lean 외부에서 작동하는 전용 평면 기하학 풀이기로, Lean에서 직접 형식화하기 번거로운 IMO 기하학 문제를 처리한다.

이 아키텍처는 심오한 통찰을 구현한다: 수학적 추론은 순수하게 형식적이지도(그랬다면 기계가 수십 년 전에 이미 해결했을 것이다), 순수하게 직관적이지도 않다(그랬다면 형식화가 불필요했을 것이다). 비형식적 통찰과 형식적 검증 사이의 상호작용이야말로 수학적 사고를 정의하는 것이며, Aristotle은 이 상호작용을 경시 수준에서 구현한다.

Seed-Prover: 깊이를 통한 확장

Seed-Prover는 다른 경로를 택하며, 긴 사고 연쇄를 활용한 강화학습이 Goedel-Prover의 데이터 생성 전략 없이도 형식적 정리 증명을 극적으로 향상시킬 수 있음을 보여준다.

핵심 통찰은 다음과 같다: 정리 증명을 위한 표준 RL은 최종 결과(증명 성공 여부)만을 보상한다. Seed-Prover는 심층 추론 보상(deep reasoning rewards)을 도입하는데, 이는 완전한 증명이 아직 도출되지 않은 상황에서도 모델이 증명을 향해 진전을 이루는 것에 보상을 제공하는 중간 신호이다. 이를 통해 부분적 성공으로부터의 학습이 가능해진다. 부분적 성공은 완전한 증명보다 훨씬 빈번하게 발생하며, 생산적인 추론 전략에 관한 풍부한 정보를 담고 있다.

그 결과, 단순히 정확한 것을 넘어 구조화된 증명을 생성하는 모델이 탄생했다. 이 증명들은 수학자들이 체계적이라고 인식할 수 있는 방식으로 보조 정리(lemma), 중간 결과, 최종 결론으로 구성된다. 이러한 구조적 특성은 해당 증명들이 더 어려운 정리를 증명하기 위한 구성 요소로 활용될 수 있게 하여 복리적 이점을 창출한다는 점에서 중요하다. Seed-Prover는 기하 추론 컴포넌트인 Seed-Geometry와 결합하여 IMO 2025의 6문제 중 5문제를 완전히 증명했다.

수학에서 소프트웨어로

Dougherty 외 연구진의 FVAPPS 벤치마크는 ATP 패러다임을 순수 수학에서 소프트웨어 검증으로 확장한다. 이 벤치마크는 AI 시스템이 코드를 작성하는 데 그치지 않고, 코드의 정확성을 증명할 것을 요구한다. 즉, 구현과 함께 기능적 정확성에 대한 형식적 증명을 생성해야 한다.

이것이 이론 수학과 실용 공학을 잇는 다리이다. AI가 소프트웨어를 작성하고 그 정확성까지 증명할 수 있는 세계는 버퍼 오버플로우, 경쟁 조건(race condition), 논리 오류 등 특정 범주의 소프트웨어 버그가 증명 가능한 방식으로 불가능해지는 세계이다. 현재 시스템은 이를 프로덕션 규모에서 달성하기까지 아직 요원하지만, 이 벤치마크는 목표를 설정하고 체계적인 진전 측정을 가능하게 한다.

주장과 근거

주장	근거	판정
LLM이 유효한 형식적 수학 증명을 생성할 수 있다	Goedel-Prover, Seed-Prover가 Lean 벤치마크에서 최첨단 성능 달성	✅ 강력히 지지됨
AI가 IMO 수준의 경시 문제를 풀 수 있다	Aristotle이 2025 IMO에서 금메달 동등 성과 시연	✅ 입증됨
데이터 부트스트래핑이 형식화 희소성을 극복한다	Goedel-Prover의 자기 대결(self-play) 방식이 유용한 학습 데이터 생성	✅ 지지됨
ATP 시스템이 진정한 수학적 창의성을 발휘한다	Aristotle의 비형식적 추론 컴포넌트가 비자명한 통찰을 보여줌	⚠️ 논쟁의 여지 있음—창의성의 정의에 따라 다름
AI를 통한 형식적 소프트웨어 검증이 실용적이다	FVAPPS 벤치마크가 실현 가능성을 확립했으나 프로덕션 준비까지는 요원함	⚠️ 초기 단계

열린 질문들

이해의 문제: Goedel-Prover가 정리를 증명할 때, 그것은 수학을 이해하는 것인가, 아니면 증명 전술(proof tactic)에 대한 정교한 패턴 매칭을 수행하는 것인가? 이는 철학적 군소리가 아니다. 이 질문에 대한 답변이, 이 시스템들이 진정으로 새로운 수학으로 일반화할 수 있는지, 아니면 알려진 기법들의 재조합에 머무르는지를 결정한다.

연구 수학으로의 확장성: IMO 문제들은 어렵지만, 잘 훈련된 학부생이 알 만한 기법으로도 풀 수 있다. ATP 시스템은 전문 수학자가 수년을 소비하는 종류의 미해결 연구 문제들로 확장될 수 있는가?

협력 모델: 인간 수학자는 ATP 시스템과 어떻게 협력해야 하는가? 공동 저자로서? 검증 도구로서? 탐색 보조 수단으로서? 수학 연구에서 최적의 인간-AI 협력 모델은 아직 알려져 있지 않다.

교차 시스템 이전: Lean에서의 증명은 Coq나 Isabelle로 자동으로 이전되지 않는다. 증명 보조기에 구애받지 않는 ATP 시스템을 구축할 수 있을까, 아니면 커뮤니티가 서로 호환되지 않는 형식적 생태계를 중심으로 분열될 것인가?

형식화 격차: 대부분의 출판된 수학은 비형식적으로—암묵적 가정을 포함한 자연어로—기술되어 있다. AI가 이 격차를 메워, 기존의 수학 문헌을 자동으로 형식화함으로써 더욱 강력한 ATP 시스템에 필요한 훈련 데이터를 만들어낼 수 있을까?

연구에 대한 시사점

수학자들에게 ATP 시스템은 단순한 호기심의 대상에서 협력자로 진화하고 있다. Goedel-Prover는 손으로 검증하는 데 며칠이 걸릴 추측을 검증할 수 있다. Aristotle은 인간이라면 고려하지 못했을 증명 전략을 제안할 수 있다. 이러한 도구를 연구 워크플로에 통합하는 연구자들은 실질적인 이점을 갖게 될 것이다—기계가 수학적 사고를 대체하기 때문이 아니라, 그것을 증폭시키기 때문이다.

컴퓨터 과학자들에게, LLM과 형식적 검증의 융합은 새로운 설계 공간을 만들어낸다. 단순히 테스트된 것이 아니라 정확성이 증명된 소프트웨어가 버그가 치명적인 결과를 초래하는 의료 기기, 금융 인프라, 자율주행 차량과 같은 중요 시스템에서 실현 가능해진다.

수학 철학의 관점에서, 이러한 시스템들은 오래된 질문들을 새로운 긴박감으로 다시 불러일으킨다. 어떤 인간도 이해하지 못하는 정리를 기계가 증명한다면, 그것이 정말로 증명된 것인가? Aristotle이 어떤 인간 참가자도 고려하지 않은 전략을 사용하여 IMO 문제를 풀어낸다면, 그것은 창의성인가 연산인가? 그 답은 추상적인 차원에서뿐만 아니라, 인간과 인공적 수학 지능 사이의 미래 관계를 어떻게 구조화할 것인지에 있어서도 중요하다.

References (4)

[1] Lin, Y., Tang, S., Lyu, B. et al. (2025). Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving. arXiv:2502.07640.

DOI Scholar

[2] Achim, T., Best, A., Bietti, A. et al. (2025). Aristotle: IMO-level Automated Theorem Proving. arXiv:2510.01346.

DOI Scholar

[3] Chen, L., Gu, J., Huang, L. et al. (2025). Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving. arXiv:2507.23726.

DOI Scholar

[4] Dougherty, Q., Mehta, R. (2025). Proving the Coding Interview: A Benchmark for Formally Verified Code Generation. IEEE LLM4Code.

DOI Scholar

Machines Proving Theorems: Goedel-Prover and the IMO Gold Medal Frontier

The Formalization Bottleneck

Aristotle: Creativity in Formal Proof

Seed-Prover: Scaling Through Depth

From Mathematics to Software

Claims and Evidence

Open Questions

What This Means for Your Research

기계가 정리를 증명하다: Goedel-Prover와 IMO 금메달 프론티어

형식화 병목 현상

Aristotle: 형식 증명에서의 창의성

Seed-Prover: 깊이를 통한 확장

수학에서 소프트웨어로

주장과 근거

열린 질문들

연구에 대한 시사점

References (4)

Explore this topic deeper