Paper ReviewMathematics & StatisticsMachine/Deep Learning

Goedel-Prover: How an Open-Source Model Reached the Frontier of Automated Theorem Proving

Goedel-Prover is the leading open-source automated theorem prover—achieving state-of-the-art performance in Lean 4 through a bootstrapping strategy that generates its own training data. Seed-Prover complements it with reinforcement learning for deeper reasoning chains.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Automated theorem proving occupies a unique position in AI research: it is the only domain where correctness is absolute. A generated proof is either valid—every step follows from axioms and previously proven lemmas—or it is not. There is no "mostly correct" proof, no "approximately valid" derivation. This binary standard makes ATP both the hardest and most meaningful benchmark for machine reasoning.

Goedel-Prover (Lin et al.) has emerged as the leading open-source system for generating formal proofs in Lean 4, the proof assistant increasingly adopted by the mathematical community for both research and education. Its success rests not on architectural novelty but on a practical insight about the field's central bottleneck: data scarcity.

The Data Problem in Formal Mathematics

The entire corpus of formalized mathematics—theorems and proofs written in machine-verifiable languages—is minuscule compared to the datasets that power other AI systems. The Mathlib library for Lean 4, the largest repository of formalized mathematics, contains over 210,000 theorems and 100,000 definitions (as of mid-2025). Compare this to the trillions of tokens used to train language models. The five-order-of-magnitude gap explains why standard scaling approaches fail for theorem proving: there simply is not enough formal mathematical data to train a prover through supervised learning alone.

Goedel-Prover addresses this through self-bootstrapping: the system generates its own training data by formalizing existing mathematical problems into Lean 4 statements, attempting to prove them, and adding successful proofs to its training corpus. The cycle operates as follows:

An LLM formalizes mathematical problems from existing datasets (e.g., the Numina competition dataset) into Lean 4 formal statements

The prover attempts to prove each formalized statement

Successful proofs are added to the training set

The prover is retrained on the expanded dataset of verified proofs

The improved prover tackles harder formalization problems, and the cycle repeats

This iterative training mechanism enables the system to build mathematical capability from existing competition-level problems at scale. The result is the Goedel-Pset-v1 corpus: 1.64 million formal statements with over 800,000 accompanied by verified proofs—the largest open-source dataset of its kind.

Seed-Prover: Depth Through Reinforcement Learning

Chen et al.'s Seed-Prover takes a complementary approach. Rather than generating more data, it extracts more learning from existing data through reinforcement learning with deep reasoning chains.

Standard RL for theorem proving provides reward only for completed proofs—a binary signal (proof verified by Lean / proof not verified) that provides no information about partial progress. For difficult theorems where complete proofs are rare, this sparse reward signal makes learning extremely slow.

Seed-Prover addresses this sparsity through an intermediate lemma generation paradigm. Rather than attempting to prove theorems in a single pass, the system decomposes proofs into intermediate lemmas—each of which can be independently verified by Lean (still a binary reward per lemma). This decomposition strategy effectively creates more frequent reward signals by breaking hard problems into verifiable sub-problems:

Generating and proving lemmas that serve as sub-goals of the target theorem
Composing verified lemmas into complete proofs
Building a growing library of reusable proven lemmas

The result is proofs that are not merely correct but well-structured—organized into logical sections with intermediate lemmas, in a style that human mathematicians would recognize as principled rather than accidental. This structural quality is not just aesthetically pleasing; it enables proof reuse, where lemmas proven for one theorem become available as building blocks for others.

The Lean 4 Ecosystem

Both systems target Lean 4 specifically, reflecting a strategic bet on the proof assistant ecosystem. Lean 4's combination of a powerful type system, a growing mathematical library (Mathlib), and an active community of mathematicians and computer scientists has positioned it as the platform of choice for formalized mathematics.

The ATP research community's convergence on Lean 4 creates a virtuous cycle: more theorem provers targeting Lean generate more formalized mathematics, which provides more training data for provers, which enables proving harder theorems, which attracts more mathematicians to formalization. Goedel-Prover's self-bootstrapping amplifies this cycle by generating training data that benefits the entire Lean ecosystem, not just its own training.

Claims and Evidence

Claim	Evidence	Verdict
Self-bootstrapping effectively addresses formal math data scarcity	Goedel-Prover demonstrates improving performance through data generation cycles	✅ Supported
RL with intermediate rewards improves proof quality	Seed-Prover produces structured proofs with reusable lemmas	✅ Supported
Open-source ATP models can match proprietary performance	Goedel-Prover achieves state-of-the-art among open models (as of April 2025)	✅ Supported
ATP systems can prove research-level mathematics	Current systems handle competition and undergraduate-level problems; research frontiers remain out of reach	⚠️ Not yet for open research problems
Lean 4 is the optimal platform for ATP research	Strong community and library; but Coq and Isabelle retain dedicated user bases	⚠️ Dominant but not universal

Open Questions

Conjecture quality: Goedel-Prover generates conjectures automatically, but are these conjectures mathematically interesting? A system that proves thousands of trivial variations adds data volume without mathematical depth. How do we guide conjecture generation toward mathematically meaningful territory?

Proof search scalability: As target theorems become more difficult, the search space for proofs grows exponentially. Current tactic-based search methods hit walls at depth ~20-30 tactics. How do we extend proof search to the depth required for research-level mathematics?

Cross-domain transfer: A prover trained on algebra may struggle with topology, and vice versa. How do we build provers that transfer mathematical knowledge across domains—mirroring the human mathematician's ability to apply tools from one area to problems in another?

Formalization vs. understanding: When Goedel-Prover generates a proof, does the underlying model "understand" the mathematics, or is it performing sophisticated pattern matching over tactic sequences? The answer matters for predicting where these systems will plateau.

Human-prover collaboration: What is the optimal workflow for mathematicians using ATP tools? Should the prover attempt full proofs autonomously, or should it suggest proof strategies that the mathematician refines?

What This Means for Your Research

For mathematicians, Goedel-Prover and Seed-Prover are tools that can verify conjectures, fill in routine proof details, and suggest proof approaches—augmenting human mathematical reasoning without replacing the creative insight that drives mathematical discovery.

For AI researchers, formal mathematics provides the most rigorous evaluation domain available: proofs are either valid or invalid, with no ambiguity about evaluation. The bootstrapping and RL approaches developed here will transfer to other domains where formal correctness matters—software verification, hardware design, protocol analysis.

For the mathematical community broadly, the convergence of AI and formal mathematics is reshaping how mathematics is practiced, taught, and verified. The formalization movement—translating informal mathematical knowledge into machine-verifiable form—is accelerating, driven partly by ATP tools that make formalization productive rather than merely pedagogical.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 학술 논문에서 인용하기 전에 구체적인 연구 결과, 통계 및 주장은 원본 논문을 통해 반드시 확인해야 한다.

Goedel-Prover: 오픈소스 모델이 자동 정리 증명의 최전선에 도달한 방법

자동 정리 증명(Automated Theorem Proving, ATP)은 AI 연구에서 독특한 위치를 차지한다. 이는 정확성이 절대적인 기준으로 적용되는 유일한 분야이다. 생성된 증명은 올바르거나—공리와 이전에 증명된 보조 정리로부터 모든 단계가 도출되거나—그렇지 않거나 둘 중 하나이다. "대부분 정확한" 증명이나 "대략적으로 유효한" 도출이란 존재하지 않는다. 이 이분법적 기준은 ATP를 기계 추론에 대한 가장 어렵고도 가장 의미 있는 벤치마크로 만든다.

Goedel-Prover(Lin et al.)는 Lean 4에서 형식 증명을 생성하는 선도적인 오픈소스 시스템으로 부상하였다. Lean 4는 연구 및 교육 목적으로 수학 커뮤니티에서 점점 더 많이 채택되고 있는 증명 보조 도구이다. 이 시스템의 성공은 아키텍처의 참신함이 아니라, 해당 분야의 핵심적인 병목 지점에 대한 실용적인 통찰에 기반한다. 바로 데이터 부족이다.

형식 수학에서의 데이터 문제

기계 검증 가능한 언어로 작성된 정리와 증명으로 구성된 형식 수학의 전체 코퍼스는, 다른 AI 시스템을 구동하는 데이터셋에 비하면 극히 미미한 규모이다. Lean 4용 Mathlib 라이브러리는 형식 수학의 가장 큰 저장소로, 2025년 중반 기준으로 210,000개 이상의 정리와 100,000개 이상의 정의를 포함한다. 이를 언어 모델 학습에 사용되는 수조 개의 토큰과 비교해 보라. 이 다섯 자릿수에 달하는 규모의 격차는 정리 증명에 표준적인 스케일링 접근법이 왜 실패하는지를 설명해 준다. 지도 학습만으로 증명기를 훈련하기에는 충분한 형식 수학 데이터가 단순히 존재하지 않는다.

Goedel-Prover는 자기 부트스트래핑(self-bootstrapping)을 통해 이 문제를 해결한다. 즉, 시스템이 기존 수학 문제를 Lean 4 구문으로 형식화하고, 이를 증명하려 시도하며, 성공한 증명을 훈련 코퍼스에 추가하는 방식으로 자체적인 훈련 데이터를 생성한다. 이 사이클은 다음과 같이 작동한다:

LLM이 기존 데이터셋(예: Numina 대회 데이터셋)의 수학 문제를 Lean 4 형식 구문으로 형식화한다

증명기가 각 형식화된 구문을 증명하려 시도한다

성공한 증명이 훈련 세트에 추가된다

증명기가 확장된 검증 증명 데이터셋으로 재학습된다

향상된 증명기가 더 어려운 형식화 문제에 도전하고, 사이클이 반복된다

이 반복적 훈련 메커니즘은 시스템이 대규모의 기존 대회 수준 문제로부터 수학적 역량을 구축할 수 있게 한다. 그 결과물이 바로 Goedel-Pset-v1 코퍼스이다. 이는 164만 개의 형식 구문으로 구성되며, 그중 800,000개 이상에 검증된 증명이 함께 제공되는, 동종 최대의 오픈소스 데이터셋이다.

Seed-Prover: 강화 학습을 통한 심층적 접근

Chen et al.의 Seed-Prover는 상호 보완적인 접근법을 취한다. 더 많은 데이터를 생성하는 대신, 심층 추론 체인을 활용한 강화 학습(reinforcement learning)을 통해 기존 데이터로부터 더 많은 학습을 이끌어 낸다.

정리 증명을 위한 표준적인 RL은 완성된 증명에 대해서만 보상을 제공한다. 이는 이분법적 신호(Lean에 의해 증명 검증됨 / 증명 검증 안 됨)로서, 부분적인 진전에 대한 정보를 전혀 제공하지 않는다. 완전한 증명이 드문 어려운 정리의 경우, 이 희소한 보상 신호는 학습을 극도로 느리게 만든다.

Seed-Prover는 중간 보조 정리 생성 패러다임을 통해 이 희소성 문제를 해결한다. 정리를 한 번에 증명하려 시도하는 대신, 시스템은 증명을 중간 보조 정리들로 분해한다. 각 보조 정리는 Lean에 의해 독립적으로 검증될 수 있다(보조 정리당 여전히 이분법적 보상). 이 분해 전략은 어려운 문제를 검증 가능한 하위 문제로 세분화함으로써 효과적으로 더 빈번한 보상 신호를 생성한다:

목표 정리의 하위 목표 역할을 하는 보조 정리를 생성하고 증명하기
검증된 보조 정리를 완전한 증명으로 구성하기
재사용 가능한 증명된 보조 정리로 이루어진 성장하는 라이브러리 구축하기

결과물은 단순히 정확한 것을 넘어 체계적으로 구조화된 증명이다. 중간 보조정리(lemma)를 포함하여 논리적 구획으로 조직되며, 인간 수학자들이 우연이 아닌 원칙적인 방식으로 인식할 수 있는 스타일을 따른다. 이러한 구조적 특성은 미적으로 만족스러울 뿐만 아니라, 하나의 정리를 위해 증명된 보조정리가 다른 정리의 구성 요소로 활용될 수 있는 증명 재사용(proof reuse)을 가능하게 한다.

Lean 4 생태계

두 시스템 모두 Lean 4를 구체적인 목표 플랫폼으로 삼고 있으며, 이는 증명 보조 시스템(proof assistant) 생태계에 대한 전략적 선택을 반영한다. Lean 4는 강력한 타입 시스템, 성장하는 수학 라이브러리(Mathlib), 그리고 활발한 수학자 및 컴퓨터 과학자 커뮤니티를 결합함으로써 형식화 수학(formalized mathematics)을 위한 최우선 플랫폼으로 자리매김하였다.

ATP(Automated Theorem Proving) 연구 커뮤니티가 Lean 4로 수렴하는 현상은 선순환을 만들어낸다. Lean을 대상으로 하는 더 많은 정리 증명기(theorem prover)가 더 많은 형식화 수학을 생성하고, 이는 증명기를 위한 더 많은 훈련 데이터를 제공하며, 이를 통해 더 어려운 정리를 증명할 수 있게 되고, 결과적으로 더 많은 수학자들이 형식화에 참여하게 된다. Goedel-Prover의 자기 부트스트래핑(self-bootstrapping)은 자체 훈련뿐만 아니라 Lean 생태계 전체에 이로운 훈련 데이터를 생성함으로써 이 선순환을 더욱 증폭시킨다.

주장과 근거

주장	근거	판정
자기 부트스트래핑은 형식 수학 데이터 부족 문제를 효과적으로 해결한다	Goedel-Prover는 데이터 생성 주기를 통해 성능 향상을 실증한다	✅ 지지됨
중간 보상을 활용한 RL이 증명 품질을 향상시킨다	Seed-Prover는 재사용 가능한 보조정리를 포함한 구조화된 증명을 생성한다	✅ 지지됨
오픈소스 ATP 모델이 독점 모델의 성능에 필적할 수 있다	Goedel-Prover는 오픈 모델 중 최고 수준(2025년 4월 기준)을 달성한다	✅ 지지됨
ATP 시스템이 연구 수준의 수학을 증명할 수 있다	현재 시스템은 경시대회 및 학부 수준의 문제를 다루며, 연구의 최전선은 아직 도달 불가능하다	⚠️ 미해결 연구 문제에는 아직 미치지 못함
Lean 4가 ATP 연구를 위한 최적의 플랫폼이다	강력한 커뮤니티와 라이브러리를 갖추고 있으나, Coq와 Isabelle도 전용 사용자 기반을 유지하고 있다	⚠️ 지배적이나 보편적이지는 않음

미해결 질문

추측의 질: Goedel-Prover는 추측을 자동으로 생성하지만, 이러한 추측들이 수학적으로 흥미로운가? 수천 가지 사소한 변형을 증명하는 시스템은 수학적 깊이 없이 데이터 양만 추가할 뿐이다. 추측 생성을 수학적으로 의미 있는 영역으로 어떻게 유도할 것인가?

증명 탐색의 확장성: 목표 정리가 더 어려워질수록 증명을 위한 탐색 공간은 지수적으로 증가한다. 현재의 탁틱(tactic) 기반 탐색 방법은 깊이 약 20~30 탁틱에서 한계에 부딪힌다. 연구 수준의 수학에 필요한 깊이까지 증명 탐색을 어떻게 확장할 것인가?

영역 간 전이: 대수학(algebra)으로 훈련된 증명기는 위상수학(topology)에서 어려움을 겪을 수 있으며, 그 반대도 마찬가지이다. 한 영역의 도구를 다른 영역의 문제에 적용하는 인간 수학자의 능력을 모방하여, 수학적 지식을 영역 간에 전이할 수 있는 증명기를 어떻게 구축할 것인가?

형식화 대 이해: Goedel-Prover가 증명을 생성할 때, 기저의 모델이 수학을 "이해"하는 것인가, 아니면 탁틱 시퀀스에 대한 정교한 패턴 매칭을 수행하는 것인가? 이 질문에 대한 답은 이러한 시스템이 어느 지점에서 한계에 도달할지를 예측하는 데 중요하다.

인간-증명기 협력: ATP 도구를 활용하는 수학자들의 최적 워크플로(workflow)는 무엇인가? 증명기가 자율적으로 완전한 증명을 시도해야 하는가, 아니면 수학자가 다듬을 수 있는 증명 전략을 제안해야 하는가?

연구에 주는 시사점

수학자들에게 Goedel-Prover와 Seed-Prover는 추측을 검증하고, 일상적인 증명 세부 사항을 채우며, 증명 접근법을 제안하는 도구이다. 즉, 수학적 발견을 이끄는 창의적 통찰을 대체하지 않으면서 인간의 수학적 추론을 보완한다. AI 연구자들에게 있어 형식 수학은 현재 이용 가능한 가장 엄밀한 평가 영역을 제공한다. 증명은 타당하거나 타당하지 않거나 둘 중 하나이며, 평가에 있어 어떠한 모호성도 없다. 여기서 개발된 bootstrapping 및 RL 접근 방식은 형식적 정확성이 중요한 다른 영역, 즉 소프트웨어 검증, 하드웨어 설계, 프로토콜 분석 등으로 전이될 것이다.

수학 공동체 전반에 있어 AI와 형식 수학의 수렴은 수학이 실践되고, 교육되고, 검증되는 방식을 재편하고 있다. 형식화 운동, 즉 비형식적 수학 지식을 기계 검증 가능한 형태로 변환하는 작업은 가속화되고 있으며, 이는 부분적으로 형식화를 단순히 교육학적인 것이 아니라 생산적인 것으로 만드는 ATP 도구들에 의해 추동되고 있다.

References (2)

[1] Lin, Y., Tang, S., Lyu, B. et al. (2025). Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving. arXiv:2502.07640.

DOI Scholar

[2] Chen, L., Gu, J., Huang, L. et al. (2025). Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving. arXiv:2507.23726.