Paper ReviewMathematics & StatisticsExperimental Design

RLMEval: Can Neural Theorem Provers Handle Research-Level Mathematics?

Most ATP benchmarks test undergraduate or competition mathematics. RLMEval evaluates neural theorem provers on research-level mathematics from real publications—revealing that the gap between solving competition problems and advancing mathematical research remains substantial.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The automated theorem proving community celebrates each new benchmark conquered: miniF2F problems solved, IMO questions answered, Mathlib theorems reproven. These achievements are genuine and impressive. But they share a common limitation: the problems are known to be solvable. Competition problems have solutions that fit on a single page. Textbook exercises have answers in the back. Mathlib theorems have proofs that human mathematicians have already written.

Research-level mathematics is qualitatively different. The problems may not have solutions. The techniques required may not exist yet. The formalization of the problem statement itself may require mathematical insight. The gap between solving known problems and contributing to open research is the gap between a talented student and a working mathematician—and it is enormous.

Poiroux et al.'s RLMEval (presented at EMNLP Findings) confronts this gap directly by evaluating neural theorem provers on research-level mathematics from real published papers. The results provide a sobering calibration of where AI mathematical reasoning actually stands.

The Benchmark Design

RLMEval collects theorems from recent mathematical publications—real results that working mathematicians proved and published. The theorems span multiple mathematical subdisciplines and difficulty levels, from routine lemmas (technical results needed for the main theorems) to the main theorems themselves.

Each theorem is provided in two forms:

Informal statement: The theorem as written in the paper (natural language with mathematical notation)
Formal statement: The theorem formalized in Lean (machine-verifiable specification)

The evaluation measures two capabilities:

Proof generation: Given the formal statement, can the prover find a proof?

Autoformalization: Given the informal statement, can the system produce a correct formalization?

What the Results Reveal

The findings are instructive in their specificity:

Routine lemmas: Neural provers handle many routine lemmas—straightforward consequences of definitions, applications of known theorems, algebraic manipulations. These are the mathematical equivalent of "boilerplate code"—necessary but not creative.

Non-trivial intermediate results: Provers struggle with intermediate results that require choosing the right mathematical technique from several possibilities. Unlike competition problems where the technique is often hinted by the problem context, research mathematics requires the prover to autonomously select from a large toolbox.

Main theorems: Current provers rarely succeed on main theorems of published papers. These theorems typically require novel proof strategies that combine techniques in ways not seen in the training data—precisely the capability that defines mathematical research.

Autoformalization: Translating informal mathematical statements to formal specifications is itself a challenging task. Mathematical notation is ambiguous (the same symbol means different things in different contexts), implicit assumptions are common (domain experts share unstated conventions), and formalization choices affect proof difficulty.

The Gap Analysis

RLMEval enables a precise gap analysis between current AI capability and research-level mathematics:

Capability Level	AI Performance	Human Comparison
Routine lemmas	Good	Undergraduate
Non-trivial intermediates	Moderate	Graduate student
Main theorems	Poor	Researcher
Novel proof strategies	Absent	Expert researcher
Conjecture generation	Not evaluated	Creative mathematician

The progression from routine to creative mirrors the human mathematical development trajectory—and current AI systems are roughly at the graduate student level: competent with known techniques, struggling when creativity is required.

Claims and Evidence

Claim	Evidence	Verdict
Current provers solve research-level lemmas	RLMEval demonstrates moderate success on routine results	✅ Supported
Current provers solve main theorems of published papers	Success rate is low	❌ Not yet
Autoformalization is a bottleneck for research-level ATP	Formalization errors degrade downstream proving	✅ Supported
The gap between competition and research mathematics is large	RLMEval quantifies the gap across difficulty levels	✅ Supported
Benchmarks on known problems overestimate AI mathematical capability	Research-level evaluation reveals lower performance	✅ Supported

Open Questions

What makes research mathematics hard for AI? Is it the novelty of proof strategies, the depth of required background knowledge, the need for mathematical intuition, or the formalization challenge? RLMEval identifies the gap but does not fully diagnose its cause.

Can retrieval help? If the prover has access to the mathematical literature (not just formalized libraries), can it find and adapt proof strategies from similar published results?

Collaborative proving: Rather than fully automated proving, can AI assist human mathematicians on specific sub-goals of a research proof—handling the routine parts while the human provides creative direction?

Evaluation beyond binary success: A prover that makes partial progress on a main theorem—formalizing the right approach but failing on a technical sub-goal—is more capable than one that makes no progress. Can we evaluate partial mathematical reasoning?

Domain adaptation: Performance likely varies across mathematical subdisciplines. Combinatorics may be easier (more pattern-based) than analysis (more epsilon-delta reasoning). How should evaluation account for domain-specific difficulty?

What This Means for Your Research

For AI researchers working on mathematical reasoning, RLMEval provides the most honest assessment of current capability. Competition benchmarks are useful for measuring progress but create a misleading impression of proximity to genuine mathematical research capability. The gap is large—and closing it requires advances in creative reasoning that current architectures may not support.

For mathematicians, RLMEval calibrates expectations. AI proof assistants can genuinely help with routine proof obligations—freeing human effort for creative work. But the headline-grabbing results on competition problems should not be extrapolated to research mathematics. The creative core of mathematical research remains distinctly human—for now.

For the mathematical community broadly, RLMEval raises the question of what mathematical research really is. If routine lemmas can be automated and competition problems can be solved, the uniquely human contribution to mathematics is increasingly concentrated in the creative acts of conjecture, strategy selection, and conceptual insight that current AI systems cannot perform.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 학술 연구에서 인용하기 전에 특정 발견, 통계 및 주장은 원본 논문과 대조하여 검증해야 한다.

RLMEval: 신경 정리 증명기는 연구 수준의 수학을 다룰 수 있는가?

자동 정리 증명 커뮤니티는 새로운 벤치마크가 정복될 때마다 이를 축하한다: miniF2F 문제 해결, IMO 문제 답변, Mathlib 정리 재증명. 이러한 성과들은 진정하고 인상적이다. 그러나 이들은 공통된 한계를 공유한다: 해당 문제들은 풀 수 있다는 것이 이미 알려져 있다. 경시대회 문제는 한 페이지에 맞는 풀이가 존재한다. 교재 연습문제는 뒤에 답이 있다. Mathlib 정리는 인간 수학자들이 이미 작성한 증명이 있다.

연구 수준의 수학은 질적으로 다르다. 문제에 풀이가 존재하지 않을 수 있다. 필요한 기법이 아직 존재하지 않을 수 있다. 문제 진술 자체의 형식화도 수학적 통찰을 요구할 수 있다. 알려진 문제를 푸는 것과 열린 연구에 기여하는 것 사이의 간극은 재능 있는 학생과 현직 수학자 사이의 간극과 같으며—그 간극은 엄청나다.

Poiroux 외 연구진의 RLMEval(EMNLP Findings 발표)은 실제 출판된 논문에서 가져온 연구 수준의 수학으로 신경 정리 증명기를 평가함으로써 이 간극을 정면으로 다룬다. 그 결과는 AI 수학적 추론이 실제로 어느 수준에 있는지를 냉정하게 보정해 준다.

벤치마크 설계

RLMEval은 최근 수학 출판물로부터 정리를 수집한다—현직 수학자들이 증명하고 발표한 실제 결과들이다. 해당 정리들은 여러 수학 하위 분야와 난이도 수준에 걸쳐 있으며, 일상적인 보조 정리(주요 정리에 필요한 기술적 결과)부터 주요 정리 자체까지 포함된다.

각 정리는 두 가지 형태로 제공된다:

비형식적 진술: 논문에 작성된 그대로의 정리 (수학적 표기법을 포함한 자연어)
형식적 진술: Lean으로 형식화된 정리 (기계 검증 가능한 명세)

평가는 두 가지 능력을 측정한다:

증명 생성: 형식적 진술이 주어졌을 때, 증명기가 증명을 찾을 수 있는가?

자동 형식화: 비형식적 진술이 주어졌을 때, 시스템이 올바른 형식화를 생성할 수 있는가?

결과가 드러내는 것

발견들은 그 구체성에서 교훈적이다:

일상적인 보조 정리: 신경 증명기는 많은 일상적인 보조 정리를 다룬다—정의의 직접적인 귀결, 알려진 정리의 적용, 대수적 조작. 이것들은 수학적으로 "반복 코드"에 해당한다—필요하지만 창의적이지 않다.

비자명한 중간 결과: 증명기는 여러 가능성 중에서 올바른 수학적 기법을 선택해야 하는 중간 결과에서 어려움을 겪는다. 문제 맥락에서 기법이 암시되는 경우가 많은 경시대회 문제와 달리, 연구 수학에서는 증명기가 방대한 도구상자에서 자율적으로 선택해야 한다.

주요 정리: 현재의 증명기는 출판된 논문의 주요 정리에서 거의 성공하지 못한다. 이러한 정리들은 일반적으로 훈련 데이터에서 볼 수 없는 방식으로 기법들을 결합하는 새로운 증명 전략을 요구한다—이것이 바로 수학 연구를 정의하는 능력이다.

자동 형식화: 비형식적 수학적 진술을 형식적 명세로 번역하는 것 자체가 어려운 과제이다. 수학적 표기법은 모호하고(같은 기호가 다른 맥락에서 다른 것을 의미한다), 암묵적 가정이 일반적이며(분야 전문가들은 명시되지 않은 관례를 공유한다), 형식화 선택이 증명 난이도에 영향을 미친다.

간극 분석

RLMEval은 현재 AI 능력과 연구 수준 수학 사이의 정밀한 간극 분석을 가능하게 한다:

능력 수준	AI 성능	인간 비교
일상적인 보조 정리	우수	학부생
비자명한 중간 결과	보통	대학원생
주요 정리	미흡	연구자
새로운 증명 전략	없음	전문 연구자
추측 생성	평가되지 않음	창의적 수학자

주장과 증거

주장	증거	판정
현재의 증명기는 연구 수준의 보조 정리를 풀 수 있다	RLMEval이 일상적인 결과에 대해 중간 수준의 성공을 보임	✅ 지지됨
현재의 증명기는 출판된 논문의 주요 정리를 풀 수 있다	성공률이 낮음	❌ 아직 아님
자동 형식화는 연구 수준 ATP의 병목이다	형식화 오류가 하위 증명 과정을 저하시킴	✅ 지지됨
경시대회 수학과 연구 수학 사이의 격차는 크다	RLMEval이 난이도 수준 전반에 걸쳐 격차를 정량화함	✅ 지지됨
기존 문제에 대한 벤치마크는 AI의 수학적 능력을 과대평가한다	연구 수준의 평가에서 더 낮은 성능이 드러남	✅ 지지됨

미해결 질문

AI에게 연구 수학이 어려운 이유는 무엇인가? 증명 전략의 참신함 때문인가, 필요한 배경 지식의 깊이 때문인가, 수학적 직관의 필요성 때문인가, 아니면 형식화의 어려움 때문인가? RLMEval은 격차를 확인하지만 그 원인을 완전히 진단하지는 못한다.

검색이 도움이 될 수 있는가? 증명기가 형식화된 라이브러리뿐만 아니라 수학 문헌에 접근할 수 있다면, 유사한 출판 결과에서 증명 전략을 찾아 응용할 수 있을까?

협력적 증명: 완전한 자동화 증명 대신, AI가 연구 증명의 특정 하위 목표에서 인간 수학자를 보조할 수 있을까? 즉, 일상적인 부분은 AI가 처리하고 인간은 창의적 방향을 제시하는 방식이다.

이진 성공을 넘어선 평가: 주요 정리에서 부분적으로 진전을 이루는 증명기—올바른 접근 방식을 형식화하되 기술적 하위 목표에서 실패하는—는 전혀 진전이 없는 것보다 더 유능하다. 부분적인 수학적 추론을 평가할 수 있을까?

도메인 적응: 성능은 수학 세부 분야에 따라 다를 가능성이 높다. 조합론은 분석학(엡실론-델타 추론)보다 쉬울 수 있다(패턴 기반의 특성상). 평가에서 도메인별 난이도를 어떻게 반영해야 할까?

일상적인 것에서 창의적인 것으로의 진행은 인간의 수학적 발달 궤적을 반영하며, 현재의 AI 시스템은 대략 대학원생 수준에 위치한다. 즉, 알려진 기법에는 능숙하지만 창의성이 요구될 때는 어려움을 겪는다.

이 연구가 당신의 연구에 주는 의미

수학적 추론을 연구하는 AI 연구자에게 RLMEval은 현재 역량에 대한 가장 솔직한 평가를 제공한다. 경시대회 벤치마크는 발전을 측정하는 데 유용하지만, 진정한 수학 연구 역량에 근접했다는 오해를 불러일으킨다. 격차는 크며, 이를 좁히려면 현재의 아키텍처가 지원하지 못할 수도 있는 창의적 추론의 발전이 필요하다.

수학자에게 RLMEval은 기대치를 조율해 준다. AI 증명 보조 도구는 일상적인 증명 작업에서 실질적인 도움을 줄 수 있으며, 인간의 노력을 창의적 작업에 집중시킬 수 있다. 그러나 경시대회 문제에서의 주목할 만한 결과를 연구 수학으로 외삽해서는 안 된다. 수학 연구의 창의적 핵심은 지금까지는 여전히 인간 고유의 영역으로 남아 있다.

수학 공동체 전반에 대해 RLMEval은 수학적 연구가 진정으로 무엇인지에 대한 질문을 제기한다. 일상적인 보조 정리가 자동화되고 경시대회 문제가 풀릴 수 있다면, 수학에 대한 인간 고유의 기여는 현재의 AI 시스템이 수행할 수 없는 추측, 전략 선택, 개념적 통찰이라는 창의적 행위에 점점 더 집중된다.

References (1)

[1] Poiroux, A., Bosselut, A., Kuncak, V. (2025). RLMEval: Evaluating Research-Level Neural Theorem Proving. EMNLP Findings.

DOI Scholar