Paper ReviewMathematics & StatisticsExperimental Design

RLMEval: Can Neural Theorem Provers Handle Research-Level Mathematics?

Most ATP benchmarks test undergraduate or competition mathematics. RLMEval evaluates neural theorem provers on research-level mathematics from real publicationsโ€”revealing that the gap between solving competition problems and advancing mathematical research remains substantial.

By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The automated theorem proving community celebrates each new benchmark conquered: miniF2F problems solved, IMO questions answered, Mathlib theorems reproven. These achievements are genuine and impressive. But they share a common limitation: the problems are known to be solvable. Competition problems have solutions that fit on a single page. Textbook exercises have answers in the back. Mathlib theorems have proofs that human mathematicians have already written.

Research-level mathematics is qualitatively different. The problems may not have solutions. The techniques required may not exist yet. The formalization of the problem statement itself may require mathematical insight. The gap between solving known problems and contributing to open research is the gap between a talented student and a working mathematicianโ€”and it is enormous.

Poiroux et al.'s RLMEval (presented at EMNLP Findings) confronts this gap directly by evaluating neural theorem provers on research-level mathematics from real published papers. The results provide a sobering calibration of where AI mathematical reasoning actually stands.

The Benchmark Design

RLMEval collects theorems from recent mathematical publicationsโ€”real results that working mathematicians proved and published. The theorems span multiple mathematical subdisciplines and difficulty levels, from routine lemmas (technical results needed for the main theorems) to the main theorems themselves.

Each theorem is provided in two forms:

  • Informal statement: The theorem as written in the paper (natural language with mathematical notation)
  • Formal statement: The theorem formalized in Lean (machine-verifiable specification)
The evaluation measures two capabilities:
  • Proof generation: Given the formal statement, can the prover find a proof?
  • Autoformalization: Given the informal statement, can the system produce a correct formalization?
  • What the Results Reveal

    The findings are instructive in their specificity:

    Routine lemmas: Neural provers handle many routine lemmasโ€”straightforward consequences of definitions, applications of known theorems, algebraic manipulations. These are the mathematical equivalent of "boilerplate code"โ€”necessary but not creative.

    Non-trivial intermediate results: Provers struggle with intermediate results that require choosing the right mathematical technique from several possibilities. Unlike competition problems where the technique is often hinted by the problem context, research mathematics requires the prover to autonomously select from a large toolbox.

    Main theorems: Current provers rarely succeed on main theorems of published papers. These theorems typically require novel proof strategies that combine techniques in ways not seen in the training dataโ€”precisely the capability that defines mathematical research.

    Autoformalization: Translating informal mathematical statements to formal specifications is itself a challenging task. Mathematical notation is ambiguous (the same symbol means different things in different contexts), implicit assumptions are common (domain experts share unstated conventions), and formalization choices affect proof difficulty.

    The Gap Analysis

    RLMEval enables a precise gap analysis between current AI capability and research-level mathematics:

    <
    Capability LevelAI PerformanceHuman Comparison
    Routine lemmasGoodUndergraduate
    Non-trivial intermediatesModerateGraduate student
    Main theoremsPoorResearcher
    Novel proof strategiesAbsentExpert researcher
    Conjecture generationNot evaluatedCreative mathematician

    The progression from routine to creative mirrors the human mathematical development trajectoryโ€”and current AI systems are roughly at the graduate student level: competent with known techniques, struggling when creativity is required.

    Claims and Evidence

    <
    ClaimEvidenceVerdict
    Current provers solve research-level lemmasRLMEval demonstrates moderate success on routine resultsโœ… Supported
    Current provers solve main theorems of published papersSuccess rate is lowโŒ Not yet
    Autoformalization is a bottleneck for research-level ATPFormalization errors degrade downstream provingโœ… Supported
    The gap between competition and research mathematics is largeRLMEval quantifies the gap across difficulty levelsโœ… Supported
    Benchmarks on known problems overestimate AI mathematical capabilityResearch-level evaluation reveals lower performanceโœ… Supported

    Open Questions

  • What makes research mathematics hard for AI? Is it the novelty of proof strategies, the depth of required background knowledge, the need for mathematical intuition, or the formalization challenge? RLMEval identifies the gap but does not fully diagnose its cause.
  • Can retrieval help? If the prover has access to the mathematical literature (not just formalized libraries), can it find and adapt proof strategies from similar published results?
  • Collaborative proving: Rather than fully automated proving, can AI assist human mathematicians on specific sub-goals of a research proofโ€”handling the routine parts while the human provides creative direction?
  • Evaluation beyond binary success: A prover that makes partial progress on a main theoremโ€”formalizing the right approach but failing on a technical sub-goalโ€”is more capable than one that makes no progress. Can we evaluate partial mathematical reasoning?
  • Domain adaptation: Performance likely varies across mathematical subdisciplines. Combinatorics may be easier (more pattern-based) than analysis (more epsilon-delta reasoning). How should evaluation account for domain-specific difficulty?
  • What This Means for Your Research

    For AI researchers working on mathematical reasoning, RLMEval provides the most honest assessment of current capability. Competition benchmarks are useful for measuring progress but create a misleading impression of proximity to genuine mathematical research capability. The gap is largeโ€”and closing it requires advances in creative reasoning that current architectures may not support.

    For mathematicians, RLMEval calibrates expectations. AI proof assistants can genuinely help with routine proof obligationsโ€”freeing human effort for creative work. But the headline-grabbing results on competition problems should not be extrapolated to research mathematics. The creative core of mathematical research remains distinctly humanโ€”for now.

    For the mathematical community broadly, RLMEval raises the question of what mathematical research really is. If routine lemmas can be automated and competition problems can be solved, the uniquely human contribution to mathematics is increasingly concentrated in the creative acts of conjecture, strategy selection, and conceptual insight that current AI systems cannot perform.

    References (1)

    [1] Poiroux, A., Bosselut, A., Kuncak, V. (2025). RLMEval: Evaluating Research-Level Neural Theorem Proving. EMNLP Findings.

    Explore this topic deeper

    Search 290M+ papers, detect research gaps, and find what hasn't been studied yet.

    Click to remove unwanted keywords

    Search 8 keywords โ†’