Deep DiveMathematics & StatisticsMachine/Deep Learning

Aristotle at the Olympiad: AI Achieves Gold-Medal Mathematics Without Human Coaching

Aristotle solves 2025 International Mathematical Olympiad problems at gold-medal level by combining informal mathematical intuition (LLM reasoning) with formal proof verification (Lean 4). MATP-BENCH extends the challenge to multimodal problems requiring diagram understanding.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The International Mathematical Olympiad is the most prestigious mathematics competition for pre-university students. Its problems are designed to resist systematic solution methods—requiring creative insight, novel construction, and the ability to connect disparate mathematical ideas in ways that standard techniques do not prescribe. An IMO gold medal signifies mathematical talent at a level achieved by only a few hundred people worldwide each year.

Achim et al.'s Aristotle system solves 2025 IMO problems at gold-medal-equivalent performance, formally proving IMO-level problems in Lean 4. The system combines three components: a Lean proof search engine, an informal reasoning system for lemma generation, and a dedicated geometry solver. For some problems, Aristotle produces proofs that take different approaches from typical human solutions—for instance, a more algebraic proof where humans favor geometric arguments—though the informal reasoning process is described by the authors as inherently noisy, with gaps and errors that the formal verification step catches.

The Dual Architecture

Aristotle's achievement rests on a dual architecture that mirrors the two modes of mathematical thinking:

Informal reasoning (System 1): An LLM-based module reads the problem in natural language and generates a high-level proof strategy—a sketch of how the proof should proceed, which mathematical tools might be relevant, and what the key insights are. This is analogous to a mathematician's initial intuition: "This looks like it might yield to an induction argument combined with a pigeonhole principle application."

Formal verification (System 2): A Lean 4-based proof search system translates the informal strategy into rigorous formal proof steps. Each step is machine-verified—ensuring that the proof is not just plausible but mathematically valid. This is analogous to a mathematician writing out the full proof, checking each deduction.

Integration layer: A meta-controller mediates between the two systems. When the informal strategy suggests an approach that the formal system cannot complete, the controller backtracks and requests an alternative strategy. When the formal system discovers that an intermediate claim is unprovable, it feeds this information back to the informal reasoner, which adjusts its approach.

This integration is the key architectural innovation. Previous systems operated in either informal mode (generating plausible but unverified solutions) or formal mode (searching for proofs within a fixed strategy space). Aristotle is among the first to operationalize the interplay between insight and rigor that characterizes human mathematical reasoning.

Multimodal Mathematics: MATP-BENCH

He et al.'s MATP-BENCH extends the automated theorem proving challenge to multimodal problems—mathematical problems that include diagrams, figures, and geometric constructions alongside text. Many competition mathematics problems (and much of geometry) are inherently visual: the diagram is not merely an illustration but an essential part of the problem statement.

Multimodal LLMs can perceive diagrams, but using visual information for formal mathematical reasoning presents unique challenges:

Extracting geometric relationships from a diagram (which points are collinear, which angles are equal, which segments are parallel) requires visual understanding combined with geometric knowledge
Formalizing visual intuition (this triangle "looks" isosceles) into precise mathematical claims that can be verified
Using diagram information to guide proof search—human mathematicians frequently "see" the proof strategy in the diagram before formalizing it

MATP-BENCH evaluates whether current multimodal LLMs can perform these steps, and the results reveal substantial gaps: models that perform well on text-only mathematical reasoning degrade when diagrams are introduced, suggesting that visual mathematical reasoning is a distinct capability from textual mathematical reasoning.

Claims and Evidence

Claim	Evidence	Verdict
AI can solve IMO-level competition problems	Aristotle demonstrates gold-medal equivalent on 2025 IMO	✅ Demonstrated
Informal-formal integration improves over either alone	Aristotle outperforms both informal-only and formal-only baselines	✅ Supported
Current MLLMs handle multimodal mathematical reasoning well	MATP-BENCH shows degradation when diagrams are introduced	❌ Not yet
AI mathematical reasoning transfers across problem types	Limited evidence; performance varies by mathematical domain	⚠️ Domain-dependent

Open Questions

Research mathematics: IMO problems, while difficult, are solvable with known techniques. Can AI systems tackle open research problems—conjectures where the answer is genuinely unknown?

Mathematical taste: Human mathematicians have "taste"—an intuition for which problems are important, which approaches are elegant, which results are surprising. Can AI develop mathematical taste, or will it remain a powerful but undiscriminating tool?

Collaboration models: How should mathematicians work with AI proof assistants? As verifiers (checking human proofs)? As co-authors (contributing novel proof steps)? As explorers (systematically searching for proofs of conjectured results)?

Educational implications: If AI can solve competition mathematics at gold-medal level, what does this mean for mathematics education? Does it devalue competition training, or does it provide a new pedagogical tool?

What This Means for Your Research

For mathematicians, Aristotle represents both a tool and a challenge. The tool: AI proof assistants that can verify and extend mathematical work. The challenge: understanding what mathematical creativity means when machines can replicate its outputs. The resolution, likely, is that AI handles the routine aspects of mathematical reasoning—freeing mathematicians for the conceptual work that remains distinctly human.

For AI researchers, competition mathematics provides a uniquely rigorous evaluation domain. The problems are hard, the solutions are verifiable, and the comparison to human performance is direct. Aristotle's success raises the bar for what we expect from AI reasoning systems.

면책 조항: 이 게시물은 정보 제공을 목적으로 한 연구 동향 개요이다. 구체적인 연구 결과, 통계 및 주장은 학술 연구에서 인용하기 전에 원본 논문을 통해 반드시 확인해야 한다.

올림피아드의 아리스토텔레스: AI, 인간 코치 없이 수학 금메달 수준 달성

국제수학올림피아드(IMO)는 대학 입학 전 학생들을 대상으로 하는 가장 권위 있는 수학 경시대회이다. IMO 문제들은 체계적인 풀이 방법에 저항하도록 설계되어 있으며, 창의적 통찰력, 독창적 구성, 그리고 표준적인 기법으로는 규정할 수 없는 방식으로 서로 이질적인 수학적 아이디어들을 연결하는 능력을 요구한다. IMO 금메달은 전 세계적으로 매년 수백 명만이 달성할 수 있는 수준의 수학적 재능을 의미한다.

Achim et al.의 Aristotle 시스템은 2025년 IMO 문제들을 금메달에 상응하는 성능으로 풀어내며, Lean 4에서 IMO 수준의 문제들을 형식적으로 증명한다. 이 시스템은 세 가지 구성 요소를 결합한다: Lean 증명 탐색 엔진, 보조 정리(lemma) 생성을 위한 비형식적 추론 시스템, 그리고 전용 기하학 풀이기이다. 일부 문제에 대해 Aristotle은 일반적인 인간 풀이와는 다른 접근 방식의 증명을 산출하는데, 예를 들어 인간이 기하학적 논증을 선호하는 곳에서 더 대수적인 증명을 제시하기도 한다. 다만 저자들은 비형식적 추론 과정이 본질적으로 노이즈가 많고 공백과 오류가 존재하며, 이는 형식적 검증 단계에서 포착된다고 설명한다.

이중 아키텍처

Aristotle의 성취는 수학적 사고의 두 가지 양식을 반영하는 이중 아키텍처에 기반한다.

비형식적 추론(System 1): LLM 기반 모듈이 자연어로 된 문제를 읽고 고수준의 증명 전략을 생성한다. 즉, 증명이 어떻게 진행되어야 하는지의 개요, 어떤 수학적 도구들이 관련될 수 있는지, 그리고 핵심 통찰이 무엇인지를 스케치한다. 이는 수학자의 초기 직관에 유사하다: "이 문제는 귀납법 논증과 비둘기집 원리의 적용을 결합하면 풀릴 것 같다."

형식적 검증(System 2): Lean 4 기반 증명 탐색 시스템이 비형식적 전략을 엄밀한 형식적 증명 단계들로 변환한다. 각 단계는 기계적으로 검증되어, 증명이 그럴듯할 뿐만 아니라 수학적으로 유효함을 보장한다. 이는 수학자가 전체 증명을 작성하며 각 추론을 점검하는 것에 유사하다.

통합 레이어: 메타 컨트롤러가 두 시스템 사이를 중재한다. 비형식적 전략이 형식적 시스템이 완료할 수 없는 접근 방식을 제안할 때, 컨트롤러는 역추적하여 대안적 전략을 요청한다. 형식적 시스템이 중간 주장이 증명 불가능함을 발견할 때, 이 정보를 비형식적 추론기에 피드백하고 추론기는 접근 방식을 수정한다.

이 통합이 핵심적인 아키텍처 혁신이다. 이전 시스템들은 비형식적 모드(그럴듯하지만 검증되지 않은 풀이를 생성)나 형식적 모드(고정된 전략 공간 내에서 증명을 탐색) 중 하나로만 작동했다. Aristotle은 인간의 수학적 추론을 특징짓는 통찰과 엄밀성 사이의 상호작용을 구현한 최초의 시스템들 중 하나이다.

멀티모달 수학: MATP-BENCH

He et al.의 MATP-BENCH는 자동화된 정리 증명 과제를 멀티모달 문제로 확장한다. 멀티모달 문제란 텍스트와 함께 다이어그램, 그림, 기하학적 작도를 포함하는 수학 문제를 말한다. 많은 경시수학 문제들(그리고 기하학의 상당 부분)은 본질적으로 시각적이다: 다이어그램은 단순한 삽화가 아니라 문제 진술의 필수적인 부분이다.

멀티모달 LLM은 다이어그램을 인식할 수 있지만, 시각적 정보를 형식적인 수학적 추론에 활용하는 것은 고유한 도전 과제들을 제시한다:

다이어그램으로부터 기하학적 관계를 추출하는 것(어떤 점들이 공선(collinear)인지, 어떤 각도가 동일한지, 어떤 선분이 평행한지)은 시각적 이해와 기하학적 지식의 결합을 요구한다.
시각적 직관을 형식화하는 것(이 삼각형이 이등변삼각형처럼 "보인다")을 검증 가능한 정밀한 수학적 주장으로 변환하는 것이 요구된다.
다이어그램 정보를 활용하여 증명 탐색을 유도하는 것—인간 수학자들은 흔히 형식화하기 전에 다이어그램에서 증명 전략을 "포착"한다

MATP-BENCH는 현재의 멀티모달 LLM이 이러한 단계들을 수행할 수 있는지 평가하며, 그 결과는 상당한 격차를 드러낸다: 텍스트 전용 수학적 추론에서 우수한 성능을 보이는 모델이 다이어그램이 도입되면 성능이 저하되며, 이는 시각적 수학 추론이 텍스트 수학 추론과는 구별되는 능력임을 시사한다.

주장과 근거

주장	근거	평결
AI가 IMO 수준의 경시 문제를 풀 수 있다	Aristotle이 2025 IMO에서 금메달 동급 성적을 달성	✅ 입증됨
비형식-형식 통합이 각각 단독보다 성능이 향상된다	Aristotle이 비형식 전용 및 형식 전용 기준선 모두를 능가	✅ 지지됨
현재의 MLLM이 멀티모달 수학적 추론을 잘 처리한다	MATP-BENCH에서 다이어그램 도입 시 성능 저하가 나타남	❌ 아직 미달
AI 수학적 추론이 문제 유형 전반에 걸쳐 전이된다	근거 제한적; 수학 분야에 따라 성능이 상이함	⚠️ 분야 의존적

미해결 질문

수학 연구: IMO 문제들은 어렵기는 하지만 알려진 기법으로 풀 수 있다. AI 시스템이 답이 진정으로 알려지지 않은 미해결 연구 문제—즉 추측(conjecture)—에 도전할 수 있을까?

수학적 감각: 인간 수학자들은 "감각(taste)"을 지닌다—어떤 문제가 중요한지, 어떤 접근법이 우아한지, 어떤 결과가 놀라운지에 대한 직관이다. AI가 수학적 감각을 발전시킬 수 있을까, 아니면 강력하지만 분별력 없는 도구로 남을 것인가?

협력 모델: 수학자들은 AI 증명 보조 도구와 어떻게 협력해야 하는가? 검증자(인간의 증명을 검토)로서? 공동 저자(새로운 증명 단계를 기여)로서? 탐색자(추측된 결과의 증명을 체계적으로 탐색)로서?

교육적 함의: AI가 경시 수학을 금메달 수준으로 풀 수 있다면, 수학 교육에는 어떤 의미가 있는가? 이것이 경시 훈련의 가치를 떨어뜨리는가, 아니면 새로운 교육학적 도구를 제공하는가?

연구자에게 주는 시사점

수학자들에게 Aristotle은 도구인 동시에 도전 과제이다. 도구로서: AI 증명 보조 도구가 수학적 작업을 검증하고 확장할 수 있다. 도전 과제로서: 기계가 그 결과물을 재현할 수 있을 때 수학적 창의성이 무엇을 의미하는지 이해하는 것이다. 그 해결책은, 아마도, AI가 수학적 추론의 일상적인 측면을 담당하여—수학자들이 여전히 인간 고유의 영역으로 남아 있는 개념적 작업에 집중할 수 있도록 해방시키는 것이다.

AI 연구자들에게 경시 수학은 독보적으로 엄격한 평가 영역을 제공한다. 문제들은 어렵고, 풀이는 검증 가능하며, 인간 성능과의 비교는 직접적이다. Aristotle의 성공은 AI 추론 시스템에 기대하는 기준을 높인다.

References (2)

[1] Achim, T., Best, A., Bietti, A. et al. (2025). Aristotle: IMO-level Automated Theorem Proving. arXiv:2510.01346.

DOI Scholar

[2] He, Z., Lyu, Z., Chen, D. et al. (2025). MATP-BENCH: Can MLLM Be a Good Automated Theorem Prover for Multimodal Problems? arXiv:2506.06034.