Paper ReviewAI & Machine LearningMachine/Deep Learning

Gemini 2.5 Pro's Thinking Budget: Controlling the Quality-Cost Tradeoff in Reasoning

Google's Gemini 2.5 Pro introduces a 'thinking budget' that gives users direct control over how much computation a model spends reasoning. We examine what this means for the quality-cost-latency triangle and whether user-controlled inference scaling changes the economics of AI deployment.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Most reasoning models present users with a binary choice: either the model thinks extensively (high cost, high latency, better answers) or it does not (fast, cheap, sometimes wrong). Google's Gemini 2.5 Pro technical report (Comanici et al., 2025) introduces a different framing. Rather than a toggle, users receive a dial — a thinking budget that lets them specify, in concrete terms, how much computation the model should spend on internal reasoning before producing a response.

This is a design choice worth examining carefully, because it shifts responsibility for inference economics from the model provider to the user.

Research Landscape: The Reasoning Model Generation

The 2024-2025 period has seen a clear architectural trend: models that allocate variable compute at inference time based on problem difficulty. OpenAI's o1 and o3 series, DeepSeek-R1, and Anthropic's extended thinking in Claude all implement variations on this theme. The core insight is shared — harder problems benefit from more internal deliberation — but the implementations differ in a critical design dimension: who decides how much thinking happens?

In most systems, the model itself determines reasoning depth. The model reads the prompt, estimates difficulty, and allocates tokens to its chain-of-thought accordingly. This is elegant but opaque: the user cannot predict the cost of a query before it executes, and there is no mechanism to say "this problem is not worth more than $0.02 of compute."

Gemini 2.5 Pro's thinking budget makes this tradeoff explicit and user-controllable. According to the technical report, users can directly set the reasoning budget, enabling them to manage the quality-cost-latency triangle for their specific use case. A developer building a chatbot for quick factual queries might set a minimal thinking budget. A research team solving competition-level mathematics might set it to maximum.

Benchmark Context

The report positions Gemini 2.5 Pro as achieving state-of-the-art performance on coding and reasoning benchmarks. Specific results cited include a strong score on SWE-Bench Verified using a custom agent setup, first place on AIME 2025 without majority voting, and a Gold medal on IMO 2025. The model also supports processing up to 3 hours of video content, reflecting its multimodal capabilities.

Several contextual notes are important for interpreting these numbers. SWE-Bench Verified measures end-to-end software engineering ability — given a GitHub issue, can the model produce a working patch? This benchmark figure uses a custom agent setup, meaning the raw model score may differ. The AIME result is notable specifically because it was achieved without majority voting, a technique where multiple samples are generated and the most common answer is selected. This distinction matters because majority voting is computationally expensive and can inflate apparent performance.

Critical Analysis: Claims and Evidence

Claim	Source	Assessment
Thinking budget allows users to control quality-cost-latency tradeoff	Technical report	Supported as architectural feature; long-term user behavior data not yet available
SWE-Bench Verified (strong performance)	Technical report (custom agent setup)	Supported with caveat: agent scaffolding matters
AIME 2025 first place without majority voting	Technical report	Supported; the "without majority voting" qualifier is significant
IMO 2025 Gold medal	Technical report	Supported; a separate paper (arXiv:2507.15855) details the methodology
Thinking budget changes deployment economics	Implied	Plausible but unverified at scale

What the Report Does Not Address

The technical report does not provide detailed ablation studies showing how performance degrades as the thinking budget decreases. This is the most important missing piece: if performance drops sharply below a certain threshold, the "dial" is effectively a binary switch with extra steps. If degradation is gradual, the feature genuinely enables fine-grained cost management.

The report also does not address how the thinking budget interacts with problem difficulty estimation. When a user sets a low budget for a genuinely hard problem, does the model fail gracefully (producing a lower-confidence answer) or fail catastrophically (producing a confidently wrong answer)?

The Design Philosophy Question

The thinking budget concept reflects a broader tension in AI system design: abstraction versus control. Most AI products abstract away inference details, presenting users with a simple input-output interface. The thinking budget breaks this abstraction deliberately, exposing an internal parameter that was previously hidden.

This has precedent in cloud computing, where users choose between instance types with different resource configurations. Most cloud users converge on a small number of configurations. Whether thinking budgets follow the same pattern — collapsing into "fast," "balanced," and "deep" presets — remains to be seen.

Open Questions

Degradation curve: How does performance on reasoning benchmarks change as the thinking budget decreases from maximum to minimum? Is the relationship linear, logarithmic, or step-function?

User calibration: Can users accurately estimate the appropriate thinking budget for a given task? If not, does the feature create anxiety rather than control?

Competitive dynamics: Will other providers adopt user-controllable reasoning budgets, or will they compete on automatic budget allocation?

Benchmark inflation: As thinking models proliferate, do existing benchmarks adequately distinguish between models that reason well efficiently and models that reason well expensively?

Multimodal reasoning cost: The report emphasizes multimodal capabilities including video processing. How does the thinking budget interact with multimodal inputs, where the "difficulty" of reasoning depends on modality?

What This Means for Practitioners

For developers integrating LLMs into products, the thinking budget is operationally significant. It converts unpredictable inference costs into controllable ones — a genuine improvement for production budgeting. The practical recommendation is straightforward: benchmark your specific use case at multiple budget levels to find the cost-performance knee, rather than defaulting to maximum.

For researchers, the thinking budget raises a methodological question: when reporting benchmark results for reasoning models, should the compute budget be standardized? A model achieving marginally higher accuracy at significantly greater compute cost has not clearly demonstrated superiority.

The thinking budget is not a technical novelty so much as an economic interface innovation. It makes the cost of intelligence visible and manageable — a necessary step as reasoning models move from research demonstrations to production infrastructure.

면책 조항: 이 게시물은 정보 제공 목적의 연구 개요이다. 학술 연구에서 인용하기 전에 구체적인 연구 결과, 통계 및 주장을 원본 논문과 대조하여 검증해야 한다.

Gemini 2.5 Pro의 Thinking Budget: 추론에서 품질-비용 트레이드오프 제어

대부분의 추론 모델은 사용자에게 이진 선택을 제시한다. 즉, 모델이 광범위하게 사고하거나(높은 비용, 높은 지연 시간, 더 나은 답변) 그렇지 않거나(빠르고, 저렴하며, 때로는 오답)의 두 가지 선택지이다. Google의 Gemini 2.5 Pro 기술 보고서(Comanici et al., 2025)는 이와 다른 프레임을 제시한다. 단순한 토글 스위치 방식이 아니라, 사용자에게 하나의 다이얼이 주어진다. 바로 thinking budget이다. 이를 통해 사용자는 모델이 응답을 생성하기 전에 내부 추론에 얼마나 많은 계산을 사용해야 하는지를 구체적인 수치로 지정할 수 있다.

이는 신중하게 검토할 만한 설계 선택이다. 추론 경제성에 대한 책임을 모델 제공자에서 사용자로 이전하기 때문이다.

연구 동향: 추론 모델의 세대

2024~2025년 기간에는 명확한 아키텍처적 추세가 나타났다. 바로 문제 난이도에 따라 추론 시 가변적인 연산을 할당하는 모델들이다. OpenAI의 o1 및 o3 시리즈, DeepSeek-R1, 그리고 Claude에서 Anthropic의 extended thinking이 모두 이 주제의 변형을 구현하고 있다. 핵심 통찰은 공통적이다. 더 어려운 문제일수록 내부적인 심층 사고로부터 더 큰 이점을 얻는다는 것이다. 그러나 구현 방식은 중요한 설계 차원에서 서로 다르다. 바로 얼마나 많은 사고가 이루어질지를 누가 결정하는가이다.

대부분의 시스템에서는 모델 자체가 추론 깊이를 결정한다. 모델이 프롬프트를 읽고 난이도를 추정한 다음, 그에 따라 chain-of-thought에 토큰을 할당한다. 이는 우아하지만 불투명하다. 사용자는 쿼리를 실행하기 전에 비용을 예측할 수 없으며, "이 문제는 $0.02 이상의 연산 비용을 쓸 가치가 없다"고 지정할 메커니즘도 없다.

Gemini 2.5 Pro의 thinking budget은 이 트레이드오프를 명시적이고 사용자가 제어 가능한 방식으로 만든다. 기술 보고서에 따르면, 사용자는 추론 예산을 직접 설정할 수 있으며, 이를 통해 특정 사용 사례에 맞게 품질-비용-지연 시간 삼각관계를 관리할 수 있다. 빠른 사실 조회용 챗봇을 구축하는 개발자는 thinking budget을 최소로 설정할 수 있다. 경시대회 수준의 수학 문제를 푸는 연구팀은 이를 최대로 설정할 수 있다.

벤치마크 맥락

보고서는 Gemini 2.5 Pro가 코딩 및 추론 벤치마크에서 최첨단 성능을 달성했다고 제시한다. 구체적으로 인용된 결과로는 커스텀 에이전트 설정을 사용한 SWE-Bench Verified에서의 높은 점수, 다수결 투표 없이 달성한 AIME 2025 1위, 그리고 IMO 2025 금메달이 있다. 또한 이 모델은 멀티모달 능력을 반영하여 최대 3시간 분량의 동영상 콘텐츠 처리를 지원한다.

이 수치들을 해석하는 데 있어 몇 가지 맥락적 사항이 중요하다. SWE-Bench Verified는 종단 간 소프트웨어 엔지니어링 능력을 측정한다. GitHub 이슈가 주어졌을 때 모델이 작동하는 패치를 생성할 수 있는지를 평가하는 것이다. 이 벤치마크 수치는 커스텀 에이전트 설정을 사용하므로, 원시 모델 점수와는 다를 수 있다. AIME 결과는 특히 다수결 투표 없이 달성되었다는 점에서 주목할 만하다. 다수결 투표란 여러 샘플을 생성하고 가장 많이 나온 답변을 선택하는 기법이다. 이 구별이 중요한 이유는, 다수결 투표가 계산 비용이 높고 겉보기 성능을 부풀릴 수 있기 때문이다.

비판적 분석: 주장과 근거

주장	출처	평가
Thinking budget을 통해 사용자가 품질-비용-지연 시간 트레이드오프를 제어할 수 있다	기술 보고서	아키텍처적 기능으로서 지지됨; 장기적인 사용자 행동 데이터는 아직 없음
SWE-Bench Verified (높은 성능)	기술 보고서 (커스텀 에이전트 설정)	주의 사항과 함께 지지됨: 에이전트 스캐폴딩이 중요함
다수결 투표 없이 AIME 2025 1위	기술 보고서	지지됨; "다수결 투표 없이"라는 조건이 중요함
IMO 2025 금메달	기술 보고서	지지됨; 별도 논문(arXiv:2507.15855)에서 방법론을 상세히 설명함
Thinking budget이 배포 경제성을 변화시킨다	함축됨	그럴듯하나 대규모에서 검증되지 않음

보고서가 다루지 않는 내용

기술 보고서는 thinking budget이 감소함에 따라 성능이 어떻게 저하되는지를 보여주는 상세한 ablation study를 제공하지 않는다. 이것이 가장 중요한 누락 부분이다. 만약 특정 임계값 이하에서 성능이 급격히 떨어진다면, 해당 "다이얼"은 사실상 추가 단계가 붙은 이진 스위치에 불과하다. 반면 성능 저하가 점진적이라면, 이 기능은 진정한 의미에서 세분화된 비용 관리를 가능하게 한다.

보고서는 또한 thinking budget이 문제 난이도 추정과 어떻게 상호작용하는지도 다루지 않는다. 사용자가 실제로 어려운 문제에 낮은 budget을 설정했을 때, 모델이 우아하게 실패하는지(신뢰도가 낮은 답변을 생성하는지), 아니면 치명적으로 실패하는지(자신 있게 틀린 답변을 생성하는지) 여부가 불분명하다.

설계 철학의 문제

thinking budget 개념은 AI 시스템 설계에서 더 넓은 긴장 관계를 반영한다. 바로 추상화 대 제어의 문제이다. 대부분의 AI 제품은 추론 세부 사항을 추상화하여 사용자에게 단순한 입출력 인터페이스를 제공한다. thinking budget은 이러한 추상화를 의도적으로 깨고, 이전에는 숨겨져 있던 내부 파라미터를 노출한다.

이는 클라우드 컴퓨팅에서 선례가 있다. 클라우드 컴퓨팅에서 사용자는 서로 다른 리소스 구성을 가진 인스턴스 유형 중에서 선택한다. 대부분의 클라우드 사용자는 소수의 구성으로 수렴하는 경향이 있다. thinking budget도 동일한 패턴을 따를지 — 즉 "빠름," "균형," "심층"과 같은 프리셋으로 수렴될지 — 는 아직 지켜봐야 한다.

미해결 질문들

성능 저하 곡선: reasoning benchmark에서의 성능은 thinking budget이 최대에서 최소로 감소함에 따라 어떻게 변하는가? 그 관계는 선형인가, 로그함수적인가, 아니면 계단 함수적인가?

사용자 보정: 사용자는 주어진 작업에 적합한 thinking budget을 정확하게 추정할 수 있는가? 그렇지 않다면, 이 기능이 제어감 대신 불안감을 유발하는가?

경쟁 역학: 다른 서비스 제공자들도 사용자가 제어 가능한 reasoning budget을 채택할 것인가, 아니면 자동 budget 할당으로 경쟁할 것인가?

Benchmark 인플레이션: thinking 모델이 확산됨에 따라, 기존 benchmark는 효율적으로 잘 추론하는 모델과 비용을 많이 들여 잘 추론하는 모델을 적절히 구별하는가?

멀티모달 추론 비용: 보고서는 비디오 처리를 포함한 멀티모달 역량을 강조한다. thinking budget은 추론의 "난이도"가 모달리티에 따라 달라지는 멀티모달 입력과 어떻게 상호작용하는가?

실무자에게 주는 시사점

LLM을 제품에 통합하는 개발자에게 thinking budget은 운영 측면에서 중요한 의미를 갖는다. 이는 예측 불가능한 추론 비용을 제어 가능한 것으로 전환하며, 이는 프로덕션 예산 관리에 있어 실질적인 개선이다. 실용적인 권고 사항은 명확하다. 최대값을 기본값으로 설정하기보다는, 여러 budget 수준에서 자신의 특정 사용 사례를 벤치마킹하여 비용-성능 변곡점을 찾아야 한다.

연구자에게 있어 thinking budget은 방법론적 질문을 제기한다. reasoning 모델의 benchmark 결과를 보고할 때, 컴퓨팅 budget을 표준화해야 하는가? 현저히 높은 컴퓨팅 비용으로 약간 더 높은 정확도를 달성한 모델이 우수성을 명확히 입증했다고 보기 어렵다.

thinking budget은 기술적 참신함이라기보다 경제적 인터페이스 혁신에 가깝다. 이는 지능의 비용을 가시적이고 관리 가능하게 만들며, 이는 reasoning 모델이 연구 시연에서 프로덕션 인프라로 이행함에 따라 필연적인 단계이다.

References (3)

[1] Comanici, G., Bieber, E., Schaekermann, M. et al. (2025). Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. arXiv:2507.06261.

DOI Scholar

Setlur, Yang, Snell (2025). e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs.

DOI Scholar

Wilhelm, P., Wittkopp, T., & Kao, O. (2025). Beyond Test-Time Compute Strategies: Advocating Energy-per-Token in LLM Inference. Proceedings of the 5th Workshop on Machine Learning and Systems, 208-215.

DOI Scholar

Gemini 2.5 Pro's Thinking Budget: Controlling the Quality-Cost Tradeoff in Reasoning

Research Landscape: The Reasoning Model Generation

Benchmark Context

Critical Analysis: Claims and Evidence

What the Report Does Not Address

The Design Philosophy Question

Open Questions

What This Means for Practitioners

Gemini 2.5 Pro의 Thinking Budget: 추론에서 품질-비용 트레이드오프 제어

연구 동향: 추론 모델의 세대

벤치마크 맥락

비판적 분석: 주장과 근거

보고서가 다루지 않는 내용

설계 철학의 문제

미해결 질문들

실무자에게 주는 시사점

References (3)

Explore this topic deeper