Deep DiveAI & Machine LearningMachine/Deep Learning

The RLVR Paradox: Why Checking Only the Answer Makes the Reasoning Right

A persistent worry in RL-trained reasoning models: if you only reward the final answer, won't the model learn to reach correct answers through flawed reasoning? A new theoretical result shows that under specific conditions, GRPO with binary verifiable rewards implicitly amplifies the probability of correct chain-of-thought—not just correct answers.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Here is a puzzle that has quietly troubled the reasoning-LLM community. You train a model with reinforcement learning, rewarding it when its final answer is correct and penalizing it when the answer is wrong. You never inspect the intermediate reasoning steps. You never reward good reasoning or penalize bad reasoning. You check only the output.

Common sense suggests this should produce a model that games the reward: finding shortcuts, memorizing answer patterns, or stumbling onto correct answers through flawed logic. The reasoning chain should degrade or become decorative—present but not functional.

Yet empirically, models trained this way—including DeepSeek-R1 and others—develop coherent, step-by-step reasoning. The chains of thought are not random walks that happen to terminate at correct answers. They contain logical structure, self-correction, and verification steps. How?

The RLVR paper (2025) provides a theoretical answer: under specific, identifiable conditions, GRPO (Group Relative Policy Optimization) with binary verifiable rewards implicitly incentivizes correct chain-of-thought reasoning. The gradient does the work that explicit process supervision was thought to require.

The Research Landscape

The tension between outcome-based and process-based reward has been a central debate in reasoning model development. Process reward models (PRMs) evaluate each reasoning step individually, providing dense supervision that directly rewards correct reasoning. The drawback is cost: step-level annotations require expert human judgment, and training a separate reward model introduces its own failure modes.

Outcome reward models (ORMs) check only the final answer. They are cheap to construct—for math problems, the answer is either right or wrong, verifiable automatically. But the theoretical concern is reward hacking: the model might learn associations between surface patterns and correct answers without developing genuine reasoning capability.

The practical evidence has been confusing. Some studies find that ORMs produce reasoning quality comparable to PRMs. Others find degradation in reasoning faithfulness despite maintained answer accuracy. The RLVR paper attempts to resolve this confusion by identifying the conditions under which outcome-only reward does and does not incentivize correct reasoning.

The Theoretical Result

The paper's central theoretical contribution, as stated in the abstract, is a proof that GRPO with binary verifiable rewards—rewards that simply check whether the final answer matches the ground truth—implicitly incentivizes correct chain-of-thought reasoning.

The mechanism works as follows. GRPO updates the model by comparing outputs within a group: for the same problem, multiple candidate solutions are generated, and the policy gradient pushes probability mass toward solutions that received higher reward (correct answers) and away from those that received lower reward (incorrect answers).

The key insight is about what distinguishes correct-answer solutions from incorrect-answer solutions within the model's own generation distribution. According to the abstract, the critical condition is that the base LLM can distinguish correct from incorrect reasoning chains through strong pretraining. When this condition holds, correct reasoning chains are more likely to produce correct answers than incorrect reasoning chains are. Consequently, the set of solutions receiving positive reward (correct final answers) is enriched for correct reasoning, and the set receiving negative reward is enriched for incorrect reasoning.

The GRPO gradient, by amplifying probability of correct-answer solutions and suppressing incorrect-answer solutions, therefore implicitly amplifies the probability of correct reasoning chains and suppresses incorrect ones—even though the reward signal contains no information about reasoning quality.

Why the Base Model Matters

The condition identified in the abstract—that the base LLM must be able to distinguish correct from incorrect reasoning through strong pretraining—is not a trivial assumption. It implies that the base model already has latent knowledge of what constitutes valid reasoning, even if it does not reliably produce valid reasoning in practice.

This is plausible for large models pretrained on extensive corpora that include mathematical proofs, logical arguments, scientific papers, and other reasoning-heavy text. Through pretraining, these models develop internal representations that correlate with reasoning validity. The RL training does not create reasoning ability from scratch; it amplifies a latent signal that pretraining established.

This framing also explains why the approach might fail for smaller or less capable base models: if the base model cannot distinguish good from bad reasoning, then correct answers will be equally likely to arise from correct and incorrect reasoning chains, and the GRPO gradient will not preferentially amplify correct reasoning.

Critical Analysis: Claims and Evidence

Claim	Source	Verdict
GRPO with binary verifiable rewards implicitly incentivizes correct chain-of-thought	Abstract, theoretical proof	✅ Supported — theoretical result with identified conditions
The mechanism requires the base LLM to distinguish correct from incorrect reasoning	Abstract, stated condition	✅ Explicitly stated as a necessary condition
GRPO gradient automatically amplifies correct CoT probability	Abstract	✅ Follows from the theoretical framework
Outcome-only reward is sufficient to replace process reward for reasoning	Implication of the result	⚠️ Conditional — holds only when the base model satisfies the distinguishability condition
This explains why DeepSeek-R1 and similar models develop coherent reasoning	Contextual interpretation	⚠️ Plausible connection but not directly claimed in the abstract

The theoretical nature of the contribution is both a strength and a limitation. It provides a formal explanation for an empirically observed phenomenon, which is valuable. But theoretical proofs in machine learning often rely on assumptions (e.g., convergence conditions, distribution properties) that may not hold precisely in practice. The gap between the theorem's conditions and real training dynamics is worth scrutinizing.

Open Questions

Threshold for the base model condition. How strong must the base model's pretraining be for the distinguishability condition to hold? Is there a measurable threshold—a perplexity score, a benchmark performance level—below which outcome-only RL will fail to incentivize correct reasoning?

Reasoning faithfulness vs. reasoning correctness. The result addresses whether the probability of correct reasoning increases. It does not directly address whether the model's reasoning traces faithfully represent its internal computation. A model could produce text that looks like correct reasoning while computing the answer through different internal mechanisms.

Domain dependence. Mathematical reasoning has clean verifiability—answers are objectively right or wrong. For domains where answer verification is noisy or ambiguous (scientific reasoning, ethical judgment, open-ended analysis), does the theoretical result extend?

Scaling dynamics. Does the implicit incentive for correct reasoning strengthen or weaken as training progresses? If the model becomes very good at producing correct answers, the gradient signal distinguishing correct from incorrect reasoning may diminish.

What This Means for Your Research

For practitioners training reasoning models, the RLVR result provides theoretical grounding for the empirically successful approach of using outcome-only rewards. If your base model is sufficiently capable, you may not need the expense of process-level annotation.

For theorists, the result opens a productive line of inquiry: characterizing exactly when and why sparse reward signals produce structured intermediate behavior. This connects to broader questions about implicit regularization in neural network training.

For those concerned about reasoning faithfulness and safety, the result is double-edged. It explains why reasoning emerges, but the mechanism depends on a correlation between correct reasoning and correct answers—a correlation that adversarial problems or distributional shift could break.

Explore related work through ORAA ResearchBrain.

면책 조항: 이 게시물은 정보 제공을 목적으로 한 연구 동향 개요이다. 학술 연구에서 인용하기 전에 특정 연구 결과, 통계 및 주장을 원문 논문과 대조하여 검증해야 한다.

RLVR 역설: 답만 확인해도 추론이 올바르게 되는 이유

추론 LLM 커뮤니티에서 조용히 논란이 되어온 퍼즐이 하나 있다. 강화학습으로 모델을 훈련시키되, 최종 답이 맞을 때 보상을 주고 틀릴 때 페널티를 부여한다고 하자. 중간 추론 단계는 전혀 검사하지 않는다. 좋은 추론에 보상을 주거나 나쁜 추론에 페널티를 부여하지도 않는다. 오직 출력만 확인한다.

상식적으로는 이러한 방식이 보상을 악용하는 모델을 만들어낼 것 같다. 즉, 지름길을 찾거나, 답의 패턴을 암기하거나, 결함 있는 논리로 우연히 정답에 도달하는 식으로 말이다. 추론 사슬은 저하되거나 장식적으로 변해야 한다—존재하지만 기능하지 않는 형태로.

그러나 경험적으로, DeepSeek-R1을 비롯한 이러한 방식으로 훈련된 모델들은 일관된 단계별 추론을 발전시킨다. 사고의 사슬은 마침 정답에서 끝나는 무작위 보행이 아니다. 그것은 논리적 구조, 자기 수정, 검증 단계를 포함한다. 어떻게 이런 일이 가능한가?

RLVR 논문(2025)은 이에 대한 이론적 답을 제시한다. 특정하고 식별 가능한 조건 하에서, 이진 검증 가능 보상을 사용한 GRPO(Group Relative Policy Optimization)는 올바른 chain-of-thought 추론을 암묵적으로 장려한다는 것이다. 명시적인 과정 감독이 필요하다고 여겨졌던 역할을 그래디언트가 수행한다.

연구 현황

결과 기반 보상과 과정 기반 보상 사이의 긴장은 추론 모델 개발에서 핵심적인 논쟁이었다. 과정 보상 모델(PRM)은 각 추론 단계를 개별적으로 평가하여, 올바른 추론을 직접 보상하는 밀집 감독을 제공한다. 단점은 비용이다. 단계 수준의 주석은 전문가의 인간 판단을 요구하며, 별도의 보상 모델을 훈련시키는 것 자체가 고유한 실패 유형을 도입한다.

결과 보상 모델(ORM)은 최종 답만 확인한다. 구성 비용이 저렴하다—수학 문제의 경우 답이 맞거나 틀리거나 하며, 자동으로 검증 가능하다. 그러나 이론적 우려는 보상 해킹이다. 모델이 진정한 추론 능력을 개발하지 않고도 표면적 패턴과 정답 사이의 연관성을 학습할 수 있다는 것이다.

실증적 증거는 혼란스러웠다. 일부 연구에서는 ORM이 PRM에 필적하는 추론 품질을 생성한다고 밝힌다. 다른 연구에서는 답의 정확도가 유지됨에도 불구하고 추론의 충실도가 저하된다고 밝힌다. RLVR 논문은 결과만의 보상이 올바른 추론을 장려하는 조건과 그렇지 않은 조건을 식별함으로써 이러한 혼란을 해소하고자 한다.

이론적 결과

논문의 초록에 명시된 핵심 이론적 기여는, 최종 답이 정답과 일치하는지 단순히 확인하는 보상인 이진 검증 가능 보상을 사용한 GRPO가 올바른 chain-of-thought 추론을 암묵적으로 장려한다는 증명이다.

메커니즘은 다음과 같이 작동한다. GRPO는 그룹 내 출력을 비교하여 모델을 업데이트한다. 동일한 문제에 대해 여러 후보 해답이 생성되고, 정책 그래디언트는 더 높은 보상을 받은 해답(정답)을 향해 확률 질량을 밀고, 더 낮은 보상을 받은 해답(오답)으로부터 멀리 밀어낸다.

핵심 통찰은 모델 자체의 생성 분포 내에서 정답 해답과 오답 해답을 구별하는 것이 무엇인지에 관한 것이다. 초록에 따르면, 결정적인 조건은 기반 LLM이 강력한 사전 훈련을 통해 올바른 추론 사슬과 올바르지 않은 추론 사슬을 구별할 수 있다는 것이다. 이 조건이 충족될 때, 올바른 추론 사슬은 올바르지 않은 추론 사슬보다 더 높은 확률로 정답을 생성한다. 결과적으로, 양의 보상을 받는 해답 집합(최종 답이 정답인 것들)에는 올바른 추론이 풍부하게 포함되고, 음의 보상을 받는 집합에는 올바르지 않은 추론이 풍부하게 포함된다. GRPO 그래디언트는 정답 솔루션의 확률을 증폭하고 오답 솔루션의 확률을 억제함으로써, 보상 신호가 추론 품질에 관한 정보를 전혀 포함하지 않음에도 불구하고, 올바른 추론 체인의 확률을 암묵적으로 증폭하고 잘못된 추론 체인의 확률을 억제한다.

기반 모델이 중요한 이유

초록에서 제시된 조건—기반 LLM이 강력한 사전 학습을 통해 올바른 추론과 잘못된 추론을 구별할 수 있어야 한다는 것—은 사소한 가정이 아니다. 이는 기반 모델이 실제로 유효한 추론을 안정적으로 생성하지 못하더라도, 유효한 추론을 구성하는 요소에 대한 잠재적 지식을 이미 보유하고 있음을 시사한다.

이는 수학적 증명, 논리적 논증, 과학 논문, 그 밖의 추론 집약적 텍스트를 포함하는 방대한 코퍼스로 사전 학습된 대형 모델에 있어 충분히 설득력 있는 주장이다. 사전 학습을 통해 이러한 모델들은 추론의 유효성과 상관관계를 갖는 내부 표현을 발전시킨다. RL 학습은 추론 능력을 처음부터 생성하는 것이 아니라, 사전 학습이 확립한 잠재적 신호를 증폭하는 것이다.

이러한 관점은 소형 모델이나 역량이 낮은 기반 모델에서 해당 접근법이 실패할 수 있는 이유도 설명해 준다. 기반 모델이 좋은 추론과 나쁜 추론을 구별하지 못한다면, 정답은 올바른 추론 체인과 잘못된 추론 체인 모두에서 동등하게 도출될 것이며, GRPO 그래디언트는 올바른 추론을 선택적으로 증폭하지 못하게 된다.

비판적 분석: 주장과 근거

주장	출처	판정
이진 검증 가능 보상을 사용한 GRPO는 암묵적으로 올바른 chain-of-thought를 유도한다	초록, 이론적 증명	✅ 지지됨 — 조건이 명시된 이론적 결과
해당 메커니즘은 기반 LLM이 올바른 추론과 잘못된 추론을 구별할 수 있어야 한다	초록, 명시된 조건	✅ 필요 조건으로 명시됨
GRPO 그래디언트는 올바른 CoT 확률을 자동으로 증폭한다	초록	✅ 이론적 프레임워크에서 도출됨
결과 기반 보상만으로 추론에서 과정 보상을 대체하기에 충분하다	결과의 함의	⚠️ 조건부 — 기반 모델이 구별 가능성 조건을 만족할 때만 성립
이것이 DeepSeek-R1 및 유사 모델이 일관된 추론을 발전시키는 이유를 설명한다	맥락적 해석	⚠️ 설득력 있는 연관성이지만 초록에서 직접 주장되지는 않음

이론적 기여의 특성은 강점인 동시에 한계이기도 하다. 이는 경험적으로 관찰된 현상에 대한 형식적 설명을 제공한다는 점에서 가치 있다. 그러나 머신러닝에서의 이론적 증명은 실제 환경에서 정확히 성립하지 않을 수 있는 가정(예: 수렴 조건, 분포 속성)에 의존하는 경우가 많다. 정리의 조건과 실제 학습 역학 사이의 간극은 면밀히 검토할 필요가 있다.

미해결 과제

기반 모델 조건의 임계값. 구별 가능성 조건이 성립하려면 기반 모델의 사전 학습이 얼마나 강력해야 하는가? 결과 기반 RL이 올바른 추론을 유도하는 데 실패하는 기준이 되는, 측정 가능한 임계값—퍼플렉서티 점수 또는 벤치마크 성능 수준—이 존재하는가?

추론의 충실성 vs. 추론의 정확성. 해당 결과는 올바른 추론의 확률이 증가하는지 여부를 다룬다. 모델의 추론 흔적이 내부 계산을 충실히 반영하는지 여부는 직접적으로 다루지 않는다. 모델은 실제로는 다른 내부 메커니즘을 통해 답을 계산하면서도 올바른 추론처럼 보이는 텍스트를 생성할 수 있다.

도메인 의존성. 수학적 추론은 명확한 검증 가능성을 가진다—답이 객관적으로 맞거나 틀리다. 답 검증이 불분명하거나 모호한 도메인(과학적 추론, 윤리적 판단, 개방형 분석)에서도 이론적 결과가 확장될 수 있는가?

스케일링 역학. 올바른 추론에 대한 암묵적 인센티브는 훈련이 진행됨에 따라 강해지는가, 아니면 약해지는가? 모델이 정답을 생성하는 데 매우 능숙해지면, 올바른 추론과 잘못된 추론을 구별하는 경사 신호가 약해질 수 있다.

연구에 대한 시사점

추론 모델을 훈련하는 실무자들에게, RLVR 결과는 결과 기반 보상만을 사용하는 경험적으로 성공한 접근법에 대한 이론적 근거를 제공한다. 기반 모델이 충분히 유능하다면, 프로세스 수준의 주석 비용이 필요하지 않을 수 있다.

이론가들에게, 이 결과는 생산적인 탐구의 방향을 열어준다: 희소 보상 신호가 언제, 왜 구조화된 중간 행동을 생성하는지를 정확히 규명하는 것이다. 이는 신경망 훈련에서의 암묵적 정규화에 관한 더 넓은 질문들과 연결된다.

추론의 충실성과 안전성에 관심 있는 이들에게, 이 결과는 양면적이다. 추론이 왜 출현하는지를 설명하지만, 그 메커니즘은 올바른 추론과 정답 사이의 상관관계에 의존하며, 이 상관관계는 적대적 문제나 분포 변화에 의해 깨질 수 있다.

관련 연구는 ORAA ResearchBrain을 통해 탐색할 수 있다.

References (1)

[1] (2025). Verifiable Rewards Implicitly Incentivize Correct Chain-of-Thought Reasoning. arXiv:2506.14245.

DOI Scholar