Paper ReviewAI & Machine LearningMachine/Deep Learning

DeepSeek-R1: When Reinforcement Learning Alone Produces Emergent Reasoning

The standard recipe for building a reasoning LLM involves supervised fine-tuning on curated chain-of-thought data before applying reinforcement learning. DeepSeek-R1 asks: what if you skip the supervised step entirely? The answer—that self-reflection, verification, and dynamic strategy adaptation emerge spontaneously from RL alone—challenges assumptions about how reasoning develops in language models.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The dominant approach to building reasoning-capable language models follows a two-stage pipeline: first, supervised fine-tuning (SFT) on curated chain-of-thought demonstrations teaches the model what reasoning looks like; then, reinforcement learning from human feedback (RLHF) or similar methods refine and align the behavior. The supervised stage is considered essential—without explicit examples of step-by-step reasoning, the model presumably cannot learn to reason.

DeepSeek-R1, published by DeepSeek-AI (2025) and subsequently appearing in Nature (vol. 645, pp. 633–638), challenges this assumption directly. The central claim is that reinforcement learning alone—without any supervised fine-tuning on reasoning demonstrations—can incentivize reasoning capability in large language models. The behaviors that emerge include self-reflection, verification of intermediate steps, and dynamic adaptation of problem-solving strategies.

If this holds, it suggests that reasoning may not need to be taught through imitation. It may instead be a latent capability that the right optimization pressure can surface.

The Research Landscape

The question of how reasoning arises in language models sits at the intersection of several active research threads. One thread concerns chain-of-thought prompting: the observation, dating to Wei et al. (2022), that asking a model to "think step by step" substantially improves performance on reasoning tasks. This demonstrated that reasoning-like behavior could be elicited without additional training, but only for models of sufficient scale.

A second thread concerns process reward models: training separate models to evaluate individual reasoning steps rather than only final answers. This approach, explored by Lightman et al. (2023) and others, provides denser training signal but requires expensive step-level annotations.

DeepSeek-R1 takes a different path. Rather than providing demonstrations of reasoning (SFT) or step-level feedback (process rewards), it applies reinforcement learning with outcome-based rewards directly to a base model. The model receives reward signal only for correct final answers. The question is whether this sparse signal—correct or incorrect, with no information about how to reach the answer—is sufficient to produce structured reasoning behavior.

What Emerges from RL Alone

According to the paper's abstract, three specific behaviors emerge spontaneously through RL training without being explicitly taught:

Self-reflection. The model begins to question and reconsider its own intermediate conclusions. Rather than generating a linear chain of reasoning, it produces traces that include statements like "wait, let me reconsider" or "this approach may not work because..." These self-corrective patterns were not present in the base model and were not demonstrated through supervised examples—they developed as the model learned that self-correction improved its probability of reaching correct final answers.

Verification. The model develops behaviors that check intermediate results before proceeding. In mathematical reasoning, for instance, it may re-derive a partial result or substitute values back into an equation to confirm correctness. This verification behavior functions as an internal process reward—the model learns to evaluate its own steps without an external process reward model.

Dynamic strategy adaptation. Rather than committing to a single problem-solving approach, the model learns to switch strategies when an initial approach appears unproductive. This flexibility—trying algebraic manipulation, switching to geometric reasoning, falling back to enumeration—emerges from the RL training signal that rewards correct outcomes regardless of the method used.

The paper reports that the resulting model achieves what it describes as frontier-level reasoning performance—competitive with models that used the standard SFT-then-RL pipeline.

Critical Analysis: Claims and Evidence

Claim	Source	Verdict
RL alone, without SFT, can incentivize reasoning capability in LLMs	DeepSeek-AI (2025), abstract	✅ Supported by reported results; independently notable given Nature publication
Self-reflection emerges spontaneously through RL training	DeepSeek-AI (2025), abstract	✅ Reported as observed behavior; mechanism plausible given reward structure
Verification behavior develops without explicit process rewards	DeepSeek-AI (2025), abstract	✅ Reported; represents a form of learned internal reward
Dynamic strategy adaptation arises from outcome-based RL	DeepSeek-AI (2025), abstract	✅ Reported; consistent with RL theory on exploration under sparse reward
Model achieves frontier-level reasoning performance	DeepSeek-AI (2025), abstract	⚠️ Claimed but "frontier-level" depends on benchmark selection and comparison set

Several aspects warrant careful consideration. First, the claim that these behaviors "emerge" carries significant theoretical weight. Emergence implies that the behaviors are not straightforwardly predicted from the training signal—that outcome-based reward alone does not obviously lead to self-reflection. Whether this constitutes true emergence or is a predictable consequence of optimizing for correctness in a sufficiently capable model is an open theoretical question.

Second, the reproducibility question is significant. DeepSeek-R1's training infrastructure is substantial, and the base model from which RL training begins already possesses considerable knowledge and linguistic capability from pretraining. The RL-only claim means no reasoning-specific supervised fine-tuning, but the base model's pretraining on internet text inevitably includes exposure to examples of step-by-step reasoning. The RL signal may be surfacing patterns already latent in the pretrained weights rather than creating genuinely novel reasoning capability.

Third, the practical relevance is substantial regardless of the theoretical interpretation. If outcome-based RL can replace or reduce the need for curated reasoning demonstrations, it removes a significant bottleneck in building reasoning models: the expensive, labor-intensive process of creating high-quality chain-of-thought training data.

Open Questions

Scale dependence. Does RL-only reasoning emergence require a base model above a certain capability threshold? If so, the approach may not generalize to smaller models, limiting its practical impact for resource-constrained settings.

Reasoning faithfulness. The model produces reasoning traces that correlate with correct answers, but are these traces faithful representations of the model's actual computation? Or are they post-hoc rationalizations that happen to accompany correct outputs?

Domain transfer. The emergent reasoning behaviors are demonstrated primarily on mathematical and logical reasoning tasks. Whether similar emergence occurs for scientific reasoning, causal inference, or common-sense reasoning remains to be established.

Training stability. RL training is notoriously unstable. How sensitive are these emergent behaviors to hyperparameter choices, reward design, and training duration? A behavior that emerges only under narrow training conditions is less theoretically interesting than one that emerges robustly.

Interaction with SFT. If SFT is applied after RL-only training, does it improve, degrade, or leave unchanged the emergent reasoning behaviors? Understanding this interaction could inform optimal training pipelines.

What This Means for Your Research

For researchers building reasoning systems, DeepSeek-R1 suggests that the investment in curated chain-of-thought datasets may be partially replaceable by RL training—a potentially significant reduction in data preparation cost. However, the base model capability requirement means this is not a shortcut for smaller-scale projects.

For those studying emergence in neural networks, the reported spontaneous development of self-reflection and verification provides a concrete case study. Whether this constitutes genuine emergence or sophisticated pattern matching from pretraining remains a productive research question.

Explore related work through ORAA ResearchBrain.

면책 조항: 이 게시물은 정보 제공을 목적으로 한 연구 동향 개요이다. 특정 연구 결과, 통계, 주장은 학술 저작물에서 인용하기 전에 원본 논문을 통해 검증해야 한다.

DeepSeek-R1: 강화 학습만으로 창발적 추론이 발생할 때

추론 능력을 갖춘 언어 모델을 구축하는 지배적인 접근 방식은 두 단계 파이프라인을 따른다. 먼저, 정제된 chain-of-thought 시연에 대한 지도 미세 조정(SFT)으로 모델에게 추론이 어떤 형태를 띠는지 가르치고, 이후 인간 피드백 기반 강화 학습(RLHF) 또는 유사한 방법으로 행동을 정제하고 정렬한다. 지도 학습 단계는 필수적인 것으로 여겨진다—단계별 추론의 명시적 예시 없이는 모델이 추론을 학습할 수 없다고 가정하기 때문이다.

DeepSeek-AI(2025)가 발표하고 이후 Nature(vol. 645, pp. 633–638)에 게재된 DeepSeek-R1은 이러한 가정에 직접적으로 도전한다. 핵심 주장은 강화 학습만으로—추론 시연에 대한 지도 미세 조정 없이—대규모 언어 모델에서 추론 능력을 유도할 수 있다는 것이다. 발현되는 행동에는 자기 반성, 중간 단계의 검증, 문제 해결 전략의 동적 적응이 포함된다.

이것이 사실이라면, 추론은 모방을 통해 가르칠 필요가 없을 수도 있음을 시사한다. 대신 추론은 적절한 최적화 압력이 표면으로 끌어낼 수 있는 잠재적 능력일 수 있다.

연구 동향

언어 모델에서 추론이 발생하는 방식에 관한 문제는 여러 활발한 연구 흐름의 교차점에 위치한다. 하나의 흐름은 chain-of-thought 프롬프팅에 관한 것이다. Wei et al.(2022)로 거슬러 올라가는 이 관찰은, 모델에게 "단계별로 생각하라"고 요청하는 것만으로도 추론 과제에서 성능이 크게 향상된다는 것을 보여준다. 이는 추가 훈련 없이도 추론과 유사한 행동이 유도될 수 있음을 입증했지만, 충분한 규모의 모델에서만 가능했다.

두 번째 흐름은 프로세스 보상 모델에 관한 것이다. 최종 답변만이 아닌 개별 추론 단계를 평가하는 별도의 모델을 훈련하는 이 접근 방식은 Lightman et al.(2023) 등에 의해 탐구되었으며, 더 밀도 있는 훈련 신호를 제공하지만 비용이 많이 드는 단계 수준의 주석이 필요하다.

DeepSeek-R1은 다른 경로를 택한다. 추론의 시연(SFT)이나 단계 수준의 피드백(프로세스 보상)을 제공하는 대신, 결과 기반 보상과 함께 강화 학습을 기본 모델에 직접 적용한다. 모델은 오직 정확한 최종 답변에 대해서만 보상 신호를 받는다. 문제는 이 희소한 신호—답이 맞는지 틀린지만 알려주고 어떻게 답에 도달해야 하는지에 대한 정보는 없는—가 구조화된 추론 행동을 만들어 내기에 충분한지 여부이다.

RL만으로 발현되는 것

논문의 초록에 따르면, 명시적으로 가르치지 않았음에도 불구하고 RL 훈련을 통해 세 가지 특정 행동이 자발적으로 발현된다.

자기 반성. 모델은 자신의 중간 결론에 의문을 제기하고 재검토하기 시작한다. 선형적인 추론 연쇄를 생성하는 대신, "잠깐, 다시 생각해 보자" 또는 "이 접근 방식은 ... 때문에 작동하지 않을 수 있다"와 같은 진술을 포함하는 추적을 생성한다. 이러한 자기 수정 패턴은 기본 모델에는 없었으며 지도 예시를 통해 시연된 것도 아니다—모델이 자기 수정이 올바른 최종 답변에 도달할 확률을 높인다는 것을 학습하면서 발전한 것이다.

검증. 모델은 계속 진행하기 전에 중간 결과를 확인하는 행동을 발전시킨다. 예를 들어 수학적 추론에서, 부분 결과를 재도출하거나 값을 방정식에 다시 대입하여 정확성을 확인할 수 있다. 이 검증 행동은 내부적인 프로세스 보상으로 기능한다—모델은 외부 프로세스 보상 모델 없이도 자신의 단계를 평가하는 법을 학습한다. 동적 전략 적응. 단일 문제 해결 접근법을 고수하는 대신, 모델은 초기 접근법이 비생산적으로 보일 때 전략을 전환하는 법을 학습한다. 대수적 조작을 시도하고, 기하학적 추론으로 전환하고, 열거법으로 돌아가는 이러한 유연성은 사용된 방법에 관계없이 올바른 결과에 보상을 주는 RL 훈련 신호로부터 나타난다.

논문에서는 결과 모델이 프론티어 수준의 추론 성능을 달성한다고 보고하며, 이는 표준 SFT-then-RL 파이프라인을 사용한 모델들과 경쟁할 수 있는 수준이라고 설명한다.

비판적 분석: 주장과 근거

주장	출처	평가
SFT 없이 RL만으로도 LLM의 추론 능력을 유도할 수 있다	DeepSeek-AI (2025), 초록	✅ 보고된 결과에 의해 지지됨; Nature 게재를 감안할 때 독립적으로 주목할 만함
자기 반성이 RL 훈련을 통해 자발적으로 나타난다	DeepSeek-AI (2025), 초록	✅ 관찰된 행동으로 보고됨; 보상 구조를 고려할 때 메커니즘이 타당함
명시적인 프로세스 보상 없이 검증 행동이 발전한다	DeepSeek-AI (2025), 초록	✅ 보고됨; 학습된 내부 보상의 한 형태를 나타냄
결과 기반 RL로부터 동적 전략 적응이 나타난다	DeepSeek-AI (2025), 초록	✅ 보고됨; 희소 보상 하에서의 탐색에 관한 RL 이론과 일치함
모델이 프론티어 수준의 추론 성능을 달성한다	DeepSeek-AI (2025), 초록	⚠️ 주장되었으나 "프론티어 수준"은 벤치마크 선택 및 비교 대상에 따라 달라짐

몇 가지 측면은 신중한 고려가 필요하다. 첫째, 이러한 행동들이 "나타난다(emerge)"는 주장은 상당한 이론적 무게를 지닌다. 창발(emergence)은 해당 행동들이 훈련 신호로부터 단순하게 예측되지 않음을, 즉 결과 기반 보상만으로는 자기 반성으로 이어진다고 명백히 예측할 수 없음을 의미한다. 이것이 진정한 창발을 구성하는지, 아니면 충분한 능력을 갖춘 모델에서 정확성을 최적화하는 예측 가능한 결과인지는 여전히 열린 이론적 질문이다.

둘째, 재현 가능성 문제가 중요하다. DeepSeek-R1의 훈련 인프라는 상당하며, RL 훈련이 시작되는 기반 모델은 이미 사전 훈련을 통해 상당한 지식과 언어 능력을 보유하고 있다. RL만 사용한다는 주장은 추론 특화 지도 미세 조정이 없다는 의미이지만, 인터넷 텍스트에 대한 기반 모델의 사전 훈련에는 필연적으로 단계별 추론의 예시에 대한 노출이 포함된다. RL 신호는 진정으로 새로운 추론 능력을 만들어 내는 것이 아니라, 사전 훈련된 가중치에 이미 잠재된 패턴을 표면으로 드러내는 것일 수 있다.

셋째, 이론적 해석과 무관하게 실용적 관련성은 상당하다. 결과 기반 RL이 선별된 추론 시연의 필요성을 대체하거나 줄일 수 있다면, 추론 모델 구축의 중요한 병목 지점인 고품질 chain-of-thought 훈련 데이터를 생성하는 비용이 많이 드는 노동 집약적 과정이 제거된다.

열린 질문들

규모 의존성. RL만을 사용한 추론 창발은 특정 능력 임계값 이상의 기반 모델을 필요로 하는가? 그렇다면 이 접근법은 소형 모델에 일반화되지 않을 수 있으며, 자원이 제한된 환경에서의 실용적 영향이 제한될 수 있다.

추론의 충실성. 모델은 정답과 상관관계가 있는 추론 흔적을 생성하지만, 이러한 흔적이 모델의 실제 계산을 충실하게 표현하는가? 아니면 올바른 출력과 함께 나타나는 사후 합리화인가?

도메인 전이. 창발적 추론 행동은 주로 수학적 및 논리적 추론 과제에서 입증된다. 과학적 추론, 인과 추론, 또는 상식 추론에서도 유사한 창발이 발생하는지는 아직 확립되지 않았다.

훈련 안정성. RL 훈련은 불안정하기로 악명 높다. 이러한 창발적 행동들은 하이퍼파라미터 선택, 보상 설계, 훈련 기간에 얼마나 민감한가? 좁은 훈련 조건에서만 나타나는 행동은 견고하게 나타나는 행동보다 이론적으로 덜 흥미롭다.

SFT와의 상호작용. RL 전용 학습 이후 SFT를 적용하면 창발적 추론 행동이 향상되는가, 저하되는가, 아니면 변화가 없는가? 이 상호작용을 이해하면 최적의 학습 파이프라인을 설계하는 데 도움이 될 수 있다.

연구자에게 주는 시사점

추론 시스템을 구축하는 연구자들에게 DeepSeek-R1은 엄선된 chain-of-thought 데이터셋에 대한 투자가 RL 학습으로 부분적으로 대체될 수 있음을 시사한다—이는 데이터 준비 비용을 잠재적으로 크게 절감할 수 있는 가능성이다. 그러나 기반 모델의 역량 요건이 존재하므로, 소규모 프로젝트에서는 이를 손쉬운 지름길로 활용하기 어렵다.

신경망의 창발(emergence) 현상을 연구하는 이들에게는, 자기 반성 및 검증 능력의 자발적 발현에 대한 보고가 구체적인 사례 연구를 제공한다. 이것이 진정한 창발을 구성하는지, 아니면 사전 학습으로부터 비롯된 정교한 패턴 매칭인지는 여전히 생산적인 연구 질문으로 남아 있다.

관련 연구는 ORAA ResearchBrain을 통해 탐색할 수 있다.

References (1)

[1] DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Nature, 645, 633–638. / arXiv:2501.12948.

DOI Scholar