Paper ReviewAI & Machine LearningReinforcement Learning

After DeepSeek R1: How Reinforcement Learning Is Teaching LLMs to Think Harder

DeepSeek R1 proved that RL can unlock genuine reasoning in LLMs. Now the field is asking harder questions: how to maintain reasoning diversity, how to scale inference compute, and whether RL-trained reasoners actually understand or merely pattern-match.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The release of DeepSeek R1 in January 2025 was an inflection point—not for what it achieved, but for what it demonstrated was achievable. A language model, trained with reinforcement learning to reason through problems step by step before answering, outperformed models with vastly more parameters on mathematical and scientific reasoning benchmarks. The implication was immediate and unsettling for the established order: perhaps the path to better AI reasoning runs not through more data or bigger models, but through better learning algorithms applied to the reasoning process itself.

Six months later, the research community has absorbed this lesson and is pushing beyond it. The questions have shifted from "Can RL improve LLM reasoning?" (yes, definitively) to questions that are harder and more consequential: How do we prevent RL from collapsing reasoning diversity? How much inference-time computation should we allocate? And can we trust what a reasoning model tells us about its own confidence?

The RL Reasoning Advance

Hou et al.'s comprehensive framework provides the most rigorous treatment of RL-enhanced reasoning since DeepSeek R1 itself. Their central contribution is a principled method for inference scaling—allocating additional computation at test time to improve reasoning quality on difficult problems.

The intuition is elegant. Not all questions deserve the same computational effort. A simple factual query requires one forward pass; a complex mathematical proof benefits from generating multiple reasoning chains and selecting the best. Hou et al. formalize this with a learned difficulty estimator that dynamically allocates compute: easy questions get fast, cheap answers; hard questions trigger extended reasoning with multiple candidate solutions.

The results validate the approach convincingly: on challenging mathematical benchmarks, inference scaling improves accuracy substantially over fixed-computation baselines, with the majority of improvement concentrated on the hardest problems—precisely where it matters most.

The Diversity Crisis

Yao et al. identify a pathology that cuts deeper than performance metrics. As RL training progresses, models learn to generate fewer distinct reasoning strategies. The policy converges on a narrow set of approaches that maximize reward, abandoning alternative strategies that might succeed on problems where the dominant strategy fails.

This is not merely a theoretical concern. On problems requiring creative or unconventional reasoning—those for which the "obvious" approach fails—diversity-impoverished models perform dramatically worse than models that maintain a repertoire of strategies. The analogy to human cognition is apt: a mathematician who knows only one proof technique will solve many problems efficiently but hit a wall when that technique does not apply.

Their diversity-aware policy optimization adds an explicit diversity bonus to the RL reward, encouraging the model to maintain multiple reasoning approaches even as it optimizes overall quality. The technical challenge is defining "diversity" in the space of reasoning chains—they use embedding-space dispersion as a proxy, rewarding the model when its candidate solutions occupy a broader region of representation space.

The improvement on challenging reasoning benchmarks is significant: notable gains on problems where the majority vote strategy previously failed. More importantly, the diversity bonus does not degrade performance on problems where a single strategy suffices—it purely adds capability at the frontier of difficulty.

Domain Transfer: When Reasoning Meets Medicine

Tordjman et al.'s benchmark of DeepSeek on medical tasks provides the most consequential test of RL-trained reasoning. Medicine is the ultimate stress test: reasoning errors have life-or-death consequences, the knowledge base is vast and continuously evolving, and clinical reasoning requires integrating heterogeneous evidence types (symptoms, lab values, imaging, patient history) that pure language models have not been trained on.

Their findings are characteristically nuanced. DeepSeek demonstrates strong performance on diagnostic reasoning—generating differential diagnoses and reasoning through clinical scenarios—but shows meaningful limitations in settings requiring specialized or personalized clinical knowledge. Notably, the model's expressed confidence poorly correlates with actual accuracy, a pattern that Zhang et al.'s graph-based confidence estimation method addresses.

This last finding connects to Zhang et al.'s graph-based confidence estimation. Their method constructs a graph over the model's reasoning steps, where edges represent logical dependencies, and estimates confidence based on the structural consistency of the reasoning graph rather than the model's self-reported certainty. Early results suggest this approach better discriminates between reliable and unreliable reasoning—but the method has yet to be validated at clinical scale.

Claims and Evidence

Claim	Evidence	Verdict
RL significantly improves LLM reasoning over SFT alone	DeepSeek R1 + Hou et al. demonstrate consistent gains	✅ Strongly supported
Inference scaling improves hard-problem performance	Substantial accuracy gains on MATH with dynamic compute allocation	✅ Supported
RL training reduces reasoning diversity	Yao et al. document strategy collapse empirically	✅ Supported
Diversity-aware training recovers lost capability	Notable gains on previously-failed problems	✅ Supported
RL-trained reasoners generalize to medical domains	Strong on diagnosis, weak on rare diseases and uncertainty	⚠️ Partially supported
LLMs accurately estimate their own reasoning confidence	Zhang et al. and Tordjman et al. show poor calibration	❌ Refuted

Open Questions

The verification problem: RL rewards for reasoning are typically based on whether the final answer is correct. But correct answers can arise from wrong reasoning (lucky guesses), and wrong answers can arise from sound reasoning applied to ambiguous premises. How do we reward reasoning quality rather than outcome correctness?

Process vs. outcome supervision: Should RL reward each reasoning step individually (process supervision) or only the final answer (outcome supervision)? Process supervision is more informative but requires step-level labels that are expensive to obtain. The optimal balance remains unresolved.

Reasoning or retrieval? When an RL-trained model "reasons" through a math problem, is it genuinely performing logical inference, or is it pattern-matching against similar problems seen during training? The distinction matters for generalization to truly novel problems.

Scaling laws for reasoning: Do reasoning capabilities follow the same scaling laws as factual knowledge? Or does reasoning ability require qualitatively different scaling—more RL iterations rather than more parameters or data?

Human-AI reasoning collaboration: If LLMs reason through problems step-by-step, producing visible chains of thought, how should human users interact with this reasoning? Should they verify each step? Override intermediate conclusions? The UX of reasoning models is largely unexplored.

What This Means for Your Research

The post-DeepSeek landscape has two immediate implications for researchers working with LLMs.

First, reasoning is now a tunable dimension. Through RL training and inference scaling, you can trade compute for reasoning quality in a principled way. This means that the optimal model for your task may not be the largest one—it may be a smaller model with more sophisticated reasoning training and generous inference-time compute.

Second, do not trust model confidence on reasoning tasks. The calibration failures documented by Tordjman et al. and Zhang et al. are systematic, not anecdotal. If your application depends on knowing how certain the model is about its reasoning—and any consequential application does—you need external calibration mechanisms, not the model's self-report.

The RL reasoning advance is real. But like many advances, it has surfaced new problems alongside the ones it has solved. The field's task now is not to celebrate the breakthrough but to understand its limits—and to build the verification, calibration, and diversity-preservation infrastructure that turns impressive demos into trustworthy tools.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 학술 작업에서 인용하기 전에 원본 논문을 통해 특정 연구 결과, 통계 및 주장을 검증해야 한다.

DeepSeek R1 이후: 강화학습이 LLM에게 더 깊이 사고하는 법을 가르치는 방법

2025년 1월 DeepSeek R1의 출시는 하나의 변곡점이었다—그것이 달성한 바 때문이 아니라, 달성 가능하다는 것을 증명했기 때문이다. 강화학습으로 훈련되어 답변 전에 단계적으로 문제를 추론하는 언어 모델이, 훨씬 더 많은 파라미터를 가진 모델들을 수학 및 과학 추론 벤치마크에서 능가했다. 그 함의는 기존 질서에 즉각적이고 불안하게 다가왔다: 더 나은 AI 추론으로 가는 길은 더 많은 데이터나 더 큰 모델이 아니라, 추론 과정 자체에 적용되는 더 나은 학습 알고리즘에 있을지도 모른다는 것이었다.

6개월이 지난 지금, 연구 커뮤니티는 이 교훈을 흡수하고 그 너머로 나아가고 있다. 질문들은 "RL이 LLM의 추론을 개선할 수 있는가?"(그렇다, 명확히)에서 더 어렵고 더 중요한 질문들로 전환되었다: 어떻게 RL이 추론의 다양성을 붕괴시키는 것을 막을 것인가? 추론 시간 계산량을 얼마나 할당해야 하는가? 그리고 추론 모델이 자신의 확신에 대해 말하는 것을 신뢰할 수 있는가?

RL 추론의 발전

Hou et al.의 포괄적인 프레임워크는 DeepSeek R1 이후 RL 강화 추론에 대한 가장 엄밀한 처리를 제공한다. 그들의 핵심 기여는 추론 시간 스케일링(inference scaling)을 위한 원칙적인 방법—어려운 문제에서 추론 품질을 향상시키기 위해 테스트 시간에 추가 계산을 할당하는 것—이다.

직관은 우아하다. 모든 질문이 동일한 계산적 노력을 받을 필요는 없다. 단순한 사실 조회는 하나의 순방향 패스(forward pass)를 필요로 하고, 복잡한 수학적 증명은 여러 추론 체인을 생성하여 최선을 선택하는 것으로부터 이점을 얻는다. Hou et al.은 학습된 난이도 추정기(difficulty estimator)로 이를 형식화하여 계산량을 동적으로 할당한다: 쉬운 질문은 빠르고 저렴한 답변을 받고, 어려운 질문은 여러 후보 해결책을 가진 확장된 추론을 촉발한다.

결과는 접근 방식을 설득력 있게 검증한다: 도전적인 수학 벤치마크에서 추론 시간 스케일링은 고정 계산량 기준선 대비 정확도를 상당히 향상시키며, 개선의 대부분은 가장 어려운 문제들에 집중된다—바로 가장 중요한 곳에서.

다양성 위기

Yao et al.은 성능 지표보다 더 깊이 파고드는 병폐를 확인한다. RL 훈련이 진행됨에 따라, 모델들은 더 적은 수의 고유한 추론 전략을 생성하는 법을 배운다. 정책(policy)은 보상을 최대화하는 좁은 접근법 집합으로 수렴하여, 지배적인 전략이 실패하는 문제에서 성공할 수 있는 대안적 전략들을 포기한다.

이것은 단순히 이론적인 우려가 아니다. 창의적이거나 비관습적인 추론을 요구하는 문제들—"명백한" 접근법이 실패하는 문제들—에서, 다양성이 결핍된 모델들은 전략의 레퍼토리를 유지하는 모델들보다 극적으로 나쁜 성능을 보인다. 인간 인지와의 비유는 적절하다: 하나의 증명 기법만 아는 수학자는 많은 문제를 효율적으로 풀겠지만, 그 기법이 적용되지 않을 때 벽에 부딪힐 것이다.

그들의 다양성 인식 정책 최적화(diversity-aware policy optimization)는 RL 보상에 명시적인 다양성 보너스를 추가하여, 전반적인 품질을 최적화하면서도 모델이 여러 추론 접근법을 유지하도록 장려한다. 기술적 도전은 추론 체인의 공간에서 "다양성"을 정의하는 것이다—그들은 임베딩 공간 분산(embedding-space dispersion)을 대리 지표로 사용하여, 후보 해결책들이 표현 공간의 더 넓은 영역을 점유할 때 모델에게 보상을 준다.

도전적인 추론 벤치마크에서의 개선은 유의미하다: 이전에 다수결 전략이 실패했던 문제들에서 주목할 만한 성과 향상이 있다. 더 중요하게는, 다양성 보너스가 단일 전략으로 충분한 문제들에서 성능을 저하시키지 않는다—오직 난이도의 최전선에서만 능력을 순수하게 추가한다.

도메인 전이: 추론이 의학을 만날 때

Tordjman 등의 의료 과제에 대한 DeepSeek 벤치마크는 RL로 훈련된 추론에 대한 가장 중요한 시험을 제공한다. 의학은 궁극적인 스트레스 테스트이다. 추론 오류는 생사에 관한 결과를 초래하고, 지식 기반은 방대하며 지속적으로 진화하고, 임상 추론은 순수 언어 모델이 훈련받지 않은 이질적인 증거 유형(증상, 검사 수치, 영상, 환자 병력)의 통합을 요구한다.

이들의 연구 결과는 특유의 미묘한 뉘앙스를 띤다. DeepSeek은 진단 추론—감별 진단 생성 및 임상 시나리오를 통한 추론—에서 강력한 성능을 보이지만, 전문적이거나 개인화된 임상 지식을 요구하는 환경에서는 의미 있는 한계를 드러낸다. 특히 모델이 표현하는 신뢰도는 실제 정확도와 낮은 상관관계를 보이는데, 이는 Zhang 등의 그래프 기반 신뢰도 추정 방법이 다루는 패턴이다.

이 마지막 발견은 Zhang 등의 그래프 기반 신뢰도 추정과 연결된다. 이들의 방법은 모델의 추론 단계에 대한 그래프를 구성하며, 엣지는 논리적 의존성을 나타내고, 모델이 자체적으로 보고하는 확실성보다는 추론 그래프의 구조적 일관성을 기반으로 신뢰도를 추정한다. 초기 결과는 이 접근법이 신뢰할 수 있는 추론과 신뢰할 수 없는 추론을 더 잘 구별한다는 것을 시사하지만, 해당 방법은 아직 임상 규모에서 검증되지 않았다.

주장과 근거

주장	근거	판정
RL은 SFT 단독 대비 LLM 추론을 크게 향상시킨다	DeepSeek R1 + Hou 등이 일관된 성능 향상을 입증	✅ 강력히 지지됨
추론 시 스케일링은 어려운 문제의 성능을 향상시킨다	동적 컴퓨팅 할당을 통한 MATH에서의 상당한 정확도 향상	✅ 지지됨
RL 훈련은 추론 다양성을 감소시킨다	Yao 등이 전략 붕괴를 실증적으로 문서화	✅ 지지됨
다양성 인식 훈련은 손실된 능력을 회복한다	이전에 실패했던 문제들에서 주목할 만한 성능 향상	✅ 지지됨
RL로 훈련된 추론 모델은 의료 도메인으로 일반화된다	진단에서는 강하지만, 희귀 질환 및 불확실성에서는 약함	⚠️ 부분적으로 지지됨
LLM은 자체 추론 신뢰도를 정확하게 추정한다	Zhang 등과 Tordjman 등이 낮은 보정 수준을 보임	❌ 반박됨

미해결 질문

검증 문제: 추론에 대한 RL 보상은 일반적으로 최종 답이 정확한지 여부에 기반한다. 그러나 정확한 답은 잘못된 추론에서 나올 수 있고(운 좋은 추측), 잘못된 답은 모호한 전제에 적용된 타당한 추론에서 나올 수 있다. 결과 정확성이 아닌 추론 품질을 어떻게 보상할 것인가?

과정 감독 대 결과 감독: RL은 각 추론 단계를 개별적으로 보상해야 하는가(과정 감독), 아니면 최종 답만 보상해야 하는가(결과 감독)? 과정 감독은 더 많은 정보를 제공하지만 획득 비용이 높은 단계별 레이블을 필요로 한다. 최적의 균형은 아직 해결되지 않았다.

추론인가, 검색인가? RL로 훈련된 모델이 수학 문제를 "추론"할 때, 진정한 논리적 추론을 수행하는 것인가, 아니면 훈련 중 접한 유사 문제들에 대한 패턴 매칭을 하는 것인가? 이 구분은 진정으로 새로운 문제로의 일반화에 중요하다.

추론에 대한 스케일링 법칙: 추론 능력은 사실적 지식과 동일한 스케일링 법칙을 따르는가? 아니면 추론 능력은 더 많은 파라미터나 데이터보다 더 많은 RL 반복 횟수와 같이 질적으로 다른 스케일링을 필요로 하는가?

인간-AI 추론 협업: LLM이 가시적인 사고 연쇄를 생성하며 문제를 단계별로 추론할 때, 인간 사용자는 이 추론과 어떻게 상호작용해야 하는가? 각 단계를 검증해야 하는가? 중간 결론을 무시해야 하는가? 추론 모델의 UX는 아직 거의 탐구되지 않았다.

연구에 대한 시사점

DeepSeek 이후의 환경은 LLM을 활용하여 연구하는 연구자들에게 두 가지 즉각적인 시사점을 제공한다. 첫째, 추론은 이제 조정 가능한 차원이다. RL 훈련과 추론 스케일링을 통해 원칙적인 방식으로 컴퓨팅 자원과 추론 품질을 교환할 수 있다. 이는 특정 태스크에 최적인 모델이 반드시 가장 큰 모델이 아닐 수 있음을 의미한다—더 정교한 추론 훈련과 충분한 추론 시간 컴퓨팅을 갖춘 소형 모델이 더 적합할 수 있다.

둘째, 추론 태스크에서 모델의 확신도를 신뢰하지 말아야 한다. Tordjman et al.과 Zhang et al.이 기록한 캘리브레이션 실패는 일화적 사례가 아니라 체계적인 현상이다. 만약 애플리케이션이 모델의 추론에 대해 얼마나 확실한지를 파악하는 데 의존한다면—결과적 영향이 있는 모든 애플리케이션이 그러하듯—모델의 자체 보고가 아닌 외부 캘리브레이션 메커니즘이 필요하다.

RL 추론의 발전은 실질적이다. 그러나 많은 발전이 그러하듯, 이 역시 해결한 문제들과 함께 새로운 문제들을 수면 위로 드러냈다. 지금 이 분야의 과제는 돌파구를 축하하는 것이 아니라 그 한계를 이해하고, 인상적인 데모를 신뢰할 수 있는 도구로 전환하는 검증·캘리브레이션·다양성 보존 인프라를 구축하는 것이다.

References (4)

[1] Hou, Z., Lv, X., Lu, R. et al. (2025). Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling. arXiv:2501.11651.

DOI Scholar

[2] Yao, J., Cheng, R., Wu, X. et al. (2025). Diversity-Aware Policy Optimization for Large Language Model Reasoning. arXiv:2505.23433.

DOI Scholar

[3] Tordjman, M., Liu, Z., Yuce, M. et al. (2025). Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning. Nature Medicine.

DOI Scholar

[4] Zhang, C., Shu, C., Shareghi, E. (2025). All Roads Lead to Rome: Graph-Based Confidence Estimation for Large Language Model Reasoning. arXiv:2509.12908.

DOI Scholar

After DeepSeek R1: How Reinforcement Learning Is Teaching LLMs to Think Harder

The RL Reasoning Advance

The Diversity Crisis

Domain Transfer: When Reasoning Meets Medicine

Claims and Evidence

Open Questions

What This Means for Your Research

DeepSeek R1 이후: 강화학습이 LLM에게 더 깊이 사고하는 법을 가르치는 방법

RL 추론의 발전

다양성 위기

도메인 전이: 추론이 의학을 만날 때

주장과 근거

미해결 질문

연구에 대한 시사점

References (4)

Explore this topic deeper