Paper ReviewAI & Machine LearningReinforcement Learning

After DeepSeek R1: How Reinforcement Learning Is Teaching LLMs to Think Harder

DeepSeek R1 proved that RL can unlock genuine reasoning in LLMs. Now the field is asking harder questions: how to maintain reasoning diversity, how to scale inference compute, and whether RL-trained reasoners actually understand or merely pattern-match.

By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The release of DeepSeek R1 in January 2025 was an inflection pointโ€”not for what it achieved, but for what it demonstrated was achievable. A language model, trained with reinforcement learning to reason through problems step by step before answering, outperformed models with vastly more parameters on mathematical and scientific reasoning benchmarks. The implication was immediate and unsettling for the established order: perhaps the path to better AI reasoning runs not through more data or bigger models, but through better learning algorithms applied to the reasoning process itself.

Six months later, the research community has absorbed this lesson and is pushing beyond it. The questions have shifted from "Can RL improve LLM reasoning?" (yes, definitively) to questions that are harder and more consequential: How do we prevent RL from collapsing reasoning diversity? How much inference-time computation should we allocate? And can we trust what a reasoning model tells us about its own confidence?

The RL Reasoning Advance

Hou et al.'s comprehensive framework provides the most rigorous treatment of RL-enhanced reasoning since DeepSeek R1 itself. Their central contribution is a principled method for inference scalingโ€”allocating additional computation at test time to improve reasoning quality on difficult problems.

The intuition is elegant. Not all questions deserve the same computational effort. A simple factual query requires one forward pass; a complex mathematical proof benefits from generating multiple reasoning chains and selecting the best. Hou et al. formalize this with a learned difficulty estimator that dynamically allocates compute: easy questions get fast, cheap answers; hard questions trigger extended reasoning with multiple candidate solutions.

The results validate the approach convincingly: on challenging mathematical benchmarks, inference scaling improves accuracy substantially over fixed-computation baselines, with the majority of improvement concentrated on the hardest problemsโ€”precisely where it matters most.

The Diversity Crisis

Yao et al. identify a pathology that cuts deeper than performance metrics. As RL training progresses, models learn to generate fewer distinct reasoning strategies. The policy converges on a narrow set of approaches that maximize reward, abandoning alternative strategies that might succeed on problems where the dominant strategy fails.

This is not merely a theoretical concern. On problems requiring creative or unconventional reasoningโ€”those for which the "obvious" approach failsโ€”diversity-impoverished models perform dramatically worse than models that maintain a repertoire of strategies. The analogy to human cognition is apt: a mathematician who knows only one proof technique will solve many problems efficiently but hit a wall when that technique does not apply.

Their diversity-aware policy optimization adds an explicit diversity bonus to the RL reward, encouraging the model to maintain multiple reasoning approaches even as it optimizes overall quality. The technical challenge is defining "diversity" in the space of reasoning chainsโ€”they use embedding-space dispersion as a proxy, rewarding the model when its candidate solutions occupy a broader region of representation space.

The improvement on challenging reasoning benchmarks is significant: notable gains on problems where the majority vote strategy previously failed. More importantly, the diversity bonus does not degrade performance on problems where a single strategy sufficesโ€”it purely adds capability at the frontier of difficulty.

Domain Transfer: When Reasoning Meets Medicine

Tordjman et al.'s benchmark of DeepSeek on medical tasks provides the most consequential test of RL-trained reasoning. Medicine is the ultimate stress test: reasoning errors have life-or-death consequences, the knowledge base is vast and continuously evolving, and clinical reasoning requires integrating heterogeneous evidence types (symptoms, lab values, imaging, patient history) that pure language models have not been trained on.

Their findings are characteristically nuanced. DeepSeek demonstrates strong performance on diagnostic reasoningโ€”generating differential diagnoses and reasoning through clinical scenariosโ€”but shows meaningful limitations in settings requiring specialized or personalized clinical knowledge. Notably, the model's expressed confidence poorly correlates with actual accuracy, a pattern that Zhang et al.'s graph-based confidence estimation method addresses.

This last finding connects to Zhang et al.'s graph-based confidence estimation. Their method constructs a graph over the model's reasoning steps, where edges represent logical dependencies, and estimates confidence based on the structural consistency of the reasoning graph rather than the model's self-reported certainty. Early results suggest this approach better discriminates between reliable and unreliable reasoningโ€”but the method has yet to be validated at clinical scale.

Claims and Evidence

<
ClaimEvidenceVerdict
RL significantly improves LLM reasoning over SFT aloneDeepSeek R1 + Hou et al. demonstrate consistent gainsโœ… Strongly supported
Inference scaling improves hard-problem performanceSubstantial accuracy gains on MATH with dynamic compute allocationโœ… Supported
RL training reduces reasoning diversityYao et al. document strategy collapse empiricallyโœ… Supported
Diversity-aware training recovers lost capabilityNotable gains on previously-failed problemsโœ… Supported
RL-trained reasoners generalize to medical domainsStrong on diagnosis, weak on rare diseases and uncertaintyโš ๏ธ Partially supported
LLMs accurately estimate their own reasoning confidenceZhang et al. and Tordjman et al. show poor calibrationโŒ Refuted

Open Questions

  • The verification problem: RL rewards for reasoning are typically based on whether the final answer is correct. But correct answers can arise from wrong reasoning (lucky guesses), and wrong answers can arise from sound reasoning applied to ambiguous premises. How do we reward reasoning quality rather than outcome correctness?
  • Process vs. outcome supervision: Should RL reward each reasoning step individually (process supervision) or only the final answer (outcome supervision)? Process supervision is more informative but requires step-level labels that are expensive to obtain. The optimal balance remains unresolved.
  • Reasoning or retrieval? When an RL-trained model "reasons" through a math problem, is it genuinely performing logical inference, or is it pattern-matching against similar problems seen during training? The distinction matters for generalization to truly novel problems.
  • Scaling laws for reasoning: Do reasoning capabilities follow the same scaling laws as factual knowledge? Or does reasoning ability require qualitatively different scalingโ€”more RL iterations rather than more parameters or data?
  • Human-AI reasoning collaboration: If LLMs reason through problems step-by-step, producing visible chains of thought, how should human users interact with this reasoning? Should they verify each step? Override intermediate conclusions? The UX of reasoning models is largely unexplored.
  • What This Means for Your Research

    The post-DeepSeek landscape has two immediate implications for researchers working with LLMs.

    First, reasoning is now a tunable dimension. Through RL training and inference scaling, you can trade compute for reasoning quality in a principled way. This means that the optimal model for your task may not be the largest oneโ€”it may be a smaller model with more sophisticated reasoning training and generous inference-time compute.

    Second, do not trust model confidence on reasoning tasks. The calibration failures documented by Tordjman et al. and Zhang et al. are systematic, not anecdotal. If your application depends on knowing how certain the model is about its reasoningโ€”and any consequential application doesโ€”you need external calibration mechanisms, not the model's self-report.

    The RL reasoning advance is real. But like many advances, it has surfaced new problems alongside the ones it has solved. The field's task now is not to celebrate the breakthrough but to understand its limitsโ€”and to build the verification, calibration, and diversity-preservation infrastructure that turns impressive demos into trustworthy tools.

    References (4)

    [1] Hou, Z., Lv, X., Lu, R. et al. (2025). Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling. arXiv:2501.11651.
    [2] Yao, J., Cheng, R., Wu, X. et al. (2025). Diversity-Aware Policy Optimization for Large Language Model Reasoning. arXiv:2505.23433.
    [3] Tordjman, M., Liu, Z., Yuce, M. et al. (2025). Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning. Nature Medicine.
    [4] Zhang, C., Shu, C., Shareghi, E. (2025). All Roads Lead to Rome: Graph-Based Confidence Estimation for Large Language Model Reasoning. arXiv:2509.12908.

    Explore this topic deeper

    Search 290M+ papers, detect research gaps, and find what hasn't been studied yet.

    Click to remove unwanted keywords

    Search 8 keywords โ†’