← Back to Model Beat
2Research·Apr 17

Targeted Exploration via Unified Entropy Control for Reinforcement Learning

arXiv:2510.10649v2 Announce Type: replace Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has shown significant promise for enhancing the reasoning capabilities of large language models (LLMs). However, prevailing algorithms like GRPO broadcast a uniform advantage signal across all tokens in a sequence. This coarse-grained approach overlooks the pivotal role of uncertain, high-stakes decisions during reasoning, leading to inefficient exploration and the well-documented problem of entropy collapse. To address this, we introduce UnCertainty-aware Advantage Shaping (UCAS), a model-free method that refines credit assignment by leveraging the model's internal uncertainty signals. UCAS operates in two stages: it first modulates the response-level advantage using a logit-space self-confidence proxy, and then applies an asymmetric token-level penalty based on raw logit certainty. This dual mechanism encourages exploration of high-uncertainty paths that yield correct answers while penalizing overconfident yet erroneous reasoning, effectively balancing the exploration-exploitation trade-off. Extensive experiments on five mathematical reasoning benchmarks show that UCAS significantly…

Covered by 2 sources

  • AarXiv CS.AICan Xie, Ruotong Pan, Xiangyu Wu, Yunfei Zhang, Jiayi Fu, Tingting Gao, Guorui ZhouApr 17
  • AarXiv CS.AIChen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Ge Lan, Yue WangApr 17

Related stories

ResearchMixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM MidtrainingApr 16 · 2 sourcesResearchAutomated Alignment Researchers: Using large language models to scale scalable oversight - AnthropicApr 14 · 2 sourcesResearchAI as scientist? Machine-written papers clear academic reviews, raise questions - MSNApr 13 · 2 sourcesResearchNvidia wants to scale robot simulation training with Lyra 2.0Apr 16