← Back to Model Beat
8Research·Mar 30

Entropy-Preserving Reinforcement Learning

Policy gradient algorithms have driven many recent advancements in language model reasoning. An appealing property is their ability to learn from exploration on their own trajectories, a process crucial for fostering diverse and creative solutions. As we show in this paper, many policy gradient algorithms naturally reduce the entropy—and thus the diversity of explored trajectories—as part of training, yielding a policy increasingly limited in its ability to explore. In this paper, we argue that entropy should be actively monitored and controlled throughout training. We formally analyze the…

Covered by 1 source

Related stories

ResearchAustralian government and Anthropic sign MOU for AI safety and research - AnthropicMar 31ResearchSakana AI's AI Scientist Clears Academic Conference Review - 조선일보Apr 2ResearchAI as scientist? Machine-written papers clear academic reviews, raise questions - MSNApr 2ResearchAI writes a research paper that passes peer review - Phys.orgMar 30