← Back to Model Beat
10Policy·Nov 21

From shortcuts to sabotage: natural emergent misalignment from reward hacking - Anthropic

From shortcuts to sabotage: natural emergent misalignment from reward hacking Anthropic

Covered by 1 source

Related stories

PolicyAnthropic partners with Rwandan Government and ALX to bring AI education to hundreds of thousands of learners across Africa - AnthropicNov 18PolicyGetty Images v Stability AI: A landmark judgment reinforcing the need for the UK government to amend its copyright laws - Wolters KluwerNov 20PolicyStrengthening our safety ecosystem with external testingNov 19PolicyMitigating the risk of prompt injections in browser use - AnthropicNov 24 · 2 sources