4Opinion·1d ago
What LLMs explain is not what they believe: Evaluating explanation sufficiency under models' own input beliefs
A new study finds that the text explanations generated by large language models often do not align with the internal reasoning or beliefs the models actually use to produce their outputs. This discrepancy suggests that chain-of-thought rationales may not reliably serve as evidence for model accuracy, posing risks for high-stakes fields that rely on these justifications for oversight.
Covered by 1 source
- AarXiv CS.AI↗Nhi Nguyen, Shauli Ravfogel, Rajesh Ranganath1d ago