← Back to Model Beat
2Policy·Apr 16

HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy

arXiv:2603.15620v2 Announce Type: replace Abstract: Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover,…

Covered by 2 sources

  • AarXiv CS.AIHeng Fang, Shangru Li, Shuhan Wang, Xuanyang Xi, Dingkang Liang, Xiang BaiApr 16
  • AarXiv CS.AIMyungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, Jinwoo ShinApr 16

Related stories

PolicyMaking AI operational in constrained public sector environmentsApr 16PolicyThe Specification Trap: Why Static Value Alignment Alone Is Insufficient for Robust AlignmentApr 17PolicyGoogle Told to Share Search Data With AI Rivals in EU ProposalApr 16PolicyAutoRAN: Automated Hijacking of Safety Reasoning in Large Reasoning ModelsApr 17