← Back to Model Beat
2Research·Apr 16

SiLVR: A Simple Language-based Video Reasoning Framework

arXiv:2505.24869v3 Announce Type: replace Abstract: Recent advances in test-time optimization have led to remarkable reasoning capabilities in Large Language Models (LLMs), enabling them to solve highly complex problems in math and coding. However, the reasoning capabilities of multimodal LLMs (MLLMs) still significantly lag, especially for complex video-language tasks. To address this issue, we present SILVR, a Simple Language-based Video Reasoning framework that decomposes complex video understanding into two stages. In the first stage, SILVR transforms raw video into language-based representations using multisensory inputs, such as short clip captions and audio/speech subtitles. In the second stage, language descriptions are fed into a powerful reasoning LLM to solve complex video-language understanding tasks. To handle long-context multisensory inputs, we use an Adaptive Context Reduction scheme, which dynamically determines the temporal granularity with which to sample the tokens. Our simple, modular, and training-free video reasoning framework achieves the best-reported results on Video-MME (long), Video-MMMU…

Covered by 3 sources

  • AarXiv CS.AICe Zhang, Yan-Bo Lin, Ziyang Wang, Mohit Bansal, Gedas BertasiusApr 16
  • AarXiv CS.AIJihao Qiu, Lingxi Xie, Xinyue Huo, Qi Tian, Qixiang YeApr 16
  • AarXiv CS.AIZhixuan Wu, Quanxing Zha, Teng Wang, Genbao Xu, Wenyuan Gu, Wei Rao, Nan Ma, Bo Cheng, Soujanya PoriaApr 17

Related stories

ResearchMixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM MidtrainingApr 16 · 2 sourcesResearchAutomated Alignment Researchers: Using large language models to scale scalable oversight - AnthropicApr 14 · 2 sourcesResearchAI as scientist? Machine-written papers clear academic reviews, raise questions - MSNApr 13 · 2 sourcesResearchNvidia wants to scale robot simulation training with Lyra 2.0Apr 16