SiLVR: A Simple Language-based Video Reasoning Framework
arXiv:2505.24869v3 Announce Type: replace Abstract: Recent advances in test-time optimization have led to remarkable reasoning capabilities in Large Language Models (LLMs), enabling them to solve highly complex problems in math and coding. However, the reasoning capabilities of multimodal LLMs (MLLMs) still significantly lag, especially for complex video-language tasks. To address this issue, we present SILVR, a Simple Language-based Video Reasoning framework that decomposes complex video understanding into two stages. In the first stage, SILVR transforms raw video into language-based representations using multisensory inputs, such as short clip captions and audio/speech subtitles. In the second stage, language descriptions are fed into a powerful reasoning LLM to solve complex video-language understanding tasks. To handle long-context multisensory inputs, we use an Adaptive Context Reduction scheme, which dynamically determines the temporal granularity with which to sample the tokens. Our simple, modular, and training-free video reasoning framework achieves the best-reported results on Video-MME (long), Video-MMMU…
Covered by 3 sources
- AarXiv CS.AI↗Ce Zhang, Yan-Bo Lin, Ziyang Wang, Mohit Bansal, Gedas BertasiusApr 16
- AarXiv CS.AI↗Jihao Qiu, Lingxi Xie, Xinyue Huo, Qi Tian, Qixiang YeApr 16
- AarXiv CS.AI↗Zhixuan Wu, Quanxing Zha, Teng Wang, Genbao Xu, Wenyuan Gu, Wei Rao, Nan Ma, Bo Cheng, Soujanya PoriaApr 17