← Back to Model Beat
3Opinion·Apr 16

Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks

arXiv:2604.13403v1 Announce Type: new Abstract: In-context learning (ICL) enables models to adapt to new tasks via inference-time demonstrations. Despite its success in large language models, the extension of ICL to multimodal settings remains poorly understood in terms of its internal mechanisms and how it differs from text-only ICL. In this work, we conduct a systematic analysis of ICL in multimodal large language models. Using identical task formulations across modalities, we show that multimodal ICL performs comparably to text-only ICL in zero-shot settings but degrades significantly under few-shot demonstrations. To understand this gap, we decompose multimodal ICL into task mapping construction and task mapping transfer, and analyze how models establish cross-modal task mappings, and transfer them to query samples across layers. Our analysis reveals that current models lack reasoning-level alignment between visual and textual representations, and fail to reliably transfer learned task mappings to queries. Guided by these findings, we further…

Covered by 3 sources

  • AarXiv CS.AIYu Wang, Sharon LiApr 16
  • AarXiv CS.AIHongjian Zou, Yue Ge, Qi Ding, Yixuan Liao, Xiaoxin ChenApr 16
  • AarXiv CS.AIBahey Tharwat, Giorgos Kordopatis-Zilos, Pavel Suma, Ian Reid, Giorgos ToliasApr 16

Related stories

OpinionCapacity Efficiency at Meta: How Unified AI Agents Optimize Performance at HyperscaleApr 16OpinionHow Anthropic Learned Mythos Was Too Dangerous for the WildApr 16OpinionWhy having “humans in the loop” in an AI war is an illusionApr 16OpinionRethinking AI TCO: Why Cost per Token Is the Only Metric That MattersApr 15