← Back to Model Beat
2Models·Apr 16

Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning

arXiv:2604.13586v1 Announce Type: new Abstract: Existing multi-view three-dimensional (3D) object detection approaches widely adopt large-scale pre-trained vision transformer (ViT)-based foundation models as backbones, being computationally complex. To address this problem, current state-of-the-art (SOTA) \texttt{ToC3D} for efficient multi-view ViT-based 3D object detection employs ego-motion-based relevant token selection. However, there are two key limitations: (1) The fixed layer-individual token selection ratios limit computational efficiency during both training and inference. (2) Full end-to-end retraining of the ViT backbone is required for the multi-view 3D object detection method. In this work, we propose an image token compensator combined with a token selection for ViT backbones to accelerate multi-view 3D object detection. Unlike \texttt{ToC3D}, our approach enables dynamic layer-wise token selection within the ViT backbone. Furthermore, we introduce a parameter-efficient fine-tuning strategy, which trains only the proposed modules, thereby reducing the number of fine-tuned parameters from more than $300$ million (M) to only $1.6$ M.…

Covered by 2 sources

  • AarXiv CS.AIDanish Nazir, Antoine Hanna-Asaad, Lucas G\"ornhardt, Jan Piewek, Thorsten Bagdonat, Tim FingscheidtApr 16
  • AarXiv CS.AIMingqian Ji, Shanshan Zhang, Jian YangApr 17

Related stories

ModelsGemini can now pull from Google Photos to generate personalized imagesApr 16 · 3 sourcesModelsGemini App on MacApr 15 · 4 sourcesModelsIntroducing GPT-Rosalind for life sciences researchApr 16 · 3 sourcesModelsIntroducing Claude Opus 4.7 - AnthropicApr 16 · 9 sources