← Back to Model Beat
4Research·5h ago

Rethinking Speech-LLM Integration for ASR: Effective Joint Speech-Text Training by Interleaving

Researchers have identified that interleaving speech and text data during model training improves automatic speech recognition performance more effectively than traditional methods. This approach addresses the diminishing returns often seen when simply increasing the volume of supervised speech data alone. By better integrating textual pretraining with audio inputs, developers may be able to build more accurate voice recognition systems without needing proportionally larger datasets.

Covered by 1 source

  • AarXiv CS.AIRuchao Fan, Yiming Wang, Rui Zhao, Liliang Ren, Keqi Deng, Xiaoyang Chen, Ali Zare, Bo Ren, Yuxuan Hu, Junkun Chen, Yan Huang, Yelong Shen, Jinyu Li5h ago

Related stories

ResearchOn Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMsJun 29 · 13 sourcesResearchAnti-Causal Domain Generalization: Leveraging Unlabeled DataJul 1 · 2 sourcesResearchLearning Unmasking Policies for Diffusion Language ModelsJun 29 · 6 sourcesResearchRedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttentionJun 29