← Back to Model Beat
2Research·Apr 17

Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality

arXiv:2509.14255v2 Announce Type: replace Abstract: Mixture-of-Experts (MoE) models improve efficiency through sparse activation, but their learned gating functions provide limited insight into routing decisions. This work introduces the Semantic Resonance Architecture (SRA), which routes tokens to experts via cosine similarity between token representations and learnable semantic anchors, making every routing decision directly traceable to anchor-token similarity scores. We evaluate SRA on WikiText-103 across 17 configurations. In a controlled multi-seed comparison (3 seeds x 4 configurations, 256 experts, $D_{ff}=256$), cosine routing achieves competitive perplexity with standard linear routing ($12.57 \pm 0.03$ vs $12.45 \pm 0.03$ for $K=1 \to 4$; $12.52 \pm 0.02$ vs $12.57 \pm 0.02$ for $K=2 \to 4$). The training recipe -- not the routing function -- drives specialization quality, while cosine routing provides inherent inspectability. We introduce a bandpass routing loss -- a floor-and-ceiling corridor on expert utilization -- that reduces dead experts from 30-45% to 0-6% and…

Covered by 2 sources

Related stories

ResearchMixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM MidtrainingApr 16 · 2 sourcesResearchAutomated Alignment Researchers: Using large language models to scale scalable oversight - AnthropicApr 14 · 2 sourcesResearchAI as scientist? Machine-written papers clear academic reviews, raise questions - MSNApr 13 · 2 sourcesResearchNvidia wants to scale robot simulation training with Lyra 2.0Apr 16