Mixture of a Million Experts
ABSTRACT. The feedforward (FFW) layers in standard transformer architectures incur a linear increase in computational costs and activation memory as the hidden layer width grows. Sparse mixture-of-experts (MoE) architectures have emerged as a viable approach to address this issue by decoupling model size from computational cost. The recent discovery of the finegrained MoE scaling law shows that higher granularity leads to better performance. However, existing MoE models are limited to a small number of experts due to computational and optimization challenges. This paper introduces PEER (parameter efficient expert retrieval), a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of tiny experts (over a million). Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off. By enabling efficient utilization of a massive number of experts, PEER unlocks the potential for further scaling of transformer models while maintaining computational efficiency.
CONCLUSION. This work introduces a fine-grained MoE architecture that decomposes an extremely wide dense feedforward layer into a large number of small experts. This design is supported by the recent discovery of the fine- grained MoE scaling law. To overcome the computational overhead of routing to a large number of experts, we apply the product keys to efficiently select a small subset of hidden neurons within a wide MLP layer. Empirical analysis using language modeling tasks demonstrate that given the same compute budget, PEER significantly outperforms dense transformers, coarse-grained MoEs and product key memory layers.
Parameter Efficient Expert Retrieval (PEER) for MoE Models: Addressing Limitations
This paper introduces a novel Mixture-of-Experts (MoE) architecture called PEER (Parameter Efficient Expert Retrieval) capable of handling over a million experts. It addresses limitations of existing MoEs by enabling efficient routing and knowledge distillation to handle a massive number of tiny experts.
Key Contributions:
Exploring Extreme MoE Setting: Unlike prior research focusing on a small number of large experts, PEER investigates using a vast number of tiny experts. This is driven by the fine-grained MoE scaling law which suggests better performance with high granularity.
Learned Index Structure for Routing: PEER uses product key retrieval, a learned index structure, to efficiently route to a million experts. This significantly reduces the computational cost compared to traditional MoE routing.
New Layer Design: Combining product key routing with single-neuron experts, PEER layer expands capacity without significant computational overheads. This surpasses dense FFWs, coarse-grained MoEs, and Product Key Memory (PKM) layers in performance-compute trade-off.
Comprehensive Ablation Studies: The paper investigates the impact of various design choices, including the number of experts, active parameters, and heads, demonstrating the effectiveness of PEER.
Technical Details:
Expert Retrieval: PEER utilizes product keys for efficient retrieval. By dividing queries and keys into sub-components, it reduces the complexity of retrieving top-k experts from O(Nd) to O((√N + k²)d).
Parameter-efficient Experts: PEER uses single-neuron MLPs as experts, sharing hidden neurons among experts to enhance knowledge transfer and parameter efficiency.
Multi-head Retrieval: The layer incorporates multiple independent query networks (heads), each retrieving a separate set of experts.
Results:
IsoFLOP Analysis: PEER consistently outperforms dense FFWs, coarse-grained MoEs, and PKM layers across different FLOP budgets, demonstrating its superior performance-compute trade-off.
Language Modeling Evaluation: PEER achieves lower perplexity scores on various language modeling benchmarks compared to baselines, indicating improved performance.
Ablation Studies: Varying the number of experts, active experts, and heads reveals that PEER scales efficiently with increasing granularity, demonstrating its ability to leverage large pools of experts.
Overall:
This paper makes a significant contribution to the field of MoE models by introducing a novel architecture capable of efficiently handling a million experts. PEER opens new avenues for scaling language models while maintaining computational efficiency, potentially leading to further breakthroughs in the development of large language models.