Mixture-of-Experts with Expert Choice Routing. (arXiv:2202.09368v1 [cs.LG])

Sparsely-activated Mixture-of-experts (MoE) models allow the number of
parameters to greatly increase while keeping the amount of computation for a
given token or a given sample unchanged. However, a poor expert routing
strategy (e.g. one resulting in load imbalance) can cause certain experts to be
under-trained, leading to an expert being under or over-specialized. Prior work
allocates a fixed number of experts to each token using a top-k function
regardless of the relative importance of different tokens. To address this, we
propose a heterogeneous mixture-of-experts employing an expert choice method.
Instead of letting tokens select the top-k experts, we have experts selecting
the top-k tokens. As a result, each token can be routed to a variable number of
experts and each expert can have a fixed bucket size. We systematically study
pre-training speedups using the same computational resources of the Switch
Transformer top-1 and GShard top-2 gating of prior work and find that our
method improves training convergence time by more than 2x. For the same
computational cost, our method demonstrates higher performance in fine-tuning
11 selected tasks in the GLUE and SuperGLUE benchmarks. For a smaller
activation cost, our method outperforms the T5 dense model in 7 out of the 11

Source: https://arxiv.org/abs/2202.09368


Related post