You don't have to use the whole MoE model, for each token only 1/N of the model ...

		visarga on Aug 30, 2023 \| parent \| context \| favorite \| on: Understanding Llama 2 and the New Code Llama LLMs You don't have to use the whole MoE model, for each token only 1/N of the model is used, where N is the number of experts. So it's compute utilisation scales slower than memory usage.