Cool paper. It's more independent than dense or normal MoE but I think it's stil...

Cool paper. It's more independent than dense or normal MoE but I think it's still far away from the distributed training you're looking for because you still need a seed LM which is trained normally and when fine-tuning each expert from the seed LM, you still need enough GPUs or VRAM to fine-tune that LLM so you're still limited to large GPU clusters which is the problem we're trying to avoid.

In the case of the paper, they are using OPT-6.7b as the seed LM which requires 8xV100 GPUs for fine-tuning each expert. That's a combined total of 256GB of VRAM for a single expert while the 3090 only has 24GB of VRAM and is still one of the most expensive GPUs out there.

Maybe we could use something like PEFT or QLoRA in combination with this technique to make each expert small enough for the community to fine-tune and make a worse Mixtral 8x7b, but I don't know enough to say for sure.

Or maybe it turns out we can make a good MoE model with thousands of smaller experts. Experts small enough for a separate member of the community to independently fine-tune on a normal GPU, but idk.

To have both a performant and distributed LLM trained from scratch, we still need a completely different architecture to do it, but this work is pretty cool and may mean that if nothing else, there is something the community can do to help move things forward.

Also, I was going to say the MoE routing on this technique was lacking, but I found a more recent paper[0] by Meta which fixes this with a final fine-tuning stage.

[0] https://arxiv.org/abs/2403.07816