Sure! Basically traditional MoE has several linear layers, and the network learn...

Sure! Basically traditional MoE has several linear layers, and the network learns to route down those paths, based on the training loss (similar to how CNNs learn through max-pooling, which is also non-differentiable). However, MoEs have been shown to specialiaze on tokens, not high-level semantics. This was eloquently explained by Fuzhao Xue, author of OpenMoE, in one of our reading groups: https://www.youtube.com/watch?v=k3QOpJA0A0Q&t=1547s

In contrast, our router sits at a higher level of the stack, sending prompts to different models and providers based on quality on the prompt distribution, speed and cost. Happy to clarify further if helpful!