> This is because the entire model's weights must be run through the algorithm m...

> This is because the entire model's weights must be run through the algorithm many times per prompt.

And this is why I'm so excited about MoE models! qwen3:30b-a3b runs at the speed of a 3B parameter model. It's completely realistic to run on a plain CPU with 20 GB RAM for the model.