Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It is widely believed that GPT-3.5 is a MoE model, which means it could have 175B parameters but still be much lower latency than GPT-3



Interesting, I thought GPT-3.5 was considered GPT-3 + InstructGPT-style RLHF on a large scale, whereas GPT-4 is considered to be an MoE model.


There was an article on HN a couple of weeks ago that conjectured it might apply to GPT-3.5 Turbo as well: https://news.ycombinator.com/item?id=37006224


Haven't seen that one, yet. Thanks for sharing!


Why would MoE make it lower latency?


You don't have to use the whole MoE model, for each token only 1/N of the model is used, where N is the number of experts. So it's compute utilisation scales slower than memory usage.


It's easier to parallelize so you can throw more GPUs at a single request (or really, batch of requests)


Interesting, yeah I buy that, thanks. Building my intuition with this stuff. Anyone seen a good open-source implementation of MoE with Llama yet?


Jon Durbin has been working on LMoE, which isn't pure MoE but uses a LoRA-based approach instead. Core idea is dynamically swapping PEFT adapters based on the incoming utterance.

https://github.com/jondurbin/airoboros#lmoe


I'm pretty excited about LoRA MoEs, but for the sake of conversation I'll point out a reply someone made to me when I commented about them: https://news.ycombinator.com/item?id=37007795

Any LoRA approach is obviously going to be perform a little worse that a fully tuned model, but I guess the jury is still out on whether this approach will actually work well.

Exciting times!


Yea its definitely a tradeoff. My intuition here is that, much like the resistance you get to catastrophic forgetting when using LoRAs, adapter-based approaches will be useful in scenarios where your "experts" largely need to maintain the base capabilities of the model. So maybe the experts in this case are just style experts, rather than knowledge (this is pure conjecture, we will see as we eval all these approaches).


You can't just turn an existing model into MoE, they need to be trained from scratch unfortunately. I'm not aware of any open source MoE models, they are complicated and probably not that useful if you want to run them on your own hardware.


Would you mind correcting my misunderstanding here? Code Llama is a fine tuned version of Llama2 (i.e. not trained from scratch). If I fine tuned Llama2 with a bunch of law text and had Law Llama, and fined tuned a couple more with some history text and science text. Why wouldn't Code Llama, Law Llama, History Llama, and Science Llama not be the experts in my MoE setup? Seems like I just need a simple router in front of those models to direct the prompt to the right expert.


That could work, but I'd expect the following issues:

* For a lot of prompts every fine tuned model will make the same mistakes (they mostly share the same weights after all) and so you aren't getting nearly as much benefit as e.g. GPT-4 gets.

* It's going to be really expensive at inference time, since you have to run multiple models even though in most cases they won't help much

* Normally when people talk about hobbyists doing finetuning they mean <1M tokens, whereas Code Llama Python was finetuned on 100B tokens, way outside most people's price range. For the finetuning that you can afford, you can't teach it new knowledge, just show it how to apply the knowledge it already has.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: