It is widely believed that GPT-3.5 is a MoE model, which means it could have 175...

rasbt · on Aug 30, 2023

Interesting, I thought GPT-3.5 was considered GPT-3 + InstructGPT-style RLHF on a large scale, whereas GPT-4 is considered to be an MoE model.

caeruleus · on Aug 30, 2023

There was an article on HN a couple of weeks ago that conjectured it might apply to GPT-3.5 Turbo as well: https://news.ycombinator.com/item?id=37006224

rasbt · on Aug 30, 2023

Haven't seen that one, yet. Thanks for sharing!

rgbrgb · on Aug 30, 2023

Why would MoE make it lower latency?

visarga · on Aug 30, 2023

You don't have to use the whole MoE model, for each token only 1/N of the model is used, where N is the number of experts. So it's compute utilisation scales slower than memory usage.

sebzim4500 · on Aug 30, 2023

It's easier to parallelize so you can throw more GPUs at a single request (or really, batch of requests)

rgbrgb · on Aug 30, 2023

Interesting, yeah I buy that, thanks. Building my intuition with this stuff. Anyone seen a good open-source implementation of MoE with Llama yet?

spmurrayzzz · on Aug 30, 2023

Jon Durbin has been working on LMoE, which isn't pure MoE but uses a LoRA-based approach instead. Core idea is dynamically swapping PEFT adapters based on the incoming utterance.

https://github.com/jondurbin/airoboros#lmoe

Me1000 · on Aug 30, 2023

I'm pretty excited about LoRA MoEs, but for the sake of conversation I'll point out a reply someone made to me when I commented about them: https://news.ycombinator.com/item?id=37007795

Any LoRA approach is obviously going to be perform a little worse that a fully tuned model, but I guess the jury is still out on whether this approach will actually work well.

Exciting times!

spmurrayzzz · on Aug 30, 2023

Yea its definitely a tradeoff. My intuition here is that, much like the resistance you get to catastrophic forgetting when using LoRAs, adapter-based approaches will be useful in scenarios where your "experts" largely need to maintain the base capabilities of the model. So maybe the experts in this case are just style experts, rather than knowledge (this is pure conjecture, we will see as we eval all these approaches).

sebzim4500 · on Aug 30, 2023

You can't just turn an existing model into MoE, they need to be trained from scratch unfortunately. I'm not aware of any open source MoE models, they are complicated and probably not that useful if you want to run them on your own hardware.

Me1000 · on Aug 30, 2023

Would you mind correcting my misunderstanding here? Code Llama is a fine tuned version of Llama2 (i.e. not trained from scratch). If I fine tuned Llama2 with a bunch of law text and had Law Llama, and fined tuned a couple more with some history text and science text. Why wouldn't Code Llama, Law Llama, History Llama, and Science Llama not be the experts in my MoE setup? Seems like I just need a simple router in front of those models to direct the prompt to the right expert.

sebzim4500 · on Aug 30, 2023

That could work, but I'd expect the following issues:

* For a lot of prompts every fine tuned model will make the same mistakes (they mostly share the same weights after all) and so you aren't getting nearly as much benefit as e.g. GPT-4 gets.

* It's going to be really expensive at inference time, since you have to run multiple models even though in most cases they won't help much

* Normally when people talk about hobbyists doing finetuning they mean <1M tokens, whereas Code Llama Python was finetuned on 100B tokens, way outside most people's price range. For the finetuning that you can afford, you can't teach it new knowledge, just show it how to apply the knowledge it already has.