Understanding Llama 2 and the New Code Llama LLMs

rwl4 · on Aug 30, 2023

The author of the article appears to have misunderstood one important detail about Code Llama.

They state:

> The Code Llama models were trained on 500B tokens, whereas Llama 2 models were trained on 2T tokens. Since the Code Llama model was trained on 4x fewer tokens, maybe a CodeLlama 70B version did not perform well enough due to LLM scaling laws—there was not enough training data.

But if you read the paper, on page 1, it says:

> Our approach is based on gradually specializing and increasing the capabilities of Llama 2 models by applying a cascade of training and fine-tuning steps [...]

In fact, they show a diagram at the top of page 3 that details the process, starting with Llama 2 foundation models.

Llama 2 Foundation models (7B, 13B, 34B) -> Code training 500B -> Python / Long Context.

See the paper here: https://arxiv.org/abs/2308.12950

rasbt · on Aug 30, 2023

Good catch. Above that paragraph, I wrote that the Code Llama models were initialized with the Llama 2 weights, which makes this contradictory, indeed.

What I meant to say here was 500B domain-specific tokens. Maybe domain-specific is not the right word here, but tokens related to the problems that the LLM aims to solve.

EDIT: Updated the text to be more clear.

sp332 · on Aug 30, 2023

It does say this: Note that all Code Llama models were initialized with Llama 2 weights before they were further trained on code.

behnamoh · on Aug 30, 2023

They also moved part of the article to another post and made it paywalled. Is that really necessary for someone who's already been a professor, has a famous book, and works at a (supposedly highly invested) AI company?

jxy · on Aug 30, 2023

Right.

### off topic rants below

Somehow there are so many blogpost about these things, all trying to ask for your emails. Is it becoming easier to put more words together nowadays? I guess so.

I really wish there is a way to fact check all, instead of depending on good samaritans in a comment on HN to point these obvious misconceptions out.

cosmojg · on Aug 30, 2023

> I really wish there is a way to fact check all, instead of depending on good samaritans in a comment on HN to point these obvious misconceptions out.

You mean like reading original sources? Frequently, big research projects like this come with an official paper[1] and/or blog post[2] explaining what they did.

[1] https://ai.meta.com/research/publications/code-llama-open-fo...

[2] https://ai.meta.com/blog/code-llama-large-language-model-cod...

semi · on Aug 31, 2023

I wonder how long until we can just use LLMs to do that for us - first summarizing a blog post (already something we've seen many examples of LLMs doing) but focusing on extracting factual claims, then using those as context when injesting linked sources to output to find what in the sources actually backs up the claims or if anything in the source goes against the claims made

simonw · on Aug 30, 2023

> Somehow there are so many blogpost about these things, all trying to ask for your emails.

That's because Substack defaults to bothering people for their email, and lots of people are using Substack as their blogging platform these days.

behnamoh · on Aug 30, 2023

> and lots of people are using Substack as their blogging platform these days.

they shouldn't. It's Medium all over again...

ImprobableTruth · on Aug 30, 2023

>GPT-3.5 has 175B parameters versus 70B parameters in Llama 2

We know that for the original version of GPT-3.5, but my assumption was that Turbo was a distilled smaller model (which is why it uses OAI's new vocab & is so much faster).

If that's not the case, what could be the explanation for it being faster?

phillipcarter · on Aug 30, 2023

I think that unless (until?) OpenAI releases information about the model itself and the inference engine it runs on, everything is just speculation. Clearly, there's impressive ML and systems engineering at play with GPT-3.5-turbo given how capable, fast, and scalable to their customer base it is.

rasbt · on Aug 30, 2023

I think so too. But in general, it could also be due to other reasons: faster hardware, lower timeout for batched inference, optimizations like flash attention and flash attention 2, quantization, ...

I'd say that it's probably a mix of all of the above (incl some distillation).

sebzim4500 · on Aug 30, 2023

It is widely believed that GPT-3.5 is a MoE model, which means it could have 175B parameters but still be much lower latency than GPT-3

rasbt · on Aug 30, 2023

Interesting, I thought GPT-3.5 was considered GPT-3 + InstructGPT-style RLHF on a large scale, whereas GPT-4 is considered to be an MoE model.

caeruleus · on Aug 30, 2023

There was an article on HN a couple of weeks ago that conjectured it might apply to GPT-3.5 Turbo as well: https://news.ycombinator.com/item?id=37006224

rasbt · on Aug 30, 2023

Haven't seen that one, yet. Thanks for sharing!

rgbrgb · on Aug 30, 2023

Why would MoE make it lower latency?

visarga · on Aug 30, 2023

You don't have to use the whole MoE model, for each token only 1/N of the model is used, where N is the number of experts. So it's compute utilisation scales slower than memory usage.

sebzim4500 · on Aug 30, 2023

It's easier to parallelize so you can throw more GPUs at a single request (or really, batch of requests)

rgbrgb · on Aug 30, 2023

Interesting, yeah I buy that, thanks. Building my intuition with this stuff. Anyone seen a good open-source implementation of MoE with Llama yet?

spmurrayzzz · on Aug 30, 2023

Jon Durbin has been working on LMoE, which isn't pure MoE but uses a LoRA-based approach instead. Core idea is dynamically swapping PEFT adapters based on the incoming utterance.

https://github.com/jondurbin/airoboros#lmoe

Me1000 · on Aug 30, 2023

I'm pretty excited about LoRA MoEs, but for the sake of conversation I'll point out a reply someone made to me when I commented about them: https://news.ycombinator.com/item?id=37007795

Any LoRA approach is obviously going to be perform a little worse that a fully tuned model, but I guess the jury is still out on whether this approach will actually work well.

Exciting times!

spmurrayzzz · on Aug 30, 2023

Yea its definitely a tradeoff. My intuition here is that, much like the resistance you get to catastrophic forgetting when using LoRAs, adapter-based approaches will be useful in scenarios where your "experts" largely need to maintain the base capabilities of the model. So maybe the experts in this case are just style experts, rather than knowledge (this is pure conjecture, we will see as we eval all these approaches).

sebzim4500 · on Aug 30, 2023

You can't just turn an existing model into MoE, they need to be trained from scratch unfortunately. I'm not aware of any open source MoE models, they are complicated and probably not that useful if you want to run them on your own hardware.

Me1000 · on Aug 30, 2023

Would you mind correcting my misunderstanding here? Code Llama is a fine tuned version of Llama2 (i.e. not trained from scratch). If I fine tuned Llama2 with a bunch of law text and had Law Llama, and fined tuned a couple more with some history text and science text. Why wouldn't Code Llama, Law Llama, History Llama, and Science Llama not be the experts in my MoE setup? Seems like I just need a simple router in front of those models to direct the prompt to the right expert.

sebzim4500 · on Aug 30, 2023

That could work, but I'd expect the following issues:

* For a lot of prompts every fine tuned model will make the same mistakes (they mostly share the same weights after all) and so you aren't getting nearly as much benefit as e.g. GPT-4 gets.

* It's going to be really expensive at inference time, since you have to run multiple models even though in most cases they won't help much

* Normally when people talk about hobbyists doing finetuning they mean <1M tokens, whereas Code Llama Python was finetuned on 100B tokens, way outside most people's price range. For the finetuning that you can afford, you can't teach it new knowledge, just show it how to apply the knowledge it already has.

visarga · on Aug 30, 2023

There is also speculative sampling - you decode a few tokens with a smaller model, then use the big model to validate them simultaneously. The big model might trim the prediction up to a point and add an extra token. Then cycle again with the small model -> 2-2.5x speedup

ranguna · on Aug 31, 2023

I'm not sure if I'm the only one, but I find the starcoder model to be muuuuch better than codellama 34B quantized. I can't seem to find any good coding benchmarks online comparing the two.

Anyone else having a similar experience?

Havoc · on Aug 30, 2023

Managed to get code llama 34 integrated into vscode and must say it’s surprisingly usable for scaffolding and also explaining pieces of code

Isuckatcode · on Aug 30, 2023

Could you share instructions on how you did that

Havoc · on Aug 30, 2023

On a very high level you chain it together like so:

llama.cpp >> OpenAI translation server (included in llama git) >> Continue extension in vscode

syntaxing · on Aug 30, 2023

Does this mean there’s most likely a non released version of llama 2 34B at Meta since they need one as a base for code llama?

sa-code · on Aug 31, 2023

Indeed, and its existence was mentioned in the llama 2 paper too

rasbt · on Aug 31, 2023

Yes, when I remember correctly, they said they didn't release the 34B Llama 2 model yet because they haven't had a chance for "red teaming" that one, where with "red teaming" they mean something along the lines of identifying and exploiting vulnerabilities