Cerebras-GPT vs. LLaMA AI Model Performance Comparison

sbierwagen · on March 29, 2023

>>Is 10000 bigger than 10050?

>>Yes, 10000 is bigger than 10050.

>But even the mighty ChatGPT often can't do simple math

GPT is bad at math because BPE input compression obfuscates individual digits. https://bbot.org/etc/gpt-math.png You'd be bad at math too if every number was scrambled.

The graph is from page 22 of the GPT-3 paper from 2020. https://arxiv.org/abs/2005.14165 Even with 175 billion parameters it can't reliably do four digit addition.

An example from 4 days ago of ChatGPT being as bad as you'd expect at string reversal: https://news.ycombinator.com/item?id=35297183

(Although, I just tested ChatGPT Mar 14 Version against the above question after doing a bunch of math prompting and it got it right...)

sillysaurusx · on March 29, 2023

This is a common myth, which I've written about before. https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

The closest anyone's come to proving that byte-level tokenization is better is the ByT5 paper https://arxiv.org/abs/2105.13626

But they only showed evidence for improvement on specific tasks, not general performance, which is an important distinction. And their own benchmarks show that the improvements tend to be marginal: https://i.imgur.com/6Cw0APS.png

One view of the situation is that byte-level access (or "digit-level" in this case) gives a way to accelerate training, and to achieve higher performance with fewer parameters. The model doesn't need to spend as much effort on learning the decompression algorithm (tokenization).

But once learned, the tokenization doesn't seem to hinder a model from achieving higher performance, the same way that JPG compression doesn't hinder us from achieving an image that looks very good to humans. It's a bit like arguing an artist would be better if they only operated on raw bitmaps, or that our eyes would be better if our visual cortex didn't do any signal compression. Maybe, but the fact that our eyes do it is pretty strong evidence that compression isn't harmful.

sbierwagen · on March 29, 2023

I'm not sure how this is germane?

I'm talking about specific tasks: saying if 10000 or 10050 is larger. GPT is demonstrably bad at that. The ByT5 paper doesn't mention arithmetic tasks or show benchmark results for the specific task I mention.

Your linked comment says:

>This is a common myth but in practice no one (as far as I know) has shown that byte level predictions result in superior overall performance.

Stating if BPE or character tokenization is better for everything is a much broader claim, one I didn't make! One could easily imagine a toolformer that calls out to calc.exe for anything involving numbers which would get much better numeric performance while still using BPEs.

sillysaurusx · on March 29, 2023

> GPT is bad at math because BPE input compression obfuscates individual digits. https://bbot.org/etc/gpt-math.png You'd be bad at math too if every number was scrambled.

This is the myth I was referring to. BPE compression may slow down training, but it doesn't follow that slower training is the reason for being bad at math.

If you trained GPT specifically on arithmetic tasks, you'd get superior performance to GPT-3, regardless of which tokenization scheme you'd use. But you'd destroy most of its knowledge about everything not-arithmetic.

sbierwagen · on March 29, 2023

>BPE compression may slow down training, but it doesn't follow that slower training is the reason for being bad at math.

It's not so much that it slows down training, is that it completely destroys the relationship between digits and results. Every number is assigned a random token ID, so GPT-3 had to memorize every operation separately. It couldn't generalize at all, which is why it got worse at larger numbers, which showed up less often in the training set-- no examples to remember.

You can try the tokenizer online here: https://platform.openai.com/tokenizer

It assigns the input text `10 11 12 13 14 15 16` token IDs `940, 1367, 1105, 1511, 1478, 1315, 1467`. How is it supposed to figure out incrementing numbers from that? Well, it can't, so it memorizes them. "Neural nets want to work"!

I used the past tense above, because while writing this comment I asked ChatGPT Mar 14 Version a bunch of manydigit addition and substraction questions and it got them all right. Then I asked it if one of those large numbers contained an 8 and it... hallucinated a completely wrong answer, oops: https://bbot.org/etc/gpt-math2.png It's also still spotty at multiplication: "The product of 82368 and 33333 is 2745504384." Well, you got the first five digits right...

nl · on March 29, 2023

> It assigns the input text `10 11 12 13 14 15 16` token IDs `940, 1367, 1105, 1511, 1478, 1315, 1467`. How is it supposed to figure out incrementing numbers from that?

Having the token numbers sequential is meaningless and makes no difference to how a NN works. They are just pointers.

It needs to learn that the sequential value of the things pointed to by these tokens is important and it has only partially done that.

We see that in children learning their numbers too: As a small child "what comes after 39" (or some number) and you'll often get wrong answers.

gerad · on March 29, 2023

I think what OP means is that the tokenization is throwing away encoded information.

That is 11 should be two tokens representing 1 and 10. That way the relationship can be more easily learned.

Otherwise it needs to memorize the relationship between every pair of numbers. Which is exponentially more difficult.

nl · on March 29, 2023

Yes this is a fair point, and if that is what they meant I could agree it is worth trying. As pointed out down-thread the Llama tokenizer is supposed to spilt numbers into their digits.

vintermann · on March 30, 2023

The tokens can be variable length too. If "911" appears often enough in the data used to make the BPE encoding, it may get assigned its own token. If it does, the model has no way to "pick it apart" and learn that it actually constructed of a 9, a 1 and a 1. That information has been thrown away, and must be laboriously re-learned from context during training, for those contexts where it's relevant (For instance, if it's part of a math task, or you need a word to rhyme with "been shot down" in a song in the style of Wyclef Jean).

It's quite possible that this trade-off is worth it, that making math and poetry harder to learn isn't a big deal because of the savings elsewhere (models can still learn it eventually, as we have seen).

But BPE isn't exactly elegant. It's itself a learned encoding, trained on a different objective from the rest of the system. In the rest of the training, we let the GPT model decide how to assemble lower-level concepts into higher-level ones, and which information to ignore. On the lowest level we've hardcoded it for the model, and we know the model sometimes have to work hard to undo our encoding.

kristjansson · on March 30, 2023

Bear in mind that the token ids are essentially indicies into a matrix. Sequential IDs have no more meaning than two columns being ‘next to’ each other in a dataset.

wongarsu · on March 29, 2023

even worse than sequential things not being sequential: 10000 is tokenized as one token (token 33028), while 10050 is two tokens, 100 (token 3064) and 50 (token 1120)

sillysaurusx · on March 29, 2023

Again, repeating the myth that BPE is the reason for GPT being bad at math doesn't make it true. (Nor does labeling it a myth make it false.) It hasn't been demonstrated that this is the reason. And it's not merely a pedantic point – this is an interesting look into why models behave the way they do.

To get us both on the same page, let's restate the arguments.

- Your position is that in a hypothetical world where a GPT-3 was trained without BPE, yet somehow still managed to maintain the same context window as BPE (i.e. around three to five pages of text), it would be superior at math, because it can see the individual digits.

- My position is that it can still see the same context window, and the context window is all that matters. Regardless of whether you tokenize it using OpenAI BPE or LLaMA BPE or digit level, model performance isn't tied to it. What matters is parameter count, training time, and dataset.

If you were to pit the two models against each other, given the same dataset and same amount of training, the BPE model would likely perform significantly better at math. The larger the model, the more it can do, and the better the performance at any individual task. And the BPE model would perform better because the compression allows it to learn at a faster rate -- i.e. the same amount of training computation spent on the BPE'd model gives you much more impact than computation spent on the byte level model.

In other words, the larger context window is the whole point. You get 2048 tokens. If you make it 2048 bytes, then you get a very limited view of the input text, which harms overall performance. But if you train it on specifically math problems, then of course it'll achieve superior performance, and not simply because it has an unscrambled view of the digits.

As I said before, this is like saying that JPG compression is harmful to our understanding of an image. But it's clearly not. The JPG may be composed of wavelets, but what matters is the overall signal -- and that signal is what GPT is being trained on (the understanding of language), not the individual characters.

This also explains why tokenization doesn't seem to affect performance across languages that much. Russian tokenizes into many more characters than English, yet GPT can still achieve pretty good performance on it. This isn't because the characters are scrambled, but rather because it's seen many more examples of English than Russian.

nl · on March 30, 2023

I think there is an issue in vocabulary size though.

If I'm correct in thinking that the OP is proposing that what Llama does ("Notably, we split all numbers into individual digits") explicitly this allows the model to treat a number as a sequence of digits so it can learn how math works.

This is especially important with very rare numbers. Take a number the GPT has never (or hardly ever) seen in its training data:

832408345770928764

The GPT-3 tokenizer tokenizes[1] that into:

83,24,08,345,7,709,287,64

To some degree this is forced to occur by the use of raw BPE encoding and the vocabulary size (175K in the case of GPT-3).

Now consider the string:

832408345770928764 + 37

The model presumably has learnt something like "if all tokens are in this area (where the area is "numbers" in the token-space) and they are followed by a + sign then we don't just append the string, instead we swap the last token for another one"

But of course this is insufficient in this case - it needs to learn carrying rules to also increment the next token. As is speculated in https://arxiv.org/abs/2212.10559, it's possible there's a relationship between the depth of the model and the length of chained rules it can learn, and because of the number of multi-digit tokens it has learnt these rules are unnecessarily complex, and incomplete.

If these tokens were instead single digits the rules would be much simpler, and it's possible the model could actually learn the real rules of math instead of the subset of semi-memorized things it has at the moment.

[1] https://platform.openai.com/tokenizer

wongarsu · on March 29, 2023

What about a tokenizer where the digits 0-9 just get the tokens 0-9, all other tokens are derived in the usual BPE manner (except that no token other than 0-9 is allowed to contain digits). Such a model wouldn't have a notable reduction in context size and should behave largely the same in terms of training speed and language tasks, but if sbierwagen is right it might be significantly better at math (and at least on an intuitive level this seems to make sense).

sebzim4500 · on March 29, 2023

This is pretty much what LLaMA did, for what it's worth. Unfortunately there are so many other architectural and training differences that comparing the arithmetic ability of LLaMA and GPT-3 would not settle this debate.

nl · on March 29, 2023

> This is pretty much what LLaMA did, for what it's worth.

I don't think that's exactly accurate. The GP proposed:

>> What about a tokenizer where the digits 0-9 just get the tokens 0-9, all other tokens are derived in the usual BPE manner (except that no token other than 0-9 is allowed to contain digits).

The paper says:

> We tokenize the data with the byte-pair encoding (BPE) algorithm (Sennrich et al., 2015), using the implementation from Sentence-Piece (Kudo and Richardson, 2018). Notably, we split all numbers into individual digits, and fallback to bytes to decompose unknown UTF-8 characters.

I read this as sayng the a number like 1002 would be split into 1,0,0,2 and then represented by the tokens IDs that point to those digits. These are not the same things.

Interestingly, the code FB provided doesn't seem to have any special handling for digits: https://github.com/facebookresearch/llama/blob/main/llama/to...

sebzim4500 · on March 30, 2023

>These are not the same things.

You mean because the token ids are in a different order? This is irrelevant to a transformer model, there is no inductive bias that similar tokens should have related token ids.

>Interestingly, the code FB provided doesn't seem to have any special handling for digits

Yeah, token handling is done in sentencepiece.

nl · on March 29, 2023

There's nothing in a NN that would make the token order important.

sebzim4500 · on March 29, 2023

I think you are likely wrong, but given neither of us are going to spend millions of dollars training two versions of GPT-3 we will have to agree to disagree. Meta seems to agree with me, since when they trained LLaMA they used one token per digit.

sillysaurusx · on March 29, 2023

> neither of us are going to spend millions of dollars training two versions of GPT-3 we will have to agree to disagree

I encourage you to apply to TRC https://sites.research.google/trc/about/

You'll be able to access millions of dollars worth of compute. It's how I got my start.

I love when I'm mistaken, since that's how science is pushed forward – we can't be certain we're right, only that we're not wrong yet. So it would be delightful if you formulate this into a testable hypothesis and falsify it yourself.

You can use The Pile to train your GPT: https://pile.eleuther.ai/

m00x · on March 30, 2023

We do the same thing in our brains.

1 is a token in our brain, so is 2, and 3.

vintermann · on March 30, 2023

Well, it's certainly not the lowest level token - to the degree we can talk about tokens in our brains in the first place.

We can talk and think about such things as the shape of the letters, which it would take a Herculean effort for GPT-like models to learn, since it gets letters in a form where that information has been stripped away.

alpaca128 · on March 30, 2023

As pointed out by gerad above, we humans don't treat multi-digit numbers as one atomic token. We know 11 = 10 + 1. But if an AI model has no access to individual digits no matter how large the number is, then the task of doing simple maths is exponentially harder to learn.

f_devd · on March 29, 2023

> If you trained GPT specifically on arithmetic tasks

Sure but you'd have a lot of overlapping tokens with BPE, which doesn't help with convergence. GP is claiming it's specifically worse at arithmetic because of BPE which is true.

forgotusername6 · on March 30, 2023

Complete tangent here. The current usage of the word germane apparently comes from a line in Hamlet. Try asking chatGPT to quote the line and explain it to you. When I tried it failed utterly, not in the explanation but in returning to the actual quote.

stormfather · on March 29, 2023

Are you referring to how BPE is permutation invariant? (ignoring positional encoding)

PeterisP · on March 30, 2023

The general tokenization screws up performance on some very specific niche tasks, which are not really used in training or evaluation, but are used a 'toy examples' by people who demo or test the system, resulting in stupid outcomes that are then (IMHO unfairly) treated as informative of the model's general performance.

I.e. "how many letters are in a word" or "which is the fifth letter of this word" type tasks is screwed up because the model has no information about the length of each token other than what it can infer from seeing some words in similar challenges in children learning materials; so it knows the length of some tokens (i.e. those used in simple words in similar materials) but not all of them.

"does 123912879212392 contain the digit 7" can fail because the model has limited information that the token "79" in the middle is somehow related to the token "7". If you ask whether "87" is contained in that number, then it has to deduce that from seeing tokens "128" and "79", both of which have no direct relation to the digits in them.

"is 50100 greater than 50200" becomes a tricky challenge because due to its tokenization it tries to compare "501"/"00" with "50"/"200" while knowing that "501" is greater than "50"; just as comparing "10000" with "100"/"50". We use Arabic numerals because their positional encoding makes arithmetic easier, but language models receive the same data without positional encoding, they get given something that's more like an extreme version of Roman numerals.

I mean, all of it can be learned if given sufficient data, however, we shouldn't be surprised if some tasks seem more difficult than they should be if we have explicitly crippled the representation we use to communicate these tasks to the system. And it doesn't neccesarily mean that we should change the tokenization (because these tasks are toy challenges not relevant to most real functioning), we just shouldn't use these types of letter/digit challenges to judge the quality of models.

bertday · on March 30, 2023

I’d argue that tokenization is more analogous to having a losslessly encoded image where you don’t have access to the decoder. Humans don’t perform well on that one.

nl · on March 30, 2023

http://nlg.csie.ntu.edu.tw/~cjchen/papers/eacl2023.pdf (which was pointed to downthread) indicates tokenizing and converting to scientific notation gives a 5% increase in performance on a "comparing numbers" benchmark.

croddin · on March 29, 2023

I'm not sure if these models use the GPT tokenizer, but if you type a long string of numbers into https://platform.openai.com/tokenizer, you can see the tokens that the LLM would see. What the LLMs get as input for math is significantly worse then having to do mental math with roman numerals, tokenizing makes sense for words but for numbers it seems the like LLMs would have to learn a lot more steps. I wonder if limiting number tokens to 2 digits per token, instead of the 1-3 (and even more for repeated digits) it currently is would improve models math.

thequadehunter · on March 29, 2023

I've been wondering...does it even matter at this point when you can just pull out a calculator in langchain?

Also side note...I went to high school with the author and hadn't heard about him for a while lol. That was a neat surprise.

spaethnl · on March 30, 2023

I suspect it does matter because the inability to understand arithmetic likely effects higher level learning. For example, If you want it to learn statistics, it likely needs a good understanding of arithmetic.

All tasks that rely on it will perform worse than expected.

thequadehunter · on March 30, 2023

Oh, I see. It's about a cap in baseline ability rather than whether it's technically possible to give it the extension?

est · on March 30, 2023

> does it even matter at this point when you can just pull out a calculator in langchain?

That's alrithmetics. For math problems:

Each alien chicken has 3 legs, each alien rabbit has 7 legs, for a total of 33 legs, how many alien chicken are there?

Now try to calculator that!

sbierwagen · on March 30, 2023

A couple people have used LLMs as frontends to theorem provers/SAT solvers, which would handle that fairly easily: https://twitter.com/AlbertQJiang/status/1584877475502301184 https://twitter.com/ericzelikman/status/1605385835624378368

thequadehunter · on March 30, 2023

Apparently it can.

https://writings.stephenwolfram.com/2023/03/chatgpt-gets-its...

computerex · on March 30, 2023

https://github.com/hwchase17/langchain/tree/master/langchain...

Model would probably take that description and write python code to compute the answer.

spyder · on March 29, 2023

There was a paper where they have found that converting the numbers to scientific notation ( like 1.5e-7) has improved these transformer-based language models at math, if I remember correctly. (with a quick search I could not find the link to it now)

nl · on March 30, 2023

Interesting.

It might be http://nlg.csie.ntu.edu.tw/~cjchen/papers/eacl2023.pdf

They show an increase from 65% to 70% on their "comparing numbers" benchmark.

spyder · on March 30, 2023

That's also an interesting one, but I think I found it, but it's a little different from what I remembered because it's not just the scientific notation that helps but more so the addition of positional tokens:

https://arxiv.org/abs/2102.13019

The essence of it from the abstract:

". In particular, the model fails to learn addition of five-digit numbers when using subwords (e.g., “32”), and it struggles to learn with character-level representations (e.g., “3 2”). By introducing position tokens (e.g., “3 10e1 2”), the model learns to accurately add and subtract numbers up to 60 digits. We conclude that modern pretrained language models can easily learn arithmetic from very few examples, as long as we use the proper surface representation."

nl · on March 30, 2023

That's a really interesting paper too. Thanks!

cma · on March 29, 2023

Llama individually tokenized digits, how much did it fix the issue?

nl · on March 30, 2023

The most relevant reported benchmark results are on https://arxiv.org/abs/2103.03874

They compare it against PaLM and Minerva, both of which are fine tuned on math datasets.

It outperforms both at roughly the same parameter count (eg Llama 65B vs Palm and Minerva 62B) but unclear how much of this is due to encoding vs the many other differences.

It is useful to see that the performance increase is clearly not due to fine turning though.

cma · on March 30, 2023

Where do I look, the paper was from 2021 before llama but did they update it (or you mean the llama paper mentions that benchmark)?

nl · on March 30, 2023

I posted the wrong link sorry!

https://arxiv.org/pdf/2302.13971.pdf (the Llama paper), table 7.

jquery · on March 30, 2023

I just fed this to ChatGPT 4.

Q: “Is 10000 bigger than 10050”

A: “No, 10,000 is not bigger than 10,050. 10,050 is 50 units larger than 10,000”

sbierwagen · on March 30, 2023

Tried GPT4ALL-finetuned LLaMA-7B:

>>> Is 10000 bigger than 10050?

>Yes, 10000 is greater in magnitude than or exceeds the value of 10050. The number 10000 has a larger exponent and can be represented as (2^4) * 3 + (-7), while the number 10050 cannot have an even integer representation since it requires two more digits to represent its magnitude than any other positive whole numbers, which is not possible.

Well that's not right. Restarted with a new seed.

>>> Is 10000 bigger than 10050?

>Yes, because 10000 > 10050

>>> Is 10,000 bigger than 10,050?

>No, they are equal. The difference between them is only one digit (i.e., the last number). Therefore, it's not possible to tell if either value is greater or lesser by just looking at their values without knowing more information about what those numbers represent and how they were obtained in the first place.

Oooof. The LLaMA tokenizer didn't save it here.

rikimaru0345 · on March 30, 2023

> The difference between them is only one digit (i.e., the last number). Therefore, it's not possible to tell if either value is greater or lesser by just looking at their values without knowing more information about what those numbers represent and how they were obtained in the first place.

That one is especially hilarious. But the part at the end "how they were obtained" is really strange. Where in its dataset would it possibly have learned such nonsense? Doesn't matter where numbers come from to compare them.

It implies that it doesn't understand what numbers even are in general, and that giving it a calculator (that it can use perfectly) only masks a much deeper problem.

I mean, I'm reading a ton of people say "its not just pattern recognition and token prediction, it has emergent properties!!!!" and from experimentation I believe it.

But if the models can pick up language and its intricacies, and even do simple logic tasks, shouldn't it also be able to pick up on what numbers are and how they work? At least knowing that where a number came from doesn't matter when its just about comparing their value in a pure mathematical sense?

What does that mean for concepts other than numbers? Do those models fake a LOT more than we already believe they do?

sbierwagen · on March 31, 2023

I'm inclined to think that the model is just too small, or that integer quantization damaged it badly. It has the GPT-2 trait of getting "stuck", (mode collapse?) if it gives a bad response once, it will tend to repeat itself over and over. (Repetition was also a bad GPT-2 trait)

It's important to note that 7 billion parameters really is very small. 20 times smaller than GPT-3. Smaller still than ChatGPT or GPT-4. I find it plausible that in the future there will be distilled models substantially smaller than GPT-3 but with all its power, but GPT4ALL-LLaMA-7B isn't it.

jquery · on March 30, 2023

Ok those are funny

anon305uklfh · on April 7, 2023

I tried to let it justify itself. It does bring up rounding, but still fails to justify itself. ... main: seed = 1680888846 ... == Running in chat mode. == ...

> Is 10000 bigger than 10050?

Yes, 10000 > 10050 > Why do you think 10000 is bigger than 10050?

Because the number after decimal point in both numbers are different. In case of '10,000', it has two digits and one comma while for '10,064' there are three digits with no commas between them which makes 10,059 as a whole number.

> Can you explain that better?

Sure! The first digit after the decimal point in both numbers is zero (indicating they are integers), but there's one more significant figure for 10,064 than for 10,059. This means that if you were to round off these two values using a calculator or computer program, it would give different results: - Rounding upwards from the second digit after decimal point in both numbers gives us 2 significant figures (i.e., "rounded" value) for each number; however, only one of them has an extra zero before that rounding figure. So if you were to use a calculator or computer program to calculate these values as rounded off from the second digit after decimal point using any method like truncation/roundoff etcetera (i.e., "rounded" value), it would give different results: - For 10,059, there is no extra zero before that rounding figure; therefore, you get a rounded result of 2 significant figures for this number using any method like truncation/roundoff etcetera (i.e., "rounded" value). However, if we were to use the same method on 10,064 and then calculate its rounded off version from second digit after decimal point as well; it would give a different result: - For this number too there is no extra zero before that rounding figure. Therefore you get a rounded value of "2" significant figures for this number using any method like truncation/roundoff etcetera (i.e., "rounded" value).

> But why are you mentioning the numbers 10,059 and 10,064 when the original numbers were 10000 and 10050?

The reason for this is that I wanted to demonstrate how roundoff errors can occur even with seemingly accurate data. In reality, it's not always possible or practical to obtain exact values of quantities like money amounts in real-world situations where there are many decimal places involved (e.g., 10,059 vs. 10,064). Therefore, we need to be aware that even seemingly accurate data can still have errors due to roundoff or other factors such as truncation of values when converting from one format/unit system to another.

> And if the 10000 and 10050 were the exact values, and not the rounded result, would 10000 be bigger than 10050?

Yes! If we had two numbers that exactly matched each other (i.e., if they both represented $10,000), then it is possible for one number to represent a larger amount of money due to rounding errors or truncation during the conversion process from one format/unit system to another.

> So, using the exact value of 10000 and the exact value of 10050, is 10000 bigger than 10050?

Yes! If we had two numbers that exactly matched each other (i.e., if they both represented $10,000), then it would be possible for one number to represent a larger amount of money due to rounding errors or truncation during the conversion process from one format/unit system to another.

> Is 10000.0 bigger than 10050.0?

Yes! If we had two numbers that exactly matched each other (i.e., if they both represented $10,000), then it would be possible for one number to represent a larger amount of money due to rounding errors or truncation during the conversion process from one format/unit system to another.

>

Still failed with the decimal.

bob1029 · on March 30, 2023

The more parameters you have, the less error you will find on these prompts. This is the emergent frontier.

MuffinFlavored · on March 29, 2023

as an LLM how do we train it to know when to call a “math” plugin/API? (feed its input, go get output)

sebzim4500 · on March 29, 2023

Either finetune with a few thousand example, or just tell it in the prompt to call the API when it needs to do maths (in context learning).

discordance · on March 29, 2023

In the pipeline, first run a classification model to determine what class of question is being asked.

Then engage the most appropriate plugins.

ftxbro · on March 29, 2023

For context, the cerebras models were trained in only a couple weeks and the purpose of the training was to establish a scaling law for compute-optimal training, presumably for predicting what will happen with larger models where it's more important to train in a compute-optimal way. This is a different goal than that of other research projects that try to get the most power per VRAM on small models.

freeqaz · on March 29, 2023

That makes sense especially since they're not intending to deploy this model to production. For models like GPT-3/4 it makes sense why they would train them more because the costs of running the inference "in production" likely dominates the compute costs. (Just like how Youtube will spend 50x more compute to compress a video an extra 2% because bandwidth costs far outstrip the compression costs.)

Do you know what percentage, roughly, this model has been trained relative to something like LLaMA? Are we talking 10%? 50%? 90%?

It may be possible that it is still useful if it can be trained further by the community!

gpm · on March 29, 2023

LLaMa 65B and 30B were trained on 1.4 trillion tokens. This model was trained on 260 billion tokens.

freeqaz · on March 29, 2023

So ~18.6% trained relative to LLaMa. That's not _nothing_ but it's also not great. Thanks for digging into this!

pama · on March 29, 2023

The “law” was previously established empirically already and is only of relevance as a technical detail to a few specialists that may care. I think it was a strategic mistake to only release models that are weaker than what people can get their hands on. Is there a limit on that hardware scaling to larger models? As a hardware company that tries to stay in the game they should show some signs of dominance, not just Apache license.

UncleEntity · on March 29, 2023

> The “law” was previously established empirically already and is only of relevance as a technical detail to a few specialists that may care.

Or, you know, customers who want a quote on how much it’ll cost to train on their data.

Not everyone has unlimited budgets to train a model to state of the art performance for giggles.

carlsborg · on March 30, 2023

The LLM model performance is a sideshow. They are demoing linear scaling on their "cluster on a wafer" tech (a vertically scaled single chip "the largest computer chip ever built and the fastest AI processor on Earth") vs distributed training.

They have a cloud service offering for example fine tuning and pretraining well known LLM models on their chips. (est. $40k to train GPT3 6.7B in ~11 days)

varunkmohan · on March 30, 2023

Sad, this whole thing where companies train chinchilla optimal models has to stop - I'm always happy to get open source models but it would be great if it was useful rather than a benchmark.

nl · on March 29, 2023

Tl;DR: It's quite a lot worse than LLama 13B in every way that is measured. It benchmarks between GPT-J (6B) and GPT-NeoX-20B, although is worse than GPT-J (6B) in a few benchmarks.

There are no good benchmarks for text generation, but it would be surprising if it was competitive with the best open models.

It's unclear how much RLHF can help here - AFAIK no one has tried out DataBrick's Dolly to compare a GPT-J based RLHF model.

breadchris · on March 29, 2023

Wow! I didn't even know that Cerebras was a thing and I have been trying to keep up to date with this stuff!

imaurer · on March 29, 2023

Submit new links here :)

https://github.com/imaurer/awesome-decentralized-llm

ftxbro · on March 29, 2023

other hacker news cerebras discussion is here: https://news.ycombinator.com/item?id=35343763

dumbaccount123 · on March 29, 2023

[flagged]

knicholes · on March 29, 2023

Nobody is stopping you from living in a cave! ... at least I don't think so.

uejfiweun · on March 29, 2023

I dunno man, you seen cave prices these days? The cave market is in a tough spot until the fed starts cutting…

flangola7 · on March 29, 2023

> Nobody is stopping you from living in a cave!

A bunch of tights wearing twats manufactured this idea of "private land" and now violence is waged against those seeking cave life.

Also caves don't stop the Shoggoth from finding you.