As mentioned in the post, the smaller models are trained well past "compute-opti...

Minus0 · on April 17, 2023

In this context compute optimal isn't quite the same as diminishing returns. If you look at the loss graphs in the Llama paper, you can see that even the curves for the smaller models were still going down at the time they stopped training and weren't anywhere near plateauing yet. LLMs are notoriously data hungry and will take a long time to reach convergence.

Compute optimal here means the point at which it makes sense to move from a smaller to a larger model assuming that: (a) you have a fixed compute budget of FLOPs, and (b) you want to train the best model possible. The problem is that this applies only to training and assumes nothing about the cost of inference. If you actually need to deploy these trained models and support them long-term for hundreds, thousands, even millions of people to use, would you rather deploy a 13B model or a 30B model at the same level of quality, even if the 13B model would be more costly to train?

There is going to be a point at which these models plateau and further improvement will not be possible without moving to a larger model, but Llama doesn't get there quite yet.