Optimized small model training is not only important for availability but also for the scientific study of LLMs. It’s like the use of simple organisms like yeast for biological studies - we also need to study the simplest possible transformers that exhibit behaviors of interest from the larger models if we hope to ever understand LLMs and have more control over their behavior.
Totally agree, one of the most interesting podcasts i have listened to in a while was a couple of years ago on the Tiny Stories paper and dataset (the author used that dataset) which focuses on stories that only contain simple words and concepts (like bedtime stories for a 3 year old), but which can be used to train smaller models to produce coherent english, both with grammar, diversity, and reasoning.
The podcast itself with one of the authors was fantastic for explaining and discussing the capabilities of LLMs more broadly, using this small controlled research example.
As an aside: i dont know what the dataset is in the biological analogy, maybe the agar plate. A super simple and controlled environment in which to study simple organisms.
I like the agar plate analogy. Of course, the yeast is the star of the show, but so much work goes into prepping the plate.
As someone in biotech, 90% of the complaints I hear over lunch are not about bad results, but about bad mistakes during the experiment. E.G. someone didn't cover their mouth while pipetting and the plates unusable now.
Ha! I remember where I was when I listened to that episode (Lakeshore Drive almost into Chicago for some event or other) — thanks for triggering that memory — super interesting stuff
(there are also lots of private company datasets like e.g. user purchase history that can be used with small models to solve real business problems. All the advances in 'large' language models can be leveraged and applied to small problems if the input sequences can be represented as a special custom language.)
Doing hyperparameter sweeps on lots of small models to find the optimal values for each size and fitting scaling laws to predict the hyperparameters to use for larger models seems to work reasonably well. I think https://arxiv.org/abs/2505.01618 is the latest advance in that vein.
That's not a mirage, it's clearly capability that a smaller model cannot demonstrate. A model with less parameters and less hidden layers cannot have a neuron that lights up when it detects a face.
Consider a single-neuron model that just pools all pixels in an image together. It's possible for the average activation of this neuron to be exactly the same on faces and non-faces, but extremely unlikely given the large range of possibilities. So in aggregate, this neuron can distinguish faces from non-faces, even though, when you apply it to classifying a particular image, it'll be better than random only by an extremely tiny amount.
As the number of neurons increases, the best face/non-face distinguisher neuron gets better and better, but there's never a size where the model cannot recognize faces at all and then you add just a single neuron that recognizes them perfectly.
> here's never a size where the model cannot recognize faces at all
True
> then you add just a single neuron that recognizes them perfectly
Not true.
Don't think in terms of neurons, think in terms of features. A feature can be spread out over multiple neurons (polysemanticity), I just use a single neuron as a simplified example. But if those multiple neurons perfectly describe the feature, then all of them are important to describe the feature.
The Universal Approximation Theorem implies that a large enough network to perfectly achieve that goal would exist (let's call it size n or larger), so eventually you'd get what you want between 0 and n neurons.
> if those multiple neurons perfectly describe the feature, then all of them are important to describe the feature.
You could remove any one of those neurons before retraining the model from scratch and polysemanticity would slightly increase while perfomance slightly decreases, but really only slightly. There are no hard size thresholds, just a spectrum of more or less accurate approximations.
It mostly has to do with sparsity in high dimensional space. When you scale things to the extreme everything is very far away from each other, the space is sparse, and random vectors have very high chance to be orthogonal, etc. All of these makes optimization incredibly slow and difficult. Just another facet of the so called "curse of dimensionality".
What the author is doing here is pre-training. This is something usually model makers like Google and Meta need to do. Most business are much better off doing fine-tuning or to a lesser extent continued pre-training. The author is doing this for academic reasons.
As a researcher, I can totally agree, but at the same time this isn't super straight forward. Things get weird because you can't just translate from one GPU to another. There isn't a clean calculation for that. There's also other issues like parallelism. Sure, your model is stable with a batch size of 8192 but that's across 1 node, it might not be stable with that batch across 2 nodes. This is a real frustrating part and honestly I don't think most people even are aware such issues exist.
Right now I'm just happy when people are including parameter, GMACs (or FLOPs), and throughput. I always include those and the GPUs I used. I also frequently include more information in the appendix but frankly when I include it in the front matter the paper is more likely to be rejected.
I can tell you why this isn't happening though. There's a common belief that scale is all you need. Which turns into "fuck the GPU poor". I've published works where my model is 100x smaller (with higher throughput, and far lower training costs), and the responses from reviewers tend to be along the lines "why isn't it better?" or "why not just distill or prune a large model?" There's this weird behavior that makes the black box stay a black box. I mean Yi Tay famously said "Fuck theorists" on twitter
Exactly. YOLO runs of frontier models with a single random seed/data shuffle are pretty limited for trying to study the “molecular biology”. I actually like to think of LLM understanding as being like biology in the 1850s. There's lots of inspiration to be found in how biology has advanced since then and the types of experiments we might run to better understand LLMs.
Its something I keep thinking about when I see all these deep-dives by Anthropic on the "genetics" of LLMs. I see the emergent properties of LLMs as inseparable from their data environment. If the organization/prevalence of text online was different, I think Anthropic would see different "genetics". As the amt of LLM-generated text grows, I think it will become more clear that the "fundamental unit" is their relationship.