Optimized small model training is not only important for availability but also f...

azath92 · 2025-08-14T15:24:21 1755185061

Totally agree, one of the most interesting podcasts i have listened to in a while was a couple of years ago on the Tiny Stories paper and dataset (the author used that dataset) which focuses on stories that only contain simple words and concepts (like bedtime stories for a 3 year old), but which can be used to train smaller models to produce coherent english, both with grammar, diversity, and reasoning.

The podcast itself with one of the authors was fantastic for explaining and discussing the capabilities of LLMs more broadly, using this small controlled research example.

As an aside: i dont know what the dataset is in the biological analogy, maybe the agar plate. A super simple and controlled environment in which to study simple organisms.

For ref: - Podcast ep https://www.cognitiverevolution.ai/the-tiny-model-revolution... - tinystories paper https://arxiv.org/abs/2305.07759

momojo · 2025-08-14T17:08:05 1755191285

I like the agar plate analogy. Of course, the yeast is the star of the show, but so much work goes into prepping the plate.

As someone in biotech, 90% of the complaints I hear over lunch are not about bad results, but about bad mistakes during the experiment. E.G. someone didn't cover their mouth while pipetting and the plates unusable now.

re5i5tor · 2025-08-16T04:43:08 1755319388

Ha! I remember where I was when I listened to that episode (Lakeshore Drive almost into Chicago for some event or other) — thanks for triggering that memory — super interesting stuff

willvarfar · 2025-08-14T14:13:49 1755180829

(there are also lots of private company datasets like e.g. user purchase history that can be used with small models to solve real business problems. All the advances in 'large' language models can be leveraged and applied to small problems if the input sequences can be represented as a special custom language.)

tmule · 2025-08-14T16:05:23 1755187523

Unfortunately, as things stand, it’s well-known that behaviors and optimizations in small scale models fail to replicate in larger models.

yorwba · 2025-08-14T19:01:57 1755198117

Doing hyperparameter sweeps on lots of small models to find the optimal values for each size and fitting scaling laws to predict the hyperparameters to use for larger models seems to work reasonably well. I think https://arxiv.org/abs/2505.01618 is the latest advance in that vein.

swyx · 2025-08-14T21:22:44 1755206564

the problem is that the eval processes dont really work here if you believe in "Emergent Abilities" https://arxiv.org/abs/2206.07682

exasperaited · 2025-08-14T22:06:13 1755209173

Which we probably should not, at least not the "sudden" emergence that those researchers claimed to see.

https://arxiv.org/abs/2304.15004

Good article about why here; this helped me understand a lot:

https://www.wired.com/story/how-quickly-do-large-language-mo...

jychang · 2025-08-14T23:00:27 1755212427

Why not? It takes models of a certain size to contain xyz neuron/feature.

https://www.youtube.com/watch?v=AgkfIQ4IGaM

That's not a mirage, it's clearly capability that a smaller model cannot demonstrate. A model with less parameters and less hidden layers cannot have a neuron that lights up when it detects a face.

yorwba · 2025-08-15T07:42:19 1755243739

Consider a single-neuron model that just pools all pixels in an image together. It's possible for the average activation of this neuron to be exactly the same on faces and non-faces, but extremely unlikely given the large range of possibilities. So in aggregate, this neuron can distinguish faces from non-faces, even though, when you apply it to classifying a particular image, it'll be better than random only by an extremely tiny amount.

As the number of neurons increases, the best face/non-face distinguisher neuron gets better and better, but there's never a size where the model cannot recognize faces at all and then you add just a single neuron that recognizes them perfectly.

jychang · 2025-08-16T06:34:05 1755326045

> here's never a size where the model cannot recognize faces at all

True

> then you add just a single neuron that recognizes them perfectly

Not true.

Don't think in terms of neurons, think in terms of features. A feature can be spread out over multiple neurons (polysemanticity), I just use a single neuron as a simplified example. But if those multiple neurons perfectly describe the feature, then all of them are important to describe the feature.

The Universal Approximation Theorem implies that a large enough network to perfectly achieve that goal would exist (let's call it size n or larger), so eventually you'd get what you want between 0 and n neurons.

yorwba · 2025-08-16T08:33:43 1755333223

> if those multiple neurons perfectly describe the feature, then all of them are important to describe the feature.

You could remove any one of those neurons before retraining the model from scratch and polysemanticity would slightly increase while perfomance slightly decreases, but really only slightly. There are no hard size thresholds, just a spectrum of more or less accurate approximations.

victorbjorklund · 2025-08-14T17:16:36 1755191796

Which in itself is very interesting and requires study.

anvuong · 2025-08-14T19:16:02 1755198962

It mostly has to do with sparsity in high dimensional space. When you scale things to the extreme everything is very far away from each other, the space is sparse, and random vectors have very high chance to be orthogonal, etc. All of these makes optimization incredibly slow and difficult. Just another facet of the so called "curse of dimensionality".

jebarker · 2025-08-14T16:43:01 1755189781

Well-known but not well-understood

jph00 · 2025-08-14T20:51:27 1755204687

That's not widely true. E.g the GPT 4 tech report pointed out nearly all their experiments were done on models 1000x smaller than the final model.

tmule · 2025-08-15T00:46:12 1755218772

Fair point, though I’d argue that there’s inherent selection bias for improvements that could fit a scaling law curve in the small model regime here.

indoordin0saur · 2025-08-14T17:16:40 1755191800

But why? If we don't know why then how do we figure it out?

leopoldj · 2025-08-14T15:48:46 1755186526

What the author is doing here is pre-training. This is something usually model makers like Google and Meta need to do. Most business are much better off doing fine-tuning or to a lesser extent continued pre-training. The author is doing this for academic reasons.

smeeth · 2025-08-14T14:19:52 1755181192

I've been annoyed for a while people don't use a common parameter weight/compute budget for benchmarking papers.

That said, it does make it easier to claim progress...

pizza · 2025-08-14T17:23:23 1755192203

https://github.com/KellerJordan/modded-nanogpt is pretty great in that respect

godelski · 2025-08-14T23:48:31 1755215311

As a researcher, I can totally agree, but at the same time this isn't super straight forward. Things get weird because you can't just translate from one GPU to another. There isn't a clean calculation for that. There's also other issues like parallelism. Sure, your model is stable with a batch size of 8192 but that's across 1 node, it might not be stable with that batch across 2 nodes. This is a real frustrating part and honestly I don't think most people even are aware such issues exist.

Right now I'm just happy when people are including parameter, GMACs (or FLOPs), and throughput. I always include those and the GPUs I used. I also frequently include more information in the appendix but frankly when I include it in the front matter the paper is more likely to be rejected.

I can tell you why this isn't happening though. There's a common belief that scale is all you need. Which turns into "fuck the GPU poor". I've published works where my model is 100x smaller (with higher throughput, and far lower training costs), and the responses from reviewers tend to be along the lines "why isn't it better?" or "why not just distill or prune a large model?" There's this weird behavior that makes the black box stay a black box. I mean Yi Tay famously said "Fuck theorists" on twitter

ai-christianson · 2025-08-14T14:27:07 1755181627

I'm interested in one that can run fast on a laptop, but training can take a few days (maybe even longer) on the same laptop.

biophysboy · 2025-08-14T13:16:51 1755177411

It’s a fun analogy because the data “environment” of the model being trained matters a great deal

jebarker · 2025-08-14T13:52:00 1755179520

Exactly. YOLO runs of frontier models with a single random seed/data shuffle are pretty limited for trying to study the “molecular biology”. I actually like to think of LLM understanding as being like biology in the 1850s. There's lots of inspiration to be found in how biology has advanced since then and the types of experiments we might run to better understand LLMs.

biophysboy · 2025-08-14T16:43:52 1755189832

Its something I keep thinking about when I see all these deep-dives by Anthropic on the "genetics" of LLMs. I see the emergent properties of LLMs as inseparable from their data environment. If the organization/prevalence of text online was different, I think Anthropic would see different "genetics". As the amt of LLM-generated text grows, I think it will become more clear that the "fundamental unit" is their relationship.

moojacob · 2025-08-14T16:32:30 1755189150

Enough with big data! Who's working on small data? https://www.youtube.com/watch?v=eDr6_cMtfdA&pp=ygUKc21hbGwgZ...

arethuza · 2025-08-14T14:27:20 1755181640

Thanks - that's one of the most interesting comments I've seen about LLMs.

Makes me want to try training a model to sing "Daisy, Daisy..."