Hacker News new | past | comments | ask | show | jobs | submit login
Tiny Language Models Come of Age (quantamagazine.org)
174 points by nsoonhui on Oct 8, 2023 | hide | past | favorite | 78 comments



I'm a big believer that smaller models will take hold. The throwing paint at the wall strategy of having 100B+ models and random data won't scale. It's possible it will take new model architectures to get there. But having more fine-tuned control over model parameters is a pattern that will emerge. Then smaller, domain specific models can be joined together to form larger models, if full generalization is needed.

I have an article on building micromodels that discusses some of this: https://neuml.hashnode.dev/train-a-language-model-from-scrat...


100%, I'd be amazed if we don't hit a time when the huge models we currently have are looked at the same way we see old fashioned room size computers now.


I would take the other side of that bet for sure.

Modern computers don’t have less ram than the old fashioned room size computers. Amount of RAM seems much more analogous here than the size of the hardware running the software. Of course the latter will go down, but the former? I don’t see why.


Techniques like distilling show that you can in fact get similar performance with 60% fewer parameters. Yes it depends on making the large model first, but that only implies that somehow the large model is perhaps more useful in initial learning. And if that's the case, then it would be rather surprising if we couldn't identify why and create new training techniques that go directly to the smaller equivalent model.

So in this way you've already lost your bet, as we already know that it's possible to create vastly smaller models with similar performance.


60% is not vastly smaller in the way that I meant - we are talking about differences of three orders of magnitude in this article and that is what I am skeptical of.

Nevertheless, I have not actually seen a model exhibiting high-level LM capabilities (of the GPT-4 level) that was distilled.

Just because a model can match in perplexity with fewer parameters is not the same as matching the performance as subjectively measured by humanity interacting with the model.


Early days, yet. If we could cut the size by 60% every few years, we'd be looking at a situation akin to Moore's law where a performant AI might fit in my grandkid's pocket.


That doesn’t mean that a much larger with high quality training data and new training techniques won’t be more powerful than a smaller model. We have incredible efficiencies in modern computers in addition to immense resources. They both are needed for the best performance. I suspect it’s the same for LLM. Superior data, superior techniques, and superior scale will yield superior models.


Information density will increase in models just as spatial density increased for hardware.


Saying that information density will increase is very different from saying model sizes will decrease of a similar order of magnitude.

Regardless, this isn't the trend we've seen with software RAM usage - information density is only decreasing as hardware becomes more abundant and comparatively less engineering effort goes into fitting everything into as small a portion as possible.


The computers I use are basically warehouse-sized now.


Very apt analogy, stealing this :)


> I'm a big believer that smaller models will take hold. The throwing paint at the wall strategy of having 100B+ models and random data won't scale.

But all these smaller models are based on those 100b+ models! Like OP on TinyStories relies on larger models twice, to generate and then evaluate. Or the recent wave of small models which are all based on extracting data from GPT-3/4 to borrow the capbilities+RLHFing for free.

You can talk about how interesting it is and how much of an overhang neural nets have in terms of being overparameterized (which is an important AI safety/capabilities issue - the first AGI will be the largest, slowest, and worst one, by a long shot), but the one thing it doesn't tell you is that smaller models can replace big models entirely. Because they are parasitic on said big models, so you still need to train the big models to begin with.

(The smaller models also seem to lose a lot of the qualitative capabilities of larger ones, like meta-learning. Their TinyStories model does only stories. I doubt you could get it to 'dax a blick'.)


Not all of them per se, take a look at something like Mistral. It's a 7B model displaying incredible performance. IMO, we still haven't even scratched the surface of what is possible with small LLMs. Especially not with pre-filtered/classified pre-training data. (Interesting LLMs based on their data approach and relatively small size: Qwen, InternLM, Mistral, Phi)


> Not all of them per se, take a look at something like Mistral. It's a 7B model displaying incredible performance.

I would, but they don't say what their dataset is that I can find anywhere, and the only thing they say about their instruction-tuned is that it's trained on 'publicly available' datasets. You know, the ones where a lot of them turn out under the hood to be drawing from the OA API or other pretrained models in some way or another...

> Especially not with pre-filtered/classified pre-training data.

Indeed not! But what exactly is prefiltering or classifying all that data...?


In the context of this article, the small/tiny models were 1--30 million parameters and the large model was 1.5 billion parameters. Efficient training of a 7-billion parameter model already requires algorithms for training across multiple GPU because the memory requirement for derivatives and optimizers will not fit on the typical 80G Ram of the current high end GPUs.


mind boggling that 7B is now considered a small model. I think it's valid, given the preeminence of 70B+ sized models. But wow, the community really just leap frogged over single digit billion parameter sizes.


Since GPT-3 OpenAI has been filtering their pre-training data, and I believe others have done it too


I would like to have a more modular approach with more specialized training of the models.

Currently I can only really use smaller models for micro-tasks like sentiment analysis or classification, but any type of problem solving has to be left to GPT-4.


Agreed with the overall point (small models will prove powerful) but I disagree with:

> But having more fine-tuned control over model parameters is a pattern that will emerge

This assumes parameters do something interpretable. IMO, it’s more that parameters themselves are meaningless/dumb and patterns evolve via interaction of many such units (like in ants)


Empirically, perhaps due to the lottery ticket hypothesis, one needs to wait a lot longer for small models to reach a useful loss, so if one is impatient it almost always pays off to train the largest model that a given compute configuration can support.


I disagree - larger models are significantly more capable, likely as a function of their size, and the cost to inference them is only going to go down.

Maybe for niche use cases like handling customer support requests small models will do well, but GPT4 is not going to be condensed to a few billion anytime soon imo.


Believing this is existential for a thousand AI seed-stage startups. I hope it’s true for their sakes.

Hear the other timeline from Darius Amodei. https://a16z.com/improving-ai/ We may be entering a regime of multi-billion training costs on custom hardware. Open-source AI will be like open-source CPUs. Cool but the real thing even at the smallest scale comes out of monopsonized (is that a word?) inaccessible infrastructure


I appreciate the conciseness of this article. Thank you for sharing. I agree. The kitchen sink data strategy appears to be fairly inefficient with current model architectures.


I’ve been wondering about variations on panel of experts designs that use a “mob” of small models or possibly a network of them. Sort of like crowdsourcing with baby AIs.

Could we have libraries of small models that are all experts on niche subjects that basically collaborate?

The advantage is that it’s easier to scale and distribute lots of small models on commodity hardware than it is to run giant models that require heaps of contiguous ram.

I’d love to have some time to play with versions of this.


The best design we know is how God made humans. Our brains have specialized parts that work together with analog components. We’re raised on knowledge of increasing complexity as humans, given general-purpose heuristics, and then we specialize. From there, small teams of specialists interacting typically do the best jobs. So, I’d try to do stuff like that first since it’s proven in the only general intelligence that exists.

If I do experiments, I’m making models specialized from foundational to programming to specific language to specific tasks. Each pass will have training data so it picks up general concepts. Then, it will be trained to focus more on what it needs. Eventually, what’s slushing around in its neurons should be just the right information. Or it won’t work.

Anyway, my concepts that I published are at this link:

https://gethisword.com/tech/exploringai/index.html

(See alternative models section.)


You could always start with a big model then periodically prune low activation parameters as part of fine tuning until the desired size is reached.


There are hard limits to what a "brain" of a given size can do. Expecting ensembles of 7B models to match what properly trained 180B model can do at specialised tasks is like expecting a thousand monkeys trained on a specialised task to be able to match human performance.


If this were true, then GPT4 would be better at chess than Stockfish.

Since it's not, it isn't.


By that logic, humans should be better at chess than Stockfish, but they aren’t.


What if there was a large inefficient model just good enough that it could generate small specialized ones and orchestrate them?

Now that I've said it, I wonder if there's any evidence human brains work that way.


For all interested, TinyStories dataset they used to train "tiny" models is on HuggingFace: https://huggingface.co/datasets/roneneldan/TinyStories


>"Eldan and Li observed hints that networks with fewer layers but more neurons per layer were better at answering questions that required factual knowledge; conversely, networks with more layers and fewer neurons per layer were better at keeping track of characters and plot points from earlier in the story."

That is a very interesting result! This may point to the idea that the notion of time itself (as history/historical knowledge of the state of the system at an earlier point in its evaluation/evolution/computations) being more "present" (or pronounced, "relative to results", "apparent in results") in neural networks which are deep (many layers) rather than wide (more neurons per layer).

Which, if true and widely confirmed to be so by other researchers -- would be an interesting discovery indeed!

Also, if true (and it's a big 'if'!), this may help researchers in other areas better understand the relationship between time and information better (i.e., Leonard Susskind, ER=EPR, etc.: https://en.wikipedia.org/wiki/Leonard_Susskind , https://en.wikipedia.org/wiki/ER_%3D_EPR , https://www.youtube.com/results?search_query=leonard+susskin... )


Imho we should move AI out of the realm of CS, very quickly. The information overload is getting too much.

As an analogy, it would be strange if psychology was a part of physics. Physicists would (I think) not be amused if their work literature became flooded with papers about the human psyche. Even if the underlying "hardware" is physics.


AI is not CS. AI implementation is CS but AI theory is firmly in the field of statistics.

The newcomers with their strong opinions seem to have become confused because programmers hit APIs and think they're doing "AI".

Software architecture and neural network architecture are not synonyms.


AI isn't really statistics at this point either though. The primitives are statistics tools, but parameter dynamics and macro-scale behavior of these models are very much their own area of study.


No, it's still very much at the intersection of optimization theory and statistics/probability.


In the last year or so I've seen a big shift upwards in the focus of papers. Academics don't have the resources to beat Meta at building foundation models, so the low hanging fruit is in understanding the behavior of existing models more deeply, and how to extend or leverage them in new ways.

Within a few years the fraction of papers in AI about new architectures, training, hyperparamter optimization, etc will be dwarfed by papers about things like controlnets, LoRA combinators, multi-model dispatch networks and few-shot embedding methods.


If we're talking about "AI" as a nebulous catch-all and arbitrary label for whatever hype people are chasing, then sure, you're free to define it however you like, by construction.

Popularity and status quo doesn't change the definition of the underlying theory. I will push back forever on some demotion of the importance of ML (i.e. theory) as distinct from some hype-driven notion of AI.


It's not really a hype-driven notion. The underlying machinery is statistical models, but those models are being trained into higher order structures which are worthy of research in their own right due to emergent properties, and eventually work in understanding and controlling those emergent properties will be more important than foundational ML advances.


I am sympathetic to this idea of emergence creating a new field of AI as distinct from the statistics that underlies it.

Then the focus should be to identify something like a fundamental unit of intelligence in order to formalize a higher-order science out of the foundations. An analogy can be drawn between physics and chemistry: we needed to properly identify "the atom" and its component parts to get anywhere with the science of chemistry. But it took a whole lot of physics to get to that point. It seems similar with the ML-cum-AI transition where we'll still need to dig very deep into statistics and information theory before being able to abstract them away in favor of higher-order concepts.

To me it seems we're really far from anything like that yet. Like at least a couple decades if not more. Friston's got some cool ideas that make me think he may have his name on some stuff later on but again the theory on learning systems is barely getting started.


It sounds like that’s the dividing line. Generating a new foundational model is CS+engineering, doing research on such models or derivatives thereof which treats them as alien artefacts to be studied is “AI.”


If statisticians can build large scale software they’d be really angry.


I'm hopeful AI will end up being probability/statistic's "CS", in the same way that CS relates to its mathematics.

IOW, an intellectually-linked discipline that drives so much revenue (and thus has so much work to be done in it) that it comprises its own field (about 75% of which is unique "applications-of-thing" problems).


Software engineering is already CS's "CS". CS is a field of study, not a role. There may be something like "AI engineering" but I can't see a way it would be much different from existing ML engineer roles. And if you do ML engineer minus specialized statistics knowledge you're just back to plain old software engineering.

I don't know that this particular aspect of the "AI revolution" needs some disruptive paradigm shift.


Po-tato potat-o, re: CS vs SE and role vs field.

I'm hesitant to appropriate "engineer" into what we do. There are certainly people who work with code who earn that title. There are also many who don't.

I do think it would be healthy for people who work with ML to separate more decisively from people who work with general purpose code. There are enough unique problems and solutions in ML that a clear community would better serve the field's maturation.

As opposed to getting an endless summer of "Why don't you just" software developers fouling things up, because it's "similar".


As a nascent field, "why don't you just" ends up with new, useful, applicable techniques, along with many that aren't. We don't know which is which until after the fact.


On the other extreme, what distinguishes CS from math?


>> On the other extreme, what distinguishes CS from math?

Doesn't that support the argument even more? CS and math share a lot more then CS and AI, yet CS and math are different disciplines.


Actually that is not correct. Math is at the heart of CS and AI. Both came from math and rely on math.


>> Both came from math and rely on math.

Neural nets came from biology and rely on math.


> what distinguishes CS from math?

The art of application


No? Applied maths is maths, so applying maths does not stop maths from being maths.

Of course, computing is closer to pure math than applied maths. Computing is applied pure maths, perhaps.


The "computing" that's closer to pure math is algorithmic theory, cryptography, and computational engineering, which is an extremely small subset of the field.

The majority of what computer science practitioners do is general purpose coding, which doesn't have nearly as much to do with the underlying math.

Or, in other words, how proficient would an applied math person be at building a front end? And how many of their skills would they be able to leverage?

That's the overlap, or lack thereof.


We were specifically talking about comp sci, not software engineering [sic]. Yes, software engineering is way less rigorous. It is hardly is maths—or comp sci for that matter.


Time


If we end up developing AGI, computer science itself will become useless since all economic progress will be driven by businesses using their own AGI to develop software. It won't matter in the same way nobody designs microchips by hand anymore.


Intuitively, this seems like a good approach -- pare down the datasets to isolate the essential, if not rudimentary, elements.

It feels like a good direction would be to "grow" these models from a linguistic kernel, then supply them with a more organized / reproducible "fact bank" / memory store.

We try to mimic this right now with prompt engineering, but in the end it's always just word soup. Having layers to it, like a language synthesis layer, logic layer, memory layer -- where the logic is a constraint on the language, and the memory provides the working data -- it could lead to a model that hallucinates far less.


I know that HN loves small models because they are more accessible and hacking-friendly.

However, the returns over increased capabilities vastly overwhelm the costs of running larger models.

It doesn’t matter that a model requires $100,000 worth of hardware if it can automate the work of several information worker paid $200k annually.


Assuming you don't take privacy into consideration, yes.


Tiny Stories trained models and alternative implementation based on llama 2: https://github.com/karpathy/llama2.c


all three of those "Katie and her cat" stories are horrible. besides that none of them sound plausible, there is no substance in these stories that would make them useful to even present to children. they're just random events strung together. there's actually reasons humans write and tell stories beyond just sounding "like a story a human wrote".


Still,

> Eldan and Li presented the same challenge to OpenAI’s GPT-2, a 1.5-billion-parameter model released in 2019. It fared far worse — before the story’s abrupt ending, the man threatens to take the girl to court, jail, the hospital, the morgue and finally the crematorium.


Is it possible to use ML to improve on training returns? Similar to neural architecture search or circuit design optimization, can we use it to maximize performance and accuracy through training set selection/optimization?


Data subset selection is an age old problem / sub-field in ML. Feature selection is a special case of this broader problem.


I think the next big innovation in LLMs (sort of like the attention mechanism) will be some method of distributing work to much smaller, specialized, and capable units, rather than having one giant network.

We already see hints of this with MoE, but something entirely new wouldn't surprise me.


Using GPT to train smaller models is against the OpenAI TOS. At least for commercial purposes.

Is it not? Do we just ignore it because it's not fair or they can't prove it?

How can we get from needing a million stories to being able to use just 10 stories?


Companies are free to put whatever they wish in their TOS. Whether it holds legal water is an entirely different question and for the courts to decide. The law of the land always takes precedence over any arbitrary clauses that companies put in their TOS.


There should be a law prohibiting them from claiming that. Someone should challenge it in court


In my opinion, models that are trained based on highly selected, curated data are going to lose out to the larger more unsupervised approaches. This is just yet another instantiation of the bitter lesson.


I wonder how google runs AI on a phone doing imaging stuff and I can’t even get it to work on a 64GB RAM machine with an Nvidia 3070 card.


Why doesn't it work? Try Ollama, or llama.cpp, they should work fine out of the box. They do on my 3070 just fine. Also, try a model that fits your VRAM.


Most the AI doing image processing is architecturally very different and smaller than LLMs.


Running (quantised) LLMs (encoder-decoder and decoder only transformers) on phones isn't hard. Running CNNs and encoder-only transformers, the kind used in vision on phones is even easier.


I am certain investments into synthetic data will proliferate exponentially. Synthetic data can be more diverse, less toxic, and of higher average quality. But the foremost rationale we require it is to improve models.

You likely recall the reversal curse - if an LLM trains "A is B" it doesn't automatically deduce "B is A". You need to explicitly append the second part to the training set to make it complete. Fragmented or incomplete information in the training set persists fragmented in the trained model. Training is blind, only inference is intelligent.

Similarly, many training examples contain apparent information that conceals implicit deductions, for instance a math problem - the statement is evident and apparent but the chain of thought is implicit, concealed.

What I am getting at is a general principle - LLMs need to "study" the original training data and make it comprehensive, elicit those implicit deductions, and enable LLMs to traverse the conceptual space in all directions.

Study, then train.

In other words, to study is to digest the raw data, to reconnect fragmented information, to generate insight, to execute the instructions and see the result, generally to unfold what is hidden.

It's a matter of utilizing LLMs in generative mode to prepare the dataset before training. Microsoft has generated a 150B token synthetic dataset for Phi-1.5 and it demonstrated 5x efficiency gains. They will likely increase 100x. The promise of faster, cheaper inference and fewer errors is very alluring.


Could we use this to create better models for very specific medical domains for example?


Some of their findings are really interesting, as well as their approach with the children stories, but I find it worrying that at no point the article mentions the problems inherent to the feeding AI models with the output of other AI models.

> The success of the TinyStories models also suggests a broader lesson. The standard approach to compiling training data sets involves vacuuming up text from across the internet and then filtering out the garbage. Synthetic text generated by large models could offer an alternative way to assemble high-quality data sets that wouldn’t have to be so large.

And they completely ignore the fact that those models are building their "synthetic stories" based on the knowledge from all that vacuuming text from across the internet, as if now we had already solved the problem of sourcing human data without the need for further vacuuming.


Related questions:

What is the status of synthetic data generated from a model trained on copyright infringing content?

How about the text generated from an open source model, trained on open or licensed data, but having copyrighted material in the prompt for reference.

Does going through an AI model wash copyrights away? Does any hint of copyrighted data in the training corpus or prompt invalidate the right to publish the results? Only when they are similar enough to the copyrighted content? How about when the content itself is pretty common and not unique at all, like a solution to bubblesort?


Combined with MoE, this is the microservice architecture for AI. You know it's coming.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: