“In my humble opinion, these companies would not allocate a second of compute to lightweight models if they thought there was a straightforward way to achieve the next leap in reasoning capabilities.”
The rumour/reasoning I’ve heard is that most advances are being made on synthetic data experiments happening after post-training. It’s a lot easier and faster to iterate on these with smaller models.
Eventually a lot of these learnings/setups/synthetic data generation pipelines will be applied to larger models but it’s very unwieldy to experiment with the best approach using the largest model you could possibly train. You just get way fewer experiments per day done.
The models bigger labs are playing with seem to be converging to about what is small enough for a researcher to run an experiment overnight.
> You just get way fewer experiments per day done.
Smaller/simpler/weird/different models can be an incredible advantage due to iteration speed. I think this is the biggest meta problem in AI development. If you can try a large range of hyper parameters, fitness function implementations, etc. in a few hours, you will eventually wipe the floor with the parties forced to wait days, weeks and months for their results each time.
The bitter lesson certainly applies and favors those with a lot of compute and data, but if your algorithms fundamentally suck or are approaching a dead end, none of that compute or information will matter.
Why is Searle's Room the seemingly de facto thought experiment for this stuff? You always have to take an extra couple of steps to make it work, and it never really sells it to me. Its true he was distinctly concerned with AI, but he was not at all speaking to the same context, i.e., before "the bitter lesson."
I humbly suggest looking into guys more like Quine. His problem of "radical translation" is much more easily mapped to LLMs. (Thinking specifically here of the model as the "translator"). Its maybe a little harder to grasp for non-domain experts, but at least there is no need for hyperstitional armchair interpretations of old problems in order to make it relevant.
People jump straight into cognitive science/philosophy with this stuff, I just want to be like "whoa, slow down! So much to establish before that.."
> No matter how fast Searle is, he won't be able to come up with a beautiful and original Chinese poem that has the creative spark special to humans
Why not?
> Of course, at some level of complexity, it will be stuck in a local maximum of work quality simply because the book has no guide on how to solve the problem at hand.
I find this a pretty un-optimistic view, especially from someone building a coding autopilot. Having myself used LLMs for a bunch of software development in the last year, it seems its 'local maximum' is no different from a developer's _if_ you split the process up appropriately. The author alludes to this when they mention 'workflow'.
Everyone is trying to use LLMs in a 'single inference pass', assuming that's as good as it gets, but that's like trying to get find human creativity in a single cascading activation of neurons. A brain doesn't fit on an axon. So, I kinda think the author should be less shy about their optimism. Inference is soon ~free, as they say, so to me, naive as I might be, the future of AI coding agents is not limited to grunt tasks, it is as creative and exploratory as any human coder.
Ps. Fume looks cool. I'd suggest people take a look at aider.chat and claude-engineer too (on github).
Unsure if this is a useful answer. But Searle/LLM could make something that looks like it has a creative spark, and that's it.
Why I think that's different is in the case of a human artist, they create something because they have something they want to say. Whatever they produce is a way of saying 'this is what the world feels like to me, is it the same for you?'. And if it is, it resonates.
But I cannot see how an LLM would 'want' to say anything. If we're talking psychoanlytically of where wanting comes from, and call it a desire to fill a void of how incoherent you actually are, then an LLM doesn't go through that process.
Maybe Searle does, and still wants the characters to make you feel a certain way, in which case the comparison doesn't fit.
> If we're talking psychoanlytically of where wanting comes from, and call it a desire to fill a void of how incoherent you actually are, then an LLM doesn't go through that process.
Ironically, many people complain LLMs are too incoherent, with all their confabulations and hallucinations.
But I agree. Desire is a good verb. I think that's what differentiates us from the 'machines'. In art, we try to create meaning. From our lives. From our discontents. Even a million LLMs cannot be in deficit of meaning; they are precisely tuned to their own capacity. Whereas something strange about humans is our endless desire for 'more'.
I'm not convinced we do "want" to say anything, though. The combinations of physical inputs (which mostly translate to hormones i imagine?) and data inputs seem to drive my behavior to such a degree that i question if i could really do anything else at any given moment.
The whole free will debate seems a bit out of scope (and out of my reach, hah), but nonetheless it feels interesting in the LLM context.
edit: Note that i don't necessarily think LLMs are there or even can be. We seem to technologically small to produce the complexity in ourselves. Nonetheless i'm always interested in how far reduced complexity can take us.
> Why not?
The 'original' part is more important than the 'beautiful' part - which should have been more clear in my writing. This argument also triggers the question "is true originality even possible" but I think the difference for LLMs at the moment is their incapability of building non-obvious analogies. I've yet to be inspired something written by an AI and I don't think simply overfitting a model with all human generated data is enough for that. As I also mentioned in the blog, I would be happily proven in future.
> _if_ you split the process up appropriately
I believe this pre-requisite is very important. LLMs so are terrible at planning and splitting a complex task into simpler steps. This might be natural limitation of `next token prediction`. For complex planning, each step should be the result of both the previous and speculative future steps. We try to tackle this by dividing a plan into two: a macro and a micro plan but still a lot to improve there.
An LLM, certainly by itself, can't be "as creative and exploratory as any human coder", because it's limited by inability to reason other than by training data mashup, has no curiosity, no ability to learn from it's exploratory mistakes and successes (were it to make them), etc, etc.
It seems we've reached the point that understanding of LLMs would be a great candidate for the beginner/intermediate/expert meme. "It's just autocomplete" -> "It's got a world model, it's thinking for itself" -> "It's just autocomplete".
I think the discussion around "exponentials" with top-end LLM (think 3.5 sonnet, gpt-4 not the smaller models) scaling is really pointless. The heuristic we have for what to expect from performance is just scaling, which has worked pretty well. These benchmarks are imperfect in lots of ways, aren't necessarily sensitive to showing exponential progress and it is difficult to predict step changes in capability in advance.
If you zoom out on the first graphic from December 2023 back to 2020, the capabilities of models released at that time on these benchmarks would be much much lower. The best lens for future performance of large models is uncertainty.
> The best lens for future performance of large models is uncertainty.
100% agree. I think to better way to phrase my argument there would be to reject the notion that LLMs are destined to get exponentially smarter (twitter fallacy). This is not to say I believe they are not going to get any smarter in the future. We simply don't know and building a company/product on the expectation of another Moore's Law is dangerous.
> No matter how fast Searle is, he won't be able to come up with a beautiful and original Chinese poem that has the creative spark special to humans.
This analogy seems flawed to me. Searle is in an empty room but LLMs are not. They are constantly learning from user inputs and data is continuously being made more available for LLM ingestion. I still don't think that an LLM will completely replace humans at pure creativity but I don't see why it can't come close. Especially since we're only 2 years in to this craze.
424 terrabytes text is over a billion books worth of data. On the common crawl website it even says "Over 250 billion pages spanning 17 years." That's an impressive amount of information.
Comparing common crawl to video makes no sense. Common crawl is text extracted from webpages. 424 terabytes of pure text contains exponentially more text than I will read in my entire life.
I think this is a good thing to keep in mind. If you compare that to how much information a young human gets as input for example, it really puts things into perspective.
> Fume isn't built on the premise that LLMs are going to get substantially smarter in the coming months. Instead, we're certain they're going to get much cheaper and faster.
I'm so tired of the assumption that AI tools are going to get increasingly more capable until they can take effectively take over any task that humans currently excel at. They are already useful, but they don't seem likely to take over everything. This is especially true when it comes to making critical decisions.
This take about cost, however, seems well-grounded. I appreciate clear statements like this that can act as guiding principles for what kinds of things to build, and how to anticipate changes in the coming months and years.
The rumour/reasoning I’ve heard is that most advances are being made on synthetic data experiments happening after post-training. It’s a lot easier and faster to iterate on these with smaller models.
Eventually a lot of these learnings/setups/synthetic data generation pipelines will be applied to larger models but it’s very unwieldy to experiment with the best approach using the largest model you could possibly train. You just get way fewer experiments per day done.
The models bigger labs are playing with seem to be converging to about what is small enough for a researcher to run an experiment overnight.