I was impressed until I read the caveat about the high-compute version using 172x more compute.
Assuming for a moment that the cost per task has a linear relationship with compute, then it costs a little more than $1 million to get that score on the public eval.
The results are cool, but man, this sounds like such a busted approach.
So what? I’m serious. Our current level of progress would have been sci-fi fantasy with the computers we had in 2000. The cost may be astronomical today, but we have proven a method to achieve human performance on tests of reasoning over novel problems. WOW. Who cares what it costs. In 25 years it will run on your phone.
So your claim for optimism here is that something today that took ~10^22 floating point operations (based on an estimate earlier in the thread) to execute will be running on phones in 25 years? Phones which are currently running at O(10^12) flops. That means ten orders of magnitudes of improvement for that to run in a reasonable amount of time? It's a similar scale up in going from ENIAC (500 flops) to a modern desktop (5-10 teraflops).
That sounds reasonable to me because the compute cost for this level of reasoning performance won’t stay at 10^22 and phones won’t stay at 10^12. This reasoning breakthrough is about 3 months old.
By a factor of 10,000,000,000? (that's the ten orders of magnitude needed, since the physical side is out). Algorithmic improvements will make this 10 billion times cheaper?
It's not so much the cost as much the fact that they got a slightly better result by throwing 172x more compute per/task. The fact that it may have cost somewhere north of $1 million simply helps to give a better idea of how absurd the approach is.
It feels a lot less like the breakthrough when the solution looks so much like simply brute-forcing.
But you might be right, who cares? Does it really matter how crude the solution is if we can achieve true AGI and bring the cost down by increasing the efficiency of compute?
That’s the thing that’s interesting to me though and I had the same first reaction. It’s a very different problem than brute-forcing chess. It has one chance to come to the correct answer. Running through thousands or millions of options means nothing if the model can’t determine which is correct. And each of these visual problems involve combinations of different interacting concepts. To solve them requires understanding, not mimicry. So no matter how inefficient and “stupid” these models are, they can be said to understand these novel problems. That’s a direct counter to everyone who ever called these a stochastic parrot and said they were a dead-end to AGI that was only searching an in distribution training set.
The compute costs are currently disappointing, but so was the cost of sequencing the first whole human genome. That went from 3 billion to a few hundred bucks from your local doctor.
Let's make two generous assumptions:
1. ARC-AGI actually generalizes to human intelligence
2. It took 172x more compute to go from ~75% to ~87%, so it will take roughly 4x that to get to 99% (the level of a STEM graduate), assuming every 172x'ing of the compute cuts the remaining gap in half
That is roughly 10^9 times more compute required, or roughly the US military budget per half an hour, to get the intelligence of 1 (!) STEM graduate (not any kind of superhuman intelligence).
Of course, algorithms will get better, but this particular approach feels like wading in a plateau of efficiency improvements, very, very far down the X axis.
Assuming for a moment that the cost per task has a linear relationship with compute, then it costs a little more than $1 million to get that score on the public eval.
The results are cool, but man, this sounds like such a busted approach.