This was really good. The only thing I didn't see explicitly discussed, although I guess it's obvious, is that what's different 33 years later is the inputs the models operate on. The '89 sota model used 16x16 greyscale images, today we have single digit megapixel color images, in 30 years, a desktop will be able to train Clip in 90 seconds, but what will the sota models be trained on?
Human behaviour in a way far more general than which token we might next type. To mimick humans as closely as might be possible with the basic deep learning method, train something that can predict human behaviour in general. Training would require billions to quadrillions of hours of video and audio and probably many other inputs, from many different people, engaged in the full variety of human activity.
Why? An adult by 25 only has 146k hours of video experience “training,” most of it repeated, derivative, and unproductive. And their encoded genes can be observed in their genome, so don’t need to be retrained by millions of years of evolution.
Much of that time also includes physical interaction with the world, which makes it far more valuable because it can improve performance in a focused way.
Neural nets seem to learn much slower than humans. Even GPT-2 has seen orders of magnitude more tokens of language than a human experiences in a lifetime. At least as far as language is concerned, humans are able to extract a lot more information from their training data.
The lambda calculus (a system we know is capable of infinite self-complexity, learning, etc. with the right program) can be described in a few hundred bits. And a neural net can be described in the lambda calculus in perhaps a few thousand bits.
Also, we have no idea how "compressed" the genome is.
Basic structure can be encoded yes (and obviously is given brains have consistent structure), but the weights or parameters, presuming that brains learn via synaptic weights, obviously do not fit in the genome.
One difference is that humans are actively involved in data collection, so when there is a gap in their knowledge, they don’t just wait for the information to show up, they ask a question, etc.
We might have megapixel images that we can easily get with phone cameras, but virtually all vision models in common use take 224x224 resolution images as input, or maybe 384x384. Anything higher resolution than that just gets resampled down. It seems that you are better off using your compute budget on a bigger “brain” than on better “eyes” for now.
I don't think that's current. Certainly the object detection models work on bigger images, and the datasets they're pretrained on e.g. coco are not 224x224. I think standard models pretrained on imagenet, like the Resnets usually have everything resized to 224x224, and so they favor this kind of scaling.