Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This was really good. The only thing I didn't see explicitly discussed, although I guess it's obvious, is that what's different 33 years later is the inputs the models operate on. The '89 sota model used 16x16 greyscale images, today we have single digit megapixel color images, in 30 years, a desktop will be able to train Clip in 90 seconds, but what will the sota models be trained on?


Human behaviour in a way far more general than which token we might next type. To mimick humans as closely as might be possible with the basic deep learning method, train something that can predict human behaviour in general. Training would require billions to quadrillions of hours of video and audio and probably many other inputs, from many different people, engaged in the full variety of human activity.


Humans have a brain that physically changes especially in childhood, so that is potentially a massive advantage.


Why? An adult by 25 only has 146k hours of video experience “training,” most of it repeated, derivative, and unproductive. And their encoded genes can be observed in their genome, so don’t need to be retrained by millions of years of evolution.


Much of that time also includes physical interaction with the world, which makes it far more valuable because it can improve performance in a focused way.


How better to learn to do menial physical tasks like house cleaning, and produce picking?


Neural nets seem to learn much slower than humans. Even GPT-2 has seen orders of magnitude more tokens of language than a human experiences in a lifetime. At least as far as language is concerned, humans are able to extract a lot more information from their training data.


Humans are also extensively pretrained by billions of years of evolution, so by starting from scratch GPT is admittedly disadvantaged from the get-go.


But the human genome is only 3 gigabytes and the vast majority of that is unlikely to be encoding brain structure.


"Only" 3 gigabytes.

The lambda calculus (a system we know is capable of infinite self-complexity, learning, etc. with the right program) can be described in a few hundred bits. And a neural net can be described in the lambda calculus in perhaps a few thousand bits.

Also, we have no idea how "compressed" the genome is.


Basic structure can be encoded yes (and obviously is given brains have consistent structure), but the weights or parameters, presuming that brains learn via synaptic weights, obviously do not fit in the genome.

Compression still must obey information theory.


True, but we don’t know how to recreate that kind of pretraining.


One difference is that humans are actively involved in data collection, so when there is a gap in their knowledge, they don’t just wait for the information to show up, they ask a question, etc.


Humans do not train on video, the idea of a video, or even a frame, is a high level abstraction within the human brain.


We might have megapixel images that we can easily get with phone cameras, but virtually all vision models in common use take 224x224 resolution images as input, or maybe 384x384. Anything higher resolution than that just gets resampled down. It seems that you are better off using your compute budget on a bigger “brain” than on better “eyes” for now.


I don't think that's current. Certainly the object detection models work on bigger images, and the datasets they're pretrained on e.g. coco are not 224x224. I think standard models pretrained on imagenet, like the Resnets usually have everything resized to 224x224, and so they favor this kind of scaling.


Millions of hours of data captured by headsets like the vision pro?

Not sure all the things it captures, but a model could be trained on the combination of audio/video/spatial/iris/what have you...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: