This was really good. The only thing I didn't see explicitly discussed, although...

retrac · on Aug 26, 2023

Human behaviour in a way far more general than which token we might next type. To mimick humans as closely as might be possible with the basic deep learning method, train something that can predict human behaviour in general. Training would require billions to quadrillions of hours of video and audio and probably many other inputs, from many different people, engaged in the full variety of human activity.

quickthrower2 · on Aug 26, 2023

Humans have a brain that physically changes especially in childhood, so that is potentially a massive advantage.

reducesuffering · on Aug 26, 2023

Why? An adult by 25 only has 146k hours of video experience “training,” most of it repeated, derivative, and unproductive. And their encoded genes can be observed in their genome, so don’t need to be retrained by millions of years of evolution.

thaw13579 · on Aug 26, 2023

Much of that time also includes physical interaction with the world, which makes it far more valuable because it can improve performance in a focused way.

alanbernstein · on Aug 26, 2023

How better to learn to do menial physical tasks like house cleaning, and produce picking?

canjobear · on Aug 26, 2023

Neural nets seem to learn much slower than humans. Even GPT-2 has seen orders of magnitude more tokens of language than a human experiences in a lifetime. At least as far as language is concerned, humans are able to extract a lot more information from their training data.

mkaic · on Aug 26, 2023

Humans are also extensively pretrained by billions of years of evolution, so by starting from scratch GPT is admittedly disadvantaged from the get-go.

georgeg23 · on Aug 26, 2023

But the human genome is only 3 gigabytes and the vast majority of that is unlikely to be encoding brain structure.

retrac · on Aug 26, 2023

"Only" 3 gigabytes.

The lambda calculus (a system we know is capable of infinite self-complexity, learning, etc. with the right program) can be described in a few hundred bits. And a neural net can be described in the lambda calculus in perhaps a few thousand bits.

Also, we have no idea how "compressed" the genome is.

georgeg23 · on Aug 26, 2023

Basic structure can be encoded yes (and obviously is given brains have consistent structure), but the weights or parameters, presuming that brains learn via synaptic weights, obviously do not fit in the genome.

Compression still must obey information theory.

canjobear · on Aug 26, 2023

True, but we don’t know how to recreate that kind of pretraining.

thaw13579 · on Aug 27, 2023

One difference is that humans are actively involved in data collection, so when there is a gap in their knowledge, they don’t just wait for the information to show up, they ask a question, etc.

edgyquant · on Aug 26, 2023

Humans do not train on video, the idea of a video, or even a frame, is a high level abstraction within the human brain.

eigenvalue · on Aug 26, 2023

We might have megapixel images that we can easily get with phone cameras, but virtually all vision models in common use take 224x224 resolution images as input, or maybe 384x384. Anything higher resolution than that just gets resampled down. It seems that you are better off using your compute budget on a bigger “brain” than on better “eyes” for now.

version_five · on Aug 26, 2023

I don't think that's current. Certainly the object detection models work on bigger images, and the datasets they're pretrained on e.g. coco are not 224x224. I think standard models pretrained on imagenet, like the Resnets usually have everything resized to 224x224, and so they favor this kind of scaling.

ramblerman · on Aug 26, 2023

Millions of hours of data captured by headsets like the vision pro?

Not sure all the things it captures, but a model could be trained on the combination of audio/video/spatial/iris/what have you...