Human behaviour in a way far more general than which token we might next type. To mimick humans as closely as might be possible with the basic deep learning method, train something that can predict human behaviour in general. Training would require billions to quadrillions of hours of video and audio and probably many other inputs, from many different people, engaged in the full variety of human activity.
Why? An adult by 25 only has 146k hours of video experience “training,” most of it repeated, derivative, and unproductive. And their encoded genes can be observed in their genome, so don’t need to be retrained by millions of years of evolution.
Much of that time also includes physical interaction with the world, which makes it far more valuable because it can improve performance in a focused way.
Neural nets seem to learn much slower than humans. Even GPT-2 has seen orders of magnitude more tokens of language than a human experiences in a lifetime. At least as far as language is concerned, humans are able to extract a lot more information from their training data.
The lambda calculus (a system we know is capable of infinite self-complexity, learning, etc. with the right program) can be described in a few hundred bits. And a neural net can be described in the lambda calculus in perhaps a few thousand bits.
Also, we have no idea how "compressed" the genome is.
Basic structure can be encoded yes (and obviously is given brains have consistent structure), but the weights or parameters, presuming that brains learn via synaptic weights, obviously do not fit in the genome.
One difference is that humans are actively involved in data collection, so when there is a gap in their knowledge, they don’t just wait for the information to show up, they ask a question, etc.