It seems easy to you because you can't remember the years when you were a toddler and had to learn basic interactions with the world around you. It seems natural to an adult but it is quite complex.
But this argument applies just as well to tons of other tasks AIs can handle just fine. So it doesn't explain why this particular action is so much harder compared to anything else.
In particular, the task requires understanding that one can impact the world through action. This is learned by humans through a constant feedback loop running for months to a year+. The very way we train AIs doesn't seem to teach this agency, only teach the ability to mimic having that agency in ways that we can capture data for (such as online discussions). Will that training eventually give rise to such agency? I'm doubtful with most current models given that the learning process is so disconnected from the execution and that execution is prompted and not inherently on going. Maybe some agent swarm that is always running and always training and upgrading its members could achieve that level of agency, which is why I'm not saying it is impossible, but I expect we are going to have to wait for some newer model that is always running and which is training as it is running to see true agency develop.
Until then, it is a question of if we can capture the appearance of agency in the training set well enough for learn it with training and not depend upon interactions to learn more.
I don't think I am, and for context here I have built my own DQNs from scratch to learn to play games like Snake.
I'd argue if you consider the size of the input and output space here it's not as complex you're implying.
To refer back to my example, to tell the difference between four legged creatures is complicated because there's a huge number of possible outputs and the visual input space is both large and complex. Learning how to detect patterns in raw image data is complicated and is why we and other animals are preloaded with the neurological structures to do this. It's also why we often use pretrained models when training models to label new outputs – simply learning how detect simple patterns in visual data is difficult enough so if this step can be skipped it often makes sense to skip it.
In constrast the inputs to Minecraft are relatively very simple – you have a handful of buttons which can be pressed and those buttons can be pressed for different durations. Similarly the output space here while large is relatively simple and presumably simply detecting that an action like holding a button results in a state change shouldn't be that complex to learn... I mean it's already learning that pressing a button results in a state change so I think you'd need to explain to me why adding a tiny bit of additional complexity here is so unreasonable. Maybe I'm missing something.
> I think you'd need to explain to me why adding a tiny bit of additional complexity here is so unreasonable
As far as I understand DreamerV3 doesn't employ intrinsic rewards (like in novelty-based exploration). It adopts stochastic exploration which makes it practically impossible to get to rewards that require to consistently repeat an action with no intermediate rewards.
And finding intrinsic rewards that work good across diverse domains is a complex problem in itself.
Example: When humans play Minecraft, they already know object permanence from the real world. I did not see anywhere that AI got trained to learn object permanence. Yet it is required for basics like searching for your mineshaft after turning around.