An important caveat from the paper >Moreover, we follow previous work in acceler...

toxik · 2025-04-07T10:01:34 1744020094

Like all things RL, it is 99.9% about engineering the environment and rewards. As one of the authors stated elsewhere here, there is a reward for completing each of 12 steps necessary to find diamonds.

Mostly I'm tired of RL work being oversold by its authors and proponents by anthropomorphizing its behaviors. All while this "agent" cannot reliably learn to hold down a button, literally the most basic interaction of the game.

red75prime · 2025-04-07T10:15:33 1744020933

The "no free lunch" theorem. You can't start from scratch and expect your program to repeat 4 billion years of evolution collecting inductive biases useful in our corner of our Universe in a matter of hours[1].

While it's possible to bake in this particular inductive bias (repetitive actions might be useful), they decided not to (it's just not that interesting).

[1] And you certainly can't reproduce the observation selection effect in a laboratory. That is the thing that makes it possible to overcome the "no free lunch" theorem: our existence and intelligence are conditional on evolution being possible and finding the right biases.

We have to bake in inductive biases to get results. We have to incentivize behaviors useful (or interesting) to us to get useful results instead of generic exploration.

toxik · 2025-04-07T14:44:04 1744037044

You don't have to repeat 4 billion years of evolution, an RL agent lives inside a strange universe where the basic axioms happen to be exactly aligned with what you can do in that universe.

Its actions are not muscular, they are literal gameplay actions. It is orders of magnitude easier to learn that the same action should be performed until completion, than that the finger should be pressed against a surface while the hand is stabilized with respect to the cursor on a screen.

One of the most interesting (and pathological) things about humans is that we learn what is rewarding. Not how to get a reward, but actually we train ourselves to be rewarded by doing difficult/novel/funny/etc things. Notably this is communicated largely by being social, i.e., we feel reward for doing something difficult because other people are impressed by that.

In Castaway, Hanks' only companion is a mute, deflated ball, but nonetheless he must keep that relationship alive---to keep himself alive. The climax of the movie is when Hanks returns home and people are so impressed, his efforts are validated.

Contrast that to RL, there is no intrinsic motivation. The agents do not play, or meaningfully explore, really. The extent of its exploration is a nervous tic that makes it press the wrong button with probability ε. The reason it cannot hold down buttons is because it explores by having Parkinson's disease, by accident, not because it thought it might find out something useful/novel/funny/etc. In fact, it can't even have a definition of those words, because they are defined in the space between beings.

orbifold · 2025-04-07T17:55:12 1744048512

Personally I am almost certain that the current framing of RL and its relationship to animal behavior is deeply misguided. It proves close to impossible to train animals using this paradigm (not for a lack of trying), i.e. animals such as mice only make any progress when water deprived and under conditions that exploit their natural instincts. Nevertheless they are capable of far more complex natural behaviors. There is a non-zero chance that RL as an explanation of animal behavior is just plain wrong or not applicable.

nomel · 2025-04-08T01:18:35 1744075115

I naively believe that the lack of performance is one of connectivity. Animal brains don't use directed graphs, probably for the very reason that latching states, like holding a button, become unreasonable. Our brains probably use small network graphs [1][2].

[1] definition: https://en.wikipedia.org/wiki/Small-world_network

[2] evidence for our brains: https://www.semanticscholar.org/paper/Small-world-directed-n...

red75prime · 2025-04-08T09:33:06 1744104786

> The agents do not play, or meaningfully explore, really

As others already pointed it's not an intrinsic limitation of RL agents.

> In fact, it can't even have a definition of those words, because they are defined in the space between beings.

In fact an agent doesn't need to know definitions to act. A bacterium don't know what it means to reproduce, but reproduces anyway.

wegfawefgawefg · 2025-04-08T01:39:51 1744076391

Go read the Intrisic Curiosity Module papers, 1 and 2.

exe34 · 2025-04-08T08:48:16 1744102096

Are you referring to this one?

https://arxiv.org/pdf/1905.10071

wegfawefgawefg · 2025-04-09T15:59:41 1744214381

no not that one. the first icm paper:

https://pathak22.github.io/noreward-rl/

and the followup which address the noise impredictability problem.

there are more after that which i believe fail the black pill and miss the point of ml, asicifying the architecture with human priors. But the broader point is to show that rl is not just discovering solutions by chance in random actions. Nature starts with priors, and curiosity is one of the universal policy bootstrapping techniques. (others might be imitation, next state prediction, total nearby replication count)

There is also a paper that deployed ICM on a physical robot and it just played with a ball because it was the only source of novel stimuli, and inadvertantly learned how to operate its arms. There was no other reward in the environment except for curiosity. It is amazing, and slightly creepy. I think the ICM will be rediscovered later in ML tech.

rebeccaskinner · 2025-04-07T16:44:37 1744044277

> While it's possible to bake in this particular inductive bias (repetitive actions might be useful), they decided not to (it's just not that interesting).

What's interesting to me about this is that the problem seems really aligned with the research they are doing. From what I can tell, they build a system where the agent has a simplified "mental" model of the game world and it uses to predict actions that will lead to better rewards.

I don't think what's missing here is teaching the model that it should just try to do things a lot until they succeed. Instead, what I think is missing is the context that it's playing a game, and what that means.

For example, any human player who sits down to play minecraft is likely to hold down the button to mine something. Younger children might also hold the jump button down and jump around aimlessly, but older children and adults probably wouldn't. Why? I suspect it's because people with experience in video games have set expectations for how game designers communicate the gameplay experience. We understand that clicking on things to interact with them is a common mode of interaction, and we expect that games have upgrade mechanics that will let us work faster or interact with high level items. It's not that we repeat any action arbitrarily to see that it pays off, but rather that we're speaking a language of games and modeling the mind of the game designers and anticipating what they expect from us.

I would think that trying to expand the model of the world to include this notion of the language of games might be a better approach to overcoming the limitation instead of just hard-coding the model to try things over and over again to see if there's a payoff.

d0mine · 2025-04-07T17:58:56 1744048736

Isn’t it exactly what alphazero did?

“AlphaZero was trained solely via self-play using 5,000 first-generation TPUs to generate the games and 64 second-generation TPUs to train the neural networks, all in parallel, with no access to opening books or endgame tables. After four hours of training, DeepMind estimated AlphaZero was playing chess at a higher Elo rating than Stockfish 8; after nine hours of training, the algorithm defeated Stockfish 8 in a time-controlled 100-game tournament (28 wins, 0 losses, and 72 draws).” [emphasis added] https://en.wikipedia.org/wiki/AlphaZero

red75prime · 2025-04-07T19:02:18 1744052538

I thought that it might be a rare chance to invoke the NFL theorem appropriately, but I guess I was wrong. The NFL talks about a uniform distribution of problems. A case that is probably never the case. At least for habitable universes.

Nevertheless, the theorem basically states that there are games where AlphaZero will be beaten by another algorithm. Even if those games are nonsensical from our point of view.

voidmain · 2025-04-09T02:45:18 1744166718

Games drawn from this uniform distribution can't even be implemented in our physical universe (you would need exponentially large lookup tables to store the rules). There is no chance of ever encountering any of them.

Of course, there are "games" like "invert sha-512" that can be implemented in our world but are probably impractical to learn. But NFL has nothing to say about them; a game that simple has zero measure in a uniform distribution over problems.

tmtvl · 2025-04-07T22:04:29 1744063469

I forget, was it Alpha or one of the others (Leela, Kata, FineArt,...) which had a weakness against... I wanna say the Micro Chinese (?), where it would consistently play the same suboptimal sequence that let players beat it easily if they took that path.

Xcelerate · 2025-04-07T19:58:54 1744055934

> I thought that it might be a rare chance to invoke the NFL theorem appropriately, but I guess I was wrong

Haha, I wouldn’t feel bad. It’s one of the most misunderstood theorems, and I don’t think I’ve ever seen it invoked correctly on a message board.

827a · 2025-04-07T21:40:54 1744062054

Given that a computer should be able to simulate at least some applicable aspects and processes of reality billions of times faster than the speed at which our own universe runs at: Yes, I think it is entirely reasonable to have these agents follow at least some kind of from-scratch evolutionary history. It might also be valuable: As it could further research in understanding what the word "applicable" there even means; what parts of our evolutionary history are important toward inductively reasoning your way toward a diamond in Minecraft? What parts aren't? How can that generalize?

If you code a reward function for each step necessary to get a diamond, you are teaching the AI how to do it. There is no other way to look at it. Its extremely unethical to claim, as Nature does, that it did this without "being taught", and it is in my eyes academic malpractice to claim, as their paper does, that it did this "without human data or curricula"; though mitigated by the reality that they admit this in the paper. If this is the case; I am still digesting the paper, as it is quite technical.

This isn't an LLM, I'm aware of this, but I am at the point where if I could bet on the following statement being true, I'd go in at five figures: Every major AI benchmark, advancement, or similar accomplishment in the past two years can almost entirely be explained by polluted training data. These systems are not nearly as autonomously intelligent as anyone making money on them says they are.

kypro · 2025-04-07T11:44:35 1744026275

> You can't start from scratch and expect your program to repeat 4 billion years of evolution collecting inductive biases useful in our corner of our Universe in a matter of hours

Really? Minecraft's gameplay dynamic are not particularly complex... The AI here isn't learning highly complex rules about the nuances of human interaction or learning to detect the relatively subtle differences between various four legged creatures based on small differences in body morphology. In these cases I could see how millions of years of evolution is important to at least give us and other animals a head start when entering the world. If the AI had to do something like this to progress in Minecraft then I'd get why learning those complexities would be skipped over.

But in this case a human would quickly understand that holding a button creates a state which tapping a button does not, and therefore would assume this state could be useful to explore further states. Identifying this doesn't seem particularly complex to me. If the argument is that it will take slightly longer for an AI to learn patterns in dependant states then okay, sure, but I think arguing that learning that holding a button creates a new state is such a complex problem that we couldn't possibly expect an AI to learn it from scratch within a short timeframe is a very weak argument. It's just not that complex. To me this suggests that current algorithms are lacking.

blueflow · 2025-04-07T12:05:36 1744027536

It seems easy to you because you can't remember the years when you were a toddler and had to learn basic interactions with the world around you. It seems natural to an adult but it is quite complex.

geysersam · 2025-04-07T14:30:31 1744036231

But this argument applies just as well to tons of other tasks AIs can handle just fine. So it doesn't explain why this particular action is so much harder compared to anything else.

SkyBelow · 2025-04-07T15:09:31 1744038571

In particular, the task requires understanding that one can impact the world through action. This is learned by humans through a constant feedback loop running for months to a year+. The very way we train AIs doesn't seem to teach this agency, only teach the ability to mimic having that agency in ways that we can capture data for (such as online discussions). Will that training eventually give rise to such agency? I'm doubtful with most current models given that the learning process is so disconnected from the execution and that execution is prompted and not inherently on going. Maybe some agent swarm that is always running and always training and upgrading its members could achieve that level of agency, which is why I'm not saying it is impossible, but I expect we are going to have to wait for some newer model that is always running and which is training as it is running to see true agency develop.

Until then, it is a question of if we can capture the appearance of agency in the training set well enough for learn it with training and not depend upon interactions to learn more.

blueflow · 2025-04-07T14:38:08 1744036688

Which tasks?

geysersam · 2025-04-07T19:30:25 1744054225

> basic interactions with the world around you, tasks that seem easy to us but are actually quite complex

Tasks such as:

  - recognizing objects in our surroundings,
  - speaking,
  - reasoning about other people's thoughts and feelings,
  - playing go?

All of those were at some point "easy for us but very hard for computer programs".

kypro · 2025-04-07T14:44:29 1744037069

I don't think I am, and for context here I have built my own DQNs from scratch to learn to play games like Snake.

I'd argue if you consider the size of the input and output space here it's not as complex you're implying.

To refer back to my example, to tell the difference between four legged creatures is complicated because there's a huge number of possible outputs and the visual input space is both large and complex. Learning how to detect patterns in raw image data is complicated and is why we and other animals are preloaded with the neurological structures to do this. It's also why we often use pretrained models when training models to label new outputs – simply learning how detect simple patterns in visual data is difficult enough so if this step can be skipped it often makes sense to skip it.

In constrast the inputs to Minecraft are relatively very simple – you have a handful of buttons which can be pressed and those buttons can be pressed for different durations. Similarly the output space here while large is relatively simple and presumably simply detecting that an action like holding a button results in a state change shouldn't be that complex to learn... I mean it's already learning that pressing a button results in a state change so I think you'd need to explain to me why adding a tiny bit of additional complexity here is so unreasonable. Maybe I'm missing something.

red75prime · 2025-04-07T16:28:26 1744043306

> I think you'd need to explain to me why adding a tiny bit of additional complexity here is so unreasonable

As far as I understand DreamerV3 doesn't employ intrinsic rewards (like in novelty-based exploration). It adopts stochastic exploration which makes it practically impossible to get to rewards that require to consistently repeat an action with no intermediate rewards.

And finding intrinsic rewards that work good across diverse domains is a complex problem in itself.

blueflow · 2025-04-07T15:33:05 1744039985

Example: When humans play Minecraft, they already know object permanence from the real world. I did not see anywhere that AI got trained to learn object permanence. Yet it is required for basics like searching for your mineshaft after turning around.

nkrisc · 2025-04-08T02:36:11 1744079771

And yet have you seen what toddlers are capable of learning on their own? It is natural to them.

blueflow · 2025-04-08T12:09:09 1744114149

> It is natural to them.

This is where the "inductive biases" from comment 43609692 are hidden in, and this is what AI currently lacks.

red75prime · 2025-04-07T13:09:37 1744031377

> Minecraft's gameplay dynamic are not particularly complex...

I think you underestimate complexity of going from 12288+400 changing numbers to a concept of gameplay dynamics in the first place. Or in other words your complexity prior is biased by experience.

LPisGood · 2025-04-07T15:03:03 1744038183

When I was a child and first played Minecraft I clicked instead of held and after 10 minutes I gave up, deciding that Minecraft was too hard.

zvitiate · 2025-04-07T15:31:28 1744039888

What if you were in an environment where you had to play Minecraft for say, an hour. Do you think your child brain would've eventually tried enough things (or had your finger slip and stay on the mouse a little extra while), noticed that hitting a block caused an animation, (maybe even connect it with the fact that your cursor highlights individual blocks with a black box,) decide to explore that further, and eventually mine a block? Your example doesn't speak to this situation at all.

danijar · 2025-04-07T22:35:03 1744065303

I think learning to hold a button down in itself isn't too hard for a human or robot that's been interacting with the physical world for a while and has learned all kinds of skills in that environment.

But for an algorithm learning from scratch in Minecraft, it's more like having to guess the cheat code for a helicopter in GTA, it's not something you'd stumble upon unless you have prior knowledge/experience.

Obviously, pretraining world models for common-sense knowledge is another important research frontier, but that's for another paper.

daedrdev · 2025-04-07T15:51:06 1744041066

I had the same problem, learned from a roblox mining game where mining a block required clicking it a bunch of times.

freeone3000 · 2025-04-07T15:52:09 1744041129

RL is useful for action selection and planning. Actually determining the mechanics of the game can be achieved with explicit instruction and definition of an action set.

I suppose whether you find this result intriguing or not depends on if you’re looking to build result-building planning agents over an indeterminate (and sizable!) time horizon, in which case this is a SOTA improvement and moderately cool, or if you’re looking for a god in the machine, which this is not.

SpaceManNabs · 2025-04-07T17:02:29 1744045349

If you have an alternative for RL in these use cases, please feel free to share.

When RL works, it really works.

The only alternative I have seen is deep networks with MCTS, and they are quickly to ramp up to decent quality. But they hit caps relatively quickly.

o11c · 2025-04-07T20:35:08 1744058108

And a relevant piece of ancient wisdom (exact date not known, but presumably before 1970):

> In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6.

> “What are you doing?”, asked Minsky.

> “I am training a randomly wired neural net to play Tic-Tac-Toe” Sussman replied.

> “Why is the net wired randomly?”, asked Minsky.

> “I do not want it to have any preconceptions of how to play”, Sussman said.

> Minsky then shut his eyes.

> “Why do you close your eyes?”, Sussman asked his teacher.

> “So that the room will be empty.”

> At that moment, Sussman was enlightened.

lgeorget · 2025-04-07T12:27:39 1744028859

Well, to be fair... I (a human) had to look it up online the first time I played as well. I was repeatedly clicking on the same tree for an entire minute before that. I even tried several different trees just in case.

fusionadvocate · 2025-04-07T14:47:48 1744037268

But it is possible to discover by holding down the button and realizing the block is getting progressively more "scratched".

kharak · 2025-04-07T09:24:34 1744017874

In my mind, this generalizes to the same problem with other non-stochastic (deterministic) operations like logical conclusions (A => B) .

I have a running bet with friend that humans encode deterministic operations in neural networks, too, while he thinks there has to be another process at play. But there might be something extra helping our neural networks learn the strong weights required for it. Or the answer is again: "more data".

FrustratedMonky · 2025-04-07T13:01:46 1744030906

"accelerating block breaking because learning to hold a button for hundreds of consecutive steps "

This is fine, and does not impact the importance of figuring out the steps.

Anybody that has done any tuning on systems that run at different speeds, the adjusting for the speed difference is just engineering, and allows you to get on with more important/inventive work.

JohnKemeny · 2025-04-07T07:17:53 1744010273

I'm not sure it's a serious caveat if the "hint" or "control" is in the manual.

suddenlybananas · 2025-04-07T08:33:43 1744014823

Sorry, I don't quite follow what you mean?

franktankbank · 2025-04-07T11:19:05 1744024745

I didn't read the manual and when I was trying to help my kid play the game I couldn't figure out how to break blocks.

Hamuko · 2025-04-07T10:35:43 1744022143

Turns out that AI are much better at playing video games if they're allowed to cheat.

thesz · 2025-04-07T07:21:16 1744010476

"It allows AI to understand its physical environment and also to self-improve over time, without a human having to tell it exactly what to do."

ks1723 · 2025-04-07T09:46:36 1744019196

I my view, the 'exactly' is crucial here. They do implicitly tell the model what to do by encoding it in the reward function:

In Minecraft, the team used a protocol that gave Dreamer a ‘plus one’ reward every time it completed one of 12 progressive steps involved in diamond collection — including creating planks and a furnace, mining iron and forging an iron pickaxe.

This is also why I think the title of the article is slightly misleading.

wongarsu · 2025-04-07T18:29:44 1744050584

It's kind of fair, humans also get rewarded for those steps when they learn Minecraft

xwolfi · 2025-04-08T02:05:42 1744077942

But they don't learn that way at all, my 7yo learns by watching youtubers. There's a whole network of people teaching each other the game, that's almost more fun than playing it alone.