If I remember correctly there were similar scenarios that would occur using that...

vladTheInhaler · on July 9, 2021

The example you're thinking of is actually in gridworld [1]. As you allud to, one of the parameters of the model is the cost of simply being alive for an additional time-step. If the cost is negative (a reward), then the agent will just sit there forever and accumulate infinite points. If it is zero, it might still just sit there to avoid falling into the hole, which has a large penalty and ends the simulation. As you turn up the dial on the cost of living, the agent starts using more and more aggressive strategies to reach the goal quickly. But if you make it too big, it will just jump in the hole.

[1] https://inst.eecs.berkeley.edu/~cs188/fa18/assets/slides/lec...

hotwire · on July 6, 2021

It reminds me of the thread about the Quake 3 bots, who left alone for several years, figured out that the best approach was to not kill each other.

https://i.imgur.com/dx7sVXj.jpg

spywaregorilla · on July 6, 2021

Without knowledge of their reward function its difficult to tell if they're converged on this strategy or if its just broken.