Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You can do Q-Learning with a transformer. You simply define the state space as the observation sequence. This is in fact natural to do in partially observed settings. So your distinction does not make sense.


DT's reward-to-go vs. QL's Bellman incl. discount, not choice of architecture for policy. You could also do DTs with RNNs (though own problems w/ memory).

Apologies if we're talking past one another.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: