Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

All information about the past which will be available for predicting future tokens must be stored in the present state. So, if some bits of info about some past tokens at times less than t_p will be used for predicting some future token at time t_f, those bits must be passed through all states at times from t_p to t_f. The bits are passed through the recurrence. Once information about past tokens is lost from the hidden state it is gone forever, so it must be stored and carried across many steps up until it finally becomes useful.

The information cost of making the RNN state way bigger is high when done naively, but maybe someone can figure out a clever way to avoid storing full hidden states in memory during training or big improvements in hardware could make memory use less of a bottleneck.



> The information cost of making the RNN state way bigger is high when done naively, but maybe someone can figure out a clever way to avoid storing full hidden states in memory during training or big improvements in hardware could make memory use less of a bottleneck.

Isn't this essentially what Mamba [1] does via its 'Hardware-aware Algorithm'?

[1] https://arxiv.org/pdf/2312.00752




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: