Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What do you mean? Attention is just matrix multiplication and softmax, it's all feed forward.


I wasn't disputing that it's feed-forward. I just meant that stacked transformer layers can be thought of as an iterative refinement of the intermediate activations. Not the same as an autoregressive process that receives previous outputs as inputs, but far more expressive than a single transformer layer.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: