Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They are definitely feed-forward. Self-attention looks at all pairs of tokens from the context window, but they do not look backwards in time at its own output. The flow of data is layer by layer, each layer gets one shot at influencing the output. That's feed-forward.



Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: