Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

From their implementation it looks like they’re calculating tanh and then applying a weight and bias


Exactly, And that's what happens in LayerNorm too. So if figured the best base for comparison would have been to leave that bit out when looking at their difference or similarity, because obviously the bits that have the same implementation will be the same.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: