From their implementation it looks like they’re calculating tanh and then applyi... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		lukah 11 months ago \| parent \| context \| favorite \| on: Transformers Without Normalization From their implementation it looks like they’re calculating tanh and then applying a weight and bias

Lerc 11 months ago [–]

Exactly, And that's what happens in LayerNorm too. So if figured the best base for comparison would have been to leave that bit out when looking at their difference or similarity, because obviously the bits that have the same implementation will be the same.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact