Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What are the practical implications of this?


from the abstract

  By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning.


Sure, but why would one prefer tanh instead of normalization layers if they have the same accuracy?

I suppose normalization kernels have reductions in them, but how hard are reductions in 2025?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: