Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It is explained in the paper.

> In this work, we use sine and cosine functions of different frequencies:

> PE(pos,2i) = sin(pos/10000^{2i/d_model})

> PE(pos,2i+1) = cos(pos/10000^{2i/d_model})

> where pos is the position and i is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PE_{pos+k} can be represented as a linear function of PE_{pos}.



Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: