> In this work, we use sine and cosine functions of different frequencies:
> PE(pos,2i) = sin(pos/10000^{2i/d_model})
> PE(pos,2i+1) = cos(pos/10000^{2i/d_model})
> where pos is the position and i is the dimension. That is, each dimension of the positional encoding
corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We
chose this function because we hypothesized it would allow the model to easily learn to attend by
relative positions, since for any fixed offset k, PE_{pos+k} can be represented as a linear function of
PE_{pos}.
> In this work, we use sine and cosine functions of different frequencies:
> PE(pos,2i) = sin(pos/10000^{2i/d_model})
> PE(pos,2i+1) = cos(pos/10000^{2i/d_model})
> where pos is the position and i is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PE_{pos+k} can be represented as a linear function of PE_{pos}.