It is explained in the paper. > In this work, we use sine and cosine functions o...

It is explained in the paper.

> In this work, we use sine and cosine functions of different frequencies:

> PE(pos,2i) = sin(pos/10000^{2i/d_model})

> PE(pos,2i+1) = cos(pos/10000^{2i/d_model})

> where pos is the position and i is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PE_{pos+k} can be represented as a linear function of PE_{pos}.