Someone who read the paper pointed out to me recently that there's an aspect to transformers/attention that uses the sin or cos function to determine which words to pay attention to or the spacing between them (I'm probably not expressing this correctly, so please correct me if I'm wrong). It seems really unintuitive that sin and/or cos would be a factor in human language - can you explain this?
That sounds like a reference to the concept of cosine similarity.
Imagine that words are spread out in the space.
Cosine similarity is a measure of similarity between two vectors (each word is encoded as a vector).
By measuring the cosine of the angle between the two vectors we can get:
1) whether 2 vectors have the same angle (2 words have the same meaning or close enough) when the cosine is close to 1
2) whether 2 vectors are perpendicular (2 words don't have anything to do with each other) when the cosine is close to zero
3) whether 2 vectors are opposite in direction (2 words have opposite meanings in some aspect) when the cosine is close to -1
Cosine similarity is like comparing two people's interests.
If two people have similar interests, the angle between them is small, and the cosine similarity value will be high.
If two people have completely different interests, the angle between them is large, and the cosine similarity value will be low.
So, cosine similarity is a way to measure how similar two things are by looking at the angle between them.
Not as much of an expert as others commenting here, but I believe the sine/cosine stuff comes in just because it’s a standard and very efficient way of comparing vectors.
(“Vector” is just an alternate way of talking about a coordinate point - you can say “a point is at (x,y)”, or equivalently you can also say “turn x degrees and then travel y units forward”, either method gives enough information to find the point exactly.)
I don’t think sine and cosine are actually factors in human language - rather, the process of turning words into vectors captures whatever are the factors in human language, translates them into vectors, and in that translation process the factors get turned into something that sine/cosine measurements of vectors is good at picking up.
A toy example would be that arithmetic doesn’t seem to be a factor in political orientation, but if you assess everyone’s political orientation with some survey questions and then put them on a line from 0 to 10, then you could do some subtraction and multiplication operations to find numbers that are close together - ie doing arithmetic to find similar politics. The reason that works is not because arithmetic has anything to do with political orientation, it’s because your survey questions captured information about political orientation and transformed it into something that arithmetic works on.
I guess this explanation doesn’t do much except push the unintuitiveness into the embedding process (that’s the process of turning words into vectors in a way that captures their relationship to other words).
> In this work, we use sine and cosine functions of different frequencies:
> PE(pos,2i) = sin(pos/10000^{2i/d_model})
> PE(pos,2i+1) = cos(pos/10000^{2i/d_model})
> where pos is the position and i is the dimension. That is, each dimension of the positional encoding
corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We
chose this function because we hypothesized it would allow the model to easily learn to attend by
relative positions, since for any fixed offset k, PE_{pos+k} can be represented as a linear function of
PE_{pos}.
Someone else can better explain this. Based on one of the video suggested in one of the replies here. sin and cos doesn’t have any inherent properties specific to language, they were chosen because a simple linear function was needed in that step of optimization. Any other function could fit the bill as well.