I 'm still confused by the term 'attention', because it implies that something e...

I 'm still confused by the term 'attention', because it implies that something else is actively attending, while it's more about self-similarity. We begin with a sequence of vectors and linearly transform in 3 ways as Q,K and V (these tranformations are learned). The "attention" is (Q.K)*V so "amplify the parts of V according to the similarity between the two other projections Q K". Somehow by doing in parallel a lot of parallel self-similar transformations and stacking them in series we get syntax modeling. It remains a mystery to me what the transformations are supposed to model and why this works so well. This paper might well be one of the most profound discoveries of this century.

Please suggest some paper that delves a bit more into the theory around the architecture.