Is the behavior of that the attention + FF displacements tend point in the same direction known? I am kind of surprised they are even in the same latent space across layers. The FF network could be doing arbitrary rotations, right? I suspect I misunderstand what is going on.
It's a 2D representation of very high-dimensional vectors. Something has to be left out and accurately depicting arbitrary rotations in the high-dimensional space is one of those things.