Every matrix multiply can be viewed as a 1x1 convolution, just like every convol...

Every matrix multiply can be viewed as a 1x1 convolution, just like every convolution can be viewed as an (implicit, for larger than 1x1 kernels) matrix multiply.

I'm not sure this is particularly enlightening, but it's probably one small step of understanding that is required to truly "get" the underlying math.

Edit: Should have said that every matrix multiply against weights is a convolution. The matrix multiply in the standard attention block (query times keys) really doesn't make sense to be seen as a convolution.