The short version (as I understand it) is that you use a neural network to weight pairs of inputs by their importance to each other. That lets you get rid of unimportant information while keeping what actually is important.
Hi! I'm the creator of the site. Good news: I'm currently working on animations and an explainer video on transformers and self-attention. The best way to be notified is probably to subscribe to my YouTube channel and hit the bell icon for notifications.