This may make it difficult to explain and I already see many incorrect explanations here and even more lazy ones (why post the first Google result? You're just adding noise)
> Steve Yegge on Medium thinks that the team behind Transformers deserves a Nobel
First, Yegge needs to be able to tell me what Attention and Transformers are. More importantly, he needs to tell me who invented them.
That actually gets to our important point and why there are so many bad answers here and elsewhere. Because you're both missing a lot of context as well as there being murky definitions. This is also what makes it difficult to ELI5. I'll try, then try to give you resources to get an actually good answer.
== Bad Answer (ELI5) ==
A transformer is an algorithm that considers the relationship of all parts of a piece of data. It does this through 4 mechanisms and in two parts. The first part is composed of a normalization block and an attention block. The normalization block scales the data and ensures that the data is not too large. Then the attention mechanism takes all the data handed to it and considers how it is all related to one another. This is called "self-attention" when we only consider one input and it is called "cross-attention" when we have multiple inputs and compare. Both of these create a relationship that are similar to creating a lookup table. The second block is also composed of a normalization block followed by a linear layer. The linear layer reprocesses all the relationships it just learned and gives it context. But we haven't stated the 4th mechanism! This is called a residual layer or "skip" layer. This allows the data to pass right on by each of the above parts without being processed and this little side path is key to getting things to train efficiently.
Now that doesn't really do the work justice or give a good explanation of why or how things actually work. ELI5 isn't a good way to understand things for usage, but it is an okay place to start and learn abstract concepts. For the next level up I suggest Training Compact Transformers[0]. It'll give some illustrations and code to help you follow along. It is focused on vision transformers, but it is all the same. The next level I suggest Karpathy's video on GPT[1], where you will build transformers and he goes in a bit more depth. Both these are good for novices and people with little mathematical background. For more lore and understanding why we got here and the confusion over the definition of attention I suggest Lilian Wang's blog[2] (everything she does is gold). For a lecture and more depth I suggest Pascal Poupart's class. Lecture 19[3] is the one on attention and transformers but you need to at minimum watch Lecture 18 but if you actually have no ML experience or knowledge then you should probably start from the beginning.
The truth is that not everything can be explained in simple terms, at least not if one wants an adequate understanding. That misquotation of Einstein (probably originating from Nelson) is far from accurate and I wouldn't expect someone that introduced a highly abstract concept with complex mathematics (to such a degree that physicists argued he was a mathematician) would say something so silly. There is a lot lost when distilling a concept and neither the listener nor speaker should fool themselves into believing this makes them knowledgeable (armchair expertise is a frustrating point on the internet and has gotten our society in a lot of trouble).
As usual, the most helpful answer is burried among desperate platitudes full of inaccuracies trying to pander to the absurd reddit-esque ELI5 notion you've dismantled.
This may make it difficult to explain and I already see many incorrect explanations here and even more lazy ones (why post the first Google result? You're just adding noise)
> Steve Yegge on Medium thinks that the team behind Transformers deserves a Nobel
First, Yegge needs to be able to tell me what Attention and Transformers are. More importantly, he needs to tell me who invented them.
That actually gets to our important point and why there are so many bad answers here and elsewhere. Because you're both missing a lot of context as well as there being murky definitions. This is also what makes it difficult to ELI5. I'll try, then try to give you resources to get an actually good answer.
== Bad Answer (ELI5) ==
A transformer is an algorithm that considers the relationship of all parts of a piece of data. It does this through 4 mechanisms and in two parts. The first part is composed of a normalization block and an attention block. The normalization block scales the data and ensures that the data is not too large. Then the attention mechanism takes all the data handed to it and considers how it is all related to one another. This is called "self-attention" when we only consider one input and it is called "cross-attention" when we have multiple inputs and compare. Both of these create a relationship that are similar to creating a lookup table. The second block is also composed of a normalization block followed by a linear layer. The linear layer reprocesses all the relationships it just learned and gives it context. But we haven't stated the 4th mechanism! This is called a residual layer or "skip" layer. This allows the data to pass right on by each of the above parts without being processed and this little side path is key to getting things to train efficiently.
Now that doesn't really do the work justice or give a good explanation of why or how things actually work. ELI5 isn't a good way to understand things for usage, but it is an okay place to start and learn abstract concepts. For the next level up I suggest Training Compact Transformers[0]. It'll give some illustrations and code to help you follow along. It is focused on vision transformers, but it is all the same. The next level I suggest Karpathy's video on GPT[1], where you will build transformers and he goes in a bit more depth. Both these are good for novices and people with little mathematical background. For more lore and understanding why we got here and the confusion over the definition of attention I suggest Lilian Wang's blog[2] (everything she does is gold). For a lecture and more depth I suggest Pascal Poupart's class. Lecture 19[3] is the one on attention and transformers but you need to at minimum watch Lecture 18 but if you actually have no ML experience or knowledge then you should probably start from the beginning.
The truth is that not everything can be explained in simple terms, at least not if one wants an adequate understanding. That misquotation of Einstein (probably originating from Nelson) is far from accurate and I wouldn't expect someone that introduced a highly abstract concept with complex mathematics (to such a degree that physicists argued he was a mathematician) would say something so silly. There is a lot lost when distilling a concept and neither the listener nor speaker should fool themselves into believing this makes them knowledgeable (armchair expertise is a frustrating point on the internet and has gotten our society in a lot of trouble).
[0] https://medium.com/pytorch/training-compact-transformers-fro...
[1] https://www.youtube.com/watch?v=kCc8FmEb1nY
[2] https://lilianweng.github.io/posts/2018-06-24-attention/
[3] https://www.youtube.com/watch?v=OyFJWRnt_AY