This might be dumbing this down very very much but it all boils down to having s...

This might be dumbing this down very very much but it all boils down to having some sort of a special lookup table and you give it a "query" and a bunch of "keys" and the "value" is the most likely next word. As you input something into the network, a network of these tables are consulted and you are given most likely next word.

The novelty in this paper is this "query-key-value" relation that gets learned. A lot of previous work in this area was focused on learning a rough state machine to which you input a set of state transitions and it will give you the most likely next state. This will also work but training such networks is very slow and you also don't have the capability to train the network to "attend" to certain part of the inputs. This lookup based technique lets you do that plus this is also very compute efficient (compared to previous techniques).

I'm missing a lot of details but that's basically the intuition behind this.

These are very excellent resources: - https://www.youtube.com/watch?v=ptuGllU5SQQ&list=PLoROMvodv4... - https://www.youtube.com/watch?v=OyFJWRnt_AY&pp=ygUfYXR0ZW50a...