that's not what it says in the article. it actually says "information from all previous tokens can be passed to the current token".
that statement is meaningfully different from "all previous tokens can be passed to the current token". and both really makes sense if you understand attention mechanisms.
Sorry for the misquote but it's a distraction from my issue which was with the usage of the word 'passed'.
Do you pass information from other tokens to a token in the sense that each token processes information from other tokens? A token isn't a processing unit AFAIK, it's just a word part. The processing is not the responsibility of the token itself. My understanding is that tokens may be associated with each other via an external structure but not passed to each other. Or maybe they meant a token vector? And the token vector contains information from related tokens? It's unclear.
To me, 'passed' means data passed to a function or algorithm for processing. It's confusing unless a token is a function or algorithm.
My point is that this language only makes sense if you are already up to date in that field.
that statement is meaningfully different from "all previous tokens can be passed to the current token". and both really makes sense if you understand attention mechanisms.