Just a quick clarification, attention of the same sort transformers used was already being employed in RNNs for a while. Thus the name "attention is all you need", it turned out you can just remove the recurrent part which makes it hard to train the NN.