Simply put no other model has the same number of effective skip connections or p...

Simply put no other model has the same number of effective skip connections or passes as much information through the model from input to output.

Earlier models had huge bottlenecks in terms of information limits and precision. (Auto encoders Vs uNets for example) And LSTM are still semi unstable.

Why the attention design as posited by Google works so well is part the skip forward and part "now we have enough information and processing power to try this".

It's well motivated but from a first principles up do we expect this to work well, it's a bit less well understood still. And if you're good at that you'll likely get a job offer very quickly.