We know architecture and training procedures matter in practice. MLPs and transf...

We know architecture and training procedures matter in practice.

MLPs and transformers are ultimately theoretically equivalent. That means there is an MLP that represent the any function a given transformer can. However, that MLP is hard to identify and train.

Also the transformer contains MLPs as well...