We know architecture and training procedures matter in practice.
MLPs and transformers are ultimately theoretically equivalent. That means there is an MLP that represent the any function a given transformer can. However, that MLP is hard to identify and train.
MLPs and transformers are ultimately theoretically equivalent. That means there is an MLP that represent the any function a given transformer can. However, that MLP is hard to identify and train.
Also the transformer contains MLPs as well...