> I wonder if there is a cap to multi head attention architecture I don't think ...

> I wonder if there is a cap to multi head attention architecture

I don't think there is a cap other than having good data. The model learns all languages in the world, it has capacity. A simple model like AlphaZero beats humans at board games. As long as you have data, the model is not an obstacle. A LLM like AlphaProof is ranked silver medal at IMO.