It should be noted that that last part has a pretty large factor to it that also scales with model size, because to run transformers efficiently you cache some of the intermediate activations from the attention block.
The factor is basically 2 * number of layers * number of embeddings values (e.g. fp16) that are stored per token.