When using low precision formats like float8 you usually have to upscale the activations to BF16 before normalising. So the normalisation layers are proportionally using more compute when going to lower precision. Replacing these layers would help reduce the compute cost significantly.