MFU is probably the best but requires application logic. You can export metrics at the infra level like SM efficiency. We explain it a bit how we used it to do some optimization.
MFU is indeed very useful. Today we found that while scaling Karpathy’s nanoGPT to multiple H100 nodes the MFU calculation itself was dropping MFU performance![1]
Commenting it out improved iter performance by almost 30%
https://www.trainy.ai/blog/gpu-utilization-misleading