We shouldn't just be looking for under trained tokens. Tokens are effectively the first layer of the network, but we should also be looking for training data imbalances at every weight at every other layer of the network.
When we find them, it might be best to delete weights with hardly any data flowing through them (which might make the model smaller or help generalisation).
I believe model distillation does this. SparseGPT was a big one, managing to remove 50% of parameters without loosing much accuracy IIRC. I saw a more recent paper citing the SparseGPT one that managed around 70-80% sparsity, pretty impressive stuff.
When we find them, it might be best to delete weights with hardly any data flowing through them (which might make the model smaller or help generalisation).