Post-training quantization doesn't come for free, but the pretraining on constrained precision weights actually counterintuitively results in a performance increase per parameter as the number of parameters grows in the ternary BitNet paper.
Even if there was zero efficiency gain in ternary weights, large models should probably be trained on a network of precision limited weights from here on out given the research so far.
I suspect it relates to each weight relating to multiple 'features' in the network. The greater the precision, the more room it gives for competing features to compromise on node values that aren't best for either feature instead of reorganizing the feature mapping to avoid conflicts.
The number of bits used per weight during training, could be included in the regularization, perhaps?
For instance, one could extend dropout regularization to several levels, where each weight could have random chances to include the most significant 2-16 bits part of the time (and still 0 part of the time), and where the impact on the gradient of having fewer bits could be used to tune the ideal number of bits for each weight.
Then one could add L1 regularization for the total number of bits used to squeeze the total down down to whatever size one aims for.
Even if there was zero efficiency gain in ternary weights, large models should probably be trained on a network of precision limited weights from here on out given the research so far.
I suspect it relates to each weight relating to multiple 'features' in the network. The greater the precision, the more room it gives for competing features to compromise on node values that aren't best for either feature instead of reorganizing the feature mapping to avoid conflicts.