Post-training quantization doesn't come for free, but the pretraining on constra...

trashtester · on May 31, 2024

The number of bits used per weight during training, could be included in the regularization, perhaps?

For instance, one could extend dropout regularization to several levels, where each weight could have random chances to include the most significant 2-16 bits part of the time (and still 0 part of the time), and where the impact on the gradient of having fewer bits could be used to tune the ideal number of bits for each weight.

Then one could add L1 regularization for the total number of bits used to squeeze the total down down to whatever size one aims for.