> Does quantization work with most neural networks? Yes. It works pretty well fo...

> Does quantization work with most neural networks?

Yes. It works pretty well for CNN-based vision models. Or rather, I'd claim it works even better: with post-training quantization you can make most models work with minimal precision loss entirely in int8 (fixed point), that is, computation is over int8/int32, no floating point at all, instead of weight-only approach discussed here.

If you do QAT something down to 2-bit weight and 4-bit activation would work.

People aren't interested in a weight-only quantization back then because CNNs are in general "denser", i.e. bottleneck was on compute, not memory.