You seem like you know what you are talking about... mind if I ask what your tho...

espadrine · 2025-03-05T16:38:31 1741192711

There is no question that quantization degrades quality. The GGUF R1 uses Q4_K_M, which, on Llama-3-8B, increases the perplexity by 0.18[0]. Many plots show increasing degradation as you quantize more[1].

That said, it is possible to train a model in a quantization-aware way[2][3], which improves the quality a bit, although not higher than the raw model.

Also, a loss in quality may not be perceptible in a specific use-case. Famously LMArena.ai tested Llama 3.1 405B with bf16 and fp8, and the latter was only 2 Elo points below, well within measurement error.

[0]: https://github.com/ggml-org/llama.cpp/blob/master/examples/q...

[1]: https://github.com/ggml-org/llama.cpp/discussions/5063#discu...

[2]: https://pytorch.org/blog/quantization-aware-training/

[3]: https://mistral.ai/news/ministraux

sosuke · 2025-03-05T16:27:25 1741192045

I don't know what I'm talking about but when I first asked your question this https://gist.github.com/Artefact2/b5f810600771265fc1e3944228... helped start me on a path to understanding. I think.

But if you don't already know the question your asking is not at all something I could distill down into a sentence or to that would make sense to a lay-person. Even then I know I couldn't distill it at all sorry.

Edit: I found this link I referenced above on quantized models by bartowski on huggingface https://huggingface.co/bartowski/Qwen2.5-Coder-14B-GGUF#whic...

Ambix · 2025-03-06T10:23:50 1741256630

I did my own experiments and it looks like (surprisingly) Q4KM models often outperforms Q6 and Q8 quantised models.

For bigger models (in range of 8B - 70B) the Q4KM is very good, there are no any degradation compared to full FP16 models.