I wonder why Mistral et al don't prepare GGUF versions of these for launch day? ...

Patrick_Devine · on July 18, 2024

Some of the major vendors _do_ create the GGUFs for their models, but often they have the wrong parameter settings, need changes in the inference code, or don't include the correct prompt template. We (i.e. Ollama) have our own conversion scripts and we try to work with the model vendors to get everything working ahead of time, but unfortunately Mistral doesn't usually give us a heads up before they release.

a2128 · on July 18, 2024

llama.cpp is still under development and they sometimes come out with breaking changes or new quantization methods, and it can be a lot of work to keep up with these changes as you publish more models over time. It's easier to just publish a standard float32 safetensors that works with PyTorch, and let the community deal with other runtimes and file formats.

If it's a new architecture, then there's also additional work needed to add support in llama.cpp, which means more dev time, more testing, and potentially loss of surprise model release if the development work has to be done out in the open

dannyw · on July 18, 2024

I think it's actually reasonable to leave some opportunities to the community. It's an Apache 2.0 model. It's meant for everyone to build upon freely.

sroussey · on July 18, 2024

Same could be said for onnx.

Depends on which community you are in as to what you want.

simonw · on July 18, 2024

Right - imagine how much of an impact a model release could have if it included GGUF and ONNX and MLX along with PyTorch.

sroussey · on July 19, 2024

I kinda wish Hugging Face just did it for people.