I think this is a neat project, and use it a lot. My only complaint is the lack ...

smcleod · on June 16, 2024

I think the two main maintainers of Ollama have good intentions but suffer from a combination of being far too busy, juggling their forked llama.cpp server and not having enough automation/testing for PRs.

There is a new draft PR up to look at moving away from trying to juggle maintaining a llama.cpp fork to using llama.cpp with cgo bindings which I think will really help: https://github.com/ollama/ollama/pull/5034

cosmez · on June 15, 2024

There are many pull requests trying to implement this feature, and they don't even care to reply. This is the only reason I'm still using llama.cpp serve instead of this.

ekianjo · on June 16, 2024

wouldnt it be more practical to make a PR for llamacpp to replicate what Ollama does well instead?

richardanaya · on June 15, 2024

Hey, that's my PR yah ... it's strange.

jmorgan · on June 16, 2024

Sorry it's taking so long to review and for the radio silence on the PR.

We have been trying to figure out how to support more structured output formats without some of the side effects of grammars. With JSON mode (which uses grammars under the hood) there were originally quite a few issue reports namely around lower performance and cases where the model would infinitely generate whitespace causing requests to hang. This is an issue with OpenAI's JSON mode as well which requires the caller to "instruct the model to produce JSON" [1]. While it's possible to handle edge cases for a single grammar such as JSON (i.e. check for 'JSON' in the prompt), it's hard to generalize this to any format.

Supporting more structured output formats is definitely important. Fine-tuning for output formats is promising, and this thread [2] also has some great ideas and links.

[1] https://platform.openai.com/docs/guides/text-generation/json...

[2] https://github.com/ggerganov/llama.cpp/issues/4218

regen7253 · on June 16, 2024

Thank you!

I've been using llama.cpp for about a year now, mostly implementing some RAG and React related papers to stay up to date. I mostly used llama.cpp, but since a few months, I started to use both Ollama and Llama.cpp.

If you added grammars I wouldn't have to be running the two servers, I think you're doing an excellent job out of maintaining Ollama. Every update is like Christmas. They also don't seem to have the server as a priority (it's still literally just an example of how you'd use their C api).

So, I understand your position, since their server API has been quite unstable, and the grammar validation didn't work at all until February. I also still can't get their multiple model loading to work reliably right now.

Having said that, GBNF is a godsend for my daily use cases. I even prefer using phi3b with a grammar than deal with the hallucinations of a 70b without it. Fine tuning helps a lot, but can't solve the problem fully (you still need to validate the generation), and it's a lot less agile when implementing ideas. Crating some synthetic data sets is easier if you have support for grammars.

I think many like me are in the same spot. Thank you for being considerate about the stability and support that it would require. But please, take a look at the current state of their grammar validation, which is pretty good right now.

okwhateverdude · on June 16, 2024

Not to put too fine of a point on it, but why not merge one of the simpler PRs for this feature, gate the feature behind an opt-in env var (ie. OLLAMA_EXPERIMENTAL_GRAMMAR=1), and sprinkle these caveats you've mentioned into the documentation? That should be enough to ward off the casuals that would flood the issue queue. Add more hoops if you'd like.

There seems to be enough interest in this specific feature that you don't need to make it perfect or provide a complicated abstraction. I am very willing to accept/mitigate the side effects for the ability to arbitrarily constrain generation. Not sure about others, but given there are half a dozen different PRs specifically for this feature, I am pretty sure they, too, are willing to accept the side effects.

washadjeffmad · on June 16, 2024

Since it's trivial enough to run mainline features on actual llama.cpp, it seems redundant to ask ollama to implement and independently maintain branches or features that aren't fully working, if it's not something already in an available testing branch.

We're not relying on ollama for feature development and there are multiple open projects with implementations already, so no one is deprived of anything without this or a hundred other potential PRs not in ollama yet.

protosam · on June 16, 2024

This is extremely useful and seems like what I need to fix all of my structured output woes, when my model gets chatty for no reason.

The issue poorly glosses over explaining what feature they are talking about, here's a link to the docs about it: https://github.com/ggerganov/llama.cpp/blob/master/grammars/...

mattmight · on June 16, 2024

Same! I use ollama a lot, but when I need to do real engineering with language models, I end up having to go back to llama.cpp because I need grammar-constrained generation to get most models to behave reasonably. They just don't follow instructions well enough without it.

waldrews · on June 16, 2024

What's the deal with hosted open source model services not supporting grammars? I've seen fireworks.ai do it, but not anybody else - am I missing something?

darkteflon · on June 16, 2024

Came here to say this. Love Ollama, have been using it since the beginning, can’t understand why the GBNF proposals are apparently going ignored. Really hope they move it forward. Llama3 really drove this home for us. For small-parameter models especially, grammar can be the difference between useful and useless.