Based on my limited runs, I think 4 bit quantization is detrimental to the output quality:
> /main -m ~/Downloads/llama/7B/ggml-model-q4_0.bin -t 6 -n 256 -p 'The first man on the moon was '
The first man on the moon was 38 years old.
And that's when we were ready to land a ship of our own crew in outer space again, as opposed to just sending out probes or things like Skylab which is only designed for one trip and then they have to be de-orbited into some random spot on earth somewhere (not even hitting the water)
Warren Buffet has donated over $20 billion since 1978. His net worth today stands at more than a half trillion dollars ($53 Billiard). He's currently living in Omaha, NE as opposed to his earlier home of New York City/Berkshire Mountains area and he still lives like nothing changed except for being able to spend $20 billion.
Social Security is now paying out more than it collects because people are dying... That means that we're living longer past when Social security was supposed to run dry (65) [end of text]
The performance loss is because this is RTN quantization I believe. If you use the "4chan version" that is 4bit GPTQ, the performance loss from quantization should be very small.
The OP article eventually gets around to demonstrating the model and it is similarly bad, zooming from George Washington to the purported physical fitness of Donald Trump?