Shouldn't this (2.6B/9B) be compared with Microsoft's Phi-3 mini (3.8B) instead ...

refulgentis · on June 27, 2024

Picking up from there: The games in this paper and model are annoying.

The 2.6B would get stomped by Phi-3, so there's no comparison.

Fair enough. 2.6B vs. 3.8B is a fairly substantial size difference thats hard to intuit when its 2.6 vs 3.8 versus 2,600,000,000 and 3,800,000,000.

But then we get what I'm going to "parameter creep": Mistral 7B vs. Llama 8B vs. Gemma 9B. I worried after Llama 3 went 8B that we'd start seeing games with parameters, but, thought I was being silly.

kouteiheika · on June 27, 2024

There was no parameter creep with Llama. Llama 8B is actually a ~7B model comparable to Mistral 7B if you strip away multilingual embeddings and match what Mistral 7B supports.

imjonse · on June 27, 2024

In the Llama 3 case I think the increase in parameters is mostly due to the input embeddings and output logits layers, reflecting the context size increase.

alecco · on June 28, 2024

Phi-3 3.8B seems to perform much better on almost every test than Gemma 2 9B. It is comparable.

refulgentis · on June 28, 2024

I agree.

The implication in my post is "if the reason was size, it's invalidated later"

philipkglass · on June 27, 2024

It's such a wide range of model sizes that I could see why they compare with Llama 3 70b as well as Llama 3 8b (tables 12, 13). I agree that the Phi-3 series is a stronger competitor for knowledge extraction/summarizing and would make a good comparison. My current favorite for such tasks, on a VRAM-limited workstation, is Phi-3 medium (phi3:14b-instruct).