Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Shouldn't this (2.6B/9B) be compared with Microsoft's Phi-3 mini (3.8B) instead of Mistral and Llama-3?

(table 13 on page 7) vs https://arxiv.org/pdf/2404.14219 (page 6, quite better in general)

The report on knowledge distillation training is interesting, though.



Picking up from there: The games in this paper and model are annoying.

The 2.6B would get stomped by Phi-3, so there's no comparison.

Fair enough. 2.6B vs. 3.8B is a fairly substantial size difference thats hard to intuit when its 2.6 vs 3.8 versus 2,600,000,000 and 3,800,000,000.

But then we get what I'm going to "parameter creep": Mistral 7B vs. Llama 8B vs. Gemma 9B. I worried after Llama 3 went 8B that we'd start seeing games with parameters, but, thought I was being silly.


There was no parameter creep with Llama. Llama 8B is actually a ~7B model comparable to Mistral 7B if you strip away multilingual embeddings and match what Mistral 7B supports.


In the Llama 3 case I think the increase in parameters is mostly due to the input embeddings and output logits layers, reflecting the context size increase.


Phi-3 3.8B seems to perform much better on almost every test than Gemma 2 9B. It is comparable.


I agree.

The implication in my post is "if the reason was size, it's invalidated later"


It's such a wide range of model sizes that I could see why they compare with Llama 3 70b as well as Llama 3 8b (tables 12, 13). I agree that the Phi-3 series is a stronger competitor for knowledge extraction/summarizing and would make a good comparison. My current favorite for such tasks, on a VRAM-limited workstation, is Phi-3 medium (phi3:14b-instruct).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: