Picking up from there: The games in this paper and model are annoying.
The 2.6B would get stomped by Phi-3, so there's no comparison.
Fair enough. 2.6B vs. 3.8B is a fairly substantial size difference thats hard to intuit when its 2.6 vs 3.8 versus 2,600,000,000 and 3,800,000,000.
But then we get what I'm going to "parameter creep": Mistral 7B vs. Llama 8B vs. Gemma 9B. I worried after Llama 3 went 8B that we'd start seeing games with parameters, but, thought I was being silly.
There was no parameter creep with Llama. Llama 8B is actually a ~7B model comparable to Mistral 7B if you strip away multilingual embeddings and match what Mistral 7B supports.
In the Llama 3 case I think the increase in parameters is mostly due to the input embeddings and output logits layers, reflecting the context size increase.
It's such a wide range of model sizes that I could see why they compare with Llama 3 70b as well as Llama 3 8b (tables 12, 13). I agree that the Phi-3 series is a stronger competitor for knowledge extraction/summarizing and would make a good comparison. My current favorite for such tasks, on a VRAM-limited workstation, is Phi-3 medium (phi3:14b-instruct).
(table 13 on page 7) vs https://arxiv.org/pdf/2404.14219 (page 6, quite better in general)
The report on knowledge distillation training is interesting, though.