For this model, it seems like the point is that it uses way less parameters than...

For this model, it seems like the point is that it uses way less parameters than at least the large Llama model while having near identical performance. Given how large these models are getting, this is an important thing to do before making performance better again.