There are larger models in there, a 8B and a 6B. By this logic they should be above 2B model, yet we don't see this. That's why we have open standard benchmarks, to measure this directly - not hypothesize by the models' sizes or do some cross-dataset arithmetics.
Also note that, Voxtral's capacity is not necessarily all devoted to speech, since it "Retains the text understanding capabilities of its language model backbone"
Also note that, Voxtral's capacity is not necessarily all devoted to speech, since it "Retains the text understanding capabilities of its language model backbone"