Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Interesting observations:

* Llama 3.2 multimodal actually still ranks below Molmo from ai2 released this morning.

* AI2D: 92.3 (3.2 90B) vs 96.3 (of Molmo 72B)

* Llama 3.2 1B and 3B is pruned from 3.1 8B so no leapfrogging unlike 3 -> 3.1.

* Notably no code benchmarks. Deliberate exclusion of code data in distillation to maximize mobile on-device use cases?

Was hoping there would be some interesting models I can add to https://double.bot but doesn't seem like any improvements to frontier performance on coding.



On the second point, you're comparing MMMU-Pro (multimodal) to MMLU-Pro (text only). I don't think they published scores on MMLU-Pro for 3.2.

(Edit: parent comment was corrected, thanks!)


Yep you're right, thanks for catching (sorry for the ninja edit!)


Where do you see the MMLU-Pro evaluation for Llama 3.2 90B? On the link I only see Llama 3.2 90B evaluated against multimodal benchmarks.


Ah you're right I totally misread that!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: