Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Phi-3 blow this out of the water.

                      Benchmark  |  Gemma 2 (9B)  |  Phi-3 Small (7B)
    -----------------------------|----------------|-------------------
                  MMLU (5-Shot)  |       63.6     |       75.7
             HellaSwag (5-Shot)  |       49.8     |       77.0
                  ANLI (7-Shot)  |       48.7     |       58.1
           GSM-8K (8-Shot; CoT)  |       59.8     |       89.6
                 MedQA (2-Shot)  |       49.6     |       65.4
               AGIEval (0-Shot)  |       42.1     |       45.1
              TriviaQA (5-Shot)  |       72.3     |       58.1
                Arc-C (10-Shot)  |       78.3     |       90.7
                Arc-E (10-Shot)  |       91.4     |       97.0
                  PIQA (5-Shot)  |       78.1     |       86.9
                SociQA (5-Shot)  |       65.5     |       79.2
    BigBench-Hard (3-Shot; CoT)  |       59.6     |       79.1
            WinoGrande (5-Shot)  |       55.6     |       81.5
           OpenBookQA (10-Shot)  |       78.6     |       88.0
                 BoolQ (2-Shot)  |       66.0     |       84.8
        CommonSenseQA (10-Shot)  |       76.2     |       80.0
      TruthfulQA (10-Shot; MC2)  |       52.1     |       70.2
             HumanEval (0-Shot)  |       34.1     |       61.0
                  MBPP (3-Shot)  |       51.5     |       71.7


Another take on this: phi-3 small has 1100 ELO on LMSYS (ranked #52) while the confidence interval for Gemma 2 9B is [1170, 1200] ELO (ranked btw #15 and #25).


Phi is notorious for benchmark overfitting. It's good, but not as good as it looks on the charts. On the Lmsys leaderboard it places a whole 23 spots behind Llama-3-8B which it also claims to soundly beat on the above. So YMMV.


Pretraining on the Test Set Is All You Need

https://arxiv.org/abs/2309.08632




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: