Sharing PyLLMs [1] reasoning benchmark results for some of the recent models. Su...

Sharing PyLLMs [1] reasoning benchmark results for some of the recent models. Surprised by nemo (speed/quality) and mistral large is actually pretty good (but painfully slow).

AnthropicProvider('claude-3-haiku-20240307') Median Latency: 1.61 | Aggregated speed: 122.50 | Accuracy: 44.44%

MistralProvider('open-mistral-nemo') Median Latency: 1.37 | Aggregated speed: 100.37 | Accuracy: 51.85%

OpenAIProvider('gpt-4o-mini') Median Latency: 2.13 | Aggregated speed: 67.59 | Accuracy: 59.26%

MistralProvider('mistral-large-latest') Median Latency: 10.18 | Aggregated speed: 18.64 | Accuracy: 62.96%

AnthropicProvider('claude-3-5-sonnet-20240620') Median Latency: 3.61 | Aggregated speed: 59.70 | Accuracy: 62.96%

OpenAIProvider('gpt-4o') Median Latency: 3.25 | Aggregated speed: 53.75 | Accuracy: 74.07% |

[1] https://github.com/kagisearch/pyllms