- Improves upon GPT-4o's score on the Short Story Creative Writing Benchmark, but Claude Sonnets and DeepSeek R1 score higher. (https://github.com/lechmazur/writing/)
- Improves upon GPT-4o's score on the Confabulations/Hallucinations on Provided Documents Benchmark, nearly matching Gemini 1.5 Pro (Sept) as the best-performing non-reasoning model. (https://github.com/lechmazur/confabulations)
- Improves upon GPT-4o's score on the Thematic Generalization Benchmark, however, it doesn't match the scores of Claude 3.7 Sonnet or Gemini 2.0 Pro Exp. (https://github.com/lechmazur/generalization)