It significantly improves upon GPT-4o on my Extended NYT Connections Benchmark. ...

zone411 · 2025-02-28T01:22:33 1740705753

I ran three more of my independent benchmarks:

- Improves upon GPT-4o's score on the Short Story Creative Writing Benchmark, but Claude Sonnets and DeepSeek R1 score higher. (https://github.com/lechmazur/writing/)

- Improves upon GPT-4o's score on the Confabulations/Hallucinations on Provided Documents Benchmark, nearly matching Gemini 1.5 Pro (Sept) as the best-performing non-reasoning model. (https://github.com/lechmazur/confabulations)

- Improves upon GPT-4o's score on the Thematic Generalization Benchmark, however, it doesn't match the scores of Claude 3.7 Sonnet or Gemini 2.0 Pro Exp. (https://github.com/lechmazur/generalization)

I should have the results from the multi-agent collaboration, strategy, and deception benchmarks within a couple of days. (https://github.com/lechmazur/elimination_game/, https://github.com/lechmazur/step_game and https://github.com/lechmazur/goods).

j_bum · 2025-02-28T01:36:29 1740706589

Honest question for you: are these puzzles actually a good way to test the models?

The answers are certainly in the training set, likely many times over.

I’d be curious to see performance on Bracket City, which was featured here on HN yesterday.