Yeah I mean SQL is pretty nuanced - one of the things we want to improve in the benchmark is how we measure "success", in the sense that multiple correct SQL results can look structurally dissimilar while semantically answering the prompt.
i pay for claude premium but actually use grok quite a bit, the 'think' function usually gets me where i want more often than not. odd you don't have any xAI models listed. sure grok is a terrible name but it surprises me more often. i have not tried the $250 chatgpt model yet though, just don't like openAI practices lately.
Not saying you're wrong about "OpenAI practices", but that's kind of a strange thing to complain about right after praising an LLM that was only recently inserting claims of "white genocide" into every other response.
Even if you don't care about racial politics, or even good-vs-evil or legal-vs-criminal, the fact that that entire LLM got (obviously, and ineptly) tuned to the whim of one rich individual — even if he wasn't as creepy as he is — should be a deal-breaker, shouldn't it?
Just curious, how do you know your questions and the SQL aren't in the LLM training data? Looks like the benchmark questions w/SQL are online (https://ghe.clickhouse.tech/).
Actually no, we have it up to 3 attempts. In fact, Opus 4 failed on 36/50 tests on the first attempt, but it was REALLY good at nailing the second attempt after receiving error feedback.
No it's definitely interesting. It suggests that Opus 4 actually failed to write proper syntax on the first attempt, but given feedback it absolutely nailed the 2nd attempt. My takeaway is that this is great for peer-coding workflows - less "FIX IT CLAUDE"
yeah this was a surprising result. of course, bear in mind that testing an LLM on SQL generation is pretty nuanced, so take everything with a grain of salt :)
Opus 4 beat all other models. It's good.