Already test Opus 4 and Sonnet 4 in our SQL Generation Benchmark (https://llm-be...

XCSme · 2025-05-22T21:47:07 1747950427

It's weird that Opus4 is the worst at one-shot, it requires on average two attempts to generate a valid query.

If a model is really that much smarter, shouldn't it lead to better first-attempt performance? It still "thinks" beforehand, right?

riwsky · 2025-05-23T11:57:38 1748001458

Don’t talk to Opus before it’s had its coffee. Classic high-performer failure mode.

stadeschuldt · 2025-05-22T19:40:56 1747942856

Interestingly, both Claude-3.7-Sonnet and Claude-3.5-Sonnet rank better than Claude-Sonnet-4.

_peregrine_ · 2025-05-22T19:48:36 1747943316

yeah that surprised me too

Workaccount2 · 2025-05-22T20:09:33 1747944573

This is a pretty interesting benchmark because it seems to break the common ordering we see with all the other benchmarks.

_peregrine_ · 2025-05-22T21:15:04 1747948504

Yeah I mean SQL is pretty nuanced - one of the things we want to improve in the benchmark is how we measure "success", in the sense that multiple correct SQL results can look structurally dissimilar while semantically answering the prompt.

There's some interesting takeaways we learned here after the first round: https://www.tinybird.co/blog-posts/we-graded-19-llms-on-sql-...

ineedaj0b · 2025-05-22T19:46:02 1747943162

i pay for claude premium but actually use grok quite a bit, the 'think' function usually gets me where i want more often than not. odd you don't have any xAI models listed. sure grok is a terrible name but it surprises me more often. i have not tried the $250 chatgpt model yet though, just don't like openAI practices lately.

timmytokyo · 2025-05-23T10:39:00 1747996740

Not saying you're wrong about "OpenAI practices", but that's kind of a strange thing to complain about right after praising an LLM that was only recently inserting claims of "white genocide" into every other response.

veidr · 2025-05-23T15:12:56 1748013176

For real, though.

Even if you don't care about racial politics, or even good-vs-evil or legal-vs-criminal, the fact that that entire LLM got (obviously, and ineptly) tuned to the whim of one rich individual — even if he wasn't as creepy as he is — should be a deal-breaker, shouldn't it?

gkfasdfasdf · 2025-05-23T14:41:00 1748011260

Just curious, how do you know your questions and the SQL aren't in the LLM training data? Looks like the benchmark questions w/SQL are online (https://ghe.clickhouse.tech/).

zarathustreal · 2025-05-23T16:33:01 1748017981

“Your model has memorized all knowledge, how do you know it’s smart?”

sagarpatil · 2025-05-23T03:34:12 1747971252

Sonnet 3.7 > Sonnet 4? Interesting.

dcreater · 2025-05-23T15:17:52 1748013472

How does Qwen3 do on this benchmark?

mritchie712 · 2025-05-22T20:29:00 1747945740

looks like this is one-shot generation right?

I wonder how much the results would change with a more agentic flow (e.g. allow it to see an error or select * from the_table first).

sonnet seems particularly good at in-session learning (e.g. correcting it's own mistakes based on a linter).

_peregrine_ · 2025-05-22T21:13:25 1747948405

Actually no, we have it up to 3 attempts. In fact, Opus 4 failed on 36/50 tests on the first attempt, but it was REALLY good at nailing the second attempt after receiving error feedback.

jpau · 2025-05-22T20:23:11 1747945391

Interesting!

Is there anything to read into needing twice the "Avg Attempts", or is this column relatively uninteresting in the overall context of the bench?

_peregrine_ · 2025-05-22T21:10:55 1747948255

No it's definitely interesting. It suggests that Opus 4 actually failed to write proper syntax on the first attempt, but given feedback it absolutely nailed the 2nd attempt. My takeaway is that this is great for peer-coding workflows - less "FIX IT CLAUDE"

XCSme · 2025-05-22T21:43:01 1747950181

That's a really useful benchmark, could you add 4.1-mini?

_peregrine_ · 2025-05-23T14:27:06 1748010426

Yeah we're always looking for new models to add

jjwiseman · 2025-05-22T20:31:16 1747945876

Please add GPT o3.

_peregrine_ · 2025-05-22T21:12:05 1747948325

Noted, also feel free to add an issue to the GitHub repo: https://github.com/tinybirdco/llm-benchmark

varunneal · 2025-05-22T20:17:34 1747945054

Why is o3-mini there but not o3?

_peregrine_ · 2025-05-22T21:12:40 1747948360

We should definitely add o3 - probably will soon. Also looking at testing the Qwen models

joelthelion · 2025-05-22T19:35:14 1747942514

Did you try Sonnet 4?

vladimirralev · 2025-05-22T19:40:30 1747942830

It's placed at 10. Below claude-3.5-sonnet, GPT 4.1 and o3-mini.

_peregrine_ · 2025-05-22T19:49:14 1747943354

yeah this was a surprising result. of course, bear in mind that testing an LLM on SQL generation is pretty nuanced, so take everything with a grain of salt :)

kadushka · 2025-05-22T21:08:37 1747948117

what about o3?

_peregrine_ · 2025-05-22T21:13:35 1747948415

We need to add it