I really doubt LLM benchmarks are reflective of real world user experience ever ...

herval · 2025-02-28T14:10:17 1740751817

I don't have an accurate benchmark, but in my personal experience, gpt4o hallucinates substantially less than gpt4. We solved a ton of hallucination issues just by upgrading to it...

MichaelZuo · 2025-02-28T15:46:54 1740757614

How much did you use the original GPT-4-0314?

(And even that was a downgrade compared to the more uncensored pre-release versions, which were comparable to GPT-4.5, at least judging by the unicorn test)

herval · 2025-02-28T20:36:03 1740774963

I don't recall the original version we used unfortunately :(

in our case, the bump was actually from gpt-4-vision to gpt-4o (the use case required image interpretation)

It got measurably better at both image cases and text-only cases

chrisandchris · 2025-02-28T10:06:44 1740737204

I begin to believe LLM benchmarks are like european car mileage specs. They say its 4 Liter / 100km but everyone knows it's at least 30% off (same with WLTP for EVs).

rightbyte · 2025-02-28T14:16:20 1740752180

Those numbers are not off. They are tested on tracks.

You need to remove your shoe and drive with like two toes to get the speed just right, though.

Test drivers I have done this with takes off their shoes or use ballerina shoes.

aftbit · 2025-02-28T19:25:35 1740770735

Cruise control?

rightbyte · 2025-03-01T18:19:03 1740853143

No you want to control the shape of the speed curve to not overshoot and not accelerate too much, when you follow the speed profile.

And keeping steady state speed is not that hard.

aftbit · 2025-03-02T20:49:04 1740948544

Hrm it is a bit funny that modern cars are drive-by-wire (at least for throttle) and yet they still require a skilled driver to follow a speed profile during testing, when theoretically the same thing could be done more precisely by a device plugged in through the OBD2 port.